Sentiment Analysis Using Machine Learning: Exploring NLP
- Sairam Penjarla
- Jun 26, 2024
- 4 min read
Updated: Jun 27, 2024
In this blog post, we'll delve into a sentiment analysis project using various machine learning algorithms. The goal is to classify tweets into positive and negative sentiments using techniques such as text preprocessing, feature extraction, model training, and model evaluation. We'll explain each step in detail to provide a thorough understanding of the process.
Github URL:
It is highly recomended to go through the below repo to get the full version of the code along with the necessary filels such as requirements.txt file and csv file.
Theory Behind Sentiment Analysis
What is Sentiment Analysis?
Sentiment analysis is the process of computationally identifying and categorizing opinions expressed in text to determine the sentiment conveyed by the writer. It is widely used in social media monitoring, customer feedback analysis, and market research to understand public opinion and sentiment trends.
Machine Learning for Sentiment Analysis
We'll discuss how machine learning algorithms can be applied to sentiment analysis tasks. These algorithms learn patterns from labeled data (tweets in this case) to classify new instances into predefined sentiment categories (positive or negative). This approach allows for automated sentiment classification at scale.
Code Explanation
1. Importing Required Libraries
import re
import pickle
import numpy as np
import pandas as pd
import seaborn as sns
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.stem import WordNetLemmatizer
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix, classification_report
Here, we import essential libraries for data manipulation, visualization, natural language processing, and machine learning. These libraries provide tools for preprocessing text data, training classifiers, evaluating models, and saving/loading trained models.
2. Loading and Preprocessing Data
DATASET_COLUMNS = ["sentiment", "ids", "date", "flag", "user", "text"]
DATASET_ENCODING = "ISO-8859-1"
dataset = pd.read_csv('../input/sentiment140/training.1600000.processed.noemoticon.csv', encoding=DATASET_ENCODING , names=DATASET_COLUMNS)
dataset = dataset[['sentiment','text']]
dataset['sentiment'] = dataset['sentiment'].replace(4,1)
text, sentiment = list(dataset['text']), list(dataset['sentiment'])
In this section, we load the dataset containing labeled tweets. We specify the columns and encoding format. We focus on the 'sentiment' and 'text' columns and convert sentiment values from 4 to 1 (positive sentiment). This step prepares the data for further processing and model training.
3. Text Preprocessing
def preprocess(textdata):
processedText = []
wordLemm = WordNetLemmatizer()
urlPattern = r"((http://)[^ ]*|(https://)[^ ]*|( www\\\\.)[^ ]*)"
userPattern = r'@[^\\\\s]+'
alphaPattern = r"[^a-zA-Z0-9]"
sequencePattern = r"(.)\\\\1\\\\1+"
seqReplacePattern = r"\\\\1\\\\1"
for tweet in textdata:
tweet = tweet.lower()
tweet = re.sub(urlPattern,' URL',tweet)
for emoji in emojis.keys():
tweet = tweet.replace(emoji, "EMOJI" + emojis[emoji])
tweet = re.sub(userPattern,' USER', tweet)
tweet = re.sub(alphaPattern, " ", tweet)
tweet = re.sub(sequencePattern, seqReplacePattern, tweet)
tweetwords = ''
for word in tweet.split():
if len(word)>1:
word = wordLemm.lemmatize(word)
tweetwords += (word+' ')
processedText.append(tweetwords)
return processedText
processedtext = preprocess(text)
Here, we define the preprocess function to clean and preprocess each tweet text. We lowercase the text, replace URLs with 'URL', handle user mentions, remove non-alphanumeric characters, and reduce repeated characters. We also lemmatize words to normalize them. This step ensures that the text data is clean and standardized for further analysis.
4. Splitting Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(processedtext, sentiment, test_size = 0.05, random_state = 0)
We split the preprocessed data into training and testing sets using train_test_split from sklearn. This allows us to train the model on a subset of data and evaluate its performance on unseen data. The test_size parameter specifies the proportion of the dataset to include in the test split.
5. Feature Extraction: TF-IDF Vectorization
vectoriser = TfidfVectorizer(ngram_range=(1,2), max_features=500000)
vectoriser.fit(X_train)
X_train = vectoriser.transform(X_train)
X_test = vectoriser.transform(X_test)
We use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert text data into numerical features. This step transforms text into a numerical matrix, where each row represents a document (tweet) and each column represents a unique word or combination of words (n-grams). TF-IDF weighting helps capture the importance of words in each document relative to the entire corpus.
6. Model Training and Evaluation
def model_Evaluate(model):
y_pred = model.predict(X_test)
cf_matrix = confusion_matrix(y_test, y_pred)
categories = ['Negative','Positive']
group_names = ['True Neg','False Pos', 'False Neg','True Pos']
group_percentages = ['{0:.2%}'.format(value) for value in cf_matrix.flatten() / np.sum(cf_matrix)]
labels = [f'{v1}\\\\n{v2}' for v1, v2 in zip(group_names,group_percentages)]
labels = np.asarray(labels).reshape(2,2)
We define the model_Evaluate function to evaluate classification models. It calculates the confusion matrix and visualizes performance metrics such as accuracy, precision, recall, and F1-score. This step provides insights into how well the trained model distinguishes between positive and negative sentiments.
7. Model Selection and Training
BNBmodel = BernoulliNB(alpha = 2)
BNBmodel.fit(X_train, y_train)
model_Evaluate(BNBmodel)
SVCmodel = LinearSVC()
SVCmodel.fit(X_train, y_train)
model_Evaluate(SVCmodel)
LRmodel = LogisticRegression(C = 2, max_iter = 1000, n_jobs=-1)
LRmodel.fit(X_train, y_train)
model_Evaluate(LRmodel)
We train multiple classification models (BernoulliNB, LinearSVC, and LogisticRegression) using the TF-IDF transformed training data. Each model learns to classify tweets into positive or negative sentiments based on the extracted features. We evaluate each model's performance using the model_Evaluate function.
8. Saving Trained Models and Vectorizer
file = open('vectoriser-ngram-(1,2).pickle','wb')
pickle.dump(vectoriser, file)
file.close()
file = open('Sentiment-LR.pickle','wb')
pickle.dump(LRmodel, file)
file.close()
file = open('Sentiment-BNB.pickle','wb')
pickle.dump(BNBmodel, file)
file.close()
We save the trained TF-IDF vectorizer (vectoriser) and each trained sentiment classification model (LRmodel and BNBmodel) to disk using Python's pickle module. This allows us to reuse the models for future predictions without needing to retrain them.
9. Loading Trained Models
def load_models():
file = open('vectoriser-ngram-(1,2).pickle', 'rb')
vectoriser = pickle.load(file)
file.close()
file = open('Sentiment-BNB.pickle', 'rb')
LRmodel = pickle.load(file)
file.close()
return vectoriser, LRmodel
We define the load_models function to load the saved TF-IDF vectorizer and sentiment classification model (BNBmodel). This function enables us to load pre-trained models for deployment and prediction tasks without retraining them.
10. Making Predictions
def predict(vectoriser, model, text):
textdata = vectoriser.transform(preprocess(text))
sentiment = model.predict(textdata)
data = []
for text, pred in zip(text, sentiment):
data.append((text,pred))
df = pd.DataFrame(data, columns = ['text','sentiment'])
df = df.replace([0,1], ["Negative","Positive"])
return df
if __name__=="__main__":
text = ["I hate twitter",
"May the Force be with you.",
"Mr. Stark, I don't feel so good"]
df = predict(vectoriser, LRmodel, text)
print(df.head())
Finally, we define the predict function to make predictions on new text data. This function preprocesses the input text, transforms it using the loaded vectorizer, predicts sentiment using the loaded model, and returns the predictions in a structured DataFrame format. The if __name__=="__main__": block demonstrates an example usage of the predict function with sample text inputs.
Conclusion
This blog post has provided a detailed walkthrough of building a sentiment analysis model using machine learning techniques. By following each step, readers have learned how to preprocess text