Data Science Mini Project: Sentiment Analysis using Python

About the Dataset

5 min readOct 28, 2021

In this guide, we’ll use an IMDB dataset of 50k movie reviews available on Kaggle. The dataset contains 2 columns (review and sentiment) that will help us identify whether a review is positive or negative. In this I will find which machine learning model is best suited to predict sentiment (output) given a movie review (input).

Preparing the Dataset

After downloading the dataset we will load the dataset in our jupyter notebook. The data loaded is shown in the below figures:

This dataset contains 50000 rows. To train the model faster out of 50000 I will only take 10000 rows. This small sample will contain 9000 positive and 1000 negative reviews to make the data imbalanced.

Managing the imbalanced class

In this case I have large amount of data for one class, and much fewer observations for other classes. This is known as imbalanced data because the number of observations per class is not equally distributed. Let’s take a look at how our df_review_imb dataset is distributed.

To resample our data we use the imblearn library. You can either undersample positive reviews or oversample negative reviews (you need to choose based on the data you’re working with). In this case, we’ll use the RandomUnderSampler.

Splitting data into train and test set

Before we work with our data, we need to split it into a train and test set. The train dataset will be used to fit the model, while the test dataset will be used to provide an unbiased evaluation of a final model fit on the training dataset.

CountVectorizer

The CountVectorizer gives us the frequency of occurrence of words in a document. Let’s consider the following sentences.

review = [“I love writing code in Python. I love Python code”,
          “I hate writing code in Java. I hate Java code”]

Term Frequency, Inverse Document Frequency (TF-IDF)

The TF-IDF computes “weights” that represents how important a word is to a document in a collection of documents (aka corpus). The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word.

Model Selection

Support Vector Machines (SVM)

To fit an SVM model, we need to introduce the input (text reviews as numerical vectors) and output (sentiment). After fitting svc we can predict whether a review is positive or negative with the .predict() method.

Decision Tree

To fit a decision tree model, we need to introduce the input (text reviews as numerical vectors) and output (sentiment).

Naive Bayes

To fit a Naive Bayes model, we need to introduce the input (text reviews as numerical vectors) and output (sentiment)

Logistic Regression

To fit a Logistic Regression model, we need to introduce the input (text reviews as numerical vectors) and output (sentiment)

Model Evaluation

In this section, we’ll see traditional metrics used to evaluate our models.

Mean Accuracy

To obtain the mean accuracy of each model, just use the .score method with the test samples and true labels as shown below.

SVM and Logistic Regression perform better than the other two classifiers, with SVM having a slight advantage (84% of accuracy). To show how the other metrics work, we’ll focus only on SVM.

F1 Score

F1 Score is the weighted average of Precision and Recall. Accuracy is used when the True Positives and True negatives are more important while F1-score is used when the False Negatives and False Positives are crucial.

Classification report

We can also build a text report showing the main classification metrics that include those calculated before. To obtain the classification report, we need the true labels and predicted labels classification_report(y_true, y_pred)

Confusion Matrix

A confusion matrix) is a table that allows visualization of the performance of an algorithm. This table typically has two rows and two columns that report the number of false positives, false negatives, true positives, and true negatives.

Tuning the Model

Finally, it’s time to maximize our model’s performance.

GridSearchCV

This is technique consists of an exhaustive search on specified parameters in order to obtain the optimum values of hyperparameters. To do so, we write the following code.

After fitting the model, we obtain the best score, parameters, and estimators with the following code.