There is a very famous saying that, “The pen is mightier than the sword”. It is true since your words can make someone’s day or it may ruin it. So, it’s always wise to choose your words wisely. But in today’s world of social media, people speak their heart out without thinking much. So, in this article, we’ll see how positive or negative people can be on these social media platforms.

Problem Statement

We are going to review twitter’s data to find the sentiment of people on this platform.
We’ll train a model which will take as input a person’s tweet and tell us whether it’s positive or negative.

Dataset

We are going to use Kaggle dataset for our analysis. This dataset contains 1.6 million tweets along with the target label as positive or negative. Each class contains 0.8 million examples

Preprocessing of tweets

Preprocessing is a very important part of any machine-learning pipeline. In our data we have a lot of information that doesn’t add any value to our model, so we’ll handle it one by one. We’ll perform the following preprocessing steps on our data:

Remove any email ids in our text.
Remove all hashtags and mentions(eg Arpan Srivastava)from our tweets.
Remove any punctuation marks in our tweets.
remove numbers from our tweets.

After doing the above steps, we are 50% done with our cleaning. Now we’ll perform the second part of our preprocessing and learn about concepts related to NLP.

Removing stopwords

Stopwords refer to the most common words in a language. In the English language, these are words such as [“have”,” they”,” our”,” is”], etc. These words don’t play much roles in our analysis, since they’ll be mostly present in both negative and positive tweets.
We’ll remove these stopwords from our list of tweets using nltk library.

Stemming and lowercasing

Stemming is the process of reducing each word to its root word. Suppose there are 3 words, tune, tuned, and tuning. All the words may seem different but they are derived from the same word “tun”. So in stemming what we do is replace these words with their root words.
Then we convert all the words into lowercase so that Great, great and GREAT are not treated as different words in our analysis.

Vectorization of Data

Now after performing all our preprocessing, we are left with a dataset that looks something like this:

Now we can’t feed this text data into our machine-learning model. Since machines only understand numbers and not text. So we’ll vectorize the above text data.
What we can do is create a vocabulary array using all the unique words in our data.
For Eg. Suppose we have 3 examples

I love Programming.
Programming is what I love.
Programming is cool.

So, if we create a vocabulary using the above 3 sentences we get an array something like this: [“I”, ”love”, ”Programming”, “is”, ”what”, “cool”]
Now we’ll convert our sentences into vectors by using a simple trick. If a word in our vocabulary exists in our sentence we’ll put a 1 there otherwise 0
So the transformation will look something like this:
“I love Programming” -> [1,1,1,0,0,0]
“Programming is what I love” -> [1,1,1,1,1,0]
“Programming is cool”->[0,0,1,1,0,1]

Problem with the above approach

The above approach will create a very large sparse matrix and our learning algorithm will become slow because of that.

Better Solution

In this approach what we are going to do is create a vocabulary of all unique words in our sentences.
Get a list of all positive tweets and find the frequency of each word in those positive tweets.
Similarly, we get all the negative tweets and find the frequency of all words in negative tweets.

Now we’ll transform our sentences into vectors using a simple formula.

|V|=[1, sum of freq of positive tweets, sum of freq of negative tweets]

Let’s understand our formula with the help of a single example.
Suppose our sentence is -> ‘I am happy learning code’.
Now sum of freq of positive tweets = 3(I)+3(am)+2(happy)+1(learning)+1(code) =>10
Similarly sum of freq of negative tweets = 3(I)+3(am)+0(happy)+1(learning)+1(code)=>8
So finally our sentence will be represented as [1,10,8].
Remember, the initial 1 is for the bias.
We can now convert each of our examples into vectors.

Training our Model

Since it’s a binary classification problem now, you can simply use Logistic Regression or any other binary classification algorithm to train and test your model.

Code Walkthrough

For a code walkthrough, you can refer to my notebook on Kaggle Twitter Sentiment Analysis

Till Then Happy Learning!!

Twitter Sentiment Analysis from Scratch

Preprocessing of tweets

Vectorization of Data

Training our Model

Code Walkthrough

Related Posts

Deep Dive into Logistic Regression

A quick guide to getting started with NumPy