Study of Twitter Sentiment Analysis Using Machine

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/317058859
Study of Twitter Sentiment Analysis using Machine Learning Algorithms on

Python
Article in International Journal of Computer Applications · May 2017

DOI: 10.5120/ijca2017914022
CITATIONS READS
65 8,946
5 authors, including:
Bhumika Gupta Monika Negi

Telecom Business School 3 PUBLICATIONS 69 CITATIONS
20 PUBLICATIONS 341 CITATIONS
SEE PROFILE
SEE PROFILE
Priyanka Badhani
Motilal Nehru National Institute of Technology
1 PUBLICATION 65 CITATIONS
SEE PROFILE
All content following this page was uploaded by Priyanka Badhani on 24 July 2021.
The user has requested enhancement of the downloaded file.

International Journal of Computer Applications (0975 – 8887)
Volume 165 – No.9, May 2017
Study of Twitter Sentiment Analysis using Machine

Learning Algorithms on Python
Bhumika Gupta, PhD Monika Negi, Kanika Vishwakarma, Goldi
Assistant Professor, C.S.E.D Rawat, Priyanka Badhani
G.B.P.E.C, Pauri, Uttarakhand, India B.Tech, C.S.E.D
G.B.P.E.C Uttarakhand, India
ABSTRACT 2. ABOUT SENTIMENT ANALYSIS

Twitter is a platform widely used by people to express their Sentiment analysis is a process of deriving sentiment of a
opinions and display sentiments on different occasions. particular statement or sentence. It’s a classification technique
Sentiment analysis is an approach to analyze data and retrieve which derives opinion from the tweets and formulates a
sentiment that it embodies. Twitter sentiment analysis is an sentiment and on the basis of which, sentiment classification
application of sentiment analysis on data from Twitter is performed.
(tweets), in order to extract sentiments conveyed by the user.
In the past decades, the research in this field has consistently Sentiments are subjective to the topic of interest. We are
grown. The reason behind this is the challenging format of the required to formulate that what kind of features will decide for
tweets which makes the processing difficult. The tweet format the sentiment it embodies.
is very small which generates a whole new dimension of In the programming model, sentiment we refer to, is class of
problems like use of slang, abbreviations etc. In this paper, we entities that the person performing sentiment analysis wants to
aim to review some papers regarding research in sentiment find in the tweets. The dimension of the sentiment class is
analysis on Twitter, describing the methodologies adopted and crucial factor in deciding the efficiency of the model.
models applied, along with describing a generalized Python
based approach. For example, we can have two-class tweet sentiment
classification (positive and negative) or three class tweet
Keywords sentiment classification (positive, negative and neutral).
Sentiment analysis, Machine Learning, Natural Language Sentiment analysis approaches can be broadly categorized in
Processing, Python. two classes – lexicon based and machine learning based.
Lexicon based approach is unsupervised as it proposes to
1. INTRODUCTION perform analysis using lexicons and a scoring method to
Twitter has emerged as a major micro-blogging website,
evaluate opinions. Whereas machine learning approach
having over 100 million users generating over 500 million
involves use of feature extraction and training the model using
tweets every day. With such large audience, Twitter has
feature set and some dataset.
consistently attracted users to convey their opinions and
perspective about any issue, brand, company or any other The basic steps for performing sentiment analysis includes
topic of interest. Due to this reason, Twitter is used as an data collection, pre-processing of data, feature extraction,
informative source by many organizations, institutions and selecting baseline features, sentiment detection and
companies. performing classification either using simple computation or
else machine learning approaches.
On Twitter, users are allowed to share their opinions in the
form of tweets, using only 140 characters. This leads to 2.1 Twitter Sentiment Analysis
people compacting their statements by using slang, The aim while performing sentiment analysis on tweets is
abbreviations, emoticons, short forms etc. Along with this, basically to classify the tweets in different sentiment classes
people convey their opinions by using sarcasm and polysemy. accurately. In this field of research, various approaches have
Hence it is justified to term the Twitter language as evolved, which propose methods to train a model and then test
unstructured. it to check its efficiency.
In order to extract sentiment from tweets, sentiment analysis Performing sentiment analysis is challenging on Twitter data,
is used. The results from this can be used in many areas like as we mentioned earlier. Here we define the reasons for this:
analyzing and monitoring changes of sentiment with an event,
 Limited tweet size: with just 140 characters in
sentiments regarding a particular brand or release of a
hand, compact statements are generated, which
particular product, analyzing public view of government
results sparse set of features.
policies etc.
 Use of slang: these words are different from
A lot of research has been done on Twitter data in order to English words and it can make an approach
classify the tweets and analyze the results. In this paper we outdated because of the evolutionary use of slangs.
aim to review of some researches in this domain and study  Twitter features: it allows the use of hashtags, user
how to perform sentiment analysis on Twitter data using reference and URLs. These require different
Python. The scope of this paper is limited to that of the processing than other words.
machine learning models and we show the comparison of  User variety: the users express their opinions in a
efficiencies of these models with one another. variety of ways, some using different language in
between, while others using repeated words or
symbols to convey an emotion.
29
All these problems are required to be faced in the pre-  Stemming: Replacing words with their roots, reducing
processing section. different types of words with similar meanings [3].
This helps in reducing the dimensionality of the feature
Apart from these, we face problems in feature extraction with
set.
less features in hand and reducing the dimensionality of
features.  Special character and digit removal: Digits and
special characters don’t convey any sentiment.
3. METHODOLOGY Sometimes they are mixed with words, hence their
In order to perform sentiment analysis, we are required to removal can help in associating two words that were
collect data from the desired source (here Twitter). This data otherwise considered different.
undergoes various steps of pre-processing which makes it  Creating a dictionary to remove unwanted words
more machine sensible than its previous form. and punctuation marks from the text [5].
 Expansion of slangs and abbreviations [5].
 Spelling correction [5].
 Generating a dictionary for words that are
important [7] or for emoticons [2].
 Part of speech (POS) tagging: It assigns tag to each
word in text and classifies a word to a specific
category like noun, verb, adjective etc. POS taggers
are efficient for explicit feature extraction.
3.3 Feature Extraction

A feature is a piece of information that can be used as a
characteristic which can assist in solving a problem (like
prediction [11]). The quality and quantity of features is very
important as they are important for the results generated by
Fig. 1 – General Methodology for sentiment analysis the selected model.
3.1 Tweet Collection Selection of useful words from tweets is feature extraction.
Tweet collection involves gathering relevant tweets about the
particular area of interest. The tweets are collected using  Unigram features – one word is considered at a
Twitter’s streaming API [1], [3], or any other mining tool (for time and decided whether it is capable of being a
example WEKA [2]), for the desired time period of analysis. feature.
The format of the retrieved text is converted as per  N-gram features – more than one word is
convenience (for example JSON in case of [3], [5]). considered at a time.
 External lexicon – use of list of words with
The dataset collected is imperative for the efficiency of the predefined positive or negative sentiment.
model. The division of dataset into training and testing sets is
also a deciding factor for the efficiency of the model. The Frequency analysis is a method to collect features with
training set is the main aspect upon which the results depends. highest frequencies used in [1]. Further, they removed some
of them due to the presence of words with similar sentiment
3.2 Pre-processing of tweets (for example happy, joy, ecstatic etc.) and created a group of
The preprocessing of the data is a very important step as it these words. Along with this affinity analysis is performed,
decides the efficiency of the other steps down in line. It which focuses on higher order n-grams in tweet feature
involves syntactical correction of the tweets as desired. The representation.
steps involved should aim for making the data more machine Barnaghi et al [3], use unigrams and bigrams and apply Term
readable in order to reduce ambiguity in feature extraction. Frequency Inverse Document Frequency (TF-IDF) to find the
Below are a few steps used for pre-processing of tweets - weight of a particular feature in a text and hence filter the
features having the maximum weight. The TF-IDF is a very
 Removal of re-tweets. efficient approach and is widely used in text classification and
 Converting upper case to lower case: In case we are data mining.
using case sensitive analysis, we might take two
occurrence of same words as different due to their Bouazizi et al [4], propose an approach were they don’t just
sentence case. It important for an effective analysis not rely on vocabulary used but also the expressions and sentence
to provide such misgivings to the model. structure used in different conditions. They classified features
into four classes: sentiment based features, punctuation and
 Stop word removal: Stop words that don’t affect the syntax based features, unigram based features and pattern
meaning of the tweet are removed (for example and, based features.
or, still etc.). [3] uses WEKA machine learning
package for this purpose, which checks each word The work of [5] is a bit different as they don’t focus on a
from the text against a dictionary ([3], [5]). particular topic or event but propose to find trending topics in
a region. The features extracted are divided in two categories:
 Twitter feature removal: User names and URLs are
Common Features and Tweet Specific Features. The former is
not important from the perspective of future
combination of common sentiment words while the later
processing, hence their presence is futile. All
includes @-network features, user sentiment features and
usernames and URLs are converted to generic tags [3]
emoticons. Based on the post time of each user, feature vector
or removed [5].
is built.
30
3.4 Sentiment classifiers of knowledge and learning takes place at each level
 Bayesian logistic regression: selects features and and forwarded to the next level. The hidden layers
provides optimization for performing text are dynamically generated until a desired level of
categorization. It uses a Laplace prior to avoid over- performance is achieved.
fitting and produces sparse predictive models for
text data. The Logistic Regression estimation  Case Base Reasoning: In this technique, problems
has the parametric form: that were successfully solved in the past are
accessed and their solutions are retrieved and used
further [10]. It doesn’t require an explicit domain
model, making elicitation a task of gathering case
histories and CBR system can acquire new
Where a normalization function, λ is is a vector knowledge as cases. This makes maintenance of
of weight parameters for feature set and is a large columns of information easier.
binary function that takes as input a feature and a
class label. It is triggered when a certain feature  Maximum Entropy Classifier: This classifier
exists and the sentiment is hypothesized in a certain takes no assumptions regarding the relations
way [3]. between features; it always tries to maximize
entropy of a system by computing its conditional
 Naïve Bayes: It is a probabilistic classifier with distribution of its class labels [9].
strong conditional independence assumption that is
optimal for classifying classes with highly
dependent features. Adherence to the sentiment
classes is calculated using the Bayes theorem.
’X’ is the feature vector and ’y’ is the class label.
Z(X) is the normalization factor and is the weight
X is a feature vector defined as X = { , …. } coefficient which is the feature function
and is a class label. which is defined as
Naïve Bayes is a very simple classifier with

acceptable results but not as good as other  Ensemble classifier: This classifier try to make use
classifiers. of features of all the base classifiers to do the best
classification. Base classifier used by [9] were
 Support Vector Machine Algorithm: Support
Naïve Bayes, SVM and Maximum Entropy. The
vector machines are supervised models with
classifier classifies based on the output of majority
associated learning algorithms that analyze data
of classifiers (voting rule).
used for classification and regression analysis [6],
[9]. It makes use of the concept of decision planes 4. TWITTER SENTIMENT ANALYSIS
that define decision boundaries.
WITH PYTHON
4.1 Python
X is feature vector, ‘w’ is weights of vector and ‘b’ Python is a high level, interpreted programming language,
is bias vector. is the non-linear mapping from created by Guido van Rossum. The language is very popular
input space to high dimensional feature space. for its code readability and compact line of codes. It uses white
SVMs can be used for pattern recognition [2]. space inundation to delimit blocks.
 Artificial Neural Network: the ANN model used Python provides a large standard library which can be used for
for supervised learning is the Multi-Layer various applications for example natural language processing,
Perceptron, which is a feed forward model that machine learning, data analysis etc.
maps data onto a set of pertinent outputs. Training
data given to input layer is processed by hidden It is favored for complex projects, because of its simplicity,
intermediate layers and the data goes to the output diverse range of features and its dynamic nature.
layers. The number of hidden layers is very
important metric for the performance of the model. 4.2 Natural Language Processing (NLTK)
There are two steps of working of MLP NN- feed Natural Language toolkit (NLTK) is a library in python,
forward propagation, involving learning features which provides the base for text processing and classification.
from feed forward propagation algorithm and back Operations such as tokenization, tagging, filtering, text
propagation, for cost function [5], [10]. manipulation can be performed with the use of NLTK.
Zimbra et al [1] propose an approach to use The NLTK library also embodies various trainable classifiers
Dynamic Architecture for Artificial Neural Network (example – Naïve Bayes Classifier).
(DAN2) which is a machine learned model with NLTK library is used for creating a bag-of words model,
sufficient sensitivity to mild expression in tweets. which is a type of unigram model for text. In this model, the
They target to analyze brand related sentiments number of occurrences of each word is counted. The data
where occurrences of mild sentences are frequent. acquired can be used for training classifier models. The
DAN2 is different than the simple neural networks sentiment of the entire tweets is computed by assigning
as the number of hidden layers is not fixed before subjectivity score to each word using a sentiment lexicon.
using the model. As the input is given, accumulation
31
4.3 SCIKIT-LEARN  Open the ‘Keys and Access Tokens’ tab.

The Scikit-learn project started as scikits.learn, a Google  Copy ‘Consumer Key’, ‘Consumer Secret’, ‘Access
Summer Code project by David Cournapeau. It is a powerful token’ and ‘Access Token Secret’.
library that provides many machine learning classification
algorithms, efficient tools for data mining and data analysis. The keys copied are then inserted into the code, which helps
Below are various functions that can be performed using this in dynamic collection of tweets every time we run it.
library:
The other option is to gather data non-dynamically using the
 Classification: Identifying the category to which a existing data provided by websites (like kaggle.com) and save
particular object belongs. the data into whatever format we require (for example JSON,
 Regression: Predicting a continuous-valued csv etc.).
attribute associated with an object. The former method is slow in nature as it performs tweet
 Clustering: Automatic grouping of similar objects collection every time we start the program. The latter
into sets. approach may not provide us with the quality of tweets we
 Dimension Reduction: Reducing the number of require.
random variables under consideration.
 Model selection: Comparing, validating and To solve this we can put the code for tweet collection in
choosing parameters and models. different module in a way that it doesn’t operate every time
 Preprocessing: Feature extraction and we run the project.
normalization in order to transform input data for
use with machine learning algorithm. 4.7 Pre-processing in Python
The pre-processing in Python is easy to perform due to
In order to work with scikit-learn, we are required to install functions provided by the standard library. Some of the steps
NumPy on the system. are given below:
4.4 NumPy  Converting all upper case letters to lower case.

NumPy is the fundamental package for scientific computing  Removing URLs: Filtering of URLs can be done
with Python. It provides a high-performance multidimensional with the help of regular expression
array object, and tools for working with these arrays. It (http|https|ftp)://[a-zA-Z0-9\\./]+.
contains among other things:  Removing Handles (User Reference): Handles can
be removed using regular expression - @(\w+) .
 A powerful N-dimensional array object  Removing hashtags: Hashtags can be removed
 Sophisticated (broadcasting) functions using regular expression - #(\w+).
 Tools for integrating C/C++ and Fortran code  Removing emoticons: We can use emoticon
 Useful linear algebra, Fourier transform, and dictionary to filter out the emoticons or to save the
random number capabilities. occurrence of them in a different file.
 Removing repeated characters.
4.5 Setting Up Environment for Sentiment
Analysis Using Python 4.8 Feature Extraction
Various methodologies for extracting features are available in
The following components are required to be downloaded and the present day. Term frequency-Inverse Document frequency
installed properly. is an efficient approach. TF-IDF is a numerical statistic that
 Download and install Python 2.6 or above in a reflects the value of a word for the whole document (here,
desired location. tweet).
 Download and install NumPy. Scikit-learn provides vectorizers that translate input
 Download and install NLTK library. documents into vectors of features. We can use library
function TfidfVectorizer(), using which we can provide
 Download and install Scikit-learn library. parameters for the kind of features we want to keep by
mentioning the minimum frequency of acceptable features.
4.6 Data Collection
We have two options to collect data for sentiment analysis. 4.9 TRAINING A MODEL
First is to use Tweepy - client for Twitter Application The scikit-library provides various machine learning models
Programming Interface (API). whose implementation in code is very easy. For example one
It can be installed using pip command: pip install tweepy can easily create an instance of Support Vector Machine in
one line –
To fetch tweets from the Twitter API one needs to register an
App through their Twitter account. After that the following classifier_poly=svm.SVC()
steps are performed: In order to make use of machine learning models, one is
 Open https://apps.twitter.com/ and click button – required to remember to install NumPy properly and import
‘Create New App’. from scikit-learn the desired model.
After training the model we, use the same instance to test the
 Fill the details asked.
model and save the results obtained.
 When the App is created, the page will be
automatically loaded.
32
5. EXPERIMENTATION FOR MODEL Table 1: Average accuracies of different models
VALIDATION S. Classifier Accuracy

After the pre-processing and feature extraction steps are
no.
performed, we work towards training and validating the
model’s performance. The collected dataset is divided in two– 1. DAN2 86.06%
training set and testing set. The training set is used to train the
classifier (machine learned model) while the testing set is the 2. SVM 85.0%
one on which the experimentation is performed. The ratio of 3. Bayesian Logistic Regression 74.84%
training and testing dataset can vary as per to applications. [1]
divides the dataset as 70% training and rest testing, whereas 4. Naïve Bayes 66.24%
[3] which uses cross validation on the dataset by splitting it
into 10 sections. This method selects 90% for training set and 5. Random Forest Classifier 87.5%
10 for testing. 6. Neural Network 89.93%
[4] divided the set as training set containing 21000 tweets 7. Maximum Entropy 90.0%
while testing set 1400 tweets (approx. 93% and 7%) while [5]
used 75% data for training set and [9] used approx. 83% for 8. Ensemble classifier 90.0%
training.
As the classification work in [6] is topic based and adaptive in
5.1 Applications
nature hence excessive manual labeling is avoided which  Commerce: Companies can make use of this
reduces the size of training set. research for gathering public opinion related to their
The model which is chosen for experimentation is trained brand and products. From the company’s
using the training set of data. Then this same trained model is perspective the survey of target audience is
used to classify new data, by which we can check its imperative for making out the ratings of their
accuracy. products. Hence Twitter can serve as a good
platform for data collection and analysis to
As the classification work in [6] is topic based and adaptive in determine customer satisfaction.
nature hence excessive manual labeling is avoided which
reduces the size of training set.  Politics: Majority of tweets on Twitter are related to
politics. Due to Twitter’s widespread use, many
The proposed work of [3] is a bit different as they correlate politicians are also aiming to connect to people
the event and sentiment by using timestamp. Using this through it. People post their support or disagreement
method for a particular event, it’s possible to divide it into towards government policies, actions, elections,
sub-events and further deepen the study of user sentiments. debates etc. Hence analyzing data from it can help is
This approach is complex but yield very detail results when in determining public view.
we chose a large event and desire to see fluctuations in user
sentiments with time.  Sports Events: Sports involve many events,
championships, gatherings and some controversies
The number of classes to be chosen for classification is up to too. Many people are enthusiastic sports followers
the user. One can perform binary, ternary or multi-class and follow their favorite players present on Twitter.
classification based on the type of application we are aiming These people frequently tweet about different sports
for. But it has been observed that as the number of classes related events. We can use the data to gather public
increases the performance of classifiers decreases [1], [3]. view of a player’s action, team’s performance,
official decisions etc.
6. CONCLUSION
Twitter sentiment analysis comes under the category of text
and opinion mining. It focuses on analyzing the sentiments of
the tweets and feeding the data to a machine learning model in
order to train it and then check its accuracy, so that we can use
this model for future use according to the results. It comprises
of steps like data collection, text pre-processing, sentiment
detection, sentiment classification, training and testing the
model. This research topic has evolved during the last decade
with models reaching the efficiency of almost 85%-90%. But
it still lacks the dimension of diversity in the data. Along with
this it has a lot of application issues with the slang used and
the short forms of words. Many analyzers don’t perform well
when the number of classes are increased. Also it’s still not
tested that how accurate the model will be for topics other
than the one in consideration. Hence sentiment analysis has a
very bright scope of development in future.
Fig. 2: Accuracy on the basis of number of classes used for

classification of sentiments
33
7. REFERENCES [6] Halima Banu S and S Chitrakala, “Trending Topic

[1] David Zimbra, M. Ghiassi and Sean Lee, “Brand-Related Analysis Using Novel Sub Topic Detection Model”,
Twitter Sentiment Analysis using Feature Engineering (IEEE) ISBN- 978-1-4673-9745-2, 2016.
and the Dynamic Architecture for Artificial Neural [7] Shi Yuan, Junjie Wu, Lihong Wang and Qing Wang, “A
Networks”, IEEE 1530-1605, 2016. Hybrid Method for Multi-class Sentiment Analysis of
[2] Varsha Sahayak, Vijaya Shete and Apashabi Pathan, Micro-blogs”, ISBN- 978-1-5090-2842-9, 2016.
“Sentiment Analysis on Twitter Data”, (IJIRAE) ISSN: [8] Apoorv Agarwal, Boyi Xie, Ilia Vovsha, Owen Rambow
2349-2163, January 2015. and Rebecca Passonneau, “Sentiment Analysis of Twitter
[3] Peiman Barnaghi, John G. Breslin and Parsa Ghaffari, Data” Proceedings of the Workshop on Language in
“Opinion Mining and Sentiment Polarity on Twitter and Social Media (LSM 2011), 2011.
Correlation between Events and Sentiment”, 2016 IEEE [9] Neethu M S and Rajasree R, “Sentiment Analysis in
Second International Conference on Big Data Computing Twitter using Machine Learning Techniques”, IEEE –
Service and Applications. 31661, 4th ICCCNT 2013.
[4] Mondher Bouazizi and Tomoaki Ohtsuki, “Sentiment [10] Aliza Sarlan, Chayanit Nadam and Shuib Basri, “Twitter
Analysis: from Binary to Multi-Class Classification”, Sentiment Analysis”, 2014 International Conference on
IEEE ICC 2016 SAC Social Networking, ISBN 978-1- Information Technology and Multimedia (ICIMU),
4799-6664-6. Putrajaya, Malaysia November 18 – 20, 2014.
[5] Nehal Mamgain, Ekta Mehta, Ankush Mittal and Gaurav [11] Feature engineering, Wikipedia 2017,
Bhatt, “Sentiment Analysis of Top Colleges in India https://en.wikipedia.org/wiki/Feature_engineering
Using Twitter Data”, (IEEE) ISBN -978-1-5090-0082-1,
2016.
IJCATM : www.ijcaonline.org 34
View publication stats

Study of Twitter Sentiment Analysis Using Machine

Uploaded by

Study of Twitter Sentiment Analysis Using Machine

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Study of Twitter Sentiment Analysis using Machine Learning Algorithms on

Article in International Journal of Computer Applications · May 2017

Bhumika Gupta Monika Negi

The user has requested enhancement of the downloaded file.

Study of Twitter Sentiment Analysis using Machine

ABSTRACT 2. ABOUT SENTIMENT ANALYSIS

3.3 Feature Extraction

Naïve Bayes is a very simple classifier with

4.3 SCIKIT-LEARN  Open the ‘Keys and Access Tokens’ tab.

4.4 NumPy  Converting all upper case letters to lower case.

5. EXPERIMENTATION FOR MODEL Table 1: Average accuracies of different models

VALIDATION S. Classifier Accuracy

Fig. 2: Accuracy on the basis of number of classes used for

7. REFERENCES [6] Halima Banu S and S Chitrakala, “Trending Topic

View publication stats

You might also like