Lexi Can

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Asian Journal of Computer Science and Technology

ISSN: 2249-0701 Vol.8 No.S2, 2019, pp. 1-6


© The Research Publication, www.trp.org.in

A Comprehensive Study on Lexicon Based Approaches for


Sentiment Analysis
Venkateswarlu Bonta1, Nandhini Kumaresh2and N. Janardhan3
1,2&3
Department of Computer Science, School of Mathematics and Computer Sciences,
Central University of Tamil Nadu, Thiruvarur, India
E-Mail: [email protected], [email protected],[email protected]

Abstract - In recent years, it is seen that the opinion-based extraction, sentiment mining, review mining etc. Though
postings in social media are helping to reshape business and there are many names, the main aim of them is to know how
public sentiments, and emotions have an impact on our social to extract the hidden polarity of the written text or review.
and political systems. Opinions are central to mostly all human
activities as they are the key influencers of our behaviour. A. Applications
Whenever we need to make a decision, we generally want to
know others opinion. Every organization and business always
Sentiment analysis has attracted large number of researchers
wants to find customer or public opinion about their products
and services. Thus, it is necessary to grab and study the due to its applications in variety of disciplines. Huge
opinions on the Web. However, finding and monitoring sites amount of existing work has been focused on online
on the web and distilling the reviews remains a big task customer reviews including product, hotel, movie, etc.
because each site typically contains a huge volume of opinion Individual consumers want to know the opinions of existing
text and the average human reader will have difficulty in users of product before purchasing it, and also others
identifying the polarity of each review and summarizing the opinion about political candidate before making a voting
opinions in them. Hence, it needs the automated sentiment decision in political election. Sentiment analysis has shown
analysis to find the polarity score and classify the reviews as its impact on social media such as Twitter, Facebook to
positive or negative. This article uses NLTK, Text blob and
understand user tweets and their behaviour. Previously,
VADER Sentiment analysis tool to classify the movie reviews
which are downloaded from the website www.rottentomatoes. when an individual wanted to know the opinion about any
com that is provided by the Cornell University, and makes a product or service, he/she used to ask his/her friends or
comparison on these tools to find the efficient one for family members.
sentiment classification. The experimental results of this work
confirm that VADER outperforms the Text blob. When an organization or a business needs public or
Keywords: Sentiment Analysis, Opinion Mining, Sentiwordnet, consumer opinion, it is used to conduct surveys, opinion
NLTK, Text blob, VADER polls, and focus groups. After the explosive growth of social
media, there is no need to ask one’s friends or family for
I. INTRODUCTION opinions, and no organization needs to conduct surveys,
opinion polls, and focusing groups in order to gather public
Sentiment analysis is the process of computationally opinions because there is an abundance of such information
identifying and categorizing the opinions expressed in a available publicly. Due to these applications, industrial
piece of text, especially in order to determine the writer’s activities have flourished in recent years. Sentiment analysis
attitude towards a particular topic, product, etc. is positive, applications have spread to almost every possible domain
negative or neutral[1]. The attitude may be as one of the from consumerproducts to services, healthcare, and
following scenarios. (a)His or her judgment or evaluation financial services to social events and political elections.
(b) Affective state i.e., the emotional state of the author
when writing a review. Sentiment analysis can be highly On the other hand, shifting to automated sentiment analysis
useful in several cases. The best example is the marketing saves time and money. Customers can log in and simply
methodology. Marketing teams can use sentiment analysis watchthe graphs and tables that show them data about the
to launch a new product or to determine the existing product required brands in a user-friendly environment. Such
popularity and preference. Reviews from social media can services also have benefit of being able to capture measure
be gathered and used to assess how good or bad a product or and display data in real-time speed. Sentiment analysis has
service is doing based on customer response. In computer become the gateway to understand the customer needs,
science literature three main streams of sentiment extending customer base and expectations.
definitions can be found. The first stream of research
definition sentiment is through opinion. The second stream II. RELATED WORK
of research is sentiment through focusing on feelings. The
third stream is sentiment through focusing on both feeling Sentiment analysis is a type of data mining that deals with
and opinions[2]. In addition, there are other terms with people’s opinion through Natural Language Processing,
slight different tasks namely opinion mining, opinion

1 AJCST Vol.8 No.S2 March 2019


Venkateswarlu Bonta, Nandhini Kumaresh and N. Janardhan

Computational Linguistics and text Analysis. There are


mainly two approaches to extract the sentiment from given
reviews and classify the result as positive or negative.

1. Lexicon Based Approach


2. Machine Learning Approach

Lexicon based approach requires predefined lexicon while


Machine Learning approach automatically classifies the
review which requires training data. A lexicon is a stock of
terms that belongs to a particular subject or language.
Hybrid approaches are also used to overcome the drawbacks
of the individual techniques.

A. Lexicon-Based Approach

Lexicon-Based Approach uses sentiment lexicon with Fig. 1 Fundamental approach for Sentiment classification using Lexicon
information about which words and phrases are positive and
which are negative [3]. A sentiment lexicon is a list of B. Machine Learning Approach
lexical features which are generally labelled according to
their semantic orientation as either positive or negative. Machine Learning Approaches are used to construct an
Researchers first create a sentiment lexicon through algorithm and build a model by feature selection and by
compiling sentiment word lists such as manual approaches, learning from labelled training datasets [6]. Naïve Bayes
lexical approaches, and corpus-based approaches, then Classifier, Support Vector Machine (SVM), and Random
determine the polarity score of the given review based on Forest are the well-known methods for sentiment
the positive and negative indicators which are identified in classification through Machine Learning.
the lexicon.

There are some lexicons like LIWC (Linguistic Inquiry and


Word Count), GI (General Inquirer) that categorizes the
words into positive and negative according to their context
free semantic orientation. LIWC consists of almost 4,500
words organized into one of 76 categories, including 905
words in two categories especially related to sentiment
analysis.

LIWC was well-established and validated in a process


spanning tree more than a decade of work by sociologists,
psychologists, linguists [4]. Though its extensive use to find
sentiment analysis in social media text, LIWC does not
include acronyms, initialisms, emoticons, and slang which
are important factors for sentiment analysis of social media
text. Fig. 2 Fundamental approach for Sentiment classification using Machine
Learning
However, other lexicons like ANEW (Affective Norms for
English Words), SentiWordNet, and SenticNet are These algorithms can automatically learn all kinds of
associated with valence scores for sentiment intensity. features forclassification through optimization. Since the
SentiWordNet consists of 1,47,306 synsets are annotated sentiment classifier is trained on the labelled data from one
with three sentiment scores such as positive negative and domain, oftenit does not work with another domain. To
objective [5]. overcome problem, Lexicon based approaches are
recommended.
Though, it is not a gold standard resource likeWord Net, but
is useful for wide range of tasks. One of the major C. Levels of Sentiment Analysis
advantages of Lexicon-Based approach is, its domain
independence, and also it can be easily extended and Based on the level of analysis which they involve, the levels
improved. of sentiment analysis are categorized into 3 types, namely
1. Document level analysis
2. Sentence level analysis
3. Entity and Aspect level Analysis.

AJCST Vol.8 No.S2 March 2019 2


A Comprehensive Study on Lexicon Based Approaches for Sentiment Analysis

The task of Document level analysis is to classify whether natural language processing (NLP) tasks [9]. Textblob is
the whole opinion document expresses a positive or just like a python string.
negative sentiment [7]. This level of analysis assumes that
each document expresses opinions on a single entity. The Features of Textblob
Sentence level analysis deals with sentences and determines
whether each sentence expressed is positive, negative, or a. Tokenization
neutral. This level of analysis is mostly related to b. Noun phrase extraction
subjectivity classification. Entity and Aspect based c. POS tagging
analysis. Goal of the third level of analysis is to discover d. Sentiment analysis
sentiments on entities and their aspects. Instead of looking e. Language Translation and detection
at the language constructs such as reviews, sentences, etc., f. n-grams
the aspect level directly looks at opinion itself. It is based g. spelling correction
on idea that an opinion consists of a sentiment and a target h. WordNet integration
entity.This article presents our work on Lexicon based
approaches to identify the sentiment of the given movie 3. VADER:VADER (Valence Aware Dictionary and
reviews. sEntimentReasoner) is a lexicon and rule-based sentiment
analysis tool. It is an open source under the MIT license
D. Lexicon-Based Sentiment Analysis Tools developed by George Berry, Ewan Klein, and Pier Paolo.
Vader lexicon performs exceptionally well in the social
1. NLTK:NLTK (Natural Language ToolKit) is an open media domain. VADER retains the benefits of traditional
source Natural Language Processing platform for python, sentiment lexicons like LIWC (Linguistic Inquiry and Word
developed in conjunction with computational linguistics at Count). It is bigger, simply inspected, understood, quickly
university of Pennsylvania in 2001. It provides easy-to-use applied and easily extended. The VADER sentiment
interface over 50 corpora, lexicon resources such as lexicon is gold-standard quality and has been validated by
SentiWordNet with a suit of text processing libraries for humans. VADER distinguishes itself from LIWC as
classification, tokenization, and semantic reasoning. In sensitive to sentiment expressions in social media context
NLTK sentiment score is calculated from SentiWordNet and also generalizing more favourable to other domains.
which consists of polarity score of each synset of WordNet
with three sentiment numerical scores positivity, negativity III. METHODOLOGY
values ranging from 0.0 to 1.0 and their sum is 1.0 for each
synset. However, you can also determine the objectivity Movie reviews are analysed for their sentiment during
score of the synset of Word Net using the formula given Movie release to predict the movie response as positive or
below. negative. The overall methodology follows four steps

(1) 1. Data collection


The scores are calculated using a complex mix of semi- 2. Pre-processing
supervised algorithms. It is not a gold standard resourcelike 3. Sentiment extraction
WordNet and LIWC. However, it is useful for a wide range 4. Classify sentiment as positive or negative.
of tasks.
A. NLTK
The WordNet synsets are uniquely identified by POS, ID
pairs [8]. Synset Terms column shows the included terms Fig. 3 shows how NLTK classifies a review into either
with the sense numbers. The SentiwordNet lexicon is very positive or negative. It first tokenizes the reviews into words
noisy; a large majority of synsets have no positive or and then remove the stop words such as a, an, the, for, is,
negative scores. It also fails to account for sentiment etc. After removing stop words, the words are stemmed to
bearing lexical features relevant to text in micro blogs. get their root words. For example, “disappointed” is reduced
TABLE I SAMPLE SENTIWORDNET VALUES
to “disappoint”. This helps in reducing the time while
searching the word in the SentiWordNet. All special
POS ID PosScore NegScore SynsetTerms symbols and numbers are also removed from the reviews.
A 00071142 0.5 0.5 Impressed#1 Now it performs the POS (Parts of Speech) tagging on the
purified reviews. It involves stringent grammar rules while
A 00070111 0 0 Enhansive#2
performing the tagging. Thus, the data is ready for
A 00065064 0.625 0 Good#5 classification by extracting the positive and negative words
A 00061664 0.625 0 Neat#1 from the given review and match them with respected
A 00035868 0.5 0 Blusting#1 sentiment score given in the SentiWordNet. Finally, by
counting the positive and negative terms which are found in
2. TextBlob: Textblob is a python library for processing the review, and using sentiment polarity, the class receives
textual data. It provides a consistent API for common the highest score.

3 AJCST Vol.8 No.S2 March 2019


Venkateswarlu Bonta, Nandhini Kumaresh and N. Janardhan

tuple with two parameters called polarity and subjectivity.


The polarity score is ranging from -1 to 1 and subjectivity
ranges are from 0 to 1 where 0 is most objective and 1 is
most subjective.

Example:
review=Textblob (“the movie was interesting.”)
review.sentiment
Fig. 3 Overall process of NLTK to classify a review # Sentiment(polarity=0.5, subjectivity=0.5)

For each Synset the score in SentiWordNet lexicon, can be C. VADER


calculated by
(2) As mentioned earlier, VADER is a lexicon and rule-based
For a term with specific POS tag, if k synsets contain it, sentiment analysis tool. It uses a combination of a sentiment
then the sentiment score of the term can be calculated by lexicon, a list of lexical features which are generally
following expression labelled according to their semantic orientation as either
∑ positive or negative.VADER has been quite successful

(3) when dealing with social media texts, movie reviews, and
Where n is the sense number.If a term not in the product reviews. This is because VADER not only tells
SentiWordNet, we assume that its sentiment score is 0. If a about the positivity and negativity score but also tells about
negation word appears in front of a term, we simply reverse how positive or negative a sentiment is. The developers of
the sentiment value of the respected term. The sentiment VADER have used Amazon’s Mechanical Turk to get most
score of the target review can be calculated by adding up all of their ratings.
the term sentiment scores as shown in below:
1. Advantages of VADER
∑ (4)
∑ (5) a. Works perfectly on social media type text.
(6) b. It does not require any training data but constructed
from generalizable, valence-based, human-curated gold
Where p is a review which contains m positive terms and n standard lexicon.
negative terms.PosScore(p) and NegScore(p) represents the c. VADER supports emoji for sentiment classification.
positivity and negativity of the corresponding review p. d. It is fast enough to be used online.
SentiScore of p represents the final sentiment score of the e. It does not severely suffer from a speed-performance
review p. tradeoff.

B. TextBlob Installation:

C:\Users\Admin>pip installvaderSentiment
Textblob is a python library that provides text mining, text
analysis and text processing modules for python developers. VADER analyses a piece of text to see if any of the words
Textblob reuses NLTK corpora, and if NLTK has been from the text are present in the VADER lexicon. It can find
installed beforeTextblob, then the Textblob will be installed the polarity indices using polarity_scores() function. This
with a great ease. Textblob supports the python versions 2.6 will return the metric values of the negative, neutral,
and the latest. positive, and compound for a given sentence. The
compound score is a metric that calculates the sum of all the
Installation: lexicon ratings which have been normalized between -1 and
$ pip install -U textblob +1 where -1 indicates most extreme negative and +1
$ python -m textblob.download_corpora indicates most extreme positive. It is useful to set the
standardized thresholds for classifying sentences as positive,
The above commands install Textblob and download neutral or negative. The typical threshold values are given
necessary NLTK corpora, and if NLTK is installed before below
Textblob, there is no need to download corpora.
Textblob is a sentence level analysis. First, it takes a dataset Positive Sentiment: compound score >= 0.05
as the input then it splits the review into sentences. A Neutral Sentiment: compound score > -0.05 and < 0.05
common way of determining polarity for an entire dataset is Negative Sentiment: compound score <= -0.05
to count the number of positive and negative
sentences/reviews and decide whether the response is These are the most useful metrics for multidimensional
positive and negative based on total number of positive and measures of sentiment for a given textual review.The below
negative reviews. Polarity and subjectivity of a given review figure shows the VADER lexiconcontaining words along
can be known using sentiment() function. It returns a named with their sentiment ratings.

AJCST Vol.8 No.S2 March 2019 4


A Comprehensive Study on Lexicon Based Approaches for Sentiment Analysis

TABLE II SAMPLE VADER LEXICON VALUES VADER analyses sentiments primarily based on certain key
Word Sentiment Rating points such as Punctuation, Capitalization, Degree
modifiers, Conjunctions, Preceding Tri-gram [10].
Great 3.1
Disaster -3.1 There are more than 7,500 lexical features with validated
Good 1.9 valence scores that indicate both the sentiment polarity, and
Horrible -2.5 sentiment intensity ranging from -4 to +4.
Rejoiced 2.0

Fig. 4 Methods and process approach overview of VADER

2. Dataset From the above table III it is clear that accuracy of VADER
is 77% where as Text blob and NLTK has 74% and 62%
Dataset includes 11861 sentence-level snippets from respectively. It is also identifiedthat how NLTK and Text
www.rotten.tomatoes.com provide by the Cornell blob show a concentration of reviews incorrectly that are
University [11]. The snippets were derived from an original classified as neutral. Presumably, this is due to lack of
set of 10662 movie reviews (5331 positive and 5331 coverage for sentiment-oriented language of social media
negative) in Pang & Lee (2005 July). text which consists of emoticons, slangs, acronyms.

IV. RESULTS AND DISCUSSION The lexicon of machine learning algorithms is constructed
by training their modules on half of the data,and remaining
Our experimental approach yields the following results half is used for testing.The lexicon of machine learning
algorithms is constructed by training their modules on half
of the data,and remaining half is used for testing. Most of
the algorithms work only in specific domains. Unlike
machine learning algorithms VADER performs better
across various kinds of domains. As compared to machine
learning techniques, VADER has several advantages.
Firstly, it is both quick and computationally economic.
VADER runs directly from standard modern laptop or
computer; a corpus takes a fraction of a second to analyse
with VADER, butit approximately takes hours when using
more complex models like Support Vector Machine.
Second advantage is that the lexicon and the rules used by
the VADER are directly accessibleand not hidden.
Fig. 5 VADER versus Text blob. VADER shows the close proximity of Therefore, VADER is easily understood, extended and
reviews than Textblob with respect to the actual reviews in the Dataset
modified.
A. Lexicon-Classification Performance VADER Sentiment Analysis works better for texts from
TABLE III PERFORMANCE OF LEXICON SENTIMENT ANALYSIS TOOLS
social media and other Web sources as well, than Text blob
because when it comes to analysing comments or reviews
Classification Accuracy metrics from social media, the sentiment of the sentence changes
Lexicon F1 based on the emoticons. VADER takes this into account
Precision% Recall% Accuracy%
score% along with slang, capitalization, and the way the words are
VADER 78.46 85.0 81.60 77.0 written along with their context. For example: “The movie
Textblob 76.92 81.96 79.37 74.0 is good” gives a compound score of 0.4404 where as “The
movie is GOOD” gives a compound score of 0.5622.
NLTK 60 55.0 57 62.0 Another factor that increases the intensity of the sentiment

5 AJCST Vol.8 No.S2 March 2019


Venkateswarlu Bonta, Nandhini Kumaresh and N. Janardhan

in a sentence is the inclusion of the exclamation marks. It for Travel, Tourism and Hospitality: New Perspectives, Practice and
Cases,pp. 243-261. New York: Routledge, 2018.
considers up to three exclamation marks that addsthe
[3] Z. Nanli, Z. Ping, L. Weiguo, and C. Meng, “Sentiment analysis: A
additional positive or negative intensity. For instance, “The literature review”, Proceedings of the International Symposium on
movie was GOOD!” will give the result of 0.6027. VADER Management of Technology (ISMOT), Hangzhou, IEEE, pp. 572-576,
also takes into account of modifying words in front of a 2012.
[4] J.W. Pennebaker, R.L. Boyd, K. Jordan, and K. Blackburn, “The
sentiment term, for example“extremely good” would
development and psychometric properties of LIWC2015”, Austin,
increase the positive intensity. VADER also supports TX: University of Texas at Austin, 2015
emojisentiments. Hence VADER is better option for tweets [5] S. Baccianella, A. Esuli, and F. Sebastiani, “SENTIWORDNET 3.0 :
analysis and their sentiments. An Enhanced Lexical Resource for Sentiment Analysis and Opinion
Mining”, pp. 2200–2204, 2008
[6] M. Hu and B. Liu, “Opinion Extraction and Summarization on the
V. CONCLUSION Web”, pp. 1621–1624.
[7] B. Pang, L. Lee, H. Rd, and S. Jose, “Thumbs up ? Sentiment
VADER is a gold standard list of lexical features which is Classification using Machine Learning Techniques”, pp. 79–86, 2002
[8] Wordnet.com, “WordNet, a Lexical database for English”, [online]
specially attuned to find semantics in micro blog text. If
Available: http://wordnet.princeton.edu/
sentiment was absolutely the only thing planned to do for [9] Textbolb.com, “Textblob Tutorial, Quickstart“, [online] Available at:
micro blog text, and if it needs to be processed fast, then https://textblob.readthedocs.io/en/latest/quickstart.html#quickstart
VADER is a better choice by considering the threshold as [10] C. J. Hutto and E. Gilbert, “VADER : A Parsimonious Rule-based
Model for Sentiment Analysis of Social Media Text”, Proceedings of
0.05.VADER also follows grammatical and syntactical
the Eighth International AAAI Conference on Weblogs and Social
conventions for expressing and emphasizing sentiment Media, , pp. 216–225, 2014
intensity. VADER performs well than Text blob and NLTK [11] Cornell.edu, “Movie Review data”, [online] Available:
sentiment analysis tools. http://www.cs.cornell.edu/people/pabo/movie-review-data/
[12] Steven Bird and Edward Loper. “NLTK: The Natural Language
Toolkit”, 2006
REFERENCES [13] Bing Liu,“Sentiment Analysis and Opinion Mining”, Morgan
&Claypool Publishers, May 2012.
[1] Vikas Malik and Amit Kumar. “Sentiment Analysis of Twitter Data [14] Adamo and David. “A Text Similarity Approach to Sentiment
Using Naive Bayes Algorithm”, International Journal on Recent and Classification (of Movie Reviews) using SentiWordNet” .10.13140/
Innovation Trends in Computing and Communication, Vol. 6, No. 4, RG.2.1.3271.1120, 2015
2018. [15] H. Han, Y. Zhang, J. Zhang, J. Yang, and X. Zou, “Improving the
[2] J.Ge, M.Alonso Vazquez, and U.Gretzel,“Sentiment analysis: a performance of lexicon-based review sentiment analysis method by
review”, In Sigala, M. &Gretzel, U. (Eds.), Advances in Social Media reducing additional introduced sentiment bias”,, pp. 1–11, 2018
[16] Steven Loria. “Textblob Documentation”,Release 0.15.2,2018.

AJCST Vol.8 No.S2 March 2019 6

You might also like