Report
Report
I would like to express my sincere gratitude to Mr. Rohit Goyal sir, Department of Computer
Science and Engineering, whose role as a project guide was invaluable for the project.
Last but not the least we convey our gratitude to all the teachers for providing us the technical
skill that will always remain as our asset and to all non-teaching staff for the gracious hospitality
1
ABSTRACTS
Sentiment analysis or opinion mining is the computational study of people’s opinions,
sentiments, attitudes, and emotions expressed in written language. It is one of the most active
research areas in natural language processing and text mining in recent years. Its popularity is
mainly due to two reasons. First, it has a wide range of applications because opinions are central
to almost all human activities and are key influencers of our behaviors. Whenever we need to
make a decision, we want to hear other’s opinions. Second, it presents many challenging
research problems, which had never been attempted before the year 2000. Part of the reason for
the lack of study before was that there was little opinionated text in digital forms. It is thus no
surprise that the inception and the rapid growth of the field coincide with those of the social
media on the Web. In fact, the research has also spread outside of computer science to
management sciences and social sciences due to its importance to business and society as a
whole. In this talk, I will start with the discussion of the mainstream sentiment analysis research
and then move on to describe some recent work on modeling comments, discussions, and
debates, which represents another kind of analysis of sentiments and opinions. Sentiment
classification is a way to analyze the subjective information in the text and then mine the
opinion.
Sentiment analysis is the procedure by which information is extracted from the opinions,
appraisals and emotions of people in regards to entities, events and their attributes. In decision
making, the opinions of others have a significant effect on customers ease, making choices with
regards to online shopping, choosing events, products, entities. The approaches of text sentiment
analysis typically work at a particular level like phrase, sentence or document level. This paper
aims at analyzing a solution for the sentiment classification at a fine-grained level, namely the
sentence level in which polarity of the sentence can be given by three categories as positive,
negative and neutral.
2
TABLE OF CONTENTS
1 INTRODUCTION-------------------------------------------------------Pg(5-7)
1.1 Objective --------------------------------------------------------------------------------Pg(7)
2 ALGORITHM----------------------------------------------------------------------Pg(8-14)
2.1 Model------------------------------------------------------------------------------------Pg(8)
4 IMPLEMENTATION-----------------------------------------------------------Pg(18)
5 TESTING--------------------------------------------------------------------------Pg(19-21)
3
5.2.2 Black Box Testing-------------------------------------------------------Pg(21)
7 CONCLUSION------------------------------------------------ -Pg(24)
8 BIBLIOGRAPHY--------------------------------------------- Pg(25)
4
-
1 INTRODUCTION
Sentiment analysis refers to the use of natural language processing, text analysis and
computational linguistics to identify and extract subjective information in source materials.
Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer
with respect to some topic or the overall contextual polarity of a document .The attitude may be
his or her judgment or evaluation affective state, or the intended emotional communication.
Sentiment analysis is the process of detecting a piece of writing for positive, negative, or neutral
feelings bound to it .Humans have the innate ability to determine sentiment; however, this
process is time consuming, inconsistent, and costly in a business context It’s just not realistic to
have people individually read tens of thousands of user customer reviews and score them for
sentiment .
For example if we consider Semantria’s cloud based sentiment analysis software .Semantria’s
cloud-based sentiment analysis software extracts the sentiment of a document and its
components through the following steps:
A document is broken in its basic parts of speech, called POS tags, which identify the
structural elements of a document, paragraph, or sentence
(ie Nouns, adjectives, verbs, and adverbs) .
Sentiment-bearing phrases, such as “terrible service”, are identified through the use of
specifically designed algorithms.
Each sentiment-bearing phrase in a document is given a score based on alogarithmic
scale that ranges between -10 and 10 .
Finally, the scores are combined to determine the overall sentiment of the document or
sentence Document scores range between -2 and 2.
Existing approaches to sentiment analysis can be grouped into three main categories:
5
Keyword spotting
Lexical affinity
Statistical methods
KEYWORD SPOTTING
Keyword spotting is the most naive approach and probably also the most popular because of its
accessibility and economy .Text is classified into affect categories based on the presence of fairly
unambiguous affect words like ‘happy’, ‘sad’, ‘afraid’, and ‘bored’ .The weaknesses of this
approach lie in two areas: poor recognition of affect when negation is involved and reliance on
surface features .About its first weakness, while the approach can correctly classify the sentence
“today was a happy day” as being happy, it is likely to fail on a sentence like “today wasn’t a
happy day at all” About its second weakness, the approach relies on the presence of obvious
affect words that are only surface features of the prose . In practice, a lot of sentences convey
affect through underlying meaning rather than affect adjectives For example, the text “My
husband just filed for divorce and he wants to take custody of my children away from me”
certainly evokes strong emotions, but uses no affect keywords, and therefore, cannot be
classified using a keyword spotting approach .
LEXICAL AFFINITY
Lexical affinity is slightly more sophisticated than keyword spotting as, rather than simply
detecting obvious affect words, it assigns arbitrary words a probabilistic ‘affinity’ for a particular
emotion For example, ‘accident’ might be assigned a 75% probability of being indicating a
negative affect, as in ‘car accident’ or ‘hurt by accident’ These probabilities are usually trained
from linguistic corpora .Though often outperforming pure keyword spotting, there are two main
problems with the approach First, lexical affinity, operating solely on the word-level, can easily
be tricked by sentences like “I avoided an accident” (negation) and “I met my girlfriend by
accident” (other word senses) Second, lexical affinity probabilities are often biased toward text
of a particular genre, dictated by the source of the linguistic corpora This makes it difficult to
develop a reusable, domain-independent model .
6
STATISTICAL METHODS
Statistical methods, such as Bayesian inference and support vector machines, have been popular
for affect classification of texts .By feeding a machine learning algorithm a large training corpus
of affectively annotated texts, it is possible for the system to not only learn the affective valence
of affect keywords (as in the keyword spotting approach), but also to take into account the
valence of other arbitrary keywords (like lexical affinity), punctuation, and word co-occurrence
frequencies. However, traditional statistical methods are generally semantically weak, meaning
that, with the exception of obvious affect keywords, other lexical or co-occurrence elements in a
statistical model have little predictive value individually .As a result, statistical text classifiers
only work with acceptable accuracy when given a sufficiently large text input .So, while these
methods may be able to affectively classify user’s text on the page- or paragraph- level, they do
not work well on smaller text units such as sentences or clauses .
1.1 OBJECTIVE
Sentiment classification is a way to analyze the subjective information in the text and then mine
the opinion .Sentiment analysis is the procedure by which information is extracted from the
opinions, appraisals and emotions of people in regards to entities, events and their attributes. In
decision making, the opinions of others have a significant effect on customers ease, making
choices with regards to online shopping, choosing events, products, entities .
7
to the reader .In recent years, sentiment analysis becomes a hotspot in numerous research fields,
including natural language processing (NLP), data mining (DM) and information retrieval (IR)
This is due to the increasing of subjective texts appearing on the internet .Machine Learning is
commonly used to classify sentiment from text .This technique involves with statistical model
such ad Support Vector Machine (SVM) , Bag of Words and Näive Bayes (NB) .The most
commonly used in sentiment mining were taken from blog, twitter and web review which
focusing on sentences that expressed sentiment directly .The main aim of this problem is to
develop a sentiment mining model that can process the text in the mobile reviews .
8
2-Algorithm
2.1Model
Feature engineering
The first thing we need to do when creating a machine learning model is to decide what to use as
features. We call features the pieces of information that we take from the text and give to the
algorithm so it can work its magic. For example, if we were doing classification on health, some
features could be a person’s height, weight, gender, and so on. We would exclude things that
maybe are known but aren’t useful to the model, like a person’s name or favorite color. In this
case though, we don’t even have numeric features. We just have text. We need to somehow
convert this text into numbers that we can do calculations on. So what do we do? Simple! We
use word frequencies. That is, we ignore word order and sentence construction, treating every
document as a set of the words it contains. Our features will be the counts of each of these
words. Even though it may seem too simplistic an approach, it works surprisingly well.
The naïve bayes classifier algorithm can be trained very efficiently in supervised learning for
example an insurance company which intends to promote a new policy to reduce the promotion
costs the company wants to target the most likely prospects the company can collect the
historical data for its customers ,including income range ,number of current insurance policies
,number of vehicles owned ,money invested ,and information on whether a customer has recently
switched insurance companies .Using naïve bayes classifier the company can predict how likely
a customer is to respond positively to a policy offering. With this information,the company can
reduce its promotion costs by restricting the promotion to the most likely customers . The naïve
bayes algorithm offers fast model building and scoring both binary and multiclass situations for
relatively low volumes of data this algorithm makes prediction using bayes theorem which
incorporates evidence or prior knowledge in its prediction bayes theorem relates the conditional
and marginal probabilities of stochastic events H and X which is mathematically stated as
P(H/X)=P(X/H)P(H)
P(X)
9
P stands for the probability of the variables within parenthesis .
P(H) is the prior probability of marginal probability of H it’s prior in the sense that it has not yet
accounted for the information available in X .
P(H/X) is the conditional probability of H, given X it is also called the posterior probability
because it has already incorporated the outcome of event X . P(X/H) is the conditional
probability of X given H .
Let T be a training set of samples, each with their class labels .There are k classes
X= {x1, x2,xn} depicting n measured values of the n attributes [A1, A2, , An] respectively .
Given a sample X, the classifier will predict that X belongs to the class having the highest a
posteriori probability, conditioned on X That is X is predicted to belong to the class Ci if and
only if
Thus we find the class that maximizes P(Ci/X) The class Ci for which P(Ci /X) is maximized is
called the maximum posteriori hypothesis . By Baye’s theorem
As P(X) is the same for all classes, only P(X/Ci)* P(Ci) need be maximized If the class a priori
probabilities, P(Ci), are not known, then it is commonly assumed that the classes are equally
10
likely [P(C1) = P(C2) = = P(Ck)] and we would therefore maximize P(X/Ci)
Given data sets with many attributes, it would be computationally expensive to compute P(X/Ci).
In order to reduce computation in evaluating P(X/Ci) * P(Ci), the naive assumption of class
conditional independence is made This presumes that the values of the attributes are
conditionally independent of one another, given the class label of the sample .
The probabilities [P(x1/Ci), P(x2 /Ci) … P(xn /Ci)] can easily be estimated from the training set
Recall that here xk refers to the value of attribute Ak for sample X .If Ak is categorical, then
P(xk /Ci) is the Ak number of samples of class Ci in T having the value xk for attribute , divided
by freq(Ci, T), the number of sample of class Ci in T .
In order to predict the class label of X, P(X/Ci)* P(Ci) is evaluated for each class Ci .The
classifier predicts that the class label of X is Ci if and only if it is the class that maximizes
P(X/Ci) * P(Ci).
I=5 I=4
Love= 20 Love=6
This= 5 This=5
Film=4 Film=3
Given test set as “I love this film” Find the sentiment for the given test set
11
positive reviews =10
Negative reviews=10
Prior probability:
Conditional probability
The conditional probability is the probability that a random variable will take on a particular
value given that the outcome for another random variable is known The conditional probability
for the word ‘I’ in positive review is P(I/positive)=5/10=0 5
The conditional probability for the word ‘LOVE’ in positive review is P(Love/positive)=20/10=2
The conditional probability for the word ‘THIS’ in positive review is P(This/positive)=5/10=0 5
The conditional probability for the word ‘FILM’ in positive review is P(Film/positive)=4/10=0 4
The conditional probability for the word ‘I’ in negative review is P(I/negative)=4/10=0 4
The conditional probability for the word ‘LOVE’ in negative review is P(Love/negative)=6/10=0
6
The conditional probability for the word ‘THIS’ in negative review is P(This/negative)=5/10=0 5
The conditional probability for the word ‘FILM’ in negative review is P(Film/negative)=3/10=0
3
Posterior probability
The posterior probabilities is the product of prior probability and conditional probabilities
12
The posterior probability for the positive review is P(positive)=0 5*0 5*0 5*0 4*2=0 1
The posterior probability for the negative review is P(negative)=0 5*0 6*0 3*0 5*0 4=0 018
The posterior probability for the positive reviews is greater than the posterior probability of the
negative review
P(positive)>P(negative)
The given test set “I Love This Film” is predicted by naïve bayes as a positive Sentiment.
A simple example
Let’s see how this works in practice with a simple example. Suppose we are building a classifier
that says whether a text is about sports or not. Our training data has 5 sentences:
Text Tag
“A great game” Sports
“The election was over” Not sports
“Very clean match” Sports
“A clean but forgettable game” Sports
“It was a close election” Not sports
Now, which tag does the sentence A very close game belong to?
Since Naive Bayes is a probabilistic classifier, we want to calculate the probability that the
sentence “A very close game” is Sports, and the probability that it’s Not Sports. Then, P(Sports
|a very close game) we take the largest one. Written mathematically, what we want is — the
probability that the tag of a sentence is Sports given that the sentence is “A very close game”.
Bayes’ Theorem
Now we need to transform the probability we want to calculate into something that can be
calculated using word frequencies. For this, we will use some basic properties of probabilities,
13
and Bayes’ Theorem. If you feel like your knowledge of these topics is a bit rusty, read up on it
and you’ll be up to speed in a couple of minutes.
Bayes’ Theorem is useful when working with conditional probabilities (like we are doing here),
because it provides us with a way to reverse them:
𝑷(𝑩⃓ 𝑨)×𝑷(𝑨)
P(A/B)= 𝑷(𝑩⃓)
In our case, we have P(Sports |a very close game), so using this theorem we can reverse the
conditional probability:
Since for our classifier we’re just trying to find out which tag has a bigger probability, we can
discard the divisor —which is the same for both tags— and just compare
With
This is better, since we could actually calculate these probabilities! Just count how many times
the sentence “A very close game” appears in the Sports tag, divide it by the total, and obtain
P(Sports |a very close game),.
There’s a problem though: “A very close game” doesn’t appear in our training data, so this
probability is zero. Unless every sentence that we want to classify appears in our training data,
the model won’t
Being Naïve
So here comes the Naive part: we assume that every word in a sentence is independent of the
other ones. This means that we’re no longer looking at entire sentences, but rather at individual
14
words. So for our purposes, “this was a fun party” is the same as “this party was fun” and “party
fun was this”.
This assumption is very strong but super useful. It’s what makes this model work well with little
data or data that may be mislabeled. The next step is just applying this to what we had before:
And now, all of these individual words actually show up several times in our training data, and
we can calculate them!
Calculating probabilities
The final step is just to calculate every probability and see which one turns out to be larger.
First, we calculate the a priori probability of each tag: for a given sentence in our training data,
the probability that it is Sports P(Sports) is ⅗. Then, P(Not Sports) is ⅖. That’s easy enough.
Then, calculating P(game|Sports) means counting how many times the word “game” appears in
Sports texts (2) divided by the total number of words in sports (11).
2
Therefore, 𝑃(𝑔𝑎𝑚𝑒𝑆𝑝𝑜𝑟𝑡𝑠) =
11
However, we run into a problem here: “close” doesn’t appear in any Sports text! That means that
P(Close|Sports). This is rather inconvenient since we are going to be multiplying it with the other
probabilities, so we’ll end up with
15
𝑃(𝑎 𝑆𝑝𝑜𝑟𝑡𝑠) × 𝑃(𝑣𝑒𝑟𝑦 𝑆𝑝𝑜𝑟𝑡𝑠) × 0 × 𝑃(𝑔𝑎𝑚𝑒 𝑆𝑝𝑜𝑟𝑡𝑠). This equals 0, since in a
multiplication, if one of the terms is zero, the whole calculation is nullified. Doing things this
way simply doesn’t give us any information at all, so we have to find a way around.
How do we do it? By using something called Laplace smoothing: we add 1 to every count so it’s
never zero. To balance this, we add the number of possible words to the divisor, so the division
will never be greater than 1. In our case, the possible words are[ 'a', 'great', 'very', 'over', 'it', 'but',
'game', 'election', 'clean', 'close', 'the','was', 'forgettable', 'match'] ince the number of possible
words is 14 (I counted them!), applying smoothing we get that
2+1
𝑃(𝑔𝑎𝑚𝑒𝑠𝑝𝑜𝑟𝑡𝑠) = 11+14 The full result are:
Now we just multiply all the probabilities, and see who is bigger:
=2.76×10-5
=0.000276
16
3-SYSTEM ANALYSIS AND DESIGN
Software Requirements
Hardware Requirements
3.2 PYTHON
17
NLTK
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs
for symbolic and statistical natural language processing (NLP) for English written in the Python
programming language. It was developed by Steven Bird and Edward Loper in the Department
of Computer and Information Science at the University of Pennsylvania.NLTK includes
graphical demonstrations and sample data. It is accompanied by a book that explains the
underlying concepts behind the language processing tasks supported by the toolkit,plus a
cookbook. NLTK is intended to support research and teaching in NLP or closely related areas,
including empirical linguistics, cognitive science, artificial intelligence, information retrieval,
and machine learning. NLTK has been used successfully as a teaching tool, as an individual
study tool, and as a platform for prototyping and building research systems. There are 32
universities in the US and 25 countries using NLTK in their courses. NLTK supports
classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.
Scikit-learn
Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python
programming language.[3] It features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient boosting, k-means and DBSCAN,
and is designed to interoperate with the Python numerical and scientific libraries NumPy and
SciPy.
A data flow diagram (DFD) is a graphical representation of the "flow" of data through an
information system, modeling its process aspects .A DFD is often used as a preliminary step to
create an overview of the system, which can later be elaborated.
Process: A process takes data as input, execute some steps and produce data as output.
External Entity: Objects outside the system being modeled, and interact with processes in
system.
Data Store: Files or storage of data that store data input and output from process.
18
Data Flow: The flow of data from process to process.
19
Level 1 Data Flow Diagram for the process (Naïve Bayes)
20
4-IMPLEMENTATION
I used Python for the implementation of “sentiment analysis of movie reviews using supervised
learning techniques”. There are several stages involved during implementation of the problem
using different supervised learning techniques. Among them training and testing are the two
main phases that are involved.
Gathering data: Twitter was used as a source of text in this project, we gathered tweets for
major companies by using keywords and dates to scrap them. Twitter API (although not being
the most efficient way )was used to gather text data. Next opening and closing values of stocks
were gathered from yahoo finance for the day for each company
Text cleaning: Stopwords removal, punctuation removal, stemming etc were few of the
techniques used to clean the text. This is common to every text analysis problem.
Sentiment generation: It is then important to fine the sentiment of the statement, there are many
smart and reliable algorithms out there to tag statements with complex or simple sentiments.
Some sentiment algorithm would also give you sentiments like anxiety, sadness etc while the
most commonly used are positive, negative and neutral. Natural language processing is used to
tag each word with a sentiment and the overall score or sentiment that the statement would get
depends on the underlying word sentiments.
Stock independent variable: The stock opening and closing values were then broken down into
high, low or neutral per day based on its behavior for the day at the closing time.
Prediction: positive sentiments did seem to have a upward effect on stock prices. This was
found by prediction method: data was divided into 2 sets training and testing (70%–30%).
Different algorithms were used to learn the behavior of sentiments with stock variable and then
30% were used as unseen data to test whether it could predict. The positive sentiments were
more predictive than negative.
21
4.3 Training and Testing
This method mainly concentrates on the attributes Training Phase involves the Elimination of
Special Characters and Conversion to Lower case and Word Count stages are used in this
method .The DataSet obtained from the WordCount is passed to another stage where all the
neutral words are eliminated by using Positive and Negative words .Finally the obtained DataSet
is given as a input to the Naïve Bayes method and along with that Sentiment polarities are
calculated carefully and given as input. In the testing phase the test data is passed through the
Elimination of Special Characters and Conversion to Lower case stage .The prior,conditional and
posterior probabilities are calculated using the input data .
22
5-TESTING
Testing is the process of evaluating a system or its component’s with the intent to find that
whether it satisfies the specified requirements or not .This activity results in the actual, expected
and difference between their results i.e testing is executing a system in order to identify any gaps,
errors or missing requirements in contrary to the actual desire or requirements.
In order to make sure that system does not have any errors, the different levels of testing
strategies that are applied at different phases of software development are
23
5.1.1 Unit Testing
The goal of unit testing is to isolate each part of the program and show that individual parts are
correct in terms of requirements and functionality.
The testing of combined parts of an application to determine if they function correctly together is
Integration testing .This testing can be done by using two different methods
In Top-Down integration testing, the highest-level modules are tested first and then progressively
lower-level modules are tested.
Testing can be performed starting from smallest and lowest level modules and proceeding one at
a time .When bottom level modules are tested attention turns to those on the next level that use
the lower level ones they are tested individually and then linked with the previously examined
lower level modules.In a comprehensive software development environment, bottom-up testing
is usually done first, followed by top-down testing.
This is the next level in the testing and tests the system as a whole .Once all the components are
integrated, the application as a whole is tested rigorously to see that it meets Quality Standards.
The main purpose of this Testing is to find whether application meets the intended specifications
and satisfies the client’s requirements .We will follow two different methods in this testing.
24
5.1.4.1 Alpha Testing
This test is the first stage of testing and will be performed amongst the teams .Unit testing,
integration testing and system testing when combined are known as alpha testing. During this
phase, the following will be tested in the application:
Spelling Mistakes.
Broken Links.
The Application will be tested on machines with the lowest specification to test loading
times and any latency problems.
In beta testing, a sample of the intended audience tests the application and send their feedback to
the project team .Getting the feedback, the project team can fix the problems before releasing the
software to the actual users.
White box testing is the detailed investigation of internal logic and structure of the Code. To
perform white box testing on an application, the tester needs to possess knowledge of the internal
working of the code .The tester needs to have a look inside the source code and find out which
unit/chunk of the code is behaving inappropriately.
The technique of testing without having any knowledge of the interior workings of the
application is Black Box testing .The tester is oblivious to the system architecture and does not
have access to the source code.Typically, when performing a black box test, a tester will interact
with the system’s user interface by providing inputs and examining outputs without knowing
how and where the inputs are worked upon.
25
5.3 Validation
All the levels in the testing (unit,integration,system) and methods (black box,white box)are
implemented on our application successfully and the results obtained as expected .
The testing is done among the team members and by the end users. It satisfies the specified
requirements and finally we obtained the results as expected.
26
6-Screen Shots
27
28
7-CONCLUSION
The project uses Statistical models for solving the problem i.e. Naive Bayes. Naive Bayes is
mainly based on the independence assumption .Training is very easy and fast In this approach
each attribute in each class is considered separately. Testing is straightforward, calculating the
conditional probabilities from the data available. One of the major task is to find the sentiment
polarities which is very important in this approach to obtain desired output. In this naïve bayes
approach we only considered the words that are available in our dataset and calculated their
conditional probabilities. Successful results have been obtained after applying this approach to
our problem.
29
8-BIBLIOGRAPHY
The followings were referred during the analysis and execution phase of the project Websites
Referred:
http://www.analyticsvedia.com/sentimentanalysis
http://www.kaggle.com/sentimentanalysis
http://www.imdb.com/moviereviews
30