0% found this document useful (0 votes)
278 views

Report

The document discusses different approaches to sentiment analysis including keyword spotting, lexical affinity, and statistical methods. Keyword spotting identifies sentiment based on the presence of unambiguous sentiment words but has weaknesses around negation and relies on surface features. Lexical affinity assigns probabilities of sentiment to words based on corpora but operates at the word level and can be tricked by sentences. Statistical methods use machine learning techniques on large annotated corpora to determine sentiment and overcome weaknesses of other approaches.

Uploaded by

Arushi Guar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
278 views

Report

The document discusses different approaches to sentiment analysis including keyword spotting, lexical affinity, and statistical methods. Keyword spotting identifies sentiment based on the presence of unambiguous sentiment words but has weaknesses around negation and relies on surface features. Lexical affinity assigns probabilities of sentiment to words based on corpora but operates at the word level and can be tricked by sentences. Statistical methods use machine learning techniques on large annotated corpora to determine sentiment and overcome weaknesses of other approaches.

Uploaded by

Arushi Guar
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

ACKNOWLEDGEMENT

I would like to express my sincere gratitude to Mr. Rohit Goyal sir, Department of Computer
Science and Engineering, whose role as a project guide was invaluable for the project.

Last but not the least we convey our gratitude to all the teachers for providing us the technical
skill that will always remain as our asset and to all non-teaching staff for the gracious hospitality

1
ABSTRACTS
Sentiment analysis or opinion mining is the computational study of people’s opinions,
sentiments, attitudes, and emotions expressed in written language. It is one of the most active
research areas in natural language processing and text mining in recent years. Its popularity is
mainly due to two reasons. First, it has a wide range of applications because opinions are central
to almost all human activities and are key influencers of our behaviors. Whenever we need to
make a decision, we want to hear other’s opinions. Second, it presents many challenging
research problems, which had never been attempted before the year 2000. Part of the reason for
the lack of study before was that there was little opinionated text in digital forms. It is thus no
surprise that the inception and the rapid growth of the field coincide with those of the social
media on the Web. In fact, the research has also spread outside of computer science to
management sciences and social sciences due to its importance to business and society as a
whole. In this talk, I will start with the discussion of the mainstream sentiment analysis research
and then move on to describe some recent work on modeling comments, discussions, and
debates, which represents another kind of analysis of sentiments and opinions. Sentiment
classification is a way to analyze the subjective information in the text and then mine the
opinion.

Sentiment analysis is the procedure by which information is extracted from the opinions,
appraisals and emotions of people in regards to entities, events and their attributes. In decision
making, the opinions of others have a significant effect on customers ease, making choices with
regards to online shopping, choosing events, products, entities. The approaches of text sentiment
analysis typically work at a particular level like phrase, sentence or document level. This paper
aims at analyzing a solution for the sentiment classification at a fine-grained level, namely the
sentence level in which polarity of the sentence can be given by three categories as positive,
negative and neutral.

2
TABLE OF CONTENTS

1 INTRODUCTION-------------------------------------------------------Pg(5-7)
1.1 Objective --------------------------------------------------------------------------------Pg(7)

1.2 Proposed Approach and Methods to be Employed--------------------------- -Pg(7)

2 ALGORITHM----------------------------------------------------------------------Pg(8-14)

2.1 Model------------------------------------------------------------------------------------Pg(8)

2.1.1 Naïve Bayes--------------------------------------------------------------------------Pg(8-14)

3 SYSTEM ANALYSIS AND DESIGN-----------------------------------------Pg(15-17)

3.1 Software and Hardware Requirements------------------------------------------Pg(15)

3.2 Python ---------------------------------------------------------------------------------Pg(15-16)

3.3 Data Flow Diagrams-----------------------------------------------------------------Pg(16-17)

4 IMPLEMENTATION-----------------------------------------------------------Pg(18)

4.1 Steps of Sentiment Analysis---------------------------------------------------------Pg(18)

4.2 Testing and Training -----------------------------------------------------------------Pg(18)

5 TESTING--------------------------------------------------------------------------Pg(19-21)

5.1 Testing Strategies------------------------------------------------------------------Pg(19-20)

5.1.1 Unit Testing----------------------------------------------------------------------Pg(20)

5.1.2 Integration Testing-------------------------------------------------------------Pg(20)

5.1.2.1 Top Down Integration Testing--------------------------------------------Pg(20)

5.1.2.2 Bottom Up Integration Testing ------------------------------------------Pg(20)

5.1.3 System Testing------------------------------------------------------------------Pg(20)

5.1.4 Accepting Testing -------------------------------------------------------------Pg(20)

5.1.4.1 Alpha Testing -----------------------------------------------------------------Pg(20)

5.1.4.2 Beta Testing--------------------------------------------------------------------Pg(21)

5.2 Testing Methods------------------------------------------------------------------ Pg(21)

5.2.1 White Box Testing----------------------------------------------------- --------Pg(21)

3
5.2.2 Black Box Testing-------------------------------------------------------Pg(21)

5.3 Validation ----------------------------------------------------------------- Pg(21)

5.5 Test Results----------------------------------------------------------------- Pg(21)

6 SCREEN SHOTS --------------------------------------------- Pg(22-23)

7 CONCLUSION------------------------------------------------ -Pg(24)

8 BIBLIOGRAPHY--------------------------------------------- Pg(25)

4
-
1 INTRODUCTION

Sentiment analysis refers to the use of natural language processing, text analysis and
computational linguistics to identify and extract subjective information in source materials.
Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer
with respect to some topic or the overall contextual polarity of a document .The attitude may be
his or her judgment or evaluation affective state, or the intended emotional communication.
Sentiment analysis is the process of detecting a piece of writing for positive, negative, or neutral
feelings bound to it .Humans have the innate ability to determine sentiment; however, this
process is time consuming, inconsistent, and costly in a business context It’s just not realistic to
have people individually read tens of thousands of user customer reviews and score them for
sentiment .

For example if we consider Semantria’s cloud based sentiment analysis software .Semantria’s
cloud-based sentiment analysis software extracts the sentiment of a document and its
components through the following steps:

 A document is broken in its basic parts of speech, called POS tags, which identify the
structural elements of a document, paragraph, or sentence
(ie Nouns, adjectives, verbs, and adverbs) .
 Sentiment-bearing phrases, such as “terrible service”, are identified through the use of
specifically designed algorithms.
 Each sentiment-bearing phrase in a document is given a score based on alogarithmic
scale that ranges between -10 and 10 .
 Finally, the scores are combined to determine the overall sentiment of the document or
sentence Document scores range between -2 and 2.

Semantria’s cloud-based sentiment analysis software is based on Natural Language Processing


and delivers you more consistent results than two humans. Using automated sentiment analysis,
Semantria analyzes each document and its components based on sophisticated algorithms
developed to extract sentiment from your content in a similar manner

Existing approaches to sentiment analysis can be grouped into three main categories:

5
 Keyword spotting
 Lexical affinity
 Statistical methods

KEYWORD SPOTTING
Keyword spotting is the most naive approach and probably also the most popular because of its
accessibility and economy .Text is classified into affect categories based on the presence of fairly
unambiguous affect words like ‘happy’, ‘sad’, ‘afraid’, and ‘bored’ .The weaknesses of this
approach lie in two areas: poor recognition of affect when negation is involved and reliance on
surface features .About its first weakness, while the approach can correctly classify the sentence
“today was a happy day” as being happy, it is likely to fail on a sentence like “today wasn’t a
happy day at all” About its second weakness, the approach relies on the presence of obvious
affect words that are only surface features of the prose . In practice, a lot of sentences convey
affect through underlying meaning rather than affect adjectives For example, the text “My
husband just filed for divorce and he wants to take custody of my children away from me”
certainly evokes strong emotions, but uses no affect keywords, and therefore, cannot be
classified using a keyword spotting approach .

LEXICAL AFFINITY
Lexical affinity is slightly more sophisticated than keyword spotting as, rather than simply
detecting obvious affect words, it assigns arbitrary words a probabilistic ‘affinity’ for a particular
emotion For example, ‘accident’ might be assigned a 75% probability of being indicating a
negative affect, as in ‘car accident’ or ‘hurt by accident’ These probabilities are usually trained
from linguistic corpora .Though often outperforming pure keyword spotting, there are two main
problems with the approach First, lexical affinity, operating solely on the word-level, can easily
be tricked by sentences like “I avoided an accident” (negation) and “I met my girlfriend by
accident” (other word senses) Second, lexical affinity probabilities are often biased toward text
of a particular genre, dictated by the source of the linguistic corpora This makes it difficult to
develop a reusable, domain-independent model .

6
STATISTICAL METHODS
Statistical methods, such as Bayesian inference and support vector machines, have been popular
for affect classification of texts .By feeding a machine learning algorithm a large training corpus
of affectively annotated texts, it is possible for the system to not only learn the affective valence
of affect keywords (as in the keyword spotting approach), but also to take into account the
valence of other arbitrary keywords (like lexical affinity), punctuation, and word co-occurrence
frequencies. However, traditional statistical methods are generally semantically weak, meaning
that, with the exception of obvious affect keywords, other lexical or co-occurrence elements in a
statistical model have little predictive value individually .As a result, statistical text classifiers
only work with acceptable accuracy when given a sufficiently large text input .So, while these
methods may be able to affectively classify user’s text on the page- or paragraph- level, they do
not work well on smaller text units such as sentences or clauses .

1.1 OBJECTIVE
Sentiment classification is a way to analyze the subjective information in the text and then mine
the opinion .Sentiment analysis is the procedure by which information is extracted from the
opinions, appraisals and emotions of people in regards to entities, events and their attributes. In
decision making, the opinions of others have a significant effect on customers ease, making
choices with regards to online shopping, choosing events, products, entities .

1.2 Proposed Approach and methods to be Employed


Sentiment Analysis or Opinion Mining is a study that attempts to identify and analyze emotions
and subjective information from text Since early 2001, the advancement of internet technology
and machine learning techniques in information retrieval make Sentiment Analysis becomes
popular among researchers. Besides, the emergent of social networking and blogs as a
communication medium also contributes to the development of research in this area Sentiment
analysis or mining refers to the application of Natural Language Processing, Computational
Linguistics, and Text Analytics to identify and extract subjective information in source materials
.Sentiment mining extracts attitude of a writer in a document includes writer’s judgement and
evaluation towards the discussed issue. Sentiment analysis allows us to identify the emotional
state of the writer during writing, and the intended emotional effect that the author wishes to give

7
to the reader .In recent years, sentiment analysis becomes a hotspot in numerous research fields,
including natural language processing (NLP), data mining (DM) and information retrieval (IR)
This is due to the increasing of subjective texts appearing on the internet .Machine Learning is
commonly used to classify sentiment from text .This technique involves with statistical model
such ad Support Vector Machine (SVM) , Bag of Words and Näive Bayes (NB) .The most
commonly used in sentiment mining were taken from blog, twitter and web review which
focusing on sentences that expressed sentiment directly .The main aim of this problem is to
develop a sentiment mining model that can process the text in the mobile reviews .

8
2-Algorithm

2.1Model

Feature engineering
The first thing we need to do when creating a machine learning model is to decide what to use as
features. We call features the pieces of information that we take from the text and give to the
algorithm so it can work its magic. For example, if we were doing classification on health, some
features could be a person’s height, weight, gender, and so on. We would exclude things that
maybe are known but aren’t useful to the model, like a person’s name or favorite color. In this
case though, we don’t even have numeric features. We just have text. We need to somehow
convert this text into numbers that we can do calculations on. So what do we do? Simple! We
use word frequencies. That is, we ignore word order and sentence construction, treating every
document as a set of the words it contains. Our features will be the counts of each of these
words. Even though it may seem too simplistic an approach, it works surprisingly well.

2.1.1 Naïve bayes


A naïve bayes classifier is a simple probability based algorithm. It uses the bayes theorem but
assumes that the instances are independent of each other which is an unrealistic assumption in
practical world naïve bayes classifier works well in complex real world situations.

The naïve bayes classifier algorithm can be trained very efficiently in supervised learning for
example an insurance company which intends to promote a new policy to reduce the promotion
costs the company wants to target the most likely prospects the company can collect the
historical data for its customers ,including income range ,number of current insurance policies
,number of vehicles owned ,money invested ,and information on whether a customer has recently
switched insurance companies .Using naïve bayes classifier the company can predict how likely
a customer is to respond positively to a policy offering. With this information,the company can
reduce its promotion costs by restricting the promotion to the most likely customers . The naïve
bayes algorithm offers fast model building and scoring both binary and multiclass situations for
relatively low volumes of data this algorithm makes prediction using bayes theorem which
incorporates evidence or prior knowledge in its prediction bayes theorem relates the conditional
and marginal probabilities of stochastic events H and X which is mathematically stated as

P(H/X)=P(X/H)P(H)

P(X)

9
P stands for the probability of the variables within parenthesis .

P(H) is the prior probability of marginal probability of H it’s prior in the sense that it has not yet
accounted for the information available in X .

P(H/X) is the conditional probability of H, given X it is also called the posterior probability
because it has already incorporated the outcome of event X . P(X/H) is the conditional
probability of X given H .

P(X) is the prior or marginal probability of X, which is normally the evidence .

It can also represented as

Posterior=likelihood ∗ prior⁄normalising constant

The ratio of P(X/H)/P(X) is also called as standardised likelihood .

The naive Bayesian classifier works as follows:

Let T be a training set of samples, each with their class labels .There are k classes

[C1,C2, , Ck] Each sample is represented by an n-dimensional vector,

X= {x1, x2,xn} depicting n measured values of the n attributes [A1, A2, , An] respectively .

Given a sample X, the classifier will predict that X belongs to the class having the highest a
posteriori probability, conditioned on X That is X is predicted to belong to the class Ci if and
only if

P(Ci / X) > P(Cj /X) for 1 ≤j ≤ m, j ≠ i

Thus we find the class that maximizes P(Ci/X) The class Ci for which P(Ci /X) is maximized is
called the maximum posteriori hypothesis . By Baye’s theorem

P(Ci/X) = (P(X/Ci)*P(Ci)) / P(X)

As P(X) is the same for all classes, only P(X/Ci)* P(Ci) need be maximized If the class a priori
probabilities, P(Ci), are not known, then it is commonly assumed that the classes are equally

10
likely [P(C1) = P(C2) = = P(Ck)] and we would therefore maximize P(X/Ci)

Otherwise we maximize P(X/Ci) * P(Ci) .

Given data sets with many attributes, it would be computationally expensive to compute P(X/Ci).
In order to reduce computation in evaluating P(X/Ci) * P(Ci), the naive assumption of class
conditional independence is made This presumes that the values of the attributes are
conditionally independent of one another, given the class label of the sample .

Mathematically this means that

The probabilities [P(x1/Ci), P(x2 /Ci) … P(xn /Ci)] can easily be estimated from the training set
Recall that here xk refers to the value of attribute Ak for sample X .If Ak is categorical, then
P(xk /Ci) is the Ak number of samples of class Ci in T having the value xk for attribute , divided
by freq(Ci, T), the number of sample of class Ci in T .

In order to predict the class label of X, P(X/Ci)* P(Ci) is evaluated for each class Ci .The
classifier predicts that the class label of X is Ci if and only if it is the class that maximizes
P(X/Ci) * P(Ci).

The naïve bayes example for text classification


The training set consists of 10 Positive Reviews and 10 negative reviews and considered word
counts are as follows

Positive Reviews Database Negative Reviews Database

I=5 I=4

Love= 20 Love=6

This= 5 This=5

Film=4 Film=3

Given test set as “I love this film” Find the sentiment for the given test set

Given training set consists of the following information

11
positive reviews =10

Negative reviews=10

Total no of Reviews=positive reviews+ negative reviews=20

Prior probability:

The prior probability for the positive reviews is P(positive)=10/20= 0 5

The prior probability for the negative reviews is P(negative)=10/20 =0 5

Conditional probability

The conditional probability is the probability that a random variable will take on a particular
value given that the outcome for another random variable is known The conditional probability
for the word ‘I’ in positive review is P(I/positive)=5/10=0 5

The conditional probability for the word ‘LOVE’ in positive review is P(Love/positive)=20/10=2

The conditional probability for the word ‘THIS’ in positive review is P(This/positive)=5/10=0 5

The conditional probability for the word ‘FILM’ in positive review is P(Film/positive)=4/10=0 4

The conditional probability for the word ‘I’ in negative review is P(I/negative)=4/10=0 4

The conditional probability for the word ‘LOVE’ in negative review is P(Love/negative)=6/10=0
6

The conditional probability for the word ‘THIS’ in negative review is P(This/negative)=5/10=0 5

The conditional probability for the word ‘FILM’ in negative review is P(Film/negative)=3/10=0
3

Posterior probability

The posterior probabilities is the product of prior probability and conditional probabilities

Posterior probability= prior probability *conditional probability

12
The posterior probability for the positive review is P(positive)=0 5*0 5*0 5*0 4*2=0 1

The posterior probability for the negative review is P(negative)=0 5*0 6*0 3*0 5*0 4=0 018

The posterior probability for the positive reviews is greater than the posterior probability of the
negative review

P(positive)>P(negative)

The given test set “I Love This Film” is predicted by naïve bayes as a positive Sentiment.

A simple example

Let’s see how this works in practice with a simple example. Suppose we are building a classifier
that says whether a text is about sports or not. Our training data has 5 sentences:

Text Tag
“A great game” Sports
“The election was over” Not sports
“Very clean match” Sports
“A clean but forgettable game” Sports
“It was a close election” Not sports

Now, which tag does the sentence A very close game belong to?

Since Naive Bayes is a probabilistic classifier, we want to calculate the probability that the
sentence “A very close game” is Sports, and the probability that it’s Not Sports. Then, P(Sports
|a very close game) we take the largest one. Written mathematically, what we want is — the
probability that the tag of a sentence is Sports given that the sentence is “A very close game”.

Bayes’ Theorem

Now we need to transform the probability we want to calculate into something that can be
calculated using word frequencies. For this, we will use some basic properties of probabilities,

13
and Bayes’ Theorem. If you feel like your knowledge of these topics is a bit rusty, read up on it
and you’ll be up to speed in a couple of minutes.

Bayes’ Theorem is useful when working with conditional probabilities (like we are doing here),
because it provides us with a way to reverse them:

𝑷(𝑩⃓ 𝑨)×𝑷(𝑨)
P(A/B)= 𝑷(𝑩⃓)

In our case, we have P(Sports |a very close game), so using this theorem we can reverse the
conditional probability:

𝑷(𝒂 𝒗𝒆𝒓𝒚 𝒄𝒍𝒐𝒔𝒆 𝒈𝒂𝒎𝒆 𝒔𝒑𝒐𝒓𝒕𝒔) × 𝑷(𝒔𝒑𝒐𝒓𝒕𝒔)


𝑷(𝒔𝒑𝒐𝒓𝒕𝒔𝒂𝒗𝒆𝒓𝒚 𝒄𝒍𝒐𝒔𝒆 𝒈𝒂𝒎𝒆) =
𝑷(𝒂 𝒗𝒆𝒓𝒚 𝒄𝒍𝒐𝒔𝒆 𝒈𝒂𝒎𝒆)

Since for our classifier we’re just trying to find out which tag has a bigger probability, we can
discard the divisor —which is the same for both tags— and just compare

𝑃(𝑎 𝑣𝑒𝑟𝑦 𝑐𝑙𝑜𝑠𝑒 𝑔𝑎𝑚𝑒 𝑆𝑝𝑜𝑟𝑡𝑠) × 𝑝(𝑆𝑝𝑜𝑟𝑡𝑠)

With

𝑃(𝑎 𝑣𝑒𝑟𝑦 𝑐𝑙𝑜𝑠𝑒 𝑔𝑎𝑚𝑒𝑁𝑜𝑡 𝑆𝑝𝑜𝑟𝑡𝑠) × 𝑃(𝑁𝑜𝑡 𝑆𝑝𝑜𝑟𝑡𝑠)

This is better, since we could actually calculate these probabilities! Just count how many times
the sentence “A very close game” appears in the Sports tag, divide it by the total, and obtain
P(Sports |a very close game),.

There’s a problem though: “A very close game” doesn’t appear in our training data, so this
probability is zero. Unless every sentence that we want to classify appears in our training data,
the model won’t

Being Naïve

So here comes the Naive part: we assume that every word in a sentence is independent of the
other ones. This means that we’re no longer looking at entire sentences, but rather at individual

14
words. So for our purposes, “this was a fun party” is the same as “this party was fun” and “party
fun was this”.

We write this as:

𝑷(𝒂 𝒗𝒆𝒓𝒚 𝒄𝒍𝒐𝒔𝒆 𝒈𝒂𝒎𝒆) = 𝑷(𝒂) × 𝑷(𝒗𝒆𝒓𝒚) × 𝑷(𝒄𝒍𝒐𝒔𝒆) × 𝑷(𝒈𝒂𝒎𝒆)

This assumption is very strong but super useful. It’s what makes this model work well with little
data or data that may be mislabeled. The next step is just applying this to what we had before:

𝑷(𝒂 𝒗𝒆𝒓𝒚 𝒄𝒍𝒐𝒔𝒆 𝒈𝒂𝒎𝒆𝑺𝒑𝒐𝒓𝒕𝒔)


= 𝑷(𝒂 𝑺𝒑𝒐𝒓𝒕𝒔) × 𝑷(𝒗𝒆𝒓𝒚 𝑺𝒑𝒐𝒓𝒕𝒔) × 𝑷(𝒄𝒍𝒐𝒔𝒆 𝑺𝒑𝒐𝒓𝒕𝒔)
× 𝑷(𝒈𝒂𝒎𝒆 𝑺𝒑𝒐𝒓𝒕𝒔)

And now, all of these individual words actually show up several times in our training data, and
we can calculate them!

Calculating probabilities

The final step is just to calculate every probability and see which one turns out to be larger.

Calculating a probability is just counting in our training data.

First, we calculate the a priori probability of each tag: for a given sentence in our training data,
the probability that it is Sports P(Sports) is ⅗. Then, P(Not Sports) is ⅖. That’s easy enough.

Then, calculating P(game|Sports) means counting how many times the word “game” appears in
Sports texts (2) divided by the total number of words in sports (11).

2
Therefore, 𝑃(𝑔𝑎𝑚𝑒𝑆𝑝𝑜𝑟𝑡𝑠) =
11

However, we run into a problem here: “close” doesn’t appear in any Sports text! That means that
P(Close|Sports). This is rather inconvenient since we are going to be multiplying it with the other
probabilities, so we’ll end up with

15
𝑃(𝑎 𝑆𝑝𝑜𝑟𝑡𝑠) × 𝑃(𝑣𝑒𝑟𝑦 𝑆𝑝𝑜𝑟𝑡𝑠) × 0 × 𝑃(𝑔𝑎𝑚𝑒 𝑆𝑝𝑜𝑟𝑡𝑠). This equals 0, since in a
multiplication, if one of the terms is zero, the whole calculation is nullified. Doing things this
way simply doesn’t give us any information at all, so we have to find a way around.

How do we do it? By using something called Laplace smoothing: we add 1 to every count so it’s
never zero. To balance this, we add the number of possible words to the divisor, so the division
will never be greater than 1. In our case, the possible words are[ 'a', 'great', 'very', 'over', 'it', 'but',
'game', 'election', 'clean', 'close', 'the','was', 'forgettable', 'match'] ince the number of possible
words is 14 (I counted them!), applying smoothing we get that

2+1
𝑃(𝑔𝑎𝑚𝑒𝑠𝑝𝑜𝑟𝑡𝑠) = 11+14 The full result are:

Word P(word Sports) P(word Not Sports )


a 2+1 1+1
11 + 14 9 + 14
very 1+1 0+1
11 + 14 9 + 14
close 0+1 1+
11 + 14 9 + 14
game 2+1 0+1
11 + 14 9 + 14

Now we just multiply all the probabilities, and see who is bigger:

𝑃(𝑎𝑆𝑝𝑜𝑟𝑡𝑠) × 𝑃(𝑣𝑒𝑟𝑦𝑆𝑝𝑜𝑟𝑡𝑠) × 𝑃(𝑐𝑙𝑜𝑠𝑒𝑆𝑝𝑜𝑟𝑡𝑠) × 𝑃(𝑔𝑎𝑚𝑒𝑆𝑝𝑜𝑟𝑡𝑠) × 𝑃(𝑆𝑝𝑜𝑟𝑡𝑠)

=2.76×10-5

=0.000276

16
3-SYSTEM ANALYSIS AND DESIGN

3.1 Software and Hardware Requirements

Software Requirements

 Operating System :windows 7, windows vista, windows xp,windows 8 and higher


versions.
 Language : Python

Hardware Requirements

 Ram : 1 GB Ram and more


 Processor : Any IntelProcessor
 HardDisk : 6 GB and more
 Speed : 1GHZ and more

3.2 PYTHON

Python is an interpreted high-level programming language for general-purpose programming.


Created by Guido van Rossum and first released in 1991, Python has a design philosophy that
emphasizes code readability, notably using significant whitespace. It provides constructs that
enable clear programming on both small and large scales. In July 2018, Van Rossum stepped
down as the leader in the language community after 30 years. Python features a dynamic type
system and automatic memory management. It supports multiple programming paradigms,
including objectoriented, imperative, functional and procedural, and has a large and
comprehensive standard library. Python interpreters are available for many operating systems.
CPython, the reference implementation of Python, is open sourcesoftware[ and has a community-
based development model, as do nearly all of Python's other implementations. Python and
CPython are managed by the non-profit Python Software Foundation.

17
NLTK

The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs
for symbolic and statistical natural language processing (NLP) for English written in the Python
programming language. It was developed by Steven Bird and Edward Loper in the Department
of Computer and Information Science at the University of Pennsylvania.NLTK includes
graphical demonstrations and sample data. It is accompanied by a book that explains the
underlying concepts behind the language processing tasks supported by the toolkit,plus a
cookbook. NLTK is intended to support research and teaching in NLP or closely related areas,
including empirical linguistics, cognitive science, artificial intelligence, information retrieval,
and machine learning. NLTK has been used successfully as a teaching tool, as an individual
study tool, and as a platform for prototyping and building research systems. There are 32
universities in the US and 25 countries using NLTK in their courses. NLTK supports
classification, tokenization, stemming, tagging, parsing, and semantic reasoning functionalities.

Scikit-learn

Scikit-learn (formerly scikits.learn) is a free software machine learning library for the Python
programming language.[3] It features various classification, regression and clustering algorithms
including support vector machines, random forests, gradient boosting, k-means and DBSCAN,
and is designed to interoperate with the Python numerical and scientific libraries NumPy and
SciPy.

3.3 Dataflow Diagrams

A data flow diagram (DFD) is a graphical representation of the "flow" of data through an
information system, modeling its process aspects .A DFD is often used as a preliminary step to
create an overview of the system, which can later be elaborated.

Process: A process takes data as input, execute some steps and produce data as output.

External Entity: Objects outside the system being modeled, and interact with processes in
system.

Data Store: Files or storage of data that store data input and output from process.

18
Data Flow: The flow of data from process to process.

DATA FLOW DIAGRAM

Level 0 Data Flow Diagram for the process (Naïve Bayes)

19
Level 1 Data Flow Diagram for the process (Naïve Bayes)

20
4-IMPLEMENTATION
I used Python for the implementation of “sentiment analysis of movie reviews using supervised
learning techniques”. There are several stages involved during implementation of the problem
using different supervised learning techniques. Among them training and testing are the two
main phases that are involved.

4.1 Steps of Sentiment Analysis

Gathering data: Twitter was used as a source of text in this project, we gathered tweets for
major companies by using keywords and dates to scrap them. Twitter API (although not being
the most efficient way )was used to gather text data. Next opening and closing values of stocks
were gathered from yahoo finance for the day for each company

Text cleaning: Stopwords removal, punctuation removal, stemming etc were few of the
techniques used to clean the text. This is common to every text analysis problem.

Sentiment generation: It is then important to fine the sentiment of the statement, there are many
smart and reliable algorithms out there to tag statements with complex or simple sentiments.
Some sentiment algorithm would also give you sentiments like anxiety, sadness etc while the
most commonly used are positive, negative and neutral. Natural language processing is used to
tag each word with a sentiment and the overall score or sentiment that the statement would get
depends on the underlying word sentiments.

Stock independent variable: The stock opening and closing values were then broken down into
high, low or neutral per day based on its behavior for the day at the closing time.

Prediction: positive sentiments did seem to have a upward effect on stock prices. This was
found by prediction method: data was divided into 2 sets training and testing (70%–30%).
Different algorithms were used to learn the behavior of sentiments with stock variable and then
30% were used as unseen data to test whether it could predict. The positive sentiments were
more predictive than negative.

21
4.3 Training and Testing

This method mainly concentrates on the attributes Training Phase involves the Elimination of
Special Characters and Conversion to Lower case and Word Count stages are used in this
method .The DataSet obtained from the WordCount is passed to another stage where all the
neutral words are eliminated by using Positive and Negative words .Finally the obtained DataSet
is given as a input to the Naïve Bayes method and along with that Sentiment polarities are
calculated carefully and given as input. In the testing phase the test data is passed through the
Elimination of Special Characters and Conversion to Lower case stage .The prior,conditional and
posterior probabilities are calculated using the input data .

22
5-TESTING
Testing is the process of evaluating a system or its component’s with the intent to find that
whether it satisfies the specified requirements or not .This activity results in the actual, expected
and difference between their results i.e testing is executing a system in order to identify any gaps,
errors or missing requirements in contrary to the actual desire or requirements.

5.1 Testing Strategies

In order to make sure that system does not have any errors, the different levels of testing
strategies that are applied at different phases of software development are

23
5.1.1 Unit Testing

The goal of unit testing is to isolate each part of the program and show that individual parts are
correct in terms of requirements and functionality.

5.1.2 Integration Testing

The testing of combined parts of an application to determine if they function correctly together is
Integration testing .This testing can be done by using two different methods

5.1.2.1 Top Down Integration testing

In Top-Down integration testing, the highest-level modules are tested first and then progressively
lower-level modules are tested.

5.1.2.2 Bottom-up Integration testing

Testing can be performed starting from smallest and lowest level modules and proceeding one at
a time .When bottom level modules are tested attention turns to those on the next level that use
the lower level ones they are tested individually and then linked with the previously examined
lower level modules.In a comprehensive software development environment, bottom-up testing
is usually done first, followed by top-down testing.

5.1.3 System Testing

This is the next level in the testing and tests the system as a whole .Once all the components are
integrated, the application as a whole is tested rigorously to see that it meets Quality Standards.

5.1.4 Acceptance Testing

The main purpose of this Testing is to find whether application meets the intended specifications
and satisfies the client’s requirements .We will follow two different methods in this testing.

24
5.1.4.1 Alpha Testing

This test is the first stage of testing and will be performed amongst the teams .Unit testing,
integration testing and system testing when combined are known as alpha testing. During this
phase, the following will be tested in the application:

 Spelling Mistakes.
 Broken Links.
 The Application will be tested on machines with the lowest specification to test loading
times and any latency problems.

5.1.4.2 Beta Testing

In beta testing, a sample of the intended audience tests the application and send their feedback to
the project team .Getting the feedback, the project team can fix the problems before releasing the
software to the actual users.

5.2 Testing Methods

5.2.1 White Box Testing

White box testing is the detailed investigation of internal logic and structure of the Code. To
perform white box testing on an application, the tester needs to possess knowledge of the internal
working of the code .The tester needs to have a look inside the source code and find out which
unit/chunk of the code is behaving inappropriately.

5.2.2 Black Box Testing

The technique of testing without having any knowledge of the interior workings of the
application is Black Box testing .The tester is oblivious to the system architecture and does not
have access to the source code.Typically, when performing a black box test, a tester will interact
with the system’s user interface by providing inputs and examining outputs without knowing
how and where the inputs are worked upon.

25
5.3 Validation

All the levels in the testing (unit,integration,system) and methods (black box,white box)are
implemented on our application successfully and the results obtained as expected .

5.4 Test Results

The testing is done among the team members and by the end users. It satisfies the specified
requirements and finally we obtained the results as expected.

26
6-Screen Shots

27
28
7-CONCLUSION
The project uses Statistical models for solving the problem i.e. Naive Bayes. Naive Bayes is
mainly based on the independence assumption .Training is very easy and fast In this approach
each attribute in each class is considered separately. Testing is straightforward, calculating the
conditional probabilities from the data available. One of the major task is to find the sentiment
polarities which is very important in this approach to obtain desired output. In this naïve bayes
approach we only considered the words that are available in our dataset and calculated their
conditional probabilities. Successful results have been obtained after applying this approach to
our problem.

29
8-BIBLIOGRAPHY
The followings were referred during the analysis and execution phase of the project Websites
Referred:

http://www.analyticsvedia.com/sentimentanalysis

http://www.kaggle.com/sentimentanalysis

http://www.imdb.com/moviereviews

30

You might also like