IJETR042461
IJETR042461
IJETR042461
100 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869 (O) 2454-4698 (P), Volume-6, Issue-1, September 2016
II.LITERATURE REVIEW models taking the activity level of editors on Wikipedia as a
Although Twitter has been very popular as a web service, popularity parameter [14].
there has not been considerable published research on it.
Huberman and others [3] studied the social interactions on III. PROPOSED METHODOLOGY
Twitter to reveal that the driving process for usage is a In the case of social media, the outrageousness and high
sparse hidden network underlying the friends and followers, variance of the information that propagates through large
while most of the links represent meaningless interactions. user communities presents an interesting opportunity for
Java et al [4] investigated community structure and isolated using that data into a form that allows for specific
different types of user intentions on Twitter. Jansen and predictions about particular results, without having to
others [5] have examined Twitter as a mechanism for word- institute market mechanisms. Specifically we consider the
of-mouth advertising, and considered particular brands and task of predicting box-office sentiments for movies and song
products while examining the structure of the postings and using the information post on Twitter, one of the fastest
the change in sentiments. However the authors do not growing social media in the Internet. Twitter, a micro-
perform any analysis on the predictive aspect of Twitter. blogging social media site, has experienced a full of
There has been some prior work on analyzing the popularity in recent years leading to a huge user-base,
correlation between blog and review mentions and consisting of several tens of millions of users who actively
performance. Gruhl and others [6] showed how to generate participate in the creation and propagation of infofmation.
automated queries for mining blogs in order to predict spikes Our Methodology divided in several blocks.
in book sales. And while there has been research on
predicting movie sales, almost all of them have used meta-
data information on the movies themselves to perform the
forecasting, such as the movies genre, MPAA rating, running
time, release date, the number of screens on which the movie
debuted, and the presence of particular actors or actresses in
the cast. Joshi and others [7] use linear regression from text
and metadata features to predict earnings for movies. Mishne
and Glance [8] correlate sentiments in blog posts with movie
box-office scores. The correlations they observed for Figure 1 Proposed system block diagram
positive sentiments are fairly low and not sufficient to use
for predictive purposes. Sharda and Delen [9] have treated A. Data Set
the prediction problem as a classification problem and used The dataset that we used was obtained by crawling hourly
neural networks to classify movies into categories ranging feed data from Twitter.com. To ensure that we obtained all
from flop to blockbuster. Apart from the fact that they are tweets referring to a movie and songs, we used keywords
predicting ranges over actual numbers, the best accuracy that present in the movie title and songs title as search
their model can achieve is fairly low. Zhang and Skiena [10] arguments. We extracted tweets over frequent intervals using
have used a news aggregation model along with IMDB data the Twitter Search Api., thereby ensuring we had the
to predict movie box-office numbers. We have shown how timestamp, author and tweet text for our analysis. With an
our model can generate better results when compared to intention to discover the social media signals that potentially
their method. possess stronger correlation with the profitability of a film;
Akcora et al., [11] in their experiment, try to identify the we identify signals which reflect audience approval from
emotional pattern and the word pattern that claims to change different social media domains. Note that the type of data in
the public opinion, using Twitter data. To identify the each domain may be different, e.g., Twitter is a social stream
breakpoint, researchers use Jaccards similarity of two whereas YouTube is a social video publishing website. Most
successive intervals of words. Sun et al., [12] study fan of social media buzz around a movie that have been captured
pages on Facebook to understand diffusion trees. Kwak et for this study are before its release or during the first 1-2
al., [13] compare the number of followers, page-ranks and weeks.
the number of re-tweets as three different measures of
B. Classification of Data
influence. Their finding is that the ranking of the most
influential users differed depending on the measure. Data can be classified into different categories depending on
Movies and songs are released across different parts of the the domain i.e.
world and Twitter users are also from different parts of the Political, Business, Sports etc.
world. The study aims to analyze whether an attitude in the Recommendations, Complaints
tweets is positive or negative sentiment or a cognitive Positive, Negative
statement, understand the flow of interpersonal messages Lexicon based classification
across different countries and understand people behavior. Requires a dictionary of words and their polarity
Many previous studies have shaped the goal of our work. scores.
Ishii et al. developed a mathematical framework for the Supervised Learning classification
spread of popularity of the movie in society. Their model Requires training data to create a classification
considers the activity level of the bloggers estimated through model.
number of weblog posts on particular movies in the Japanese Various visualization techniques
Blogosphere as a representative parameter for social WEKA can be applied to classification of data.
popularity. Similarly, other researchers have developed
101 www.erpublication.org
Sentiment Analysis on Unstructured Social Media Data Compare with Different Classification Algorithms
102 www.erpublication.org
International Journal of Engineering and Technical Research (IJETR)
ISSN: 2321-0869 (O) 2454-4698 (P), Volume-6, Issue-1, September 2016
Logistic
Parameters SVM Regression J48
TP Rate 0.501 0.752 0.372
FP Rate 0.11 0.350 0.262
Precision 0.82 0.752 0.362
Recall 0.501 0.752 0.372
F- Measure 0.622 0.752 0.37
ROC Area 0.696 0.807 0.552
Figure 5 Pie-Chart after applying J48 algorithm
103 www.erpublication.org
Sentiment Analysis on Unstructured Social Media Data Compare with Different Classification Algorithms
REFERENCES
[1] Bruns, A., The Active Audience: Transforming Journalism from
Gatekeeping to Gatewatching. In Making Online News: The
Ethnography of New Media Production. Eds. Chris Paterson and
David Domingo. New York: Peter Lang, 2008.
[2] Boyd, Danahd and Ellison, N.. Social Network Sites: Definition,
History and Scholarship. Journal of Computer-Mediated
Communication, 13(1), 1, 210-230, 2007.
[3] Bernardo A. Huberman, Daniel M. Romero, and Fang Wu. Social
networks that matter: Twitter under the microscope. First Monday,
14(1), Jan 2009.
[4] Akshay Java, Xiaodan Song, Tim Finin and Belle Tseng. Why we
twitter: understanding microblogging usage and communities.
Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 workshop
on Web mining and social network analysis, pages 5665, 2007.
[5] B. Jansen, M. Zhang, K. Sobel, and A. Chowdury. Twitter
power:Tweets as electronic word of mouth. Journal of the American
Society for Information Science and Technology, 2009.
[6] Karniouchina, Ekaterina, Impact of star and movie buzz on motion
picture distribution and box office revenue, Intern. J. of Research
in Marketing, vol. 28, pp. 62-74, 2011.
[7] Spann, Martin & Bernd Skiera, InternetBased Virtual Stock Markets
for Business Forecasting, Management Science , vol. 49, no. 10, pp.
1310-1326, 2003.
[8] http://www.stateofdigital.com/how-to-recognize-twitter-bots-6-
signals-to-look-out-for
104 www.erpublication.org