Sentiment Analysis of Tamil Movie Reviews Via Feature Frequency Count

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/283535754

Sentiment analysis of tamil movie reviews via feature frequency count

Article · January 2015

CITATIONS READS
19 662

3 authors, including:

M. Anand Kumar Soman Kp


National Institute of Technology Karnataka Amrita Vishwa Vidyapeetham
122 PUBLICATIONS   855 CITATIONS    709 PUBLICATIONS   6,417 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A Comparative Analysis of Machine Comprehension Using Deep Learning Models in Code-Mixed Hindi Language View project

Machine learning for Environmental science, toxicology, and pollution research View project

All content following this page was uploaded by M. Anand Kumar on 24 January 2018.

The user has requested enhancement of the downloaded file.


Sentiment Analysis of Tamil Movie Reviews via
Feature Frequency Count

Arunselvan S J, Anand kumar M, Soman K P


Centre for Excellence in Computational Engineering and Networking
Amrita Vishwa Vidyapeetham
Coimbatore
INDIA
[email protected], m [email protected], kp [email protected]
¯ ¯

Abstract—The digital community paves the way for huge and negative (including seldom neutral). It is significant to
volume of opinion rich reviews from forums, blogs, discus- classify opinions to crisp the emotions to discover a common
sions and so on. In divergence with common text classification man’s underlying thoughts concerning every service including
approach, word count in the document are used as features. movie reviews. Experimental results shows an accuracy of
Metalevel features are taken from hand labeled Tamil movie 64 percent for SVM classifier for unigram feature. Major
reviews. Once the feature is extracted, they are used as input for
supervised machine learning algorithms for further classification.
contribution of this paper are,
Generally the frequency of occurrence of keyword is more (a) Tamil movie reviews were collected, structured, analyzed
suitable feature in overall sentiment analysis and not necessarily and hand tagged into positive and negative.
indicated by repeated use of keywords. Experimental results point (b) Four major NLP classifying algorithms were applied.
out the proposed method in this paper which shows considerable (c) SVM based classification was done by varying n-grams
accuracy in detecting sentimental information in Tamil. Accuracy and different kernels.
of about 65
Keywords—Sentiment analysis; Feature Extraction; SVM; RBF.
II. LITERATURE SURVEY

I. I NTRODUCTION Heeryon Cho et al performed sentiment analysis of Korean


movie review with Korean sentiment dictionary containing
Opinion plays a vital role in finalizing decision in this 135,082 words. It was done by checking the review words with
periodic life. Millions of cyberspace users express their dictionary matching arithmetic mean and fixing a threshold to
feelings through e-commerce, e-tourism, quora, forum, social classify reviews into positive and negative with an accuracy
networking and lots more. By default, we value our friend’s of 0.7985 and 0.815 respectively [2]. Vasu Jain predicted the
and relative’s opinion. Thoughts of people are crisp and success of a movie with analysis of tweets using lingpipe
short lived but sometimes complicated and lengthy. Sentiment sentiment classifier implementing 8-gram language model [3].
analysis refers to identifying the orientation of an opinion, Agarwal et al did polarity determination from text classifica-
belief, and feel from data, text document [1]. Movie industry tion using heuristic knowledge from word net sympony graph,
is huge and profitable. People usually check reviews before thereby increasing the effectiveness of sentiment classifier [4].
watching a movie for which they spend their hard cash. Bo Pang further extended work on sentiment classification,
A recent survey in entertainment industry show that online by extracting the subjectivity part of text document, to find
reviews and movie rating has a bigger impact at box office minimum cuts in graph resulting in implementation of cross
collection. It is tedious for everyone to analyze each and every section contextual constraints [5]. Dongjoo et al suggested
reviews on net. Therefore sentiment analysis comes into play. conjunction method which included adjective determination for
We use Machine Language and NLP classifying algorithms indirect messages in document. Word set exploration technique
to revel polarity. Sentiment analysis on movie reviews for included clustering of antonyms and synonyms individually
Tamil is an important step. Developing a system which detect and orientation of the adjective was predicted correspondingly.
preferences of user, still remains as a challenging task for Gloss classification takes tf-idf and SVM to find out maximum
Tamil. We have built a sentiment classifier to determine possible score for all three possible polarity edges of triangle
positive and negative review using hand collected Tamil [6]. Kanayama et al broked down the sentimental orient
movie data sets. There is no standardized or predefined text to smaller sentiment units mostly consisting of subject
dataset for Tamil language. We compare four supervised and adjectives which were given as input to transfer based
Machine Learning algorithms SVM, both Multinomial Navie machine translation and sentiment analysis. Parsing was done
Bayes and Bernoulli Navie bayes, Random kitchen sink and and polarity pattern of sentence was found [7]. Christopher
Logistic regression. et al developed a search engine to spot products exactly
This paper attempts to compare the classifiers and find out with five major factors such as extraction of features, their
how well each and every classifier performs for the given efficiency, product score, accuracy and elapsed time for the
Tamil data set. Here, preliminary opinion is split into positive customers. The Red Opal molds all these dimensions while
retrieving searched products for consumer [8]. Bing et al D. Logistic Regression
proposed a system to have a clearer vision on comparing the
products for both consumers and manufactures. Where one can Logistic Regression is a binary classifier or regression
visually measure weak and strong spots of products which model which has logarithmic likelihood of a class which is
one is looking [9]. Pream et al suggested a model where predicted. It is a discriminative model of logistic function
we they lexical information with word class association. With which operates mainly on real valued vector inputs. Here the
the help of trained example specific domain information was input features are taken as linear function [17]. The main
identified information [10]. Ahmed et al carried out opinion goal of this classifier is given an observation ’x’, what is the
classification on online web forums by extracting specific probability of ’y’. The mathematical function which is used to
features in linguistic content for Arabic language. In order to combine the weight vector with observation to create a final
improve the accuracy, an entropy weighted genetic algorithm decision [18]. The logistic equation is given by,
was created to add information content to the extracted feature.
They signify the importance of features in document level log(p(x)/(1 − p(x)) = β0 + x.β (4)
classification for sentiment analysis [9] [11]. Amithava et
al proposed a computational approach of generating lexicon E. Random kitchen sink
words for any language from existing English language to
target language dictionary. English Senti-Word Net (andria et RKS is used to map n-tuple data points to higher dimension
al) contain precisely labeled positive and negative lexicon with or infinite dimension space. Based on inverse Fourier trans-
corresponding score. Subjectivity lexicon were accumulated form. This algorithm is either RBF kernel or Gaussian. There
from hand labeled data with words learned from corpora, then is no explicit mapping [19].
part of speech was performed [12]. RBF kernel for data pair x and y is

III. MACHINE LEARNING ALGORITHMS 2


k(x, y) = e−σkx−yk (5)
A. Multinomial Naive Bayes
Assumes that probability of each word occurring in a F. Support Vector Machine
document is totally independent on word context as well as
it’s position in the particular document [13]. In bag of words Support Vector Machine findsout a hyper plane which has
assumption suppose the feature never occurred in a given class largest possible margin distance between classes [20]. We use
probability will estimate to zero. To normalize Naive bayes we LibSVM, where SVM training projects the training feature
include small correction known as Laplace smoothing [14]. points into higher dimension space. Then, place a hyper plane
The Multinomial Naive Bayes is represented mathematically which has largest distance with trained data points. SVM
by, prediction projects test data points into space where already
trained data points are placed [21]. Depending on the weight of
Y test data points, Support Vector Machines classify the classes
P (c|d = p(c) p(tk |c) (1) into positive or negative. The Generalized SVM equation is
1≤k≤nd ≤n given by,
2
min1/2 kwk (6)
B. Bernoulli Naive Bayes
Naive Bayes Bernoulli algorithm takes estimate for each subject to
class c and the boolean feature w which denotes all words in yi ((wiT .xi ) − b) − 1 ≥ 0∀i (7)
document. If the word appears in document w is 1 else zero
[15]. Document is estimated among class c which maximizes For given φ training data (xi y i ) from 2 classes such that yi =
±1, SVM finds out hyperplane wt φ(x) + b = 0
Y
p(c) p(w|c) (2) SVM kernal Radial Basis Function is expressed as,

w will appear all over the document. Therefore multiple 2


occurrence of the word is captured[16]. k(x, y) = e−γkx−yk (8)

C. Bayes Rule
The polynomial kernel Function is defined as,
Both Multinomial and Bernoulli Naive Bayes algorithms
are derived from Bayes rule [14] k(x, y) = (xt y + c) (9)

where d is a degree
p(c) × p(d|c)
p(c|d) = (3) The linear Kernel Function is given as,
p(d)
k(x, y) = x.y (10)
where
p(c) and p(d) represent the individual probabilities of c,d The performance of SVM classifier strongly depends on choos-
respectively p(c|d) is probability of c given d. ing c and γ parameters in the training phase [22].
IV. PROPOSED METHOD algorithms namely Multinomial and Bernoulli Naive Bayes,
Logistic Regression, Random Kitchen Sink, Support Vector
A random integer value is given to text file which contains Machines are applied on these data sets to find out their
all positive and negative documents represented line by line perfomance individually for tamil language.
as a single text file. The text file is read wherein each line
represents a new review document. The scanning of this file We denote number of features on n, which is minimum
is done for every review document done line by line using number of repeated unique words in a document. By turning
\n as delimiter. The dimension of the file read is 2300 ∗ 1 the parameter n we check the accuracy of each classifying
(feature vector). Each of these documents are stored in a algorithm with k-fold cross validation (k=10).
random variable of dimension 1 ∗ 1, from which the first 1000
are considered for training and the next 160 for testing and By choosing n from 10 to 100 with the interval of 10,
vice versa. 1000 postive 1000 negative for training and 160 we find the accuracy of every classifier. Further, the elapsed
negative 160 positive are considered for testing. time consumed by the classifier is also compared. The
accuracy and elapsed time analysis are performed on unigram
Then document term matrix is created where columns contain and bigram for MNB,BNB,LR and RKS.
unique words and rows contain documents. Coordinate of
unique words and documents represent count of particular On other hand, we observed SVM, which performed
word in each document. This represent intensity of that well on par with other algorithms. Trigram features, accuracy
feature on the given document. We select featured words that and elapsed time are analyzed for different values of n.
appeared more times in a document. We fix n (number of Support Vector Machine has 5 different kernels namely
counts) to a certain threshold. We then calculate accuracy by Linear, Polynomial, Radial Basis Function, Sigmoid,
taking features which are beyond fixed threshold. Once this Precomputed Kernel. We tested SVM with these different
feature matrix is created 2 distinct classes are classified with kernels.The experiments were repeated and found out the
various NLP classifying algorithms. This simple model could accuracy and elapsed time. For this dataset by varying γ from
be further extended to n-grams. 0 to 10, and fixed as 0.000001 for best optimized performance.
The 10 fold cross validation is used to check the effeciency While using Radial Basis Kernel function we observ
of classifiers. Out of 2320 documents, 2000 are chosen for better accuracy when compared to Linear, Polynomial and
training and 320 are taken for testing wherein continuous 10 other kernels. Using trial and error we find best cost −c
documents are taken randomly. Further the accuracy of the and gamma −g parameter. Gaussian radial basis function
classifier is determined. The model can be scaled to n folds. performs better than polynomial kernel with degree 3. where
as sigmoid gives a constant accuracy of 0.50625.
V. DATASET CREATION
Movie review sites grows for all languages. Tamil The feature count for unigram as well as bigram for
movie reviews were collected from over 25+ sites results 2320 documents are calculated for different values of n.
in unstructured data. Reviews vary from professionals to For higher values of n, the feature count will decrease.
celebrities and in normal people. Raw data contains spaces,
special characters, symbols and other language fonts.

Tamil language movie reviews of about 2320 reviews


were collected and an equal split of positive and negative
reviews were done. Positive and negative reviews of 1160
each were manually labeled. This is the first and biggest
dataset existing for Tamil language. The movie reviews
were manually collected from many leading Tamil movie
reviews websites such as karunthel.com, Filmibeat.com,
123tamilcinema.com, Thehindu.com, and many more. There
is a total of 447332 words in positive class and 444135 words
in negative class. The average word count is 385.

None of the reviews were crawled. They were all manually


collected and labelled into two main polarity classes. All
special characters, HTML tags, fonts excluding Tamil
language were filtered out. These reviews were all made into
a single text file wherein every single review document was
put in a new line before pre-processing. The dataset is in the
form of 8-bit universal character encoding.
Fig. 1. Accuracy comparision for unigram feature
VI. EXPERIMENTAL RESULTS
Fig-1 and Table-1 shows that SVM constantly outperforms
Different experiments have been performed on filtered other algorithm for different values of N with maximum
tamil movie reviews data set. Four popular Machine Learning accuracy of 64.69 percent in bigram. Logistic Regression
Fig. 2. Accuracy comparision for bigram feature

Fig. 4. Elapsed time comparision for bigram feature


gives least accuracy constantly. Bernoulli has it’s accuracy
bouncing when n equals 100 the system has least accuracy
of 47.21 percent. Multinomial Navie Bayes is second best
classifier next to SVM. Random kitchen sink shows not much
credibility with increased number of features.

Fig. 5. Feature count of unigram Vs bigram

Fig. 3. Elapsed time comparision for unigram feature


Fig-5 shows how feature count keeps on diminishing as
we increase n. When n equals 100, the classifier considers
Fig 2 and Table 2 shows that there is an increase in the features (words) which have repeated for minimum of
elapsed time from unigram to bigram for Bernoulli and 100 times. Obviously number of features will be less for
Logistic Regression. Random Kitchen Sink has a slight higher threshold. In bigram, 8549 features are taken for n=10.
variation for unigram and bigram. SVM shows bigger Usually, the first and last word in a document will be taken as
difference when elapsed compared with others, whereas unique feature, when it comes to bigram. Therefore, for 2320
Multinomial yields highest time of all classifiers. documents there will be 4640 features along with bigram
Fig 3 and 4 shows that for bigram features both SVM and feature. Thats why there is quite lot of feature dimensions for
Multinomial Navie Bayes predict almost same extent with 1 bigram on par with unigram. Feature count will remain same
to 2 percent difference. When n equals 90, we get maximum for distinct kernels.
accuracy of 60.32 for SVM. RKS shows least perfomance
here too. Bernoulli keeps on decreasing as we increase n.
There are many neutral reviews in negative document and
some in positive document which results in sparsity in feature [14] M. Panda, A. Abraham, and M. R. Patra, “Discriminative multinomial
vector matrix. This affects overall accuracy in this paper. naive bayes for network intrusion detection,” in Panda, Mrutyunjaya
Since movie reviews are collected from various sites, dataset and Abraham, Ajith and Patra, Manas Ranjan, 2010, pp. 5–10.
is unstructured. Some have implicit meaning and negations. [15] S.-H. Yang, H. Zha, and B.-G. Hu, “Dirichlet-bernoulli alignment: A
generative model for multi-class multi-label multi-instance corpora,” in
There are lot of stop words and unlemmatized words which Advances in neural information processing systems, 2009.
has resulted in increased number of features vector. [16] K. Cho, A. Ilin, and T. Raiko, “Improved learning of gaussian-bernoulli
restricted boltzmann machines.” Springer, 2011, pp. 10–17.
VII. CONCLUSION AND FUTURE WORK [17] D. Williams, X. Liao, Y. Xue, and L. Carin, “Incomplete-data classifi-
cation using logistic regression.” ACM, 2005, pp. 972–979.
We have performed the classification for Tamil movie [18] W. Chen, Y. Chen, Y. Mao, and B. Guo, “Density-based logistic
reviews in the application of sentimental analysis. We have regression.” ACM, pp. 140–148.
observed that SVM outperforms other classifiers easily for [19] A. Rahimi and B. Recht, “Weighted sums of random kitchen sinks:
both unigram and bigram features. Trigram is also tested for Replacing minimization with randomization in learning,” in Advances
SVM but the results show there is very minimal change in in neural information processing systems, 2009, pp. 1313–1320.
accuracy on par with bigram. Even elapsed time for bigram [20] C.-W. Hsu, C.-C. Chang, C.-J. Lin et al., “A practical guide to support
and trigram are almost same. We get an accuracy of 0.6469 vector classification,” 2003.
for 10 minimum number of features in SVM on bigram which [21] S. Rueping, “Svm classifier estimation from group probabilities,” in
Proceedings of the 27th International Conference on Machine Learning
is really motivating for Tamil language. Logistic Regression (ICML-10, 2010.
and Random Kitchen Sink does not seems to prove well. [22] T. Hastie, S. Rosset, R. Tibshirani, and J. Zhu, “The entire regularization
Multinomial Naive bayes for unigram consumes lot of time. path for the support vector machine,” The Journal of Machine Learning
This empirical results hold strictly for tamil dataset. The Research, vol. 5, pp. 1391–1415, 2004.
dataset is further developed and filtered. Further hand labeled
positive and negative features could be added to extract feature
vector from test data.

R EFERENCES
[1] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment
classification using machine learning techniques,” in Proceedings of
the ACL-02 conference on Empirical methods in natural language
processing, vol. 10. Association for Computational Linguistics, 2002,
pp. 79–86.
[2] H. Cho and S.-H. Choi, “Sentiment classification of movie reviews
using korean sentiment dictionary,” 2014.
[3] V. Jain, “Prediction of movie success using sentiment analysis of
tweets.”
[4] A. Agarwal and P. Bhattacharyya, “Sentiment analysis: A new approach
for effective use of linguistic knowledge and exploiting similarities in a
set of documents to be classified,” in Proceedings of the International
Conference on Natural Language Processing (ICON), 2005.
[5] B. Pang and L. Lee, “A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts,” in Proceedings of
the 42nd annual meeting on Association for Computational Linguistics,
2004.
[6] D. Lee, O.-R. Jeong, and S.-g. Lee, “Opinion mining of customer
feedback data on the web,” in Proceedings of the 2nd international
conference on Ubiquitous information management and communication,
2008.
[7] K. Hiroshi, N. Tetsuya, and W. Hideo, “Deeper sentiment analysis using
machine translation technology,” 2004.
[8] C. Scaffidi, K. Bierhoff, E. Chang, M. Felker, H. Ng, and C. Jin, “Red
opal: product-feature scoring from reviews,” in Proceedings of the 8th
ACM conference on Electronic commerce, 2007.
[9] B. Liu, M. Hu, and J. Cheng, “Opinion observer: analyzing and com-
paring opinions on the web,” in Proceedings of the 14th international
conference on World Wide Web. ACM, 2005, pp. 342–351.
[10] A. Abbasi, H. Chen, and A. Salem, “Sentiment analysis in multiple
languages: Feature selection for opinion classification in web forums,”
ACM Transactions on Information Systems (TOIS), vol. 26, p. 12, 2008.
[11] E. Refaee and V. Rieser, “An arabic twitter corpus for subjectivity and
sentiment analysis,” in Proceedings of the Ninth International Con-
ference on Language Resources and Evaluation (LREC14), Reykjavik,
Iceland, may. European Language Resources Association (ELRA), 2014.
[12] A. Esuli and F. Sebastiani, “Sentiwordnet: A high-coverage lexical
resource for opinion mining,” Evaluation, pp. 1–26, 2007.
[13] A. Juan and H. Ney, “Reversing and smoothing the multinomial naive
bayes text classifier.” in PRIS. Citeseer, 2002.
TABLE I. AVERAGE ACCURACY OF DOCUMENTS CLASSIFICATION
N Multinomial Bernoulli Logistic Random Kitchen Support Vector Machines
Naive Bayes Naive Bayes Regression Sink (RBF Kernel)
Unigram Bigram Unigram Bigram Unigram Bigram Unigram Bigram Unigram Bigram Trigram
10 57.98 58.73 57.71 57.99 53.90 52.16 50.00 49.76 60.00 64.69 60.01
20 58.66 59.19 60.19 57.49 52.38 57.14 51.35 51.67 59.95 61.25 59.38
30 59.80 58.34 59.31 56.91 53.90 54.76 52.65 53.67 60.31 60.93 60.31
40 61.01 59.43 53.56 56.63 51.08 51.08 53.78 61.88 59.68 61.88 59.69
50 59.01 58.58 55.89 55.12 53.46 54.76 52.37 51.32 60.32 62.19 60.31
60 58.65 59.73 59.86 54.89 52.16 56.76 52.21 52.71 59.69 62.56 59.69
70 58.15 59.16 58.34 53.27 48.05 55.10 53.37 53.64 59.31 61.62 59.87
80 58.01 59.04 58.62 54.44 53.25 50.43 51.79 51.62 59.68 60.62 59.67
90 60.21 59.77 58.53 53.35 50.43 54.55 52.16 51.62 60.26 61.87 59.73
100 57.60 58.52 47.21 51.69 51.08 53.03 51.12 51.01 60.15 60.56 60.94

TABLE II. AVERAGE ELAPSED TIME OF THE ALGORITHMS


N Multinomial Bernoulli Logistic Random Kitchen Support Vector Machines
Naive Bayes Naive Bayes Regression Sink (RBF Kernel)
Unigram Bigram Unigram Bigram Unigram Bigram Unigram Bigram Unigram Bigram Trigram
10 645.4260 407.85 0.4923 1.4759 5.3196 7.6128 0.4418 0.4319 25.8771 132.502 25.8439
20 662.3449 412.16 0.4776 0.9234 5.7252 7.1640 0.4654 0.4552 25.3172 95.1875 25.2983
30 659.7410 423.86 0.3479 0.8324 5.3040 7.1448 0.4454 0.4599 24.8517 85.9136 24.6543
40 667.6440 420.18 0.3378 0.7264 5.4912 7.1760 0.4487 0.4681 24.5796 76.4123 24.5225
50 666.7718 422.65 0.3838 0.8174 5.3352 6.8796 0.4748 0.4506 24.0705 78.2838 23.9837
60 662.4366 427.89 0.5202 0.6509 6.1776 6.8640 0.4724 0.4413 23.9158 72.4600 23.7400
70 660.3963 422.21 0.4288 0.6247 5.2104 7.0200 0.4606 0.4473 23.6860 72.0195 23.7368
80 662.3963 413.81 0.3225 0.6142 5.2884 6.9732 0.4610 0.4352 23.4350 70.4436 23.3499
90 664.0042 425.67 0.3592 0.6154 4.9452 6.8952 0.4689 0.4639 26.0621 68.5199 23.1827
100 669.3727 421.71 0.4279 0.5971 4.7424 6.0576 0.4445 0.4723 25.7392 68.1910 23.0620

TABLE III. S UPPORT V ECTOR M ACHINES


N Support Vector Machine Support Vector Machine Support Vector Machine Number of Features
Polynomial Kernel RBF Kernel Linear Kernel
Accuracy Accuracy Accuracy
Unigram Bigram Unigram Bigram Trigram Unigram Bigram Unigram Bigram
10 49.3750 53.7500 60.0000 64.6875 57.9900 60.3125 62.1875 1433 8549
20 49.3750 50.3125 59.9532 61.2500 57.4900 59.3750 62.1875 1114 5952
30 50.6250 50.3333 60.3125 60.9375 56.9100 56.5625 61.8750 951 4826
40 50.6250 53.1250 59.6875 61.8750 56.6300 61.2500 61.5625 861 4139
50 50.3125 52.5000 60.3125 62.1875 55.1200 56.8750 59.3750 791 3664
60 49.3750 50.9375 59.6875 62.5625 54.8900 59.3750 60.3125 746 3306
70 49.3770 52.1875 59.6875 61.5625 53.2700 53.4375 60.9375 713 3029
80 49.0625 51.8750 59.6875 60.6250 54.4400 62.5000 62.8125 679 2812
90 49.3750 51.5625 60.2655 61.8750 53.3500 57.1875 59.6875 644 2610
100 49.6875 50.6250 60.1578 61.5625 51.6900 53.1250 61.8750 613 2441

View publication stats

You might also like