Sentiment Analysis of Tamil Movie Reviews Via Feature Frequency Count
Sentiment Analysis of Tamil Movie Reviews Via Feature Frequency Count
Sentiment Analysis of Tamil Movie Reviews Via Feature Frequency Count
net/publication/283535754
CITATIONS READS
19 662
3 authors, including:
Some of the authors of this publication are also working on these related projects:
A Comparative Analysis of Machine Comprehension Using Deep Learning Models in Code-Mixed Hindi Language View project
Machine learning for Environmental science, toxicology, and pollution research View project
All content following this page was uploaded by M. Anand Kumar on 24 January 2018.
Abstract—The digital community paves the way for huge and negative (including seldom neutral). It is significant to
volume of opinion rich reviews from forums, blogs, discus- classify opinions to crisp the emotions to discover a common
sions and so on. In divergence with common text classification man’s underlying thoughts concerning every service including
approach, word count in the document are used as features. movie reviews. Experimental results shows an accuracy of
Metalevel features are taken from hand labeled Tamil movie 64 percent for SVM classifier for unigram feature. Major
reviews. Once the feature is extracted, they are used as input for
supervised machine learning algorithms for further classification.
contribution of this paper are,
Generally the frequency of occurrence of keyword is more (a) Tamil movie reviews were collected, structured, analyzed
suitable feature in overall sentiment analysis and not necessarily and hand tagged into positive and negative.
indicated by repeated use of keywords. Experimental results point (b) Four major NLP classifying algorithms were applied.
out the proposed method in this paper which shows considerable (c) SVM based classification was done by varying n-grams
accuracy in detecting sentimental information in Tamil. Accuracy and different kernels.
of about 65
Keywords—Sentiment analysis; Feature Extraction; SVM; RBF.
II. LITERATURE SURVEY
C. Bayes Rule
The polynomial kernel Function is defined as,
Both Multinomial and Bernoulli Naive Bayes algorithms
are derived from Bayes rule [14] k(x, y) = (xt y + c) (9)
where d is a degree
p(c) × p(d|c)
p(c|d) = (3) The linear Kernel Function is given as,
p(d)
k(x, y) = x.y (10)
where
p(c) and p(d) represent the individual probabilities of c,d The performance of SVM classifier strongly depends on choos-
respectively p(c|d) is probability of c given d. ing c and γ parameters in the training phase [22].
IV. PROPOSED METHOD algorithms namely Multinomial and Bernoulli Naive Bayes,
Logistic Regression, Random Kitchen Sink, Support Vector
A random integer value is given to text file which contains Machines are applied on these data sets to find out their
all positive and negative documents represented line by line perfomance individually for tamil language.
as a single text file. The text file is read wherein each line
represents a new review document. The scanning of this file We denote number of features on n, which is minimum
is done for every review document done line by line using number of repeated unique words in a document. By turning
\n as delimiter. The dimension of the file read is 2300 ∗ 1 the parameter n we check the accuracy of each classifying
(feature vector). Each of these documents are stored in a algorithm with k-fold cross validation (k=10).
random variable of dimension 1 ∗ 1, from which the first 1000
are considered for training and the next 160 for testing and By choosing n from 10 to 100 with the interval of 10,
vice versa. 1000 postive 1000 negative for training and 160 we find the accuracy of every classifier. Further, the elapsed
negative 160 positive are considered for testing. time consumed by the classifier is also compared. The
accuracy and elapsed time analysis are performed on unigram
Then document term matrix is created where columns contain and bigram for MNB,BNB,LR and RKS.
unique words and rows contain documents. Coordinate of
unique words and documents represent count of particular On other hand, we observed SVM, which performed
word in each document. This represent intensity of that well on par with other algorithms. Trigram features, accuracy
feature on the given document. We select featured words that and elapsed time are analyzed for different values of n.
appeared more times in a document. We fix n (number of Support Vector Machine has 5 different kernels namely
counts) to a certain threshold. We then calculate accuracy by Linear, Polynomial, Radial Basis Function, Sigmoid,
taking features which are beyond fixed threshold. Once this Precomputed Kernel. We tested SVM with these different
feature matrix is created 2 distinct classes are classified with kernels.The experiments were repeated and found out the
various NLP classifying algorithms. This simple model could accuracy and elapsed time. For this dataset by varying γ from
be further extended to n-grams. 0 to 10, and fixed as 0.000001 for best optimized performance.
The 10 fold cross validation is used to check the effeciency While using Radial Basis Kernel function we observ
of classifiers. Out of 2320 documents, 2000 are chosen for better accuracy when compared to Linear, Polynomial and
training and 320 are taken for testing wherein continuous 10 other kernels. Using trial and error we find best cost −c
documents are taken randomly. Further the accuracy of the and gamma −g parameter. Gaussian radial basis function
classifier is determined. The model can be scaled to n folds. performs better than polynomial kernel with degree 3. where
as sigmoid gives a constant accuracy of 0.50625.
V. DATASET CREATION
Movie review sites grows for all languages. Tamil The feature count for unigram as well as bigram for
movie reviews were collected from over 25+ sites results 2320 documents are calculated for different values of n.
in unstructured data. Reviews vary from professionals to For higher values of n, the feature count will decrease.
celebrities and in normal people. Raw data contains spaces,
special characters, symbols and other language fonts.
R EFERENCES
[1] B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment
classification using machine learning techniques,” in Proceedings of
the ACL-02 conference on Empirical methods in natural language
processing, vol. 10. Association for Computational Linguistics, 2002,
pp. 79–86.
[2] H. Cho and S.-H. Choi, “Sentiment classification of movie reviews
using korean sentiment dictionary,” 2014.
[3] V. Jain, “Prediction of movie success using sentiment analysis of
tweets.”
[4] A. Agarwal and P. Bhattacharyya, “Sentiment analysis: A new approach
for effective use of linguistic knowledge and exploiting similarities in a
set of documents to be classified,” in Proceedings of the International
Conference on Natural Language Processing (ICON), 2005.
[5] B. Pang and L. Lee, “A sentimental education: Sentiment analysis using
subjectivity summarization based on minimum cuts,” in Proceedings of
the 42nd annual meeting on Association for Computational Linguistics,
2004.
[6] D. Lee, O.-R. Jeong, and S.-g. Lee, “Opinion mining of customer
feedback data on the web,” in Proceedings of the 2nd international
conference on Ubiquitous information management and communication,
2008.
[7] K. Hiroshi, N. Tetsuya, and W. Hideo, “Deeper sentiment analysis using
machine translation technology,” 2004.
[8] C. Scaffidi, K. Bierhoff, E. Chang, M. Felker, H. Ng, and C. Jin, “Red
opal: product-feature scoring from reviews,” in Proceedings of the 8th
ACM conference on Electronic commerce, 2007.
[9] B. Liu, M. Hu, and J. Cheng, “Opinion observer: analyzing and com-
paring opinions on the web,” in Proceedings of the 14th international
conference on World Wide Web. ACM, 2005, pp. 342–351.
[10] A. Abbasi, H. Chen, and A. Salem, “Sentiment analysis in multiple
languages: Feature selection for opinion classification in web forums,”
ACM Transactions on Information Systems (TOIS), vol. 26, p. 12, 2008.
[11] E. Refaee and V. Rieser, “An arabic twitter corpus for subjectivity and
sentiment analysis,” in Proceedings of the Ninth International Con-
ference on Language Resources and Evaluation (LREC14), Reykjavik,
Iceland, may. European Language Resources Association (ELRA), 2014.
[12] A. Esuli and F. Sebastiani, “Sentiwordnet: A high-coverage lexical
resource for opinion mining,” Evaluation, pp. 1–26, 2007.
[13] A. Juan and H. Ney, “Reversing and smoothing the multinomial naive
bayes text classifier.” in PRIS. Citeseer, 2002.
TABLE I. AVERAGE ACCURACY OF DOCUMENTS CLASSIFICATION
N Multinomial Bernoulli Logistic Random Kitchen Support Vector Machines
Naive Bayes Naive Bayes Regression Sink (RBF Kernel)
Unigram Bigram Unigram Bigram Unigram Bigram Unigram Bigram Unigram Bigram Trigram
10 57.98 58.73 57.71 57.99 53.90 52.16 50.00 49.76 60.00 64.69 60.01
20 58.66 59.19 60.19 57.49 52.38 57.14 51.35 51.67 59.95 61.25 59.38
30 59.80 58.34 59.31 56.91 53.90 54.76 52.65 53.67 60.31 60.93 60.31
40 61.01 59.43 53.56 56.63 51.08 51.08 53.78 61.88 59.68 61.88 59.69
50 59.01 58.58 55.89 55.12 53.46 54.76 52.37 51.32 60.32 62.19 60.31
60 58.65 59.73 59.86 54.89 52.16 56.76 52.21 52.71 59.69 62.56 59.69
70 58.15 59.16 58.34 53.27 48.05 55.10 53.37 53.64 59.31 61.62 59.87
80 58.01 59.04 58.62 54.44 53.25 50.43 51.79 51.62 59.68 60.62 59.67
90 60.21 59.77 58.53 53.35 50.43 54.55 52.16 51.62 60.26 61.87 59.73
100 57.60 58.52 47.21 51.69 51.08 53.03 51.12 51.01 60.15 60.56 60.94