1. Introduction
Advances in internet technology and the continuous development of web 2.0 is resulting in the production of a substantial amount of data daily. The availability of a plethora and variety of social media platforms increased the connectivity among social media users which changes the prevalent viewpoint of socialization, personalization, and networking. For the fourth quarter of 2020, an estimated number of 1.8 billion users were active on Facebook each day [
1]. This is in addition to Facebook ancillary services like Instagram, WhatsApp, and messenger each of which have active users amounting to 1 billion on a monthly basis [
2]. Similarly, according to third-party analysis, other social media platforms such as iMessage owned by Apple, WeChat by Tencent, and YouTube by Google, which is no longer exclusive, are now members of the 1 billion-per-month-active-user-club. Furthermore, 75% of internet users now regularly use at least one social media platform [
3]. From a purely technical standpoint, increased accessibility has provided new opportunities and challenges by encouraging users to share their views, emotions, and opinions, in addition to consuming services [
4,
5]. One of the fast-growing and impactful social media networks is Twitter, on which users can read, post, and update short text messages termed as ‘tweets’ which enable Twitter users to communicate their views, opinions, and sentiments about a particular entity. These sentiment-bearing tweets play a vital role in many areas, for instance, social media marketing [
6], academics [
7], and election campaign news [
6].
Sentiment analysis aims at categorizing and determining the polarity of a subjective text at phrase, sentence, or document level [
8]. It has a variety of applications in various fields, including e-commerce, politics, entertainment, and health care to name a few. For instance, sentiment analysis can help companies to track customer perceptions about their products, it can also assist customers in selecting the best product based on public opinion [
9]. While sentiment analysis has various applications and implementations in a variety of fields, it comes with diverse issues and challenges related to Natural Language Processing (NLP). Recent research related to sentiment analysis is still afflicted by technical and theoretical complexity which limit its inclusive accuracy in sentiment detection [
10]. Hussein et al. [
11] investigated the challenges of sentiment analysis and their effects on the accuracy of the results. The experimental results substantiated that accuracy is a considerable challenge in the conduction of sentiment analysis. It also demonstrates that accuracy is affected by several challenges, such as handling sarcasm, negation, abbreviations, domain dependence, bi-polar words, etc.
As for sentiment analysis of tweets, the key task is to classify the divergence between opinion-bearing tweets as either negative or positive. Sentiment analysis of tweets comes with its challenges. People tend to use informal language in their tweets, which might heighten the risk of not detecting the overall sentiment of the text [
12]. Some tweets are short text, which may carry little contextual information which provides inadequate indications of sentiment [
13]. Understanding the sentiments of tweets containing acronyms and abbreviations is also an immense challenge for the computer. Attributable to these challenges, a growing interest of researchers is classifying the sentiments of tweets.
Sentiments of tweets can be investigated using three approaches: (I) a machine learning (ML) approach which utilizes learning models for classification of sentiments; (II) a rule-based approach which uses either corpus-based sentiment lexicons, publicly available sentiment lexicons, or lexical dictionary for extraction of sentiments; and (III) a hybrid approach which combines the ML approach and rule-based approach. Consequently, deep learning approaches, as integrated into a large number of researches, have shown their significance in sentiment analysis [
14], computer vision [
15], and speech recognition [
16]. Consequently, the authors of [
17] showed that integrating ConvBiLSTM, a deep learning model, produced more effective and robust results in analyzing sentiments of tweets. They utilized a convolution neural network (Conv) for extraction of local features and bidirectional long short-term memory (BiLSTM) to capture the long-distance dependencies and classified tweets with 91.13% accuracy. In line with the above, another research integrated deep learning models in the sentiment classification of tweets which resulted in more accurate results as compared to conventional ML models [
18].
The Sentiment140 dataset is utilized in this study which contains 1.6 million tweets, among which 800,000 tweets are negative tweets and 800,000 tweets are annotated as positive. Tweets in this dataset were originally annotated by considering the emoticons, for instance, tweets with happy emoticons were considered as positive and tweets containing sad emoticons were considered negative [
19]. This study proposes that annotating tweets using a lexical dictionary generates more correlated features for more accurate sentiment classification of tweets. For this purpose, the current study proposes a novel approach that leverages the benefits of a lexical dictionary from the rule-based approach and learning models from the ML approach for the sentiment classification of tweets.
Key contributions of this study are summarized as follows:
This study explores the viability of the implementation of a lexical dictionary and evaluates the potency of a stacked ensemble for the sentiment classification of tweets.
A lexical dictionary, namely TextBlob, is integrated for sentiment annotation of tweets. TextBlob returns a float value within a range of “+1.0” and “−1.0” which represents the sentiment orientation of the text. Here, “+1.0” corresponds to positive, and “−1.0” corresponds to negative sentiments. We set the threshold value to “0” which indicates that output values greater than “0” will be regarded as positive tweets and vice versa.
Three feature engineering approaches are integrated and evaluated in this study including term frequency-inverse document frequency (TF-IDF), bag of words (BOW), and a union of BOW and TF-IDF.
A novel stacked ensemble of the ML model, logistic regression (LR), and a deep learning model, LSTM, is proposed for sentiment classification of tweets. LR works best with binary classification tasks; on the other hand, LSTM is the best choice for remembering the long-term dependencies of larger datasets. Thus, the proposed stacked ensemble harnesses the proficiency of combining the predictions made by three LSTMs as base learners using LR as a meta learner.
A diverse range of experimentation is carried out in this study to compare the performance of the proposed approach with conventional state-of-the-art ML models including random forest (RF), AdaBoost (ADB), and logistic regression (LR). Moreover, this study also compares the performance of models using original sentiments of tweets with sentiments extracted by TextBlob.
We also compare the performance of our proposed approach with correlated studies carried out on the sentimnet140 dataset for the sentiment classification of tweets.
The remainder of the paper is organized as follows:
Section 2 explores sentiment analysis-related work which gives a brief description of previous studies.
Section 3 briefly describes the dataset along with preprocessing techniques utilized to create clean data. It also explains the techniques and algorithms utilized in this research to conduct experiments.
Section 4 presents a detailed discussion and analysis of the results.
Section 5 is comprised of the conclusion and future direction.
2. Related Work
In the field of text classification, there is a wide scope for analyzing sentiments, and many researchers have studied the mechanism of sentiment analysis by identifying emotions contained in the text [
20,
21]. Ankit and Saleena [
22] carried out Twitter sentiment analysis by integrating an ensemble of Naïve Bayes (NB), Random Forest (RF), Support Vector Machine (SVM), and Logistic Regression (LR) models with BOW as a feature extraction technique. The authors proposed a two-fold study in which they first predicted the sentiment score of the tweet and, in the second phase, they predicted the polarity of the tweet based on sentiment score. They utilized four datasets including sentiment 140, HCR (Health Care Reforms), the Frist GOP debate Twitter Sentiment dataset, and Twitter sentiment analysis dataset for analysis of the proposed approach. The results showed that the proposed ensemble learning classifier performs better than the stand-alone classifiers.
Onan et al. [
23] proposed a multi-objective weighted voting ensemble classifier for text sentiment classification. Their proposed system incorporates Bayesian Logistic Regression (BLR), Linear Discriminant Analysis (LDA), NB, LR, and SVM as base learners whose performance in terms of sensitivity and security determines the weighted adjustment. Different classification tasks which include sentiment analysis, software defect prediction, spam filtering, credit risk modeling, and semantic mapping suggest that their proposed system outperforms the conventional ensemble learning models. The highest accuracy of 98.86% is achieved in the software defect detection task on a dataset containing details of laptops.
Rustam et al. [
24] proposed a voting classifier (VC) for the sentiment analysis of tweets. VC comprises logistic regression (LR) and an SGDC (stochastic gradient descent classifier) which produces prediction under soft voting. In their study, they classified the tweets into three classes (positive, negative, and neutral). Different ML classifiers were also tested on the “twitter-airline-sentiment” dataset. Their study investigated the role of feature extraction techniques like TF, TF-IDF, and word2vec on classification accuracy. LSTM, a deep learning model, was also used and it achieved an accuracy lower than ML models. The accuracy achieved by the voting classifier is 78.9% and 79.1% with TF and TF-IDF feature extraction.
Umer et al. [
25] conducted sentiment analysis of tweets using an ensemble of a Convolutional Neural Network (CNN) and LSTM. As an ML classifier does not perform well on the vast amount of data, to overcome this limitation, they advised use of a Deep Learning-based ensemble system. They evaluated their proposed approach on three different datasets. They integrate feature extraction methods such as word2vec and TF-IDF. Results showed that the CNN-LSTM achieved higher accuracy than other classifiers. They also compared the performance of the (CNN-LSTM) proposed model with the other deep learning models which authenticated the proposed approach.
Stjanovski et al. [
26] used the deep CNN approach to perform experiments on sentiment analysis on Twitter data. The proposed CNN was trained on the top most pre-trained word embeddings derived from large text corpora using unsupervised learning, which was further used with the dropout layer, softmax layer and two fully connected layers, and multiple varying windowed filters. The results show that the pre-trained word vectors are very effective on Twitter corpora for the unsupervised learning phase. They used the Twitter 2015 dataset and achieve an F1 Score of 64.85%.
Jianqiang et al. [
27] suggested a deep learning-based system to classify tweets into negative and positive sentiments. The authors named the system global vector (Glove) depth CNN (DCNN). For sentiment features, the authors concatenated the pre-trained N-gram features and word embedding features as feature vectors. Moreover, they captured contextual features by using a recurrent structure and used CNN for the representation of text. Their proposed system achieved the highest accuracy of 85.97% on the STSGd dataset.
Santos et al. [
28] recommended a deep convolutional neural network that uses character level to sentence level information to deploy sentiment classification for short texts. They used two datasets in their study; the maximum accuracy they have achieved was 86.4% on the ST’s corpus.
Ishaq et al. [
29] advocated a deep neural network-based model for hotel review sentiments given by the guests of the hotel. The authors evaluated their proposed approach in terms of binary class classification and multi-class classification including 3 classes and 10 classes. The results showed that a maximum accuracy of 97% is achieved by LSTM on binary class classification.
Sentiment classification using deep learning models is highly impacted by the structure of the data under consideration. In this regard, three CNN-based and five RNN-based deep neural networks were employed and compared in a study to exploit significant implications for the development of a maximized sentiment classification system. The study concluded that, the larger the training data size, the higher the accuracy of the model. They also investigated the character-level and word-level input structure of the data on the models which showed that a word-level structure makes the model learn the hidden patterns more accurately as compared to the character-level structure of input data [
30].
Consequently, a hybrid sentiment classification model leveraging the benefits of word embedding techniques along with deep learning models is proposed in a study [
31]. The authors combined the FastText embedding with character embedding which are fed as an input to the proposed hybrid of CNN and BiLSTM which achieved the highest accuracy score of 82.14%.
Another study investigated the deep learning model CNN-LSTM for Twitter sentiment analysis [
32]. Their method first utilized unlabeled data to pre-train word embeddings with the subset of data, along with distant supervision and fine-tuning. Their proposed system is based on the number of ensembles of CNN and LSTM networks used in the classification of the tweets. They used the SemEval-2017 twitter dataset for evaluation of the proposed approach. Using an ensemble of 10 CNN and 10 LSTM networks, they achieved an accuracy of 74.8%.