Ijst 2023 2912
Ijst 2023 2912
Ijst 2023 2912
RESEARCH ARTICLE
https://www.indjst.org/ 2240
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249
1 Introduction
The proliferation of social media users in recent years has resulted in massive amounts
of data being generated by technological advancements. One of the most abundant
types of user data is text comments, which are large volumes of writing. Researchers
are beginning to focus on this useful data from social media. Sentiment Analysis is a
subfield of NLP that analyses and extracts feelings from text data by determining the
degree of polarity in the sentence. It is also known as opinion analysis or opinion mining.
Institutions have used a variety of research methods to ascertain the public perception
of a particular issue.
Determining the attitude or feeling that is expressed in a text which may be positive,
negative, or neutral is known as polarity analysis. Analysis of people views, feelings,
and opinions towards entities and their characteristics as expressed in written texts
is a subfield of text classification. Since humans may find it difficult to interpret the
emotion contained in a text, Sentiment Analysis can be problematic. Furthermore, it
is not always easy to obtain high accuracy in the analysis process because a single
text may contain multiple emotions. Accordingly, one of the biggest challenges in
any text-based classification task is figuring out what characteristics or indicators
to use to tell apart different classes. In this section, the focus will be on the latest
contributions of this field. G. Mostafa et al. employed several Machine Learning
algorithms on Twitter data, employing numerous preprocessing and encoding steps
to boost accuracy rates. The obtained accuracies were then compared and shown.
In comparison to other algorithms, their experiments show that the Neural Network
algorithm offers exceptional accuracy. Among the Machine Learning algorithms are
KNeighbors, Support Vector Machines (SVM), and Logistic Regression (LR) (1) .
Sentiment Analysis is a useful technique for understanding the beliefs and
motivations of others. The goal of this research is to use probabilistic ratings to reliably
identify and categorize news stories into positive, negative, and neutral categories (2) .
According to classical finance theory, prices reflect the balance between potential risks
and expected returns, and investors are logical, emotionless entities. This eliminates
the possibility that price decisions will be influenced by investor sentiment. However,
sentimental and logical investors are both acknowledged in modern behavioral
finance (3) .
The financial industry and academics have focused on the topic of stock market
return and volatility forecasting for more than 60 years. Sentiment Analysis from
financial news articles and online platforms such as Twitter has great potential to
improve trading strategies. Additionally, Sentiment Analysis of financial news articles
can provide predictive powers useful for managers risk assessment and portfolio
management. Natural Language Processing (NLP) techniques have been used more
frequently in recent academic studies to measure the impact of sentiment on stock
market performance from news, financial reports, and social media posts specific to
a given company.
Tetlock’s 2007 groundbreaking study examined the possible relationships between
sentiment in the media and changes in the stock market, using data from the Wall
Street Journal. According to his research, a greater degree of pessimism in the media
affected market prices negatively. Afterwards, Tetlock et al. (2008) investigated whether
corporate financial news could predict a company accounting earnings and stock
returns using a bag-of-words model. Their findings suggested that unfavorable wording
in news stories about particular companies could be a sign of declining company profits.
https://www.indjst.org/ 2241
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249
Market prices, however, frequently showed an underreaction to the information buried in negative language, even in spite of
this predictive ability (4) . Users can determine whether the overall sentiment of the article is neutral, good, or negative by looking
at certain terms and phrases. Although there are sophisticated algorithms available to identify the numerous sentiments that
can be communicated in an article, this tool streamlines the assessment process by giving users a simple result and empowering
them to make their own more nuanced decisions (5) . Figure 1 shows the arrangement of Sentiment Analysis in Natural Language
Processing. Basic problems like classifying sentiment in financial text data are at the heart of the study of Sentiment Analysis
in finance. It also entails using Sentiment Analysis to forecast variables like stock prices and market trends. There is also an
emphasis on using market time series data, among other techniques, to predict the sentiment of future news texts (6) .
Users can determine whether the overall sentiment of the article is neutral, good, or negative by looking at certain terms and
phrases. Although there are sophisticated algorithms available to identify the numerous sentiments that can be communicated
in an article, this tool streamlines the assessment process by giving users a simple result and empowering them to make their
own more nuanced decisions (7) . The primary goal of this paper is to provide users of sentiment data with recommendations on
the factors to take into account when creating sentiment indicators. The focus is on sentiment indicators that are taken from
financial news and used in financial forecasting (8) . Natural Language Processing research has focused a lot of attention on the
task of Sentiment Analysis, especially in the financial sector. Several literary works employ diverse approaches to carry out
Financial Sentiment Analysis, spanning from lexicon-based methods to Machine Learning and Deep Learning strategies (9) .
In the financial domain, stock market prediction is one of the applications in which Sentiment Analysis has been used
to predict future stock market trends and prices from the analysis of financial news articles. Joshi et al. compared three ML
algorithms and observed that Random Forest and Support Vector Machine (SVM) performed better than Naïve Bayes (NB).
Renault used StockTwits (a platform where people share ideas about the stock market) as a data source and applied five
algorithms, namely NB, a maximum entropy method, a linear SVM, Random Forest, and a multilayer perceptron and concluded
that the maximum entropy and linear SVM methods gave the best results. Over the years, researchers have combined Deep
Learning methods with traditional Machine Learning techniques (e.g., construction of sentiment lexicon), thus obtaining more
promising results (10) .
One popular technique that is being used more and more to gauge how social media users feel about a topic is Sentiment
Analysis. Data mining is the method most often used to perform Sentiment Analysis. Our primary concept involves utilizing
Machine Learning to ascertain, from investor messages, what they anticipate will happen to stock prices and the market in
general. The rationale behind our choice of Machine Learning methodology over data mining is that, particularly in large data
sets, the most difficult task in data mining is identifying features and choosing the best of those features.
The work by Loughran and McDonald is among the most well-known in this field. Between 1994 and 2008, Tey composed
a financial lexicon and manually compiled lists of six words, such as positive, negative, litigious, uncertain, model strong,
and model weak, using the US Securities and Exchange Commission Gateway. A Chinese financial lexicon constructor that
is automatic is proposed by Mao et al. In an effort to automatically create a Chinese financial lexicon, his suggested procedure
examines numerous corpora that are categorized as positive or negative.
In the previous works carried out Multinomial NB model was employed to find accuracy in previous models; however,
of all the models used, Multinomial NB demonstrated the least accuracy approximately 74%. In this research, Multinomial
NB demonstrated the highest accuracy nearly 82%. According to earlier research, the model does a good job of correctly
classifying about 80% of positive sentiments, but it has a lot of trouble in correctly classifying sentiments into the ”neutral”
and ”negative” classes, showing a lot of misclassifications in these categories. The previous works were done only by using social
media comments, in this research social media comments as well as news articles are used.
https://www.indjst.org/ 2242
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249
By increasing the accuracy of sentiment analysis in financial data, the project helps to improve risk management and
financial market decision-making. More sophisticated machine learning models and natural language processing methods are
used to decode emotions with greater accuracy. Multinomial Naïve Bayes, Logistic Regression, KNeighbors, Decision Tree,
and Random Forest are a few of the machine learning algorithms that are used to extract sentiment polarity and improve
classification accuracy. The research work also provides insights into the relationship between sentiment and market variables
such as volatility, trading volume, and stock prices.
Advanced machine learning models, such as Random Forest, Logistic Regression, KNeighbors, Decision Tree, and
Multinomial Naïve Bayes, are used in this work for sentiment analysis in finance. Compared to earlier studies (1) , (10) , the research
achieves a notable improvement in sentiment classification accuracy by utilizing these techniques. Sentiment extraction from
financial text data is improved by the application of Natural Language Processing techniques, particularly the Multinomial NB
model. Additionally, the work provides stakeholders with a clear overview by visually representing sentiment distribution using
a donut chart. Also, the work offers insightful information that can be used to manage risk and make wise financial decisions in
volatile markets. In this work, author used purely ML models to get the accuracy more than the previous works (1) , (10) , which got
74%, and in this work, we can see the accuracy increased to approximately 82%. Notably, the Multinomial NB model accuracy
rose significantly from 65% to 82%.
2 Methodology
Financial data Sentiment Analysis is a methodical and structured process that uses NLP techniques to extract meaningful
information from textual data related to investor sentiment, financial markets, and economic trends. This all-inclusive approach
is intended to carefully evaluate and measure the opinions stated in financial texts, such as news stories, tweets, earnings
reports, and other written materials. The main objective is to learn more about how sentiment affects financial markets. By
carefully examining and interpreting the sentiment contained in financial data, the proposed work enables decision-makers to
identify new trends, make well-informed decisions, and even forecast market moves. Thus, more knowledgeable, and successful
financial and investment strategies are produced. All things considered, this methodology is a powerful instrument in the
field of financial analysis and risk assessment, offering a methodical and perceptive approach to the dynamic field of Financial
Sentiment Analysis.
2.1 Steps involved in Sentiment Analysis of Financial data Natural Language Processing
2.1.1 Data Collection
Unquestionably, one of the most important phases of the Sentiment Analysis process is this one. The quality, consistency, and
labeling of the collected data, as well as how it has been annotated, are critical components of the latter stages. To start this
procedure: Assemble a varied assortment of financial data sources first. These could be news stories on finance, posts on social
media, earnings reports, or any other textual information related to the financial industry. Sources can include news websites,
social networking sites, financial reports, and a variety of other repositories; they should be carefully chosen.
By analyzing news, financial reports, and social media content, sentiment analysis is a tool that can be used to gauge public
opinion and attitudes toward specific industries or companies. When predicting market trends and making wise investment
decisions, these insights are a priceless resource. Businesses can adjust their strategies ahead of changes in the market by
understanding how the public views crucial economic factors like taxes, regulations, and economic indicators.
https://www.indjst.org/ 2243
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249
The purpose of this removal is to give words that convey the texts main idea more weight. As an illustration, consider the text
”There is a pen on the table.” The words ”is,” ”a,” ”on,” and ”the” in this sentence are safe to remove during parsing because they
do not add much to the overall meaning of the statement. On the other hand, the keywords that convey the most important
details about the statement’s subject matter are words like ”there,” ”book,” and ”table”. The particular NLP task and its goals
should be carefully considered before deciding whether or not to remove stop words.
Stemming/Lemmatization: Word Normalization techniques like Stemming and Lemmatization both seek to reduce words
to their most basic form. As part of a text normalization process called stemming, prefixes and suffixes are removed from words
based on a list of frequently occurring affixes. This rule-based method eliminates suffixes such as ”ing,” ”ly,” ”es,” and ”s,” among
others, to simplify words. Lemmatization, on the other hand, is a more methodical and structured way to determine a words root
form. It is based on morphological analysis, which considers word structure and grammatical relationships, and vocabulary,
which takes into account the significance of words within a dictionary.
These methods are used to improve tasks related to NLP in terms of accuracy and efficiency.
https://www.indjst.org/ 2244
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249
hyperparameters precisely specify the parameters of the algorithm, the learning algorithm concentrates on minimizing the loss
function based on the input data. Algorithms are used by automated hyperparameter tuning techniques to find these ideal
values. Today, Bayesian optimization, random search, and Grid Search are a few of the most widely used automated methods.
In this work GridSearchCV is used to optimize hyperparameters and find the ideal model configuration.
2.1.9 Visualization
To monitor sentiment trends over time, create visualizations using line charts, heatmaps, or other suitable graphical
representations. Visualization is used in NLP to provide understanding and insights, enabling us to track changes in sentiment
over time.
In NLP, visualization is a potent tool that gives us a deeper understanding of data patterns. It has nothing to do with the
discipline of Neuro-Linguistic Programming, which specializes in methods for individual growth. Visualizations are used in
NLP for data analysis to make data trends and patterns easier to view and comprehend. Making data-driven decisions and
deriving conclusions from the data can be aided by these visualizations.
2.1.10 Monitoring
To keep the model accurate, keep an eye on it and update it whenever new financial data becomes available. Natural language
processing is a difficult field because NLP models have to understand a wide range of human languages, emotions, context,
ambiguity, domain-specific language, and colloquialisms. It is crucial to keep an eye on these models after deployment to make
sure they function properly in a variety of scenarios. The flowchart in Figure 2 describes the process of the system.
3.1 Experiment
Algorithm for the proposed work is represented below:
(a) Convert text data into numerical features using the technique TF-IDF.
6. Choose a sentiment analysis model.
7. Train the sentiment analysis model on the training set.
8. Evaluate the model on the testing set:
https://www.indjst.org/ 2245
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249
https://www.indjst.org/ 2246
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249
were designated as the training set. Figure 3 shows the image of the financial data set used.
https://www.indjst.org/ 2247
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249
at delivering information quickly. While bar graphs compare sentiment categories visually across different entities or time
periods, donut charts make it easy to compare proportions. These visual representations of sentiment data assist in spotting
patterns or trends, making them useful resources for both analysts and stakeholders.
As mentioned before, the proposed work consists of five Machine Learning Algorithms. Finally, for each algorithm, the
measures of accuracy and performance have been computed in Table 1. The accuracy in percentage for each algorithm is shown
in Figure 5. Model Performance Evaluation: While Multinomial NB achieved a higher accuracy of 81.39%, the KNeighbors
classifier yielded an accuracy rate of only 73.11%. The models that performed best in terms of accuracy were Random Forest,
Logistic Regression, and Decision Tree, with respective results of 80.84%, 81.21%, and 76.79%. With Grid Search CV, the
Multinomial NB model achieved the highest accuracy of 81.39% in this experiment, outperforming other models and enabling
us to optimize the model and adjust its hyperparameters. In this work, we also segregated positive, negative, and neutral
sentiments and displayed them through the donut chart. As compared to the previous research work used the Multinomial NB
model and got an accuracy around 74%, in this research paper it is increased to 82%. Reliability of sentiment classification in a
dataset is measured by machine learning models in sentiment analysis, which is commonly used to assess prediction accuracy.
Accurately predicted sentiments are compared to the total instances in this measure. Although precision, recall, and F1 score are
important metrics to consider, a more thorough evaluation can be obtained by analysing other metrics as well. Recall computes
the percentage of correctly predicted positive sentiments among all actual positive instances, whereas precision measures the
percentage among all positive predictions. An overall assessment of the models efficacy is given by the F1 score, which strikes
a balance between recall and accuracy. When it comes to different applications, these metrics work together to help refine
sentiment analysis models.
5 Conclusion
Sentiment Analysis is done in this study using a financial dataset that was collected from multiple social media sites. Evaluating
the effects of Machine Learning models and preprocessing methods on diverse and heterogeneous data is the main goal.
Financial data must be categorized into three sentiment categories positive, negative, and neutral for the study. Machine
Learning algorithms are used to accomplish this classification task. Comparing the accuracy and execution time of five distinct
algorithms is a noteworthy aspect of the study. To increase the accuracy of the model, the dataset is subjected to several
preprocessing procedures, such as feature selection, data cleaning, and data balancing. Procedures for resampling and balancing
https://www.indjst.org/ 2248
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249
data are essential for improving the efficiency of the algorithms that are used. In this research, only the advanced ML models
such as Multinomial NB, LR, Decision Tree, Random Forest and KNieghbours are used to extract financial sentiment from
textual data. The novel aspect of the study is the incorporation of several models Random Forest, Logistic Regression, K-Nearest
Neighbors, Decision Tree, and Naive Bayes that had not been examined in previously published works. Interestingly in the
proposed work, the accuracy score of the Naive Bayes model increased from 65% to 82%. The built model accuracy scores are
as follows, in ascending order: 73.11% for the KNeighbors, 76.79% for Decision Tree, 80.84% for Random Forest, 81.21% for
Logistic Regression, and 81.39% for Multinomial NB. Subsequently, the research will concentrate on employing optimization
algorithms such as Swarm and Bat algorithms to attain enhanced accuracy outcomes. This study sheds light on the difficulties
associated with Sentiment Analysis in the financial sector and emphasizes the significance of algorithm selection, preprocessing,
and upcoming optimization initiatives for improving accuracy.
References
1) Ahmad HO, Umar SU. Sentiment Analysis of Financial Textual data Using Machine Learning and Deep Learning Models. Informatica. 2023;47(5):153–
158. Available from: https://dx.doi.org/10.31449/inf.v47i5.4673.
2) Kaman S. News Sentiment Analysis By Using Deep Learning Framework. ScienceOpen Preprints. 2020;p. 1–4. Available from: http://dx.doi.org/10.14293/
s2199-1006.1.sor-.ppcv5ia.v2.
3) Yadav A, Jha CK, Sharan A, Vaish V. Sentiment analysis of financial news using unsupervised approach. Procedia Computer Science. 2020;167:589–598.
Available from: https://dx.doi.org/10.1016/j.procs.2020.03.325.
4) Deveikyte J, Geman H, Piccari C, Provetti A. A sentiment analysis approach to the prediction of market volatility. Frontiers in Artificial Intelligence.
2022;5:1–10. Available from: https://dx.doi.org/10.3389/frai.2022.836809.
5) Kalbande A. Summarization and Sentiment Analysis for Financial News. International Journal for Research in Applied Science and Engineering Technology.
2021;9(10):88–90. Available from: https://dx.doi.org/10.22214/ijraset.2021.38345.
6) Fazlija B, Harder P. Using Financial News Sentiment for Stock Price Direction Prediction. Mathematics. 2022;10(13):1–20. Available from: https:
//dx.doi.org/10.3390/math10132156.
7) Chan SWK, Chong MWC. Sentiment analysis in financial texts. Decision Support Systems. 2017;94:53–64. Available from: https://dx.doi.org/10.1016/j.
dss.2016.10.006.
8) Arratia A, Avalos G, Cabaña A, Duarte-López A, Renedo-Mirambell M. Sentiment Analysis of Financial News: Mechanics and Statistics. In: Data Science
for Economics and Finance. Springer, Cham. ;p. 195–216. Available from: http://dx.doi.org/10.1007/978-3-030-66891-4_9.
9) Zhang B, Yang H, Liu XY. Instruct-FinGPT: Financial Sentiment Analysis by Instruction Tuning of General-Purpose Large Language Models. SSRN
Electronic Journal. 2023;p. 1–7. Available from: http://dx.doi.org/10.2139/ssrn.448983.
10) Gupta A, Dengre V, Kheruwala HA, Shah M. Comprehensive review of text-mining applications in finance. Financial Innovation. 2020;6(1):1–25. Available
from: https://dx.doi.org/10.1186/s40854-020-00205-1.
https://www.indjst.org/ 2249