Ijst 2023 2912

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

INDIAN JOURNAL OF SCIENCE AND TECHNOLOGY

RESEARCH ARTICLE

Discovering Untapped Potential in


Finance Markets Using NLP-Driven
Sentiment Analysis
OPEN ACCESS Salma S Shahapur1 ∗ , Anil Koralli2 , Geeta Chippalakatti1 ,
Received: 20-11-2023 Megha Mahadev Balikai1 , Daneshwari Mudalagi1 , Ryan Dias3 ,
Accepted: 03-05-2024 Sahana Devali1 , Komal Wajantari1
Published: 29-05-2024 1 Department of Electronics and Communication, Jain College of Engineering, Karnataka,
India
2 Associate Professor, Department of Panchakarma, KAHER’s Shri B M K Ayurveda
Mahavidyalay, Karnataka, India
Citation: Shahapur SS, Koralli A, 3 Department of Computer Science and Engineering, Jain College of Engineering, Karnataka,
Chippalakatti G, Balikai MM, India
Mudalagi D, Dias R, Devali S,
Wajantari K (2024) Discovering
Untapped Potential in Finance
Markets Using NLP-Driven
Abstract
Sentiment Analysis. Indian Journal Objectives: The main objective in Sentiment Analysis, and financial data is
of Science and Technology 17(21):
2240-2249. https://doi.org/ to decode sentiments through the use of Natural Language Processing (NLP)
10.17485/IJST/v17i21.2912 and Machine Learning (ML). This provides insights into market trends that
∗ are essential for risk management and well-informed financial decisions. By
Corresponding author.
removing emotion from financial writing, gives stakeholders practical advice
[email protected]
for wise investments in dynamic markets. The main objective is to increase
Funding: None
the accuracy of Multinomial Naïve Bayes (Multinomial NB). In this work, we
Competing Interests: None used purely ML models to get the accuracy more than the previous works,
Copyright: © 2024 Shahapur et al. which got 74%, and in this work, we can see the accuracy increased to
This is an open access article approximately 82%. Methods: This specific Research Work employs Machine
distributed under the terms of the
Creative Commons Attribution Learning approaches to extract sentiment polarity (positive, negative, and
License, which permits unrestricted neutral) from financial text data. The used ML models are- Multinomial Naïve
use, distribution, and reproduction Bayes (Multinomial NB), Logistic Regression (LR), KNeighbors, Decision Tree,
in any medium, provided the
original author and source are and Random Forest algorithms are used for Sentiment Analysis. Findings:
credited. Studies conducted in the financial sector have shown that sentiment and
Published By Indian Society for informational substance in stock news have a big impact on market factors
Education and Environment (iSee) such as trading volume, volatility, stock prices, and corporate earnings. This
ISSN study involves approximately 6000 newspaper articles to extract the polarity
Print: 0974-6846 from financial text data. The incorporation of Multinomial NB of Natural
Electronic: 0974-5645
Language Processing techniques results in a marginal improvement in the
sentiment classification model performance among all the models used, where
the previous works got the lowest accuracy of 74%. Novelty: This research
introduces advanced ML models for extracting financial sentiment from text,
including Random Forest, LR, KNeighbors, Decision Tree, and Multinomial NB
previously unexplored in similar studies. Notably, the Multinomial NB model
accuracy rose significantly from 65% to 82%. The study also presents sentiment

https://www.indjst.org/ 2240
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249

segregation (positive, negative, neutral) visualized in a donut chart.


Keywords: Sentiment Analysis; Machine Learning; Natural Language
Processing; Financial data; Stock market prediction

1 Introduction
The proliferation of social media users in recent years has resulted in massive amounts
of data being generated by technological advancements. One of the most abundant
types of user data is text comments, which are large volumes of writing. Researchers
are beginning to focus on this useful data from social media. Sentiment Analysis is a
subfield of NLP that analyses and extracts feelings from text data by determining the
degree of polarity in the sentence. It is also known as opinion analysis or opinion mining.
Institutions have used a variety of research methods to ascertain the public perception
of a particular issue.
Determining the attitude or feeling that is expressed in a text which may be positive,
negative, or neutral is known as polarity analysis. Analysis of people views, feelings,
and opinions towards entities and their characteristics as expressed in written texts
is a subfield of text classification. Since humans may find it difficult to interpret the
emotion contained in a text, Sentiment Analysis can be problematic. Furthermore, it
is not always easy to obtain high accuracy in the analysis process because a single
text may contain multiple emotions. Accordingly, one of the biggest challenges in
any text-based classification task is figuring out what characteristics or indicators
to use to tell apart different classes. In this section, the focus will be on the latest
contributions of this field. G. Mostafa et al. employed several Machine Learning
algorithms on Twitter data, employing numerous preprocessing and encoding steps
to boost accuracy rates. The obtained accuracies were then compared and shown.
In comparison to other algorithms, their experiments show that the Neural Network
algorithm offers exceptional accuracy. Among the Machine Learning algorithms are
KNeighbors, Support Vector Machines (SVM), and Logistic Regression (LR) (1) .
Sentiment Analysis is a useful technique for understanding the beliefs and
motivations of others. The goal of this research is to use probabilistic ratings to reliably
identify and categorize news stories into positive, negative, and neutral categories (2) .
According to classical finance theory, prices reflect the balance between potential risks
and expected returns, and investors are logical, emotionless entities. This eliminates
the possibility that price decisions will be influenced by investor sentiment. However,
sentimental and logical investors are both acknowledged in modern behavioral
finance (3) .
The financial industry and academics have focused on the topic of stock market
return and volatility forecasting for more than 60 years. Sentiment Analysis from
financial news articles and online platforms such as Twitter has great potential to
improve trading strategies. Additionally, Sentiment Analysis of financial news articles
can provide predictive powers useful for managers risk assessment and portfolio
management. Natural Language Processing (NLP) techniques have been used more
frequently in recent academic studies to measure the impact of sentiment on stock
market performance from news, financial reports, and social media posts specific to
a given company.
Tetlock’s 2007 groundbreaking study examined the possible relationships between
sentiment in the media and changes in the stock market, using data from the Wall
Street Journal. According to his research, a greater degree of pessimism in the media
affected market prices negatively. Afterwards, Tetlock et al. (2008) investigated whether
corporate financial news could predict a company accounting earnings and stock
returns using a bag-of-words model. Their findings suggested that unfavorable wording
in news stories about particular companies could be a sign of declining company profits.

https://www.indjst.org/ 2241
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249

Market prices, however, frequently showed an underreaction to the information buried in negative language, even in spite of
this predictive ability (4) . Users can determine whether the overall sentiment of the article is neutral, good, or negative by looking
at certain terms and phrases. Although there are sophisticated algorithms available to identify the numerous sentiments that
can be communicated in an article, this tool streamlines the assessment process by giving users a simple result and empowering
them to make their own more nuanced decisions (5) . Figure 1 shows the arrangement of Sentiment Analysis in Natural Language
Processing. Basic problems like classifying sentiment in financial text data are at the heart of the study of Sentiment Analysis
in finance. It also entails using Sentiment Analysis to forecast variables like stock prices and market trends. There is also an
emphasis on using market time series data, among other techniques, to predict the sentiment of future news texts (6) .

Fig 1. The arrangement of Sentiment Analysis in Natural Language Processing

Users can determine whether the overall sentiment of the article is neutral, good, or negative by looking at certain terms and
phrases. Although there are sophisticated algorithms available to identify the numerous sentiments that can be communicated
in an article, this tool streamlines the assessment process by giving users a simple result and empowering them to make their
own more nuanced decisions (7) . The primary goal of this paper is to provide users of sentiment data with recommendations on
the factors to take into account when creating sentiment indicators. The focus is on sentiment indicators that are taken from
financial news and used in financial forecasting (8) . Natural Language Processing research has focused a lot of attention on the
task of Sentiment Analysis, especially in the financial sector. Several literary works employ diverse approaches to carry out
Financial Sentiment Analysis, spanning from lexicon-based methods to Machine Learning and Deep Learning strategies (9) .
In the financial domain, stock market prediction is one of the applications in which Sentiment Analysis has been used
to predict future stock market trends and prices from the analysis of financial news articles. Joshi et al. compared three ML
algorithms and observed that Random Forest and Support Vector Machine (SVM) performed better than Naïve Bayes (NB).
Renault used StockTwits (a platform where people share ideas about the stock market) as a data source and applied five
algorithms, namely NB, a maximum entropy method, a linear SVM, Random Forest, and a multilayer perceptron and concluded
that the maximum entropy and linear SVM methods gave the best results. Over the years, researchers have combined Deep
Learning methods with traditional Machine Learning techniques (e.g., construction of sentiment lexicon), thus obtaining more
promising results (10) .
One popular technique that is being used more and more to gauge how social media users feel about a topic is Sentiment
Analysis. Data mining is the method most often used to perform Sentiment Analysis. Our primary concept involves utilizing
Machine Learning to ascertain, from investor messages, what they anticipate will happen to stock prices and the market in
general. The rationale behind our choice of Machine Learning methodology over data mining is that, particularly in large data
sets, the most difficult task in data mining is identifying features and choosing the best of those features.
The work by Loughran and McDonald is among the most well-known in this field. Between 1994 and 2008, Tey composed
a financial lexicon and manually compiled lists of six words, such as positive, negative, litigious, uncertain, model strong,
and model weak, using the US Securities and Exchange Commission Gateway. A Chinese financial lexicon constructor that
is automatic is proposed by Mao et al. In an effort to automatically create a Chinese financial lexicon, his suggested procedure
examines numerous corpora that are categorized as positive or negative.
In the previous works carried out Multinomial NB model was employed to find accuracy in previous models; however,
of all the models used, Multinomial NB demonstrated the least accuracy approximately 74%. In this research, Multinomial
NB demonstrated the highest accuracy nearly 82%. According to earlier research, the model does a good job of correctly
classifying about 80% of positive sentiments, but it has a lot of trouble in correctly classifying sentiments into the ”neutral”
and ”negative” classes, showing a lot of misclassifications in these categories. The previous works were done only by using social
media comments, in this research social media comments as well as news articles are used.

https://www.indjst.org/ 2242
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249

By increasing the accuracy of sentiment analysis in financial data, the project helps to improve risk management and
financial market decision-making. More sophisticated machine learning models and natural language processing methods are
used to decode emotions with greater accuracy. Multinomial Naïve Bayes, Logistic Regression, KNeighbors, Decision Tree,
and Random Forest are a few of the machine learning algorithms that are used to extract sentiment polarity and improve
classification accuracy. The research work also provides insights into the relationship between sentiment and market variables
such as volatility, trading volume, and stock prices.
Advanced machine learning models, such as Random Forest, Logistic Regression, KNeighbors, Decision Tree, and
Multinomial Naïve Bayes, are used in this work for sentiment analysis in finance. Compared to earlier studies (1) , (10) , the research
achieves a notable improvement in sentiment classification accuracy by utilizing these techniques. Sentiment extraction from
financial text data is improved by the application of Natural Language Processing techniques, particularly the Multinomial NB
model. Additionally, the work provides stakeholders with a clear overview by visually representing sentiment distribution using
a donut chart. Also, the work offers insightful information that can be used to manage risk and make wise financial decisions in
volatile markets. In this work, author used purely ML models to get the accuracy more than the previous works (1) , (10) , which got
74%, and in this work, we can see the accuracy increased to approximately 82%. Notably, the Multinomial NB model accuracy
rose significantly from 65% to 82%.

2 Methodology
Financial data Sentiment Analysis is a methodical and structured process that uses NLP techniques to extract meaningful
information from textual data related to investor sentiment, financial markets, and economic trends. This all-inclusive approach
is intended to carefully evaluate and measure the opinions stated in financial texts, such as news stories, tweets, earnings
reports, and other written materials. The main objective is to learn more about how sentiment affects financial markets. By
carefully examining and interpreting the sentiment contained in financial data, the proposed work enables decision-makers to
identify new trends, make well-informed decisions, and even forecast market moves. Thus, more knowledgeable, and successful
financial and investment strategies are produced. All things considered, this methodology is a powerful instrument in the
field of financial analysis and risk assessment, offering a methodical and perceptive approach to the dynamic field of Financial
Sentiment Analysis.

2.1 Steps involved in Sentiment Analysis of Financial data Natural Language Processing
2.1.1 Data Collection
Unquestionably, one of the most important phases of the Sentiment Analysis process is this one. The quality, consistency, and
labeling of the collected data, as well as how it has been annotated, are critical components of the latter stages. To start this
procedure: Assemble a varied assortment of financial data sources first. These could be news stories on finance, posts on social
media, earnings reports, or any other textual information related to the financial industry. Sources can include news websites,
social networking sites, financial reports, and a variety of other repositories; they should be carefully chosen.
By analyzing news, financial reports, and social media content, sentiment analysis is a tool that can be used to gauge public
opinion and attitudes toward specific industries or companies. When predicting market trends and making wise investment
decisions, these insights are a priceless resource. Businesses can adjust their strategies ahead of changes in the market by
understanding how the public views crucial economic factors like taxes, regulations, and economic indicators.

2.1.2 Data Preprocessing


Text Cleaning: Text cleaning is a task that needs a precise goal, so having a thorough understanding of the intended outcome
is crucial. Preparing the text data for analysis, entails a painstaking process of removing special characters, symbols, and
superfluous characters.
Tokenization: A basic procedure called tokenization turns unprocessed data into a useful data string. Tokenization is widely
known for its use in cybersecurity and in the production of Non-Fungible Tokens (NFTs), but it is also very important in NLP.
Tokenization is a technique used in NLP to divide sentences and paragraphs into smaller units so that meaning can be assigned
to the text more easily. The segmentation of words in languages where spaces or punctuation marks do not clearly delineate word
boundaries is one significant challenge in this process. This problem is especially common in languages with a lot of symbols,
like Thai, Korean, Chinese, and Japanese. Tokenization and word separation are accomplished through a variety of techniques,
such as sub-word tokenization, character tokenization, and word tokenization.
Stop Word Removal: In NLP, eliminating stop words is not always necessary; it all depends on the particular task at hand.
Stop words are frequently removed from texts for tasks like text classification, where the goal is to group or categorize the text.

https://www.indjst.org/ 2243
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249

The purpose of this removal is to give words that convey the texts main idea more weight. As an illustration, consider the text
”There is a pen on the table.” The words ”is,” ”a,” ”on,” and ”the” in this sentence are safe to remove during parsing because they
do not add much to the overall meaning of the statement. On the other hand, the keywords that convey the most important
details about the statement’s subject matter are words like ”there,” ”book,” and ”table”. The particular NLP task and its goals
should be carefully considered before deciding whether or not to remove stop words.
Stemming/Lemmatization: Word Normalization techniques like Stemming and Lemmatization both seek to reduce words
to their most basic form. As part of a text normalization process called stemming, prefixes and suffixes are removed from words
based on a list of frequently occurring affixes. This rule-based method eliminates suffixes such as ”ing,” ”ly,” ”es,” and ”s,” among
others, to simplify words. Lemmatization, on the other hand, is a more methodical and structured way to determine a words root
form. It is based on morphological analysis, which considers word structure and grammatical relationships, and vocabulary,
which takes into account the significance of words within a dictionary.
These methods are used to improve tasks related to NLP in terms of accuracy and efficiency.

2.1.3 Text Labelling


Adding sentiment labels to data, which generally categorize it as positive, negative, or neutral, is a necessary step in producing a
labeled dataset that NLP can use for training and assessment. An essential component of organizing and extracting information
from text data is NLP labeling. It entails identifying and giving particular labels to textual elements like entities, events, or
sentiments. The process of annotation lays the groundwork for additional study and understanding of the text’s content. In
NLP, data annotation is very important since it improves machine learning model accuracy. By methodically classifying and
labeling textual data, we enable machines to comprehend and interpret human language more effectively. The effective analysis
of textual content depends on this increase in comprehension.

2.1.4 Feature Extraction


An excellent summary of the need for feature extraction techniques in NLP is given in this material, which also highlights
two well-liked approaches: Term Frequency Inverse Document Frequency (TF-IDF) and Bag-of-Words. These techniques are
essential for transforming textual data into a format that machine learning algorithms can handle efficiently.
Bag of Words: It is a straightforward but powerful technique that builds a vector of word frequencies by counting the
frequency of each word in a document. It is useful for a number of NLP tasks, such as text classification, similarity analysis, and
clustering, even though it ignores word order and grammar.
TF-IDF Vectorizer: Conversely, TF-IDF considers both inverse document frequency, or how unique a word is across the
entire corpus, and term frequency, or how frequently a word appears in a document. As a result, words are represented more
nuancedly, highlighting those that are significant for a particular document relative to the dataset as a whole.
These methods are crucial for decreasing the dimensionality of text data and giving Machine Learning algorithms useful
features, which increases the algorithms efficacy in a range of NLP applications.

2.1.5 Model Selection


Optimizing the performance of ML models requires careful consideration of training hyperparameters and appropriate
architecture parameters. Generalization metrics have been investigated recently as a means of directing model selection. In
contrast to Computer Vision (CV) tasks, NLP tasks have received less attention. One of our goals is to use metrics that, without
requiring access to data, can be used to predict test errors.

2.1.6 Training the Model


Utilizing the labeled dataset, train the selected model. To assess the performance of the model, it is crucial to separate the data
into subsets for training and testing. The data is given to the NLP model in this stage of training so that it can learn. This
usually entails applying different learning algorithms, such as Deep Learning (DL) and Machine Learning, and splitting data
into training, validation, and testing sets. To minimize errors and optimize performance, the model will modify its weights and
biases in response to feedback from the data and the loss function. It is critical to keep a careful eye on the training process and
evaluate the model using metrics like recall, accuracy, precision, F1 score, perplexity, and so forth.

2.1.7 Hyperparameter Tuning


GridSearchCV is used to optimize hyperparameters and find the ideal model configuration. Finding the best values for a
learning algorithm’s hyperparameters is known as hyperparameter tuning. When these parameters are optimized, the model
performs better and produces fewer errors based on a predetermined loss function. It is crucial to remember that although

https://www.indjst.org/ 2244
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249

hyperparameters precisely specify the parameters of the algorithm, the learning algorithm concentrates on minimizing the loss
function based on the input data. Algorithms are used by automated hyperparameter tuning techniques to find these ideal
values. Today, Bayesian optimization, random search, and Grid Search are a few of the most widely used automated methods.
In this work GridSearchCV is used to optimize hyperparameters and find the ideal model configuration.

2.1.8 Model Evaluation


Evaluate the model performance with suitable metrics (recall, F1-score, accuracy, precision) and make sure it is robust by
using cross-validation. To compare various models during experimentation and choose the best among pre-trained models,
evaluating a language model is essential.

2.1.9 Visualization
To monitor sentiment trends over time, create visualizations using line charts, heatmaps, or other suitable graphical
representations. Visualization is used in NLP to provide understanding and insights, enabling us to track changes in sentiment
over time.
In NLP, visualization is a potent tool that gives us a deeper understanding of data patterns. It has nothing to do with the
discipline of Neuro-Linguistic Programming, which specializes in methods for individual growth. Visualizations are used in
NLP for data analysis to make data trends and patterns easier to view and comprehend. Making data-driven decisions and
deriving conclusions from the data can be aided by these visualizations.

2.1.10 Monitoring
To keep the model accurate, keep an eye on it and update it whenever new financial data becomes available. Natural language
processing is a difficult field because NLP models have to understand a wide range of human languages, emotions, context,
ambiguity, domain-specific language, and colloquialisms. It is crucial to keep an eye on these models after deployment to make
sure they function properly in a variety of scenarios. The flowchart in Figure 2 describes the process of the system.

3 Results and Discussion


In the following subsections, the experiments and results have been shown. An algorithm for Sentiment Analysis for financial
data is introduced in subsection 3.1, an experimental setup is given in subsection 3.1.1, and the dataset description in detail is
demonstrated in subsection 3.1.2. Lastly, the results and analysis of all the classifiers are presented in subsection 3.2.

3.1 Experiment
Algorithm for the proposed work is represented below:

1. Import necessary libraries and datasets.


2. Preprocess the financial text data:

(a) Tokenize the text.


(b) Remove stop words, punctuation, and special characters.
(c) Perform stemming or lemmatization.
3. Label the data with sentiment labels (positive, negative, neutral).
4. Split the dataset into training and testing sets.
5. Vectorize the text data:

(a) Convert text data into numerical features using the technique TF-IDF.
6. Choose a sentiment analysis model.
7. Train the sentiment analysis model on the training set.
8. Evaluate the model on the testing set:

(a) Calculate accuracy, precision, recall, and F1 score.


9. Fine-tune the model if necessary.

https://www.indjst.org/ 2245
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249

Fig 2. The described process for the system

(a) Adjust hyperparameters or try different models to improve performance.


10. Apply the trained model to new financial text data for sentiment prediction.
11. Analyze and interpret sentiment results for decision-making in a financial context.

3.1.1 Experimental Setup


In this section, there is more detail about proposed experiments using the news articles dataset to apply Deep Learning
techniques. The main goal of this paper is to find out if Deep Learning models could improve the Sentiment Analysis accuracy
for messages from news articles. The goal of Deep Learning is to mimic the principles of hierarchical learning found in the
human brain. Non-linearity in the analysis of Big Data is introduced by using Deep Learning for feature extraction.
Models for Sentiment Analysis can be created using a variety of methods, such as supervised learning with labeled data,
incorporating outside knowledge about word polarities, or combining these methods. We chose to examine the efficacy of the
Logistic Regression to thoroughly investigate various approaches.

3.1.2 Data Set


This experiments dataset comes from a variety of social media sites. Here the texts are preprocessed to make sure they are
suitable for the experiments. Then the texts are removed that were too long, had too many special characters, or showed too
many subtle emotional overtones. The dataset was gathered from several social media platforms that allow users to post financial
commentary. It consists of two primary components: the labels that go along with the comments, and the comments themselves.
Approximately 6,000 records a mixture of neutral, negative, and positive data fit the parameters of the proposed algorithms.
Sentiment Analysis in this work focuses on ternary classification, which divides sentiments into three categories: negative,
neutral, and positive. Eighty percent of the data in the dataset were designated as the test set, and the remaining twenty percent

https://www.indjst.org/ 2246
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249

were designated as the training set. Figure 3 shows the image of the financial data set used.

Fig 3. An image of Data Set

3.1.3 Exploratory Data Analysis (EDA)


EDA is the procedure by which analysts of data become acquainted with their dataset to derive conclusions and create theories.
Usually, this process makes use of visualizations and descriptive statistics. Here Word Clouds are generated for sentence variables
in our analysis. Word Clouds are a useful tool for visualizing text data because they show the frequency or importance of
each word by indicating its size and color. Also, created a Donut Chart to display sentiment analysis data. The proportions of
categorical data can be seen in Donut Charts, where the size of each segment represents the proportion of the corresponding
category. According to this research, over 14.7 % of the dataset was comprised of negative sentiment. The next step was Model
Building, where we used the TF-IDF Vectorizer to calculate the significance of each word in the ’Sentence’ column. The data
was divided into test (20%) and training (80%) sets. During this stage, five different models (K Neighbors Classifier, Decision
Tree Classifier, Random Forest Classifier, Multinomial NB, and Logistic Regression) are used to predict sentiment and assess
each one’s performance based on accuracy. Figure 4 shows the Donut chart displaying Sentiment Analysis data.

Fig 4. Donut chart displaying Sentiment Analysis data

4 Results and Analysis


An analysis of financial data sentiment revealed a preponderance of positive sentiments, indicating hope and faith in market
developments. Furthermore, neutral feelings were observed, indicating a measured approach or a range of opinions. On
the other hand, negative opinions were less common and pointed out potential hazards or areas of worry in the financial
environment. Donut charts and bar graphs are popular ways to present results because they are easy to understand and efficient

https://www.indjst.org/ 2247
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249

at delivering information quickly. While bar graphs compare sentiment categories visually across different entities or time
periods, donut charts make it easy to compare proportions. These visual representations of sentiment data assist in spotting
patterns or trends, making them useful resources for both analysts and stakeholders.
As mentioned before, the proposed work consists of five Machine Learning Algorithms. Finally, for each algorithm, the
measures of accuracy and performance have been computed in Table 1. The accuracy in percentage for each algorithm is shown
in Figure 5. Model Performance Evaluation: While Multinomial NB achieved a higher accuracy of 81.39%, the KNeighbors
classifier yielded an accuracy rate of only 73.11%. The models that performed best in terms of accuracy were Random Forest,
Logistic Regression, and Decision Tree, with respective results of 80.84%, 81.21%, and 76.79%. With Grid Search CV, the
Multinomial NB model achieved the highest accuracy of 81.39% in this experiment, outperforming other models and enabling
us to optimize the model and adjust its hyperparameters. In this work, we also segregated positive, negative, and neutral
sentiments and displayed them through the donut chart. As compared to the previous research work used the Multinomial NB
model and got an accuracy around 74%, in this research paper it is increased to 82%. Reliability of sentiment classification in a
dataset is measured by machine learning models in sentiment analysis, which is commonly used to assess prediction accuracy.
Accurately predicted sentiments are compared to the total instances in this measure. Although precision, recall, and F1 score are
important metrics to consider, a more thorough evaluation can be obtained by analysing other metrics as well. Recall computes
the percentage of correctly predicted positive sentiments among all actual positive instances, whereas precision measures the
percentage among all positive predictions. An overall assessment of the models efficacy is given by the F1 score, which strikes
a balance between recall and accuracy. When it comes to different applications, these metrics work together to help refine
sentiment analysis models.

Fig 5. Accuracy in percentage for each Algorithm

Table 1. Outcomes of the Algorithm


Model Test Accuracy Score
Random Forest 80.84%
Logistic Regression 81.21%
Multinomial NB 81.39%
Decision Tree 76.79%
KNeighbors 73.11%

5 Conclusion
Sentiment Analysis is done in this study using a financial dataset that was collected from multiple social media sites. Evaluating
the effects of Machine Learning models and preprocessing methods on diverse and heterogeneous data is the main goal.
Financial data must be categorized into three sentiment categories positive, negative, and neutral for the study. Machine
Learning algorithms are used to accomplish this classification task. Comparing the accuracy and execution time of five distinct
algorithms is a noteworthy aspect of the study. To increase the accuracy of the model, the dataset is subjected to several
preprocessing procedures, such as feature selection, data cleaning, and data balancing. Procedures for resampling and balancing

https://www.indjst.org/ 2248
Shahapur et al. / Indian Journal of Science and Technology 2024;17(21):2240–2249

data are essential for improving the efficiency of the algorithms that are used. In this research, only the advanced ML models
such as Multinomial NB, LR, Decision Tree, Random Forest and KNieghbours are used to extract financial sentiment from
textual data. The novel aspect of the study is the incorporation of several models Random Forest, Logistic Regression, K-Nearest
Neighbors, Decision Tree, and Naive Bayes that had not been examined in previously published works. Interestingly in the
proposed work, the accuracy score of the Naive Bayes model increased from 65% to 82%. The built model accuracy scores are
as follows, in ascending order: 73.11% for the KNeighbors, 76.79% for Decision Tree, 80.84% for Random Forest, 81.21% for
Logistic Regression, and 81.39% for Multinomial NB. Subsequently, the research will concentrate on employing optimization
algorithms such as Swarm and Bat algorithms to attain enhanced accuracy outcomes. This study sheds light on the difficulties
associated with Sentiment Analysis in the financial sector and emphasizes the significance of algorithm selection, preprocessing,
and upcoming optimization initiatives for improving accuracy.

References
1) Ahmad HO, Umar SU. Sentiment Analysis of Financial Textual data Using Machine Learning and Deep Learning Models. Informatica. 2023;47(5):153–
158. Available from: https://dx.doi.org/10.31449/inf.v47i5.4673.
2) Kaman S. News Sentiment Analysis By Using Deep Learning Framework. ScienceOpen Preprints. 2020;p. 1–4. Available from: http://dx.doi.org/10.14293/
s2199-1006.1.sor-.ppcv5ia.v2.
3) Yadav A, Jha CK, Sharan A, Vaish V. Sentiment analysis of financial news using unsupervised approach. Procedia Computer Science. 2020;167:589–598.
Available from: https://dx.doi.org/10.1016/j.procs.2020.03.325.
4) Deveikyte J, Geman H, Piccari C, Provetti A. A sentiment analysis approach to the prediction of market volatility. Frontiers in Artificial Intelligence.
2022;5:1–10. Available from: https://dx.doi.org/10.3389/frai.2022.836809.
5) Kalbande A. Summarization and Sentiment Analysis for Financial News. International Journal for Research in Applied Science and Engineering Technology.
2021;9(10):88–90. Available from: https://dx.doi.org/10.22214/ijraset.2021.38345.
6) Fazlija B, Harder P. Using Financial News Sentiment for Stock Price Direction Prediction. Mathematics. 2022;10(13):1–20. Available from: https:
//dx.doi.org/10.3390/math10132156.
7) Chan SWK, Chong MWC. Sentiment analysis in financial texts. Decision Support Systems. 2017;94:53–64. Available from: https://dx.doi.org/10.1016/j.
dss.2016.10.006.
8) Arratia A, Avalos G, Cabaña A, Duarte-López A, Renedo-Mirambell M. Sentiment Analysis of Financial News: Mechanics and Statistics. In: Data Science
for Economics and Finance. Springer, Cham. ;p. 195–216. Available from: http://dx.doi.org/10.1007/978-3-030-66891-4_9.
9) Zhang B, Yang H, Liu XY. Instruct-FinGPT: Financial Sentiment Analysis by Instruction Tuning of General-Purpose Large Language Models. SSRN
Electronic Journal. 2023;p. 1–7. Available from: http://dx.doi.org/10.2139/ssrn.448983.
10) Gupta A, Dengre V, Kheruwala HA, Shah M. Comprehensive review of text-mining applications in finance. Financial Innovation. 2020;6(1):1–25. Available
from: https://dx.doi.org/10.1186/s40854-020-00205-1.

https://www.indjst.org/ 2249

You might also like