Enhanced Sentiment Analysis and Topic Modeling During The Pandemic Using Automated Latent Dirichlet Allocation

Received 27 May 2024, accepted 4 June 2024, date of publication 10 June 2024, date of current version 17 June 2024.

Digital Object Identifier 10.1109/ACCESS.2024.3411717

Enhanced Sentiment Analysis and Topic Modeling

During the Pandemic Using Automated Latent
Dirichlet Allocation
Department of Electronic Engineering, Institute of Information Science Technology, Jeju National University, Jeju 63243, South Korea
Department of Computer Engineering, Major of Electronic Engineering, Institute of Information Science Technology, Jeju National University, Jeju 63243, South Korea

South Korea
Corresponding author: Yung-Cheol Byun
This work was supported by the ‘‘Regional Innovation Strategy (RIS)’’ through the National Research Foundation of Korea (NRF) funded
by the Ministry of Education (MOE).

ABSTRACT The COVID-19 pandemic has profoundly impacted human societies, resulting in the loss
of millions of lives and slowing economic growth worldwide. This devastating pandemic underscores
the gravity of viral threats and led to multifaceted consequences, including loss of livelihoods, dynamic
labor force migration, and significant ramifications on mental health. Furthermore, different scientific
institutions and companies are attempting to accelerate research and innovation by analyzing large data
corpus for fighting against the pandemic. In this research study, an advanced approach based on automated
Latent Dirichlet Allocation (LDA) is suggested dealing with a large data corpus for efficiently providing
visualization of sentiment analysis and discovered topics. This innovative approach seeks to interrogate
a substantial pandemic corpus, delving into the intricacies of public sentiment and discerning evolving
trends pertinent to the pandemic. A sophisticated 10-topic LDA model was implemented, revealing Topic
8 as the most prevalent, with a frequency peak of 22.29, eclipsing other enumerated topics. We employ
text-mining techniques like WordCloud and Word2Vec to offer insights into specific terms relevant to
the pandemic, such as ‘‘Origin,’’ ‘‘Symptom,’’ ‘‘Diagnostic,’’ and ‘‘Transmission.’’ Applying the t-SNE
method enriches the analysis by visually unraveling semantic clusters within the corpus. The subsequent
phase involves modeling strategic topics within the corpus through an unsupervised LDA-based approach,
leveraging our suggested framework. This novel perspective contributes to a deeper understanding of the
underlying dynamics by analyzing a large data corpus quickly and automatically for providing visualization
of discovered topics aiming to aid front-line workers, healthcare practitioners, and community support to
fight against the pandemic.

INDEX TERMS Topic modeling, LDA, sentiment analysis, machine learning, deep learning, feature

I. INTRODUCTION of the long-term restriction to houses or residences, the

The pandemic has impeded millions of lives. The majority consequences of this pandemic included loss of Coron-
of countries around the world were driven to order a aviruses (CoVs), which are significant RNA viruses enclosed
temporary shutdown of their economy to stop the virus from in single positive strands that infect humans and animals,
spreading. This pandemic leads towards health crises as well gastrointestinal tracts, or respiratory systems [3]. Several
as slowed down the economic growth [1], [2]. In addition, viruses affect human life, such as Human Coronaviruses
this fact provides evidence of the virus menace. Because (HCoVs), which are known to exist in seven different strains.
Beta-CoVs HCoVs-OC43 and HCoVs-HKU1 and alpha-
The associate editor coordinating the review of this manuscript and CoVs NL63 and 229E cause only moderate respiratory
approving it for publication was Zhengmao Li . illnesses [4], [5].

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

Over the past 20 years, however, two new coronaviruses, etc. Text summarizing [19] are a few of the most widely
CoV (MERS-CoV) and severe acute respiratory syndrome used and well-liked NLP approaches. In addition, in [20],
CoV (SARS-CoV) in the Middle East, have caused human the authors employed epistemic network analysis to extract
infections in local nations and areas and have caused sentiment score in blended learning data. Similarly, in [21],
mortality rates around 10% and 35%, respectively [6], [7], the authors used collaborative filtering approach to analyze
[8]. The coronavirus is now a world wide SARS-CoV-2, 7th users behaviour through social media data.
hcov, (2019-nCoV) formerly. Medical and social astrocytic In the context of the COVID-19 crisis, large language
2022. tough efforts globally for the pandemic caused by the models (LLMs) like BERT offer exceptional NLP capabilities
SARS-CoV-2 coronavirus which started infecting people in but are computationally and memory-intensive [22], [23].
late December 2019 worldwide. It is an evolving medical Binarization, which reduces model weights to 1 bit, signif-
issue, so the global monitoring of the disease and possible icantly cuts these demands, yet often results in performance
effective global research is currently being up-to-date to drops [24]. To address this, methods like BiLLM and BiBERT
thwart and regroup the current pandemics of large concern introduce innovative techniques such as binary residual
to WHO. Here, is the seventh SARS-CoV-2. It is the seventh approximation, optimal splitting search, Bi-Attention struc-
coronavirus that appeared in late December 2019, and the tures, and Direction-Matching Distillation (DMD) to enhance
most recent one discovered. 14 September 2022, WHO called accuracy and efficiency [25]. These approaches achieve
the COVID-19 pandemic amidst the disease (COVID-19 high-accuracy inference and substantial savings in FLOPs
invoking the shape of the new virus). As of 14 September and model size, demonstrating their potential for real-
2022, 607,083,820 people were afflicted with the disease, and world, resource-constrained scenarios [26]. The author used
6,496,721 people died [9]. non-negative matrix factorization and probabilistic latent
Furthermore, as the scientific research community strives sentiment analysis to normalize the mutual and distance
to develop innovative solutions to mitigate the pandemic, information [27]. In addition, in [28], the authors proposed
it also paves the way for accelerating the pace of innovation a two stage adaptive distillation model to capture aesthetic
and discovery [10]. In addition, it has become a serious health and context information in crowd-sensing environment.
concern globally, increasing demand for health technology By leveraging these advancements, binarized LLMs can
to enhance development for aiding health practitioners maintain performance while being more computationally and
and policy makers to provide breakthrough against future memory efficient. In the context of the COVID-19 crisis, NLP
pandemic [11]. In addition, research funding institutions techniques have been crucial for analyzing vast amounts of
have focused significant attention on supporting research natural language data from various sources, enabling effective
and innovation at the utmost pace to address the pandemic. sentiment analysis, topic modeling, and information retrieval.
However, to accelerate the pace of the research community Automated LDA (Latent Dirichlet Allocation) discussion
towards innovative solutions, timely processing of the large for the pandemic outbreak involves applying topic modeling
data corpus of the pandemic is required to facilitate growing techniques to extract key themes or topics from large-scale
research. According to scientific evidence, many researchers discussions related to the pandemic. This method helps
are writing on different issues, including text for communities researchers identify and categorize various aspects of the
with disease outbreaks, epidemic alarms, and other medical pandemic outbreak, such as medical issues, public responses,
services, all of which are included in the discussion. economic impacts, and policy discussions. Moreover, senti-
Therefore, an efficient solution is required to timely process ment analysis can provide insights into prevailing sentiments
the large pandemic data corpus for providing a visualization surrounding the pandemic and its effects on different commu-
of discovered topics to scientific researchers and other health nities by gauging the emotional tone of these discussions.
practitioners for mitigating with the pandemic. This study has significant research topics and their
Several studies have applied Natural Language Processing connections and applied LDA modeling and NLP to evaluate
(NLP) [12] on SM text to the work related to COVID-19. the current status of literature on COVID-19 and COV
NLP approaches are gaining popularity in processing enor- infection. This study can also aid research on pandemic
mous natural language data. NLP is a multidisciplinary coordination by identifying high-priority scientific topics.
science combining artificial intelligence (AI) and linguistics, This research is urgently needed in pathogens, treatments,
leveraging computers to interpret and understand human virus diagnostics, vaccines, and viral genomes, while clinical
language. This convergence of knowledge makes machines characterization, epidemiology, and virus transmission are
capable of processing, analyzing, and generating text to now priorities.
enable communication between humans and computers [13]. The distinctive contributions of our study are delineated as
The importance of NLP nowadays is further heightened follows:
by the fact that we produce large amounts of unstructured • Pioneering a novel approach, we introduce an auto-
text data in our daily routine is called entity recognition mated LDA based topic modeling method to scrutinize
sentiment analysis [14], machine translation [15], topic an extensive pandemic corpus. This method goes
modeling [16], text filtering [17], reviews analysis [18], beyond conventional analyses, offering an enhanced

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

understanding of public sentiment and emerging trends latent Dirichlet allocation (LDA) model being the most often
linked to the pandemic. used [29]. Global researchers are working to comprehend
• Elevating the analysis, we employ word-to-vector as an the COVID-19 pandemic’s many facets. Many researcher
embedding technique, delving into the intricate semantic has also appeared in the literature since the outbreak of
relationships and similarities among words. Specifically, COVID-19, which was reported at the end of December 2019.
we explore terms such as origin, symptom, diagnostic, For example, one metric of the growing body of COVID-19
transmission, etc., providing a nuanced perspective on research is a NLP-based analysis of social media posts,
their interconnections with the pandemic. scholarly papers and the daily news relevant to the disease.
• Employ statistical analysis to analyze statistical signifi- A topic considered as the study of COVID-19 news stories
cance of the proposed research study. from Canada is presented by Bai et al. in [30]. To examine
• In addition, a detailed comparison is provided to the news media during the early stages of the COVID-19
highlight the empirical effectiveness of the proposed epidemic in China [31] adopted a digital topic modeling
research study over the existing studies. technique. During the COVID-19 conference, [32] presented
The rest of the paper organized as follows. The litera- a system for identifying and following pertinent subjects from
ture review on pandemic publications examined sentiment social media. Reference [33] examined how the local public
analysis and topic modeling in Section II. Our method and responded to the new Coronavirus (COVID-19).
research design section III introduces data preprocessing
and profoundly explores the dataset. We explain the precise
Some studies used sentiment analysis to examine how
methodology of this study. The subject distribution and topic
individuals responded to the epidemic through social media
representations are discussed in more detail in section IV
posts. The tweeter posts and Weibo postings made by China
and V. In section VI, we explain the results and over-
and America between January 2020 and May 2020 during
generalizations.Finally, the conclusion section VII summa-
the epidemic were examined by [49]. The results showed
rizes how this study fits into the research framework. We also
that most people were confident in controlling the pandemic,
describe the limitations of our research work and provide
but sentiments of people like fear, sadness, and disgust
suggestions for future research.
also appeared worldwide. They compared the people’s
emotions, i.e., anger, hate, fear, happiness, sadness, and
II. RELATED WORK surprise. An existing study of the sentiment dynamics of
In 2020, the pandemic of coronavirus disease (COVID), residents of the Australian state of New South Wales (NSW)
presented by the WHO (World Health Organization), will throughout the pandemic, [50] retrieved five months’ worth
occur. The pandemic COVID-19 topic has a lot of research. of COVID-19-related tweets from Twitter. They grouped
Addressing issues like the COVID-19 transmission pro- tweets into groups based on the worth of local government
cess, the virus symptoms, and psychological conditions of areas (LGAs) and tracked dynamic mood shifts over time.
COVID-19 patients, boosting human immunity to prevent To dynamically assess the subject and mood of 13 million
health consequences, prediction of COVID-19 data based on tweets about COVID-19, [51] devised a unique methodology.
the technique of machine learning (ML), and the importance In addition, in [52], the authors carried out a cross-sectional
of online tools of technology in this context. Our literature study to investigate the impact of negative emotions and
evaluation will examine the trend of general research on risk perception of health practitioners during COVID-19
COVID-19 and the application of ML algorithms for related pandemic. Despite several issues with social media data’s
research. In the current study, an unsupervised ML method is biases, confounding, and representatives [53], social media
used to identify gaps in existing literature and suggest future platforms have an estimated 3.96 billion users worldwide.
research directions. Lots of searches are to be carried out to Several methodologies have been used based on character-
analyze this pandemic. istics, including Part of Speech (POS), uni-grams, bi-grams,
statistical techniques, words, and sentence embedding [54].
A. TOPIC MODELING In [55], the authors proposed a crowd-sourcing method to
Topic Modeling is a technique that may be used to manage estimate moral elevation in medical data to facilitate well-
an extensive collection of documents by grouping them being. Word embedding in Deep Learning (DL) models
according to various subjects. Although topic modeling is have received greater attention recently [54]. In [56],
often called a clustering option, it is more reliable and the authors developed a DL-assisted model to distinguish
frequently provides more accurate results than a clustering between positive and negative emotions in medical data.
technique like k-means. The clustering technique presup- In [57], Doc2Vec and Word2Vec were used for the sentiment
poses that each document is assigned a subject, and the analysis of medical documents. When assessing unsupervised
distance between them is measured. TM assigns a document models for the medical domain, the study’s authors also
to a group of topics with different weights or probabilities employed WordNet’s Welsh statistic. In addition, in [58],
without making any assumptions about how close or far the authors employed attentive multi-tasking ML model to
apart the subjects are. Are several TMs available, with the recognize emotions for sustainable and livable environment

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

TABLE 1. Critical analysis of existing studies.

development. It continues to be a great source of textually the representations into a unified latent vector space. In [69],
rich, semantic data with excellent chances to monitor the authors investigated emotion using traditional ML in
various social interaction-related characteristics, particularly interactive education systems. Similarly, in [70], the authors
conversations on public health challenges. attempted to investigate difference in behavioural embedding
between two entities. Our research fits into a category of
C. SENTIMENT ANALYSIS AND TOPIC MODELING studies that use topic modeling and sentiment analysis to
From [59], [60], [61], [62], [63], and [64], of the research evaluate COVID-19 data. Although this research is a positive
works cited either performed the topic of Using COVID-19 feature to our understanding of cross-cultural COVID-19
data for modeling or sentiment analysis. In contrast, there news, the nature of our COVID-19 (research article) and the
are extremely few studies that have integrated examine using use of topic modeling and Textual Similarities for sentiment
topic modeling and sentiment analysis data from COVID-19 categorization make this research necessary.
topics covered by Chandrasekaran et al. [65] included Using
methods like LDA and VADER; we can analyze the trends
and opinions expressed in tweets concerning the COVID-19
In this section, we presented a detailed methodology of the
pandemic. Xue et al. [66] investigated tweets for assessing
proposed architecture of COVID-19. The main sections of
public sentiment and conversation during the COVID-19
these studies are as follows in Fig 1.
epidemic. They employed the LDA approach for topic
modeling. In addition, in [67], the authors suggested an • Data Collection
hybrid NLP model based on multi-layered features to analyze • Data Prepossessing
the hidden insights of the data. Similarly, in [68], the authors • Word Cloud Generation
suggested a neural network (NN) based approach to transform • Topic Modeling

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

FIGURE 1. Overview model of the proposed methodology.

A. COLLECTION OF DATASET period. We skipped those papers without abstracts. A total of

All abstracts of papers in the scientific literature indexed 1994 papers were finally selected for analysis of COVID-19
in the SCOPUS database, and all the documents in the academic published research paper data that were extracted
scientific literature on the outbreak of COVID-19, and after from JSON files using push shift API. The histogram of
COVID-19, and after COVID-19 date, in the literature search COVID-19 research papers is shown in Fig 2.
of the COVID period paper, minimum of three keywords for
searching paper in the scientific literature on COVID-19 are B. DATA PREPROCESSING AND CLEANING
the keywords of COVID-19, coronavirus and SARS-CoV2. The raw data gathered for each step is pre-processed
A minimum of three keywords for searching papers in the to conduct the analysis more effectively and efficiently.
scientific literature on COVID-19, coronavirus, and SARS- Unprocessed published research paper abstract data might
CoV2. We found a lot of the papers during the mentioned impede analysis since they are filled with incorrect terms

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

TABLE 2. Sentiment analysis and topic modeling of COVID-19.

distribution over a set of K topics. Each topic K ∈

{1, · · · , K } is represented as a distribution φ k over
vocabulary words [71]. Each word contributes in a specific
FIGURE 2. COVID-19 publications histogram.
way to each topic.
The Figure below clearly shows the mathematical annota-
and stop words that provide unclear results. The essential tions; for instance, it designates a matrix with rows defined
processes for cleaning and converting the data into something by documents and columns defined by topics, and θ (d, k)
usable are all part of the pre-processing. This transforms the indicates the probability that topic k will appear in the
text input into a complete form, improving the functionality document d. Similarly, φ is a matrix where the columns
of ML algorithms. The pre-processing is carried out in the are words, and the rows are subjects. Below is a simplified
subsequent steps: illustration of the LDA procedure.
In LDA, it is assumed that the topic distribution has a
Dirichlet prior, resulting in a uniform topic distribution for
each document model in Equation IV the probability for
A stop word is a term most often used in a language.
a corpus [72]. Fig 3 explains the LDA plate notation, and
Examples of these terms in English are ‘‘the,’’ ‘‘a,’’ ‘‘an,’’
Table 2displays the significance of the notations.
‘‘in,’’ etc. These words don’t significantly modify the
LDA aims to find topic θ matrix and document φ topic that
meaning of a statement or its topic, either. As a result, it is
maximizes the following joint probability distribution across
permissible to omit them without changing the sense of the
the hidden and observable variables.
statement. By removing these terms, the algorithm can focus
more on words that help define the text. Y
P w1 , · · · , w Nd | β, α

Nd Z
The second pre-processing stage involves breaking down Y
abstract phrases and sentences into smaller parts, such as = P (θd | α)
d=1 θd
individual words. Tokenization turns each newly acquired Y Nd X 
smaller unit into a separate entity known as a token. The × θdk , βkwn dθd
n=1 k
extracted tokens help create more accurate models and
identify the context of the analyzed text. LDA assumes the following generating process given a
corpus D made up of M documents, each of length Ni
E. BIGRAMS/TRIGRAMS • Creat θi ∼ Dir(α), where i ∈ {1, 2, · · · , D}. Dir α is a
A connection between the first two words in each bi-gram LDA distribution with symmetric parameter α where α
in the abstract. In contrast to the co-word use, this bigram is frequently sparse.
occurrence only considers a relation/edge if two words are • Creat βk ∼ Dir(η), where k ∈ {1, 2, · · · , K } andβ is
placed one after the other in a sentence. Similar to the prior often spares.
bigram occurrence, a relationship model is used instead. • For the n th space in documents d where n ∈
There is an additional edge between the first and third words {1, 2, · · · , Nd } and d ∈ {1, 2, · · · , D}.
in a trigram of three words. • select a topic Z d , n for the position is generated the from
of Z d , n ∼ Multinomial(θi ).
IV. TOPIC MODELING OF COVID-19 PUBLICATIONS • Word Wd , n which is produced from the word distribu-
Topic modeling is an unsupervised classification technique of tion of the subject chosen in the preceding step Wi , j ∼
documents. Management of project and engineering studies Multinomial(θzd , n)., should be used to fill that position.
have gradually embraced the topic modeling technique This work uses LDA to model themes and separately cover
known as LDA. An unsupervised ML method called LDA trending topics. In topic modeling, the number of subjects
can identify the main topics from a collection of unlabeled is a significant variable. We utilize the coherence score
texts. Each document in LDA is considered as a probabilistic to calculate the determined number of topics to make

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

FIGURE 3. Explanation of LDA in topic modeling.

these subjects interpretable by humans. The coherence score In Equation 2 where D(i ) is the count of the documents
in Equation 1 aids in separating themes with a human containing the word w( i) and D(i , w j ) the count of
understanding from statistical inference. documents containing both word wi and wj, and W (k) =
(k) (k)
The coherence chooses the top n words in each topic that w1 ˙, . . . , w N is the list of N most probable words of the
appear often and averages all the scores pairwise for those
topic k [73].
topic n words wi , . . . , wn of the topic. Finally, we got the
total coherence score for the current topics. The total number
of topics across two validation sets, fixed = 0.01 and = 0.1.
Non-negative Matrix Factorization (NMF) was utilized to
We selected the number of subjects to be between 1 and 100.
extract and identify the underlying topics from a vast
We decide on 10 topics since the findings show that this
collection of research articles. In the vector space model,
number produces the maximum coherence score, and we use
the non-negative matrix is represented by d x n, where d
LDA topic modeling to analyze the abstracts.
X represents the size of the words in the topic, and n represents
score wi , w j

Coherence = (1) the total number of documents.
i< j
In Non-negative Matrix Factorization (NMF), the corpus
We used a grid-search optimization approach to determine
matrix Z ∈ Rd×n ≥0 is factorization into two low-rank non-
the K number of subjects that results in the most compelling
negative matrices: W ∈ Rd×x , known as the dictionary
model. A detailed overview of documents and word probabil-
matrix, and H ∈ Rx×n , known as the coding matrix. This
ity is explained in Figure 4. To further explain, after training
factorization is accomplished by solving the optimization
baseline models spanning the range of K, the C_v computed
problem as described in Equation 3:
the coherence measure to estimate the ideal number of topics
K for the corpus of abstracts. The topic model’s coherence inf ∥Z − W H ∥2F (3)
score C_UMass averages the coherence ratings for each W ∈Rd×x x×n
≥0 ,H ∈R≥0
subject included in the model. The log causes C_UMass to
where ∥A∥2F = 2
produce negative values, with values closer to 0 denoting i, j Ai j denotes the Frobenius norm of
more easily understood topics by humans. matrix A. NMF is essentially an iterative optimization algo-
rithm. However, it has a significant drawback: the objective
(k) (k)
  2 D w i , w f +ε function is usually non-convex and possesses multiple local
CUMass k; W (k) =
log minima. As a result, different random initializations of the
N (N − 1) (k)
i< j D w i NMF procedure can lead to different matrix factorizations.
K The variability impacts how the results are interpreted.,
k; W (k)
CUMass = CUMass (2) including the topic vector representations in W and the
K relevance between articles and topics in H .

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

FIGURE 4. Grid-based determination of the optimal number of topics.

Algorithm 1 The Process NMF Algorithm After the topic visualization, topic 8 was assigned the
1: Step 1:Input Corpus matrix X highest accuracy depending on the health and vaccine, and
2: Apply Non-negative Matrix Factorization (NMF) to decom- the subtopics were waste, supply, environment, and supply.
pose X
into matrices W and H with x topics.
The public health topic led to articles related to public
3: Select the optimal number of topics x ∗ sentiment about the COVID-19 outbreak.
by using a threshold value in matrix H ,
categorizing the articles into topics Z 1 , · · · , Z x ∗ . VI. RESULTS AND DISCUSSION
Any articles that do not meet the threshold
are placed in an ‘‘Extra Document’’ matrix Z e . This section presents experiment results and analysis to eval-
The value of x∗ is chosen to allocate articles uate the proposed topic modeling approach. Topic coherence
to the relevant topics is considered the most frequent word in each generated topic
Z 1 , Z 2 · · ·, z x ∗ based on a specified threshold in matrix and measures the sentiment simulated between the words of
H, topics. Using either UCI or Umass to perform the pairwise
and any remaining articles are assigned to an
‘‘Extra Document’’ matrix xe ; calculations and calculate the mean coherence score across
4: while No of the articles assigned to a topic all the topics for the model.
i > m do
Apply NMF to the sub-matrix A. SENTIMENT ANALYSIS
Z i to obtain Wi and Hi with xi∗ sub-topics.
assign the documents to the topics by This study used 32314 research publications and 428,265
the threshold α in matrix H . words for sentimental analysis. Among the data, 22.1% had
assign the rest to Z e ; positive sentiments, 12.3% had negative sentiments, and the
end majority had neutral opinions. It accounted for 64.1%. Table 1
5: For each article z i in Z e do
Calculate the cosine similarity between z i
shows how many words were related to each month. Based
and each of the topic of leaf on the results COVID has a primarily neutral sentiment.
Assign xi to the most similar topic. Topic 1 records the highest number of positive words, and
end from there, the tone of the public toward the COVID crisis
repeat the loop process of each topic in seems less optimistic. While positive sentiment decreased by
articles until every topic has less than m articles.
4.5%, 4.2%, and 3% in topic 2, topic 3, and topic 4, neutral
sentiment increased by 2.7%, 1.7%, and 1.3% during those
topics. In topic 2, there is also a high number of negative
Moreover, the choice of the number of topics, k, introduces sentiments. Negative sentiments increase in the latter topic
another source of variability. Different combinations of initial when compared to topic 1 and topic 3. Except for topic 2,
values for W and H , along with varying values of k, produce the percentage of Neutral tweets almost remains the same
different topics, thereby leading to different clustering results throughout the year. As the COVID crisis piled up, there was
for the articles. a drop in positive sentiments and a significant increase in
neutral sentiments as shown in Table 3.
We used NMF topic visualization with the algorithm B. TOPIC MODELING USING PROPOSED APPROACH
implementation. In Algorithm 1 the data is visualized into We extract topics from the COVID-19 papers that have been
10 topics. published using the LDA model in genism. The number of

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

TABLE 3. Positive and negative sentiment analysis.

FIGURE 6. Analysis of the topic frequency of optimal topics.

13.26, and topic 1 is 13.16. Similarly, Topic 5 achieved the

FIGURE 5. Coherence scores of the different number of topics.
lowest frequency of 1.167 compared to all listed topics.
In addition, Fig. 7 illustrates an overview of the global
LDA topics depends on topic coherence. Regarding topic topics and associated terms to investigate the topic-term
coherence, the distributional hypothesis is that the start relationship. The analysis provides two types of visualization
of words with similar meanings is grouped together [74]. to interpret a selected topic by determining useful terms.
According to this hypothesis, similar words are more likely to On the left side, it provides a visual view of the selected
occur in similar situations. In the tmtoolkit package, we use global topics on a two-dimensional plane. On the right
the ‘‘coherence gensim c npmi’’ function to calculate the side, a bar chart is given to illustrate the frequency of the
coherence value for each topic number k in the abstract terms associated with the selected global topic. The ranking
collection. Fig 5 illustrates the coherence value between frequency of the associated terms is given in decreasing order
topics. Topic number 8 achieves the highest coherence for topic interpretation. In this way, each global topic is
value. analyzed compactly according to the frequency of the terms.
As a result, 8 is the ideal topic number. The results of The proposed analysis aims to facilitate health practitioners
the LDA model are interpreted to illuminate the significance for interpreting the relationship between global topics and
of the subjects. Visualization of 10 topics can be found. associated terms in a large corpus to mitigate the pandemic.
The full-text collection’s subject number is also set to 10 to
maintain consistency with the numbering. In Equation 4, C. WORLDCOULD OF COVID-19 RESEARCH PAPERS
according to topic coherence, topic V coherence value is the A wordcloud item is font size depending on the importance
sum of pairwise distributional similarity scores over topic of the word in the data, here words obtained by analysis
words [75], [76], [77]. of the abstract’s text belonging to the COVID-19 corpus.
γ For the creation of the wordcloud, you must first perform
P (wi ,w j )+ϵ

log various preprocessing (tokenization, lemmatization) of the
P(wi )·P (w j )
Score wi , w j , ϵ =  text before creating the items of the wordcloud. Clearly, the
  
− log P wi , w j + ϵ
words ‘‘covid’’, ‘‘patient’’, and ‘‘Study’’ do stand out from
X the rest of the abstract text analyzed. Other words such as
Coherence(V ) = Score wi , w j, ϵ

(4) ‘‘Wuhan’’ and ‘‘protein’’ have also been extensively shown
(wi ,w j )ϵV in Fig. 8 shows WorlColud.

Furthermore, Fig 6 shows the frequency of the optimal D. WORD2VEC MODEL AND TEXTUAL COSINE
topics. Topic 8 has the highest frequency, 22.97 percent, SIMILARITIES
compared to the other listed topics. The average frequency Word2Vec models train on a corpus of text to see which
percentage of topic 9 is 21.80, topic 6 is 21.79, topic 10 is words tend to be used in a similar context. We built the

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

FIGURE 7. Overview of the global topics and associated terms for analyzing relationship between topic and terms.

word embedding using the Python library Gensim for word Similar to the words that have been used to describe
embedding. In Word2Vec models, large corpora of text are COVID-19 symptoms, terms like ‘‘fever’’, ‘‘Patient’’,
used as inputs. As a result, each unique word in the corpus ‘‘Transmission,’’ and ‘‘Vaccine.’’
is represented by a vector. word2Vec model shown in Fig 9
Cosine similarity is used to measure vector similarity shown E. T-SNE VISUALIZATION OF SEMANTIC CLUSTERS
in Table 4 and Table 5 Cosine similarity in diagnostic, This technique was finally used to reduce each word’s
transmission to a similar word vector. dimensionality, allowing the 2D position to be projected
Similar vectors are used to represent semantically related along with its label. A ML algorithm such as K-mean was also
words to the origin during the analysis of the original corpus. implemented using Scikit-learn Python Library to partition
Coronaviruses are illnesses that can be passed from animals n-words into semantic clusters. In the elbow method, K was
to humans. This type of transmission, known as zoonotic optimally determined by summing the squared distances
origin, occurs when a pathogen jumps from non-human between clusters [1, 30]. If the plot looks like an arm, the
animals to humans. elbow on the arm is the optimal K. Here, K = 7.

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

FIGURE 8. Worldcloud of COVID-19 research publications.

FIGURE 9. Word2vec model vizualization.

From the above TSNE visualization of Word2Vec embed- F. WORD MOVER’S DISTANCE (WMD)
dings, we can distinguish several clusters among which WMD is a tool for measuring the distance between a
we can recognize semantic similarities, including medi- document and a word. The topic similarity between the topic
cal treatment, government policies and measures, vaccine and subtopics is where the most related results in word2vec
research, epidemiological research, and COVID-19 detec- embedding represent each topic. Analyzing the similarities
tion, transmission, causes, and consequences of the disease. between topics indicates a lower score with other correlated
Inter-word distance in the 2D plane is an indication of inter- topics. In Table 5, the similarity metrics in Topic 3 and
word similarity. Topic 4, Dyspnea and fever also, and cough and myalgia in

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

TABLE 4. Cosine similarity in origin, symptom to similar word vector.

TABLE 5. Cosine similarity in diagnostic, transmission to similar word vector.

simultaneously, revealing that nearly half of the tweets had

a negative sentiment, followed by positive and neutral senti-
ments. Costola et al. [82] investigate the impact of COVID-19
news on financial markets, analyzing a large corpus of
online articles from major news platforms and employing
ML techniques for sentiment analysis. It also identifies top
hashtags and most frequently used words, offering insights
into prevalent topics in Twitter conversations. Our research
takes a more comprehensive approach by encompassing a
wide range of COVID-related analyses. Our approach is more
extensive, encompassing a diverse range of COVID-related
analyses. Leveraging data from the National Library of
FIGURE 10. Optimal no of clusters K(Elbow method). Medicine PubMed, we employed automated LDA to extract
key themes from extensive discussions on the pandemic.
This method aids researchers in identifying and categorizing
Topic 5 and 6 within Topic 7, the similarity metrics among its various facets of the pandemic outbreak, ranging from
subtopic the words are notably diminished compared to those medical issues to public responses, economic impacts, and
between the similar words. policy. Additionally, sentiment analysis allows for insights
into prevailing sentiments surrounding the pandemic and
G. DISCUSSION its implications for different communities by assessing the
1) NOVELTY AND SCOPE emotional tone of these discussions. Our study makes a
This study stands out from existing research by focusing on significant contribution by examining research topics and
the public sentiment analysis on COVID-19 through research their connections, employing LDA modeling and NLP
publications. While previous studies by Ahammad [78] use (Natural Language Processing) to assess the current literature
a smaller dataset of 10,254 news headlines and combine on COVID-19 and COV infection. Importantly, our research
sentiment analysis with topic modeling. Xie et al. [79] is academic and serves to aid in pandemic coordination
offer insights into public sentiment on Weibo during the efforts by identifying high-priority scientific topics. This is
COVID-19 outbreak. Gyftopoulos et al. [80] collected data particularly crucial in areas such as pathogens, treatments,
from Twitter posts and analyzed public sentiments based on virus diagnostics, vaccines, and viral genomes, which are
the content of the posts circulated during the COVID-19 now deemed priorities alongside clinical characterization,
period. By Thakur [81] focusing on COVID-19 and MPox epidemiology, and virus transmission research. Our study

A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic

TABLE 6. Comparison of the proposed study with similar approaches.

aims to provide a detailed analysis of public sentiment for the COVID-19 study. This research makes several
surrounding COVID-19. We use a new approach that involves contributions. First, we summarize the COVID-19 Publi-
research publications and advanced techniques such as cations using topic modeling, including the most pertinent
LDA modeling and sentiment analysis. By building on terminology, major research themes, and emerging trends.
existing studies, we aim to improve our understanding of Many articles have been published about the virus’s gene
the pandemic’s impact and provide valuable insights into analysis. Government regulations and their effect have been
the scientific community’s efforts in combating COVID-19. discussed in particular articles. In the interim, the vaccination
Table 6 presents a comprehensive comparison between by the end of 2020, although not yet in the complete
our study and existing research using a similar approach, discussion. Furthermore, the proposed research not only adds
highlighting key differences and similarities in methodology. to the technique by using literature analysis but also provides
practical insights. The comparative study of topic extraction
2) LIMITATIONS AND FUTURE WORKS from full paper texts against their related abstracts can assist
It’s important to note the limitations of this study. We have us in comprehending the impact of the various texts based on
identified the issue to understand the medical treatment, topic modeling analysis findings. This research shows that
governmental rules and regulations, vaccination research, extracting ideas from abstracts could be more effective than
epidemiological research, and the detection, transmission, full text because they might convey the same information
causes, and effects of the illness COVID-19. However, our with fewer words. Third, for librarians or documentalists to
study is not exhaustive and there may be other aspects that effectively manage the literature on a particular subject, the
could be explored in future research. Future research can current study offers a practical methodological framework
consider other methods. The analysis exclusively refers to that may be used in any field. Understanding the results
the COVID-19 pandemic literature. We may contend that our may be aided by our LDA-based topic modeling, word-cloud
approach performs admirably on the sizable COVID-19 data subject visualization, and essential terms’ trends.
set. Additionally, we concentrate on the literature analysis,
VOLUME 12, 2024

VOLUME 12, 2024

