Enhanced Sentiment Analysis and Topic Modeling During The Pandemic Using Automated Latent Dirichlet Allocation
Enhanced Sentiment Analysis and Topic Modeling During The Pandemic Using Automated Latent Dirichlet Allocation
Enhanced Sentiment Analysis and Topic Modeling During The Pandemic Using Automated Latent Dirichlet Allocation
South Korea
Corresponding author: Yung-Cheol Byun ([email protected])
This work was supported by the ‘‘Regional Innovation Strategy (RIS)’’ through the National Research Foundation of Korea (NRF) funded
by the Ministry of Education (MOE).
ABSTRACT The COVID-19 pandemic has profoundly impacted human societies, resulting in the loss
of millions of lives and slowing economic growth worldwide. This devastating pandemic underscores
the gravity of viral threats and led to multifaceted consequences, including loss of livelihoods, dynamic
labor force migration, and significant ramifications on mental health. Furthermore, different scientific
institutions and companies are attempting to accelerate research and innovation by analyzing large data
corpus for fighting against the pandemic. In this research study, an advanced approach based on automated
Latent Dirichlet Allocation (LDA) is suggested dealing with a large data corpus for efficiently providing
visualization of sentiment analysis and discovered topics. This innovative approach seeks to interrogate
a substantial pandemic corpus, delving into the intricacies of public sentiment and discerning evolving
trends pertinent to the pandemic. A sophisticated 10-topic LDA model was implemented, revealing Topic
8 as the most prevalent, with a frequency peak of 22.29, eclipsing other enumerated topics. We employ
text-mining techniques like WordCloud and Word2Vec to offer insights into specific terms relevant to
the pandemic, such as ‘‘Origin,’’ ‘‘Symptom,’’ ‘‘Diagnostic,’’ and ‘‘Transmission.’’ Applying the t-SNE
method enriches the analysis by visually unraveling semantic clusters within the corpus. The subsequent
phase involves modeling strategic topics within the corpus through an unsupervised LDA-based approach,
leveraging our suggested framework. This novel perspective contributes to a deeper understanding of the
underlying dynamics by analyzing a large data corpus quickly and automatically for providing visualization
of discovered topics aiming to aid front-line workers, healthcare practitioners, and community support to
fight against the pandemic.
INDEX TERMS Topic modeling, LDA, sentiment analysis, machine learning, deep learning, feature
extraction.
2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.
81206 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 12, 2024
A. Batool, Y.-C. Byun: Enhanced Sentiment Analysis and Topic Modeling During the Pandemic
Over the past 20 years, however, two new coronaviruses, etc. Text summarizing [19] are a few of the most widely
CoV (MERS-CoV) and severe acute respiratory syndrome used and well-liked NLP approaches. In addition, in [20],
CoV (SARS-CoV) in the Middle East, have caused human the authors employed epistemic network analysis to extract
infections in local nations and areas and have caused sentiment score in blended learning data. Similarly, in [21],
mortality rates around 10% and 35%, respectively [6], [7], the authors used collaborative filtering approach to analyze
[8]. The coronavirus is now a world wide SARS-CoV-2, 7th users behaviour through social media data.
hcov, (2019-nCoV) formerly. Medical and social astrocytic In the context of the COVID-19 crisis, large language
2022. tough efforts globally for the pandemic caused by the models (LLMs) like BERT offer exceptional NLP capabilities
SARS-CoV-2 coronavirus which started infecting people in but are computationally and memory-intensive [22], [23].
late December 2019 worldwide. It is an evolving medical Binarization, which reduces model weights to 1 bit, signif-
issue, so the global monitoring of the disease and possible icantly cuts these demands, yet often results in performance
effective global research is currently being up-to-date to drops [24]. To address this, methods like BiLLM and BiBERT
thwart and regroup the current pandemics of large concern introduce innovative techniques such as binary residual
to WHO. Here, is the seventh SARS-CoV-2. It is the seventh approximation, optimal splitting search, Bi-Attention struc-
coronavirus that appeared in late December 2019, and the tures, and Direction-Matching Distillation (DMD) to enhance
most recent one discovered. 14 September 2022, WHO called accuracy and efficiency [25]. These approaches achieve
the COVID-19 pandemic amidst the disease (COVID-19 high-accuracy inference and substantial savings in FLOPs
invoking the shape of the new virus). As of 14 September and model size, demonstrating their potential for real-
2022, 607,083,820 people were afflicted with the disease, and world, resource-constrained scenarios [26]. The author used
6,496,721 people died [9]. non-negative matrix factorization and probabilistic latent
Furthermore, as the scientific research community strives sentiment analysis to normalize the mutual and distance
to develop innovative solutions to mitigate the pandemic, information [27]. In addition, in [28], the authors proposed
it also paves the way for accelerating the pace of innovation a two stage adaptive distillation model to capture aesthetic
and discovery [10]. In addition, it has become a serious health and context information in crowd-sensing environment.
concern globally, increasing demand for health technology By leveraging these advancements, binarized LLMs can
to enhance development for aiding health practitioners maintain performance while being more computationally and
and policy makers to provide breakthrough against future memory efficient. In the context of the COVID-19 crisis, NLP
pandemic [11]. In addition, research funding institutions techniques have been crucial for analyzing vast amounts of
have focused significant attention on supporting research natural language data from various sources, enabling effective
and innovation at the utmost pace to address the pandemic. sentiment analysis, topic modeling, and information retrieval.
However, to accelerate the pace of the research community Automated LDA (Latent Dirichlet Allocation) discussion
towards innovative solutions, timely processing of the large for the pandemic outbreak involves applying topic modeling
data corpus of the pandemic is required to facilitate growing techniques to extract key themes or topics from large-scale
research. According to scientific evidence, many researchers discussions related to the pandemic. This method helps
are writing on different issues, including text for communities researchers identify and categorize various aspects of the
with disease outbreaks, epidemic alarms, and other medical pandemic outbreak, such as medical issues, public responses,
services, all of which are included in the discussion. economic impacts, and policy discussions. Moreover, senti-
Therefore, an efficient solution is required to timely process ment analysis can provide insights into prevailing sentiments
the large pandemic data corpus for providing a visualization surrounding the pandemic and its effects on different commu-
of discovered topics to scientific researchers and other health nities by gauging the emotional tone of these discussions.
practitioners for mitigating with the pandemic. This study has significant research topics and their
Several studies have applied Natural Language Processing connections and applied LDA modeling and NLP to evaluate
(NLP) [12] on SM text to the work related to COVID-19. the current status of literature on COVID-19 and COV
NLP approaches are gaining popularity in processing enor- infection. This study can also aid research on pandemic
mous natural language data. NLP is a multidisciplinary coordination by identifying high-priority scientific topics.
science combining artificial intelligence (AI) and linguistics, This research is urgently needed in pathogens, treatments,
leveraging computers to interpret and understand human virus diagnostics, vaccines, and viral genomes, while clinical
language. This convergence of knowledge makes machines characterization, epidemiology, and virus transmission are
capable of processing, analyzing, and generating text to now priorities.
enable communication between humans and computers [13]. The distinctive contributions of our study are delineated as
The importance of NLP nowadays is further heightened follows:
by the fact that we produce large amounts of unstructured • Pioneering a novel approach, we introduce an auto-
text data in our daily routine is called entity recognition mated LDA based topic modeling method to scrutinize
sentiment analysis [14], machine translation [15], topic an extensive pandemic corpus. This method goes
modeling [16], text filtering [17], reviews analysis [18], beyond conventional analyses, offering an enhanced
understanding of public sentiment and emerging trends latent Dirichlet allocation (LDA) model being the most often
linked to the pandemic. used [29]. Global researchers are working to comprehend
• Elevating the analysis, we employ word-to-vector as an the COVID-19 pandemic’s many facets. Many researcher
embedding technique, delving into the intricate semantic has also appeared in the literature since the outbreak of
relationships and similarities among words. Specifically, COVID-19, which was reported at the end of December 2019.
we explore terms such as origin, symptom, diagnostic, For example, one metric of the growing body of COVID-19
transmission, etc., providing a nuanced perspective on research is a NLP-based analysis of social media posts,
their interconnections with the pandemic. scholarly papers and the daily news relevant to the disease.
• Employ statistical analysis to analyze statistical signifi- A topic considered as the study of COVID-19 news stories
cance of the proposed research study. from Canada is presented by Bai et al. in [30]. To examine
• In addition, a detailed comparison is provided to the news media during the early stages of the COVID-19
highlight the empirical effectiveness of the proposed epidemic in China [31] adopted a digital topic modeling
research study over the existing studies. technique. During the COVID-19 conference, [32] presented
The rest of the paper organized as follows. The litera- a system for identifying and following pertinent subjects from
ture review on pandemic publications examined sentiment social media. Reference [33] examined how the local public
analysis and topic modeling in Section II. Our method and responded to the new Coronavirus (COVID-19).
research design section III introduces data preprocessing
B. SENTIMENT ANALYSIS OF COVID-19
and profoundly explores the dataset. We explain the precise
Some studies used sentiment analysis to examine how
methodology of this study. The subject distribution and topic
individuals responded to the epidemic through social media
representations are discussed in more detail in section IV
posts. The tweeter posts and Weibo postings made by China
and V. In section VI, we explain the results and over-
and America between January 2020 and May 2020 during
generalizations.Finally, the conclusion section VII summa-
the epidemic were examined by [49]. The results showed
rizes how this study fits into the research framework. We also
that most people were confident in controlling the pandemic,
describe the limitations of our research work and provide
but sentiments of people like fear, sadness, and disgust
suggestions for future research.
also appeared worldwide. They compared the people’s
emotions, i.e., anger, hate, fear, happiness, sadness, and
II. RELATED WORK surprise. An existing study of the sentiment dynamics of
In 2020, the pandemic of coronavirus disease (COVID), residents of the Australian state of New South Wales (NSW)
presented by the WHO (World Health Organization), will throughout the pandemic, [50] retrieved five months’ worth
occur. The pandemic COVID-19 topic has a lot of research. of COVID-19-related tweets from Twitter. They grouped
Addressing issues like the COVID-19 transmission pro- tweets into groups based on the worth of local government
cess, the virus symptoms, and psychological conditions of areas (LGAs) and tracked dynamic mood shifts over time.
COVID-19 patients, boosting human immunity to prevent To dynamically assess the subject and mood of 13 million
health consequences, prediction of COVID-19 data based on tweets about COVID-19, [51] devised a unique methodology.
the technique of machine learning (ML), and the importance In addition, in [52], the authors carried out a cross-sectional
of online tools of technology in this context. Our literature study to investigate the impact of negative emotions and
evaluation will examine the trend of general research on risk perception of health practitioners during COVID-19
COVID-19 and the application of ML algorithms for related pandemic. Despite several issues with social media data’s
research. In the current study, an unsupervised ML method is biases, confounding, and representatives [53], social media
used to identify gaps in existing literature and suggest future platforms have an estimated 3.96 billion users worldwide.
research directions. Lots of searches are to be carried out to Several methodologies have been used based on character-
analyze this pandemic. istics, including Part of Speech (POS), uni-grams, bi-grams,
statistical techniques, words, and sentence embedding [54].
A. TOPIC MODELING In [55], the authors proposed a crowd-sourcing method to
Topic Modeling is a technique that may be used to manage estimate moral elevation in medical data to facilitate well-
an extensive collection of documents by grouping them being. Word embedding in Deep Learning (DL) models
according to various subjects. Although topic modeling is have received greater attention recently [54]. In [56],
often called a clustering option, it is more reliable and the authors developed a DL-assisted model to distinguish
frequently provides more accurate results than a clustering between positive and negative emotions in medical data.
technique like k-means. The clustering technique presup- In [57], Doc2Vec and Word2Vec were used for the sentiment
poses that each document is assigned a subject, and the analysis of medical documents. When assessing unsupervised
distance between them is measured. TM assigns a document models for the medical domain, the study’s authors also
to a group of topics with different weights or probabilities employed WordNet’s Welsh statistic. In addition, in [58],
without making any assumptions about how close or far the authors employed attentive multi-tasking ML model to
apart the subjects are. Are several TMs available, with the recognize emotions for sustainable and livable environment
development. It continues to be a great source of textually the representations into a unified latent vector space. In [69],
rich, semantic data with excellent chances to monitor the authors investigated emotion using traditional ML in
various social interaction-related characteristics, particularly interactive education systems. Similarly, in [70], the authors
conversations on public health challenges. attempted to investigate difference in behavioural embedding
between two entities. Our research fits into a category of
C. SENTIMENT ANALYSIS AND TOPIC MODELING studies that use topic modeling and sentiment analysis to
From [59], [60], [61], [62], [63], and [64], of the research evaluate COVID-19 data. Although this research is a positive
works cited either performed the topic of Using COVID-19 feature to our understanding of cross-cultural COVID-19
data for modeling or sentiment analysis. In contrast, there news, the nature of our COVID-19 (research article) and the
are extremely few studies that have integrated examine using use of topic modeling and Textual Similarities for sentiment
topic modeling and sentiment analysis data from COVID-19 categorization make this research necessary.
topics covered by Chandrasekaran et al. [65] included Using
methods like LDA and VADER; we can analyze the trends
III. MATERIALS AND METHODS
and opinions expressed in tweets concerning the COVID-19
In this section, we presented a detailed methodology of the
pandemic. Xue et al. [66] investigated tweets for assessing
proposed architecture of COVID-19. The main sections of
public sentiment and conversation during the COVID-19
these studies are as follows in Fig 1.
epidemic. They employed the LDA approach for topic
modeling. In addition, in [67], the authors suggested an • Data Collection
hybrid NLP model based on multi-layered features to analyze • Data Prepossessing
the hidden insights of the data. Similarly, in [68], the authors • Word Cloud Generation
suggested a neural network (NN) based approach to transform • Topic Modeling
D. TOKENIZATION d=1
Nd Z
The second pre-processing stage involves breaking down Y
abstract phrases and sentences into smaller parts, such as = P (θd | α)
d=1 θd
individual words. Tokenization turns each newly acquired Y Nd X
smaller unit into a separate entity known as a token. The × θdk , βkwn dθd
n=1 k
extracted tokens help create more accurate models and
identify the context of the analyzed text. LDA assumes the following generating process given a
corpus D made up of M documents, each of length Ni
E. BIGRAMS/TRIGRAMS • Creat θi ∼ Dir(α), where i ∈ {1, 2, · · · , D}. Dir α is a
A connection between the first two words in each bi-gram LDA distribution with symmetric parameter α where α
in the abstract. In contrast to the co-word use, this bigram is frequently sparse.
occurrence only considers a relation/edge if two words are • Creat βk ∼ Dir(η), where k ∈ {1, 2, · · · , K } andβ is
placed one after the other in a sentence. Similar to the prior often spares.
bigram occurrence, a relationship model is used instead. • For the n th space in documents d where n ∈
There is an additional edge between the first and third words {1, 2, · · · , Nd } and d ∈ {1, 2, · · · , D}.
in a trigram of three words. • select a topic Z d , n for the position is generated the from
of Z d , n ∼ Multinomial(θi ).
IV. TOPIC MODELING OF COVID-19 PUBLICATIONS • Word Wd , n which is produced from the word distribu-
Topic modeling is an unsupervised classification technique of tion of the subject chosen in the preceding step Wi , j ∼
documents. Management of project and engineering studies Multinomial(θzd , n)., should be used to fill that position.
have gradually embraced the topic modeling technique This work uses LDA to model themes and separately cover
known as LDA. An unsupervised ML method called LDA trending topics. In topic modeling, the number of subjects
can identify the main topics from a collection of unlabeled is a significant variable. We utilize the coherence score
texts. Each document in LDA is considered as a probabilistic to calculate the determined number of topics to make
these subjects interpretable by humans. The coherence score In Equation 2 where D(i ) is the count of the documents
in Equation 1 aids in separating themes with a human containing the word w( i) and D(i , w j ) the count of
understanding from statistical inference. documents containing both word wi and wj, and W (k) =
(k) (k)
The coherence chooses the top n words in each topic that w1 ˙, . . . , w N is the list of N most probable words of the
appear often and averages all the scores pairwise for those
topic k [73].
topic n words wi , . . . , wn of the topic. Finally, we got the
total coherence score for the current topics. The total number
V. NON-NEGATIVE MATRIX FACTORIZATION (NMF)
of topics across two validation sets, fixed = 0.01 and = 0.1.
Non-negative Matrix Factorization (NMF) was utilized to
We selected the number of subjects to be between 1 and 100.
extract and identify the underlying topics from a vast
We decide on 10 topics since the findings show that this
collection of research articles. In the vector space model,
number produces the maximum coherence score, and we use
the non-negative matrix is represented by d x n, where d
LDA topic modeling to analyze the abstracts.
X represents the size of the words in the topic, and n represents
score wi , w j
Coherence = (1) the total number of documents.
i< j
A. NMF FOR COVID-19 PUBLICATIONS
A. GRID BASED DETERMINATION
In Non-negative Matrix Factorization (NMF), the corpus
We used a grid-search optimization approach to determine
matrix Z ∈ Rd×n ≥0 is factorization into two low-rank non-
the K number of subjects that results in the most compelling
negative matrices: W ∈ Rd×x , known as the dictionary
model. A detailed overview of documents and word probabil-
matrix, and H ∈ Rx×n , known as the coding matrix. This
ity is explained in Figure 4. To further explain, after training
factorization is accomplished by solving the optimization
baseline models spanning the range of K, the C_v computed
problem as described in Equation 3:
the coherence measure to estimate the ideal number of topics
K for the corpus of abstracts. The topic model’s coherence inf ∥Z − W H ∥2F (3)
score C_UMass averages the coherence ratings for each W ∈Rd×x x×n
≥0 ,H ∈R≥0
subject included in the model. The log causes C_UMass to
where ∥A∥2F = 2
P
produce negative values, with values closer to 0 denoting i, j Ai j denotes the Frobenius norm of
more easily understood topics by humans. matrix A. NMF is essentially an iterative optimization algo-
rithm. However, it has a significant drawback: the objective
(k) (k)
2 D w i , w f +ε function is usually non-convex and possesses multiple local
CUMass k; W (k) =
X
log minima. As a result, different random initializations of the
N (N − 1) (k)
i< j D w i NMF procedure can lead to different matrix factorizations.
K The variability impacts how the results are interpreted.,
1
k; W (k)
X
CUMass = CUMass (2) including the topic vector representations in W and the
K relevance between articles and topics in H .
k=1
Algorithm 1 The Process NMF Algorithm After the topic visualization, topic 8 was assigned the
1: Step 1:Input Corpus matrix X highest accuracy depending on the health and vaccine, and
2: Apply Non-negative Matrix Factorization (NMF) to decom- the subtopics were waste, supply, environment, and supply.
pose X
into matrices W and H with x topics.
The public health topic led to articles related to public
3: Select the optimal number of topics x ∗ sentiment about the COVID-19 outbreak.
by using a threshold value in matrix H ,
categorizing the articles into topics Z 1 , · · · , Z x ∗ . VI. RESULTS AND DISCUSSION
Any articles that do not meet the threshold
are placed in an ‘‘Extra Document’’ matrix Z e . This section presents experiment results and analysis to eval-
The value of x∗ is chosen to allocate articles uate the proposed topic modeling approach. Topic coherence
to the relevant topics is considered the most frequent word in each generated topic
Z 1 , Z 2 · · ·, z x ∗ based on a specified threshold in matrix and measures the sentiment simulated between the words of
H, topics. Using either UCI or Umass to perform the pairwise
and any remaining articles are assigned to an
‘‘Extra Document’’ matrix xe ; calculations and calculate the mean coherence score across
4: while No of the articles assigned to a topic all the topics for the model.
i > m do
Apply NMF to the sub-matrix A. SENTIMENT ANALYSIS
Z i to obtain Wi and Hi with xi∗ sub-topics.
assign the documents to the topics by This study used 32314 research publications and 428,265
the threshold α in matrix H . words for sentimental analysis. Among the data, 22.1% had
assign the rest to Z e ; positive sentiments, 12.3% had negative sentiments, and the
end majority had neutral opinions. It accounted for 64.1%. Table 1
5: For each article z i in Z e do
Calculate the cosine similarity between z i
shows how many words were related to each month. Based
and each of the topic of leaf on the results COVID has a primarily neutral sentiment.
Assign xi to the most similar topic. Topic 1 records the highest number of positive words, and
end from there, the tone of the public toward the COVID crisis
repeat the loop process of each topic in seems less optimistic. While positive sentiment decreased by
articles until every topic has less than m articles.
4.5%, 4.2%, and 3% in topic 2, topic 3, and topic 4, neutral
sentiment increased by 2.7%, 1.7%, and 1.3% during those
topics. In topic 2, there is also a high number of negative
Moreover, the choice of the number of topics, k, introduces sentiments. Negative sentiments increase in the latter topic
another source of variability. Different combinations of initial when compared to topic 1 and topic 3. Except for topic 2,
values for W and H , along with varying values of k, produce the percentage of Neutral tweets almost remains the same
different topics, thereby leading to different clustering results throughout the year. As the COVID crisis piled up, there was
for the articles. a drop in positive sentiments and a significant increase in
neutral sentiments as shown in Table 3.
B. NMF TOPIC VISUALIZATION
We used NMF topic visualization with the algorithm B. TOPIC MODELING USING PROPOSED APPROACH
implementation. In Algorithm 1 the data is visualized into We extract topics from the COVID-19 papers that have been
10 topics. published using the LDA model in genism. The number of
Furthermore, Fig 6 shows the frequency of the optimal D. WORD2VEC MODEL AND TEXTUAL COSINE
topics. Topic 8 has the highest frequency, 22.97 percent, SIMILARITIES
compared to the other listed topics. The average frequency Word2Vec models train on a corpus of text to see which
percentage of topic 9 is 21.80, topic 6 is 21.79, topic 10 is words tend to be used in a similar context. We built the
FIGURE 7. Overview of the global topics and associated terms for analyzing relationship between topic and terms.
word embedding using the Python library Gensim for word Similar to the words that have been used to describe
embedding. In Word2Vec models, large corpora of text are COVID-19 symptoms, terms like ‘‘fever’’, ‘‘Patient’’,
used as inputs. As a result, each unique word in the corpus ‘‘Transmission,’’ and ‘‘Vaccine.’’
is represented by a vector. word2Vec model shown in Fig 9
Cosine similarity is used to measure vector similarity shown E. T-SNE VISUALIZATION OF SEMANTIC CLUSTERS
in Table 4 and Table 5 Cosine similarity in diagnostic, This technique was finally used to reduce each word’s
transmission to a similar word vector. dimensionality, allowing the 2D position to be projected
Similar vectors are used to represent semantically related along with its label. A ML algorithm such as K-mean was also
words to the origin during the analysis of the original corpus. implemented using Scikit-learn Python Library to partition
Coronaviruses are illnesses that can be passed from animals n-words into semantic clusters. In the elbow method, K was
to humans. This type of transmission, known as zoonotic optimally determined by summing the squared distances
origin, occurs when a pathogen jumps from non-human between clusters [1, 30]. If the plot looks like an arm, the
animals to humans. elbow on the arm is the optimal K. Here, K = 7.
From the above TSNE visualization of Word2Vec embed- F. WORD MOVER’S DISTANCE (WMD)
dings, we can distinguish several clusters among which WMD is a tool for measuring the distance between a
we can recognize semantic similarities, including medi- document and a word. The topic similarity between the topic
cal treatment, government policies and measures, vaccine and subtopics is where the most related results in word2vec
research, epidemiological research, and COVID-19 detec- embedding represent each topic. Analyzing the similarities
tion, transmission, causes, and consequences of the disease. between topics indicates a lower score with other correlated
Inter-word distance in the 2D plane is an indication of inter- topics. In Table 5, the similarity metrics in Topic 3 and
word similarity. Topic 4, Dyspnea and fever also, and cough and myalgia in
aims to provide a detailed analysis of public sentiment for the COVID-19 study. This research makes several
surrounding COVID-19. We use a new approach that involves contributions. First, we summarize the COVID-19 Publi-
research publications and advanced techniques such as cations using topic modeling, including the most pertinent
LDA modeling and sentiment analysis. By building on terminology, major research themes, and emerging trends.
existing studies, we aim to improve our understanding of Many articles have been published about the virus’s gene
the pandemic’s impact and provide valuable insights into analysis. Government regulations and their effect have been
the scientific community’s efforts in combating COVID-19. discussed in particular articles. In the interim, the vaccination
Table 6 presents a comprehensive comparison between by the end of 2020, although not yet in the complete
our study and existing research using a similar approach, discussion. Furthermore, the proposed research not only adds
highlighting key differences and similarities in methodology. to the technique by using literature analysis but also provides
practical insights. The comparative study of topic extraction
2) LIMITATIONS AND FUTURE WORKS from full paper texts against their related abstracts can assist
It’s important to note the limitations of this study. We have us in comprehending the impact of the various texts based on
identified the issue to understand the medical treatment, topic modeling analysis findings. This research shows that
governmental rules and regulations, vaccination research, extracting ideas from abstracts could be more effective than
epidemiological research, and the detection, transmission, full text because they might convey the same information
causes, and effects of the illness COVID-19. However, our with fewer words. Third, for librarians or documentalists to
study is not exhaustive and there may be other aspects that effectively manage the literature on a particular subject, the
could be explored in future research. Future research can current study offers a practical methodological framework
consider other methods. The analysis exclusively refers to that may be used in any field. Understanding the results
the COVID-19 pandemic literature. We may contend that our may be aided by our LDA-based topic modeling, word-cloud
approach performs admirably on the sizable COVID-19 data subject visualization, and essential terms’ trends.
set. Additionally, we concentrate on the literature analysis,
which includes themes for study, research trends, and topic REFERENCES
similarity networks. All of these are general information and [1] F. Hu, L. Qiu, X. Xi, H. Zhou, T. Hu, N. Su, H. Zhou, X. Li, S. Yang,
not specific medical knowledge. Consequently, our study Z. Duan, Z. Dong, Z. Wu, H. Zhou, M. Zeng, T. Wan, and S. Wei,
‘‘Has COVID-19 changed China’s digital trade?—Implications for health
framework might be applied to various types of literature. economics,’’ Frontiers Public Health, vol. 10, Mar. 2022, Art. no. 831549.
[2] F. Hu, Q. Ma, H. Hu, K. H. Zhou, and S. Wei, ‘‘A study of the spatial
VII. CONCLUSION network structure of ethnic regions in Northwest China based on multiple
The study aims to provide a framework for topic modeling to factor flows in the context of COVID-19: Evidence from Ningxia,’’
aid in analyzing the research themes and patterns surrounding Heliyon, vol. 10, no. 2, Jan. 2024, Art. no. e24653.
[3] S. R. Weiss and J. L. Leibowitz, ‘‘Coronavirus pathogenesis,’’ Adv. Virus
the newly developing research topic. We compared subject Res., vol. 81, pp. 85–164, Jan. 2011.
modeling on full-text papers and matching abstracts using [4] A. Zumla, J. F. W. Chan, E. I. Azhar, D. S. C. Hui, and K.-Y. Yuen,
COVID-19 as a case to determine the impact of various ‘‘Coronaviruses—Drug discovery and therapeutic options,’’ Nature Rev.
Drug Discovery, vol. 15, no. 5, pp. 327–347, May 2016.
document formats used as topic modeling input. With the help
[5] J. Cui, F. Li, and Z.-L. Shi, ‘‘Origin and evolution of pathogenic
of the topic modeling approach, the presented work shows coronaviruses,’’ Nature Rev. Microbiol., vol. 17, no. 3, pp. 181–192,
the common research themes, trends, and similarity networks Mar. 2019.
[6] V. C. C. Cheng, S. K. P. Lau, P. C. Y. Woo, and K. Y. Yuen, ‘‘Severe acute [31] Q. Liu, Z. Zheng, J. Zheng, Q. Chen, G. Liu, S. Chen, B. Chu,
respiratory syndrome coronavirus as an agent of emerging and reemerging H. Zhu, B. Akinwunmi, J. Huang, C. J. P. Zhang, and W.-K. Ming,
infection,’’ Clin. Microbiol. Rev., vol. 20, no. 4, pp. 660–694, Oct. 2007. ‘‘Health communication through news media during the early stage of the
[7] J. F. W. Chan, S. K. P. Lau, K. K. W. To, V. C. C. Cheng, P. C. Y. Woo, COVID-19 outbreak in China: Digital topic modeling approach,’’ J. Med.
and K.-Y. Yuen, ‘‘Middle east respiratory syndrome coronavirus: Another Internet Res., vol. 22, no. 4, Apr. 2020, Art. no. e19118.
zoonotic betacoronavirus causing SARS-like disease,’’ Clin. Microbiol. [32] E. De Santis, A. Martino, and A. Rizzi, ‘‘An infoveillance system for
Rev., vol. 28, no. 2, pp. 465–522, Apr. 2015. detecting and tracking relevant topics from Italian tweets during the
[8] L. E. Gralinski and R. S. Baric, ‘‘Molecular pathology of emerging COVID-19 event,’’ IEEE Access, vol. 8, pp. 132527–132538, 2020.
coronavirus infections,’’ J. Pathol., vol. 235, no. 2, pp. 185–195, Jan. 2015. [33] S. Noor, Y. Guo, S. H. H. Shah, P. Fournier-Viger, and M. S.
[9] (2001). World Health Organization—COVID-19. Accessed: Aug. 2022. Nawaz, ‘‘Analysis of public reactions to the novel coronavirus (COVID-
[Online]. Available: https://covid19.who.int/ 19) outbreak on Twitter,’’ Kybernetes, vol. 50, no. 5, pp. 1633–1653,
[10] S. Ahamed and M. Samad, ‘‘Information mining for COVID-19 research May 2021.
from a large volume of scientific literature,’’ 2020, arXiv:2004.02085. [34] U. Naseem, I. Razzak, M. Khushi, P. W. Eklund, and J. Kim,
[11] F. Hu, L. Qiu, and H. Zhou, ‘‘Medical device product innovation choices ‘‘COVIDSenti: A large-scale benchmark Twitter data set for COVID-19
in Asia: An empirical analysis based on product space,’’ Frontiers Public sentiment analysis,’’ IEEE Trans. Computat. Social Syst., vol. 8, no. 4,
Health, vol. 10, Apr. 2022, Art. no. 871575. pp. 1003–1015, Aug. 2021.
[12] S. Malla and P. J. A. Alphonse, ‘‘COVID-19 outbreak: An ensemble pre- [35] K. Garcia and L. Berton, ‘‘Topic detection and sentiment analysis in
trained deep learning model for detecting informative tweets,’’ Appl. Soft Twitter content related to COVID-19 from Brazil and the USA,’’ Appl. Soft
Comput., vol. 107, Aug. 2021, Art. no. 107495. Comput., vol. 101, Mar. 2021, Art. no. 107057.
[13] Q. Chen, R. Leaman, A. Allot, L. Luo, C.-H. Wei, S. Yan, and Z. Lu, [36] D. S. Abdelminaam, F. H. Ismail, M. Taha, A. Taha, E. H. Houssein,
‘‘Artificial intelligence in action: Addressing the COVID-19 pandemic and A. Nabil, ‘‘CoAID-DEEP: An optimized intelligent framework for
with natural language processing,’’ Annu. Rev. Biomed. Data Sci., vol. 4, automated detecting COVID-19 misleading information on Twitter,’’ IEEE
no. 1, pp. 313–339, Jul. 2021. Access, vol. 9, pp. 27840–27867, 2021.
[14] S. Praveen and R. Ittamalla, ‘‘An analysis of attitude of general public [37] D. Konar, B. K. Panigrahi, S. Bhattacharyya, N. Dey, and R. Jiang, ‘‘Auto-
toward COVID-19 crises—Sentimental analysis and a topic modeling diagnosis of COVID-19 using lung CT images with semi-supervised
study,’’ Inf. Discovery Del., vol. 49, no. 3, pp. 240–249, Sep. 2021. shallow learning network,’’ IEEE Access, vol. 9, pp. 28716–28728, 2021.
[15] T. Tayir and L. Li, ‘‘Unsupervised multimodal machine translation for low- [38] L. L. Wang and K. Lo, ‘‘Text mining approaches for dealing with the
resource distant language pairs,’’ ACM Trans. Asian Low-Resour. Lang. Inf. rapidly expanding literature on COVID-19,’’ Briefings Bioinf., vol. 22,
Process., vol. 23, no. 4, pp. 1–22, Apr. 2024. no. 2, pp. 781–799, Mar. 2021.
[39] S. Madichetty and M. Sridev, ‘‘A novel method for identifying the damage
[16] K. A. R. Issam, S. Patel, and C. N. Subalalitha, ‘‘Topic modeling based
assessment tweets during disaster,’’ Future Gener. Comput. Syst., vol. 116,
extractive text summarization,’’ 2021, arXiv:2106.15313.
pp. 440–454, Mar. 2021.
[17] W. Dang, L. Cai, M. Liu, X. Li, Z. Yin, X. Liu, L. Yin, and W. Zheng,
[40] S. Madichetty, S. Muthukumarasamy, and P. Jayadev, ‘‘Multi-modal
‘‘Increasing text filtering accuracy with improved LSTM,’’ Comput.
classification of Twitter data during disasters for humanitarian response,’’
Informat., vol. 42, no. 6, pp. 1491–1517, 2023.
J. Ambient Intell. Hum. Comput., vol. 12, no. 11, pp. 10223–10237,
[18] S. Pan, G. J. Xu, K. Guo, S. H. Park, and H. Ding, ‘‘Cultural insights in
Nov. 2021.
souls-like games: Analyzing player behaviors,’’ IEEE Trans. Games, 2024.
[41] X. Li and Y. Sun, ‘‘Application of RBF neural network optimal
[19] R. Sandhiya, A. Boopika, M. Akshatha, S. Swetha, and N. Hariharan, segmentation algorithm in credit rating,’’ Neural Comput. Appl., vol. 33,
‘‘A review of topic modeling and its application,’’ in Handbook of no. 14, pp. 8227–8235, Jul. 2021.
Intelligent Computing and Optimization for Sustainable Development,
[42] K. Chakraborty, S. Bhatia, S. Bhattacharyya, J. Platos, R. Bag, and
2022, pp. 305–322.
A. E. Hassanien, ‘‘Sentiment analysis of COVID-19 tweets by deep
[20] C. Huang, Z. Han, M. Li, X. Wang, and W. Zhao, ‘‘Sentiment evolution learning classifiers—A study to show how popularity is affecting accuracy
with interaction levels in blended learning environments: Using learning in social media,’’ Appl. Soft Comput., vol. 97, Dec. 2020, Art. no. 106754.
analytics and epistemic network analysis,’’ Australas. J. Educ. Technol.,
[43] H. Jelodar, Y. Wang, R. Orji, and S. Huang, ‘‘Deep sentiment classification
vol. 37, no. 2, pp. 81–95, May 2021.
and topic discovery on novel coronavirus or COVID-19 online discussions:
[21] Y. Ban, Y. Liu, Z. Yin, X. Liu, M. Liu, L. Yin, X. Li, and W. NLP using LSTM recurrent neural network approach,’’ IEEE J. Biomed.
Zheng, ‘‘Micro-directional propagation method based on user clustering,’’ Health Informat., vol. 24, no. 10, pp. 2733–2742, Oct. 2020.
Comput. Informat., vol. 42, no. 6, pp. 1445–1470, 2023. [44] L. Carnevale, A. Celesti, G. Fiumara, A. Galletta, and M. Villari,
[22] H. Chen, C. Lv, L. Ding, H. Qin, X. Zhou, Y. Ding, X. Liu, M. Zhang, ‘‘Investigating classification supervised learning approaches for the
J. Guo, X. Liu, and D. Tao, ‘‘DB-LLM: Accurate dual-binarization for identification of critical patients’ posts in a healthcare social network,’’
efficient LLMs,’’ 2024, arXiv:2402.11960. Appl. Soft Comput., vol. 90, May 2020, Art. no. 106155.
[23] S. Pan, G. J. W. Xu, K. Guo, S. H. Park, and H. Ding, ‘‘Video-based [45] H. Qi, Z. Zhou, J. Irizarry, D. Lin, H. Zhang, N. Li, and J. Cui, ‘‘Automatic
engagement estimation of game streamers: An interpretable multimodal identification of causal factors from fall-related accident investigation
neural network approach,’’ IEEE Trans. Games, 2023. reports using machine learning and ensemble learning approaches,’’
[24] H. Qin, M. Zhang, Y. Ding, A. Li, Z. Cai, Z. Liu, F. Yu, and X. Liu, J. Manage. Eng., vol. 40, no. 1, Jan. 2024, Art. no. 04023050.
‘‘BiBench: Benchmarking and analyzing network binarization,’’ in Proc. [46] P. Kairon and S. Bhattacharyya, ‘‘COVID-19 outbreak prediction using
Int. Conf. Mach. Learn., 2023, pp. 28351–28388. quantum neural networks,’’ in Intelligence Enabled Research. Springer,
[25] W. Huang, Y. Liu, H. Qin, Y. Li, S. Zhang, X. Liu, M. Magno, and X. Qi, 2021, pp. 113–123.
‘‘BiLLM: Pushing the limit of post-training quantization for LLMs,’’ 2024, [47] Y. Zhang, H. Lyu, Y. Liu, X. Zhang, Y. Wang, and J. Luo, ‘‘Monitoring
arXiv:2402.04291. depression trend on Twitter during the COVID-19 pandemic,’’ 2020,
[26] H. Qin, Y. Ding, M. Zhang, Q. Yan, A. Liu, Q. Dang, Z. Liu, and X. Liu, arXiv:2007.00228.
‘‘BiBERT: Accurate fully binarized BERT,’’ 2022, arXiv:2203.06390. [48] K. Chatsiou, ‘‘Text classification of COVID-19 press briefings using BERT
[27] P. S. A. Babu, C. S. Rao Annavarapu, and A. Mohapatra, ‘‘A novel method and convolutional neural networks,’’ Tech. Rep., 2020.
for next-generation sequence data analysis using PLSA topic modeling [49] X. Li, M. Zhou, J. Wu, A. Yuan, F. Wu, and J. Li, ‘‘Analyzing
technique,’’ in Proc. 2nd Int. Conf. Adv. Comput. Commun. Paradigms COVID-19 on online social media: Trends, sentiments and emotions,’’
(ICACCP), Feb. 2019, pp. 1–6. 2020, arXiv:2005.14464.
[28] T. Zhou, Z. Cai, F. Liu, and J. Su, ‘‘In pursuit of beauty: Aesthetic- [50] J. Zhou, H. Zogan, S. Yang, S. Jameel, G. Xu, and F. Chen, ‘‘Detect-
aware and context-adaptive photo selection in crowdsensing,’’ IEEE Trans. ing community depression dynamics due to COVID-19 pandemic in
Knowl. Data Eng., 2023. Australia,’’ IEEE Trans. Computat. Social Syst., vol. 8, no. 4, pp. 982–991,
[29] I. Vayansky and S. A. P. Kumar, ‘‘A review of topic modeling methods,’’ Aug. 2021.
Inf. Syst., vol. 94, Dec. 2020, Art. no. 101582. [51] H. Yin, S. Yang, and J. Li, ‘‘Detecting topic and sentiment dynamics due
[30] Y. Bai, S. Jia, and L. Chen, ‘‘Topic evolution analysis of COVID-19 news to COVID-19 pandemic using social media,’’ in Proc. Int. Conf. Adv. Data
articles,’’ J. Phys., Conf. Ser., vol. 1601, no. 5, Aug. 2020, Art. no. 052009. Mining Appl. Springer, 2020, pp. 610–623.
[52] J. Li, C. Huang, Y. Yang, J. Liu, X. Lin, and J. Pan, ‘‘How nursing [74] S. Syed and M. Spruit, ‘‘Full-text or abstract? Examining topic coherence
students’ risk perception affected their professional commitment during scores using latent Dirichlet allocation,’’ in Proc. IEEE Int. Conf. Data Sci.
the COVID-19 pandemic: The mediating effects of negative emotions Adv. Analytics (DSAA), Oct. 2017, pp. 165–174.
and moderating effects of psychological capital,’’ Humanities Social Sci. [75] N. Aletras and M. Stevenson, ‘‘Evaluating topic coherence using
Commun., vol. 10, no. 1, pp. 1–9, May 2023. distributional semantics,’’ in Proc. 10th Int. Conf. Comput. Semantics
[53] A. E. Aiello, A. Renson, and P. N. Zivich, ‘‘Social media- and internet- (IWCS), 2013, pp. 13–22.
based disease surveillance for public health,’’ Annu. Rev. Public Health, [76] M. Röder, A. Both, and A. Hinneburg, ‘‘Exploring the space of topic
vol. 41, no. 1, pp. 101–118, Apr. 2020. coherence measures,’’ in Proc. 8th ACM Int. Conf. Web Search Data
[54] S. A. Waheeb, N. A. Khan, B. Chen, and X. Shang, ‘‘Machine learning Mining, Feb. 2015, pp. 399–408.
based sentiment text classification for evaluating treatment quality of [77] K. Stevens, P. Kegelmeyer, D. Andrzejewski, and D. Buttler, ‘‘Exploring
discharge summary,’’ Information, vol. 11, no. 5, p. 281, May 2020. topic coherence over many models and many topics,’’ in Proc. Joint Conf.
[55] C. Bao, X. Hu, D. Zhang, Z. Lv, and J. Chen, ‘‘Predicting moral elevation Empirical Methods Natural Lang. Process. Comput. Natural Lang. Learn.,
conveyed in Danmaku comments using EEGs,’’ Cyborg Bionic Syst., vol. 4, 2012, pp. 952–961.
p. 28, Jan. 2023. [78] T. Ahammad, ‘‘Identifying hidden patterns of fake COVID-19 news: An
[56] X. Si, H. He, J. Yu, and D. Ming, ‘‘Cross-subject emotion recognition in-depth sentiment analysis and topic modeling approach,’’ Natural Lang.
brain–computer interface based on fNIRS and DBJNet,’’ Cyborg Bionic Process. J., vol. 6, Mar. 2024, Art. no. 100053.
Syst., vol. 4, p. 45, Jan. 2023. [79] R. Xie, S. K. W. Chu, D. K. W. Chiu, and Y. Wang, ‘‘Exploring public
response to COVID-19 on Weibo with LDA topic modeling and sentiment
[57] Q. Chen and M. Sokolova, ‘‘Specialists, scientists, and sentiments:
analysis,’’ Data Inf. Manage., vol. 5, no. 1, pp. 86–99, Jan. 2021.
Word2Vec and Doc2Vec in analysis of scientific and medical texts,’’ Social
[80] S. Gyftopoulos, G. Drosatos, G. Fico, L. Pecchia, and E. Kaldoudi,
Netw. Comput. Sci., vol. 2, no. 5, pp. 1–11, 2021.
‘‘Analysis of pharmaceutical Companies’ social media activity during the
[58] H. Zhang, H. Liu, and C. Kim, ‘‘Semantic and instance segmentation in
COVID-19 pandemic and its impact on the public,’’ Behav. Sci., vol. 14,
coastal urban spatial perception: A multi-task learning framework with an
no. 2, p. 128, Feb. 2024.
attention mechanism,’’ Sustainability, vol. 16, no. 2, p. 833, Jan. 2024.
[81] N. Thakur, ‘‘Sentiment analysis and text analysis of the public discourse on
[59] J. Samuel, G. G. M. N. Ali, M. M. Rahman, E. Esawi, and Y. Samuel, Twitter about COVID-19 and MPox,’’ Big Data Cognit. Comput., vol. 7,
‘‘COVID-19 public sentiment insights and machine learning for tweets no. 2, p. 116, Jun. 2023.
classification,’’ Information, vol. 11, no. 6, p. 314, Jun. 2020. [82] M. Costola, O. Hinz, M. Nofer, and L. Pelizzon, ‘‘Machine learning
[60] A. S. Imran, S. M. Daudpota, Z. Kastrati, and R. Batra, ‘‘Cross- sentiment analysis, COVID-19 news and stock market reactions,’’ Res. Int.
cultural polarity and emotion detection using sentiment analysis and Bus. Finance, vol. 64, Jan. 2023, Art. no. 101881.
deep learning on COVID-19 related tweets,’’ IEEE Access, vol. 8,
pp. 181074–181090, 2020.
[61] S. Siddiqui, M. S. Faisal, S. Khurram, A. Irshad, M. Baz, H. Hamam,
N. Iqbal, and M. Shafiq, ‘‘Quality prediction of wearable apps in
the Google play store,’’ Intell. Autom. Soft Comput., vol. 32, no. 2,
pp. 877–892, 2022.
[62] S. Boon-Itt and Y. Skunkan, ‘‘Public perception of the COVID-19
pandemic on Twitter: Sentiment analysis and topic modeling study,’’ JMIR AMREEN BATOOL received the bachelor’s
Public Health Surveill., vol. 6, no. 4, Nov. 2020, Art. no. e21978. degree from GC University, Pakistan, the M.C.S.
[63] S. Das and A. Dutta, ‘‘Characterizing public emotions and sentiments in degree from Virtual University of Pakistan, and the
COVID-19 environment: A case study of India,’’ J. Hum. Behav. Social
M.S. degree in computer science and technology
Environ., vol. 31, nos. 1–4, pp. 154–167, May 2021.
from Tiangong University, Tianjin, China, in 2021.
[64] G. Barkur, Vibha, and G. B. Kamath, ‘‘Sentiment analysis of nationwide
She is currently pursuing the Ph.D. degree with
lockdown due to COVID 19 outbreak: Evidence from India,’’ Asian J.
the Department of Electronic Engineering, Jeju
Psychiatry, vol. 51, Jun. 2020, Art. no. 102089.
National University, Republic of Korea. She is a
[65] R. Chandrasekaran, V. Mehta, T. Valkunde, and E. Moustakas, ‘‘Topics,
Project Coordinator with EUT Global Ltd. Her
trends, and sentiments of tweets about the COVID-19 pandemic: Temporal
infoveillance study,’’ J. Med. Internet Res., vol. 22, no. 10, Oct. 2020, main role is to coordinate with clients and field
Art. no. e22624. engineers to plan project delivery. Her research interests include machine
[66] J. Xue, J. Chen, C. Chen, C. Zheng, S. Li, and T. Zhu, ‘‘Public discourse learning, deep learning, and blockchain technology.
and sentiment during the COVID 19 pandemic: Using latent Dirichlet
allocation for topic modeling on Twitter,’’ PLoS ONE, vol. 15, no. 9,
Sep. 2020, Art. no. e0239441.
[67] D. Yang, T. Zhu, S. Wang, S. Wang, and Z. Xiong, ‘‘LFRSNet: A
robust light field semantic segmentation network combining contextual
and geometric features,’’ Frontiers Environ. Sci., vol. 10, Oct. 2022,
Art. no. 996513.
YUNG-CHEOL BYUN received the B.S. degree
[68] Y. Xu, E. Wang, Y. Yang, and Y. Chang, ‘‘A unified collaborative
representation learning for neural-network based recommender systems,’’ from Jeju National University, in 1993, and the
IEEE Trans. Knowl. Data Eng., vol. 34, no. 11, pp. 5126–5139, Nov. 2022. M.S. and Ph.D. degrees from Yonsei Univer-
[69] D. Li, ‘‘An interactive teaching evaluation system for preschool education sity, in 1995 and 2001, respectively. He was a
in universities based on machine learning algorithm,’’ Comput. Hum. Special Lecturer with SAMSUNG Electronics,
Behav., vol. 157, Aug. 2024, Art. no. 108211. in 2000 and 2001. From 2001 to 2003, he was
[70] F. Huang, Z. Wang, X. Huang, Y. Qian, Z. Li, and H. Chen, ‘‘Aligning a Senior Researcher with the Electronics and
distillation for cold-start item recommendation,’’ in Proc. 46th Int. ACM Telecommunications Research Institute (ETRI).
SIGIR Conf. Res. Develop. Inf. Retr., Jul. 2023, pp. 1147–1157. He was promoted to join Jeju National University
[71] D. M. Blei, A. Y. Ng, and M. I. Jordan, ‘‘Latent Dirichlet allocation,’’ as an Assistant Professor, in 2003. He is currently
J. Mach. Learn. Res., vol. 3, pp. 993–1022, Mar. 2003. an Associate Professor with the Computer Engineering Department,
[72] H. Yin, X. Song, S. Yang, and J. Li, ‘‘Sentiment analysis and topic Jeju National University. His research interests include the areas of AI
modeling for COVID-19 vaccine discussions,’’ World Wide Web, vol. 25, machine learning, pattern recognition, blockchain and deep learning-based
no. 3, pp. 1067–1083, May 2022. applications, big data and knowledge discovery, time series data analysis and
[73] D. M. Mimno, H. M. Wallach, E. M. Talley, M. Leenders, and prediction, image processing and medical applications, and recommendation
A. McCallum, ‘‘Optimizing semantic coherence in topic models,’’ in Proc. systems.
Conf. Empirical Methods Natural Lang. Process., 2011, pp. 262–272.