1. Introduction
In the era of big data, the number of netizens has been increasing annually [
1]. Social media provides important platforms for Internet users to express and exchange their views, as well as to obtain information. Once a significant event occurs, the public tends to describe their attention to and cognition of the event on social media platforms, leading to dissemination and discussion of the topic. In particular, from the end of 2019 to April 2020, the discovery of COVID-19 and lockdown policy prompted the public to exchange information on social media platforms in order to learn about the epidemic [
2]. User-generated content (UGC), generated on some social media platforms such as WeChat, Sina Weibo, and Twitter, has become an important source for obtaining social sentiment and analyzing public opinion. Moreover, topic classification has a pivotal role in public opinion analysis, on which a considerable amount of literature [
2,
3,
4,
5,
6] has been published regarding COVID-19. Although the value of revealing public opinions and emotions through social media has been proven in extensive research [
7,
8,
9], the following challenges still exist. First of all, there are two sources of access to geotagged posts: one is from user registration information, which is coarse-grained, while the other is from check-in posts, which are fine-grained and shared by users initiatively, but only by a minority. Secondly, how to effectively mine valuable topic information from word-limited and casual-content social media posts requires further research. Finally, obtaining the appropriate classification of topics is one of the most frequently stated problems when using probabilistic topic models for topic mining, as the number of topics affects the generation results of the model. In spite of many proposed researches [
10], which were employed to assess the optimal number of topics but did not perform well in their consistence with people’s subjective perceptions, the lack of semantic interpretability for topics has existed as a classification problem for many years. Therefore, it remains a challenge to obtain public opinion topics, including their spatial and temporal distribution, at a fine-grained scale within a city.
With the aim to analyze topics within online public opinion for a city during the COVID-19 epidemic, we propose a method that can address some of the problems mentioned above. First, we establish the correlation between the ground truth data related to the epidemic and check-in posts of social media, then reveal whether it is feasible to use only check-in posts for public opinion analysis. More than that, an algorithm for evaluating the optimal topic classification of public opinions is proposed, then the temporal and spatial distribution of various topics, based on the optimal classification results, is further discussed.
For this paper, a standard corpus was adopted to test the proposed evaluation method, and check-in posts on Weibo were applied to solve practical problems arising during the COVID-19 epidemic. Check-in posts are one of the geotagged post types in Weibo, which contain rich text, temporal, and spatial information. Based on check-in microblogs in Wuhan from December 2019 to April 2020, COVID-19 related posts were extracted to explore the relationship between check-in posts and the epidemic. Then, our evaluation method was proposed in order to find the optimal number of topics when the LDA topic model is used to classify public opinions, which substitutes the numerical expression for the subjective experience from a semantic point of view using a BERT pre-training model [
11]. Finally, we analyzed the generated topics and characteristics of their temporal and spatial distribution according to the check-in data in order to provide a reference for the fine monitoring management and scientific governance of public opinion.
The remainder of this paper is structured as follows:
Section 2 reviews the related studies in the literature.
Section 3 explains the proposed method and other related methods. In
Section 4, we compare our evaluation method with the other four using a standard corpus, then apply our approach to the COVID-19 case study and discuss the results.
Section 5 provides our conclusions.
2. Related Works
Social media platforms, such as Weibo, Twitter, and Facebook, contain a variety of user-generated content (e.g., text, pictures, locations, and videos), making them content-rich and more convenient than traditional questionnaire surveys when obtaining a mass of research data. Research on social media has also become more and more popular. Some research works [
12,
13,
14,
15] have utilized social media data to analyze public opinions, with a focus on natural disasters and public health emergency. With the outbreak of COVID-19, social media has quickly become an important platform for information generation and dissemination. Topic mining, sentiment analysis, temporal and spatial distribution of topics, and similar perspectives for public opinion analysis have been mainly discussed among the many relevant studies in the literature [
3,
6,
16,
17]. Han et al. [
6] combined the LDA topic model with a random forest model to classify COVID-19-related posts on Weibo, then analyzed the temporal and spatial distribution of different topics at the national scale. Their results showed that the variation trend of topics was in sync with the development of the epidemic over time. In addition, the spatial distribution of different topics was relevant to various factors, including disease severity, population density, and so on. Chen et al. [
17] used SnowNLP and K-means methods to carry out sentiment analysis and topic classification, respectively, in the use of posts on Weibo under the epidemic situation, and verified the correlation between netizen emotions and the epidemic situation in various places. Sakun Boon-Itt and Yukolpat Skunkan [
18] identified six categories of tweet topics by LDA topic modeling based on the highest topic coherence but dismissed the temporal and spatial information. Han Zheng et al. [
5] analyzed Twitter tweets by LDA to uncover temporal differences in nine topics that were identified by the existing methods [
19,
20].
From the perspective of data sources, existing researches [
6,
21] have adopted administrative districts, defined from user registration information, as the analysis unit while analyzing the spatial distribution pattern of topics or sentiments with social media data. Such information ignores the value of check-in posts on social media, with which we were able to reveal the distribution characteristics of public opinion topics within a city.
From the perspective of analytical methods for public opinion topics, the Latent Dirichlet Allocation (LDA) topic model [
22] is considered to have the advantage of identifying topics from massive text collections in an unsupervised manner [
23], and is also one of the most widely used probabilistic topic models. Wang et al. [
24] adopted LDA to determine the latent emergency topics in microblog text and the corresponding topic–word distribution. On this basis, they used the SVM method to classify the new posts according to the appropriate topics, which provided decision support for the emergency response. Wang and Han [
2,
6] utilized LDA to initially set text classification labels, then introduced a random forest model to obtain the classification result for public opinion topics. Amina Amara et al. [
25] exploited multilingual Facebook corpus to track COVID-19 trends with topics extracted by LDA topic modeling. However, those works did not provide a theoretical basis for the optimal number of topics in LDA. Numerous topic mining and recognition studies [
24,
26] have shown the practicality and effectiveness of the LDA model, but the number of topics in the model can affect the classification results [
27].
By the means of introducing the concept of a topic, the LDA model can display text information in a topic space of lower dimension, leading to a good effect in text topic mining. With the aim to choose the appropriate number of topics, many methods have been proposed [
19,
20,
28,
29] besides the traditional methods, such as calculating the perplexity [
22]. However, these methods only performed well for the model theoretically, and may not necessarily extract practical topic information consistent with experience. Under normal circumstances, the number of topics is obtained by experience or repeated experiments, which could lead to large errors [
30]. In addition to experience-based selection approaches [
31], perplexity-based approaches [
22], Bayesian statistics methods [
28], and the HDP method [
32] are other classical methods used to obtain the appropriate number of topics in LDA; however, these methods are characterized by some problems, such as high time complexity or a lack of logical derivation.
Other methods [
27] have been adopted to choose the optimal number of LDA topics, mainly based on the similarity between topics. Cao Juan et al. [
19] utilized the cosine similarity to describe the stability of topic structure according to the topic–word probability distribution matrix generated by LDA, while Deveaud et al. [
20] adopted the KL divergence to measure the similarity between probability distributions. Krasnov and Sen [
10] proposed a clustering approach with a cDBI metric to assess the optimal number of topics, but this only worked well on a small collection of English documents. There have been some researches that tried to identify the optimal number of topics from a perspective of statistical physics. Ignatenko and Koltcov et al. [
33] proposed the fractal approach to find out the optimal number of topics in a three topic modeling algorithm, i.e., PLSA, ARTM, and LDA models. With the assumption that the transition region corresponds to the “true” number of topics, they identified a range of figures, instead of a firm answer, for the optimal number of topics. Koltcov [
34] regarded the number of topics as an equivalent of temperature in nonequilibrium complex systems (i.e., topic model). By calculating Rényi and Tsallis entropies based on a value of free energy of such systems, the optimal number of topics could be identified. These research works [
19,
20,
33,
34] chose the optimal number of topics based on the probability distribution of the LDA model but did not consider the semantic relevance of the topics generated by LDA, which may contribute to improving human interpretability of topics. LDA is based on the bag-of-words model, which ignores the contextual information between texts. Therefore, Wang Tingting et al. [
27] improved the topic–word matrix generated by the LDA model and employed a topic–word vector matrix by adding the semantic information of words using the Word2Vec word embedding model. Finally, the pseudo-F statistics in adaptive clustering were used to obtain the optimal number of topics. However, the Word2Vec model obtains a static word vector and ignores the contextual information of words. Different from the above research works, in this paper, we propose a new evaluation method, which considers the semantic similarity by combining a BERT pre-training model in order to obtain the optimal topic number for LDA.