[PDF][PDF] Optimization of Topic Recognition Model for News Texts Based on LDA.

H Wang, J Wang, Y Zhang, M Wang, C Mao - J. Digit. Inf. Manag., 2019 - dline.info
H Wang, J Wang, Y Zhang, M Wang, C Mao
J. Digit. Inf. Manag., 2019dline.info
Latent Dirichlet Allocation (LDA) is the technique most commonly used in topic modeling
methods, but it requires the number of topics generated by LDA to be specified for topic
recognition modeling. Except the main iterative methods based on perplexity and
nonparametric methods, recent research has no simple way to select the optimal number of
topics in the model. Aiming at appropriately determining the number of topics and then
optimizing the LDA topic model, this paper proposes a non-iterative method for automatically …
Abstract
Latent Dirichlet Allocation (LDA) is the technique most commonly used in topic modeling methods, but it requires the number of topics generated by LDA to be specified for topic recognition modeling. Except the main iterative methods based on perplexity and nonparametric methods, recent research has no simple way to select the optimal number of topics in the model. Aiming at appropriately determining the number of topics and then optimizing the LDA topic model, this paper proposes a non-iterative method for automatically determining the number of topics. The clustering method is based on fast seeking and locating density peaks. This method transforms the traditional topic cluster number selection problem into clustering problem and thus can be used to optimize the topic recognition model for news texts. It does not need iterative optimization and can simplify model development. This method uses Word2Vec for word embedding on corpus text to explore the superior performance of word-related relationships and to express the implicit semantic relationship between topic corpora. Then, using a clustering algorithm that quickly searches for and finds the cluster peaks; the word vectors after word embedding are clustered to obtain the number of word vector clusters after word embedding. The number of clusters is used as the number of topics in the text. Finally, the experimental results show that the proposed method enjoys better precision and F1 value than the perplexitybased method, and is suitable for the identification of the number of topics in corpora in different sizes. This method can effectively find the appropriate number of topics from the news text dataset and improve the accuracy of the LDA theme model.
dline.info
Showing the best result for this search. See all results