Ber Topic

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

BERTopic: Neural topic modeling with a class-based TF-IDF procedure

Maarten Grootendorst
[email protected]

Abstract The semantic properties of these vector represen-


Topic models can be useful tools to discover
tations allow the meaning of texts to be encoded
latent topics in collections of documents. Re- in such a way that similar texts are close in vector
cent studies have shown the feasibility of ap- space.
arXiv:2203.05794v1 [cs.CL] 11 Mar 2022

proach topic modeling as a clustering task. Although embedding techniques have been used
We present BERTopic, a topic model that ex- for a variety of tasks, ranging from classification to
tends this process by extracting coherent topic neural search engines, researchers have started to
representation through the development of a
adopt these powerful contextual representations for
class-based variation of TF-IDF. More specif-
ically, BERTopic generates document embed- topic modeling. Sia et al. (2020) demonstrated the
ding with pre-trained transformer-based lan- viability of clustering embeddings with centroid-
guage models, clusters these embeddings, and based techniques, compared to conventional meth-
finally, generates topic representations with ods such as LDA, as a way to represent topics.
the class-based TF-IDF procedure. BERTopic From these clustered embeddings, topic representa-
generates coherent topics and remains compet- tions were extracted by embedding words and find-
itive across a variety of benchmarks involv-
ing those that are in close proximity to a cluster’s
ing classical models and those that follow the
more recent clustering approach of topic mod- centroid. Similarly, Top2Vec leverages Doc2Vec’s
eling. word- and document representations to learn jointly
embedded topic, document, and word vectors (An-
1 Introduction gelov, 2020; Le and Mikolov, 2014). Compara-
To uncover common themes and the underlying ble to Sia et al. (2020)’s approach, documents are
narrative in text, topic models have proven to be a clustered and topic representations are created by
powerful unsupervised tool. Conventional models, finding words close to a cluster’s centroid. Interest-
such as Latent Dirichlet Allocation (LDA) (Blei ingly, although the topic representations are gen-
et al., 2003) and Non-Negative Matrix Factoriza- erated from a centroid-based perspective, the clus-
tion (NMF) (Févotte and Idier, 2011), describe a ters are generated from a density-based perspective,
document as a bag-of-words and model each docu- namely by leveraging HDBSCAN (McInnes and
ment as a mixture of latent topics. Healy, 2017).
One limitation of these models is that through The aforementioned topic modeling techniques
bag-of-words representations, they disregard se- assume that words in close proximity to a cluster’s
mantic relationships among words. As these repre- centroid are most representative of that cluster, and
sentations do not account for the context of words thereby a topic. In practice, however, a cluster
in a sentence, the bag-of-words input may fail to will not always lie within a sphere around a cluster
accurately represent documents. centroid. As such, the assumption cannot hold for
As an answer to this issue, text embedding tech- every cluster of documents, and the representation
niques have rapidly become popular in the natural of those clusters, and thereby the topic might be
language processing field. More specifically, Bidi- misleading. Although (Sia et al., 2020) attempts
rectional Encoder Representations from Transform- to overcome this issue by re-ranking topic words
ers (BERT) (Devlin et al., 2018) and its variations based on their frequency in a cluster, the initial
(e.g., Lee et al., 2020; Liu et al., 2019; Lan et al., candidates are still generated from a centroid-based
2019), have shown great results in generating con- perspective.
textual word- and sentence vector representations. In this paper, we introduce BERTopic, a topic
model that leverages clustering techniques and a 3 BERTopic
class-based variation of TF-IDF to generate co-
herent topic representations. More specifically, BERTopic generates topic representations through
we first create document embeddings using a pre- three steps. First, each document is converted to
trained language model to obtain document-level its embedding representation using a pre-trained
information. Second, we first reduce the dimen- language model. Then, before clustering these
sionality of document embeddings before creat- embeddings, the dimensionality of the resulting
ing semantically similar clusters of documents that embeddings is reduced to optimize the clustering
each represent a distinct topic. Third, to overcome process. Lastly, from the clusters of documents,
the centroid-based perspective, we develop a class- topic representations are extracted using a custom
based version of TF-IDF to extract the topic repre- class-based variation of TF-IDF.
sentation from each topic. These three independent 3.1 Document embeddings
steps allow for a flexible topic model that can be
used in a variety of use-cases, such as dynamic In BERTopic, we embed documents to create rep-
topic modeling. resentations in vector space that can be compared
semantically. We assume that documents contain-
ing the same topic are semantically similar. To
perform the embedding step, BERTopic uses the
2 Related Work Sentence-BERT (SBERT) framework (Reimers and
Gurevych, 2019). This framework allows users to
In recent years, neural topic models have increas- convert sentences and paragraphs to dense vector
ingly shown success in leveraging neural networks representations using pre-trained language models.
to improve upon existing topic modeling tech- It achieves state-of-the-art performance on various
niques (Terragni et al., 2021; Cao et al., 2015; Zhao sentence embedding tasks (Reimers and Gurevych,
et al., 2021; Larochelle and Lauly, 2012). The 2020; Thakur et al., 2020).
incorporation of word embeddings into classical These embeddings, however, are primarily used
models, such as LDA, demonstrated the viability to cluster semantically similar documents and not
of using these powerful representations (Liu et al., directly used in generating the topics. Any other
2015; Nguyen et al., 2015; Shi et al., 2017; Qiang embedding technique can be used for this purpose
et al., 2017). Foregoing incorporation into LDA- if the language model generating the document em-
like models, there has been a recent surge of topic beddings was fine-tuned on semantic similarity. As
modeling techniques built primarily around embed- a result, the quality of clustering in BERTopic will
ding models illustrating the potential of embedding- increase as new and improved language models are
based topic modeling techniques (Bianchi et al., developed. This allows BERTopic to continuously
2020b; Dieng et al., 2020; Thompson and Mimno, grow with the current state-of-the-art in embedding
2020). CTM, for example, demonstrates the ad- techniques.
vantage of relying on pre-trained language models,
3.2 Document clustering
namely that future improvements in language mod-
els may translate into better topic models (Bianchi As data increases in dimensionality, distance to
et al., 2020a). the nearest data point has been shown to approach
the distance to the farthest data point (Aggarwal
Several approaches have started simplifying the
et al., 2001; Beyer et al., 1999). As a result, in high
topic building process by clustering word- and
dimensional space, the concept of spatial locality
document embeddings (Sia et al., 2020; Angelov,
becomes ill-defined and distance measures differ
2020). This clustering approach allows for a flex-
little.
ible topic model as the generation of the clusters
Although clustering approaches exist for over-
can be separated from the process of generating the
coming this curse of dimensionality (Pandove et al.,
topic representations.
2018; Steinbach et al., 2004), a more straight-
BERTopic builds on top of the clustering embed- forward approach is to reduce the dimensional-
dings approach and extends it by incorporating a ity of embeddings. Although PCA and t-SNE are
class-based variant of TF-IDF for creating topic well-known methods for reducing dimensionality,
representations. UMAP has shown to preservers more of the local
and global features of high-dimensional data in
lower projected dimensions (McInnes et al., 2018). A
Wt,c = tft,c · log(1 + ) (2)
Moreover, since it has no computational restric- tft
tions on embedding dimensions, UMAP can be Where the term frequency models the frequency
used across language models with differing dimen- of term t in a class c or in this instance. Here,
sional space. Thus, we use UMAP to reduce the the class c is the collection of documents concate-
dimensionality of document embeddings generated nated into a single document for each cluster. Then,
in 3.1 (McInnes et al., 2018). the inverse document frequency is replaced by the
The reduced embeddings are clustering used inverse class frequency to measure how much infor-
HDBSCAN (McInnes et al., 2017). It is an ex- mation a term provides to a class. It is calculated
tension of DBSCAN that finds clusters of varying by taking the logarithm of the average number of
densities by converting DBSCAN into a hierarchi- words per class A divided by the frequency of term
cal clustering algorithm. HDBSCAN models clus- t across all classes. To output only positive values,
ters using a soft-clustering approach allowing noise we add one to the division within the logarithm.
to be modeled as outliers. This prevents unrelated Thus, this class-based TF-IDF procedure mod-
documents to be assigned to any cluster and is ex- els the importance of words in clusters instead of
pected to improve topic representations.‘Moreover, individual documents. This allows us to generate
(Allaoui et al., 2020) demonstrated that reducing topic-word distributions for each cluster of docu-
high dimensional embeddings with UMAP can im- ments.
prove the performance of well-known clustering Finally, by iteratively merging the c-TF-IDF rep-
algorithms, such as k-Means and HDBSCAN, both resentations of the least common topic with its most
in terms of clustering accuracy and time. similar one, we can reduce the number of topics to
3.3 Topic Representation a user-specified value.

The topic representations are modeled based on the 4 Dynamic Topic Modeling
documents in each cluster where each cluster will
be assigned one topic. For each topic, we want to Traditional topic modeling techniques are static in
know what makes one topic, based on its cluster- nature and do not allow for sequentially-organized
word distribution, different from another? For this of documents to be modeled. Dynamic topic mod-
purpose, we can modify TF-IDF, a measure for rep- eling techniques, first introduced by (Blei and Laf-
resenting the importance of a word to a document, ferty, 2006) as an extension of LDA, overcome this
such that it allows for a representation of a term’s by modeling how topics might have evolved over
importance to a topic instead. time and the extent to which topic representations
The classic TF-IDF procedure combines two reflect that.
statistics, term frequency, and inverse document In BERTopic, we can model this behavior by
frequency (Joachims, 1996): leveraging the c-TF-IDF representations of topics.
Here, we assume that the temporal nature of topics
N should not influence the creation of global topics.
Wt,d = tft,d · log( ) (1)
dft The same topic might appear across different times,
Where the term frequency models the frequency albeit possibly represented differently. As an exam-
of term t in document d. The inverse document ple, a global topic about cars might contain words
frequency measures how much information a term such as "car" and "vehicle" regardless of the tem-
provides to a document and is calculated by taking poral nature of specific documents. Car-related
the logarithm of the number of documents in a documents created in 2020, however, might be bet-
corpus N divided by the total number of documents ter represented with words such as "Tesla" and
that contain t. "self-driving" whereas these words would likely
We generalize this procedure to clusters of doc- not appear in car-related documents created in 1990.
uments. First, we treat all documents in a cluster Although the same topic is assigned to car-related
as a single document by simply concatenating the documents in 1990 and 2020, its representation
documents. Then, TF-IDF is adjusted to account might differ. Thus, we first generate a global rep-
for this representation by translating documents to resentation of topics, regardless of their temporal
clusters: nature, before developing a local representation.
To do this, BERTopic is first fitted on the entire used to run the experiments, validate results, and
corpus as if there were no temporal aspects to the preprocess the data (Terragni et al., 2021).
data in order to create a global view of topics. Then, Both the implementation of BERTopic as well as
we can create a local representation of each topic the experimental setup are freely available online.
by simply multiplying the term frequency of docu- 12

ments at timestep i with the pre-calculated global


IDF values: 5.1 Datasets
Three datasets were used to validate BERTopic,
A namely 20 NewsGroups, BBC News, and Trump’s
Wt,c,i = tft,c,i · log(1 + ) (3)
tft tweets. We choose to thoroughly preprocess the 20
A major advantage of using this technique is that NewsGroups and BBC News datasets, and only
these local representations can be created without slightly preprocess Trump’s tweets to generate
the need to embed and cluster documents which more diversity between datasets.
allow for fast computation. Moreover, this method The 20 NewsGroups dataset3 contains 16309
can also be used to model topic representations by news articles across 20 categories (Lang, 1995).
other meta-data, such as author or journal. The BBC News dataset4 contains 2225 documents
from the BBC News website between 2004 and
4.1 Smoothing 2005 (Greene and Cunningham, 2006). Both
Although we can observe how topic representa- datasets were retrieved using OCTIS, and prepro-
tions are different from one time to another, the cessed by removing punctuation, lemmatization, re-
topic representation at timestep t is independent of moving stopwords, and removing documents with
timestep t-1. As a result, this dynamic representa- less than 5 words.
tion of topics might not result in linearly evolving To represent more recent data in a short-text
topics. When we expect linearly evolving topics, form, we collected all tweets of Trump5 before and
we assume that a topic representation at timestep t during his presidency. The data contains 44253
depends on the topic representation at timestep t-1. tweets, excluding re-tweets, between 2009 and
To overcome this, we can leverage the c-TF-IDF 2021. In both datasets, we lowercased all tokens.
matrices that were created at each timestep to incor- To evaluate BERTopic in a dynamic topic mod-
porate this linear assumption. For each topic and eling setting, Trump’s tweets were selected as they
timestep, the c-TF-IDF vector is normalized by di- inherently had a temporal nature to them. Addition-
viding the vector with the L1-norm. When compar- ally, the transcriptions of the United Nations (UN)
ing vectors, this normalization procedure prevents general debates between 2006 and 20156 were ana-
topic representations from having disproportionate lyzed (Baturo et al., 2017). The Trump dataset was
effects as a result of the size of the documents that binned to 10 timesteps and the UN datasets to 9
make up the topic. timesteps.
Then, for each topic and representation at
5.2 Models
timestep t, we simply take the average of the nor-
malized c-TF-IDF vectors at t and t-1. This allows BERTopic will be compared to LDA, NMF, CTM,
us to influence the topic representation at t by in- and Top2Vec. LDA and NMF were run through
corporating the representation at t-1. Thus, the OCTIS with default parameters. The "all-mpnet-
resulting topic representations are smoothed based base-v2" SBERT model was used as the embedding
on their temporal position. 1
https://github.com/MaartenGr/BERTopic
It should be noted that although we might ex- 2
https://github.com/MaartenGr/
pect linearly evolving topics, this is not always the BERTopic_evaluation
3
https://github.com/MIND-Lab/OCTIS/
case. Hence, this smoothing technique is optional tree/master/preprocessed_datasets/
when using BERTopic and will be reflected in the 20NewsGroup
4
experimental setup. https://github.com/MIND-Lab/OCTIS/
tree/master/preprocessed_datasets/BBC_
news
5 Experimental Setup 5
https://www.thetrumparchive.com/faq
6
https://runestone.academy/runestone/
OCTIS (Optimizing and Comparing Topic models books/published/httlads/_static/
is Simple), an open-source python package, was un-general-debates.csv
20 NewsGroups BBC News Trump
TC TD TC TD TC TD
LDA .058 .749 .014 .577 -.011 .502
NMF .089 .663 .012 .549 .009 .379
T2V-MPNET .068 .718 -.027 .540 -.213 .698
T2V-Doc2Vec .192 .823 .171 .792 -.169 .658
CTM .096 .886 .094 .819 .009 .855
BERTopic-MPNET .166 .851 .167 .794 .066 .663

Table 1: Ranging from 10 to 50 topics with steps of 10, topic coherence (TC) and topic diversity (TD) were
calculated at each step for each topic model. All results were averaged across 3 runs for each step. Thus, each
score is the average of 15 separate runs.

model for BERTopic and CTM (Song et al., 2020). Ranging from 10 to 50 topics with steps of 10,
Two variations of Top2Vec were modeled, one with the NPMI score was calculated at each step for
Doc2Vec and one with the "all-mpnet-base-v2" each topic model. All results were averaged across
SBERT model7 . 3 runs for each step. To evaluate the dynamic topic
For fair comparisons between BERTopic and models, the NPMI score was calculated at 50 topics
Top2Vec, the parameters of HDBSCAN and for each timestep and then averaged. All results
UMAP were fixed between topic models. were averaged across 3 runs.
To measure the generalizability of BERTopic Validation measures such are topic coherence
across language models, four different language and topic diversity are proxies of what is essen-
models were used in the experiments with tially a subjective evaluation. One user might judge
BERTopic, namely the Universal Sentence Encoder the coherence and diversity of a topic differently
(Cer et al., 2018), Doc2Vec, and the "all-MiniLM- from another user. As a result, although these mea-
L6-v2" (MiniLM) and "all-mpnet-base-v2" (MP- sures can be used to get an indication of a model’s
NET) SBERT models. performance, they are just that, an indication.
Finally, BERTopic, with and without the assump- It should be noted that although NPMI has been
tion of linearly-evolving topics, was compared with shown to correlate with human judgment, recent
the original dynamic topic model, referred hereto research states that this may only be the case for
as LDA Sequence. classical models and that this relationship might
not exist with neural topic models (Hoyle et al.,
5.3 Evaluation 2021). In part, the authors suggest a needs-driven
approach to evaluation as topic modeling’s primary
The performance of the topic models in this paper
use is in computer-assisted content analysis.
is reflected by two widely-used metrics, namely
To this purpose, the differences in running times
topic coherence and topic diversity. For each topic
of each model were explored as they can greatly
model, its topic coherence was evaluated using
impact their usability. Here, we choose to focus on
normalized pointwise mutual information (NPMI,
the wall times as it more accurately reflects how
(Bouma, 2009)). This coherence measure has been
the topic modeling techniques would be used in
shown to emulate human judgment with reason-
practice. All the models are run on a machine with
able performance (Lau et al., 2014). The measure
2 cores of Intel(R) Xeon(R) CPU @ 2.00GHz and
ranges from [-1, 1] where 1 indicates a perfect as-
a Tesla P100-PCIE-16GB GPU.
sociation. Topic diversity, as defined by (Dieng
et al., 2020), is the percentage of unique words for Moreover, in Section 7, the strengths and weak-
all topics. The measure ranges from [0, 1] where nesses of the proposed model across use cases will
0 indicates redundant topics and 1 indicates more be discussed extensively to further shed a light on
varied topics. what the model can and cannot do.

7
For an overview of SBERT models and their performance, 6 Results
see https://www.sbert.net/docs/pretrained_
models.html Our main results can be found in Table 1.
20 NewsGroups BBC News Trump
TC TD TC TD TC TD
BERTopic-USE .149 .858 .158 .764 .051 .684
BERTopic-Doc2Vec .173 .871 .168 .819 -.088 .536
BERTopic-MiniLM .159 .833 .170 .802 .060 .660
BERTopic-MPNET .166 .851 .167 .792 .066 .663

Table 2: Using four different language models in BERTopic, coherence score (TC) and topic diversity (TD) were
calculated ranging from 10 to 50 topics with steps of 10. All results were averaged across 3 runs for each step.
Thus, each score is the average of 15 separate runs.

6.1 Performance mains competitive regardless of the embedding


From Table 1, we can observe that BERTopic model. By separating the process of embedding
generally has high topic coherence scores across documents and constructing the word-topic dis-
all datasets. It has the highest scores on the tribution, BERTopic is flexible in its embedding
slightly preprocessed dataset, Trump’s tweets, procedure.
whilst remaining competitive on the thoroughly 6.3 Dynamic Topic Modeling
preprocessed datassets, 20 NewsGroups and BBC
From Table 3, we can observe that BERTopic
News. Although BERTopic demonstrates compet-
with and without the assumption of linearly evolv-
itive topic diversity scores, it is consistently out-
ing topics performs consistently well across both
performed by CTM. This is consistent with their
datasets. For Trump, it outperforms LDA on all
results indicating high topic diversity, albeit using
measures whereas it only achieves the top score on
a different topic diversity measure (Bianchi et al.,
topic coherence for the UN dataset.
2020a).
On both datasets, there seems to be no effect
6.2 Language Models of the assumption of linearly evolving topics on
both topic coherence and topic diversity indicating
The results in Table 2 demonstrate the stability of that from an evaluation perspective, the proposed
BERTopic, in terms of both topic coherence and assumption does not impact performance.
topic diversity, across SBERT language models. As
a result, the smaller and faster model, "all-MiniLM- Trump UN
L6-v2", might be preferable when limited GPU
capacity is available. TC TD TC TD
Although the USE and Doc2Vec language in LDA Sequence .009 .715 .173 .820
BERTopic generally have similar performance, BERTopic .079 .862 .231 .779
Doc2Vec scores low on the Trump dataset. This is BERTopic-Evolve .079 .863 .226 .769
reflected in the results we find in 1 where Top2Vec
with Doc2Vec has poor performance. These re- Table 3: The topic coherence (TC) and topic diversity
sults suggest that Doc2Vec struggles with creating (TD) scores were calculated on dynamic topic model-
accurate representations of the Trump dataset. ing tasks. The TC and TD scores were calculated for
each of the 9 timesteps in each dataset. Then, all results
On topic coherence, Top2Vec with Doc2Vec em- were averaged across 3 runs for each step. Thus, each
beddings shows competitive performance. How- score represents the average of 27 values.
ever, when MPNET embeddings are used both
its topic coherence and diversity drop across all
datasets suggesting that Top2Vec might not be best 6.4 Wall time
suited with embeddings outside of those generated From the left graph in Figure 1, CTM using a MP-
through Doc2Vec. This is not unexpected as both NET SBERT model, is quite slow compared to all
word and documents vectors in Doc2Vec are jointly other models. If we remove that model from the re-
embedded in the same space, which does not hold sults, we can more easily compare the wall time of
for all language models. the topic models that are more close in speed. Then,
In turn, this also suggests why BERTopic re- We can observe that the classical models, NMF
Figure 1: Computation time (wall time) in seconds of each topic model on the Trump dataset. Increasing sizes of
vocabularies were regulated through selection of documents ranging from 1000 documents until 43000 documents
with steps of 2000. Left: computational results with CTM. Right: computational results without CTM as it inflates
the y-axis making differentiation between other topic models difficult to visualize.

and LDA, are faster than the neural network-based 7.1 Strengths
topic modeling techniques. Moreover, BERTopic
and Top2Vec are quite similar in wall times if they There are several notable strengths of BERTopic
are using the same language models. Interestingly, compared to the topic models used in this study.
the MiniLM SBERT model seems to be similar in First, the experiments demonstrate that
speed compared with Doc2Vec indicating that in BERTopic remains competitive regardless of the
BERTopic, MiniLM is a good trade-off between language model used to embed the documents and
speed and performance. that performance may increase when leveraging
However, it should be noted that in the environ- state-of-the-art language models. This indicates
ment used in this experiment a GPU was avail- its ability to scale performance with new devel-
able for creating the embeddings. As a result, the opments in the field of language models whilst
wall time is expected to increase significantly when still remaining competitive if classical language
embedding documents without a GPU. Although models are used. Moreover, its stability across
Doc2Vec can be used as a language model instead, language models allows it to be used in a wide
previous experiments in this study have put its sta- range of situations. For example, when a user does
bility with respect to topic coherence and topic not have access to a GPU, Doc2Vec can be used to
diversity into question. generate competitive results.
Second, separating the process of embedding
7 Discussion documents from representing topics allows for sig-
nificant flexibility in the usage and fine-tuning of
Although we attempted to validate BERTopic BERTopic. Different preprocessing procedures can
across several experiments, topic modeling tech- be used when embedding the documents and when
niques can be validated through many other evalua- generating the topic representations. For example,
tion metrics, such as metrics for unsupervised and one might want to remove stopwords in the topic
supervised modeling performance. Moreover, topic representations but not before creating document
modeling techniques can be used across a variety embeddings. Similarly, once the documents have
of use cases, many of which are not covered in this been clustered, the topic generation process can be
study. For those reasons, we additionally discuss fine-tuned, by, for example, increasing the n-gram
the strengths and weaknesses of BERTopic to fur- of words in the topic representation, without the
ther describe when, and perhaps most importantly, need to re-cluster the data.
when not to use BERTopic. Third, by leveraging a class-based version of TF-
IDF, we can represent topics as a distribution of that BERTopic learns coherent patterns of language
words. These distributions have allowed BERTopic and demonstrates competitive and stable perfor-
to model the dynamic and evolutionary aspects of mance across a variety of tasks.
topics with little changes to the core algorithm.
Similarly, with these distributions, we can also
model the representations of topics across classes. References
Charu C Aggarwal, Alexander Hinneburg, and
7.2 Weaknesses Daniel A Keim. 2001. On the surprising behavior
of distance metrics in high dimensional space. In
No model is perfect and BERTopic is definitely International conference on database theory, pages
no exception. There are several weaknesses to the 420–434. Springer.
model that should be addressed. First, BERTopic
Mebarka Allaoui, Mohammed Lamine Kherfi, and Ab-
assumes that each document only contains a sin- delhakim Cheriet. 2020. Considerably improving
gle topic which does not reflect the reality that clustering algorithms using umap dimensionality re-
documents may contain multiple topics. Although duction technique: A comparative study. In Interna-
documents can be split up into smaller segments, tional Conference on Image and Signal Processing,
pages 317–325. Springer.
such as sentences and paragraphs, it is not an ideal
representation. However, as HDBSCAN is a soft- Dimo Angelov. 2020. Top2vec: Distributed representa-
clustering technique, we can use its probability tions of topics. arXiv preprint arXiv:2008.09470.
matrix as a proxy of the distribution of topics in a
Alexander Baturo, Niheer Dasandi, and Slava J
document. This resolves the issue to some extent Mikhaylov. 2017. Understanding state prefer-
but it does not take into account that documents ences with text as data: Introducing the un
may contain multiple topics during the training of general debate corpus. Research & Politics,
BERTopic. 4(2):2053168017712821.
Second, although BERTopic allows for a con- Kevin Beyer, Jonathan Goldstein, Raghu Ramakrish-
textual representation of documents through its nan, and Uri Shaft. 1999. When is “nearest neigh-
transformer-based language models, the topic rep- bor” meaningful? In International conference on
database theory, pages 217–235. Springer.
resentation itself does not directly account for that
as they are generated from bags-of-words. The Federico Bianchi, Silvia Terragni, and Dirk Hovy.
words in a topic representation merely sketch the 2020a. Pre-training is a hot topic: Contextual-
importance of words in a topic whilst those words ized document embeddings improve topic coher-
ence. arXiv preprint arXiv:2004.03974.
are likely to be related. As a result, words in a
topic might be similar to one another and can be Federico Bianchi, Silvia Terragni, Dirk Hovy, Debora
redundant for the interpretation of the topic. In Nozza, and Elisabetta Fersini. 2020b. Cross-lingual
contextualized topic models with zero-shot learning.
theory, this could be resolved by applying maximal
arXiv preprint arXiv:2004.07737.
marginal relevance to the top n words in a topic
but it was not explored in this study (Carbonell and David M Blei and John D Lafferty. 2006. Dynamic
Goldstein, 1998). topic models. In Proceedings of the 23rd interna-
tional conference on Machine learning, pages 113–
120.
8 Conclusion
David M Blei, Andrew Y Ng, and Michael I Jordan.
We developed BERTopic, a topic model that ex- 2003. Latent dirichlet allocation. the Journal of ma-
tends the cluster embedding approach by leverag- chine Learning research, 3:993–1022.
ing state-of-the-art language models and applying a Gerlof Bouma. 2009. Normalized (pointwise) mutual
class-based TF-IDF procedure for generating topic information in collocation extraction. Proceedings
representations. By separating the process of clus- of GSCL, 30:31–40.
tering documents and generating topic represen-
Ziqiang Cao, Sujian Li, Yang Liu, Wenjie Li, and Heng
tations, significant flexibility is introduced in the Ji. 2015. A novel neural topic model and its super-
model allowing for ease of usability. vised extension. In Proceedings of the AAAI Confer-
We present in this paper an in-depth analysis of ence on Artificial Intelligence, pages 2210–2216.
BERTopic, ranging from evaluation studies with Jaime Carbonell and Jade Goldstein. 1998. The use of
classical topic coherence measures to analyses in- mmr, diversity-based reranking for reordering doc-
volving running times. Our experiments suggest uments and producing summaries. In Proceedings
of the 21st annual international ACM SIGIR confer- Jinhyuk Lee, Wonjin Yoon, Sungdong Kim,
ence on Research and development in information Donghyeon Kim, Sunkyu Kim, Chan Ho So, and
retrieval, pages 335–336. Jaewoo Kang. 2020. Biobert: a pre-trained biomed-
ical language representation model for biomedical
Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua, text mining. Bioinformatics, 36(4):1234–1240.
Nicole Limtiaco, Rhomni St John, Noah Constant,
Mario Guajardo-Cespedes, Steve Yuan, Chris Tar, Yang Liu, Zhiyuan Liu, Tat-Seng Chua, and Maosong
et al. 2018. Universal sentence encoder. arXiv Sun. 2015. Topical word embeddings. In Twenty-
preprint arXiv:1803.11175. ninth AAAI conference on artificial intelligence.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Kristina Toutanova. 2018. Bert: Pre-training of deep dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
bidirectional transformers for language understand- Luke Zettlemoyer, and Veselin Stoyanov. 2019.
ing. arXiv preprint arXiv:1810.04805. Roberta: A robustly optimized bert pretraining ap-
proach. arXiv preprint arXiv:1907.11692.
Adji B Dieng, Francisco JR Ruiz, and David M Blei.
2020. Topic modeling in embedding spaces. Trans- L. McInnes, J. Healy, and J. Melville. 2018. UMAP:
actions of the Association for Computational Lin- Uniform Manifold Approximation and Projection
guistics, 8:439–453. for Dimension Reduction. ArXiv e-prints.

Cédric Févotte and Jérôme Idier. 2011. Algorithms Leland McInnes and John Healy. 2017. Accelerated
for nonnegative matrix factorization with the β- hierarchical density based clustering. In Data Min-
divergence. Neural computation, 23(9):2421–2456. ing Workshops (ICDMW), 2017 IEEE International
Conference on, pages 33–42. IEEE.
Derek Greene and Pádraig Cunningham. 2006. Prac-
tical solutions to the problem of diagonal dom- Leland McInnes, John Healy, and Steve Astels. 2017.
inance in kernel document clustering. In Proc. hdbscan: Hierarchical density based clustering. The
23rd International Conference on Machine learning Journal of Open Source Software, 2(11):205.
(ICML’06), pages 377–384. ACM Press. Leland McInnes, John Healy, Nathaniel Saul, and
Lukas Grossberger. 2018. Umap: Uniform manifold
Alexander Hoyle, Pranav Goel, Andrew Hian-Cheong,
approximation and projection. The Journal of Open
Denis Peskov, Jordan Boyd-Graber, and Philip
Source Software, 3(29):861.
Resnik. 2021. Is automated topic model evaluation
broken? the incoherence of coherence. Advances in Dat Quoc Nguyen, Richard Billingsley, Lan Du, and
Neural Information Processing Systems, 34. Mark Johnson. 2015. Improving topic models with
latent feature word representations. Transactions
Thorsten Joachims. 1996. A probabilistic analysis of of the Association for Computational Linguistics,
the rocchio algorithm with tfidf for text categoriza- 3:299–313.
tion. Technical report, Carnegie-mellon univ pitts-
burgh pa dept of computer science. Divya Pandove, Shivan Goel, and Rinkl Rani. 2018.
Systematic review of clustering high-dimensional
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, and large datasets. ACM Transactions on Knowl-
Kevin Gimpel, Piyush Sharma, and Radu Soricut. edge Discovery from Data (TKDD), 12(2):1–68.
2019. Albert: A lite bert for self-supervised learn-
ing of language representations. arXiv preprint Jipeng Qiang, Ping Chen, Tong Wang, and Xindong
arXiv:1909.11942. Wu. 2017. Topic modeling over short texts by in-
corporating word embeddings. In Pacific-Asia Con-
Ken Lang. 1995. Newsweeder: Learning to filter ference on Knowledge Discovery and Data Mining,
netnews. In Machine Learning Proceedings 1995, pages 363–374. Springer.
pages 331–339. Elsevier.
Nils Reimers and Iryna Gurevych. 2019. Sentence-
Hugo Larochelle and Stanislas Lauly. 2012. A neural bert: Sentence embeddings using siamese bert-
autoregressive topic model. Advances in Neural In- networks. In Proceedings of the 2019 Conference on
formation Processing Systems, 25. Empirical Methods in Natural Language Processing.
Association for Computational Linguistics.
Jey Han Lau, David Newman, and Timothy Baldwin.
2014. Machine reading tea leaves: Automatically Nils Reimers and Iryna Gurevych. 2020. Mak-
evaluating topic coherence and topic model quality. ing monolingual sentence embeddings multilin-
In Proceedings of the 14th Conference of the Euro- gual using knowledge distillation. arXiv preprint
pean Chapter of the Association for Computational arXiv:2004.09813.
Linguistics, pages 530–539.
Min Shi, Jianxun Liu, Dong Zhou, Mingdong Tang,
Quoc Le and Tomas Mikolov. 2014. Distributed repre- and Buqing Cao. 2017. We-lda: a word embeddings
sentations of sentences and documents. In Interna- augmented lda model for web services clustering. In
tional conference on machine learning, pages 1188– 2017 ieee international conference on web services
1196. PMLR. (icws), pages 9–16. IEEE.
Suzanna Sia, Ayush Dalmia, and Sabrina J Mielke.
2020. Tired of topic models? clusters of pretrained
word embeddings make for fast and good topics too!
arXiv preprint arXiv:2004.14914.
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-
Yan Liu. 2020. Mpnet: Masked and permuted pre-
training for language understanding. arXiv preprint
arXiv:2004.09297.
Michael Steinbach, Levent Ertöz, and Vipin Kumar.
2004. The challenges of clustering high dimensional
data. In New directions in statistical physics, pages
273–309. Springer.
Silvia Terragni, Elisabetta Fersini, Bruno Giovanni
Galuzzi, Pietro Tropeano, and Antonio Candelieri.
2021. Octis: Comparing and optimizing topic mod-
els is simple! In Proceedings of the 16th Confer-
ence of the European Chapter of the Association for
Computational Linguistics: System Demonstrations,
pages 263–270.
Nandan Thakur, Nils Reimers, Johannes Daxenberger,
and Iryna Gurevych. 2020. Augmented sbert: Data
augmentation method for improving bi-encoders for
pairwise sentence scoring tasks. arXiv preprint
arXiv:2010.08240.
Laure Thompson and David Mimno. 2020. Topic mod-
eling with contextualized word representation clus-
ters. arXiv preprint arXiv:2010.12626.
He Zhao, Dinh Phung, Viet Huynh, Yuan Jin, Lan Du,
and Wray Buntine. 2021. Topic modelling meets
deep neural networks: A survey. arXiv preprint
arXiv:2103.00498.

You might also like