Topic models are probabilistic models used to abstract topical information from collections of text documents. Documents are modeled as probability distributions over latent topics and topics are modeled as probability distributions over words. In a single collection of documents, topics are global, that is, they are shared across the multiple documents in the collection. A nested document collection has documents that are nested inside a higher order structure, for example, stories in a book, articles in a journal, or web pages in a web site. Regular topic models ignore the nesting and treat all documents as distinct. In contrast, a nested document collection, such as web pages nested in web sites, web pages of the same web sites share similarities with each other that they do not share with web pages of other web sites. Regular topic models allow inferences about each web page individually; they are not suited for making inferences about an entire web site.
We propose hierarchical local topic models that place a hierarchical prior on web page topic distributions and explicitly model local topics, or topics that are unique to one web site. The hierarchical prior asserts that web page topic distributions vary around their web site topic distribution and that web site topic distributions vary around a global topic distribution.Explicitly modeling local topics reduces the number of global topics needed, identifies the local topics and their owning web site, and lets us adjust inferences about how topics are covered.
We propose hierarchical topic presence models that place a sparsity inducing prior on topic distributions; they let us model the presence of topics in web sites, web pages, or both web sites and web pages. Topic presence in a web site can be modeled with logistic regression, as a function of covariates.
We apply hierarchical topic presence models to identify health topics in United States county health department web sites, estimate the percent of web sites that cover particular health topics, and identify demographic predictors of topic presence for human immunodeficiency virus (HIV) and opioid use disorder (OUD) topics.