The accelerating rate of digitization of information increases the importance and number of problems which require automatic organization and classification of written text. Topic models blei.2012
are a flexible and widely used tool which identifies semantically related documents through the topics they address. These methods originated in machine learning and were largely based on heuristic approaches such as singular value decomposition in latent semantic indexing (LSI)deerwester.1990 in which one optimizes an arbitrarily chosen quality function. Only a more statistically principled approach, based on the formulation of probabilistic generative models ghahramani.2015
, allowed for a deeper theoretical foundation within the framework of Bayesian statistical inference. This, in turn, lead to a series of key developments, in particular probabilistic latent semantic indexing (pLSI)hofmann.1999 and latent Dirichlet allocation (LDA) blei.2003 ; griffiths.2004 . The latter established itself as the state-of-the-art method in topic modeling and has been widely used not only for recommendation and classification manning.book2008 but also bibliometrical boyack.2011 , psychological mcnamara.2011 , and political grimmer.2013 analysis. Beyond the scope of natural language, LDA has also been applied in biology liu.2010 (developed independently in this context pritchard.2000 ), or image processing fei.2005 .
However, despite its success and overwhelming popularity, LDA is known to suffer from fundamental flaws in the way it represents text. In particular, it lacks an intrinsic methodology to choose the number of topics, and contains a large number of free parameters that can cause overfitting. Furthermore, there is no justification for the use of the Dirichlet prior in the model formulation besides mathematical convenience. This choice restricts the types of topic mixtures and is not designed to be compatible with well-known properties of real text altmann.book2016 , such as Zipf’s law zipf.1936 for the frequency of words. More recently, consistency problems have also been identified with respect to how planted structures in artificial corpora can be recovered with LDA lancichinetti.2015 . A substantial part of the research in topic models focuses on creating more sophisticated and realistic versions of LDA that account for, e.g., syntax griffiths.2005 , correlations between topics li.2006 , meta-information (such as authors) rosen.2004 , or burstiness doyle.2009 . Other approaches consist of post-inference fitting of the number of topics zhao_heuristic_2015
or the hyperparameterswallach.2009a , or the formulation of nonparametric hierarchical extensions teh.2006 ; blei.2010 ; Paisley2015 . In particular, models based on the Pitman-Yor Sudderth2009 ; Sato2010 ; Buntine2014 or the negative binomial process have tried to address the issue of Zipf’s law Broderick2015 yielding useful generalizations of the simplistic Dirichlet prior Zhou2015 . While all these approaches lead to demonstrable improvements, they do not provide satisfying solutions to the aforementioned issues because they either share the limitations due to the choice of Dirichlet priors, introduce idiosyncratic structures to the model, or rely on heuristic approaches in the optimization of the free parameters.
A similar evolution from heuristic approaches to probabilistic models is occurring in the field of complex networks, in particular in the problem of community detection fortunato.2010 . Topic models and community-detection methods have been developed largely independently from each other with only a few papers pointing to their conceptual similarities airoldi.2007 ; ball.2011 ; lancichinetti.2015 . The idea of community detection is to find large-scale structure, i.e. the identification of groups of nodes with similar connectivity patterns fortunato.2010 . This is motivated by the fact that these groups describe the heterogeneous nonrandom structure of the network and may correspond to functional units, giving potential insights on the generative mechanisms behind the network formation. While there is a variety of different approaches to community detection, most methods are heuristic and optimize a quality function, the most popular being modularity newman.2004 . Modularity suffers from severe conceptual deficiencies, such as its inability to assess statistical significance leading to detection of groups in completely random networks guimera.2004 , or its incapacity in finding groups below a given size lancichinetti.2011 . Methods like modularity maximization are analogous to the pre-pLSI heuristic approaches to topic models, sharing with them many conceptual and practical deficiencies. In an effort to quench these problems, many researchers moved to probabilistic inference approaches, most notably those based on stochastic block models (SBM) holland.1983 ; airoldi.2007 ; karrer.2011 , mirroring the same trend that occurred in topic modeling.
In this paper we propose and apply a unified framework to the fields of topic modeling and community detection. As illustrated in Fig. 1, by representing the word-document matrix as a bipartite network the problem of inferring topics becomes a problem of inferring communities. Topic models and community-detection methods have been previously discussed as being part of mixed-membership models Airoldi2014 . However, this has remained a conceptual connection lancichinetti.2015 and in practice the two approaches are used to address different problems airoldi.2007 ; the occurrence of words within and the links/citations between documents, respectively. In contrast, here we develop a formal correspondence that builds on the mathematical equivalence between pLSI of texts and SBMs of networks ball.2011 and that we use to adapt community-detection methods to perform topic modeling. In particular, we derive a nonparametric Bayesian parametrization of pLSI — adapted from a hierarchical stochastic block model (hSBM) peixoto.2014a ; peixoto.2015 ; peixoto_nonparametric_2017 — that makes fewer assumptions about the underlying structure of the data. As a consequence, it better matches the statistical properties of real texts and solves many of the intrinsic limitations of LDA. For example, we demonstrate the limitations induced by the Dirichlet priors by showing that LDA fails to infer topical structures that deviate from the Dirichlet assumption. We show that our model infers correctly such structures and thus leads to a better topic model than Dirichlet-based methods (such as LDA) in the terms of model selection not only in various real corpora but even in artificial corpora generated from LDA itself. Additionally, our nonparametric approach uncovers topical structures on many scales of resolution, automatically determines the number of topics together with the word classification, and its symmetric formulation allows the documents themselves to be clustered into hierarchical categories.
The goal of our manuscript is to introduce a unified approach to topic modeling and community detection, showing how ideas and methods can be transported between these two classes of problems. The benefit of this unified approach is illustrated by the derivation of an alternative to Dirichlet-based topic models, which is more principled in its theoretical foundation (making fewer assumption about the data) and superior in practice according to model selection criteria.
ii.1 Community Detection for Topic Modeling
In this section we expose the connection between topic modeling and community detection, as illustrated in Fig. 2. We first revisit how a Bayesian formulation of pLSI assuming Dirichlet priors leads to LDA and how the former can be re-interpreted as a mixed membership SBM. We then use the latter to derive a more principled approach to topic modeling using nonparametric and hierarchical priors.
ii.1.1 Topic models: pLSI and LDA
PLSI is a model that generates a corpus composed of documents, where each document has words hofmann.1999 . The placement of the words in the documents is done based on the assignment of topic mixtures to both document and words, from a total of topics. More specifically, one iterates through all documents, and for each document one samples and for each word-token , first a topic
is chosen with probability, and then a word is chosen from that topic with probability . If is the number of occurrences of word of topic in document (summarized as ), the probability of a corpus is
We denote matrices by bold-face symbols, e.g. with and where is an individual entry, thus the notation
refers to the vectorwith fixed and .
For an unknown text, we could simply maximize Eq. (1) to obtain the best parameters , , and
which describe the topical structure of the corpus. However, this approach cannot be used directly to model textual data without a significant danger of overfitting. The model possesses a large number of parameters, that grows as the number of documents, words, and topics is increased, and hence a maximum likelihood estimate will invariably incorporate a considerable amount of noise. One solution to this problem is to employ a Bayesian formulation, by proposing prior distributions to the parameters, and integrating over them. This is precisely what is done in LDAblei.2003 ; griffiths.2004 , where one chooses Dirichlet priors and with hyperparameters and for the probabilities and above, and one uses instead the marginal likelihood.
If one makes a noninformative choice, i.e. and , inference using Eq. (2) is nonparametric and less susceptible to overfitting. In particular, one can obtain the labeling of word-tokens into topics, , conditioned only on the observed total frequencies of words in documents, , in addition to the number of topics itself, simply by maximizing or sampling from the posterior distribution. The weakness of this approach rests in the fact that the Dirichlet prior is a simplistic assumption about the data-generating process: In its noninformative form, every mixture in the model — both of topics in each document as well as words into topics — is assumed to be equally likely, precluding the existence of any form of higher-order structure. This limitation has prompted the widespread practice of inferring using LDA in a parametric way, by maximizing the likelihood with respect to the hyperparameters and , which can improve the quality of fit in many cases. But not only this undermines to a large extent the initial purpose of a Bayesian approach — as the number of hyperparameters still increases with the number of documents, words and topics, and hence maximizing over them reintroduces the danger of overfitting — but also it does not sufficiently addresses the original limitation of the Dirichlet prior. Namely, regardless of the hyperparameter choice, the Dirichlet distribution is unimodal, meaning that it generates mixtures which are either concentrated around the mean value, or spread away uniformly from it towards pure components. This means that for any choice of and the whole corpus is characterized by a single typical mixture of topics into documents, and a single typical mixture of words into topics. This is an extreme level of assumed homogeneity which stands in contradiction to a clustering approach initially designed to capture heterogeneity.
In addition to the above, the use of nonparametric Dirichlet priors is inconsistent with well-known universal statistical properties of real texts; most notably the highly-skewed distribution of word frequencies, which typically follows Zipf’s lawzipf.1936 . In contrast, the noninformative choice of the Dirichlet distribution with hyperparameters amounts to an expected uniform frequency of words in topics and documents. Although this disagreement can be addressed by choosing appropriate values of , such an approach, as already mentioned, runs contrary to nonparametric inference, and is subject to overfitting.
In the following, we will show how the same original pLSI model can be re-cast as a network model that completely removes the limitations described above, and is capable of uncovering heterogeneity in the data at multiple scales.
ii.1.2 Topic models and community detection: Equivalence between pLSI and SBM
We show that pLSI is equivalent to a specific form of a mixed membership SBM as proposed by Ball et al. ball.2011 .
The SBM is a model that generates a network composed of nodes with adjacency matrix , which we will assume without loss of generality to correspond to a multigraph, i.e. . The nodes are placed in a partition composed of overlapping groups, and the edges between nodes and
are sampled from a Poisson distribution with average
where is the expected number of edges between group and group , and is the probability that node is sampled from group . The likelihood to observe , i.e. a particular decomposition of into labeled half-edges (i.e. edge endpoints) such that , can be written as
by exploiting the fact that the sum of Poisson variables is also distributed according to a Poisson.
The connection to pLSI can now be made by rewriting the token probabilities in Eq. (1) in a symmetric fashion as
where is the probability that the word belongs to topic , and is the overall propensity with which the word is chosen across all topics. In this manner, the likelihood of Eq. (1) can be re-written as
with . If we choose to view the counts as the entries of the adjacency matrix of a bipartite multigraph with documents and words as nodes, the likelihood of Eq. (6) is equivalent to the likelihood of Eq. (II.1.2) of the SBM, if we assume that each document belongs to its own specific group, , with for document-nodes, and by re-writing . Therefore, the SBM of Eq. (II.1.2) is a generalization of pLSI that allows the words as well as the documents to be clustered into groups, and includes it as a special case when the documents are not clustered.
In the symmetric setting of the SBM, we make no explicit distinction between words and documents, both of which become nodes in different partitions of a bipartite network. We base our Bayesian formulation that follows on this symmetric parametrization.
ii.1.3 Community detection and the hierarchical SBM
Taking advantage of the above connection between pLSI and SBM, we show how the idea of hierarchical SBMs developed in Refs. peixoto.2014a ; peixoto.2015 ; peixoto_nonparametric_2017 can be extended such that they can be effectively used for the inference of topical structure in texts.
Like pLSI, the SBM likelihood of Eq. (II.1.2) contains a large number of parameters that grows with the number of groups, and therefore cannot be used effectively without knowing the most appropriate dimension of the model beforehand. Analogously to what is done in LDA, this can be addressed by assuming noninformative priors for the parameters and , and computing the marginal likelihood (for an explicit expression see Supplementary Materials Sec. 1.1)
where is a global parameter determining the overall density of the network. This can be used to infer the labeled adjacency matrix as done in LDA, with the difference that not only the words but also the documents would be clustered into mixed categories.
However, at this stage the model still shares some disadvantages with LDA. In particular, the noninformative priors make unrealistic assumptions about the data, where the mixture between groups and the distribution of nodes into groups is expected to be unstructured. Among other problems, this leads to a practical obstacle, as this approach possesses a “resolution limit” where at most groups can be inferred on a sparse network with nodes peixoto_parsimonious_2013 ; peixoto_nonparametric_2017
. In the following we propose a qualitatively different approach to the choice of priors by replacing the noninformative approach with deeper Bayesian hierarchy of priors and hyperpriors, which are agnostic about the higher order properties of the data while maintaining the nonparametric nature of the approach. We begin by re-formulating the above model as an equivalentmicrocanonical model peixoto_nonparametric_2017 (for a proof see Supplementary Materials Sec. 1.2) such that we can write the marginal likelihood as the joint likelihood of the data and its discrete parameters,
where is the total number of edges between groups and (we used the shorthand and ), is the probability of a labeled graph where the labeled degrees and edge counts between groups are constrained to specific values (and not their expectation values), is the uniform prior distribution of the labeled degrees constrained by the edge counts , and
is the prior distribution of edge counts, given by a mixture of independent geometric distributions with average.
The main advantage of this alternative model formulation is that it allows us to remove the homogeneous assumptions by replacing the uniform priors and by a hierarchy of priors and hyperpriors that incorporate the possibility of higher-order structures. This can be achieved in a tractable manner without the need of solving complicated integrals that would be required by introducing deeper Bayesian hierarchies in Eq. (7) directly.
In a first step, we follow the approach of Ref. peixoto.2015 and condition the labeled degrees on an overlapping partition , given by
such that they are sampled by a distribution
Importantly, the labeled degree sequence is sampled conditioned on the frequency of degrees inside each mixture , which itself is sampled from its own noninformative prior,
where is the number of incident edges in each mixture (for detailed expressions see Supplementary Materials Sec. 1.3).
Due to the fact that the frequencies of the mixtures as well as the frequencies of the labeled degrees are treated as latent variables, this model admits group mixtures which are far more heterogeneous than the Dirichlet prior used in LDA. In particular, as was shown in Ref. peixoto_nonparametric_2017
, the expected degrees generated in this manner follow a Bose-Einstein distribution, which is much broader than the exponential distribution obtained with the prior of Eq. (10). More importantly, the asymptotic form of the degree likelihood will approach the true distribution as the prior washes out peixoto_nonparametric_2017 , making it more suitable for skewed empirical frequencies, such as Zipf’s law or mixtures thereof gerlach.2013 , without requiring specific parameters — such as exponents — to be determined a priori.
In a second step, we follow Refs. peixoto.2014a ; peixoto_nonparametric_2017 and model the prior for the edge counts between groups by interpreting it as an adjacency matrix itself, i.e. a multigraph where the groups are the nodes. We then proceed by generating it from another SBM which, in turn, has its own partition into groups and matrix of edge counts. Continuing in the same manner yields a hierarchy of nested SBMs, where each level clusters the groups of the levels below. This yields a probability (see Ref. peixoto_nonparametric_2017 ) given by
where the index refers to the variable of the SBM at a particular level, e.g., is the number of nodes in group at level .
The use of this hierarchical prior is a strong departure from the noninformative assumption considered previously while containing it as a special case when the depth of the hierarchy is . It means that we expect some form of heterogeneity in the data at multiple scales, where groups of nodes are themselves grouped in larger groups forming a hierarchy. Crucially, this removes the “unimodality” inherent in the LDA assumption, as the group mixtures are now modeled by another generative level which admits as much heterogeneity as the original one. Furthermore, it can be shown to significantly alleviate the resolution limit of the noninformative approach, since it enables the detection of at most groups in a sparse network with nodes peixoto.2014a ; peixoto_nonparametric_2017 .
Given the above model we can find the best overlapping partitions of the nodes by maximizing the posterior distribution
which can be efficiently inferred using Markov Chain Monte Carlo, as described in Refs.peixoto.2015 ; peixoto_nonparametric_2017 . The nonparametric nature of the model makes it possible to infer i) the depth of the hierarchy (containing the “flat” model in case the data does not support a hierarchical structure) and ii) the number of groups for both documents and words directly from the posterior distribution, without the need for extrinsic methods or supervised approaches to prevent overfitting. The latter can be seen interpreting Eq. (19) as a description length, see discussion after Eq. (22).
The model above generates arbitrary multigraphs, whereas text is represented as a bipartite network of words and documents. Since the latter is a special case of the former, where words and documents belong to distinct groups, the model can be used as it is, as it will “learn” the bipartite structure during inference. However, a more consistent approach for text is to include this information in the prior, since we should not have to infer what we already know. This can be done via a simple modification of the model, where one replaces the prior for the overlapping partition appearing in Eq. (13) by
where and now correspond to a disjoint overlapping partition of the words and documents, respectively. Likewise, the same must be done at the upper levels of the hierarchy, by replacing Eq. (17) with
In this way, by construction, words and documents will never be placed together in the same group.
ii.2 Comparing LDA and hSBM in real and artificial data
In this section we show that the theoretical considerations discussed in the previous section are relevant in practice. We show that hSBM constitutes a better model than LDA in three classes of problems. First, we construct simple examples that show that LDA fails in cases of non-Dirichlet topic mixtures, while hSBM is able to infer, both, Dirichlet and non-Dirichlet mixtures. Second, we show that hSBM outperforms LDA even in artificial corpora drawn from the generative process of LDA. Third, we consider five different real corpora. We perform statistical model selection based on the principle of minimum description length rissanen_modeling_1978 and computing the description length (the smaller the better) of each model (for details see Materials and Methods, Minimum Description Length).
ii.2.1 Failure of LDA in the case of non-Dirichlet mixtures
The choice of the Dirichlet distribution as a prior for the topic mixtures implies that the ensemble of topic mixtures is assumed to be either unimodal or concentrated at the edges of the simplex. This is an undesired feature of this prior because there is no reason why data should show these characteristics. In order to explore how this affects the inference of LDA, we construct a set of simple examples with topics which allow for easy visualization. Besides real data, we consider synthetic data constructed from the generative process of LDA — in which case indeed follows a Dirichlet distribution — and from cases in which the Dirichlet assumption is violated — e.g. by superimposing two Dirichlet mixtures resulting in a bimodal instead of a unimodal .
The results summarized in Fig. 3 show that SBM leads to better results than LDA. In Dirichlet generated data (Fig. 3A), LDA self-consistently identifies the distribution of mixtures correctly. Remarkably, the SBM is also able to correctly identify the Dirichlet mixture even though we did not explicitly specify Dirichlet priors. In the non-Dirichlet synthetic data (Fig. 3B), the SBM results again closely match the true topic mixtures but LDA completely fails. In fact, although the inferred result by LDA no longer resembles a Dirichlet distribution after being influenced by data, it is significantly distorted by the unsuitable prior assumptions. Turning to real data (Fig. 3C), the LDA and SBM yield very different results. While the “true” underlying topic mixture of each document is unknown in this case, we can identify the negative consequence of the Dirichlet priors from the fact that the results from LDA are again similar to the ones expected from a Dirichlet distribution — thus likely an artifact — while the SBM results suggests a much richer pattern.
Taken together, the results of this simple example visually show that LDA not only struggles to infer non-Dirichlet mixtures, but also that it shows strong biases in the inference towards Dirichlet-type mixtures. On the other hand, SBM is able to capture a much richer spectrum of topic mixtures due to its nonparametric formulation. This is a direct consequence of the choice of priors: while LDA assumes a priori that the ensemble of topic mixtures, , follows a Dirichlet distribution, SBM is more agnostic with respect to the type of mixtures while retaining its nonparametric formulation.
ii.2.2 Artificial corpora sampled from LDA
We consider artificial corpora constructed from the generative process of LDA, incorporating some aspects of real texts, (for details see Materials and Methods, Artificial corpora and Supplementary Materials Sec. 2.1). Although LDA is not a good model for real corpora — as the Dirichlet assumption is not realistic — it serves to illustrate that even in a situation that clearly favors LDA, the hSBM frequently provides a better description of the data.
From the generative process we know the true latent variable of each word-token. Therefore, we are able to obtain the inferred topical structure from each method by simply assigning the true labels without using approximate numerical optimization methods for the inference. This allows us to separate intrinsic properties of the model itself from external properties related to the numerical implementation.
In order to allow for a fair comparison between hSBM and LDA, we consider two different choices in the inference of each method, respectively. LDA requires the specification of a set of hyperparameters and used in the inference. While in this particular case we know the true hyperparameters that generated the corpus, in general these are unknown. Therefore, in addition to the true values, we also consider a noninformative choice, i.e. and . For the inference with hSBM, we only use the special case where the hierarchy has a single level such that the prior is noninformative. We consider two different parametrizations of the SBM: 1. Each document is assigned to its own group, i.e. they are not clustered and 2. different documents can belong to the same group, i.e. they are clustered. While the former is motivated by the original correspondence between pLSI and SBM, the latter shows the additional advantage offered by the possibility of clustering documents due to its symmetric treatment of words and documents in a bipartite network (for details see Supplementary Materials Sec. 2.2).
In Fig. 4A, we show that hSBM is consistently better than LDA for synthetic corpora of almost any text length ranging over 4 orders of magnitude. These results hold for asymptotically large corpora (in terms of the number of documents) as shown in Fig. 4B, where we observe that the normalized description length of each model converges to a fixed value when increasing the size of the corpus. We confirm that these results hold across a wide range of parameter settings varying the number of topics as well as the values and base measures of the hyperparameters (Supplementary Materials Sec. 3, Figs. S1 - S3).
The LDA description length does not depend strongly on the considered prior (true or noninformative) as the size of the corpora increases (Fig. 4B). This is consistent with the typical expectation that in the limit of large data, the prior “washes out”. Note, however, that for smaller corpora the of the noninformative prior is significantly worse than the of the true prior.
In contrast, the hSBM provides much shorter description lengths than LDA for the same data when allowing documents to be clustered as well. The only exception is for very small texts ( tokens) — where we have not converged to the asymptotic limit in the per-word description length. In the limit we expect hSBM to provide a similarly good or better model than LDA for all text lengths. The improvement of the hSBM over LDA in a LDA-generated corpus is counterintuitive because, for sufficient data, we expect the true model to provide a better description for it. However, for a model like LDA the limit of “sufficient data” involves the simultaneous scaling of the number of documents, words, and topics to very high values. In particular, the generative process of LDA requires a large number of documents to resolve the underlying Dirichlet distribution of the topic-document distribution as well as a large number of topics to resolve the underlying word-topic distribution. While the former is realized growing the corpus by adding documents, the latter aspect is nontrivial because the observed size of the vocabulary is not a free parameter but is determined by the word-frequency distribution and the size of the corpus through the so-called Heaps’ law altmann.book2016 . This means that as we grow the corpus by adding more and more documents, initially the vocabulary increases linearly and only at very large corpora it settles into an asymptotic sublinear growth (Supplementary Materials Sec. 4, Fig. S4). This, in turn, requires an ever larger number of topics to resolve the underlying word-topic distribution. Such large number of topics is not feasible in practice because it renders the whole goal and concept of topic models obsolete — compressing the information by obtaining an effective, coarse-grained, description of the corpus at a manageable number of topics.
In summary, the limits in which LDA provides a better description, that is either extremely small texts or very large number of topics, are irrelevant in practice. The observed limitations of LDA are due to the following reasons: i) the finite number of topics used to generate the data always leads to an under-sampling of the Dirichlet distributions, and ii) LDA is redundant in the way it describes the data in this sparse regime. In contrast, the assumptions of the hSBM are better suited for this sparse regime, and hence leads to a more compact description of the data, despite the fact the corpora were in fact generated by LDA.
ii.2.3 Real corpora
We compare LDA and SBM for a variety of different datasets, as shown in Table 1 (for details see Materials and Methods Datasets for real corpora/Numerical implementations). When using LDA, we consider both noninformative priors and fitted hyperparameters, for a wide range of numbers of topics. We obtain systematically smaller values for the description length using the hSBM. For real corpora, the difference is exacerbated by the fact the hSBM is capable of clustering documents, capitalizing on a source of structure in the data which is completely unavailable to LDA.
As our examples also show, LDA cannot be used in a direct manner to choose the number of topics, as the noninformative choice systematically underfits ( increases monotonically with the number of topics), and the parametric approach systematically overfits ( decreases monotonically with the number of topics). In practice, users are required to resort to heuristics arun_finding_2010 ; cao_density-based_2009 , or more complicated inference approaches based on the computation of the model evidence, which are not only numerically expensive, but can only be done under onerous approximations griffiths.2004 ; wallach.2009a . In contrast, the hSBM is capable of extracting the appropriate number of topics directly from its posterior distribution, while simultaneously avoiding both under- and overfitting peixoto.2014a ; peixoto_nonparametric_2017 .
In addition to these formal aspects, we argue that the hierarchical nature of the hSBM, and the fact that it clusters words as well as documents, makes it more useful in interpreting text. We illustrate this with a case study in the next section.
ii.3 Case study: Application of hSBM to Wikipedia articles
We illustrate the results of the inference with the hSBM for articles taken from the English Wikipedia in Fig. 5, showing the hierarchical clustering of documents and words. To make the visualization clearer, we focus on a small network created from only three scientific disciplines: Chemical Physics (21 articles), Experimental Physics (24 articles), and Computational Biology (18 articles). For clarity, we only consider words that appear more than once, such that we end up with a network of document-nodes, word-nodes, and edges.
The hSBM splits the network into groups on different levels, organized as a hierarchical tree. Note that the number of groups and the number of levels were not specified beforehand but automatically detected in the inference. On the highest level, hSBM reflects the bipartite structure into word- and document-nodes, as is imposed in our model.
In contrast to traditional topic models such as LDA, hSBM automatically clusters documents into groups. While we considered articles from three different categories (one category from biology and two categories from physics), the second level in the hierarchy separates documents into only two groups corresponding to articles about biology (e.g. bioinformatics or K-mer) and articles on physics (e.g. Rotating wave approximation or Molecular beam). For lower levels, articles become separated into a larger number of groups, e.g. one group contains two articles on Euler’s and Newton’s law of motion, respectively.
For words, the second level in the hierarchy splits nodes into three separate groups. We find that two groups represent words belonging to physics (e.g. beam, formula, or energy) and biology (assembly, folding, or protein) while the third group represents function words (the, of, or a). In fact, we find that the latter group’s words show close-to random distribution across documents by calculating the dissemination coefficient (right side of Fig. 5, see caption for definition). Furthermore, the median dissemination of the other groups is substantially less random with the exception of one subgroup (containing and, for, or which). This suggests a more data-driven approach to dealing with function words in topic models. The standard practice is to remove words from a manually curated list of stopwords, however, recent results question the efficacy of such methods Schoffield2017 . In contrast, the hSBM is able to automatically identify groups of stopwords, potentially rendering such heuristic interventions unnecessary.
The underlying equivalence between pLSI and the overlapping version of the SBM means that the “bag of words” formulation of topical corpora is mathematically equivalent to bipartite networks of words and documents with modular structures. From this we were able to formulate a topic model based on a hierarchical version of the SBM (hSBM) in a fully Bayesian framework alleviating some of the most serious conceptual deficiencies in current approaches to topic modeling such as LDA. In particular, the model formulation is nonparametric, and model complexity aspects such as the number of topics can be inferred directly from the model’s posterior distribution. Furthermore, the model is based on a hierarchical clustering of both words and documents — in contrast to LDA which is based on a nonhierarchical clustering of the words alone. This enables the identification of structural patterns in text that is unavailable to LDA, while at the same time allowing for the identification of patterns in multiple scales of resolution.
We have shown that hSBM constitutes a better topic model compared to LDA not only for a diverse set of real corpora but even for artificial corpora generated from LDA itself. It is capable of providing better compression — as a measure of the quality of fit — as well as a richer interpretation of the data. More importantly, however, the hSBM offers an alternative to Dirichlet priors employed in virtually any variation of current approaches to topic modeling. While motivated by their computational convenience, Dirichlet priors do not reflect prior knowledge compatible with the actual usage of language. In fact, our analysis suggests that Dirichlet priors introduce severe biases into the inference result, which in turn dramatically hinder its performance in case of even just slight deviations from the Dirichlet assumption. In contrast, our work shows how to formulate and incorporate different (and as we have shown more suitable) priors in a fully Bayesian framework, which are completely agnostic to the type of inferred mixtures. Furthermore, it also serves as a working example that efficient numerical implementations of non-Dirichlet topic models are feasible and can be applied in practice to large collections of documents.
More generally, our results show how the same mathematical ideas can be used to two extremely popular and mostly disconnected problems: the inference of topics in corpora and of communities in networks. We used this connection to obtain improved topic models, but there are many additional theoretical results in community detection that should be explored in the topic model context, e.g., fundamental limits to inference such as the undetectable-detectable phase transitiondecelle.2011 or the analogy to Potts-like spin systems in statistical physics hu.2012 . Furthermore, this connection allows the many extensions of the SBM, such as multilayer peixoto_inferring_2015 and annotated newman_structure_2016 ; hric_network_2016 versions to be readily used for topic modeling of richer text including hyperlinks, citations between documents, etc. Conversely, the field of topic modeling has long adopted a Bayesian perspective to inference, which until now has not seen a widespread use in community detection. Thus, insights from topic modeling about either the formulation of suitable priors, or the approximatinon of posterior distributions, might catalyze the development of improved statistical methods to detect communities in networks. Furthermore, the traditional application of topic models in the analysis of texts leads to classes of networks usually not considered by community detection algorithms. The word-document network is bipartite (words-documents), the topics/communities can be overlapping, and the number of links (word-tokens) and nodes (word-types) are connected to each other through Heaps’ law. In particular, the latter aspect results in dense networks, which have been largely overlooked by the networks community Courtney2018 . Topic models, thus, might provide additional insights how to approach such networks as it remains unclear how such properties affect the inference of communities in word-document networks. More generally, Heaps’ law constitutes only one of numerous statistical laws in language altmann.book2016 , such as the well-known Zipf’s law zipf.1936 . While these regularities are well-studied empirically, few attempts have been made to incorporate them explicitly as prior knowledge, e.g. formulating generative processes that lead to Zipf’s law Sato2010 ; Buntine2014 . Our results show that the SBM provides a flexible approach to deal with Zipf’s law which constitutes a challenge to state-of-the-art topic models such as LDA. Zipf’s law appears also in genetic codes Mantegna1994 and images Sudderth2009 , two prominent fields in which LDA-type models have been extensively applied pritchard.2000 ; Broderick2015 , suggesting that the block-model approach we introduce here is promising also beyond text analysis.
Iv Materials & Methods
iv.1 Minimum Description Length
We note that is conditioned on the hyperparameters , and therefore it is exact for noninformative priors ( and ) only. Otherwise, Eq. (22) is only a lower bound for because it lacks the terms involving hyperpriors for and . For simplicity, we ignore this correction in our analysis and therefore we favor LDA.
The motivation for this approach is two-fold.
One the one hand it offers a well-founded approach to unsupervised model selection within the framework of information theory, as it corresponds to the amount of information necessary to describe simultaneously i) the data when the model parameters are known, and ii) the parameters themselves. As the complexity of the model increases, the former will typically decrease, as it fits more closely the data, while at the same time it is compensated by an increase of the latter term, which serves as a penalty that prevents overfitting. In addition, given data and two models and with description length and , we can relate the difference
to the Bayes’ Factor (BF)kass.1995 . The latter quantifies how much more likely one model is compared to the other given the data
where we assume that each model is a priori equally likely, i.e. .
The description length allows for a straightforward model comparison without the introduction of confounding factors. In fact, commonly used supervised model selection approaches such as perplexity require additional approximation techniques wallach.2009a , which are not readily applicable to the microcanonical formulation of the SBM. It is thus not clear whether any difference in predictive power would result from the model and its inference or the approximation used in the calculation of perplexity. Furthermore, we note that it has been shown recently that supervised approaches based on the held-out likelihood of missing edges tend to overfit in key cases, failing to select the most parsimonious model, unlike unsupervised approaches which are more robust Valles-Catala2017 .
iv.2 Artificial corpora
For the construction of the artificial corpora, we fix the parameters in the generative process of LDA, i.e. the number of topics , the hyperparameters and , and the length of individual articles . The () - hyperparameters determine the distribution of topics (words) in each document (topic).
The generative process of LDA can be described in the following way. For each topic we sample a distribution over words from a -dimensional Dirichlet distribution with parameters for . For each document we sample a topic mixture from a -dimensional Dirichlet distribution with parameters for . For each word position ( is the length of document ) we first sample a topic from a multinomial with parameters and then sample a word from a multinomial with parameters .
We assume a parametrization in which i) each document has the same topic-document hyperparameter, i.e. for , and ii) each topic has the same word-topic hyperparameter, i.e. for . We fix the average probability of occurrence of a topic, , (word, ) by a introducing scalar hyperparameters (), i.e. for ( for ). In our case we choose i) equiprobable topics, i.e. and ii) empirically measured word frequencies from the Wikipedia corpus, i.e. with , yielding a Zipfian distribution (Supplementary Materials Sec. 5, Fig. S5), shown to be universally described by a double power law gerlach.2013 .
iv.3 Datasets for real corpora
For the comparison of hSBM and LDA we consider different datasets of written texts varying in genre, time of origin, average text length, number of documents, and language; as well as datasets used in previous works on topic models, e.g. blei.2003 ; wallach.2009 ; asuncion.2009 ; lancichinetti.2015 :
“Twitter”, a sample of Twitter messages obtained from http://www.nltk.org/nltk_data/;
“Reuters”, a collection of documents from the Reuters financial newswire service denoted as “Reuters-21578, Distribution 1.0” obtained from http://www.nltk.org/nltk_data/;
“Web of Science”, abstracts from physics papers published in the year 2000;
“New York Times (NYT)”, a collection of newspaper articles obtained from http://archive.ics.uci.edu/ml;
“PlosOne”, full text of all scientific articles published in 2011 in the journal PLoS One obtained via the Plos API (http://api.plos.org/)
In all cases we considered a random subset of the documents, as detailed in Table 1. For the NYT data we did not employ any additional filtering since the data was already provided in the form of pre-filtered word counts. For the other datasets we employed the following filtering: i) we decapitalized all words, ii) we replaced punctuation and special characters (e.g. “.”, “,”, or “/”) by blank spaces such that we can define a word as any substring between two blank spaces, and iii) keep only those words which consisted of the letters a-z.
iv.4 Numerical Implementations
For inference with LDA we used package mallet (http://mallet.cs.umass.edu/). The algorithm for inference with the hSBM presented in this work is implemented in C++ as part of the graph-tool Python library (https://graph-tool.skewed.de). We provide code on how to use hSBM for topic modeling in a github repository (https://topsbm.github.io/).
Acknowledgements.We thank M. Palzenberger for help with the “Web of Science” data. EGA thanks L. Azizi and W. L. Buntine for helpful discussions.
M.G., T.P.P., and E.G.A designed research; M.G., T.P.P., and E.G.A performed research; M.G. and T.P.P. analyzed data; M.G., T.P.P., and E.G.A wrote the paper.
The authors declare no competing interests.
Data and materials availability:
Data and code are available from the sources indicated above or from authors upon request.
- (1) D. M. Blei, Probabilistic topic models. Commun. ACM 55, 77 (2012).
- (2) S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman, Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41, 391–407 (1990).
Z. Ghahramani, Probabilistic machine learning and artificial intelligence.Nature 521, 452–459 (2015).
- (4) T. Hofmann, Probabilistic latent semantic indexing, Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - SIGIR ’99, Berkeley, CA, USA, 15 to 19 August 1999, pp. 50–57.
- (5) D. M. Blei, A. Y. Ng, M. I. Jordan, Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003).
- (6) T. L. Griffiths, M. Steyvers, Finding scientific topics. Proc. Natl. Acad. Sci. U.S.A. 101 Suppl, 5228–35 (2004).
- (7) C. D. Manning, P. Raghavan, H. Schutze, Introduction to Information Retrieval (Cambridge University Press, Cambridge, 2008).
- (8) K. W. Boyack, D. Newman, R. J. Duhon, R. Klavans, M. Patek, J. R. Biberstine, B. Schijvenaars, A. Skupin, N. Ma, K. Börner, Clustering more than two million biomedical publications: Comparing the accuracies of nine text-based similarity approaches. PLOS One 6, e18029 (2011).
- (9) D. S. McNamara, Computational methods to extract meaning from text and advance theories of human cognition. Top. Cogn. Sci. 3, 3–17 (2011).
- (10) J. Grimmer, B. M. Stewart, Text as data: The promise and pitfalls of automatic content analysis methods for political texts. Polit. Anal. 21, 267–297 (2013).
- (11) B. Liu, L. Liu, A. Tsykin, G. J. Goodall, J. E. Green, M. Zhu, C. H. Kim, J. Li, Identifying functional miRNA-mRNA regulatory modules with correspondence latent dirichlet allocation. Bioinformatics 26, 3105–3111 (2010).
- (12) J. K. Pritchard, M. Stephens, P. Donnelly, Inference of population structure using multilocus genotype data. Genetics 155, 945–959 (2000).
- (13) Fei-Fei Li, P. Perona, A bayesian hierarchical model for learning natural scene categories, , San Diego, CA, USA, 20 to 25 June 2015, vol. 2, pp. 524–531.
- (14) E. G. Altmann, M. Gerlach, Statistical laws in linguistics, in Creativity and Universality in Language, M. Degli Esposti, E. G. Altmann, F. Pachet, Eds. (Springer, Cham, 2016), pp. 7–26. (Springer, 2016).
- (15) G. K. Zipf, The Psycho-Biology of Language (Routledge, London, 1936).
- (16) A. Lancichinetti, M. I. Sirer, J. X. Wang, D. Acuna, K. Körding, L. A. N. Amaral, A high-reproducibility and high-accuracy method for automated topic classification. Phys. Rev. X 5, 011007 (2015).
- (17) T. L. Griffiths, M. Steyvers, D. M. Blei, J. B. Tenenbaum, Integrating topics and syntax, in Advances in Neural Information Processing Systems 17 (NIPS 2004), L. K. Saul, Y. Weiss, L. Bottou, Eds. (MIT Press, 2005), pp. 537–544.
- (18) W. Li, A. McCallum, Pachinko allocation: DAG-structured mixture models of topic correlations, Proceedings of the 23rd International conference on Machine Learning - ICML ’06, Pittsburgh, PA, USA, 25 to 29 June 2006, pp. 577–584.
- (19) M. Rosen-Zvi, T. L. Griffiths, M. Steyvers, P. Smyth, The author-topic model for authors and documents, Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence - UAI’04, Banff, Canada, 7 to 11 July 2004, pp. 487–494.
- (20) G. Doyle, C. Elkan, Accounting for burstiness in topic models, Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09, Montreal, Canada, 14 to 18 June 2009, pp. 281–288.
- (21) W. Zhao, J. J. Chen, R. Perkins, Z. Liu, W. Ge, Y. Ding, W. Zou, A heuristic approach to determine an appropriate number of topics in topic modeling. BMC Bioinformatics 16, S8 (2015).
- (22) H. M. Wallach, I. Murray, R. Salakhutdinov, D. Mimno, Evaluation methods for topic models, Proceedings of the 26th Annual International Conference on Machine Learning - ICML ’09, Montreal, Canada, 14 to 18 June 2009, pp. 1105–1112.
- (23) Y. W. Teh, M. I. Jordan, M. J. Beal, D. M. Blei, Hierarchical dirichlet processes. J. Am. Stat. Assoc. 101, 1566–1581 (2006).
- (24) D. M. Blei, T. L. Griffiths, M. I. Jordan, The nested chinese restaurant process and bayesian nonparametric inference of topic hierarchies. J. ACM 57, 1–30 (2010).
- (25) J. Paisley, C. Wang, D. M. Blei, M. I. Jordan, Nested hierarchical dirichlet processes. IEEE T. Pattern Anal. 37, 256–270 (2015).
- (26) E. B. Sudderth, M. I. Jordan, Shared segmentation of natural scenes using dependent Pitman-Yor processes, in Advances in Neural Information Processing Systems 21 (NIPS 2008), D. Koller, D. Schuurmans, Y. Bengio, L. Bottou, Eds. (Curran Associates, Inc., 2009), pp. 1585–1592.
- (27) I. Sato, H. Nakagawa, Topic models with power-law using Pitman-Yor process, Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’10, Washington, DC, USA, 25 to 28 July 2010, pp. 673–682.
- (28) W. L. Buntine, S. Mishra, Experiments with non-parametric topic models, Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’14, New York, NY, USA, 24 to 27 August 2014, pp. 881–890.
- (29) T. Broderick, L. Mackey, J. Paisley, M. I. Jordan, Combinatorial clustering and the beta negative binomial process. IEEE T. Pattern Anal. 37, 290–306 (2015).
- (30) M. Zhou, L. Carin, Negative binomial process count and mixture modeling. IEEE T. Pattern Anal. 37, 307–320 (2015).
- (31) S. Fortunato, Community detection in graphs. Phys. Rep. 486, 75–174 (2010).
- (32) E. M. Airoldi, D. M. Blei, S. E. Fienberg, E. P. Xing, Mixed membership stochastic blockmodels. J. Mach. Learn. Res. 9, 1981–2014 (2008).
- (33) B. Ball, B. Karrer, M. E. J. Newman, Efficient and principled method for detecting communities in networks. Phys. Rev. E 84, 036103 (2011).
- (34) M. E. J. Newman, M. Girvan, Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 (2004).
- (35) R. Guimerà, M. Sales-Pardo, L. A. N. Amaral, Modularity from fluctuations in random graphs and complex networks. Phys. Rev. E 70, 025101 (2004).
- (36) A. Lancichinetti, S. Fortunato, Limits of modularity maximization in community detection. Phys. Rev. E 84, 066122 (2011).
- (37) P. W. Holland, K. B. Laskey, S. Leinhardt, Stochastic blockmodels: First steps. Soc. Networks 5, 109–137 (1983).
- (38) B. Karrer, M. E. J. Newman, Stochastic blockmodels and community structure in networks. Phys. Rev. E 83, 016107 (2011).
- (39) E. M. Airoldi, D. M. Blei, E. A. Erosheva, S. E. Fienberg, Eds., Handbook of Mixed Membership Models and Their Applications (CRC Press, Boca Raton, FL, 2014).
- (40) T. P. Peixoto, Hierarchical block structures and high-resolution model selection in large networks. Phys. Rev. X 4, 011047 (2014).
- (41) T. P. Peixoto, Model selection and hypothesis testing for large-scale network models with overlapping groups. Phys. Rev. X 5 (2015).
T. P. Peixoto, Nonparametric Bayesian inference of the microcanonical stochastic block model.Phys. Rev. E 95, 012317 (2017).
- (43) T. P. Peixoto, Parsimonious module inference in large networks. Phys. Rev. Lett. 110, 148701 (2013).
- (44) M. Gerlach, E. G. Altmann, Stochastic model for the vocabulary growth in natural languages. Phys. Rev. X 3, 021006 (2013).
- (45) J. Rissanen, Modeling by shortest data description. Automatica 14, 465–471 (1978).
- (46) R. Arun, V. Suresh, C. E. V. Madhavan, M. N. N. Murthy, On finding the natural number of topics with latent dirichlet allocation: Some observations, in Advances in Knowledge Discovery and Data Mining, M. J. Zaki, J. X. Yu, B. Ravindran, V. Pudi, Eds. (Springer, Berlin, Heidelberg, 2010), pp. 391–402.
- (47) J. Cao, T. Xia, J. Li, Y. Zhang, S. Tang, A density-based method for adaptive LDA model selection. Neurocomputing 72, 1775–1781 (2009).
- (48) A. Schoffield, M. Måns, D. Mimno, Pulling out the stops : Rethinking stopword removal for topic models, Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Valencia, Spain, 3 to 7 April 2017, vol. 2, 432–436.
- (49) A. Decelle, F. Krzakala, C. Moore, L. Zdeborová, Inference and phase transitions in the detection of modules in sparse networks. Phys. Rev. Lett. 107, 065701 (2011).
- (50) D. Hu, P. Ronhovde, Z. Nussinov, Phase transitions in random Potts systems and the community detection problem: Spin-glass type and dynamic perspectives. Philos. Mag. 92, 406–445 (2012).
- (51) T. P. Peixoto, Inferring the mesoscale structure of layered, edge-valued, and time-varying networks. Phys. Rev. E 92, 042807 (2015).
- (52) M. E. J. Newman, A. Clauset, Structure and inference in annotated networks. Nat. Commun. 7, 11863 (2016).
- (53) D. Hric, T. P. Peixoto, S. Fortunato, Network structure, metadata, and the prediction of missing nodes and annotations. Phys. Rev. X 6, 031038 (2016).
- (54) O. T. Courtney, G. Bianconi, Dense power-law networks and simplicial complexes. Phys. Rev. E 97, 052303 (2018).
- (55) R. N. Mantegna, S. V. Buldyrev, A. L. Goldberger, S. Havlin, C. K. Peng, M. Simons, H. E. Stanley, Linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 73, 3169–3172 (1994).
- (56) R. E. Kass, A. E. Raftery, Bayes factors. J. Am. Stat. Assoc. 90, 773-795 (1995).
- (57) T. Vallès-Català, T. P. Peixoto, R. Guimerà, M. Sales-Pardo, Consistencies and inconsistencies between model selection and link prediction in networks. Phys. Rev. E 97, 026316 (2018).
- (58) H. M. Wallach, D. Mimno, A. McCallum, Rethinking LDA: Why Priors Matter, in Advances in Neural Information Processing Systems 22 (NIPS 2009), Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, A. Culotta, Eds. (Curran Associates, Inc., 2009), pp. 1973–1981.
- (59) A. Asuncion, M. Welling, P. Smyth, Y. W. Teh, On smoothing and inference for topic models, Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence - UAI’09, Montreal, Canada, 18 to 21 June 2009, pp.27–34.
- (60) E. G. Altmann, J. B. Pierrehumbert, A. E. Motter, Niche as a determinant of word fate in online groups. PLOS One 6, e19009 (2011).
- (61) M. Gerlach, thesis, Technical University Dresden, Dresden, Germany (2016).
I Marginal likelihood of the SBM
i.1 Noninformative priors
For the labeled network considered in the main text, section Community detection: The hierarchical SBM, Eq. (4), we have
If we now make a noninformative choice for the priors,
we can compute the integrated marginal likelihood as
i.2 Equivalence with microcanonical model
As mentioned in the main text, Eq. (7) can be decomposed as
where is the total number of edges between groups and (we used the shorthand and ). is the probability of a labelled graph where the labelled degrees and edge counts between groups are constrained to specific values. This can be seen by writing
being the number of configurations (i.e. half-edge pairings) that are compatible with the constraints, and
is the number of configurations that correspond to the same labelled graph . is the uniform prior distribution of the labelled degrees constrained by the edge counts , since is the number of ways to distribute indistinguishable items into distinguishable bins. Furthermore, is the prior distribution of edge counts, given by a mixture of independent geometric distributions with average .
i.3 Labelled degrees and overlapping partitions
As described in the main text, section Community detection: The hierarchical SBM, Eq. (13), the distribution of labeled degrees is given by
where the overlapping partition is distributed according to
Here, corresponds to a specific set of groups, i.e. a mixture, of size . The distribution above means that we first sample the frequency of mixture sizes from the distribution
where is the maximum overlap size (typically , unless we want to force nonoverlapping partitions with ). Given the frequencies, the mixture sizes are sampled uniformly on each node
We now consider the nodes with a given value of separately, and we put each one of them in a specific mixture of size . We do so by first sampling the frequencies in each mixture uniformly
and then we sample the mixtures themselves, conditioned on the frequencies,
The labeled degree sequence is sampled conditioned on this overlapping partition and also on the frequency of degrees inside each mixture ,
Here, is the sum of the degrees with label in mixture , which is sampled uniformly according to
where is the number of occupied mixtures that contain component . Given the degree sums, the frequency of degrees is sampled according to
where is the number of partitions of the integer into exactly parts, which can be pre-computed via the recurrence
with the boundary conditions and if or , or alternatively via the relation
where is the number of partitions of into at most parts, and using accurate asymptotic approximations for (see Ref. peixoto_nonparametric_2017 ). Finally, having sampled the frequencies, we sample the labeled degree sequence uniformly in each mixture
We refer to Ref. peixoto.2015 for further details of the above distribution.
Ii Artificial corpora drawn from LDA
ii.1 Drawing artificial documents from LDA
We specify and , i.e. the hyperparameters used to generate the artificial corpus (note that the hyperparameters used in the inference with LDA can be different) and fixing , , , and proceed in the following way:
For each topic :
Draw the word-topic distribution (frequencies of words conditioned on the topic ) from a -dimensional Dirichlet:
For each document :
Draw the topic-document distribution (frequencies of topics conditioned on the doc ) from a -dimensional Dirichlet:
For each token ( is the length of each document) in document :
Draw a topic from the categorical
Draw a word-type from the categorical
ii.2 Inference of corpora drawn from LDA
When we draw artificial corpora we obtain the labeled word-document counts , i.e. the “true” labels from the generative process of LDA as described above. In the following we describe how to obtain the description length of LDA and SBM when assigning the “true” labels as the result of the inference. In this way, we obtain the best possible inference results from each method. We can, therefore, compare the two models conceptually and avoid the issue of which particular numerical implementation was used.
ii.2.1 Inference with LDA
In the inference with LDA we simply need the word-topic, , the document-topic counts, , and the word-document matrix and use them to obtain the description length for LDA.
Note that for the inference we also have to specify the hyperparameters used in the inference, and . One approach is to consider the true prior (the same hyperparameter we used to generate the corpus) such that and . In general, however, the data is not generated from LDA such that it is unclear which is the best choice of hyperparameters for inference. Therefore, we also consider the case of a noninformative prior in which and .
ii.2.2 Inference with SBM
For the stochastic block model (SBM) we consider texts as a network in which the nodes consist of documents and words and the strength of the edge between them is given by the number of occurrences of the word in the document, yielding a bipartite multigraph. We consider the case of a degree-corrected, overlapping SBM with only one layer in the hierarchy.
No clustering of documents
For the SBM we use a particular parametrization starting from the equivalence between the degree-corrected SBM ball.2011 and probabilistic semantic indexing (pLSI) hofmann.1999 , as described in the main text, section Topic models: pLSI and LDA Each document-node is put in its own group and the word-nodes are clustered into word-groups. The latter correspond to the topics in LDA (with possible mixtures among those groups) thus giving us a total of groups.
Clustering of documents
Instead of putting each document in a separate group we cluster the documents into groups as well such that we have groups in total. Note that this corresponds to a completely symmetric clustering of the groups in which we choose the indices such that are groups for the document-nodes and are word-nodes. For a given word-token of word-type appearing in document labeled in topic , we label the two half-edges as (the half-edge on the document-node) and (the half-edge on the word-node).
Iii Varying the hyperparameters and number of topics
In Fig. 4 of the main text we compare LDA and hSBM for corpora drawn from LDA for the case and . In Figs. (S1, S2, S3) we show that these results hold under very general conditions by varying i) the values of the scalar hyperparameters; ii) the number of topics; and iii) the base measure of the vector-valued hyperparameters and (symmetric or asymmetric following the approach in Ref. wallach.2009 ). While the individual curves for the description length of the different models look different, the qualitative behavior shown in Fig. 4 of the main text remains the same. In all cases, the hSBM performs better than the LDA with noninformative priors; and only in few cases the hSBM has a larger description length than LDA with the true hyperparameters which actually generated the data. Note that the latter case constitutes an exception because i) the generating hyperparameters are unknown in practice; and ii) as the hyperparameters deviate from the noninformative choice, the LDA description length computed ceases to be complete, becoming only a lower bound to the complete one which involves integration over the hyperparameters (as is thus intractable).
Iv Word-document networks are not sparse
Typically, in community detection it is assumed that networks are sparse, i.e. the number of edges scales linearly with the number of nodes , i.e. fortunato.2010 . In Fig. S4 we observe a different scaling for word-document networks, i.e. a superlinear scaling with . This is a direct result of the sublinear growth of the number of the number of different words with the total number of words in the presence of heavy-tailed word-frequency distributions (known as Heaps’ law in quantitative linguistics altmann.book2016 ), which leads to the superlinear growth of the number of edges with the number of nodes. This means that the density, i.e. the average number of edges per node, increases as more documents are added to the corpus.
V Empirical word-frequency distribution
In the comparison of hSBM and LDA for corpora drawn from the generative process of LDA, we parametrize the word-topic hyperparameter as for with for . We use an empirical word-frequency distribution as measured from all articles in the Wikipedia corpus contained in the categories “Scientific Disciplines”. In Fig. S5 we show the empirically measured rank-frequency distribution for different words and word-tokens in total. We observe that this distribution is characterized by a heavy-tailed distribution with two power-laws. In Ref. gerlach.2013 it has been shown that virtually any collection of documents follows such a distribution of word frequencies.