A high-reproducibility and high-accuracy method for automated topic classification

02/03/2014 ∙ by Andrea Lancichinetti, et al. ∙ 0

Much of human knowledge sits in large databases of unstructured text. Leveraging this knowledge requires algorithms that extract and record metadata on unstructured text documents. Assigning topics to documents will enable intelligent search, statistical characterization, and meaningful classification. Latent Dirichlet allocation (LDA) is the state-of-the-art in topic classification. Here, we perform a systematic theoretical and numerical analysis that demonstrates that current optimization techniques for LDA often yield results which are not accurate in inferring the most suitable model parameters. Adapting approaches for community detection in networks, we propose a new algorithm which displays high-reproducibility and high-accuracy, and also has high computational efficiency. We apply it to a large set of documents in the English Wikipedia and reveal its hierarchical structure. Our algorithm promises to make "big data" text analysis systems more reliable.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I The likelihood landscape is always rough

Most practitioners know that a very large number of topic models can fit the same data almost equally well: this poses a serious problem for an algorithm’s stability. We start investigating this problem by considering an elementary test-case, which we denote the language corpus. “Toy” models are helpful because they can be analytically treated and provide useful insights for more realistic and complex cases.

In our language corpus, topics are fully disambiguated languages – that is, no similar words are used across languages – and each document is written entirely in a single language, thus creating the simplest possible test case. As is assumed by the LDA generative model, we use a two-step process to create synthetic documents. In the first step, we select a language with probability , which corresponds to a Dirichlet distribution with very small concentration parameters (see SI). Given the language, in the second step, we randomly sample a given number of words from that language’s vocabulary into the document. For the sake of simplicity, we restrict the vocabulary of each language to a set of unique equiprobable words. Thus, an “English” document in the language corpus, is just a “bag” of English words. Note that every document uses words from a single language.

Let us be more concrete and consider a dataset with three languages and distinct number of documents in each language. Consider also that there are more documents in English than in the other languages. An implementation of a topic model algorithm could correctly infer the three languages as topics or alternatively split “English” into two “dialects” and merge two other languages (see Fig. 1). This alternative model is wrong on two counts: it splits English into two parts, while merging two different languages. Naïvely, one would expect the alternative model to have a smaller likelihood than the correct generative model. However, this is not always the case (Fig. 1C), for PLSA Hofmann (1999) and the symmetric version of LDA 111Symmetric LDA implements a model where each topic can be chosen a priori with equal probability. Blei et al. (2003a); Griffiths and Steyvers (2004). In fact, dividing (or overfitting) the “English” documents in the corpus yields an increase of the likelihood. As we show in the SI text, the log-likelihood increases by between and per English document, depending on the average length of the documents, through this process of overfitting. Analogously, merging (or underfitting) the “French” and “Spanish” documents, results in a decrease of the log-likelihood of per “French” and “Spanish” document, where is the average length of the documents. Thus, there is a critical fraction of “French” and “Spanish” documents below which the alternative model will have a greater likelihood that the correct generative model (Fig. 1).

Note that this theoretical limit of a likelihood’s ability of identify the correct generative model is not limited to topic modeling. Indeed, it also holds for non-negative matrix factorization 222

Non-negative matrix factorization is a popular low-rank matrix approximation algorithm which has found countless applications, for example, in face recognition, text mining and high-dimensional data analysis in general.

Lee and Seung (1999) with KL-divergence, because of its equivalence to PLSA Gaussier and Goutte (2005).

However, the critical size of underfitted documents depends on the length of the documents in the corpus, and decreases as . In fact, increasing the documents’ length or using asymmetric LDA Wallach et al. (2009) rather than symmetric LDA, one can show, for the language corpus, that the generative model always has a higher likelihood than the alternative model (see SI). In this case, the ratio of the log-likelihood of the alternative model and the generative model can be expressed as,

(1)

where and are the likelihoods of the alternative and generative model respectively and is the fraction of underfitted documents (“French” and “Spanish,” in the example).

Even though the generative model has a greater likelihood, the ratio on the left-hand side of Eq. (1) can be arbitrarily close to 1. The reason is that the ratio is independent of the number of documents in the corpus and of the length of the documents. Thus, even with an infinite number of infinitely long documents, the generative model does not “tower” above other models in the likelihood landscape. The consequences of this fact are important because the number of alternative latent models that can be defined is extremely large – with a vocabulary of 1000 words per language, the number of alternative models is on the order of (see SI).

In conclusion, we find that only the full Bayesian model Wallach et al. (2009) can potentially detect the correct generative model regardless of the documents’ length, and an extremely large number of models are very close to the correct one in terms of likelihood. In the next section, we will show how current optimization techniques are affected by this problem.

Ii Numerical analysis of the language test

Although the language corpus is a highly idealized case, it provides an example where many competing models have very similar likelihood and the overwhelming majority of those models have more equally sized topics. Indeed, because of the high degeneracy of the likelihood landscape, standard optimization techniques might not find the model with highest likelihood even in such simple cases, and they might yield different models across different runs, as it has been previously reported Wallach et al. (2009); Steyvers and Griffiths (2007).

Figure 2: Performance of the algorithms on a datasets of documents written in different languages. To write each document, we first decide which language to use, and we then write 100 words sampled with the actual word frequencies of that language (we use a limited vocabulary of the 1000 most frequent words). For the sake of simplicity all words have been disambiguated. A. The accuracy is measured in terms of the Best Match similarity (see Methods) among the fitted model and the generative model. Reproducibility is the similarity among fitted models obtained in different runs. B. The pie charts show the topics typically found by LDA standard optimization. Each slice is a topic found by the algorithm and it is divided in colored strips. Each color represents a different language. The area of the strips in each slice is proportional to the language probability in that topic. C. Reproducibility and accuracy for this test tuning the number of documents (we show median values and 25th and 75th percentiles). We input the correct number of topic in LDA and PLSA. The dashed lines indicate the accuracy we would obtain overfitting one language and underfitting two other languages (top line), or overfitting two languages and underfitting four (bottom line). LDA(r) and LDA(s) refers to different initializations for the optimization technique (random or seeded).

Moreover, since small topics are the hardest to resolve (see SI, Sec. 1.6), standard algorithms might require the assumption that there are more topics in the corpus than in reality because the “extra topics” are needed to resolve small topics.

We test these hypotheses numerically, on two synthetic language corpora (Fig. 2). For the first corpus, which we denote egalitarian, each of ten languages comprises an equal number of documents. For the second corpus, which we denote oligarchic, 60% of the documents belong to 20% of the languages. Specifically, we group the languages into two classes. The first class comprises two languages with 30% of the documents in the corpus. The second class comprises eight languages with 5% of the documents. For both corpora, we used the real-world word frequencies http://invokeit.wordpress.com/frequency-word lists/ (2013) of the languages.

In order to determine the validity of the models inferred by the algorithm under study, we calculated both the accuracy and the reproducibility of the algorithms’ outputs. We use a measure of normalized similarity (see Methods) to compare the inferred model to the generative model (accuracy) and to compare the inferred models from two runs of the algorithm (reproducibility).

In the synthetic corpora that we consider, topics are not unequal enough and documents are sufficiently long, so that both datasets have their highest likelihood for the generative model, and for PLSA and symmetric LDA. Additionally, we run the standard algorithms Hofmann (1999); Blei et al. (2003a) with the number of topics in the generative model (as we show in the SI, estimating the number of topics via model selection would lead to an over-estimation of the number of topics). We find that PLSA and the standard optimization algorithm implemented with LDA (variational inference) Blei et al. (2003a) are unable to find the global maximum of the likelihood landscape (see Fig. 2). In the SI we also show the results for asymmetric LDA implementing Gibbs sampling Wallach et al. (2009), which, interestingly, performs well only in the egalitarian case.

Our results thus show that it is highly inefficient to explore the likelihood landscape blindly, either by starting from random initial conditions or by randomly seeding the topics using a sample of documents (Fig. 2), as is the current standard practice.

Iii A network approach

In order to improve on the performance of current methods, we surmise that it will be useful to build some intuition about where to search in the likelihood landscape. We start by noting that a corpus can be viewed as a bipartite network of words and documents Dhillon (2001), and, using this insight, we construct a network of words which are connected if they co-appear in a document Zhou et al. (2007).

In the language corpora, finding the languages is as simple as finding the connected components of this graph. In general, however, finding topics will be more complex because of words shared by topics. We propose a new approach comprising three steps, which we denote TopicMapping. In the first step, we filter out words that are unlikely to provide a separation between topics because they are used indiscriminately across topics. Specifically, we compare the dot-product similarity Tan et al. (2005)

of each pair of words (which co-appear in at least one document) with the expectation for a null-model where words are randomly shuffled across documents. For the null-model, the distribution of dot-product similarities of pairs of words is well approximated by a Poisson distribution whose average depends on the frequencies of the words (see SI). We set a

-value of for accepting the significance of the similarity between pairs of words.

In the second step, we cluster the filtered network of words using a clustering algorithm developed by Rosvall and Bergstrom (Infomap) Rosvall and Bergstrom (2008). Unlike standard topic modeling algorithms, the method does not require an estimate of the number of topics present in the corpus. We use the groups identified by the clustering algorithm as our initial guesses for the number and word composition of the topics. Because our clustering algorithm is exclusive – that is, words can belong to a single topic – we must use a latent topic model which allows for non-exclusivity. Specifically, we locally optimize a PLSA-like likelihood in order to obtain our estimate of model probabilities (see SI for more information).

In the third step, we can decide to refine our guess further running asymmetric LDA likelihood optimization Blei et al. (2003a) using, as initial conditions, the model probabilities found in the previous step. In general, if the topics are not too heterogeneously distributed, the algorithm converges after only a few iterations, as our guess is generally very close to a likelihood maximum (we actually found only one case where more iterations were needed: the Wikipedia dataset, see Fig. 5). Figure 2 shows the excellent performance of the TopicMapping algorithm.

Iv A real world example

In order to test the validity of the TopicMapping algorithm and better compare its performance to standard LDA optimization methods, we next consider a real-world corpus comprising 23,838 documents obtained from Web of Science (WoS). Each document contains the title and the abstract of a paper published in one of six top journals from different disciplines (Geology, Astronomy, Mathematics, Biology, Psychology, Economics). We pre-processed the documents in the WoS corpus by using a stemming algorithm http://snowball.tartarus.org/algorithms/english/stemmer.html (2011) and removing a standard list of stop-words. Pre-processing yielded 106,143 unique words.

Figure 3: Performance of the algorithms on a real world example. In the pie charts, each slice is a different topic found by the method and the colored areas are proportional to the probability of the corresponding journal given that topic: . The topic labels are the most frequent words in the topic. The “*” symbol is due to the stemming algorithm we used (porter2). A. Performance of standard LDA when we input the number of journals as number of topics. Big topics are split and small ones are merged. B. Performance of LDA when we input the number of topics suggested by model selection. Small topics are now resolved but big ones are split so that each topic is comparable in size. C. TopicMapping’s performance. D. Topics found by TopicMapping in a corpus were we added an interdisciplinary journal such as Science. We also show the most frequent affiliations of papers published in Science in each topic (bottom). The total number of topics found is 19 but only topic with probability bigger than are shown in the figure (9 topics).

We surmised a generative model in which each journal defines a topic and in which each document is assigned exclusively to the topic defined by the journal in which it was published. We then compare the topics inferred by symmetric LDA (variational inference) and TopicMapping with the surmised generative model (Fig. 3). While TopicMapping has nearly perfect accuracy and reproducibility, standard LDA optimization has a significantly poorer performance. When using the standard approach, LDA estimates that the corpus comprises 20 to 30 topics (see SI) and yields a reproducibility of only 55%. Even when letting LDA know that there are only six topics, the inferred models will put together papers from small journals yielding an accuracy and a reproducibility of 70%.

Adding an interdisciplinary journal (Science), we can see that TopicMapping assigns the majority of papers published in Science to the already found topics, but several new topics are identified. In terms of likelihood, TopicMapping yields a slightly better likelihood than standard LDA optimization, but only if we compare models with the same effective number of topics. A more detailed discussion on this point can be found in the SI.

V Systematic analysis on synthetic data

As a final and more systematic evaluation of the accuracy and reproducibility of the different algorithms, we implement a comprehensive generative model, where documents choose a topic distribution from a Dirichlet distribution as proposed in the LDA model. We tune the difficulty in separating topics within the corpora by setting (1) the value of a parameter which determines both the extent to which documents mix topics, and the extent to which words are significantly used by different topics; and (2) the fraction of words which are generic, that is, contain no information about the topics (see Methods).

Fig. 4 shows our results for the synthetic corpora. We have also done a more systematic analysis (see SI), but the main conclusion is the same as for the language test: the generative model has the highest likelihood (topics are sufficiently equal in size), but the number of overfitting models is so large and they are so close in terms of likelihood, that the optimization technique requires help in exploring the right portion of the parameter space. Without the right initialization, we get lower accuracy and reproducibility, as well as equally sized topics and an overestimation of the number of topics (see SI).

The computational overhead of using TopicMapping, for obtaining an initial guess of the parameter values, is small and the algorithm can be easily parallelized. To demonstrate this fact, we applied TopicMapping to a sample of the English Wikipedia with more than a million documents and almost a billion words (see Fig. 5).

Vi Conclusions

Ten years since its introduction, there has been surprisingly little research on the limitations of LDA optimization techniques for inferring topic models Wallach et al. (2009). We are able to obtain a remarkable improvement in method validity by using a much simpler objective function Rosvall and Bergstrom (2008) to obtain an educated guess of the parameter values in the latent generative model. This guess is obtained exclusively using word-word correlations to estimate topics, whereas word document correlations are accounted for later in refining the initial guess. The algorithm is related to some recent work on spectral algorithms Anandkumar et al. (2012); Arora et al. (2013). However, here we propose a practical implementation which makes no assumption about topic separability or the number of topics, as most spectral algorithms do. Interestingly, TopicMapping provides only slight improvements in terms of likelihood (because of the high degeneracy of the likelihood landscape), but nevertheless yields much better accuracy and reproducibility.

Figure 4: A. Creating synthetic corpora using the generative model. For each document, is sampled from a Dirichlet distribution whose hyper parameters are defined as: , where is the number of topics, is the probability (i.e. the size) of topic and is a parameter which tunes how mixed documents are: smaller values of yield a simpler model where documents make use of fewer topics. We also have a parameter to fix the fraction of generic words, and we implement a similar method for deciding for specific and generic words (see Methods). Once the latent topic structure is chosen, we write a corpus drawing words with probabilities given by the mixture of topics. B. The performance of the topic modeling algorithms on synthetic corpora. In all our tests, we generate a corpus of documents, of words each, and our vocabulary is made of unique equiprobable words. We set the number of topics and we input this number in LDA and PLSA. “Equally sized” means all the topics have equal probability , while in the “unequally sized” case, 4 large topics have probability each, while the other 16 topics have probability . LDA(s) and LDA(r) refer to seeded and random initialization for LDA (variational inference). The plots show the median values as well as the 25th and 75th percentiles.

Vii Methods

vii.1 Comparing models

Here, we describe the algorithm for measuring the similarity between two models, and . Both topic models are described by two probability distributions: and . Given a document, we would like to compare two distributions: and . The problem is not trivial because the topics are not labeled: the numbers we use to identify the topics in each model are just one of the possible permutations of their labels. Instead, documents have of course the same labels. For this reason, it is easy to quantify the similarity of topics and

from different models, if we look at which documents are in these topics: we can use Bayes’ theorem to compute

and and compare these two probability distributions. We propose to measure the distance between and as the norm (or Manhattan distance): . Since we are dealing with probability distributions, . We can then define the normalized similarity between topics and as: .

To get a global measure of how similar one model is with respect to the other, we compare each topic with all topics and we pick the topic which is most similar to : this is the similarity we get best matching model versus : , where BM stands for Best Match, and the arrow indicates that each topic in looks for the best matching topic in . Of course, we can make this similarity symmetric, averaging the measure with : .

Although this similarity is normalized between 0 and 1, it does not inform us about how similar the two models are compared to what we could get with random topic assignments. For this reason, we also compute the average similarity , where we randomly shuffle the document labels in model . Our null model similarity is then defined as .

Eventually, we can define our measure of normalized similarity between the two models as:

(2)

An analogous similarity score can be defined for words using instead of .

Figure 5: Topics found by TopicMapping on a large sample of Wikipedia ( million documents). Here, we show the topics found by the TopicMapping after one single LDA iteration: indeed, this dataset represents an example where optimizing LDA until convergence gives rather different results (see SI, Sec. 1.6 and Sec. 10). We highlight the top topics that account for the of total documents: those are just a handful of topics which are very easy to interpret (left). The inset shows the topics we find on the sub-corpus of documents assigned to the main topic “General Knowledge”.

vii.2 Generating synthetic corpora

The algorithm we used to generate synthetic datasets relies on the generative model assumed by LDA. First, we specify the number of documents and the number of words in each document, . For simplicity, we set the same number of words for each document, . Next, we set the number of topics and the probability distribution of each topic, . Finally, we specify the number of words in our vocabulary, , and the probability distribution of each word, . For the sake of simplicity, we used uniform probabilities for , although the same model can be used for arbitrary probability distributions. All these parameters define the size of the corpus, the other aspect to consider is how mixed documents are across topics and topics are across words: this can be specified by one hyper-parameter , whose use will be made clear in the following. The algorithm works in the following steps:

  1. For each document , we decide the probability this document will make use of each topic: . These probabilities are sampled from the Dirichlet distribution with parameters: . The definition is such that will be used in the overall corpus with probability , while the factor is a normalization which assures that we get for equiprobable topics. In this particular case, means that documents are assigned to topics drawing the probabilities uniformly at random (see SI for more on the Dirichlet distribution).

  2. For each topic, we need to define a probability distribution over words: . For this purpose, we first compute for each word, sampling the same Dirichlet distribution as before (). Second, we get from Bayes’ theorem: .

  3. We now have all we need to generate the corpus. Every in document can be drawn, first, selecting with probability and, second, choosing with probability .

Small values of the parameter will yield “easy” corpora where documents are mostly about one single topic and words are specific to a single topic, (Fig. 4). For simplicity, we keep constant for all documents and words. However, it is highly unrealistic that all words are mostly used in a single topic, since every realistic corpus contains generic words. To account for this, we divide the words into two classes, specific and generic words: for the former class, we use the same as above, while for generic words we set . The fraction of generic word is a second parameter we set.

Viii Supplementary Materials

TopicMapping software, datasets and related codes are available at https://sites.google.com/site/andrealancichinetti/topicmodeling

Acknowledgements.
We thank Xiaohan Zeng, David Mertens and Adam Hockenberry for discussions.

Supplementary Information

Outline

The supplementary material is organized as follows:

  • Sec. S1 provides analytical insights on the likelihood landscape: in particular, we discuss the theoretical limitations of PLSA Hofmann (1999) and symmetric LDA Blei et al. (2003a) in finding the correct generative model. Also, in Sec. S1.6 we present an additional example which suggests why equally sized topics often have better likelihood.

  • Sec. S2 describes the network approach we take for topic modeling.

  • Sec. S3 shows that standard LDA tends to over-estimate the number of topics, and to find equally sized topics.

  • Sec. S4 presents a more detailed analysis of the synthetic datasets: among other things, we visualize the algorithms’ results.

  • Sec. S5 shows the performance of asymmetric LDA Wallach et al. (2009).

  • Sec. S6 discusses the hierarchical topics of Web of Science dataset and the role of the -value for TopicMapping.

  • Sec. S7 presents the computation complexity of TopicMapping.

  • Sec. S8 shows the topics we found on a large sample of the English Wikipedia.

  • Sec. S9 shows that TopicMapping often provides models with higher likelihood, if we compare models with the same effective number of topics.

  • Sec. S10 is an appendix with some more technical information about the calculations presented in Sec. S1, some clarifications about Dirichlet distributions and measuring perplexity, and some technical information about the algorithms’ usage.

S1 Degeneracy problem in inferring the latent topic structure

s1.1 Introduction

Most topic model optimizations are known to be computationally hard problems Sontag and Roy (2011). However, not much is known about how the roughness of the likelihood landscape affects the algorithms’ performance.

We investigate this question by () defining a simple generative model, () generating synthetic data accordingly and () measuring how well the algorithms recover the generative model (which is considered the “ground truth”).

In the whole study, we examine different generative models. In this section, we study the simplest among those, the language test. For this model, we prove that, if the topics are not enough equally sized, the model which maximizes the likelihood optimized by PLSA and symmetric LDA can be different from the generative model. More specifically, we show that it is possible to find an extremely large number of alternative models (with the same number of topics) which overfit some topics and underfit some others but have a better likelihood than the true generative model. Symmetric LDA is the version of LDA where the prior is assumed to be the same for all topics, and it is probably the most commonly used. For asymmetric LDA, which allows different priors, the correct generative model has the highest likelihood, in the language test. However, we show that the ratio between the log-likelihood of the generative model and the one of the alternative models can be arbitrarily close to 1, even in the limit of infinite number of documents and infinite number of words per document. This implies that even increasing the amount of available information, the likelihood of the generative model will not increase relatively to the others. Below, we also give some quantitative estimates.

s1.2 The simplest generative model

Let us call the number of topics. Each topic has a vocabulary of words, and for the sake of simplicity we assume all the words are equiprobable. We also assume that we cannot find the same word in two different topics, so that we are actually dealing with fully disambiguated languages. Then, each document is entirely written in one of the languages sampling random words from the corresponding vocabulary (we use the same number of words for each document). This should be a very simple problem, since there is neither mixing of words across topics, nor of topics across documents.

Let us compute the log-likelihood, , of the generative model. The process of generating a document works in two step. We first select a language with probability , and we then write a document with probability :

(S3)

Let us focus on the second part, . After we selected the language that we are going to use, every document has the same probability of being generated:

(S4)

We will also consider later. We stress that is the log-likelihood per document. The symbol is to recall that the likelihood is computed given that we know which language we are using for the document.

Now, let us compute the log-likelihood of an alternative model, where one language (say English) is overfitted in two dialects, and two other languages (say French and Spanish) are merged. Fig. S1 illustrates how we construct the alternative model. French and Spanish are just one topic, in which each French and Spanish word is equiprobable. The English words instead are arbitrarily divided in two groups: the first English dialect makes use of words from the former group with probability and words from the second groups with probability and the second dialect has probabilities and for the two groups. We assume that the first group of words is more likely for the first dialect, i.e , while the situation is reversed for the second dialect: . The general idea is that if a document, just by chance, is using words from the first group with higher probability, it might be fitted better by the first dialect: overfitting the noise improves the likelihood and, if the English portion of the corpus is big enough, this improvement might overcome what we lose by underfitting French and Spanish.

Figure S1: Distribution over words for the topics in the generative model (A) and in the most likely model (B). We set words in each language’s vocabulary.

In Sec. S10.1, we prove that the difference between the log-likelihood per English document of the generative model and the alternative model is bigger than , regardless of the number of words per document, the size of the vocabulary or the number of documents. More precisely, if , the difference can also be higher, . Calling , the likelihood per English document in the alternative model, we have that:

(S5)

Fig. S2, shows the log likelihood difference per English document, as a function of .

Figure S2: Average difference in the log-likelihood per English document of the alternative model and the generative model as a function of the ratio , (words per document over vocabulary size). The function as well as the two dashed lines have been analytically computed in Sec. S10.1.

Keeping the same number of topics, the alternative model will pay some cost underfitting Spanish and French. Since the languages are merged, the size of the vocabulary is and the log-likelihood per Spanish or French document is:

(S6)

Now, to compute the expected log-likelihood of the alternative model we also need to know how often we use the different languages. Let us call the fraction of English documents, and the fraction of documents written in Spanish or French (underfitted documents).

The average log-likelihood per document of the alternative model can then be written as:

(S7)

We recall that, so far, we have not considered the probability that each document will pick a certain language, . Symmetric and asymmetric LDA make different assumptions at this point and we treat them both in the next two sections.

s1.3 Symmetric LDA

PLSA does not account for the probability of picking a language in the likelihood. LDA instead does consider that: the hyper parameters are a global set of parameters (one per topic) which tune the probabilities that each document is making use of each topic. In our case, each document is uniquely assigned to a language: therefore, for each document, there is a language which has probability 1 and all the other languages have probability 0. This corresponds to the limiting case where the proportionality factor is very small.

For symmetric LDA, however, all the are equal. This implies that, regardless of the actual size of the languages, the algorithm fits the data with a model for which (we recall that is the number of languages). Therefore:

(S8)

If is big enough, the likelihood of the alternative model can be higher than the one of the generative model. To be more concrete, let us consider an example. If and , in Sec. S10.1, we show that can be as high as . Let us consider the simplest case of just three topics, . Setting the right hand side of Eq. S8 to zero, we find that if the alternative model has a better likelihood. If the topics are not balanced enough, symmetric LDA cannot find the right generative model, regardless of the absence of any sort of mixing. However, this critical value actually depends on , and increasing the generative model will eventually get a better likelihood. The case , is treated in detail below.

s1.4 Asymmetric LDA

For asymmetric LDA, the average log-likelihood of the true model becomes:

(S9)

where is the entropy of the language probability distribution, .

For the sake of simplicity, let us assume that French and Spanish are equiprobable, as well as the two English dialects (see Sec. S10.1). For the alternative model:

(S10)

From Eq. S7, we finally get:

(S11)

Since , now the generative model actually has the highest likelihood: in principle, asymmetric LDA is always able to find the generative model. The ratio of the two log-likelihoods, if the documents are long enough, becomes:

(S12)

The same equation holds for symmetric and asymmetric LDA, as well as PLSA. Therefore, even if we had infinite amount of information (infinite number of documents and words per document), the ratio of the two likelihoods can actually be very close to 1.

s1.5 Finding the generative model in practice

The number of alternative models is huge. In Sec. S10.1, we show that if each language had a vocabulary size , we can find alternative models (this is a conservative estimate): assuming (which would correspond to equiprobable topics), the relative difference in their log-likelihood is as we can estimate from Eq. S12.

One might argue that, even if the relative difference of the log-likelihood is small, we have not considered that the basin of attraction of the generative model can be very large, so that optimization algorithms might actually be very effective in finding it anyway. Fig. S3 shows that the probability of finding the correct model for equiprobable languages is , while in the heterogeneous case is (this was computed using variational inference Blei et al. (2003a)).

Figure S3: In this test, the corpus has 5000 documents of 100 words each, and the vocabulary of each language has 1000 equiprobable words. In the equally sized case, we consider 10 equiprobable languages, while in the heterogeneous case, we considered languages with probability each, and 8 languages with probability . A. Cumulative probability of the relative difference of the log likelihood of the generative model and the one found by the algorithm. B. Scatter plot of the relative difference of the log likelihood versus the accuracy of the algorithm (accuracy is the Best Match similarity of the two models, see main text). Clear clusters are visible according to the how many languages are overfitted. Fig. 2 of the main paper, supports the same conclusion also after we removed the assumption that words are equiprobable.

s1.6 Model competition in hierarchical data

In the previous sections, we only discussed the difference in likelihood of the generative model and an alternative model with the same number of topics . In this section, we consider a similar test case for which, however, we fit the data with a model with topics.

The generative model we consider here is illustrated in Fig S4: we have topics which have no words in common with any other topic and one bigger topic, say English, which has two subtopics, say “music” and “science”, which share some words. Let us call the number of words in one of the English subtopics (music) which cannot be found in the other subtopic, the number of words which can only be found in the other subtopic (science), and the number of words in common between the two subtopics. We further assume that , the subtopics are equiprobable, and given a subtopic, each word is equiprobable. Let us call the number of words in each non-English language, the fraction of English documents and the fraction of documents written in a different language (for sake of simplicity, all languages but English are equiprobable).

This model should be fitted with topics. However, let us assume that we do not know the exact number of topics (as it is usually the case) and we try to fit the data with topics. In Fig. S4 we show two possible competing models: the first model correctly finds all the languages, while the second correctly finds the English subtopics but merges two languages.

With similar calculations as above, we can prove (see Sec. S10.2) that the first model has higher likelihood if:

(S13)

The previous equations holds for symmetric LDA, and also asymmetric LDA if (the exact expression for asymmetric LDA can be found in Sec. S10.2). If , the first model is always better (there are no subtopics), if , one model is better than the other if it under-fits the smaller fraction of documents. In general, if English is used enough and , the second model better fits the data.

Let us consider a numerical example: consider , words and words ( total words in the English vocabulary). This means that of the English words are used by both subtopics. Eq. S13 tells us that we are going to split English in the two subtopics, if there are two other topics to merge with .

We believe that this is the basic reason why big journals such as Cell and Astronomical journals are split by standard LDA in the Web of Science dataset (see Sec. S9). In general, since real-world topics are likely to display a hierarchical structure similar to the one described here, we argue that heterogeneity in the topic distribution makes standard algorithms prone to find subtopics of large topics before resolving smaller ones.

Figure S4: Generative model and two compiting models. In this example, we have languages but one language (English) is bigger than the others and have two subtopics (“music” and “science”). is the number of words in the English vocabulary which can only be found in the music subtopics, is the equivalent for science, whereas is the number of common words between the two subtopics. If many documents are written in English, Model 2 has a better likelihood than Model 1.

S2 A network approach to topic modeling

We give here a detailed description of TopicMapping. The method works in three steps.

First, we build a network of words, where links connect terms appearing in the same documents more often than what we could expect by chance. Second, we define the topics as clusters of words in such a network, using the Infomap method Rosvall and Bergstrom (2008) and then we compute the probabilities and locally maximizing a PLSA-like likelihood. Finally, we can refine the topics further optimizing the (asymmetric) LDA likelihood via variational inference Blei et al. (2003a).

How to define the network.

A corpus can be seen as a weighted bipartite network of words and documents: every word is connected to all documents where the word appears. The weight of the link is the number of times the word is repeated in document .

From this network, we would like to define a unipartite network of words which have many documents in common. A very simple measure of similarity between any pair of words and is the dot product similarity:

(S14)

From this definition, it is clear that generic words, like “to” or “of”, will be strongly connected to lots of more specific words, putting close terms related to otherwise far semantic areas. A possible way to filter out generic words is to compare the corpus to a simple null model where all words are randomly shuffled among documents.

For this purpose, we need to consider the probability distribution of the dot product similarity defined in Eq. S14. We start considering that in the null model each weight

is now a random variable which follows a hypergeometric distribution with parameters given by: the total number of words in document

, , the total number of occurrences of word in the whole corpus, , and the total number of words in the corpus . The mean is:

(S15)

Assuming a large enough number of documents, we can neglect the correlations among the variables and, from Eqs. S14 and S15, we get:

(S16)

Since is the sum of rare events (if ), its probability distribution can be well approximated by a Poisson distribution with average given by Eq. S16, as shown in Fig. S5.

Figure S5: Poisson approximation of the probability distribution of the dot product similarity of words and in a randomly shuffled corpus. The occurrences of the words are , , and there are documents of length drawn uniformly between and words.

Finally, our procedure to filter out the noise consists in fixing a -value, and for all pairs of words and which share at least one document, we compute , where the latter term is the

-quantile of the Poisson distribution

. Being more precise, is the largest non significant dot product similarity:

(S17)

is the weight of the link between words and , if positive. Fig. S6 shows an example.

Figure S6: A. The corpus comprises six documents, 3 are about biology and 3 about math. B. We build a network connecting words with weights equal to their dot product similarity. C. We filter non-significant weights, using a -value of . Running Infomap Rosvall and Bergstrom (2008) on this network, we get two clusters and two isolated words (study and research). D. We refine the word clusters using a topic model: the two isolated words can now be found in both topics.

Finding the topics as clusters of words and Local Likelihood Optimization.

Once the network is built, we detect clusters of highly connected nodes using the Infomap method Rosvall and Bergstrom (2008). This provides us with a hard partition of words, meaning that words can only belong to a single cluster.

We now discuss how we can compute the distributions and , given a partition of words.

We recall that in the probabilistic model of how documents are generated, we assume that every word appearing in document has been drawn from a certain topic. We are in the realm of the bag of words approximation, and therefore we are completely discarding any information about the structure of the documents. Then, it is reasonable to assume that every time we see a certain word in the same document, it was always generated by the same topic: let us denote this topic as .

We identify the topic with the single module where word is located by Infomap, : in fact, since the partition is hard (no words can sit in different modules), there is no dependency on the documents. Therefore, and:

(S18)

It is also useful to introduce , which is the number of times topic was chosen and word was drawn.

So far, we have got a model where all words are very specific to topics and documents use many topics, which is probably far from being a good candidate generative model. The model can be substantially improved optimizing the PLSA-like likelihood:

(S19)

We then describe a series of very local moves aimed at improving the likelihood of the model. The local optimization algorithm aims at fuzzing the topics and making documents more specific to fewer topics. For that, it simply finds, for each document, topics which are infrequent (more precise definition follows) and “move” the words drawn from that topic to the most important one in that document.

  1. For each document , we find its most significant topic, : this is done selecting the topic with the smallest -value, considering a null model where each word is independently sampled from topic with probability . Calling the number of words which actually come from topic , (, see Eq. S18), the -value of topic

    is then computed using a binomial distribution,

    .

  2. For document , we define the infrequent topics as those which are used with probability smaller than a parameter: .

    We consider the most significant topic (see above) and we increment by the sum of the probabilities of the infrequent topics, while all are set to zero. Similarly, has to be decreased by for each word which belongs to an infrequent topic, and is increased accordingly.

  3. We repeat the previous step for all documents. We then compute , as well as the the likelihood of the model, , where we made explicit its dependency on .

  4. We loop over all possible values of (from to with steps of ) and we pick the model which maximizes .

LDA Likelihood optimization.

The model we find, at this point, can be refined further via iterations of the Expectation-Maximization algorithm optimizing the LDA likelihood. The algorithm follows closely the implementation from

Blei et al. (2003a). The main difference, however, is that, for computing efficiency, we use sparse data structure, where words and documents are assigned to only a subset of the topics.

In most cases, the model does not change very much and the algorithm converges very quickly. However, if topics are very heterogenous in size, we might encounter situations similar to the one described in Sec. S1.6 (see Sec. S8 for an example). In practice, the software records models every few iterations, allowing users to better explore the data.

Implementation details.

Here, we would like to make a few points more precise.

  1. The filtering procedure and the LDA likelihood optimization in TopicMapping are deterministic. Instead, optimizing Infomap’s code length uses a Monte Carlo technique, which can be performed multiple times. The number of runs for Infomap’s optimization was set to 10 in most tests, although most results barely change with a single run. For measuring the reproducibility in Sec. S6, instead, we used 100 runs, because the topic structure is less sharp and we need some more runs to achieve good reproducibility (each run takes about a minute).

  2. After running Infomap, we might find that some words have not been assigned to any topics, because all their possible connections to other words have not been considered significant. In each document which uses any of them, we automatically assign these words to its most significant topic, .

  3. Some (small) topics might have not been selected as the most significant by any document. We remove these topics before the filtering procedure: if we do not, high values of the filter will yield models where these topics do not appear at all, and this might penalize their likelihood just because the number of topics is diminished.

  4. Depending on the application, it might also be useful to remove very small topics even if they were selected as the most significant by a handful of documents (this is especially important to avoid the following LDA optimization to inflate them, see Sec. S4.4). We used no threshold for the synthetic datasets, but we selected a threshold of 10 documents for the journals in Web of Science, and 100 documents for Wikipedia. In the implementation of the software, we let the users choose a threshold for removing small topics.

  5. The initial for LDA optimization was set to for all topics.

S3 Held-out likelihood and effective number of topics

The most used method for selecting the right number of topics, consists in () holding out a certain fraction of documents (say 10% of the corpus), () training the algorithm on the remainder of the dataset, () measuring the likelihood of the held-out corpus for the model obtained on the training set. The best number of topics should be the one for which the held-out likelihood is maximum. Fig. S7 shows that this method tends to give a higher number of topics that the actual one.

We also show that LDA tends to provide models in which

is fairly close to a uniform distribution. To assess this, we compare the entropy of the topic distribution,

(S20)

with the maximum possible entropy, i.e. those achieved by equally probable topics: . In fact, it is easier to compare the exponential entropy Campbell (1966) of the topic probability distributions: versus . The former can be seen as an effective number of topics: it is the number of topics needed by a uniform distribution to achieve the same entropy. Fig. S7 shows that indeed, the effective number of topics is rather close to the input .

Figure S7: Held-out likelihood and effective number of topics for the three datasets we considered in the main paper. In the language test, we considered documents, while, in the synthetic dataset, we set and the fraction of generic words to . The dashed black lines on the left indicated the number of topics that should have been selected by the method. The black line on the right-hand panels is (the highest achievable value of the effective number of topics) and the horizontal lines are the actual effective number of topics.

S4 Additional analysis on the synthetic datasets

In this section, we present five supplementary sets of results related to the synthetic datasets, presented in Fig. 4 in the main paper. In the first section, we measure the performance of the algorithms in terms of perplexity Blei et al. (2003a) (a standard measure of quality for topic models) and we show that, for our case, this evaluating method has a fairly low discriminatory power. We then propose a visualization of the comparison between the correct generative model and the ones found by the algorithms we considered. The third section is dedicated to measuring the performance of the methods in case we do not have information about the correct number of topics to input. In the fourth section, we study how the performance of LDA is affected by the initial conditions of the optimization procedure, and we show that they are crucial, as expected. Finally, we compare the performance of TopicMapping before and after running LDA as a refinement step.

s4.1 Perplexity

Fig. S8 shows the performance of the algorithms on the synthetic datasets in terms of perplexity (in Sec. S10.4 we explain in detail how perplexity is defined). Algorithms which yield a lower perplexity are considered to achieve a better performance because the model they provide is less “surprised” by a portion of the datasets which they have never seen before. The advantage of this approach is that it can be implemented for generic real-world datasets, where the actual generative model is unknown. However, in the study of our interest, the measure performs poorly in discriminating the methods.

Figure S8: Evaluating the performance of several algorithms on synthetic corpuses measuring perplexity for several values of the parameters (the other parameters are the same as in Fig. 4 in the main paper). Perplexity seems to have low discriminatory power in this test.

s4.2 Visualizing topic models

Fig. S9 shows a visualization of the performance of the methods on the synthetic datasets. We selected a few runs where the algorithms have got an average performance. The colors allow to show in which way standard LDA and PLSA fail in getting the generative model. Similarly to what happens in the language test, some (small) topics are merged together (indicated by a “*” symbol) and some other topics are overfitted in two or more dialects.

Figure S9: Topic comparison for the synthetic datasets. All parameters are the same as in Fig. 4 in the main paper, and we set the fraction of generic words (words which are used uniformly across documents) to . Every rectangle is split in horizontal bars, one for document. Each bar is divided in color blocks representing topics, with block size proportional to . The documents are sorted according to their most prominent topic. A. Performance of LDA, for equally sized topics and . The “*” symbols indicate topics inferred by LDA in which two or more actual topics are merged. Top: comparison for documents. Bottom: same procedure for words: generic words are clearly distinguishable from specific ones. The numbers on the corners are obtained from the topic similarity (see main text). B. Unequally sized topics. We show results for two values of , and . Comparison of documents only is shown. We compare LDA, PLSA and TopicMapping.

s4.3 Performances for different number of topics.

Here we discuss how the performance of LDA and PLSA changes if we do not know the exact number of topics. In the main paper, we have fed the algorithms the right number of topics, although we have shown (Sec. S3) that it is hard to guess this information. Here, we show what we get setting a different number of topics, but still reasonably close to the right value (). In general, the performance gets worse as we move further from the correct number, although 15 or 25 topics sometimes give slightly better results. We also show that the results do not change very much if we increase the number of documents to .

Figure S10: Performance of LDA and PLSA when we input different number of topics. The number of topics in the generative model is 20.

s4.4 LDA initial conditions

Figure S11: How the initial conditions affect the performance of LDA. We checked four different ways of initializing the topics: random and seeded are the basic provided options. Real model refers to setting the underlying true parameters as initial conditions. topics refers to the right initial conditions where we added small topics peaked on a single randomly chosen word. The log likelihood improvement is defined as the relative difference in the log likelihood we get with the different initial conditions compared to the seeded

initialization. The plot shows mean values and standard deviations.

In this section, we discuss how the initial conditions affect the performance of LDA optimization. Two standard different ways of initializing the topics have been considered: random and seeded. The former assigns random initial conditions while the latter uses randomly sampled documents as seeds. We used both throughout the whole study, but we have only shown the seeded version in the WoS dataset (the difference in performance is not appreciable, though). Here we compare these two initializations with the performance of the method when we guess the best possible initial conditions, meaning we start from the actual generative model (Fig. S11).

Similarly to the language test, starting from the generative model as initial conditions, we get an outstanding performance, which is also the optimal one in terms of likelihood. However, we checked that if we slightly change the number of topics, the performance gets worse and the likelihood improves. In Fig. S11, we show both performance and likelihood. topics refers to a model close to the generative one, but where we added small topics, for which only one single word can be drawn: more precisely, we pick a word at random and we define these small topics with word probability distributions . LDA will grow these small topics to increase the likelihood, overfitting the data and getting a worse performance. This is the main reason why we decided to threshold small topics in the Web of Science dataset (see Sec. S2).

s4.5 TopicMapping guess

Here, we show the performance of TopicMapping just for the guess, i.e. before running the LDA optimization (see Fig. S12). We do not show the results for the language test because, in that case, there is no difference at all. In the systematic tests, instead, running LDA as a last step slightly improves the performance of the algorithm, although the difference is not dramatic. We found a remarkable difference only in the Wikipedia dataset (see Sec. S8), where the topic distribution provided by the guess was highly heterogeneous.

Figure S12: Performance of TopicMapping on the synthetic datasets, before and after running LDA.

S5 Asymmetric LDA

In this section we discuss the results we obtain using asymmetric LDA Wallach et al. (2009) (http://mallet.cs.umass.edu

). The algorithm has two main differences respect with the other LDA method we used throughout the study: first, the prior probabilities of using a certain topic are not all equal, and, second, the optimization algorithm is based on Gibbs-sampling rather than variational inference

Blei et al. (2003a).

Fig. S13 shows that the algorithm performs better than symmetric LDA in the language test, although it still struggles recognizing the languages if the number of documents is large and the language probabilities are unequal. The performance on the synthetic graphs is better to standard LDA, (see Fig. S14) for certain parameters only.

Figure S13: Performance of asymmetric LDA in the language test (same as Fig. 2 in the main text). We used and iteration for Gibbs sampling and we input the correct number of languages in the algorithm. We optimize the hyper parameters each 100 iterations but performance is barely affected by the optimization interval. Curves are the median values and the shaded areas indicate 25th and 75th percentiles.
Figure S14: Performance of asymmetric LDA on the tests presented in Fig. 4 in the main paper. Curves are median values and colored areas are 25th and 75th percentiles.

S6 The hierarchy of WoS dataset

In this section, we study the subtopic structure of the Web of Science dataset. In fact, we expect to find subtopics in each journal. Although we do not know any “real” topic model to compare with, we can still measure the reproducibility of the algorithm.

Similarly to what we observed above, we find again that standard LDA is not reproducible and the effective number of topics is strongly affected by the input number of topics, see Fig. S15.

For TopicMapping, we observe that the number of topics is affected by the -value we choose for filtering the noisy words. This is not what happens in all the other tests we have presented so far, which have a rather clear topic structure: therefore, choosing a -value of or barely makes any difference. Instead, in analyzing Astronomical Journal abstracts, for instance, the topic structure is not so sharp anymore and we do observe that reducing the -value provides a higher number of topics. Fig. S15 shows the results. For Astronomical Journal, with a -value of we only observe one topic. Decreasing the -value to we start observing sub-topics like: “galaxi* observ* emiss*”, “star cluster metal” or “orbit system planet”. For Cell, we also observe that the effective number of topics increases for smaller -values. However, in both cases, TopicMapping is much more reproducible.

Figure S15: Reproducibility and effective number of topics for LDA and TopicMapping for the scientific abstracts of Astronomical Journal and Cell. The number of topics can be tuned in LDA changing the input number of topics. Similarly, in TopicMapping the resolution can be tuned to some extent filtering words with different -values. However, this effect is present only in corpora with a less defined topic structures than the language test or the synthetic graphs, for instance. Median and 25th and 75th percentiles are shown.

S7 Computational complexity

For a given vocabulary size, LDA’s complexity is proportional to the number of documents times the number of topics.

The computational complexity of TopicMapping’s guess is also linear with the number of documents. In particular, building the graph costs , where is the number of unique words in document . Infomap’s complexity is of the same order of magnitude (smaller if we filter links), because the algorithm runs in a time proportional to the number of edges in the graph. Local PLSA-likelihood optimization is also linear in the number of documents, and can scale better than LDA with the number of topics, if the assignments of words to topics is sparse. In fact, we use sparse data structures to compute the topics for each document and each word, meaning that for each document, for instance, we do not handle a list of all topics (including never used topics), but only a list of the topics the document actually makes use of. Indeed, this enables the algorithm to scale much better with the number of topics (see Fig. S16) on the synthetic datasets.

As a further example, to analyze the WoS corpus, TopicMapping takes minutes on a standard desktop computer. LDA takes minutes for finding models with 6 topics and minutes for models with 24 topics.

Figure S16: Time needed for the execution of standard LDA and TopicMapping (before the LDA step) on synthetic corpora. Similarly to the other tests, we used a fixed vocabulary of 2000 unique words and 50 words per document. We set and the generic words are . Both algorithms’ complexity is linear in the number of documents. However, TopicMapping can be significantly faster if the number of topics is large.

S8 Topics in Wikipedia

We have collected a large sample of the English Wikipedia (May 2013). The whole datasets comprises more than 4 million articles. However, since most of them are very short articles (stubs), we decided to consider only articles with at least 5 in-links, 5 out-links and 100 unique words. Also very specific words (such as those which appear in less than 100 articles) have been pruned. This gives us a dataset of 1,294,860 articles 118,599 unique words and millions words in total.

In order to get results quickly, we decided to parallelize most of the code. For building the network we used 9 threads, each one was assigned a fraction of the total word pairs we had to consider. Doing so, we were able to construct the graph of words in roughly 12 hours. Infomap is extremely fast: each run of the algorithm takes about one hour and we ran it 10 times. After that, we ran the filtering algorithm with a single thread, taking less than one day (we set a filtering step of 0.05). Finally, we parallelized the LDA optimization on about 50 threads: doing so, each iteration took about an hour.

In the main paper, we have shown the results of TopicMapping after running LDA optimization for one single iteration. The inset was obtained running the algorithm on the sub-corpus consisting of all words which were more likely drawn from the first topic. Fig. S17, instead, shows the results after the full LDA optimization. For comparison, we also show the results starting the algorithm with random initial conditions. Interestingly, in this dataset, LDA optimization changed our guess significantly. This is not what happens in any of the other datasets we have tested, for which the topics in our guess were less heterogeneous (see Sec. S1.6).

Figure S17: Topic found in a large Wikipedia sample by standard LDA and TopicMapping with full LDA optimization. For each topic, we show the top 7 words. Bold fonts are used for the top topics which account for of the total.

S9 TopicMapping as a likelihood optimization method

Here we discuss to which extend TopicMapping provides models with better likelihood compared to standard LDA. Indeed, in controlled test cases as the synthetic tests we have presented in this work, TopicMapping generally finds better models in terms of likelihood and this explains why it performs better (the actual generative model has the highest likelihood).

In real cases, as we discussed in Sec. S1.6, the likelihood can be maximized splitting large topics in subtopics and merging smaller topics. Therefore, if we compare the likelihood found by TopicMapping and the one found by variation inference Blei et al. (2003a) as a function of the number of topics, TopicMapping does not provide models with higher likelihood. However, this comparison heavily penalizes TopicMapping, which often provides models with a broad distribution of topics, and many of them are barely used at all. We then argue that comparing models with the same number of effective topics is a more fair comparison. Doing so, Fig. S18 shows that, indeed, TopicMapping’s models have often higher likelihood. However, the difference is not dramatic as we can see from the inset of Fig. S18, because of the degeneracy of the likelihood landscape.

Figure S18: Comparison between TopicMapping and standard LDA in terms of likelihood, for the Web of Science dataset we described in the main paper, and for Cell. Colors represent the results from standard LDA with different number of topics as input. A. TopicMapping does not provide better likelihood if we compare models with the same number of topics. B. Comparison for Web of Science between model with the same effective number of topics. C. Zooming in the shaded area in B, we can see that TopicMapping performs better on average, although standard LDA is sometimes comparable. Indeed splitting Cell and merging Schizophrenia Bullettin and American Economic Review gives very comparable results in terms of likelihood. D. Comparison for Cell.

S10 Appendices

s10.1 Likelihood of English documents in the language test

In this section, we compute the likelihood of the alternative model for the English documents (Sec. S1). Let us call and the number of English words in the first group and in the second group respectively. We have that:

and the equivalent holds for and . When we write an English document, we randomly sample words from the English vocabulary. This means that the probability that words fall in the first group, and in the second, follows a binomial distribution:

(S21)

The last ingredient is how to decide which dialect a document should be fitted with. Let us define a threshold such that, if we use the first dialect, and we use the second otherwise. Without loss of generality, we also assume that , because otherwise we go back to the one single dialect case (Eq. S3).

Let us call the likelihood of an English document in this model. Its average can be written as:

(S22)

where

and the same equation holds for replacing with and with .

We can compute the optimal values for and simply setting derivatives to zero:

If we call:

the optimal and can be written as:

is how often we use the first dialect, is the expected number of words which fall in the first group of English words, is the probability of using words from the first group, given that we are using the first dialect. We also have that:

For group , we have:

We can now compute the expected log-likelihood of Eq. S22:

Calling the entropy of a binary variable

, we get:

(S23)

Now, the problem is to find, for given and , which choice of the parameters and maximizes Eq. S23. It turns out that there are two different regimes depending on the condition .

If , a possible strategy is to assume . This means that we use the second dialect if (and only if) there are no words in the first group. This means that .

In fact, using the equations above, we get:

It is possible to prove that for and the maximum is attained for:

and disregarding size effect due to being an integer:

In the second case, , we restrict ourselves to considering . In the limit , using the Gaussian approximation of the binomial distribution in Eq. S21, we get:

If also , the difference is independent of :

In conclusion, the log-likelihood per document of the alternative model (given that we use English), is bigger that the one of the generative model, and, remarkably, the difference varies from roughly to , so that it is substantially independent of all the parameters of the model. Since we can divide the English words in two arbitrarily groups, we can actually have a large number of alternative models. For instance, if we have , and , the model with highest likelihood splits English in two groups of and words, so that the number of alternative models becomes:

and there are many more alternative models with slightly smaller likelihood: for instance using , the likelihood of the alternative model is the likelihood we obtain for , but the number is . All these models are likely local maxima of the log-likelihood for Expectation-Maximization algorithms.

s10.2 Derivation of Eq. S13

Let us start computing , the log-likelihood per document for the model where the subtopics are merged and all languages are recovered. We recall that the symbol means that the likelihood is computed given that we know the topics of the document.

If we merge the two English subtopics, the common words have probability while the words only used in one of the two subtopics have probability . Therefore, the average log-likelihood per English document in Model 1 is:

which can be re-written as:

The log-likelihood per non-English document is:

Instead merging two languages which are not English (Model 2), we get:

The difference in the average log-likelihood between the two models becomes (we recall that is the probability of any non-English language):

Eq. S13 follows from the equation above. For asymmetric LDA, we also have to consider the difference in the language entropies. Accounting for that, we get:

Then, Model 1 has higher likelihood than Model 2 if:

The correction from Eq. S13 is .

s10.3 The Dirichlet distribution

The Dirichlet distribution is frequently used in Bayesian statistics since it is the conjugate prior of the multinomial distribution. The distribution is parameterized by

values , and the support of the function is the standard

simplex, i.e. the set of vectors

x of dimension such that and all . Clearly, x can be interpreted as a probability distribution. Moreover .

In generating the synthetic corpus, for each document, we use the same to draw from the Dirichlet distribution. In fact, even letting depend on documents (but not on topics), this definition makes sure we get back the pre-defined topic probabilities, since . Fig. S19 shows how depends on in the simple case of equiprobable topics.

Figure S19: For each document, we extract from the Dirichlet distribution for several values of , setting equally probable topics. We then measure the probability of its most prominent topic (red curve) as well as the sum of the two and five largest topic probabilities (blue and green curves). The plot shows the median together with the and quantiles. Small values of lead to documents which are mostly assigned to one single topic: for instance, for , the probability of the top topic is basically 1 and all others are zero. For , the top topic has roughly 0.5 probability, the second one has 0.2 and all the last 15 combined have less than 0.05. For large values like , all topic probabilities tend towards equality,

: this means that documents cannot be classified as they use all topics with equal probability.

s10.4 Measuring Perplexity

Perplexity is the conventional way to evaluate topic models’ accuracy Blei et al. (2003a). Here, we briefly review how it is computed.

The spirit is to cross validate the model, whose parameters have been computed on a trained set of documents, looking at how well the model fits a small set of unseen documents. Therefore, the procedure is () to held out a fraction of documents from the corpus (typically ), () train the algorithm using the remaining of the documents, () infer the topic probabilities for the unseen documents without changing the topics, i.e. , () compare the actual word frequencies of the unseen documents with the topic mixture .

s10.5 Algorithms’ usage details

For LDA, we used the implementation that can be found from http://www.cs.princeton.edu/~blei/lda-c/index.html. The stopping criterion in running LDA and PLSA was that the relative improvement of the log likelihood bound was less than with respect to the previous iteration. In running LDA we let the algorithm optimize as well. The initial value was set to .

References

  • Jin et al. (2005) X. Jin, Y. Zhou, and B. Mobasher, in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (ACM, New York, NY, USA, 2005), KDD ’05, pp. 612–617, ISBN 1-59593-135-X, URL http://doi.acm.org/10.1145/1081870.1081945.
  • Krestel et al. (2009) R. Krestel, P. Fankhauser, and W. Nejdl, in Proceedings of the third ACM conference on Recommender systems (ACM, New York, NY, USA, 2009), RecSys ’09, pp. 61–68, ISBN 978-1-60558-435-5, URL http://doi.acm.org/10.1145/1639714.1639726.
  • Sudderth et al. (2005) E. B. Sudderth, A. Torralba, W. T. Freeman, and A. S. Willsky, in

    In IEEE Intl. Conf. on Computer Vision

    (2005), pp. 1331–1338.
  • Niebles et al. (2008) J. C. Niebles, H. Wang, and L. Fei-fei, Int. J. Computer Vision pp. 299–318 (2008).
  • Liu et al. (2010) B. Liu, L. Liu, A. Tsykin, G. J. Goodall, J. E. Green, M. Zhu, C. H. Kim, and J. Li, Bioinformatics 26, 3105 (2010), ISSN 1367-4803, URL http://dx.doi.org/10.1093/bioinformatics/btq576.
  • Bíró et al. (2008) I. Bíró, J. Szabó, and A. A. Benczúr, in Proceedings of the 4th international workshop on Adversarial information retrieval on the web (ACM, New York, NY, USA, 2008), AIRWeb ’08, pp. 29–32, ISBN 978-1-60558-159-0, URL http://doi.acm.org/10.1145/1451983.1451991.
  • Deerwester et al. (1990) S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. Harshman, Journal of the American society for information science 41, 391 (1990).
  • Lee and Seung (1999) D. D. Lee and H. S. Seung, Nature 401, 788 (1999).
  • Hofmann (1999) T. Hofmann, in

    Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence

    (Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1999), UAI’99, pp. 289–296, ISBN 1-55860-614-9, URL http://dl.acm.org/citation.cfm?id=2073796.2073829.
  • Blei et al. (2003a) D. M. Blei, A. Y. Ng, and M. I. Jordan, J. Mach. Learn. Res. 3, 993 (2003a), ISSN 1532-4435, URL http://dl.acm.org/citation.cfm?id=944919.944937.
  • Steyvers and Griffiths (2007) M. Steyvers and T. Griffiths, Handbook of latent semantic analysis 427, 424 (2007).
  • Hoffman et al. (2012) M. Hoffman, D. M. Blei, and D. M. Mimno, in

    Proceedings of the 29th International Conference on Machine Learning (ICML-12)

    (2012), pp. 1599–1606.
  • Anandkumar et al. (2012) A. Anandkumar, D. P. Foster, D. Hsu, S. M. Kakade, and Y.-K. Liu, arXiv preprint arXiv:1204.6703 (2012).
  • Arora et al. (2013) S. Arora, R. Ge, Y. Halpern, D. Mimno, A. Moitra, D. Sontag, Y. Wu, and M. Zhu, in Proceedings of the 30th International Conference on Machine Learning (ICML-13) (2013), pp. 280–288.
  • Blei et al. (2003b) D. M. Blei, T. Griffiths, M. Jordan, and J. Tenenbaum, in NIPS (2003b), URL http://books.nips.cc/papers/files/nips16/NIPS2003_AA03.pdf.
  • Blei and Lafferty (2007) D. M. Blei and J. D. Lafferty, AAS 1, 17 (2007).
  • Griffiths and Steyvers (2004) T. L. Griffiths and M. Steyvers, Proc. Natl. Acad. Sci. USA 101, 5228 (2004).
  • Nallapati et al. (2007) R. Nallapati, W. Cohen, and J. Lafferty, in Proceedings of the Seventh IEEE International Conference on Data Mining Workshops (IEEE Computer Society, Washington, DC, USA, 2007), ICDMW ’07, pp. 349–354, ISBN 0-7695-3033-8, URL http://dx.doi.org/10.1109/ICDMW.2007.70.
  • Sontag and Roy (2011) D. Sontag and D. Roy, Advances in Neural Information Processing Systems 24, 1008 (2011).
  • Gaussier and Goutte (2005) E. Gaussier and C. Goutte, in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (ACM, 2005), pp. 601–602.
  • Wallach et al. (2009) H. Wallach, D. Mimno, and A. McCallum, Advances in Neural Information Processing Systems 22, 1973 (2009).
  • http://invokeit.wordpress.com/frequency-word lists/ (2013) http://invokeit.wordpress.com/frequency-word lists/, in Invoke IT Blog (2013), URL http://invokeit.wordpress.com/frequency-word-lists/.
  • Dhillon (2001) I. S. Dhillon, in Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (ACM, 2001), pp. 269–274.
  • Zhou et al. (2007) T. Zhou, J. Ren, M. Medo, and Y.-C. Zhang, Physical Review E 76, 046115 (2007).
  • Tan et al. (2005) P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, (First Edition) (Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2005), ISBN 0321321367.
  • Rosvall and Bergstrom (2008) M. Rosvall and C. Bergstrom, Proc. Natl. Acad. Sci. USA 105, 1118 (2008).
  • http://snowball.tartarus.org/algorithms/english/stemmer.html (2011) http://snowball.tartarus.org/algorithms/english/stemmer.html, in The English (Porter2) stemming algorithm (2011).
  • Campbell (1966)

    L. Campbell, Probability Theory and Related Fields

    5, 217 (1966).