We consider the problem of estimating high-dimensional, discrete, mixture distributions, in the context of topic models. The focus of this work is the estimation, with sharp finite sample convergence rates, of the distribution of the latent topics within the documents of a corpus. Our main application is to the estimation of Wasserstein distances between document generating distributions.
In the framework and traditional jargon of topic models, one has access to a corpus of documents generated from a common set of latent topics. Each document is modelled as a set of words drawn from a discrete distribution on points, where is the dictionary size. We observe the
-dimensional word-count vectorfor each document , where we assume
The topic model assumption is that the matrix of expected word frequencies in the corpus, can be factorized as
Here represents the matrix of conditional probabilities of a word, given a topic, and therefore each column of belongs to the -dimensional probability simplex
The notation represents for each , and is the vector of all ones. The matrix collects the probability vectors , the simplex in . The entries of are probabilities with which each of the topics occurs within document , for each . Relationship (1
) would be a very basic application of Bayes’ Theorem ifalso depended on . A matrix that is common across documents is the topic model assumption, which we will make in this paper.
Under model (1), each distribution on words, , is a discrete mixture of distributions. The mixture components correspond to the columns of , and are therefore common to the entire corpus, while the weights, given by the entries of , are document specific. Since not all topics are expected to be covered by all documents, the mixture weights are potentially sparse, in that may be sparse. Using their dual interpretation, throughout the paper we will refer to a vector as either the topic distribution or the vector of mixture weights, in document .
The observed word frequencies are collected in a data matrix with independent columns corresponding to the th document. Our main interest is to estimate when either the matrix is known or unknown. We allow for the ambient dimensions and to depend on the sizes of the samples and throughout the paper.
While, for ease of reference to the existing literature, we will continue to employ the text analysis jargon for the remainder of this work, and our main application will be to the analysis of a movie review data set, our results apply to any data set generated from a model satisfying (1), for instance in biology (Bravo González-Blas et al., 2019; Chen et al., 2020), hyperspectral unmixing (Ma et al., 2013) and collaborative filtering (Kleinberg and Sandler, 2008).
The specific problems treated in this work are listed below, and expanded upon in the following subsections.
The main focus of this paper is on the derivation of sharp, finite-sample, -error bounds for estimators of the potentially sparse topic distributions , under model (1), for each . The finite sample analysis covers two cases, corresponding to whether the components of the mixture, provided by the columns of , are either (i) known, or (ii) unknown, and estimated by from the corpus data . As a corollary, we derive corresponding finite sample -norm error bounds for mixture model-based estimators of .
The main application of our work is to the construction and analysis of similarity measures between the documents of a corpus, for measures corresponding to estimates of the Wasserstein distance between different probabilistic representations of a document.
1.1 A finite sample analysis of topic and word distribution estimators
Finite sample error bounds for estimators of in topic models (1) have been studied in Arora et al. (2012, 2013); Ke and Wang (2017); Bing et al. (2020, 2020), while the finite sample properties of estimators of and, by extension, those of mixture-model-based estimators of , are much less understood, even when is known beforehand, and therefore .
When is a probability vector parametrized as , with , and some known function , provided that is identifiable, the study of the asymptotic properties of the maximum likelihood estimator (MLE) of , derived from the -dimensional vector of observed counts , is over eight decades old. Proofs of the consistency and asymptotic normality of the MLE, when the ambient dimensions and do not depend on the sample size, can be traced back to Rao (1957, 1958) and later to the seminal work of Birch (1964), and are reproduced, in updated forms, in standard textbooks on categorical data (Bishop et al., 2007; Agresti, 2012).
The mixture parametrization treated in this work, when known, is an instance of these well-studied low-dimensional parametrizations. Specialized to our context, for document , the parametrization is with , for each component of . However, even when and are fixed, the aforementioned classical asymptotic results are not applicable, as they are established under the following key assumptions that typically do not hold for topic models:
, for all ,
for all .
The regularity assumption (1) is crucial in classical analyses (Rao, 1957, 1958), and stems from the basic requirement of -estimation that be an interior point in its appropriate parameter space. In effect, since , this is a requirement on only a sub-vector of it. In the context of topic models, a given document of the corpus may not touch upon all topics, and in fact is expected not to. Therefore, it is expected that , for some . Furthermore, represents the number of topics common to the entire corpus, and although topic may not appear in document , it may be the leading topic of some other document . Both presence and absence of a topic in a document are subject to discovery, and are not known prior to estimation. Moreover, one does not observe the topic proportions per document directly. Therefore, one cannot use background knowledge, for any given document, to reduce to a smaller dimension in order to satisfy assumption (1).
The classical assumption (2) also typically does not hold for topic models. To see this, note that the matrix is also expected to be sparse: conditional on a topic , some of the words in a large -dimensional dictionary will not be used in that topic. Therefore, in each column , we expect that , for many rows . When the supports of and do not intersect, the corresponding probability of word in document is zero, . Since zero word probabilities are induced by unobservable sparsity in the topic distribution (or, equivalently, in the mixture weights), one once again cannot reduce the dimension a priori in a theoretical analysis. Therefore, the assumption (2) is also expected to fail.
The analysis on the MLE of is thus an open problem with being known even for fixed scenarios, when the standard assumptions (1) and (2) do not hold and when the problem cannot be artificially reduced to a framework in which they do.
Finite sample analysis of the rates of the MLE of topic distributions, for known
In Section 2.1, we provide a novel analysis of the MLE of for known , under a sparse discrete mixture framework, in which both the ambient dimensions and are allowed to grow with the sample sizes and . Kleinberg and Sandler (2008) refer to the assumption of being known as the semi-omniscient setting in the context of collaborative filtering and note that even this setting is, surprisingly, very challenging for estimating the mixture weights. By studying the MLE of when is known, one gains appreciation of the intrinsic difficulty of this problem, that is present even before one further takes into account the estimation of the entire matrix .
To the best of our knowledge, the only existing work that treats the aspect of our problem is Arora et al. (2016), under the assumptions that
[itemsep = 0mm]
the support of is known and with and ,
the matrix is known and .
The parameter is called the condition number of (Kleinberg and Sandler, 2008) which measures the amount of linear independence between columns of that belong to the simplex . Under (a) and (b), the problem framework is very close to the classical one, and the novelty in Arora et al. (2016) resides in the provision of a finite sample -error bound of the difference between the restricted MLE (restricted to the known support ) and the true , a bound that is valid for growing ambient dimensions. However, assumption (a) is rather strong, as the support of is typically unknown. Furthermore, the restriction implies that . Hence (a) essentially requires to be approximately uniform on its a priori known support. This does not hold in general. For instance, even if the support were known, many documents will primarily cover a very small number of topics, while only mentioning the rest, and thus some topics will be much more likely to occur than others, per document.
Our novel finite sample analysis in Section 2.1 avoids the strong condition (a) in Arora et al. (2016). For notational simplicity, we pick one and drop the superscripts in , and within this section. In Theorem 1 of Section 2.1.1, we first establish a general bound for the -norm of the error , with being the MLE of . Then, in Section 2.1.2, we use this bound as a preliminary result to characterize the regime in which the Hessian matrix of the loss in (6), evaluated at , is close to its population counterpart (see condition (18) in Section 2.1.2). When this is the case, we prove a potentially faster rate of in Theorem 2. A consequence of both Theorem 1 and Theorem 2 is summarized in Corollary 3 of Section 2.1.2 for the case when is dense such that . For dense , provided that for some sufficiently large constant , achieves the parametric rate , up to a multiplicative factor .
As mentioned earlier, since is not necessarily an interior point, we cannot appeal to the standard theory of the MLE, nor can we rely on having a zero gradient of the log-likelihood at . Instead, our proofs of Theorem 1 and 2 consist of the following key steps:
[leftmargin=5mm, itemsep = 0mm]
We prove that the KKT conditions of maximizing the log-likelihood under the restriction that lead to a quadratic inequality in of the form where (the infinity norm of) is defined in the next point, and
We bound the linear term of this inequality by together with a sharp concentration inequality (Lemma LABEL:lem_oracle_error of Appendix LABEL:app_tech_lemma) for
We prove that the quadratic term can be bounded from below by , using the definition of the condition number of , and control of the ratios over a suitable subset of indices such that .
The faster rate in Theorem 2 requires a more delicate control of , and its analysis is complicated by the division by . To this end, we use the bound in Theorem 1 to first prove that , for all with and some constant . We then prove a sharp concentration bound (Lemma LABEL:lem_I_deviation of Appendix LABEL:app_tech_lemma) for the operator norm of the matrix for and . This will lead to an improved quadratic inequality
Finally, a sharp concentration inequality for gives the desired faster rates on .
Minimax optimality and adaptation to sparsity of the MLE of topic distributions, for known
In Section 2.1.3 we show that the MLE of can be sparse, without any need for extra regularization, a remarkable property that holds in the topic model set-up. Specifically, we introduce in Theorem 5 a new incoherence condition on the matrix under which holds with high probability. Therefore, if the vector is sparse, its zero components will be among those of . Our analysis uses a primal-dual witness approach based on the KKT conditions from solving the MLE. To the best of our knowledge, this is the first work proving that the MLE of sparse mixture weights can be exactly sparse, without extra regularization, and determining conditions under which this can happen. Since implies that if for some , so is , this sparsity recovery property further leads to a faster rate (up to a logarithmic factor) for with , as summarized in Corollaries 4 and 6 of Section 2.1.3. In Section 2.1.4 we prove that in fact is the minimax rate of estimating over a large class of sparse topic distributions, implying the minimax optimality of the MLE as well as its adaptivity to the unknown sparsity .
Finite sample analysis of the estimators of topic distributions, for unknown
We study the estimation of when is unknown in Section 2.2. Our procedure of estimating is valid for any estimator of with columns of belonging to . For any such estimator , we propose to plug it into the log-likelihood criterion for estimating . While the proofs are more technical, we can prove that the resulting estimate of by using retains all the properties proved for the MLE based on the known in Section 2.1, provided that the error is sufficiently small. In fact, all bounds of in Theorems 8 and 9 and Corollary 10 of Section 2.2.2, have an extra additive term reflecting the effect of estimating . In Appendix A.2, we also show that the estimator retains the sparsity recovery property despite using . Essentially, our take-home message is that the rate for is the same as plus the additive error , provided that estimates well in norm, with one instance given by the estimator in Bing et al. (2020).
Finite sample analysis of the estimators of word distributions
In Section 2.3 we compare the mixture-model-based estimator of with the empirical estimator (we drop the document-index ), which is simply the -dimensional observed word frequencies, in two aspects: the convergence rate and the estimation of probabilities corresponding to zero observed frequencies. For the empirical estimator , we find with , while . We thus expect a faster rate for the model-based estimate whenever . Regarding the second aspect, we note that we can have zero observed frequency () for some word that has strictly positive word probability (). The probabilities of these words are estimated incorrectly by zeroes by the empirical estimate whereas the model-based estimator can produce strictly positive estimates, for instance, under conditions stated in Section 2.3. On the other hand, for the words that have zero probabilities in (hence zero observed frequencies), the empirical estimate makes no mistakes in estimating their probabilities while the estimation error of tends to zero at a rate that is no slower than . In the case that has correct one-sided sparsity recovery, detailed in Section 2.1.3, also estimates zero probabilities by zeroes.
1.2 Estimates of the 1-Wasserstein document distances in topic models
In Section 3 we introduce two alternative probabilistic representations of a document : via the word generating probability vector, , or via the topic generating probability vector . We use either the 1-Wasserstein distance between the word distributions, , or the 1-Wasserstein distance between the topic distributions, , in order to evaluate the proximity of a pair of documents and , for metrics and between words and topics, defined in displays (48) and (51) – (52), respectively. In particular, in Section 3.1 we explain in detail that we regard a topic as a distribution on words, given by a column of , and therefore distances between topics are distances between discrete distributions in , and need to be estimated when is not known.
In Section 3.2 we propose to estimate the two 1-Wasserstein distances by and , respectively, where is the model-based estimator of and is the estimator of , as studied in Section 2. As a main theoretical application of the error bounds derived in Section 2, we provide the finite sample upper bounds
that hold under the conditions given in Proposition 11 and Corollary 12 of Section 3.2. We assume, for notation simplicity, that . We denote by the -norm of the matrix , which counts the number of the non-zero entries in this matrix. We defer the precise technical details to Section 3.2, and give an overall, qualitative, assessment here. We first observe that the bounds are of the same order, and that their second term reflects the order of the error in estimating , and it is therefore zero if is known. In general, needs to be estimated and this error term is not zero. The most conservative rate of (2) and (3) is obtained when is dense, and , with the practical implications that a short document length (small ) can be compensated for, in terms of speed of convergence, by having a relatively small number of topics covered by the entire corpus, whereas working with a very large dictionary (large ) will not be detrimental to the rate in a very large corpus (large ).
Sparsity in is however expected, as given a topic, many of the words from a large dictionary will not be used in that topic. Moreover, sparsity in the topic distributions , for each , is also expected, as a given document will typically only touch upon a few of the topics. Display (59) in Remark 7 gives a refinement of the bounds above, that reflect their adaptation to unknown sparsity in either the document specific topic-distribution or in the word-topic matrix .
To the best of our knowledge, this rate analysis of the estimates of 1-Wasserstein distance corresponding to estimators of discrete distributions in topic models is new. The only related results, discussed in Section 3.1, have been established relative to empirical frequency estimators of discrete distributions, from an asymptotic perspective (Sommerfeld and Munk, 2017; Tameling et al., 2018) or in finite samples (Weed and Bach, 2017).
In Remark 7 of Section 3.2 we discuss the net computational benefits of representing documents in terms of their -dimensional topic distributions, for 1-Wasserstein distance calculations. Using an IMBD movie review corpus as a real data example, we illustrate in Section 3.3 the practical benefits of these distance estimates, relative to the more commonly used earth(word)-mover’s distance (Kusner et al., 2015) between observed empirical word-frequencies, , with , for all . Our analysis reveals that all our proposed 1-Wasserstein distance estimates successfully capture differences in the relative weighting of topics between documents, whereas the standard is substantially less successful, likely owing in part to the fact noted in Section 1.1 above, that when the dictionary size is large, but the document length is relatively small,
the quality of as an estimator of will deteriorate, and the quality of as an estimator of (49) will deteriorate accordingly.
The remainder of the paper is organized as follows. In Section 2.1 we study the estimation of when is known. A general bound of is stated in Section 2.1.1 and is improved in Section 2.1.2. The sparsity of the MLE is discussed in Section 2.1.3 and the minimax lower bounds of estimating are established in Section 2.1.4. Estimation of when is unknown is studied in Section 2.2. In Section 2.3 we discuss the comparison between model-based estimators and the empirical estimator of . Section 3 is devoted to our main application: the 1-Wasserstein distance between documents. In Section 3.1 we introduce alternative Wasserstein distances between probabilistic representations of documents with their estimation studied and analyzed in Section 3.2. Section 3.3 contains the analysis of a real data set of IMDB movie reviews. Simulation studies on the estimation of and are presented in Section 4. The Appendix contains all proofs, auxiliary results and additional simulation results.
For any positive integer , we write . For two real numbers and , we write and . For any set , its cardinality is written as . For any vector , we write its -norm as for . For a subset , we define as the subvector of with corresponding indices in . Let be any matrix. For any set and , we use to denote the submatrix of with corresponding rows and columns . In particular, () stands for the whole rows (columns) of in (). We write for each single row of . We use and to denote the operator norm and elementwise norm, respectively. We write . The -th canonical unit vector in is denoted by while represents the -dimensional vector of all ones. is short for the identity matrix. For two sequences and , we write if there exists such that for all . For a metric on a finite set , we use boldface to denote the corresponding matrix. The set contains all permutation matrices.
2 Estimation of topic distributions under topic models
We consider the estimation of the topic distribution vector, , for each . Pick any ; for notational simplicity, we write , and as well as throughout this section.
We allow, but do not assume, that the vector is sparse, as sparsity is expected in topic models: a document will cover some, but most likely not all, topics under consideration. We therefore introduce the following parameter space for :
with being any integer between and . From now on, we let and write for its cardinality.
In Section 2.1 we study the estimation of from the observed data , generated from background probability vector parametrized as , with known matrix . The intrinsic difficulties associated with the optimal estimation of are already visible when is known, and we treat this in detail before providing, in Section 2.2, a full analysis that includes the estimation of . We remark that assuming known is not purely unrealistic in topic models used for text data, since then one typically has access to a large corpus (with in the order of tens of thousands). When the corpus can be assumed to share the same , this matrix can be very accurately estimated.
The results of Section 2.1 hold for any known , not required to have any specific structure: in particular, we do not assume that it follows a topic model with anchor words (Assumption 1 stated in Section 2.2.1 below). We will make this assumption when we consider optimal estimation of when itself is unknown, in which case Assumption 1 serves as both a needed identifiability condition and a condition under which estimation of both and , in polynomial time, becomes possible. This is covered in detail in Section 2.2.
2.1 Estimation of when is known
When is known and given, with columns , the data has a multinomial distribution,
where is the topic distribution vector, with entries corresponding to the proportions of the topics, respectively. Under (4), it is natural to consider the Maximum Likelihood Estimator (MLE) of . The log-likelihood, ignoring terms independent of , is proportional to
where the last summation is taken over the index set of observed relative frequencies,
and using the convention that . Then
This optimization problem is also known as the log-optimal investment strategy, see for instance (Boyd et al., 2004, Problem 4.60)
. It can be computed efficiently, since the loss function in (6) is concave on its domain, the open half space , and the constraints and are convex.
The following two subsections state the theoretical properties of the MLE in (6), and include a study of its adaptivity to the potential sparsity of and minimax optimality. In Section 2.3 we show that although is constructed only from observed, non-zero, frequencies, can be a non-zero estimate of for those indices for which we observe .
2.1.1 A general finite sample bound for
To analyze , we first introduce two deterministic sets that control defined in (5). Recalling , we collect the words with non-zero probabilities in the set
We will also consider the set
The sets and are appropriately defined such that holds with probability at least (see Lemma LABEL:lem_basic of Appendix LABEL:app_tech_lemma). Define
We note that , and all depend on implicitly via . Another important quantity is the following restricted condition number of the submatrix of , defined as
We make the following simple, but very important, observation that
with , by using the fact that both and belong to . In fact, (12) holds generally for any estimator as
Display (12) implies that the “effective” error bound of arises mainly from the estimation of . Also because of this property, we need the condition number of to be positive only over the cone rather than the whole .
The following theorem states the convergence rate of . Its proof can be found in Appendix LABEL:app_proof_thm_mle.
Assume . For any , with probability , one has
Theorem 1 is a general result that only requires . The rates depend on two important quantities: and , which we discuss below in detail. In the next section we will show that the bound in Theorem 1 serves as an initial result, upon which one could obtain a faster rate of the MLE in certain regimes.
Remark 1 (Discussion on ).
The condition number, , is commonly used to quantify the linear independence
of the columns belonging to of the matrix (Kleinberg and Sandler, 2008). As remarked in Kleinberg and Sandler (2008), the condition number plays the role of the smallest singular value,
plays the role of the smallest singular value,, but it is more appropriate for matrices with columns belonging to a probability simplex. Because of the chain inequalities
and the fact that , having appear in the bound loses at most a factor comparing to . But using potentially yields a much worse bound than using : there are instances for which is lower bounded by a constant whereas is only of order (see, for instance, Kleinberg and Sandler (2008, Appendix A)).
The restricted condition number in (11) for generalizes by requiring the condition of over the cones with and . We thus view as the analogue of the restricted eigenvalue
as the analogue of the restricted eigenvalue(Bickel et al., 2009) of the Gram matrix in the sparse regression settings. In topic models, it has been empirically observed that the (restricted) condition number of is oftentimes bounded from below by some absolute constant (Arora et al., 2016).
To understand why appears in the rates, recall that the MLE in (6) only uses the words in as defined in (5). Intuitively, only the condition number of should play a role as we do not observe any information from words in . Since holds with high probability, we can thus bound from below by . For the same reason, in (10) is defined over rather than .
Remark 2 (Discussion on ).
Define the smallest non-zero entry in as
Recall . We have where
The magnitudes of both and closely depend on while additionally depends on
a quantity that essentially balances the entries of and those of . Clearly, when is dense, that is, , we have . In general, we have
We further remark that if has a special structure such that there exists at least one anchor word for each topic , that is, for each , there exists a row (see Assumption 1 in Section 2.2.1 below), it is easy to verify that the inequality for in (14) is in fact an equality.
2.1.2 Faster rates of
In this section we state conditions under which the general bound stated in Theorem 1 can be improved. We begin by noting that one of the main difficulties in deriving a faster rate for is in establishing a link between the Hessian matrix (the second order derivative) of the loss function in (6) evaluated at to that evaluated at .
To derive this link, we prove in Appendix LABEL:app_proof_thm_mle that a relative weighted error of estimating by stays bounded in probability, in the precise sense that
Further, we show in Lemma LABEL:lem_I_deviation in Appendix LABEL:app_tech_lemma that the Hessian matrix of (6) at concentrates around its population-level counterpart, with replaced by . A sufficient condition under which (18) holds can be derived as follows. First note that
We have bounded by in (17), and have provided an initial bound on in Theorem 1. Therefore, (18) holds if these two bounds combine to show is of order . This is summarized in the following theorem. Let be defined in (11) with and in place of . Recall that is defined in (16). In addition, we define