Topic modeling [Ble12] aims at extracting the latent structure from a corpus of documents (either images or texts), that are represented as vectors . The key assumption is that the documents are (approximately) convex combinations of a small number of topics . Conditional on the topics, documents are generated independently by letting
where the weights and noise vectors are i.i.d. across . The scaling factor is introduced for mathematical convenience (an equivalent parametrization would have been to scale by a noise-level parameter ), and can be interpreted as a signal-to-noise ratio. It is also useful to introduce the matrix whose -th row is , and therefore
where and . The -th row of , is the vector of weights , while the rows of will be denoted by .
Note that belongs to the simplex . It is common to assume that its prior is Dirichlet: this class of models is known as Latent Dirichlet Allocations, or LDA [BNJ03]. Here we will take a particularly simple example of this type, and assume that the prior is Dirichlet in dimensions with all parameters equal to (which we will denote by ). As for the topics , their prior distribution depends on the specific application. For instance, when applied to text corpora, the are typically non-negative and represent normalized word count vectors. Here we will make the simplifying assumption that they are standard Gaussian . Finally, will be a noise matrix with entries .
In fully Bayesian topic models, the parameters of the Dirichlet distribution, as well as the topic distributions are themselves unknown and to be learned from data. Here we will work in an idealized setting in which they are known. We will also assume that data are in fact distributed according to the postulated generative model. Since we are interested in studying some limitations of current approaches, our main point is only reinforced by assuming this idealized scenario.
As is common with Bayesian approaches, computing the posterior distribution of the factors , given the data is computationally challenging. Since the seminal work of Blei, Ng and Jordan [BNJ03], variational inference is the method of choice for addressing this problem within topic models. The term ‘variational inference’ refers to a broad class of methods that aim at approximating the posterior computation by solving an optimization problem, see [JGJS99, WJ08, BKM17] for background. A popular starting point is the Gibbs variational principle, namely the fact that the posterior solves the following convex optimization problem:
denotes the Kullback-Leibler divergence. The variational expression in Eq. (1.4) is also known as the Gibbs free energy. Optimization is within the space
of probability measures on. To be precise, we always assume that a dominating measure over is given for , and both and have densities with respect to : we hence identify the measure with its density. Throughout the paper (with the exception of the example in Section 2) can be taken to be the Lebesgue measure.
Even for discrete, the Gibbs principle has exponentially many decision variables. Variational methods differ in the way the problem (1.3) is approximated. The main approach within topic modeling is naive mean field, which restricts the optimization problem to the space of probability measures that factorize over the rows of :
By a suitable parametrization of the marginals , , this leads to an optimization problem of dimension , cf. Section 3. Despite being non-convex, this problem is separately convex in the and
, which naturally suggests the use of an alternating minimization algorithm which has been successfully deployed in a broad range of applications ranging from computer vision to genetics[FFP05, WB11, RSP14]. We will refer to this as to the naive mean field iteration. Following a common use in the topics models literature, we will use the terms ‘variational inference’ and ‘naive mean field’ interchangeably.
The main result of this paper is that naive mean field presents an instability for learning Latent Dirichlet Allocations. We will focus on the limit with fixed. Hence, an LDA distribution is determined by the parameters . We will show that there are regions in this parameter space such that the following two findings hold simultaneously:
- No non-trivial estimator.
Any estimator , of the topic or weight matrices is asymptotically uncorrelated with the real model parameters . In other words, the data do not contain enough signal to perform any strong inference.
- Variational inference is randomly biased.
Given the above, one would hope the Bayesian posterior to be centered on an unbiased estimate. In particular,(the posterior distribution over weights of document
) should be centered around the uniform distribution. In contrast, we will show that the posterior produced by naive mean field is centered around a random distribution that is uncorrelated with the actual weights. Similarly, the posterior over topic vectors is centered around random vectors uncorrelated with the true topics.
One key argument in support of Bayesian methods is the hope that they provide a measure of uncertainty of the estimated variables. In view of this, the failure just described is particularly dangerous because it suggests some measure of certainty, although the estimates are essentially random.
Is there a way to eliminate this instability by using a better mean field approximation? We show that a promising approach is provided by a classical idea in statistical physics, the Thouless-Anderson-Palmer (TAP) free energy [TAP77, OW01]. This suggests a variational principle that is analogous in form to naive mean field, but provides a more accurate approximation of the Gibbs principle:
- Variational inference via the TAP free energy.
We show that the instability of naive mean field is remedied by using the TAP free energy instead of the naive mean field free energy. The latter can be optimized using an iterative scheme that is analogous to the naive mean field iteration and is known as approximate message passing (AMP).
While the TAP approach is promising –at least for synthetic data– we believe that further work is needed to develop a reliable inference scheme.
The rest of the paper is organized as follows. Section 2 discusses a simpler example, -synchronization, which shares important features with latent Dirichlet allocations. Since calculations are fairly straightforward, this example allows to explain the main mathematical points in a simple context. We then present our main results about instability of naive mean field in Section 3, and discuss the use of TAP free energy to overcome the instability in Section 4.
1.1 Related literature
Over the last fifteen years, topic models have been generalized to cover an impressive number of applications. A short list includes mixed membership models [EFL04, ABFX08], dynamic topic models [BL06], correlated topic models [LB06, BL07], spatial LDA [WG08], relational topic models [CB09]
, Bayesian tensor models[ZBHD15]
. While other approaches have been used (e.g. Gibbs sampling), variational algorithms are among the most popular methods for Bayesian inference in these models. Variational methods provide a fairly complete and interpretable description of the posterior, while allowing to leverage advances in optimization algorithms and architectures towards this goal (see[HBB10, BBW13]).
prove local convergence of naive mean field estimate to the true parameters for exponential families with missing data and Gaussian mixture models. In the context of Gaussian mixtures, the same authors prove that the covariance of the variational posterior is asymptotically smaller (in the positive semidefinite order) than the inverse of the Fisher information matrix[WT05]. All of these results are established in the classical large sample asymptotics with fixed. In the present paper we focus instead on the high-dimensional limit and prove that also the mode (or mean) of the variational posterior is incorrect. Notice that the high-dimensional regime is particularly relevant for the analysis of Bayesian methods. Indeed, in the classical low-dimensional asymptotics Bayesian approaches do not outperform maximum likelihood.
In order to correct for the underestimation of covariances, [WT05] suggest replacing its variational estimate by the inverse Fisher information matrix. A different approach is developed in [GBJ15], building on linear response theory.
Naive mean field variational inference was used in [CDP12, BCCZ13] to estimate the parameters of the stochastic block model. These works establish consistency and asymptotic normality of the variational estimates in a large signal-to-noise ratio regime. Our work focuses on estimating the latent factors: it would be interesting to consider implications on parameter estimation as well.
The recent paper [ZZ17] also studies variational inference in the context of the stochastic block model, but focuses on reconstructing the latent vertex labels. The authors prove that naive mean field achieves minimax optimal statistical rates. Let us emphasize that this problem is closely related to topic models: both are models for approximately low-rank matrices, with a probabilistic prior on the factors. The results of [ZZ17] are complementary to ours, in the sense that [ZZ17] establishes positive results at large signal-to-noise ratio (albeit for a different model), while we prove inconsistency at low signal-to-noise ratio. General conditions for consistency of variational Bayes methods are proposed in [PBY17].
Our work also builds on recent theoretical advances in high-dimensional low-rank models, that were mainly driven by techniques from mathematical statistical physics (more specifically, spin glass theory). An incomplete list of relevant references includes [KM09, DM14, DAM17, KXZ16, BDM16, LM16, Mio17, LKZ17, AK18]. These papers prove asymptotically exact characterizations of the Bayes optimal estimation error in low-rank models, to an increasing degree of generality, under the high-dimensional scaling with .
Related ideas also suggest an iterative algorithm for Bayesian estimation, namely Bayes Approximate Message Passing [DMM09, DMM10]. As mentioned above, Bayes AMP can be regarded as minimizing a different variational approximation known as the TAP free energy. An important advantage over naive mean field is that AMP can be rigorously analyzed using a method known as state evolution [BM11, JM13, BMN17].
Let us finally mention that a parallel line of work develops polynomial-time algorithms to construct non-negative matrix factorizations under certain structural assumptions on the data matrix , such as separability [AGM12, AGKM12, RRTB12]. It should be emphasized that the objective of these algorithms is different from the one of Bayesian methods: they return a factorization that is guaranteed to be unique under separability. In contrast, variational methods attempt to approximate the posterior distribution, when the data are generated according to the LDA model.
We denote by
the identity matrix, and bythe all-ones matrix in dimensions (subscripts will be dropped when the number of dimensions is clear from the context). We use for the all-ones vector.
We will use for the tensor (outer) product. In particular, given vectors expressed in the canonical basis as and , is the tensor with coordinates in the basis . We will identify the space of matrices with the tensor product . In particular, for , , we identify with the matrix .
Given a symmetric matrix , we denote by
its eigenvalues in decreasing order. For a matrix (or vector)we denote the orthogonal projector operator onto the subspace spanned by the columns of by , and its orthogonal complement by . When the subscript is omitted, this is understood to be the projector onto the space spanned by the all-ones vector: and .
2 A simple example: -synchronization
In synchronization we are interested in estimating a vector from observations , generated according to
where is distributed according to the Gaussian Orthogonal Ensemble , namely are independent of . The parameter corresponds to the signal-to-noise ratio.
Under model (2.1), for and any estimator , the following limit holds in probability:
How does variational inference perform on this problem? Any product probability distributioncan be parametrized by the means , and it is immediate to get
Here is obtained from by setting the diagonal entries to , and is the binary entropy function. In view of Lemma 2.1, the correct posterior distribution should be essentially uniform, resulting in vanishing. Indeed, is a stationary point of the mean field free energy : . We refer to this as the ‘uninformative fixed point’.
Is a local minimum? Computing the Hessian at the uninformative fixed point yields
The matrix is a rank-one deformation of a Wigner matrix and its spectrum is well understood [BBAP05, FP07, BGN11]. For , its eigenvalues are contained with high probability in the interval , with , as . For , , while the other eigenvalues are contained in . This implies
In other words, is a local minimum for , but becomes a saddle point for . In particular, for , variational inference will produce an estimate , although the posterior should be essentially uniform. In fact, it is possible to make this conclusion more quantitative.
Let be any local minimum of the mean field free energy , under the -synchronization model (2.1). Then there exists a numerical constant such that, with high probability, for ,
In other words, although no estimator is positively correlated with the true signal , variational inference outputs biases that are non-zero (and indeed of order one, for a positive fraction of them).
The last statement immediately implies that naive mean field leads to incorrect inferential statements for . In order to formalize this point, given any estimators of the posterior marginals, we define the per-coordinate expected coverage as
This is the expected fraction of coordinates that are estimated correctly by choosing according to the estimated posterior. Since the prior is assumed to be correct, it can be interpreted either as the expectation (with respect to the parameters) of the frequentist coverage, or as the expectation (with respect to the data) of the Bayesian coverage. On the other hand, if the were accurate, Bayesian theory would suggest claiming the coverage
The following corollary is a direct consequence of Proposition 2.2, and formalizes the claim that naive mean field leads to incorrect inferential statements. More precisely, it overestimates the coverage achieved.
Let be any local minimum of the mean field free energy , under the -synchronization model (2.1), and consider the corresponding posterior marginal estimates . Then, there exists a numerical constant such that, with high probability, for ,
While similar formal coverage statements can be obtained also for the more complex case of topic models, we will not make them explicit, since they are relatively straightforward consequences of our analysis.
3 Instability of variational inference for topic models
3.1 Information-theoretic limit
As in the case of synchronization discussed in Section 2, we expect it to be impossible to estimate the factors with strictly positive correlation for small enough signal-to-noise ratio (or small enough sample size ). The exact threshold was characterized recently in [Mio17] (but see also [DM14, BDM16, LM16, LKZ17] for closely related results). The characterization in [Mio17] is given in terms of a variational principle over matrices.
Theorem 1 (Special case of [Mio17]).
It is also shown in Appendix C.2 that is a stationary point of the free energy . We shall refer to as the uninformative point. Let be the supremum value of such that the infimum in Eq. (3.1) is uniquely achieved at :
As formalized below, for the data do not contain sufficient information for estimating , in a non-trivial manner.
Let . Then is a stationary point of the function . Further, it is a local minimum provided where the spectral threshold is given by
Finally, if , for any estimator , we have
for a constant.
We refer to Appendix C for a proof of this statement.
Note that Eq. (3.4) compares the mean square error of an arbitrary estimator , to the mean square error of the trivial estimator that replaces each column of by its average. This is equivalent to estimating all the weights by the uniform distribution . Of course, . However, this upper bound appears to be tight for small .
Solving numerically the -dimensional problem (3.1) indicates that for and .
3.2 Naive mean field free energy
We consider a trial joint distribution that factorizes according to rows ofand according to Eq. (1.5). It turns out (see Appendix D.2) that, for any stationary point of over such product distributions, the marginals take the form
where is the density of , and is the density of , and are defined implicitly by the normalization condition . In the following we let , denote the set of parameters in these distributions; these can also be viewed as matrices and whose -th row is (in the former case) or (in the latter).
It is useful to define the functions and as (proportional to) expectations with respect to the approximate posteriors (3.5)
For , we overload the notation and denote by the matrix whose -th row is (and similarly for ).
Note that Eq. (3.10) implies the following convex duality relation between and
By strict convexity of ,
(the latter is strongly convex on the hyperplane, ) we can view as a function of or . With an abuse of notation, we will write or interchangeably.
A critical (stationary) point of the free energy (3.9) is a point at which . It turns out that the mean field free energy always admits a point that does not distinguish between the latent factors, and in particular , , as stated in detail below. We will refer to this as the uninformative critical point (or uninformative fixed point).
Define and let be any solution of the following equation in
(Such a solution always exists.) Further define
Then the naive mean field free energy of Eq. (3.9) admits a stationary point whereby, for all , ,
3.3 Naive mean field iteration
As mentioned in the introduction, the variational approximation of the free energy is often minimized by alternating minimization over the marginals , of Eq. (1.5). Using the parametrization (3.5), we obtain the following naive mean field iteration for (see Appendix D.2):
Note that, while the free energy naturally depends on the , , the iteration sets , , independent of the indices . In fact, any stationary point of can be shown to be of this form.
Any critical point of the free energy (3.9) is a fixed point of the naive mean field iteration and vice-versa, as follows from Appendix D.3. In particular, the uninformative critical point is a fixed point of the naive mean field iteration.
In view of Section 3.1, for , the real posterior should be centered around a point symmetric under permutations of the topics. In particular, the posterior over the weights of document should be centered around the symmetric distribution . In other words, the uninformative fixed point should be a good approximation of the posterior for .
A minimum consistency condition for variational inference is that the uninformative stationary point is a local minimum of the posterior for . The next theorem provides a necessary condition for stability of the uninformative point, which we expect to be tight. As discussed below, it implies that this point is a saddle in an interval of below . We recall that the index of a smooth function at stationary point is the number of the negative eigenvalues of the Hessian .
If , then there exists such that the uninformative critical point of Lemma 3.2, is, with high probability, a saddle point, with index at least and .
Correspondingly is an unstable critical point of the mapping in the sense that the Jacobian has spectral radius larger than one at .
In the following, we will say that a fixed point is stable if the linearization of at (i.e. the Jacobian matrix ) has spectral radius smaller than one. By the Hartman-Grobman linearization theorem [Per13], this implies that is an attractive fixed point. Namely, there exists a neighborhood of such that, initializing the naive mean field iteration within that neighborhood, results in as . Vice-versa, we say that is unstable if the Jacobian has spectral radius larger than one. In this case, for any neighborhood of , and a generic initialization in that neighborhood, does not converge to the fixed point.
3.5 Numerical results for naive mean field
In order to investigate the impact of the instability described above, we carried out extensive numerical simulations with the variational algorithm (3.19), (3.20). After any number of iterations , estimates of the factors , are obtained by computing expectations with respect to the marginals (3.5). This results in
Note that can be used as the state of the naive mean-field iteration instead of .
We select a two-dimensional grid of ’s and generate different instances according to the LDA model for each grid point. We report various statistics of the estimates aggregated over the instances. We have performed the simulations for and . For space considerations, we focus here on the case , , and discuss other results in Appendix E. (Simulations for other values of also yield similar results.)
We initialize both the naive mean field iteration near the uninformative fixed-point as follows:
Here has entries and and is the estimate at the uninformative fixed point. We run a maximum of and a minimum of iterations, and assess convergence at iteration by evaluating
where the minimum is over the set of permutation matrices. We declare convergence when . We denote by , the estimates obtained at convergence.
Recall the definition . In order to investigate the instability of Theorem 2, we define the quantities
In Figure 1 we plot empirical results for the average , for , and four values of . In Figure 2, we plot the empirical probability that variational inference does not converge to the uninformative fixed point or, more precisely, with , evaluated on a grid of values. We also plot the Bayes threshold (which we find numerically that it coincides with the spectral threshold ) and the instability threshold .
It is clear from Figures 1, 2, that variational inference stops converging to the uninformative fixed point (although we initialize close to it) when is still significantly smaller than the Bayes threshold (i.e. in a regime in which the uninformative fixed point would a reasonable output). The data are consistent with the hypothesis that variational inference becomes unstable at , as predicted by Theorem 2.
Because of Proposition 3.1, we expect the estimates produced by variational inference to be asymptotically uncorrelated with the true factors for
. In order to test this hypothesis, we borrow a technique that has been developed in the study of phase transitions in statistical physics, and is known as the Binder cumulant[Bin81]. For the sake of simplicity, we focus here –again– on the case , deferring the general case to Appendix E. Since in this case , , we can encode the informative component of these matrices by taking the difference between their columns. For instance, we define , and analogously , , . We then define
Here denotes empirical average with respect to the sample, , and we set