Extreme Multi-label Learning using Mixture of Factor Analyzers
We present a family of expectation-maximization (EM) algorithms for binary and negative-binomial logistic regression, drawing a sharp connection with the variational-Bayes algorithm of Jaakkola and Jordan (2000). Indeed, our results allow a version of this variational-Bayes approach to be re-interpreted as a true EM algorithm. We study several interesting features of the algorithm, and of this previously unrecognized connection with variational Bayes. We also generalize the approach to sparsity-promoting priors, and to an online method whose convergence properties are easily established. This latter method compares favorably with stochastic-gradient descent in situations with marked collinearity.READ FULL TEXT VIEW PDF
We show that under mild conditions, Estimation of Distribution Algorithm...
We present a noise-injected version of the Expectation-Maximization (EM)...
A new approach for signal parametrization, which consists of a specific
In this work, we consider a manufactory process which can be described b...
Expectation-maximization (EM) is a popular and well-established method f...
Human feedback is widely used to train agents in many domains. However,
Decentralized POMDPs provide an expressive framework for multi-agent
Extreme Multi-label Learning using Mixture of Factor Analyzers
Consider a logistic regression, where for a fixed number of binary trials,
; and where the log-odds of success are modeled as a linear function ofpredictors:
Suppose further that is given a normal prior with mean and covariance
, and that our goal is to estimate, or approximate, the posterior distribution of.
The purpose of this article is to derive the solution to this problem embodied by Algorithm 1, and to study both its subsequent elaborations and its connection with variational-Bayes inference. The algorithm iterates between two simple steps: (1) use the current value of to update a diagonal matrix of working parameters , and (2) use the new to construct and solve a set of normal equations for the new . Using recently developed distributional theory from , it is straightforward to show that these two steps form an exact expectation-maximization (EM) algorithm. The ’s play the role of notionally missing data; their conditional distribution, which falls within the family of infinite convolutions of gammas, will be described shortly. The ascent property of the EM algorithm, together with the log-concavity of the posterior distribution, guarantee that the sequence of iterates converges to the posterior mode.
The details of Algorithm 1 may be familiar to many readers in a different guise, particularly the functional form of the update for the “missing” . This is not an accident: our method is closely related to the variational-Bayes approach to logistic regression described by . Indeed, one way of arriving at Algorithm 1 is via a pure variational argument appealing to convex duality. But we pursue an entirely different, probabilistic line of argument, giving rise to subtle and importance differences from the typical variational-Bayes approach.
Section 2 introduces the method. Section 3 pursues the connection with variational Bayes. Section 4 describes online and sparse variants of the algorithm. A supplemental file contains a large suite of numerical experiments, beyond the few described in the main paper.
. This includes both the logit and negative-binomial models as special cases. These are likelihoods that involve products of the form
where is a linear function of parameters, and where and involve the response for subject . The authors of  exploit these facts to derive an efficient Gibbs sampler for the logit model. In contrast, our focus is on using the same representation to derive expectation-maximization algorithms.
The key result is that terms of form (1
) are mixtures with respect to a Polya-Gamma distribution. Define, as the infinite convolution of gammas having Laplace transform
The second equality arises from the fact that the hyperbolic cosine function is holomorphic over the entire complex plane, and according to the Weierstrass factorization theorem, may therefore be represented as a product in terms of its zeros. This Laplace transform is easily inverted by recognizing each term in the product as the Laplace transform of a Gamma distribution. We therefore conclude that if , then it is equal in distribution to an infinite sum of gammas:
where each is an independent Gammarandom variable. The general class is constructed via exponential tilting of the density:
The corresponding Laplace transform may be calculated and inverted by a similar path. We omit the details of this calculation, which leads directly to the following definition.
A random variable has a Polya-Gamma distribution with parameters and , denoted , if
where each is an independent gamma random variable.
To make matters concrete, we take the case of binomial logistic regression, where we observe triplets , respectively denoting the number of trials, the number of successes, and the predictors for case . In many data sets we simply have and either 0 or 1, but this need not be the case. Indeed, the negative-binomial case arises when for some overdispersion parameter . For all models of this form, the likelihood in is
where is the linear predictor. The fundamental integral identity arising from the Polya-Gamma data-augmentation trick (see ) allows us to rewrite each term in the likelihood as follows.
where , and where the mixing distribution is Polya-Gamma; this identity arises from evaluating the Laplace transform (2) at . Moreover, the conditional distribution for that arises in treating the above integrand as a joint density is simply an exponential tilting of the prior, and is therefore also in the Polya-Gamma family: .
Appealing to (5), the complete-data log-likelihood , given all , is a quadratic form in . This yields a conditionally Gaussian likelihood:
This expression is linear in , meaning that the EM algorithm has an especially simple structure. In the E step, given the current estimate , we compute , where is the conditional expected value of , given the data and the current iterate for . We describe this calculation in detail below. Meanwhile, in the step, we choose the new to maximize
. This expression looks like the kernel of a normal distribution. Collecting terms and completing the square, we find that the new value ofmaximizes a quadratic form involving complete-data sufficient statistics and :
Here is the diagonal matrix , and
is the column vector. Thus solves the linear system . We have carried through the calculation without accounting for the contribution of the prior. But the log prior density is also a quadratic form in and therefore combines easily with the above to yield the complete-data posterior distribution:
If we wish, we can solve this system directly at each step. Alternatively, if this is too costly, we can simply take a step in the right direction toward the solution starting from the previous value using, for example, the linear conjugate-gradient algorithm. This will typically be much faster than a full solve, and will still lead to global convergence, as it is sufficient to take a partial M step that merely improves the observed-data objective function.
While the density of a Polya-Gamma random variable can be expressed only as an infinite series, its expected value may be calculated in closed form. To see this, apply the Weierstrass factorization theorem once more to calculate the Laplace transform of a distribution as
Differentiating this expression with respect to , negating, and evaluating the result at yields for a random variable . Therefore, the E step in our EM algorithm simply involves computing
and using these to construct the complete-data sufficient statistics previously defined. The fact that the algorithm converges to the posterior mode follows trivially from the ascent property of EM algorithms, together with the log-concavity of the posterior distribution.
Quasi-Newton acceleration is a powerful technique for speeding up an EM algorithm. Suppose that we decompose the observed-data log posterior as , where is the complete-data contribution arising from the Polya-Gamma latent-variable scheme, and is the remainder term. The corresponding decomposition of the Hessian matrix is: ; the fact that is a non-negative definite matrix follows from the information inequality. As the complete-data log-posterior is Gaussian, is the inverse of the covariance matrix given in (7). The idea of quasi-Newton acceleration is to iteratively approximate the Hessian of the remainder term, , using a series of inexpensive low-rank updates. The approximate Hessian is then used in a Newton-like step to yield the next iterate, along with the next update to . Because both Newton’s method and the M step of the EM algorithm require solving an order- linear system, the only extra per-iteration cost is forming a (comparatively cheap) low-rank update to the approximation for . See  for a full explanation of the general theory of quasi-Newton acceleration.
A few general conclusions emerge from our experiments. First, iteratively re-weighted least squares often fails to converge, especially when initialized from a poor location. As many others have observed, this reflects the numerical instability in evaluating the true Hessian matrix far away from the solution. Third, the EM algorithm is robust due to the guaranteed ascent property, but slow. Finally, the QN-EM algorithm is equally robust, but far faster. On our experiments, it usually required between 10 and 100 times fewer iterations than the basic EM to reach convergence.
We now draw a sharp connection with the variational algorithm for Bayesian logistic regression described by . Let denote the contribution to the log-likelihood of case :
where is implicitly a function of . The second term, , is symmetric in and concave when considered as a function of . Appealing to standard results from convex analysis, we may therefore write it in terms of its Legendre dual :
where . Moreover, by differentiating both sides of the above equation, the optimal value for as a function of , is easily shown to be
This leads to the following variational lower bound for :
This holds for any choice of , with equality achieved at . Upon defining and , it is readily apparent that each summand in (6) is identical to the expression just given in (10), up to an additive constant involving . In each case, the approximate posterior takes the Gaussian form given in (7).
This provides a probabilistic interpretation to the purely variational argument of . It is especially interesting that we are able to identify , the argument of the dual function , as a rescaling of the missing data
that arise from a complete different line of argument under the Polya-Gamma EM approach. Both terms play the role of the inverse-variance in a conditionally Gaussian likelihood. For further insight, compare the expression for the optimum value ofwith the expression for the conditional expected value of in Formula (9) and the inner loop of Algorithm 1.
Despite this equivalence of functional form, the EM and variational-Bayes algorithms do not give the same answers. The former converges to the posterior mode and treats as missing data having a Polya-Gamma mixing distribution, while the latter operates by a fundamentally different inferential principle. From Formula (10), the marginal likelihood of the data satisfy the following lower bound:
The second integral is analytically available, owing to the fact that the lower bound from (10) is a quadratic form in . The variational-Bayes approach is to choose to make this lower bound on the marginal likelihood as tight as possible. It leads to , where are the mean and variance of the previous approximation, given in (7). By comparison, the EM algorithm implicitly sets . Notice that they differ by a factor that involves the Mahalonobis norm of .
Thus we would summarize the key similarities and differences between the approaches as follows. First, both algorithms yield updates for that are identical in their functional dependence upon “local” parameters , even though these local parameters have different interpretations. Moreover, both algorithms yield updates that are identical in functional form. But the EM algorithm treats as missing data having a known conditional expected value, and therefore evaluates at the current value of the linear predictor. In contrast, the variational-Bayes approach, is treated as a function of a further variational parameter , with the chosen to maximize the lower bound on the marginal likelihood of the data. Thus at each step the chosen ’s are different because of the different arguments at which is evaluated.
Finally, both algorithms yield an approximate Gaussian posterior that can be interpreted as a complete-data posterior distribution in an EM, where the imputed data have a Polya-Gamma conditional distribution. In the EM approach, this Gaussian approximation is centered at the posterior mode. In the variational-Bayes approach, it is centered on some other point in the parameter space that, by definition, has a lower posterior density. By a parallel line of reasoning, one also concludes that the stationary point of EM algorithm leads to a worse lower bound on the marginal likelihood, although it is not clear how the tightness of this lower bound translates into accuracy in approximating the posterior for
. In our numerical experiment reported in the supplement (which used MCMC as a benchmark), the variational posterior was typically centered somewhere between the mean and mode of the true posterior distribution (which is often quite skewed, especially for larger coefficients). The posterior variances of the two algorithms tend to be nearly identical, and much smaller than the true variances. In fact, we can identify the degree of overconfidence of both methods using standard tools from EM theory, specifically the fraction of missing information introduced by the PG data augmentation scheme. See and .
We can also introduce a penalty function to the expected log likelihood. The obvious and popular choice is the penalty, leading to a new complete-data objective function:
This is a convex problem that can be solved efficiently using coordinate descent and an active-set strategy . Most updates are of order , where is the size of the active set; full steps need be taken only rarely. The updates can be made still more efficient by exploiting any sparsity that may be present in the design matrix .
There are many proposals for fitting sparse logistic regression models; see  for a review. Most of these methods share one thing in common: they treat the likelihood using the same device employed in Fisher scoring, or iteratively re-weighted least squares (IRLS). That is, given the current estimates of the parameters , a quadratic approximation to the log-likelihood is formed:
where is constant in the parameter, and where
with the estimated success probabilityevaluated at the current parameters. The various approaches differ in the manner by which one finds the solution to the penalized, iteratively re-weighted least squares objective function:
where is the penalty function. As a result, these methods potentially inherit the numerical instability of iteratively re-weighted least squares, especially if initialized poorly. In contrast, our approach handles the likelihood term using a missing-data argument, leading to a different quadratic form at each iteration. Numerical evidence presented in the supplement suggests that the data-augmentation method is more robust, and leads to solution paths in that achieve lower classification error than those based on penalized iteratively re-weighted least squares.
We now consider an online version of the EM algorithm, which is much more scalable, has essentially the same convergence guarantees as batch EM, and operates without ever loading the whole data set into memory. A good reference for the general theory of online EM algorithms is .
Suppose we have observed up through data point , and that we have a current estimate of the complete-data sufficient statistics; call these and , in which case is the maximizer of the corresponding complete-data objective function (7). Now we see a new triplet . In the online E-step, we update the sufficient statistics as a convex combination of the old statistics and a contribution from the new data:
The simplicity of this step owes to the linearity of the complete-data likelihood in . Here , and is given by Formula (9), evaluated at the current estimate . Then in the -step, we solve (or take a step towards the solution of) the log-Gaussian posterior density with sufficient statistics and .
This can also be done in mini-batches of size , with the obvious modification to the the sufficient-stat updates from the single-data-point case. Let and be -vectors of trials and successes for the observations in batch , and as an matrix of regressors for these observations. The updates are then
where is the diagonal matrix of terms arising from the current mini-batch, each computed using Equation 9 evaluated at the current estimate for . Empirically, we have found that mini-batch sizes within an order of magnitude of (the dimension of ) have given robust performance, and that the method is unstable when processing only a data point at a time.
In our experience, the online version of the algorithm is usually much faster than the batch algorithm, even in batch settings. It is also just as accurate. The only tuning parameter is the learning rate , which arises in all online-learning algorithms. The formal convergence requirement from  is that the follow a decay schedule where
Thus we adapt to new data fast enough (the non-summability condition), but not so fast that the estimate bounces around the truth forever without ever converging (the square-summability condition). This is the usual requirement of the updating rule in any stochastic-approximation algorithm. A simple way of ensuring this is to choose for some .
We follow the advice of , which is to use very close to , and then to smooth the estimate of using Polyak-Ruppert averaging: taking the final estimate of to be the average solution for over the final iterations, where is large enough so that the algorithm has settled down.
The online EM is a second-order method that requires working with a matrix. We therefore do not anticipate that it will scale to the very largest of problems, at least with present computing technology. Indeed, for ultra-large-scale problems such as those considered by , of order , it is infeasible to even form the sufficient statistics and , much less to solve the linear system in the final line of Algorithm 2. Here methods based on stochastic gradient descent (e.g. ) are probably the better choice, if only because nothing else beyond a first-order method will run.
Nonetheless, there is a vast middle ground of potential problems where online EM may offer notable advantages compared to stochastic gradient-descent: problems that are too large to be solved easily by batch methods, but that are “small enough” (which may still mean thousands of parameters) to make forming the sufficient statistics and feasible. The fundamental tradeoff is one of per-iteration cost (where a first-order method clearly wins) versus distance traversed per iteration (where a second-order method clearly wins). When the design points are highly collinear, SGD can be extremely slow to converge, and the per-iteration cost of online EM may be worth it. As argued by , constants matter when comparing the actual efficiency of first-order versus second-order methods, and these constants involve features of the the ’s that may not be known beforehand.
As an example, Figure 1 shows the performance of the online EM on a simulated data set of 250 predictors and 100,000 observations with collinear predictors. We benchmarked against stochastic gradient descent. The design points were correlated multivariate normal draws, with covariance matrix , being a factor loadings matrix with standard normal entries. The columns of were subsequently rescaled to have marginal variance , ensuring a standard normal distribution for the linear predictors . We used multiple passes for both algorithms (50 passes for SGD, 3 passes for online EM), with each pass scanning the data points in a random order. The algorithms used a similar decay schedule, with the online EM processing data in batches of size 500. We chose 3 and 50 to yield similar computing times for the two methods. Although this ratio will clearly depend upon the size of the problem, at least for this (nontrivially large) problem, online EM is clearly the more efficient choice.
There are three other advantages of online EM, at least for problems that live in this middle ground. First and most obviously, online EM gives an approximation to the entire posterior distribution, rather than just a point estimate (assuming that the conditional sufficient statistics are re-scaled by the total number of data points processed). Even if the error bars that arise from the complete-data posterior are too optimistic (ala variational Bayes), they are better than nothing, and still allow meaningful relative comparisons of uncertainty. Second, stochastic gradient descent is notoriously brittle with respect to the decay schedule of the learning rate, with poor choices leading to outrageously bad performance. Online EM, especially when coupled with Polyak-Ruppert averaging, tends to be much more robust.
Finally, merging SGD with sparsity-promoting penalties is known to be challenging, and is typically restricted to an penalty . In contrast, nonconvex (heavy-tailed) penalties are easily incorporated into both the batch and online versions of our approach. For example, the authors of  report considerable success with a sparse second-order method even for very large problems. The disadvantage of their method is that convergence is not formally guaranteed even without sparsity; by contrast, the convergence of our approach follows straightforwardly from standard results about online EM.
We have introduced a family of expectation-maximization algorithms for binomial and negative-binomial regression. The existence of such an algorithm, and its intimate connection with variational Bayes, have not previously been appreciated in the machine-learning community. Indeed, the fact that the same local parametersarise in both algorithms, despite having very different interpretations and constructions, is noteworthy and a bit puzzling. It suggests interesting connections between two fundamental operations in statistics—namely profiling and marginalization—that are not usually thought of as being part of the same constellation of ideas. The strengths of our method are: (1) that it is very robust thanks to the ascent property of EMs; (2) that it leads to error bars very similar to those that arise from variational Bayes, but with a guarantee of consistency; (3) that is can easily be extended to incorporate sparsity-inducing priors; and (4) that it leads straightforwardly to a robust second-order online algorithm whose convergence properties are easily established. Further research is clearly needed on the performance of the online EM, in order to establish the circumstances (as suggested by Figure 1) in which it will outperform stochastic gradient descent for a fixed computational budget.
In conclusion, we refer interested readers to the appendix, which contains several details not described here:
A comparison of the quality of posterior approximations arrived at by VB, EM, and QNEM on a simple example.
A study of the data-augmentation approach to handling logit likelihoods, versus penalized iteratively re-weighted least squares, in the context of sparsity-inducing priors.
An extension to the multinomial case.
Fully bayesian inference for neural models with negative-binomial spiking.In Advances in Neural Information Processing Systems (NIPS), volume 25, 2012.
Algorithms for sparse linear classifiers in the massive data setting.Journal of Machine Learning Research, 9:313–37, 2008.
This section describes a simple simulate example whose purpose is to understand two questions. (1) How similar are the variational-Bayes approximation and the complete-data log posterior arising from the EM algorithm? (2) Does using the estimated “remainder” Hessian matrix from QNEM lead to more sensible error bars, when judged against the posterior distribution calculated by MCMC?
We simulated a small data set with coefficients , independent, standard normal predictors , and observations from a logistic-regression model. We then used four algorithms to estimate :
Expectation-maximization, as described in the main manuscript.
Expectation-maximization with quasi-Newton acceleration.
Variational Bayes, as described in the main manuscript and in .
In all cases a vague mean-zero normal prior with precision on each coefficient was used.
Using MCMC samples (after an initial burnin of samples), we computed central credible intervals for each coefficient. These are show as black lines in Figure 2. The posterior means are shown as black crosses, and the posterior modes as black dots (the marginal posteriors for this problem are notably skewed). Using the other three methods, we constructed symmetric
approximate credible intervals using the estimate and the approximate posterior standard deviations. As Figure 1 shows, the VB and EM estimates have essentially identical approximate posterior standard deviations. The only visible difference is that the EM estimates are centered about the posterior mode, while the VB estimates are centered somewhere between the mean and the mode. The QNEM estimates have significantly higher spread than either the VB or EM intervals, and in this respect are much more in line with the true standard deviations.
The main manuscript makes the following claim:
[Existing methods for sparse logistic regression] potentially inherit the numerical instability of iteratively re-weighted least squares, especially if initialized poorly. In contrast, our approach handles the likelihood term using a missing-data argument, leading to a different quadratic form at each iteration. Numerical evidence presented in the supplement suggests that our method is more robust, and leads to solution paths in that achieve lower classification error than those based on penalized iteratively re-weighted least squares.
The goal of this section is to provide the evidence in support of this claim. Consider a quadratic approximation to the log-likelihood:
where is constant in the parameter, and where
with the estimated success probability evaluated at the current parameters. Most existing approaches use this approximation to the likelihood, and differ in the manner by which one finds the solution to the penalized, iteratively re-weighted least squares objective function:
where is the penalty function. In this section, we simply consider the case for all , and the penalty:
For example, in , a coordinate descent algorithm is used to solve the penalized weighted least-squares problem. The coordinate-wise update has the form
where is the fitted value excluding the contribution from , and is the soft-thresholding operator with value
Algorithm 3 summarizes the approach for fixed .
Our approach, on the other hand, is to employ the Polya-Gamma data augmentation trick for handling the logit likelihood and to represent the prior distribution as a mixture of normals:
We then use the linear conjugate gradient algorithm to optimize the resulting log posterior distribution. The complete-data log posterior, conditioning on the ”missing data” , can be computed as
In M step, we employ the same step as in Batch EM without sparsity. The complete-data sufficient statistics are
Thus solves the linear system .
To perform the E-step, we can simply replace and with their conditional expectations and , given the observed data and the current
. The conditional momentsand are given by the following expressions:
We can solve the linear system directly at each step, or we can instead use the linear conjugate gradient algorithm to solve the system. Here we have different variations for conjugate gradient algorithm. We can try n-step ()Conjugate Gradient (CG) Method, or -tolerance CG Method, or just one-step CG method. In p-step CG method, we force the algorithm to run until it reached the exact solution. In -step CG method, we stop after iterations, while in -tolerance CG Method we stop the algorithm when converges in -tolerance.
Notice that our data-augmentation trick can also be considered as a re-weighted least squares algorithm, but with different weights. The corresponding weights and working responses are
Here we will test different combinations of weights derived from IRLS (1) versus our data-augmentation scheme (2); and coordinate descent method (3) versus conjugate gradient method (4) for solving each iteration’s subproblem. There are four combinations, and our goal is to see which combination performs best. The numerical results are presented in the next sub-section.
Data augmentation also allows us to deal with bridge penalty/prior as well. Here we want to minimize
with . The bridge penalty can be represented as a mixture of normals:
where follows a Stable distribution. The conditional moments are given by
The corresponding algorithm for our batch EM is show above (Algorithm 6).
We now describe our numerical experiments on simulated data sets. For each dataset, we model them as logistic regression problems. We tested the four combinations of WLS/DA (for the likelihood) and CG/CD (for the prior/penalty) stated above for lasso penalty. We also tested our data augmentation approach combining (using the conjugate gradient algorithm)assuming a bridge penalty with equal both to 0.5 and 0.75.
We simulated as a by matrix with random and elements, and set the true equal to (with alternating signs) for the first 10 coefficients, with the rest being zero. The four graphs of the solution paths (as a function of ) for each algorithm are show in Figures 3 through 6. The grid size is 0.01. There are clear differences among the solutions paths for each algorithm, with data augmentation algorithms leading to systematically lower values of the penalized likelihood.
To get a further idea of which algorithm is providing the best solution, we also looked at out of sample performance. Specifically, we used 80% of the observations to estimate for each value of using each algorithm, and then used the estimated to predict the remaining 20% observations. We calculated the mean of the incorrect classifications across over 1000 random train/test splits. The results as a function of are shown in Figure 7. Overall the DA + coordinate-descent combination is a bit better than the others.
When the response variablehas more than levels, the linear logistic regression model can be generalized to a multinomial logistic model. Suppose we observe data , where is an integer outcome denoting membership in one of classes, and is a -vector of predictors. Under the multinomial logit model, the probability of observation falling in class is assumed to be
where is a -vector of regression coefficients for class .
Let be the indicator response matrix, with elements . Then we want to maximize the penalized log-likelihood:
For lasso penalty, the penalty function is .
Allowing only to vary for a single class a time, a partial quadratic approximation given current estimate to the log-likelihood part of (25) is
The approach is similar to logistic regression, except now for each the classes are cycled over in the outer loop, and for each class there is a partial quadratic approximation about the current parameters . Then coordinate descent is used to solve the penalized weighted least-squares problem. Notice that and give the same log-likelihood, yet a different penalty. Therefore, the estimate can be improved by solving
This can be done separately for each coordinate, and leads to being recentralized by choosing to be the median of the .
For each , the algorithm is:
Our data augmentation approach can be used in a parallel fashion to the binomial logit case, leading to an ECM (expectation, conditional maximization) algorithm. The difference is that we fix corresponding to class to be to ensure identifiability, and thus interpret the other coefficients in terms of changes in log-odds relative to the first category. We phrase the problem as one of maximizing the posterior density