1 Introduction
Maximum likelihood estimation of latent (hidden) variable models is computationally challenging because one must integrate over the latent variables to compute the likelihood. Often, the integral is highdimensional and computationally intractable so one must resort to approximation methods such as variational expectationmaximization
(Beal and Ghahramani, 2003).The variational expectationmaximization algorithm alternates between maximizing an evidence lower bound (ELBO) on the loglikelihood with respect to the model parameters and maximizing the lower bound with respect to the variational distribution parameters. Variational expectationmaximization is popular because it is computationally efficient and performs deterministic coordinate ascent on the ELBO surface. However, variational inference has some statistical and computational limitations. First, the ELBO is often multimodal, so the deterministic algorithm may converge to a local rather than a global optimum depending on the initial parameter estimates. Second, the variational distribution is often chosen for computational convenience rather than for accuracy with respect to the posterior distribution, so the lower bound may be far from tight. The magnitude of the cost of using a poor approximation is typically not known. Here, we develop a deterministic global optimization algorithm for variational expectationmaximization to address these the first issue and quantify the optimal ELBO for an example model to address the second issue.
Related work
It is wellknown that expectationmaximization algorithms can converge to a local optimum rather than a global optimum due to local modes or saddlepoints in evidence lower bound on the loglikelihood function. In a discussion for the Dempster et al. (1977) paper on expectationmaximization, Murray (1977) described a situation he frequently observed where the EM algorithm converged to a fixed point, but not the global optimum of the loglikelihood. Murray provided a simple data set that reproduced the convergence problems and advocated a solution that continues to be the common practice today – restart the algorithm at multiple initial points and hope that it finds a global optimum. Later, Wu (1983) described regularity conditions that ensure the EM algorithm converges to a global optimum. However, these regularity conditions are difficult to check in practice. Typically, the practical solution remains the one advocated by Murray – restart the algorithm at many initial values and retain the parameter estimates that give the maximum across restarts.
Contributions
In this article, we present a deterministic algorithm for variational inference that converges to the global optimum of the evidence lower bound on the loglikelihood. Global optimization methods such as the one we present here are important to study for several reasons. First, in applications where collecting data is expensive or timeconsuming (e.g. medical diagnostics and robotic planning in complex environments) we often have large, but not massive, data sets and we need an inference algorithm that maximizes the use of the information present in the sample while also being computationally tractable. Second, global optimization methods provide new avenues for developing computationallyefficient inference algorithms that have not been extensively studied. Lastly, a guaranteed global optimum parameter estimate allows us to benchmark approximate variational inference methods and approach the analysis of the statistical properties of those methods.
The remainder of this article is organized as follows. In this section, we present the general variational inference problem statement for hierarchical exponential family latent variable models. We summarize the properties and algorithmic steps for the GOP algorithm in Section 2. We derive the GOP inference algorithm for two approximations of the Bayesian Gaussian mixture model in Section 3 and Section 4. In those sections, we describe experimental results comparing the GOP algorithm to variationalEM and BARON. In the latter section, we compare the quality of the optimum for two variational models.
Problem statement
Given a statistical model, the observeddata loglikelihood function, or simply the loglikelihood, is , where is the observed data set and is the set of model parameters. The maximum likelihood parameter estimate is . If the model,
, is in the exponential family and all of the random variables are observed, the maximum likelihood optimization problem is convex and there is a unique maximum. However, if the model has unobserved random variables, we must integrate over those random variables and the loglikelihood may be multimodal. When the integral is computationally intractable, we must resort to approximation methods to compute the maximum likelihood parameter estimate.
In a latent variable exponential family model, we introduce the unobserved or latent variable with the conditional distribution, , and marginal distribution, . Our observed data set is i.i.d. . The complete data set is . Taking , the loglikelihood of the model is
(1) 
Unfortunately, the integral over in the loglikelihood function is often computationally intractable. By introducing an averaging distribution, , we can form a lower bound on the observeddata loglikelihood using Jensen’s inequality,
Then, we can obtain an approximate maximum likelihood estimate by maximizing this evidence lower bound (ELBO) over both the averaging distribution and the model parameters, . The first term in the ELBO is the expected completedata loglikelihood and the second term is the entropy of the variational distribution.
In the expected complete loglikelihood, there is a coupling between the variational distribution parameters, , and the model parameters, . This coupling makes optimizing the ELBO difficult in practice because the resulting optimization problem is nonconvex.
2 GOP Variational Inference
The general Global OPtimization (GOP) algorithm was first developed by Floudas and Visweswaran (1990). They extended the decomposition ideas of Benders (1962) and Geoffrion (1972) in the context of solving nonconvex optimization problems to global optimality (Floudas and Visweswaran, 1993; Floudas, 2000). While the framework has been successfully used for problems in process design, control, and computational chemistry, it has not, to our knowledge, been applied to variational statistical inference.
[Add reference to linear regression here.]
Here, we extend the GOP algorithm for variational inference in hierarchical exponential family models. In Section 2.1, we state the problem conditions necessary for the GOP algorithm. Then, we briefly review the mathematical theory for the GOP algorithm in Section 2.2. Finally, in Section 2.3, we outline the general GOP algorithmic steps. The review of the GOP algorithm in this section is not novel and necessarily brief. A more complete presentation of the general algorithm can be found in Floudas (2000).
2.1 Problem Statement
GOP addresses biconvex optimization problems that can be formulated as
(2) 
where and are convex compact sets, and and
are vectors of inequality and equality constraints respectively. We require the following conditions on (
2):
is convex in for every fixed , and convex in for every fixed ;

is convex in for every fixed , and convex in for every fixed ;

is affine in for every fixed , and affine in for every fixed ;

firstorder constraints qualifications (e.g. Slater’s conditions) are satisfied for every fixed .
If these conditions are satisfied, we have a biconvex optimization problem and we can use the GOP algorithm to find the global optimum.
Partitioning and transformation of decision variables
Variational inference is typically formulated as an alternating coordinate ascent algorithm where the evidence lower bound is iteratively maximized with respect to the model parameters and the variational distribution parameters. However, the objective function construed under that partitioning of the decision variables is not necessarily biconvex. Instead, we consider all of the decision variables (model parameters and variational parameters) in the variational inference optimization problem jointly. Then, we are free to partition variables as we choose to ensure the GOP problem conditions are satisfied.
One may be able to transform variables in the original variational optimization problem, adding equality constraints as necessary, to satisfy the GOP problem conditions. For larger problems and more complex models, the partitioning of the decision variables may not be apparent. Hansen et al. (1993) proposed an algorithm to bilinearize quadratic and polynomial function problems, rational polynomials, and problems involving hyperbolic functions.
Biconvexity for exponential family hierarchical latent variable models
The variational EM problem is a biconvex optimization problem under certain conditions. We cast the variational EM problem into a biconvex convex optimization form by partitioning the model parameters, , and the variational parameters into and such that the GOP conditions are satisfied. All of the functions are analytical and differentiable. For two choices of approximating distribution for a Bayesian Gaussian mixture model, we find a suitable partition of the decision variables. However, what conditions guaranetee such a partition for general exponential family hierarchical models remains an open question.
2.2 GOP Theory
Primal problem
Because the original problem (2) satisfies the biconvexity conditions, fixing (say to ) results in a convex subproblem,
(3) 
called the primal problem. The restriction of to a fixed value is equivalent to solving the original problem with additional constraints. Therefore, the solution of the primal problem provides an upper bound on the original problem (2). Since the problem is convex, it can be solved with any convex optimization solver.
In variational inference, this step would typically be considered the Estep if was the variational parameters and the Mstep if was the model parameters. However, we have partitioned the decision variables to formulate the problem as a biconvex optimization problem, therefore the direct EMstep interpretation is lost.
Relaxed dual problem
We can write the original optimization problem (2) as
in order to cast the problem as inner and outer optimization problems using a projection of the problem onto the space of variables:
s.t.  
where 
For a given , the inner minimization problem in (2.2) is simply the primal problem. Therefore, the function can be defined for a set of solutions of the primal problem (3) for different values of . However, since is defined implicitly (2.2) can be as difficult to solve as the original problem (2).
Because the original problem satisfies Slater’s conditions, by nonlinear duality theory, the optimal objective function value of the primal problem (3) for any is identical to the solution of its corresponding dual problem ,
Substituting into the primal problem (3) gives
Using the definition of the supremum we can form the bound
Then, we have an equivalent problem to (2.2),
s.t.  
The last two conditions in the dual problem (2.2) define an implicit set of constraints making the solution of the problem difficult.
Dropping the last two conditions gives a relaxed dual problem,
(6) 
where and the Lagrange function for the primal problem (3) is
(7) 
We call the minimization problem in (6) the inner relaxed dual problem,
(8) 
where denotes the value of the variables at the optimal solution of the inner relaxed dual problem.
In summary, the primal problem (3) contains more constraints than the original problem because is fixed and so provides an upper bound on the original problem. The relaxed dual problem (6) contains fewer constraints than the original problem and so provides a lower bound on the original problem. Because the primal and relaxed dual problems are formed from an innerouter optimization problem, they can be used to form an iterative alternating algorithm to determine the global solution of the original problem. Making the iteration explicit, the primal problem is
(9) 
and the relaxed dual problem is
(10) 
where
(11) 
The key here is that information is passed from the primal problem to the dual problem through the Lagrange multipliers, , and information is passed from the dual problem to the primal problem through the optimal dual variables, .
Relaxed dual problem decomposition
The form of the relaxed dual problem (6) is difficult to solve because it contains an inner relaxed dual problem that is parametric in . Here, we decompose the relaxed dual problem into a set of independent relaxed dual subproblems that are combined to find the solution of the relaxed dual problem. The algorithm iterates between solving the primal problem and the relaxed dual problem in a way that provides guaranteed convergence to the global optimum.
Recall, the inner relaxed dual problem (8) is
Because the inner relaxed dual problem is convex in , we can form a lower bound by taking the firstorder Taylor series approximation about some fixed ,
(12) 
where
(13) 
For notational convenience we define
the th component of the gradient of the original problem Lagrange function with respect to evaluated at . Note that we are using to denote the inequality constraints in the original problem; this different function has a superscript that indicates the iteration. Then, we can write the lower bound on the inner relaxed dual problem as
(14) 
For any fixed the summation and minimization can be exchanged,
(15) 
Because the th component of the second term on the right hand side is linear in , the minimum will be at a bound of .
The specific nature of the bound (lower or upper) is determined by the sign of . There are two cases:
where and are the lower and upper bounds of respectively. We can write the two cases compactly as
where
So, for any fixed , there exists a combination of bounds for the variables such that
(16) 
The vector is the same size as and each element indicates whether is set at its upper or lower bound. The variable is the value of at the upper/lower bound such that (16) is valid. For any discrete , we can obtain a lower bound on the inner relaxed dual problem by taking the minimum of the linearized Lagrange function over all combinations of bounds . Since this is true for every , it must be true for all .
Computing the linearized Lagrange over all combinations of bounds requires that we compute linearized Lagrange functions, which can be computationally expensive. The KarushKuhnTucker (KKT) optimality conditions for the primal problem require that
So, if the partial derivative of the primal Lagrange function with respect to is found not to be a function of , we do not need to consider the bounds of . We reduce the computational burden by a factor of for every such variable.
Every for which is a function of is called a connected variable. The set of all such connected variables is
(17) 
Therefore, the solution of the inner relaxed dual problem (8) with Lagrange function replaced by its linearization about depends only on the connected variables, and we can write the inner relaxed dual minimization problem as
(18) 
Now, we can solve for a lower bound of the inner relaxed dual problem by setting the connected variables at their bounds and minimizing the linearized primal Lagrange function. Setting the connected variables at their bounds is equivalent to selecting a halfspace defined by the cut (hyperplane) in the domain of . Therefore, we are solving for a lower bound of the linearized Lagrange function in a unique region in the domain of for each . Since we solve for such a lower bound for all combinations of the connected variables, we cover the entire domain of .
Since can enter as either or , we require that be linear in to ensure all of the regions defined by are convex. However, in general is convex. When is convex, but not linear, we can linearize the Lagrange function with respect to by taking the firstorder Taylor series approximation with respect to at . This linearization provides a lower bound on the Lagrange function and the resulting cuts are linear in .
Relaxed dual problem as a sequence of convex problems
We have shown that we can approximate the inner relaxed dual problem by a minimization over a finite combination of settings of at its bounds. Here, we show that we can solve the relaxed dual problem as a sequence of convex subproblems. Finally, we show we achieve global convergence by iteratively alternating between solving the primal problem and one step in the sequence of relaxed dual subproblems. These optimality results are based on proofs by Geoffrion (1972).
Suppose we have a sequence of relaxed dual subproblems such that is the optimal value of the th relaxed dual problem.
(19) 
Recall that for any , the relaxed dual problem contains fewer constraints than the original dual problem and is a valid underestimator of the original problem. Further, since the relaxed dual problem at iteration contains more constraints than the relaxed dual problem at iteration , the underestimator is nondecreasing in . Now, we use the approach for solving the inner relaxed dual problem by setting at its bounds to derive an iterative algorithm for (19).
Iteration 1
For we have
As this holds for all , it holds for the minimum over ,
Because the right hand side is only a function of , the minimization operators can be interchanged,
or equivalently
So, for iteration 1, we have an optimization problem that provides a valid lower bound on the inner relaxed dual problem. For each setting of at its bounds , we solve a convex optimization problem – the relaxed dual subproblem.
Iteration T
Recall that defines a cut in the domain of and all cuts intersect at such that any point lies in one and only one region defined by the cuts (with the exception of boundary points). Further, recall that there is a onetoone correspondence between , the primal problem variables set at their boundaries, and the region defined by the cuts, . Now, let be the set of Lagrange functions and associated qualifying constraints from the th iteration whose qualifying constraints are satisfied at , the current value of the relaxed dual problem decision variables. The set of active qualifying constraints uniquely identifies , the primal problem variable bounds, where we denote the combination of bounds that uniquely identify the active qualifying constraints from iteration as .
We can write the inner relaxed dual problems from (19) for iteration as
where is identified by . Since the variables are set for all iterations , the operator only applies to the th iteration and we have
(20) 
Therefore, the relaxed dual problem can be solved as a series of convex problems. Each iteration adds constraints to the relaxed dual problem ensuring that the lower bound is nondecreasing. Since the relaxed dual problem is a valid lower bound on the global optimum for every iteration, it is a valid lower bound for the th iteration. The proof that the upper bound provided by the primal and the lower bound provided by the relaxed dual converge can be found in (Floudas, 2000)[Theorem 3.6.1].
2.3 GOP Algorithm Steps
Step 0 – Initialize parameters.
Initialize the upper bound and the lower bound . Define lower and upper bounds for the primal decision variables and respectively. Select a feasible initial value of the dual decision variables . Set the iteration counter to . Finally, select a convergence tolerance parameter .
Step 1 – Solve primal problem
Solve the primal problem with the relaxed dual variables fixed at and store the optimal Lagrange multipliers and . Update the upper bound , where is the solution of the primal problem.
Step 2 – Select Lagrange functions from previous iterations
While , for identify the settings of the primal problem variables at their bounds such that all of the qualifying constraints from iteration are active at the current value of the relaxed dual variables . Select those qualifying constraints and their corresponding Lagrange function to be in the set of constraints of the current relaxed dual problem.
Step 3 – Solve relaxed dual problem
Determine the set of connected variables at iteration , . For each setting of the connected variables at their bounds where , solve the relaxed dual problem:
(21) 
where .
If feasible, store the solution in and . If not feasible, then store a null value.
Step 4 – Update lower bound and
Select the minimum from the entire set , . Note that contains the solutions of the relaxed dual problem from the current iteration and the solutions from all of the previous iterations. Update the value of the lower bound . Update the value of with the corresponding optimal relaxed dual decision variable . Remove and from the stored sets so that they are not selected again in a subsequent iteration.
Step 5 – Check for convergence
Check if . If the convergence criterion is satisfied, return the optimal parameter values, else increment the iteration counter and go to step 1.
3 BGMM – Point Mass Approximation
Here, we consider a variant of the Gaussian mixture model where cluster mean is endowed with a Gaussian prior. We consider the variational model for the posterior distribution to be a point mass function for discrete latent random variables and a Dirac delta function for continuous random variables. Variational expectationmaximization with a pointmass approximating distribution is exactly classical expectationmaximization
(Gelman et al., 2013, p 337).3.1 GOP Algorithm Derivation
Model structure
A Gaussian mixture model with a prior distribution on the cluster means is
(22) 
with model parameters , unobserved/latent variables , and observed variable .
Our inferential aim is the joint posterior distribution , for , where is the maximum likelihood parameter estimate. The log likelihood for the model is .
Variational distribution
We take the variational distribution to be the following fully factorized (meanfield) distribution
(23) 
where
We define be the set of variational distribution parameters.
Variational lower bound for the loglikelihood
Given the variational distribution (23), the evidence lower bound (ELBO) is found by applying Jensen’s inequality. The final ELBO is construed as a function of the model parameters, , and the variational parameters , where we have replaced by its natural parameter to make the ELBO linear in the parameter. The ELBO can be expanded to (see Appendix B.2 for the full derivation)
(24) 
Partitioning and transforming decision variables
The variational inference problem in standard form is a minimization problem, where the objective function is the negative ELBO.
The objective function is not biconvex in the separation of variables into model parameters, , and variational parameters, because of the first term involving . This term is cubic in the variational parameters and thus nonconvex. However, if we separate and , the first term is biconvex. The second term is biconvex in and . The third term is biconvex in and . The remaining two terms are convex in and respectively. So, we must separate from , from , and from . A partition that satisfies these conditions is and . A proof that all four GOP algorithm conditions are satisfied is provided in Appendix A.3.
Our selection of which variable to optimize over in the primal problem affects computational scalability. If we choose to optimize over in the primal problem, we must solve at most relaxed dual subproblems for each iteration where each subproblem has decision variables. This is disadvantageous because the number of relaxed dual subproblems grows exponentially in the sample size. Instead, we will optimize over in the primal problem, which means we only need to solve at most relaxed dual subproblems where each subproblem has
decision variables. While each subproblem may be large, it will be a linear program which can be solved efficiently for millions of decision variables.
The variational inference problem is now
(25) 
where and .
3.2 Experimental Results
First, we examine the convergence of the upper and lower bounds on the ELBO for a small data set. Then, we compare the accuracy and computational efficiency of GOP and variational inference algorithms. Last, we compare the optimal solution of GOP and variational inference under many random initializations.
It is wellknown that the expectationmaximization algorithm is only guaranteed to converge to a fixedpoint which may be arbitrarily far from a global optimum. Others have offered anecdotes and evidence that it is actually rather common for expectationmaximization to converge to a local optimum or saddle point rather than a global optimum. Murray (1977) described a simple data set that exhibits this local convergence issue for a Gaussian model. We sought to identify a minimal data set for the (Bayesian) Gaussian mixture model that exhibits the local convergence problem. Archambeau (2003)
found that repeated observations and outliers are particularly problematic for expectationmaximization and the Gaussian mixture model. We consider these properties to construct the data set
.We find that this minimal data set is useful for stresstesting any inference algorithm for the (Bayesian) Gassian mixture model.
Convergence experiment
We examined the optimum ELBO values for GOP and variational EM (VEM) by randomly drawing initial model and variational parameters and running each algorithm to convergence. The initial and are generated from a Dirichlet distribution with . The initial is generated from a Gamma with shape parameter equal to the range of . The initial
is drawn from uniform distribution over the range of
. We drew 100 random initial values and we observed that VEM converges to a local optimum 87 times out of 100 restarts, while GOP converges to the global optimum for all 100 restarts.We verified the global optimal value using BARON, an independent global optimization algorithm (Sahinidis, 1996; Tawarmalani and Sahinidis, 2005). These results show that GOP, empirically, always converges to the global optimum. The VEM algorithm failed to converge to the global optimum more often than it succeeded. These results quantify and agree with anecdotal evidence in the literature (Murray, 1977).
Computational time experiment
We explore two aspects of computational efficiency: “What is the behavior of the upper and lower bounds from the GOP algorithm across iterations?” and “How does the timing of the GOP algorithm compare to other local and global inference methods?”.
Figure 1 shows the convergence iterations for the GOP algorithm for two initial seeds. For one of the seeds the VEM algorithm converges to a suboptimal local optimum Figure 1(a) and for the other seed the VEM algorithm converges to a suboptimal local optimum Figure 1(b). The optimality of the solution is verified by BARON. The globally optimal ELBO is and the locally optimal ELBO is .
In early iterations, the GOP bounds include the local optimum, but at approximately 3 sec, the bound interval no longer covers the local optimum. So, even though the GOP algorithm is slower than other local methods, one way it can be used is as a running oracle on an algorithm with only local guarantees. When the GOP bound interval no longer covers the VEM solution, we can restart VEM until the solution found by VEM is covered by the GOP interval.
Algorithm  Time (s)  
GOP  
6.93  
9.14  
10.77  
BARON  
35  
38  
49  
Variational EM  
0.0053  
0.0071  
0.0073 
We compared the time for convergence to VEM and a standard global optimization algorithm, BARON for our minimal data set. The results are summarized in Table 1.
Our implementation of the GOP algorithm is faster than the global optimization algorithm, BARON and slower than VEM. Indeed, VEM is several orders of magnitude faster. However, the key point of this work is to demonstrate a method to obtain a global optimum. Our implementation of GOP can be substanially enhanced by (1) using the reduction method that shrinks the decision variable search space at each iteration (Visweswaran and Floudas, 1996), (2) using mixedinteger linear programming to identify the optimal relaxed dual subproblem at each iteration (Visweswaran and Floudas, 1996), and (3) using multithreading to compute each master problem. The multithreading option is most appealing because each relaxed dual problem is completely independent, thereby suggesting how to drastically reduce the most expensive computation step.
4 BGMM – Gaussian Approximation
In this section we explore the quality of the approximation provided by two different variational models for the Bayesian Gaussian mixture model. The original model structure is the same as in Section 3, so we start with the variational model structure here.
4.1 GOP Algorithm Derivation
Variational distribution
We take the variational distribution to be the following fully factorized (meanfield) distribution
(26) 
where
We define to be the set of variational distribution parameters.
Variational lower bound for the loglikelihood
Given the variational distribution (26), the evidence lower bound (ELBO) is found by applying Jensen’s inequality. The final ELBO is construed as a function of the model parameters, , and the variational parameters , where we have replaced by its natural parameter to make the ELBO linear in the parameter.
(27) 
The ELBO can be expanded (see Appendix C.1) to
(28) 
Partitioning and transforming decision variables
The variational inference problem in standard form is a minimization problem, where the objective function is the negative ELBO. Using the same argument as in Section 3, we find a partition that satisfies the GOP conditions – and . A full proof that all four GOP algorithm conditions are satisfied is provided in the Appendix C.3.
The variational inference problem is now
s.t.  
where and .
4.2 Experimental Results
In variational inference, the choice of approximating distribution is typically motivated by computational convenience rather than accuracy with respect to the posterior distribution. In addition, the loss incurred by choosing a particular approximating distribution is generally not known. The task of characterizing this loss is daunting because local methods such as VEM do not provide a performance guarantee.
Quality of optimal ELBO
We investigate the quality of the optimal ELBO provided by (1) the point mass approximation detailed in Section 3 and (2) the Gaussian approximation presented earlier in this section. We run GOP, adapted to both approximations, using the minimal data set described in 3. The globally optimal ELBO for point mass approximation and for Gaussian approximation is 84.04 and 82.75, respectively. We observe that the globally optimal ELBO for Gaussian approximation is higher than that of point mass approximation meaning that the Gaussian variational distribution is a better approximating distribution for our model compared to the point mass variational distribution. This result shows that we do indeed incur a loss in accuracy when selecting an inferior approximating distribution suggesting that careful consideration must be made when choosing an approximating distribution.
5 Conclusion
There are several contributions of this article. We describe an adaptation of a generalpurpose biconvex global optimization method (GOP) for variational expectationmaximization problems. We derive GOP inference algorithms for two variational approximations of the Bayesian Gaussian mixture model. We provide a simple data set for which the variational evidence lower bound is multimodel that is useful for testing approximation methods. We compare our GOP inference algorithm to one locally optimal algorithm (variationalEM) and one stateoftheart global optimization algorithm (BARON) in terms of accuracy and computational time and show that GOP always converges to the global optimum and is not prohibitive in terms of computational time for the small data set tested. We compare the globally optimal variational evidence lower bounds for two variational approximate distributions and show that the global variational evidence lower bound is better for a more flexible variational model. The focus in this article is on the properties of the global optimization algorithm. We describe several future improvements that can significantly improve the computational efficiency.
Appendix A Appendix for GMM
a.1 Derivation of Loglikelihood Lower Bound
The variational lower bound is
(29) 
Writing out each term of the bound gives
Comments
There are no comments yet.