1 Introduction
Discrete choice models have a long history in statistical analysis, appearing in applications as varied as the analysis of consumer choice data (Guadagni and Little 1983; Fader and Hardie 1996), transportation planning (Theil 1969; McFadden 1974; BenAkiva and Lerman 1985)
, economic demand estimation
(Train et al. 1987; Revelt and Train 1998), new product development (Moore et al. 1999), portfolio analysis (Uhler and Cragg 1971) and health services deployment (Hall et al. 2002). They apply to situations where agents (also called choosers or decisionmakers) select items from a finite collection of alternatives (the choice set), either once or repeatedly over time. For example, in a marketing context, agents are “households”; each household makes a number of “trips” to a store, and we observe the items selected for purchase on each trip.Heterogeneous discrete choice models, which allow preferences to differ across agents, are based on a hierarchical regression formulation. We have agents numbered
, each with an unseen parameter vector
encoding preferences over item attributes. We observe one or more choice events per agent. The ’s are modeled as independent draws from a prior distribution , whereis a hyperparameter. This prior represents the heterogeneity of preferences across the population. Inference in such a hierarchical model allows us to pool information across decisionmakers. If we use an empirical Bayes point estimate of
(Robbins 1955), the posterior distribution of each depends on all of , through the common estimate. In a fully Bayesian setup, integrating out the random variable
creates similar dependence.The marginal likelihood corresponding to one agent in a heterogeneous model is
(1) 
In most cases, including the “random utility” discrete choice model we study in this paper, (1) does not exist in closed form. As a consequence, we must use approximate methods both for empirical Bayes and fully Bayesian inference.
A standard empirical Bayes technique is to approximate (1) using Monte Carlo integration. But to match the asymptotics of maximum likelihood, the number of draws per agent must grow faster than the square root of the number of agents (Train 2003), which is infeasible for largescale problems. The usual approach to the fully Bayesian random utility model is Markov chain Monte Carlo (MCMC) simulation (Albert and Chib 1993; Allenby and Lenk 1994; Allenby and Rossi 1999). MCMC provides approximate draws from the joint posterior distribution on and . The draws enable the estimation of (1) and related integrals. However, the more agents there are in the data set, the more MCMC output we need to collect and store – even if we are only interested in , we still need repeated draws for all of .
Variational methods (Jordan et al. 1999; Wainwright and Jordan 2003) offer a deterministic alternative for approximate inference. With variational inference, we maximize a datadependent lower bound on the marginal likelihood (1), over a set of auxiliary parameters distinct from the model parameters. In the fully Bayesian specification, the end result is an approximate joint posterior distribution for and . For empirical Bayes, variational techniques lead to a point estimate as well as an approximate posterior distribution for the ’s.
The main advantage of variational methods versus MCMC is computational efficiency. Variational inference algorithms typically converge to their final approximation in far less time than it takes to generate an adequate number of MCMC draws. This advantage comes at the cost of a biased approximation, in contrast to the consistency guarantees that accompany MCMC. We give evidence in Section 4 that, for our random utility discrete choice model, variational bias is negligible, and the computational speedup is very large. Variational convergence is also easy to assess, in contrast to MCMC.
Furthermore, the size of the variational representation is fixed, while the size of the MCMC approximation increases with the number of draws. Variational techniques can therefore be applied to much larger data sets. For example, with 10,000 decisionmakers and 20,000 total MCMC draws, using a 25dimensional and corresponding , the MCMC representation of the posterior exceeds 2 GB—if we discard half the draws for burnin of the chain. In fact, many data sets today contain observations on millions of agents, models can contain far more than 25 preference parameters, and MCMC chains may require hundreds of thousands of iterations.
These difficulties are well known. MCMC is rarely applied to largescale heterogeneous models. Indeed, to address scalability, it is common to work with data from a subset of individuals, or a subset of choice items. However, this approach discards information that is valuable in the inferential process, and it can lead to biased estimates (Bradlow and Zanutto 2006).
In this paper, we derive variational algorithms for a common discrete choice model – the mixed multinomial logit (MML) model. We study this model because it the workhorse of discrete choice theory and is well known in many disciplines, including economics and marketing. There are other popular discrete choice models in the literature, but the mixed multinomial logit has appeal: it is conceptually simple, yet still exhibits the MCMC inference issues just described.
The rest of the paper is organized as follows. Section 2 presents the details of the MML model. In Section 3
, we describe variational procedures suitable for empirical Bayes and fully Bayesian inference under the MML model. These procedures include a novel application of the delta method for moments to variational inference. In Section
4, we compare variational methods to MCMC for the MML model, using an extensive suite of simulated data sets. Section 5 closes with discussion and future directions. Technical arguments and derivations are relegated to Appendixes A, B, and C.2 The mixed multinomial logit model of discrete choice
Let there be agents, indexed . We observe a total of choiceevent outcomes for agent . At each choice event, the agent selects from among a fixed set of items, indexed . The items are differentiated according to attributes, indexed . The th item’s value for the th attribute can vary across agents and from one choice event to another. For example, households might shop at different stores charging various prices for the same good, and the price of a good may change over time within a single store. We denote by the matrix of attribute values, also called covariates, that agent encounters at her th choice event. The th row of is denoted . The outcome of this choice event is the observed categorical random variable , which we represent as a indicator vector.
We use the observed pairs to infer which attributes have the strongest association with item choice. To this end, let denote the utility that accrues to agent if she chooses item at her th choice event. This approach, called a “random utility model” (Train 2003), assumes utility is a noisy linear function of the attributes: . Here, is a vector of agentspecific “tastes” or “preference loadings” for the item attributes, and is a random error term representing unobserved utility.
We assume that each agent, at each choice event, selects the item maximizing her utility. In the mixed multinomial logit model, we further assume the random error terms
are iid from a Gumbel Type 2 distribution. The implied choice probabilities turn out to be
(2) 
(McFadden 1974). In discrete choice modeling, the righthand side of (2) is called the “multinomial logit” distribution, denoted
. It is essentially the same as the multilogistic function used in polychotomous logistic regression, and it is often called the softmax function in machine learning research.
We further assume that are iid from a
variate normal distribution with mean vector
and covariance matrix , which we write as . For empirical Bayes estimation, the model is now completely specified:(3)  
(4) 
The toplevel parameters and , to be estimated by maximum marginal likelihood, represent the distribution of attribute preferences across the population. In particular, gives us information about the correlation of preferences between agents.
A fully Bayesian approach requires hyperprior distributions for
and . As is standard, we use conditionally conjugate distributions:(5) 
In (5), and are prespecified hyperparameters; is the inverse Wishart distribution with scale matrix and degrees of freedom; and and are hyperparameters fixed in advance. We call the fully Bayesian approach to MML model inference “hierarchical Bayes.”
3 Variational inference for the MML model
We have presented the component hierarchical distributions in the mixed multinomial logit model. Now we turn to the question of estimation and inference procedures. In the following, variable names inside of are used to distinguish among densities: we denote the pdfs in (3)–(5) by , , , and , respectively. We let denote all observed variables, i.e., the data.
For the empirical Bayes version of the MML model, the posterior density of the latent preference vectors, , is
(6) 
The joint posterior density for hierarchical Bayes, , is
(7) 
The numerator in both cases is the joint density of latent and observed variables, computed by multiplying together the densities defined in the model hierarchy (3)–(5).
The integrals appearing in these posterior densities have no closed form. As a consequence, exact inference is intractable. Variational inference is a deterministic alternative to the MCMC methods usually applied to this problem (Rossi et al. 2005). A variational algorithm selects from a prespecified family of distributions the best approximation to the true posterior distribution. We define so that all of its members permit tractable probability calculations. Then, wherever we need the true posterior, such as for an expectation, we use the approximating variational distribution instead. This plugin idea underlies MCMC methods as well – in place of the true posterior, we substitute the empirical distribution of the MCMC posterior draws.
3.1 Variational empirical Bayes
In this section we give an overview of a variational algorithm for approximate empirical Bayes estimation in the MML model. Appendix A fills in the details. We first specify a family of approximating distributions for the true posterior distribution (6). Since this posterior factors over , we take to be a family of factored distributions as well, so that . In particular, each factor is a variate normal density, with mean and covariance matrix .
For the particular data set at hand, we want to find , the best approximation in to the posterior distribution . To make the idea of a best approximation precise, we measure discrepancy with the KullbackLeibler (KL) divergence (also called the relative entropy). Shortening to , the optimal variational parameters are given by
(8) 
We can express the KL divergence between and as
(9)  
(10) 
where we define
(11) 
Here, denotes an average over , using . Because is nonnegative, (10) implies that, for all distributions ,
(12) 
The likelihood is called the “evidence” in some contexts; we therefore christen the evidence lower bound (ELBO) function.
Using (10), we can formulate a maximization problem equivalent to (8), having the same optimal point :
(13) 
The equivalence of (8) and (13), together with the bound (12), shows that the best approximation in to the posterior yields the tightest lower bound on the marginal likelihood.
We work with (13) rather than (8) – to evaluate the KL divergence, we would need to compute the marginal likelihood, bringing us back to our original problem. The variational parameters to be adjusted are . We conduct the variational inference (13) using block coordinate ascent on the coordinate blocks , , …, , . The coordinate updates do not have a closed form, but each update solves a smooth, unconstrained convex optimization problem, as we show in Appendix A. There we also give a closedform gradient and Hessian for the update, as well as closedform gradients for the update under two different parametrizations.
To finish with empirical Bayes, we explain how to obtain approximate MLEs and . Notice the variational inference procedure (13) yields an optimal value for fixed and . We alternate this variational inference step with a complementary optimization over and
. In fact, these optimizations constitute a version of the expectationmaximization (EM) algorithm for computing MLEs. The standard Estep, where we compute the posterior expected complete log likelihood, is replaced with a variational Estep, where the expected complete log likelihood is approximated using
. The variational EM algorithm alternates between the following steps until , , and the variational parameters stabilize: Estep (variational).

Using the current and , run the block coordinate ascent algorithm as described in Appendix A, yielding new variational parameter values .
 Mstep.

Using the current variational parameter values, update the empirical Bayes parameter estimates: .
It can be shown that the variational Estep finds a new value of which moves the lowerbounding function towards the true loglikelihood from below. The Mstep maximizes this adjusted lowerbounding function over and , as a surrogate for the true loglikelihood. We then retighten and remaximize until convergence. Appendix A gives details on initialization and the Mstep update.
3.2 Variational hierarchical Bayes
Fully Bayesian inference in the mixed multinomial logit model requires calculations under the posterior given in (7). In one sense, the setting is simpler than empirical Bayes: there are no unknown toplevel parameters to estimate. All we need is to extend the previous section’s variational inference procedure to include and . Appendix A.4 reports the details behind the extension; here we summarize the main ideas.
Although the joint posterior (7) is not factorized, we continue to use a family of factorized distributions for the variational approximation:
(14) 
Using a factored family for a nonfactored posterior is commonly called meanfield variational inference. In (14), is a variate normal density; is an inverse Wishart density; and the factors are variate normal densities as before. In the analysis and the algorithm, it is convenient to use a wellknown equivalence, treating as a Wishart distribution on . We are therefore optimizing over variational parameters . The variational problem for hierarchical Bayes is to find the best approximating distribution in by solving the analog of (13):
(15) 
As with empirical Bayes, we use a block coordinate ascent optimization algorithm to solve (15), iterating through the coordinate blocks that define . Here again, all coordinate updates are convex optimizations. The details appear in Appendix A: updates for , , all have simple closed forms; has a closed form which requires no updating; and the and updates are similar to the empirical Bayes case.
4 Empirical results
We compared the accuracy and speed of the variational methods described in the previous section to a standard and widely used MCMC approach (Allenby and Rossi 2003), on a suite of simulated data sets. Each data set was generated using the discrete choice model given by (3) and (4). To simulate a data set with choice items, item attributes, and agents, we first fixed values of and , the parameters controlling the distribution of preferences in the agent population. We then independently drew a vector for each agent, according to (4). We also drew for each agent 25 iid item attribute matrices consisting of iid entries. Finally, for each agent, we used and to simulate 25 choice events , according to (3). Thus, in our data sets, each agent has observed choices.
We simulated a total of 32 different scenarios by varying , , , and the selection of and . Specifically, each data set corresponds to a distinct configuration of the following candidate values: 3 or 12 choice items ; 3 or 10 item attributes ; 250, 1000, 5000, or 25000 agents ; and “low” or “high” heterogeneity of the agent population. In the lowheterogeneity scenario, the vector consists of evenly spaced values from 2 to 2, and the matrix
is 0.25 times the identity matrix. In the highheterogeneity scenario,
is the same, but is the identity matrix. The data sets with high heterogeneity have much more diverse collections of preference vectors .We ran variational empirical Bayes (VEB), variational hierarchical Bayes (VB), and the standard MCMC algorithm on the observable data from each of the 32 simulation scenarios. For VEB, we declared convergence as soon as an Estep/Mstep iteration caused the parameter estimates’ joint Euclidean norm to change by less than , relative to their norm at the beginning of the iteration (here we mean the joint norm of all the variational parameters, together with the model parameter estimates and ). The convergence criterion for VB was the same, except and do not have point estimates – we look instead at the change in the variational parameters corresponding to their posterior approximation, namely , , and .
Choosing MCMC convergence criteria is more delicate. We tried to set the number of burnin iterations and the thinning ratio algorithmically, using the technique of Raftery and Lewis (1992). But on several of our data sets, typical control parameter values, such as the default settings for the raftery.diag function in the R package coda (Plummer et al. 2006), led to a very large number of burnin iterations. Trace plots of the sampled parameters indicated these large burnin values were unnecessary, so using them would have been unfair to MCMC in our timing comparisons. Instead, we manually investigated MCMC convergence and autocorrelation for several data sets, using trace plots, partial autocorrelation functions, and related diagnostics available in the coda package. Based on these studies, we chose to use 1,000 iterations of burnin and a 10:1 thinning ratio. On each data set, we therefore ran 6,000 total iterations of MCMC to generate 500 draws for the approximate posterior. These numbers are as small as we could reasonably make them, in order to be fair to MCMC in the timing comparisons.
4.1 Accuracy
Our measure of accuracy for each inference procedure is based on the predictive choice distribution. Informally, this distribution gives the item choice probabilities exhibited by the “average” agent, when shown an item attribute matrix . In our simulations, we know the true values of and , so we can compute the true predictive choice distribution:
(16)  
(17) 
Equation (17) explains the “average agent” interpretation of the predictive choice distribution. A slightly different take on (16)(17) is the following: if we want to forecast the item choice probabilities of a new, previously unobserved decisionmaker, we can think of her as a random draw from the agent population. Under our model, the choice probabilities for such a randomly drawn agent are precisely (16)(17).
For any particular , we use the true values of and to compute a Monte Carlo estimate of (17). We base the estimate on enough draws of to insure that its variability does not affect even the least significant digit of our reported results. We handle the integral over in the same way for the estimated predictive choice distributions furnished by each of the three inference procedures. For VEB, the estimate is (16), with and replaced by and . On the other hand, with VB and MCMC, we obtain a posterior distribution over and ; we take the mean of (16) under this posterior as a point estimate of the predictive choice distribution:
(18) 
For VB, the posterior density in (18) is approximated by the fitted variational distribution
(19) 
For MCMC, the posterior is approximated as usual by the empirical distribution of draws from a Markov chain. In both cases, we handle the integral over and in (18) with another exhaustive Monte Carlo approximation.
We measure the error of each inference procedure as the distance from its estimate of the predictive choice distribution to the true distribution. As the metric on distributions, we use total variation (TV) distance, leading to what we call the “TV error” of each procedure. In this setting, TV error equals the maximum, over all choiceitem subsets, of the difference in the probabilities assigned to the subset by the estimated versus the true choice distribution. We also need to choose the attribute matrix at which the true and estimated predictive choice distributions are calculated. In each simulation scenario, we computed the TV error of VEB on 25 different random draws of . We then compared the three procedures using the which yielded the median of the 25 TV errors. In this sense our results are representative of accuracy for a “typical” item attribute matrix. However, results using any one of the 25 matrices were qualitatively the same in the examples we checked.
TV error values for the three inference procedures are presented in Table 1 for the 3item simulation scenarios, and in Table 2 for the 12item scenarios. The main conclusion we draw from Tables 1 and 2
is simple: on these data sets, there are no practical differences in accuracy among VEB, VB, and MCMC. The scale of the TV error for all the procedures is the same; that scale is larger in the 12item case than the 3item case, but all three procedures exhibit high accuracy on all data sets. The magnitude of our simulation study makes it difficult to carry out the replications required to put standard errors in these tables. But even if the differences among TV errors in every scenario were “significant” under a suitable definition, the patternless alternation in the identity of the most accurate method would make more specific conclusions dubious.
3 attributes  10 attributes  

Agents  Het.  VEB  VB  MCMC  VEB  VB  MCMC 
250  Low  0.23  0.31  0.35  0.66  0.50  0.59 
High  0.49  0.48  0.92  0.68  0.51  1.46  
1,000  Low  0.74  0.63  0.46  0.49  0.17  0.14 
High  0.53  0.33  0.09  0.30  0.39  0.18  
5,000  Low  0.25  0.37  0.35  0.56  0.59  0.07 
High  0.19  0.25  0.39  0.35  0.20  0.38  
25,000  Low  0.63  0.74  NA  0.59  0.55  NA 
High  0.53  0.53  NA  1.60  1.10  NA 
3 attributes  10 attributes  

Agents  Het.  VEB  VB  MCMC  VEB  VB  MCMC 
250  Low  2.88  2.80  2.64  1.97  1.91  2.44 
High  1.44  1.94  1.64  2.43  2.37  2.62  
1,000  Low  1.11  1.27  1.09  0.99  1.00  1.60 
High  0.98  1.18  1.18  1.99  2.05  1.96  
5,000  Low  1.25  1.45  1.18  0.95  1.13  0.97 
High  1.14  0.98  1.10  0.71  0.91  0.92  
25,000  Low  0.22  0.33  NA  0.51  0.53  NA 
High  0.99  0.57  NA  1.23  0.96  NA 
Figure 1 shows in a different way the comparable accuracy of the three procedures. When there are three choice items, any predictive choice distribution can be plotted as a point in the triangular simplex. The figure shows the close proximity of each procedure’s estimated choice distribution to the true distribution in one simulation scenario. Plots for all the other threeitem scenarios are qualitatively the same. Figure 1 also shows contours of the VB and MCMC approximate posterior distributions. We see that VB is producing not only a posterior mean similar to MCMC, but also a similar posterior density in the neighborhood of the mean.
4.2 Speed
For each simulated data set, we ran the three procedures in turn on the same unloaded machine: a 64bit dualcore 3.2 GHz Intel Xeon processor with 8 GB of main memory. For the MCMC inference, we used the rhierMnlRwMixture function in the R package bayesm (Rossi and McCulloch 2007), which has efficient vectorized implementations of the inner sampling routines. This package stores all MCMC draws in memory, however. For our largest data sets, with 25,000 decisionmakers, the machine’s memory was exhausted before MCMC converged. We were able to run MCMC for 1,000 iterations in this case, which allowed us to extrapolate accurately the time that would have been required for 6,000 iterations. We implemented the variational algorithms in R, with compiled C components for the numerical optimization routines.
Figure 2 displays time to convergence on each data set for the three procedures, according to the convergence criteria previously described. Within each panel, convergence time is plotted as a function of the number of agents, for fixed values of the other simulation parameters. Note that the vertical axis shows convergence time on a logarithmic scale, to ease comparison of MCMC to the variational methods. All the procedures scale roughly linearly with the number of agents, which leads to the logarithmic curves seen in the figure. The conclusions are the same in all the scenarios we simulated: variational methods converge faster than MCMC, and the magnitude of the difference increases with the number of observed agents. MCMC uses two days of computation time for 25,000 agents with 12 choice items, 10 item attributes, and high heterogeneity, versus an hour for each of the variational techniques. In the same setting but with low heterogeneity, MCMC’s twoday computation compares with two hours for VEB and six hours for VB. In other scenarios, variational run times are measured in minutes, as opposed to hours or days for MCMC.
5 Discussion
Variational methods allow estimation of hierarchical discrete choice models in a small fraction of the time required for MCMC. They open Bayesian inference to a wide class of problems for which resource constraints make MCMC intractable, such as the MML model with many heterogeneous units. For now, variational methods appear to be the only viable option in these cases.
Of course, one can use variational methods to estimates many more types of models than the MML model examined here. Within the MML family, it would be straightforward to add a utility scaling parameter, or allow the heterogeneous coefficients themselves to depend on observed covariates. The value of the variational approach is greatest when subsampling of data is illadvised. For example, consider a linear model with heterogeneous coefficients on exogenous and endogenous covariates, where the available instrumental variables only weakly explain the endogenous part. To draw inferences about the covariances of the heterogeneous parameters, we may need a large amount of data to achieve reasonable power in hypothesis testing. MCMC is untenable here, but variational methods have promise. When factorized variational distributions are inadequate, alternatives such as mixtures of normals or Dirichlet processes (Blei and Jordan 2006) can be applied.
We emphasize that we do not advocate abandoning MCMC in favor of variational methods. On the contrary, we suggest using MCMC when possible. MCMC offers consistency guarantees with no analog in variational methods, and it has withstood decades of scrutiny. But we advise against subsampling the data (i.e., throwing out information), or discarding key modeling elements, simply to make the problem fit within the time and resource constraints of MCMC. The possibility of applying variational methods to previously intractable problems makes them an important addition to the statistician’s toolkit.
Appendix A Variational inference and parameter estimation
In this appendix we describe the variational inference and estimation procedures for the mixed multinomial logit model. A few words on notation: means the matrix A is positive definite; means positive semidefinite; is the determinant of . For a scalar function and a vector , means the vector .
a.1 The empirical Bayes ELBO
For empirical Bayes in the MML model, the ELBO objective function (11) becomes
(20) 
The first term in (20) is the Shannon entropy of the variational distribution. The second and third terms are (minus) an unnormalized cross entropy – the missing normalization constant is the marginal likelihood. Recall that the variational distribution is a product of normal distributions .
The first and second terms of (20) are straightforward to derive; the third term requires more attention. Using the multinomial logit mass function
(21) 
the third term becomes
(22) 
The expected logsumexp in (22) has no closed form. For variational inference, we therefore approximate the ELBO objective function using a new objective function . To construct , we consider two alternatives: the zerothorder and firstorder delta method for moments (Bickel and Doksum 2007), which we call D0 and D1 respectively. D0 is equivalent to applying Jensen’s inequality to the expected logsumexp, resulting in the following lower bound to (22):
(23) 
Here we used the usual formula for the mean of a lognormal random variable. In a different context, Blei and Lafferty (2007) consider an approximation equivalent to D0 but expressed using a redundant variational parameter.
For approximation D1, we restrict to be diagonal, and
define
. Using
results in Appendix B, we obtain the following approximation
to (22):
(24) 
with as defined in Appendix B. Notice that, unlike D0, approximation D1 does not preserve the guarantee that the optimal value of the variational optimization lower bounds the marginal likelihood. However, in our simulations, using D1 resulted in more accurate variational approximations to the posterior.
In this appendix we give a derivation based on approximation D0. The derivation for D1 is similar, but simpler, because is treated as diagonal. Under D0, the final empirical Bayes objective function is
(25) 
The first line in (25) uses the wellknown entropy of the normal distribution. The second line uses the crossentropy of two normal distributions, also well known. The third line is approximation D0.
a.2 Empirical Bayes variational Estep
Here we describe a block coordinate ascent algorithm to maximize (25) over the variational parameters and . Although the problem is not jointly convex in all these parameters, each and coordinate update solves a smooth, unconstrained convex optimization problem. The requirement is satisfied after each update. We initialize the variational parameters at the maximum likelihood estimates from a homogeneous model (in which all agents share a common value).
The concavity of (25) in follows from the fact that and from the convexity of the logsumexp function. We update using standard algorithms for unconstrained convex optimization (Boyd and Vandenberghe 2004), supplying an analytic gradient and Hessian as follows. Define the function taking values in , with th component , and normalized to sum to one across . The gradient of with respect to can then be written
(26) 
Note the similarity of this gradient to the gradient from an regularized multiple logistic regression: it consists of a contribution from the regularizer (the lefthand term), plus a residualweighted sum of covariate vectors. Abbreviating to , an argument using matrix differentials (Magnus and Neudecker 2007) gives the Hessian
(27) 
The coordinate update is harder, because we need to insure that . Using a reformulation, we can avoid making the constraint explicit, which would complicate the optimization. Let for a lowertriangular matrix . Since , one such always exists—the Cholesky factor. We replace each in with , and optimize over the unconstrained set of lowertriangular matrices .
The objective function (25) remains concave in . To see this, compare the terms depending on to the function studied in Appendix C. We now give the gradient with respect to . Standard matrix differentiation of (25) leads to the gradient
(28) 
Again using matrix differentials and the Cauchy invariance rule, it is not hard to show that the gradient with respect to is
(29) 
Note that this is the gradient with respect to a dense matrix . Since we optimize over lowertriangular matrices, i.e. vech(), we need only use the lower triangular of the gradient. This is convenient for the term : it is uppertriangular, so its lower triangle is a diagonal matrix. Furthermore, from a standard result of linear algebra, the diagonal entries are simply , where the ’s form the diagonal of .
In practice we do the and updates in a single step by optimizing jointly over and , which remains a convex problem.
a.3 Empirical Bayes Mstep
In the Mstep, we maximize (25) over and . Identifying the terms which depend on , we recognize the usual Gaussian mean estimation problem. Further, (25) is easily seen to be concave in , with a closedform solution of the corresponding firstorder condition. We obtain the Mstep updates
(30) 
Here is the empirical covariance of the vectors.
a.4 Variational hierarchical Bayes
In the fully Bayesian MML model, and have prior distributions, with corresponding variational factors given in (14). The ELBO in this case has the same form as (20), with two differences. First, contains two new terms
(31) 
Second, there are two new crossentropy terms
(32) 
Also, the middle term of (20) changes in the fully Bayesian case, because and are now averaged over rather than treated as constants.
Using known formulas for normal and Wishart entropies, the two new entropy terms are seen to equal
(33) 
Here we used the expected log determinant of a Wishart random matrix
(34) 
and the log normalization constant of the Wishart distribution
(35) 
(see, for example, Beal 2003). The new cross entropy terms for and work out to
(36) 
and
(37) 
respectively. The middle term of (20) eventually becomes
(38) 
With these changes, it is not hard to see that is concave separately in and . The firstorder conditions for block coordinate ascent lead to the updates
(39)  
(40) 
By inspection, , so this constraint need not be explicitly enforced. Note the similarity to conjugate posterior updating: on the precision scale, is the sum of the prior precision matrix and copies of the variational posterior mean for . Similarly, is a precisionweighted convex combination of the prior vector and the empirical average of the variational posterior means for .
The updates for and are similarly straightforward to derive; we obtain
(41)  
(42) 
Notice that the solution (41) for involves only the constants and . We compute once in advance, leaving it unchanged during the variational optimization.
Appendix B An application of the delta method
Let be a function from to . According to the multivariate delta method for moments (Bickel and Doksum 2007),
(43) 
Consider the case
(44) 
where is a matrix whose rows are the vectors . Let , and restrict such that for . We can now rewrite (43):
(45) 
where is the diagonal of the Hessian of , evaluated at the point . Define . Using matrix differentials, it can be shown that
(46) 
where denotes the Hadamard product.
To use the approximation (45) in an optimization over , we need to compute the gradient. The formula for makes this a more extensive but still mechanical exercise in differentials. One obtains
(47) 
Appendix C A convexity result
Let be scalars, be vectors, , and . We show here that the function
(48) 
is concave on the set of fullrank matrices.
We argue that each of the three constituent terms, from left to right, is concave. The second differential of is
(49) 
By Theorem 10.6.1 of (Magnus and Neudecker 2007), the Hessian of is
, where is the
order commutation matrix and denotes the Kronecker product.
We now show that is (matrix)
positivedefinite.
(50)  
(51)  
(52) 
Equation (50) follows from the wellknown fact that . Thus, the Hessian of is negative definite, and is concave.
Concavity of the middle term in (48) follows in the usual way from the univariate convexity of the function
(53) 
for fixed matrices
Comments
There are no comments yet.