Variational Inference (VI) (Jordan et al., 1999; Wainwright et al., 2008) is a powerful method to approximate intractable integrals. As an alternative strategy to Markov chain Monte Carlo (MCMC) sampling, VI is fast, straightforward for monitoring convergence and typically easier to scale to large data (Blei et al., 2017) than MCMC. The key idea of VI is to approximate difficult-to-compute conditional densities of latent variables, given observations, through optimization. A family of distributions
is assumed for the latent variables, as an approximation to the exact conditional distribution. VI aims at finding the member, amongst the selected family, that minimizes the Kullback-Leibler (KL) divergence from the conditional law of interest.
Let and denote, respectively, the observed data and latent variables. The goal of the inference problem is to identify the conditional density (assuming a relevant reference measure, e.g. Lebesgue) of latent variables given observations, i.e. . This conditional density can, e.g., be used to produce point and interval estimates of the latent variables, or provide predictions. Let denote a family of densities defined over the space of latent variables – we denote members of this family as below. The goal of VI is to find the element of the family closest in KL divergence to the true . Thus, the original inference problem can be rewritten as an optimization one: identify such that
for the KL-divergence defined as
with being constant with respect to . The notation refers to expectation taken over . Thus, minimizing the KL divergence is equivalent to maximising the evidence lower bound, ELBO, given by
Let , , denote the support of the target , and the support of a variational density – assumed to be common over all members . Necessarily, , otherwise the KL-divergence will diverge to .
Many VI algorithms focus on the mean-field variational family, where variational densities in are assumed to factorise over blocks of . That is,
for individual supports , , , for some , and . It is advisable that highly correlated latent variables are placed in the same block to improve the performance of the VI method.
There are mainly two types of approaches to maximise ELBO in VI: a co-ordinate ascent approach and a gradient-based one. Co-ordinate ascent VI (CAVI) (Bishop, 2006) is amongst the most commonly used algorithms in this context. To obtain a local maximiser for ELBO, CAVI sequentially optimizes each factor of the mean-field variational density, while holding the others fixed. Analytical calculations on function space – involving variational derivarives – imply that, for given fixed , ELBO is maximised for
denotes vectorhaving removed component , with (resp. ) denoting the ordered indices that are smaller (resp. larger) than ; is the expectation taken under following its variational distribution, denoted . The above suggest immediately an iterative algorithm, guaranteed to provide values for ELBO that cannot decrease as the updates are carried out.
The expected value can be difficult to derive analytically. Also, CAVI typically requires traversing the entire dataset at each iteration, which can be overly computationally expensive for large datasets. Gradient-based approaches, which can potentially scale up to large data – alluding here to recent Stochastic-Gradient-type methods – can be an effective alternative for ELBO optimisation. However, such algorithms have their own challenges, e.g. analytical derivation of gradients can often be problematic.
In real-world applications, hybrid methods combining Monte Carlo with recursive algorithms are common, e.g., Auto-Encoding Variational Bayes, Doubly-Stochastic Variational Bayes for non-conjugate inference, Stochastic EM (Beaumont et al., 2002; Sisson et al., 2007; Wei and Tanner, 1990). In VI, Monte Carlo is often used to estimate the expectation within CAVI or the gradient within derivative-driven methods. This is the case, e.g., for Stochastic VI (Hoffman et al., 2013) and Black-Box VI (BBVI) (Ranganath et al., 2014).
BBVI is used in this work as a representative of gradient-based VI algorithms. It allows carrying out VI over a wide range of complex models. The variational density is typically chosen within a parametric family, so finding in (1) is equivalent to determining an optimal set of parameters that characterize , , , , with . The gradient of ELBO w.r.t. the variational parameters equals
and can be approximated by black-box Monte Carlo estimators as, e.g.,
with , , . The approximated gradient of ELBO can then be used within a stochastic optimization procedure to update at the th iteration with
is a Robbins-Monro-type step-size sequence. As we will see in later sections, BBVI is accompanied by generic variance reduction methods, as the variability of (6) for complex models can be large.
Remark 1 (Hard Constraints).
Though a gradient-based VI method is often more straightforward to apply than a co-ordinate ascent one, – e.g. combined with the use of modern approaches for automatic differentiation (Kucukelbir et al., 2017) – co-ordinate ascent methods can still be important for models with hard constraints, where gradient-based algorithms are laborious to apply. Indeed, notice in the brief description we have given above for CAVI and BBVI, the two methodologies are structurally different, as CAVI does not necessarily require to be build up via the introduction of an exogenous variational parameter . Thus, in the context of a support for the target that involves complex constraints, a CAVI approach overcomes this issue naturally by blocking together the ’s responsible for the constraints. In contrast, introduction of the variational parameter creates sometimes severe complications in the development of the derivative-driven algorithm, as normalising constants that depend on are extremely difficult to calculate analytically (reminiscent of doubly-intractable problems in MCMC literature), and obtain their derivatives. Thus, a main argument spanning this work – and illustrated within it – is that co-ordinate-ascent-based VI methods have an important role to play amongst VI approaches for important classes of statistical models.
The main contributions of the paper are:
We discuss, and then apply a Monte Carlo CAVI (MC-CAVI) algorithm in a sequence of problems of increasing complexity, and study its performance. As the name suggests, MC-CAVI uses the Monte Carlo principle for the approximation of the difficult-to-compute conditional expectations, , within CAVI.
We show analytically that, under suitable regularity conditions, MC-CAVI will get close to a maximiser of ELBO with high probability.
We contrast MC-CAVI with MCMC and BBVI through simulated and real examples, some of which involve hard constraints; we demonstrate MC-CAVI’s effectiveness in an important application imposing such hard constraints, with real data in the context of Nuclear Magnetic Resonance (NMR) spectroscopy.
The rest of the paper is organised as follows. Section 2 presents briefly the MC-CAVI algorithm. It also provides – in a specified setting – an analytical result illustrating non-accumulation of Monte Carlo errors in the execution of the recursions of the algorithm. That is, with a probability arbitrarily close to 1, the variational solution provided by MC-CAVI can be as close as required to the one of CAVI, for big enough Monte Carlo sample size, regardless of the number of algorithmic steps. Section 3 shows two numerical examples, contrasting MC-CAVI with alternative algorithms. Section 4 presents an implementation of MC-CAVI in a real, complex, challenging posterior distribution arising in metabolomics. This is a practical application, involving hard constraints, chosen to illustrate the potential of MC-CAVI in this context. We finish with some conclusions in Section 5.
2 MC-CAVI Algorithm
2.1 Description of the Algorithm
We begin with a description of the basic CAVI algorithm. A double subscript will be used to identify block variational densities: (resp. ) will refer to the density of the th block (resp. all blocks but the th), after updates have been carried out on that block density (resp. updates have been carried out on the blocks preceding the th, and updates on the blocks following the th).
Step 0: Initialize probability density functions, .
Step : For , given , , execute:
For , update:
with taken w.r.t. .
Iterate until convergence.
Assume that the expectations , , for an index set , can be obtained analytically, over all updates of the variational density ; and that this is not the case for . Intractable integrals can be approximated via a Monte Carlo method. In paticular, for , one obtains samples from the current and uses the standard Monte Carlo estimate
Implementation of such an approach gives rise to MC-CAVI, described in Algorithm 1.
2.2 Applicability of MC-CAVI
We discuss here the class of problems for which MC-CAVI can be applied. It is desirable to avoid settings where the order of samples or statistics to be stored in memory increases with the iterations of the algorithm. To set-up the ideas we begin with CAVI itself. Motivated by the standard exponential class of distributions, we work as follows.
Consider the case when the target density – we omit reference to the data in what follows, as is fixed and irrelevant for our purposes (notice that is not required to integrate to ) – is assumed to have the structure,
for -dimensional constant vector , vector function , with some , and relevant scalar functions , ; is the standard inner product in . Also, we are given the choice of block-variational densities in (3). Following the definition of CAVI from Section 2.1 – assuming that the algorithm can be applied, i.e. all required expectations can be obtained analytically – the number of ‘sufficient’ statistics, say giving rise to the definition of will always be upper bounded by . Thus, in our working scenario, CAVI will be applicable with a computational cost that is upper bounded by a constant within the class of target distributions in (8) – assuming relevant costs for calculating expectations remain bounded over the algorithmic iterations.
Moving on to MC-CAVI, following the definition of index set in Section 2.1, recall that a Monte Carlo approach is required when updating for , . In such a scenario, controlling computational costs amounts to having a target (8) admitting the factorisations,
Once (9) is satisfied, we do not need to store all samples from , but simply some relevant averages keeping the cost per iteration for the algorithm bounded. We stress that the combination of characterisations in (8)-(9) is very general and will typically be satisfied for most practical statistical models.
2.3 Theoretical Justification of MC-CAVI
An advantageous feature of MC-CAVI versus derivative-driven VI methods is its structural similarity with Monte Carlo EM (MCEM). Thus, one can build on results in the MCEM literature to prove asymptotical properties of MC-CAVI; see e.g. Chan and Ledolter (1995); Booth and Hobert (1999); Levine and Casella (2001); Fort et al. (2003). To avoid technicalities related with working on general spaces of probability density functions, we begin by assuming a parameterised setting for the variational densities – as in the BBVI case – with the family of variational densities being closed under CAVI or (more generally) MC-CAVI updates.
Assumption 1 (Closedness of Parameterised Under Variational Update).
For the CAVI or the MC-CAVI algorithm, each density obtained during the iterations of the algorithm, , , is of the parametric form
for a unique , for some , for all .
(Let and .)
Under Assumption 1, CAVI and MC-CAVI can be corresponded to some well-defined maps , respectively, so that, given current variational parameter , one step of the algorithms can be expressed in terms of a new parameter (different for each case) obtained via the updates
For an analytical study of the convergence properties of CAVI itself and relevant regularity conditions, see e.g. (Bertsekas, 1999, Proposition 2.7.1), or numerous other resources in numerical optimisation. Expressing the MC-CAVI update – say, the th one – as
it can be seen as a random perturbation of a CAVI step. In the rest of this section we will explore the asymptotic properties of MC-CAVI. We follow closely the approach in Chan and Ledolter (1995) – as it provides a less technical procedure, compared e.g. to Fort et al. (2003) or other works about MCEM – making all appropriate adjustments to fit the derivations into the setting of the MC-CAVI methodology along the way. We denote by , , the -fold composition of , respectively, for .
is an open subset of , and the mappings , are continuous on .
If for some , then is a fixed point of . A given is called an isolated local maximiser of ELBO if there is a neighborhood of over which is the unique maximiser of ELBO.
Assumption 3 (Properties of Near a Local Maximum).
Let be an isolated local maximum of ELBO. Then,
is a fixed point of ;
there is a neighborhood of over which is a unique maximum, such that for any .
The critical technical assumption required for delivering the convergence results in the rest of this section is the following one.
Assumption 4 (Uniform Convergence in Probability on Compact Sets).
For any compact set the following holds: for any , there exists a positive integer , such that for all we have,
It is beyond the context of this paper to examine Assumption 4 in more depth. We will only stress that Assumption 4 is the sufficient structural condition that allows to extend closeness between CAVI and MC-CAVI updates in a single algorithmic step into one for arbitrary number of steps.
We continue with a definition.
A fixed point of is said to be asymptotically stable if,
for any neighborhood of , there is a neighborhood of such that for all and all , ;
there exists a neighbourhood of such that if .
For a neighborhood of , we can chose a sub-neighborhood as described in Assumption 3.
For some small , the set has a connected component, say , so that and ; we can assume that is compact. Assumption 3 implies that ; in fact, since is connected and contains , we have . This completes the proof of part (i) of
Part (ii): Let .
Consider and a convergent subsequence, , for increasing integers . Thus, , whereas . The two last limits imply , so that . We have shown that any convergent subsequence of has limit ; the compactness of gives that also . This completes the proof of part (ii) of Definition 1. ∎
The main result of this Section is as follows.
Let be as within the proof of Lemma 1. Define , for an small enough so that . For , we have , thus there are such that for all and for all with , we obtain that . Also, due to continuity and compactness, there is such that for all and for all such that , we have . Let and where denotes integer part. Notice that given , we have that . Consider the event . Under Assumption 4, we have that for arbitrarily close to 1. Within , we have that for some , or else for all , giving that , which is impossible. ∎
2.4 Stopping Criterion and Sample Size
The method requires the specification of the Monte Carlo size ands a stopping rule.
Principled - but Impractical - Approach
As the algorithm approaches a local maximum, changes in ELBO should be getting closer to zero. To evaluate the performance of MC-CAVI, one could, in principle, attempt to monitor the evolution of ELBO during the algorithmic iterations. For current variational distribution , assume that MC-CAVI is about to update with , where the addition of the second subscript at this point emphasizes the dependence of the new value for on the Monte Carlo size . Define,
If the algorithm is close to a local maximum, ELBO should be close to zero, at least for sufficiently large . Given such a choice of , an MC-CAVI recursion should be terminated once ELBO
is smaller than a user-specified tolerance threshold. Assume that the random variableELBO has mean and variance . Chebychev’s inequality implies that, with probability greater than or equal to , ELBO lies within the interval , for any real . Assume that one fixes a large enough . The choice of and of a stopping criterion should be based on the requirements:
, with a predetermined level of tolerance;
the effective range should include zero, implying that ELBO differs from zero by less than .
Requirement (i) provides a rule for the choice of – assuming applied over all , for in areas close to a maximiser –, and requirement (ii) a rule for defining a stopping criterion. Unfortunately, the above considerations – based on the proper term ELBO that VI aims to maximise – involve quantities that are typically impossible to obtain analytically or via some reasonably expensive approximation.
Similarly to MCEM, it is recommended that gets increased as the algorithm gets more stable. It is computationally inefficient to start with a large value of when the current variational distribution can be far from the maximiser. In practice, one may monitor the convergence of the algorithm by plotting statistics – of significance – of the variational distribution versus the number of iterations. We can declare that convergence has been reached when such traceplots show relatively small random fluctuations (due to the Monte Carlo variability) around a fixed value. At this point, one may terminate the algorithm or continue with a larger value of , which will further decrease the traceplot variability.
3 Numerical Examples – Simulation Study
In this section we illustrate MC-CAVI with two simulated examples. First, we apply MC-CAVI and CAVI on a simple model to highlight main features and implementation strategies. Then, we contrast MC-CAVI, MCMC, BBVI in a complex scenario with hard constraints.
3.1 Simulated Example 1
We generate data points from and fit the semi-conjugate Bayesian model
|Example Model 1|
Let be the data sample mean. In each iteration, the CAVI density function – see (4) – for
is that of the Gamma distribution, with
that of the normal distribution. and denote the relevant expectations under the current CAVI distributions for and respectively; the former are initialized at 0 – there is no need to initialise in this case. Convergence of CAVI can be monitored, e.g., via the sequence of values of and . If the change in values of these two parameters is smaller than, say, , we declare convergence. Figure 1 shows the traceplots of , .
Convergence is reached within 0.0017secs111A Dell Latitude E5470 with Intel(R) Core(TM) i5-6300U CPU@2.40GHz is used for all experiments in this paper., after precisely two iterations, due to the simplicity of the model. The resulted CAVI distribution for is , and for it is Gamma so that .
Assume now that was intractable. Since is required to update the approximate distribution of , an MCMC step can be employed to sample from to produce the Monte Carlo estimator . Within this MC-CAVI setting, will replace the exact during the algorithmic iterations. are initialised as in CAVI. For the first 10 iterations we set , and for the remaining ones, to reduce variability. We monitor the values of shown in Figure 2.
The Figure shows that MC-CAVI has stabilized after about iterations; algorithmic time was 0.0114secs. To remove some Monte Carlo variability, the final estimator of is produced by averaging the last 10 values of its traceplot, which gives , i.e. a value very close to the one obtained by CAVI. The estimated distribution of is , the same as with CAVI.
The performance of MC-CAVI depends critically on the choice . Let A be the value of in the burn-in period, B the number of burn-in iterations and C the value of after burn-in. Figure 3 shows trace plots of under different settings of the triplet A-B-C.
As with MCEM, should typically be set to a small number at the beginning of the iterations so that the algorithm can reach fast a region of relatively high probability. should then be increased to reduce algorithmic variability close to the convergence region.
3.2 Variance Reduction for BBVI
In non-trivial applications, the variability of the initial estimator within BBVI in (6) will typically be large, so variance reduction approaches such as Rao-Blackwellization and control variates (Ranganath et al., 2014) are also used. Rao-Blackwellization (Casella and Robert, 1996) reduces variances by analytically calculating conditional expectations. In BBVI, within the factorization framework of (3), where , and recalling identity (5) for the gradient, a Monte Carlo estimator for the gradient with respect to , , can be simplified as
with , , and,
Depending on the model at hand, term can be obtained analytically or via a double Monte Carlo procedure (for estimating , over all ) – or a combination of thereof. In BBVI, control variates (Ross, 2002) can be defined on a per-component basis and be applied to the Rao-Blackwellized noisy gradients of ELBO in (11) to provide the estimator,
for the control,
where , denote the th co-ordinate of the vector-valued functions , respectively, given below,
3.3 Simulated Example 2: Model with Hard Constraints
In this section, we discuss the performance and challenges of MC-CAVI, MCMC, BBVI for models where the support of the posterior – thus, also the variational distribution – involves hard constraints.
Here, we provide an example which offers a simplified version of the NMR problem discussed in Section 4 but allows for the implementation of BBVI, as the involved normalising constants can be easily computed. Moreover, as with other gradient-based methods, BBVI requires to tune the step-size sequence in (7), which might be a laborious task, in particular for increasing dimension. Although there are several proposals aimed to optimise the choice of (Bottou, 2012; Kucukelbir et al., 2017), MC-CAVI does not face such a tuning requirement.
We simulate data according to the following scheme: observations are generated from , , with , , , . We fit the following model:
|Example Model 2|
We use a standard Metropolis-within-Gibbs. We set , and . Notice that we have the full conditional distributions,
(Above, and in similar expressions written in the sequel, equality is meant to be properly understood as stating that ‘the density on the left is equal to the density of the distribution on the right’.) For each , , the full conditional is,
where is the density of and its cdf. The Metropolis-Hastings proposal for is a uniform variate from .
For MC-CAVI, the logarithm of the joint distribution is given by,
under the constraints,
To comply with the above constraints, we factorise the variational distribution as,
Here, for the relevant iteration , we have,
The quantity used in the second line above means that the expectation is considered under and (independently) .
Then, MC-CAVI develops as follows:
Step 0: For , initialize , , .
Step : For , given , , execute:
For , apply an MCMC algorithm – with invariant law – consisted of a number, , of Metropolis-within-Gibbs iterations carried out over the relevant full conditionals,
As with the full conditional within the MCMC sampler, we use a uniform proposal at the Metropolis-Hastings step applied for . For each , the iterations begin from the -values obtained at the end of the corresponding MCMC iterations at step , with very first initial values being . Use the samples to obtain and .
Update the variational distribution for ,