Stochastic variational inference (VI) is becoming a cornerstone of modern machine learning research as it reduces Bayesian inference to a stochastic optimization problem that can be automatically solved using deep learning frameworksHoffman et al. (2013); Ranganath et al. (2014); Rezende et al. (2014). Particle-based VI methods have recently gained substantial popularity with the introduction of Stein variational gradient descent (SVGD) Liu and Wang (2017). In a particle-based variational method the posterior distribution is approximated as the stationary distribution of a system of interacting particles Dai et al. (2016); Liu and Wang (2017). In SVGD this dynamics can be decomposed into a steepest ascent term and a repulsive interaction between particles that avoids the collapse of the approximate posterior into its modes. While the dynamical system has been proven to converge to the exact posterior at the limit of infinitely many particles, the finite particle dynamics suffers from the fact that the repulsive force does not depend on the true posterior and is instead induced by an arbitrarily chosen kernel function Liu and Wang (2017). This is particularly problematic when the dimensionality of the latent space is high since the performance of kernel methods rapidly degrade with the dimensionality. Intuitively, the repulsive effect should not be dependent on an arbitrary kernel. Instead, it should reflect an ’explaining away’ phenomenon, where regions of the posterior that are already captured by a particle should not influence the dynamics of the other particles.
Optimal transport theory is becoming another fundamental part of modern machine learning research Cuturi (2013); Arjovsky et al. (2017); Gulrajani et al. (2017); Tolstikhin et al. (2018). Recently, optimal transport theory has been used to obtain a new flexible form of black-box stochastic variational Bayesian inference that replaces the usual Kullback-Leibler (KL) divergence with Wasserstein divergences Ambrogioni et al. (2018)
. In this paper we introduce a particle-based form of Wasserstein variational inference based on the theory of semi-discrete optimal transport. We call this method Wasserstein variational gradient descent (WVGD). We approximate the posterior distribution by minimizing a semi-discrete optimal transport divergence. In this formulation mode collapse is avoided thanks to the explaining away phenomenon without resorting to repulsive forces. Importantly, the optimal transport formulation does not only provide the particle approximation but also a set of optimal transportation densities that map each particle to a continuous distribution that models a part of the posterior. Once we assume a parametric form, these transportation densities behave like local version of the parametric models used in conventional stochastic VI, each modeling the probability density in a region around the particle. Therefore, our approach can also be seen as a form of ensemble VI.
2 Related work
The use of optimal transport divergences in variational Bayesian inference problems was introduced in Ambrogioni et al. (2018). However, Wasserstein variational inference is only applicable in case of joint-contrastive (amortized) inference problems, while WVGD can also be used without inference amortization. The WVGD method is a form of particle-based VI. Particle-based VI can be seen as an intermediate between sampling methods such as MCMC and conventional VI as it combines the non-parametric nature of samplers with an optimization point of view. The most popular particle-based VI algorithm is SVGD Liu and Wang (2017)
. SVGD has been applied in several domains including reinforcement learningHaarnoja et al. (2017); Liu et al. (2017) and meta learning Kim et al. (2018); Feng et al. (2017). The theory behind our variational approach is closely related to optimal transport clustering Laclau et al. (2017); Mi et al. (2018). In these clustering approaches a series of medoids are trained to minimize the Wasserstein distance with a target distribution.
3 Particle-based variational inference
In an approximate Bayesian inference problem, the aim is to approximate the posterior distribution of a latent variable given a set of observations . The posterior distribution has the following form:
Unfortunately, the normalization constant is usually intractable and has to be approximated. The basic idea behind VI is to approximate the real posterior with a more tractable family of parameterized densities
Usually has a parametric form that can range from a simple diagonal Gaussian to an highly complex distribution induced by a deep generative model Rezende and Mohamed (2015); Kingma et al. (2016); Dinh et al. (2016). In a (weighted) particle-based variational framework, the approximate distribution is a linear combination of delta functions:
where is the coordinate of the -th particle and is the weight associated with the particle. Ideally we would like to find the optimal set of particles by minimizing the divergence between and . Unfortunately the KL divergence is not defined for distributions such as that are not absolutely continuous with respect to the Lebesgue measure. SVGD circumvents this problem by defining an interacting particle dynamics whose asymptotic distribution at the thermodynamic limit () converges to the under the topology induced by the KL divergence. The problem is well-posed since the infinite ensemble distribution is absolutely continuous Liu (2017). However, the SVGD dynamics is not necessarily optimal in practical applications when only a finite number of particles are used. Specifically, in SVGD individual particles cannot explain away any finite amount of probability mass since in the ideal asymptotic ensemble each particle only contributes infinitesimally to the overall posterior. Consequently, when is finite the proper coverage of the posterior depends on the choice of a kernel function that regulates the repulsive interactions between the particles.
4 Semi-discrete optimal transport
Optimal transport divergences measure the deviation between two distributions as the cost of optimally transporting a distribution to the other. An optimal transport divergence is defined by the following optimization problem:
is the set of joint distributions havingand as respectively the first and the second marginal. An important advantage of optimal transport divergences is that they can be used to compare discrete and continuous distributions. This form of optimal transport is called semi-discrete and can be formulated as follows Peyré and Cuturi (2017):
where is the set of conditional distributions that fulfill the marginalization constraint:
5 Particle-based inference with semi-discrete optimal transport
Our aim is to obtain a particle-based variational approximation by minimizing the optimal transport divergence between a weighted set of particles and the posterior distribution. Using semi-discrete optimal transport, we can formulate this optimal finite particle approximation of the posterior as the solution of the following joint optimization problem:
where the notion of optimality depends on the cost function . The transportation densities map each particle to a component of the posterior distribution. In other words, the transportation densities can be seen as emission models that spread the probability mass centered in a particle to its surroundings.
6 Formal solution of the optimal transport problem
The solution of the semi-discrete optimal transport problem has several interesting properties. The support sets of the transportation densities are elements of a tessellation of the space into non-overlapping sets. In the general case, these sets can be found using computational geometry algorithms (Aurenhammer, 1987) and quasi-Newton solvers Mérigot (2011). Fortunately, the problem of finding these cells greatly simplify if we simultaneously optimize the transportation densities and the weights of the discrete distribution as stated in the following theorem:
Theorem 1 (Formal solution of the optimal transport problem).
The optimization problem
is solved by the following tessellation:
with optimal transportation densities obtained by restricting to each set of the tessellation:
where is the indicator function of the set . The optimal weights are given by the following expression:
It is easy to see that transporting each point to the particle such that is the smallest leads to the smallest possible transportation cost. This implies that the -th transportation density is supported on the set which (up to sets of zero measure) is disjoint from the supports of the other transportation densities. The marginalization constraint then imposes that, in the set , is proportional to . In a general semi-discrete optimal transport problem this solution is not allowed since is not necessarily equal to , leading to a violation of the marginalization constraint. However, the solution is always possible in this joint-optimization since we can simply set to be equal to .
6.1 Deriving the gradient
The result at the end of the last section gives a closed-form solution for the particle-based variational loss:
In order to use this formula as the basis in a stochastic gradient descent algorithm we need to prove its differentiability. The loss in Eq.6.1 depends on both directly through the cost and indirectly through the boundary of the indicator functions. The problem of differentiating functions of this form is well-known in Eulerian fluid dynamics Leal (2007). The following lemma is a direct consequence of the -th dimensional generalization of the famous Reynolds transport theorem:
Lemma 2 (Differentiability).
If the posterior density is continuous and the cost function is differentiable with continuous partial derivatives, the loss function in Eq. 6.1 is differentiable.
Using Reynolds transport theorem, we can write the partial derivative with respect to the -th component of the -th particle as follows:
The second integral in this expression is defined over the frontier of . The differential gives the component of the velocity of the boundary
that is parallel to the surface normal vector. The partial derivative is clearly continuous since it is a sum of integrals of continuous functions. The statement follows from the differentiability theorem. ∎
The partial derivatives in Eq. 12 suggest that the gradient descent dynamics of the particles is driven by two terms. The first term drives the -th particle towards the centroid of its own set . Under this dynamics, the particles interact only by explaining away parts of the posterior, thereby screening the other particles from the attractive forces contained in their own set. The second term is defined at the frontiers of the sets and seems to imply the existence of more direct interactions. However, this term is actually zero as stated in the following theorem:
Theorem 3 (Gradient).
The gradient of Eq. 6.1 with respect to the position of the -th particle is given by the following expression:
The partial derivative in Eq. 12 can be rewritten as follows:
The first integral in this expression gives the gradient in the statement. Consequently, we need to show that the term
is equal to zero. It is easy to see that each boundary is the intersection of two-particles boundaries:
where . Consequently, Eq. 15 can be decomposed into a sum of integrals over subsets of these two-particles boundaries:
where the two terms of this expression are the two integrals at the opposite side of a boundary. The velocities and are equal since they describe the motion of the same boundary. The two costs and are also equal on the boundary by definition. Finally, the unit length normal elements and have opposite direction as a set ends where the other begins. Consequently, the two integrals have equal magnitude and opposite sign and the final result is equal to zero.
7 Wasserstein variational gradient descent
We can now introduce the WVGD algorithm. The gradient descent dynamics of the system of particles is given by the following system of differential equations:
It is instructive to study some special cases. If we use a single particle, the loss function becomes:
The minimum of this loss is the -medoid of the posterior distribution Park et al. (2006). When is the squared Euclidean distance, the one-particle dynamics converges to the posterior mean. This behavior differs from SVGD, where the one-particle case reduces to the gradient ascent of the log posterior Liu and Wang (2017). In the multi-particle case particles interact by screening part of the posterior density by moving the boundaries of the sets . Consider the two-particles case in one dimension with . The velocity field of particle one can be decomposed into two terms:
The first term is the global attractive effect of the posterior density on particle one while the second term is an apparent repulsive interaction between particle one and particle two:
This repulsive interaction is a function of the posterior mass contained in the half-axis covered by the second particle. This is in stark contrast with SVGD where the repulsive interactions are only a function of the location of the particles.
8 Adaptive importance sampling and ensemble variational inference
Since we cannot directly sample from the posterior, to obtain an Monte Carlo estimate of the gradient we resort to importance sampling. The resulting (discretized) stochastic dynamics has the following form:
where is the learning rate, is the importance sampling distribution of the -th particle parameterized by , is an importance weight and
is a normalization term. The variance of the gradient can vary greatly depending on the choice of the sampling distribution and, since the distributionsshift during the time development of the system, it is important to update the sampling distributions together with the positions of the particles Karamchandani et al. (1989). A flexible choice for the adaptive importance sampling dynamics is to descend the gradient of the reverse KL divergence between the parameterized sampling distributions and the transportation densities:
An interesting side effect of this choice is that we simultaneously obtain the positions of the particles and tractable approximations for the transportation densities. Therefore, the resulting algorithm provides both the -particles approximation and an associated continuous approximation that can be seen as a form of ensemble VI. In the one-particle case the algorithm reduces to conventional VI.
We parameterize the sampling distribution as the restriction of a tractable distribution to the set :
For example, if is a multivariate Gaussian, the sampling distribution will be a truncated Gaussian. Note that this restriction is necessary since the KL divergence can only be defined between distributions that share the same support set and the optimal transportation density is supported on . The gradient of the KL divergence is given by:
where is the entropy of
. An unbiased estimate of the gradient of an expectation with respect tocan be obtained using the standard reparameterization trick Rezende et al. (2014) since the set does not depend on . The only difference with regular evidence lower bound (ELBO) maximization is that samples are rejected if they fall outside . A problem in computing the gradient of the entropy is that we do not have a closed-form expression for the normalization constant . However, an expression for the gradient of the entropy is given by the following theorem:
Theorem 4 (Gradient of the entropy).
The gradient of with respect to the position of the parameters is given by the following expression:
8.1 Partitioned evidence lower bound
Using the ensemble of transportation densities, we can obtain a lower bound of the model evidence. Consider the partition induced by the sets . We define the partitioned ELBO (PELBO) as follows:
where is the optimal weight defined in Theorem 1. Using the Jensen’s inequality, it is easy to show that the PELBO is an evidence lower bound:
where is given by . Note that the weights are not available in closed-form and need to be approximated by importance sampling.
We validate our WVGD method on a simple mixture of Gaussians problem. Random target posterior distributions were obtained by sampling the mixture (), location () and scale parameters (
of a mixture of five Gaussian distributions. We quantified the performance of WVGD and SVGD using the squared error between posterior density and variational approximation. The cost function in WVGD was the squared euclidean distance. We used a a Gaussian kernel for SVGD. The bandwidth was adapted during training using the formula, where md is the median of the distances between the particles. This euristic was suggested in as suggested in Liu and Wang (2017)
. The SVGD density was obtained by using kernel density estimation while the WVGD density was given by its transportation densities. In order to have a fair comparison, we fit the bandwidth parameter of the kernel density estimation individually on each groundtruth posterior, this leads to a small bias in favor of SVGD. Furthermore, since the WVGD method has three times more parameters than SVGD, we compared the error of WVGD with the error of SVGD with three times as many particles. The results shown in Fig.1 show that WVGD gives consistently better performance than SVGD.
In this paper we introduced WVGD as a new form of particle-based variational inference based on optimal transport theory. An interesting feature of our method is that it also provides an ensemble of truncated continuous densities that are variational approximations of the optimal transportation densities. In our experiments we showed that WVGD has higher performance than SVGD in a mixture of Gaussians problem.
- Hoffman et al.  M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
Ranganath et al. 
R. Ranganath, S. Gerrish, and D. Blei.
Black box variational inference.
International Con- ference on Artificial Intelligence and Statistic, 2014.
Rezende et al. 
D. J. Rezende, S. Mohamed, and D. Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.International Conference on Machine Learning, 2014.
- Liu and Wang  Q. Liu and D. Wang. Stein variational gradient descent: A general purpose Bayesian inference algorithm. NIPS, 2017.
- Dai et al.  B. Dai, N. He, H. Dai, and L. Song. Provable Bayesian inference via particle mirror descent. AISTATS, 2016.
- Cuturi  M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. NIPS, 2013.
- Arjovsky et al.  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. ICML, 2017.
- Gulrajani et al.  I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville. Improved training of Wasserstein GANs. NIPS, 2017.
- Tolstikhin et al.  I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf. Wasserstein auto-encoders. ICLR, 2018.
- Ambrogioni et al.  L. Ambrogioni, U. Güçlü, Y. Güçlütürk, M. Hinne, E. Maris, and M.A.J. van Gerven. Wasserstein variational inference. NIPS, 2018.
- Haarnoja et al.  T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. ICML, 2017.
- Liu et al.  Y. Liu, P. Ramachandran, Q. Liu, and J. Peng. Stein variational policy gradient. UAI, 2017.
- Kim et al.  T. Kim, J. Yoon, O. Dia, S. Kim, Y. Bengio, and S. Ahn. Bayesian model-agnostic meta-learning. NIPS, 2018.
- Feng et al.  Y. Feng, D. Wang, and Q. Liu. Learning to draw samples with amortized stein variational gradient descent. UAI, 2017.
- Laclau et al.  C. Laclau, I. Redko, B. Matei, Y. Bennani, and V. Brault. Co-clustering through optimal transport. arXiv preprint arXiv:1705.06189, 2017.
- Mi et al.  L. Mi, W. Zhang, X. Gu, and Y. Wang. Variational wasserstein clustering. arXiv preprint arXiv:1806.09045, 2018.
- Rezende and Mohamed  D. J. Rezende and S. Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
- Kingma et al.  D. P. Kingma, R. Salimans, T.and Jozefowicz, X. Chen, I. Sutskever, and M. Welling. Improved variational inference with inverse autoregressive flow. NIPS, 2016.
- Dinh et al.  L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
- Liu  Q. Liu. Stein variational gradient descent as gradient flow. NIPS, 2017.
- Peyré and Cuturi  G. Peyré and M. Cuturi. Computational Optimal Transport. 2017.
- Aurenhammer  F. Aurenhammer. Power diagrams: properties, algorithms and applications. Journal on Computing, 16(1):78–96, 1987.
- Mérigot  Q. Mérigot. A multiscale approach to optimal transport. Computer Graphics Forum, 30(5):1583–1592, 2011.
- Leal  L. G. Leal. Advanced Transport Phenomena: Fluid Mechanics and Convective Transport Processes. Cambridge University Press, 2007.
Park et al. 
H. Park, J. Lee, and C. Jun.
A k-means-like algorithm for k-medoids clustering and its performance.ICCIE, 2006.
- Karamchandani et al.  A Karamchandani, P Bjerager, and CA Cornell. Adaptive importance sampling. Structural Safety and Reliability, 1989.