Calculating (dis)similarities between statistical mixtures is a core primitive often met in statistics, machine learning, signal processing, and information fusion among others. However, the usual information-theoretic Kullback-Leibler (KL) divergence (as known as relative entropy) or the -divergences between statistical mixtures  do not admit closed-form formula, and is in practice approximated by costly Monte Carlo stochastic integration .
To tackle this computational tractability problem, two research directions have been considered in the literature: The first line of research consists in proposing some distances between mixtures that yield closed-form formula  (e.g., the Cauchy-Schwarz divergence or the Jensen quadratic Rényi divergence). The second line of research consists in lower and upper bounding the -divergences between mixtures . This is tricky when considering bounded divergences like the Total Variation (TV) distance or the Jensen-Shannon (JS) divergence that are upper bounded by and , respectively.
When dealing with probability densities, two main classes of statistical distances have been widely studied in the literature:
The Information-Geometric (IG) invariant -divergences  (characterized as the class of separable distances), and
In general, computing closed-form formula for the OT between parametric distributions is difficult. A closed-form formula is known for elliptical distributions 
(that includes the multivariate Gaussian distributions), and the OT of multivariate continuous distributions can be calculated from the OT of their copulas.
The geometry induced by the distance is different in these two OT/IG cases. For example, consider location-scale families (or multivariate elliptical distributions):
In this paper, we build on the seminal work of Liu and Huang 
that proposed a novel family of statistical distances for statistical mixtures by solving linear programs between
component weights of mixtures where the elementary distance between any two mixtures is prescribed. They proved that their distance between mixtures (that we term MCOT distance for Mixture Component Optimal Transport) is a metric whenever the elementary distance between mixture components is a metric. This framework also applies to semi-parametric mixtures obtained from Kernel Density Estimators (KDEs).
We describe our main contributions as follows:
We define the Chain Rule Optimal Transport (CROT) distance in Definition 1, and prove that it yields a metric whenever the distance between conditional distributions is a metric in §2.2 (Theorem 3). The CROT distance extends the Wasserstein/EMD distances and the MCOT distance between statistical mixtures. We further sketch show how to build recursively hierarchical families of CROT distances.
In §4, experiments highlight quantitatively the upper bound performance of the CROT distance for bounding the total variation distance and a generalization of the square root of the Jensen-Shannon distance.
2 The Chain Rule Optimal Transport (CROT) distance
We define a novel class of distances between statistical multivariate distributions. Recall the basic chain rule
factorization of a joint probability distribution:
where probability is called the marginal probability, and probability is termed the conditional probability. Let and denote the manifolds of marginal probability densities and conditional probability densities, respectively.
For example, for latent models like statistical mixtures or hidden Markov models[42, 39], plays the role of the observed variable while denotes the hidden variable  (unobserved so that inference has to tackle incomplete data, say, using the EM algorithm ).
First, we state the generic definition of the Chain Rule Optimal Transport
distance between joint distributionsand (with ) as follows:
Definition 1 (CROT distance).
Given two multivariate distributions and , we define the Chain Rule Optimal Transport (CROT) as follows:
where is a ground distance defined on conditional density manifold (e.g., the Total Variation — TV), and (set of all probability measures on with marginals and ) satisfying the following constraint:
When the ground distance is clear from the context, we write for a shortcut of . Since and since is a feasible transport solution, we get the following upper bounds:
Property 2 (Upper bounds).
The CROT is upper bounded by
Let us notice that the CROT distance generalizes two distances met in the literature:
Remark 2.1 (CROT generalizes Wasserstein/EMD).
Remark 2.2 (CROT generalizes MCOT).
When both and are both (finite) categorical distributions, we recover the distance formerly defined in  that we term the MCOT distance in the remainder (for Mixture Component Optimal Transport).
2.2 CROT is a metric when the ground distance is a metric
Theorem 3 (CROT metric).
is a metric whenever is a metric.
We prove that satisfies the following axioms of metric distances:
As , we have .
- Law of indiscernibles.
If , as is a metric, then the density is concentrated on the region in . We therefore have
where s.t. and .
- Triangle inequality.
The proof of the triangle inequality is not straightforward.
where denotes the set of all probability measures on with marginals , and . ∎
3 CROT for statistical mixtures and Sinkhorn CROT
Consider two finite statistical mixtures and , not necessarily homogeneous nor of the same type. Let . The Mixture Component Optimal Transport (MCOT) distance proposed in  amounts to solve a Linear Program (LP) with the following objective function to minimize:
satisfying the following constraints:
By defining to be set of non-negative matrices with and (transport polytope ), we get the equivalent compact definition of MCOT/CROT:
When the ground distance is asymmetric, we shall use the ’:’ notation instead of the ’,’ notation for separating arguments.
In general, the LP problem (with variables and inequalities, equalities whom are independent) delivers an optimal soft assignment of mixture components with exactly nonzero coefficients111A LP in -dimensions has its solution located at a vertex of a polytope, described by the intersection of hyperplanes (linear constraints). in matrix . The complexity of linear programming in variables with bits using Karmarkar’s interior point methods is polynomial, in .
Observe that we necessarily have:
and similarly that:
Note that since where denotes the Krönecker symbol: iff , and otherwise.
We can interpret MCOT as a discrete optimal transport between (non-embedded) histograms. When , the transport polytope is the polyhedral set of non-negative matrices:
where is the Fröbenius inner product of matrices, and the matrix trace. This OT can be calculated using the network simplex in time.
Cuturi  showed how to relax the objective function in order to get fast calculation using the Sinkhorn divergence:
where . The KL divergence between two matrices and is defined by
with the convention that . The Sinkhorn divergence is calculated using the equivalent dual Sinkhorn divergence by using matrix scaling algorithms (e.g., the Sinkhorn-Knopp algorithm).
Because the minimization is performed on , we have
3.1 Upper bounding statistical distances between mixtures with CROT
First, let us report the basic upper bounds for MCOT mentioned earlier in Property 2. The objective function is upper bounded by:
Now, when the conditional density distance is separate convex (i.e., meaning convex in both arguments), we get the following Separate Convexity Upper Bound (SCUB):
For example, norm-induced distances or -divergences  are separate convex distances.
For the particular KL divergence
Since this holds for any permutation of of mixture components, we can tight this upper bound by minimizing over all permutations:
Now, let us further rewrite with , and with . That is, we can interpret and as mixtures of (redundant) components and , and apply the upper bound of Eq. 17 for the “best split” of matching mixture components :
Then it follows that
Thus CROT allows to upper bound the KL divergence between mixtures. The technique of rewriting mixtures as mixtures of redundant components bears some resemblance with the variational upper bound on the KL between mixtures proposed in  that requires to iterate until convergence an update of the variational upper bound.
Definition 4 (Joint convex distance).
A distance is joint convex if and only if
The -divergences (for a convex generator satisfying ) are joint convex distances . For mixtures with same weights but different component basis and a joint convex distance (e.g., KL), we get .
Theorem 5 (Upper Bound on Joint Convex Mixture Distance (UBJCMD)).
Let and be two finite mixtures, and any joint convex statistical base distance. Then CROT upper bounds the distance between mixtures:
Notice that for asymmetric base distance .
Conversely, CROT yields a lower bound for joint concave distances (e.g., fidelity in quantum computing ).
Figure 2 illustrates the CROT distance between statistical mixtures (not having the same number of components).
4.1 Total Variation distance
(using the erf function), and the other formula for the total variation between Rayleigh distributions and Gamma distributions is given in.
Figure 2(a) illustrates the performances of the various lower/upper bounds on the total variation between mixtures of Gaussian, Gamma, and Rayleigh distributions with respect to the true value which is estimated using Monte Carlo samplings.
The acronyms of the various bounds are as follows:
CELB: Combinatorial Envelope Lower Bound  (applies only for 1D mixtures)
CEUB: Combinatorial Envelope Upper Bound  (applies only for 1D mixtures)
CGQLB: Coarse-Grained Quantization Lower Bound  for bins (applies only for -divergences that satisfy the information monotonicity property)
CROT: Chain Rule Optimal Transport (this paper)
Sinkhorn CROT: Entropy-regularized CROT  , with and (for convergence of the Skinhorn-Knopp iterative matrix scaling algorithm).
We learn GMMs composed of multivariate Gaussian distributions with a diagonal covariance matrix from this MNIST database using PCA (dimension reduction from original dimensionto reduced dimension ) as explained in the caption of Table 1
. We used the Expectation-Maximization (EM) algorithm implementation ofscikit-learn .
We approximate the TV between -dimensional GMMs using Monte Carlo by performing stochastic integration of the following integrals: Let ,
Furthermore, we have:
The results are obtained using POT  (Python Optimal Transport).
|TV||CROT-TV||Sinkhorn ()||Sinkhorn ()|
Our experiments yield the following observations: As the sample size decreases, the TV distances between GMMs turn larger because the GMMs are pulled towards the two different empirical distributions. As the dimension increases, TV increases because in a high dimensional space the GMM components are less likely to overlap. We check that CROT-TV is an upper bound of TV. We verify that Sinkhorn divergences are upper bounds of CROT.
4.2 Square root of the symmetric -Jensen-Shannon divergence
TV is bounded in which makes it difficult to appreciate the quality of the CROT upper bounds in general. We shall consider a different parametric distance that is upper bounded by an arbitrary bound: .
It is well known that the square root of the Jensen-Shannon divergence is a metric  (satisfying the triangle inequality). In , a generalization of the Jensen-Shannon divergence was proposed, given by
where . unifies (twice) the Jensen-Shannon divergence (obtained when ) with the Jeffreys divergence () 
. A nice property is that the skew-divergence is upper bounded as follows:
for , so that for .
Thus, we have the square root of the symmetrized -divergence that is upper bounded by
However, is not a metric in general . Indeed, in the extreme case of , it is known that any positive power of the Jeffreys divergence does not yield a metric.
Observe that is a -divergence since is a -divergence for the generator , and we have . Since for , it follows that the -generator for the divergence is:
Figure 2(b) displays the experimental results obtained for the -JS divergences.
5 Conclusion and perspectives
In this work, we defined the generic Chain Rule Optimal Transport (CROT) distance (Definition 1) for a ground distance that encompasses the Wasserstein distance between point sets (Earth Mover Distance ) and the Mixture Component Optimal Transport (MCOT) distance , and proved that is a metric whenever is a metric (Theorem 3). We then dealt with statistical mixtures, and showed that (Theorem 5) whenever is joint convex. This holds in particular for statistical -divergences :
We also considered the smoothened Sinkhorn CROT distance for fast calculations of via matrix scaling algorithms (Sinkhorn-Knopp algorithm), with .
There are many venues to explore for further research. For example, we may consider infinite Gaussian mixtures , the chain rule factorization for
-variate densities: This gives rise to a hierarchy of CROT distances. Another direction is to explore the use of the CROT distance in deep learning.
Frank Nielsen thanks Steve Huntsman for pointing out reference  to his attention.
-  Shun-ichi Amari. Information Geometry and Its Applications. Applied Mathematical Sciences. Springer Japan, 2016.
-  Heinz H Bauschke and Jonathan M Borwein. Joint and separate convexity of the Bregman distance. In Studies in Computational Mathematics, volume 8, pages 23–36. Elsevier, 2001.
-  Kuo-Chu Chang and Wei Sun. Scalable fusion with mixture distributions in sensor networks. In 11th International Conference on Control Automation Robotics & Vision (ICARCV), pages 1251–1256, 2010.
-  Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
-  Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
-  Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
-  Minh N Do. Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models. IEEE signal processing letters, 10(4):115–118, 2003.
D. C. Dowson and B.V. Landau.
The fréchet distance between multivariate normal distributions.
Journal of multivariate analysis, 12(3):450–455, 1982.
-  B. Everett. An introduction to latent variable models. Springer Science & Business Media, 2013.
-  Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-Ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and MMD using Sinkhorn divergences. arXiv preprint arXiv:1810.08278, 2018.
-  R’emi Flamary and Nicolas Courty. POT python optimal transport library, 2017.
-  Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding. In International Symposium on Information Theor (ISIT 2004), page 31. IEEE, 2004.
-  N. Ghaffari and S. Walker. On Multivariate Optimal Transportation. ArXiv e-prints, January 2018.
-  Jacob Goldberger and Hagai Aronowitz. A distance measure between GMMs based on the unscented transform and its application to speaker recognition. In INTERSPEECH European Conference on Speech Communication and Technology,, pages 1985–1988, 2005.
Jacob Goldberger, Shiri Gordon, and Hayit Greenspan.
An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures.In
IEEE International Conference on Computer Vision (ICCV), page 487. IEEE, 2003.
John R. Hershey and Peder A. Olsen.
Approximating the Kullback-Leibler divergence between Gaussian mixture models.In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 4, pages IV–317. IEEE, 2007.
-  Mohammadali Khosravifard, Dariush Fooladivanda, and T Aaron Gulliver. Confliction of the convexity and metric properties in -divergences. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 90(9):1848–1853, 2007.
-  Fumiyasu Komaki. Bayesian prediction based on a class of shrinkage priors for location-scale models. Annals of the Institute of Statistical Mathematics, 59(1):135–146, 2007.
-  Bernhard Korte and Jens Vygen. Linear programming algorithms. In Combinatorial Optimization, pages 75–102. Springer, 2018.
-  Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
-  Zhu Liu and Qian Huang. A new distance measure for probability distribution function of mixture type. In International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 616–619. IEEE, 2000.
-  Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv preprint arXiv:1009.4004, 2010.
-  Frank Nielsen. Closed-form information-theoretic divergences for statistical mixtures. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 1723–1726. IEEE, 2012.
-  Frank Nielsen. Generalized Bhattacharyya and Chernoff upper bounds on bayes error using quasi-arithmetic means. Pattern Recognition Letters, 42:25–34, 2014.
-  Frank Nielsen and Gaëtan Hadjeres. Monte Carlo information geometry: The dually flat case. arXiv preprint arXiv:1803.07225, 2018.
-  Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating -divergences. IEEE Signal Processing Letters, 21(1):10–13, 2014.
-  Frank Nielsen and Richard Nock. On mixtures: Finite convex combinations of prescribed component distributions. CoRR, abs/1708.00568, 2017.
-  Frank Nielsen and Ke Sun. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy, 18(12):442, 2016.
-  Frank Nielsen and Ke Sun. Guaranteed deterministic bounds on the total variation distance between univariate mixtures. In IEEE Machine Learning in Signal Processing (MLSP), pages 1–6, 2018.
-  Michael A Nielsen and Isaac Chuang. Quantum computation and quantum information, 2002.
-  Ferdinand Österreicher and Igor Vajda. A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics, 55(3):639–653, 2003.
-  Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
-  József Pitrik and Dániel Virosztek. On the joint convexity of the Bregman divergence of matrices. Letters in Mathematical Physics, 105(5):675–692, 2015.
-  Carl Edward Rasmussen. The infinite Gaussian mixture model. In Advances in neural information processing systems, pages 554–560, 2000.
-  Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1-3):19–41, 2000.
-  Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
-  Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, pages 99–102, 2015.
-  Olivier Schwander and Frank Nielsen. Learning mixtures by simplifying kernel density estimators. In Matrix Information Geometry, pages 403–426. Springer, 2013.
-  Jorge Silva and Shrikanth Narayanan. Upper bound Kullback-Leibler divergence for hidden Markov models with application as discrimination measure for speech recognition. In IEEE International Symposium on Information Theory (ISIT), pages 2299–2303. IEEE, 2006.
-  Yoram Singer and Manfred K Warmuth. Batch and on-line parameter estimation of Gaussian mixtures based on the joint entropy. In Advances in Neural Information Processing Systems, pages 578–584, 1999.
-  Asuka Takatsu. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics, 48(4):1005–1026, 2011.
-  Li Xie, Valery A. Ugrinovskii, and Ian R. Petersen. Probabilistic distances between finite-state finite-alphabet hidden Markov models. IEEE transactions on automatic control, 50(4):505–511, 2005.