1 Introduction
Calculating (dis)similarities between statistical mixtures is a core primitive often met in statistics, machine learning, signal processing, and information fusion
[3] among others. However, the usual informationtheoretic KullbackLeibler (KL) divergence (as known as relative entropy) or the divergences between statistical mixtures [28] do not admit closedform formula, and is in practice approximated by costly Monte Carlo stochastic integration [28].To tackle this computational tractability problem, two research directions have been considered in the literature: The first line of research consists in proposing some distances between mixtures that yield closedform formula [23] (e.g., the CauchySchwarz divergence or the Jensen quadratic Rényi divergence). The second line of research consists in lower and upper bounding the divergences between mixtures [28]. This is tricky when considering bounded divergences like the Total Variation (TV) distance or the JensenShannon (JS) divergence that are upper bounded by and , respectively.
When dealing with probability densities, two main classes of statistical distances have been widely studied in the literature:

The InformationGeometric (IG) invariant divergences [1] (characterized as the class of separable distances), and
In general, computing closedform formula for the OT between parametric distributions is difficult. A closedform formula is known for elliptical distributions [8]
(that includes the multivariate Gaussian distributions), and the OT of multivariate continuous distributions can be calculated from the OT of their copulas
[13].The geometry induced by the distance is different in these two OT/IG cases. For example, consider locationscale families (or multivariate elliptical distributions):

For any divergence, the informationgeometric manifold has negative curvature [18]
(hyperbolic geometry). It is known that for the KullbackLeibler divergence, the manifold of mixtures with prescribed components is dually flat, and admits therefore an equivalent Bregman divergence
[25].
In this paper, we build on the seminal work of Liu and Huang [21]
that proposed a novel family of statistical distances for statistical mixtures by solving linear programs between
[21]component weights of mixtures where the elementary distance between any two mixtures is prescribed. They proved that their distance between mixtures (that we term MCOT distance for Mixture Component Optimal Transport) is a metric whenever the elementary distance between mixture components is a metric. This framework also applies to semiparametric mixtures obtained from Kernel Density Estimators
[38] (KDEs).We describe our main contributions as follows:

We define the Chain Rule Optimal Transport (CROT) distance in Definition 1, and prove that it yields a metric whenever the distance between conditional distributions is a metric in §2.2 (Theorem 3). The CROT distance extends the Wasserstein/EMD distances and the MCOT distance between statistical mixtures. We further sketch show how to build recursively hierarchical families of CROT distances.

In §4, experiments highlight quantitatively the upper bound performance of the CROT distance for bounding the total variation distance and a generalization of the square root of the JensenShannon distance.
2 The Chain Rule Optimal Transport (CROT) distance
2.1 Definition
We define a novel class of distances between statistical multivariate distributions. Recall the basic chain rule
factorization of a joint probability distribution
:where probability is called the marginal probability, and probability is termed the conditional probability. Let and denote the manifolds of marginal probability densities and conditional probability densities, respectively.
For example, for latent models like statistical mixtures or hidden Markov models
[42, 39], plays the role of the observed variable while denotes the hidden variable [9] (unobserved so that inference has to tackle incomplete data, say, using the EM algorithm [6]).First, we state the generic definition of the Chain Rule Optimal Transport
distance between joint distributions
and (with ) as follows:Definition 1 (CROT distance).
Given two multivariate distributions and , we define the Chain Rule Optimal Transport (CROT) as follows:
(1)  
(2) 
where is a ground distance defined on conditional density manifold (e.g., the Total Variation — TV), and (set of all probability measures on with marginals and ) satisfying the following constraint:
(3) 
When the ground distance is clear from the context, we write for a shortcut of . Since and since is a feasible transport solution, we get the following upper bounds:
Property 2 (Upper bounds).
The CROT is upper bounded by
Figure 1 illustrates the principle of the CROT distance. Another complementary motivation when dealing with statistical mixtures ispresented in §3
Let us notice that the CROT distance generalizes two distances met in the literature:
Remark 2.1 (CROT generalizes Wasserstein/EMD).
Remark 2.2 (CROT generalizes MCOT).
When both and are both (finite) categorical distributions, we recover the distance formerly defined in [21] that we term the MCOT distance in the remainder (for Mixture Component Optimal Transport).
2.2 CROT is a metric when the ground distance is a metric
Theorem 3 (CROT metric).
is a metric whenever is a metric.
Proof.
We prove that satisfies the following axioms of metric distances:
 Nonnegativity.

As , we have .
 Law of indiscernibles.

If , as is a metric, then the density is concentrated on the region in . We therefore have
(4)  Symmetry.

(5) (6) where s.t. and .
 Triangle inequality.

The proof of the triangle inequality is not straightforward.
(7)
where denotes the set of all probability measures on with marginals , and . ∎
3 CROT for statistical mixtures and Sinkhorn CROT
Consider two finite statistical mixtures and , not necessarily homogeneous nor of the same type. Let . The Mixture Component Optimal Transport (MCOT) distance proposed in [21] amounts to solve a Linear Program (LP) with the following objective function to minimize:
(8) 
satisfying the following constraints:
(9)  
(10)  
(11) 
By defining to be set of nonnegative matrices with and (transport polytope [5]), we get the equivalent compact definition of MCOT/CROT:
(12) 
When the ground distance is asymmetric, we shall use the ’:’ notation instead of the ’,’ notation for separating arguments.
In general, the LP problem (with variables and inequalities, equalities whom are independent) delivers an optimal soft assignment of mixture components with exactly nonzero coefficients^{1}^{1}1A LP in dimensions has its solution located at a vertex of a polytope, described by the intersection of hyperplanes (linear constraints). in matrix . The complexity of linear programming in variables with bits using Karmarkar’s interior point methods is polynomial, in [19].
Observe that we necessarily have:
and similarly that:
Note that since where denotes the Krönecker symbol: iff , and otherwise.
We can interpret MCOT as a discrete optimal transport between (nonembedded) histograms. When , the transport polytope is the polyhedral set of nonnegative matrices:
and
where is the Fröbenius inner product of matrices, and the matrix trace. This OT can be calculated using the network simplex in time.
Cuturi [5] showed how to relax the objective function in order to get fast calculation using the Sinkhorn divergence:
(13) 
where . The KL divergence between two matrices and is defined by
with the convention that . The Sinkhorn divergence is calculated using the equivalent dual Sinkhorn divergence by using matrix scaling algorithms (e.g., the SinkhornKnopp algorithm).
Because the minimization is performed on , we have
(14) 
3.1 Upper bounding statistical distances between mixtures with CROT
First, let us report the basic upper bounds for MCOT mentioned earlier in Property 2. The objective function is upper bounded by:
(15) 
Now, when the conditional density distance is separate convex (i.e., meaning convex in both arguments), we get the following Separate Convexity Upper Bound (SCUB):
(16) 
For example, norminduced distances or divergences [26] are separate convex distances.
For the particular KL divergence
and when , we get the following upper bound using the logsum inequality [7, 27]:
(17) 
Since this holds for any permutation of of mixture components, we can tight this upper bound by minimizing over all permutations:
(18) 
The best permutation can be computed using the Hungarian cubic time algorithm [40, 35, 15, 14] (with cost matrix , and with ).
Now, let us further rewrite with , and with . That is, we can interpret and as mixtures of (redundant) components and , and apply the upper bound of Eq. 17 for the “best split” of matching mixture components :
(19) 
Let
(20) 
Then it follows that
(21) 
Thus CROT allows to upper bound the KL divergence between mixtures. The technique of rewriting mixtures as mixtures of redundant components bears some resemblance with the variational upper bound on the KL between mixtures proposed in [16] that requires to iterate until convergence an update of the variational upper bound.
In fact, the CROT distance provides a good upper bound on the distance between mixtures provided the base distance is joint convex [2, 33].
Definition 4 (Joint convex distance).
A distance is joint convex if and only if
where .
The divergences (for a convex generator satisfying ) are joint convex distances [31]. For mixtures with same weights but different component basis and a joint convex distance (e.g., KL), we get .
Theorem 5 (Upper Bound on Joint Convex Mixture Distance (UBJCMD)).
Let and be two finite mixtures, and any joint convex statistical base distance. Then CROT upper bounds the distance between mixtures:
(22) 
Proof.
∎
Notice that for asymmetric base distance .
Conversely, CROT yields a lower bound for joint concave distances (e.g., fidelity in quantum computing [30]).
Figure 2 illustrates the CROT distance between statistical mixtures (not having the same number of components).
4 Experiments
4.1 Total Variation distance
Since is a metric divergence [17] bounded in , so is MCOT. The closedform formula for the total variation between univariate Gaussian distributions is reported in [24]
(using the erf function), and the other formula for the total variation between Rayleigh distributions and Gamma distributions is given in
[29].Figure 2(a) illustrates the performances of the various lower/upper bounds on the total variation between mixtures of Gaussian, Gamma, and Rayleigh distributions with respect to the true value which is estimated using Monte Carlo samplings.
The acronyms of the various bounds are as follows:

CELB: Combinatorial Envelope Lower Bound [28] (applies only for 1D mixtures)

CEUB: Combinatorial Envelope Upper Bound [28] (applies only for 1D mixtures)

CGQLB: CoarseGrained Quantization Lower Bound [28] for bins (applies only for divergences that satisfy the information monotonicity property)

CROT: Chain Rule Optimal Transport (this paper)

Sinkhorn CROT: Entropyregularized CROT [5] , with and (for convergence of the SkinhornKnopp iterative matrix scaling algorithm).
Next, we consider the renown MNIST handwritten digit database [20]: A dataset of 70000 handwritten digit grey images.^{2}^{2}2http://yann.lecun.com/exdb/mnist/
We learn GMMs composed of multivariate Gaussian distributions with a diagonal covariance matrix from this MNIST database using PCA (dimension reduction from original dimension
to reduced dimension ) as explained in the caption of Table 1. We used the ExpectationMaximization (EM) algorithm implementation of
scikitlearn [32].We approximate the TV between dimensional GMMs using Monte Carlo by performing stochastic integration of the following integrals: Let ,
Furthermore, we have:
The results are obtained using POT [11] (Python Optimal Transport).
TV  CROTTV  Sinkhorn ()  Sinkhorn ()  

,  
,  
, 
Our experiments yield the following observations: As the sample size decreases, the TV distances between GMMs turn larger because the GMMs are pulled towards the two different empirical distributions. As the dimension increases, TV increases because in a high dimensional space the GMM components are less likely to overlap. We check that CROTTV is an upper bound of TV. We verify that Sinkhorn divergences are upper bounds of CROT.
4.2 Square root of the symmetric JensenShannon divergence
TV is bounded in which makes it difficult to appreciate the quality of the CROT upper bounds in general. We shall consider a different parametric distance that is upper bounded by an arbitrary bound: .
It is well known that the square root of the JensenShannon divergence is a metric [12] (satisfying the triangle inequality). In [22], a generalization of the JensenShannon divergence was proposed, given by
(23) 
where . unifies (twice) the JensenShannon divergence (obtained when ) with the Jeffreys divergence () [22]
. A nice property is that the skew
divergence is upper bounded as follows:for , so that for .
Thus, we have the square root of the symmetrized divergence that is upper bounded by
However, is not a metric in general [31]. Indeed, in the extreme case of , it is known that any positive power of the Jeffreys divergence does not yield a metric.
Observe that is a divergence since is a divergence for the generator , and we have . Since for , it follows that the generator for the divergence is:
(24) 
Figure 2(b) displays the experimental results obtained for the JS divergences.
5 Conclusion and perspectives
In this work, we defined the generic Chain Rule Optimal Transport (CROT) distance (Definition 1) for a ground distance that encompasses the Wasserstein distance between point sets (Earth Mover Distance [36]) and the Mixture Component Optimal Transport (MCOT) distance [21], and proved that is a metric whenever is a metric (Theorem 3). We then dealt with statistical mixtures, and showed that (Theorem 5) whenever is joint convex. This holds in particular for statistical divergences :
We also considered the smoothened Sinkhorn CROT distance for fast calculations of via matrix scaling algorithms (SinkhornKnopp algorithm), with .
There are many venues to explore for further research. For example, we may consider infinite Gaussian mixtures [34], the chain rule factorization for
variate densities: This gives rise to a hierarchy of CROT distances. Another direction is to explore the use of the CROT distance in deep learning.
Acknowledgments
Frank Nielsen thanks Steve Huntsman for pointing out reference [21] to his attention.
References
 [1] Shunichi Amari. Information Geometry and Its Applications. Applied Mathematical Sciences. Springer Japan, 2016.
 [2] Heinz H Bauschke and Jonathan M Borwein. Joint and separate convexity of the Bregman distance. In Studies in Computational Mathematics, volume 8, pages 23–36. Elsevier, 2001.
 [3] KuoChu Chang and Wei Sun. Scalable fusion with mixture distributions in sensor networks. In 11th International Conference on Control Automation Robotics & Vision (ICARCV), pages 1251–1256, 2010.
 [4] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
 [5] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
 [6] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
 [7] Minh N Do. Fast approximation of KullbackLeibler distance for dependence trees and hidden Markov models. IEEE signal processing letters, 10(4):115–118, 2003.

[8]
D. C. Dowson and B.V. Landau.
The fréchet distance between multivariate normal distributions.
Journal of multivariate analysis
, 12(3):450–455, 1982.  [9] B. Everett. An introduction to latent variable models. Springer Science & Business Media, 2013.
 [10] Jean Feydy, Thibault Séjourné, FrançoisXavier Vialard, ShunIchi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and MMD using Sinkhorn divergences. arXiv preprint arXiv:1810.08278, 2018.
 [11] R’emi Flamary and Nicolas Courty. POT python optimal transport library, 2017.
 [12] Bent Fuglede and Flemming Topsoe. JensenShannon divergence and Hilbert space embedding. In International Symposium on Information Theor (ISIT 2004), page 31. IEEE, 2004.
 [13] N. Ghaffari and S. Walker. On Multivariate Optimal Transportation. ArXiv eprints, January 2018.
 [14] Jacob Goldberger and Hagai Aronowitz. A distance measure between GMMs based on the unscented transform and its application to speaker recognition. In INTERSPEECH European Conference on Speech Communication and Technology,, pages 1985–1988, 2005.

[15]
Jacob Goldberger, Shiri Gordon, and Hayit Greenspan.
An efficient image similarity measure based on approximations of KLdivergence between two Gaussian mixtures.
InIEEE International Conference on Computer Vision (ICCV)
, page 487. IEEE, 2003. 
[16]
John R. Hershey and Peder A. Olsen.
Approximating the KullbackLeibler divergence between Gaussian mixture models.
In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 4, pages IV–317. IEEE, 2007.  [17] Mohammadali Khosravifard, Dariush Fooladivanda, and T Aaron Gulliver. Confliction of the convexity and metric properties in divergences. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 90(9):1848–1853, 2007.
 [18] Fumiyasu Komaki. Bayesian prediction based on a class of shrinkage priors for locationscale models. Annals of the Institute of Statistical Mathematics, 59(1):135–146, 2007.
 [19] Bernhard Korte and Jens Vygen. Linear programming algorithms. In Combinatorial Optimization, pages 75–102. Springer, 2018.
 [20] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 [21] Zhu Liu and Qian Huang. A new distance measure for probability distribution function of mixture type. In International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 616–619. IEEE, 2000.
 [22] Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv preprint arXiv:1009.4004, 2010.
 [23] Frank Nielsen. Closedform informationtheoretic divergences for statistical mixtures. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 1723–1726. IEEE, 2012.
 [24] Frank Nielsen. Generalized Bhattacharyya and Chernoff upper bounds on bayes error using quasiarithmetic means. Pattern Recognition Letters, 42:25–34, 2014.
 [25] Frank Nielsen and Gaëtan Hadjeres. Monte Carlo information geometry: The dually flat case. arXiv preprint arXiv:1803.07225, 2018.
 [26] Frank Nielsen and Richard Nock. On the chi square and higherorder chi distances for approximating divergences. IEEE Signal Processing Letters, 21(1):10–13, 2014.
 [27] Frank Nielsen and Richard Nock. On mixtures: Finite convex combinations of prescribed component distributions. CoRR, abs/1708.00568, 2017.
 [28] Frank Nielsen and Ke Sun. Guaranteed bounds on informationtheoretic measures of univariate mixtures using piecewise logsumexp inequalities. Entropy, 18(12):442, 2016.
 [29] Frank Nielsen and Ke Sun. Guaranteed deterministic bounds on the total variation distance between univariate mixtures. In IEEE Machine Learning in Signal Processing (MLSP), pages 1–6, 2018.
 [30] Michael A Nielsen and Isaac Chuang. Quantum computation and quantum information, 2002.
 [31] Ferdinand Österreicher and Igor Vajda. A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics, 55(3):639–653, 2003.
 [32] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in Python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
 [33] József Pitrik and Dániel Virosztek. On the joint convexity of the Bregman divergence of matrices. Letters in Mathematical Physics, 105(5):675–692, 2015.
 [34] Carl Edward Rasmussen. The infinite Gaussian mixture model. In Advances in neural information processing systems, pages 554–560, 2000.
 [35] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(13):19–41, 2000.
 [36] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
 [37] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, pages 99–102, 2015.
 [38] Olivier Schwander and Frank Nielsen. Learning mixtures by simplifying kernel density estimators. In Matrix Information Geometry, pages 403–426. Springer, 2013.
 [39] Jorge Silva and Shrikanth Narayanan. Upper bound KullbackLeibler divergence for hidden Markov models with application as discrimination measure for speech recognition. In IEEE International Symposium on Information Theory (ISIT), pages 2299–2303. IEEE, 2006.
 [40] Yoram Singer and Manfred K Warmuth. Batch and online parameter estimation of Gaussian mixtures based on the joint entropy. In Advances in Neural Information Processing Systems, pages 578–584, 1999.
 [41] Asuka Takatsu. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics, 48(4):1005–1026, 2011.
 [42] Li Xie, Valery A. Ugrinovskii, and Ian R. Petersen. Probabilistic distances between finitestate finitealphabet hidden Markov models. IEEE transactions on automatic control, 50(4):505–511, 2005.
Comments
There are no comments yet.