On The Chain Rule Optimal Transport Distance

12/19/2018 ∙ by Frank Nielsen, et al. ∙ 4

We define a novel class of distances between statistical multivariate distributions by solving an optimal transportation problem on their marginal densities with respect to a ground distance defined on their conditional densities. By using the chain rule factorization of probabilities, we show how to perform optimal transport on a ground space being an information-geometric manifold of conditional probabilities. We prove that this new distance is a metric whenever the chosen ground distance is a metric. Our distance generalizes both the Wasserstein distances between point sets and a recently introduced metric distance between statistical mixtures. As a first application of this Chain Rule Optimal Transport (CROT) distance, we show that the ground distance between statistical mixtures is upper bounded by this optimal transport distance, whenever the ground distance is joint convex. We report on our experiments which quantify the tightness of the CROT distance for the total variation distance and a square root generalization of the Jensen-Shannon divergence between mixtures.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Calculating (dis)similarities between statistical mixtures is a core primitive often met in statistics, machine learning, signal processing, and information fusion 

[3] among others. However, the usual information-theoretic Kullback-Leibler (KL) divergence (as known as relative entropy) or the -divergences between statistical mixtures [28] do not admit closed-form formula, and is in practice approximated by costly Monte Carlo stochastic integration [28].

To tackle this computational tractability problem, two research directions have been considered in the literature: The first line of research consists in proposing some distances between mixtures that yield closed-form formula [23] (e.g., the Cauchy-Schwarz divergence or the Jensen quadratic Rényi divergence). The second line of research consists in lower and upper bounding the -divergences between mixtures [28]. This is tricky when considering bounded divergences like the Total Variation (TV) distance or the Jensen-Shannon (JS) divergence that are upper bounded by and , respectively.

When dealing with probability densities, two main classes of statistical distances have been widely studied in the literature:

  1. The Information-Geometric (IG) invariant -divergences [1] (characterized as the class of separable distances), and

  2. The Wasserstein distances of Optimal Transport (OT) [37] which can be computationally accelerated using entropy regularization [4, 10] (Sinkhorn divergence).

In general, computing closed-form formula for the OT between parametric distributions is difficult. A closed-form formula is known for elliptical distributions [8]

(that includes the multivariate Gaussian distributions), and the OT of multivariate continuous distributions can be calculated from the OT of their copulas 

[13].

The geometry induced by the distance is different in these two OT/IG cases. For example, consider location-scale families (or multivariate elliptical distributions):

  1. For OT, the distance between any two members admit the same closed-form formula [8]

    (depending only on the mean and variance parameters, not on the type of location-scale family). The OT geometry of Gaussian distributions has positive curvature 

    [41].

  2. For any -divergence, the information-geometric manifold has negative curvature [18]

    (hyperbolic geometry). It is known that for the Kullback-Leibler divergence, the manifold of mixtures with prescribed components is dually flat, and admits therefore an equivalent Bregman divergence 

    [25].

In this paper, we build on the seminal work of Liu and Huang [21]

that proposed a novel family of statistical distances for statistical mixtures by solving linear programs between 

[21]

component weights of mixtures where the elementary distance between any two mixtures is prescribed. They proved that their distance between mixtures (that we term MCOT distance for Mixture Component Optimal Transport) is a metric whenever the elementary distance between mixture components is a metric. This framework also applies to semi-parametric mixtures obtained from Kernel Density Estimators 

[38] (KDEs).

We describe our main contributions as follows:

  • We define the Chain Rule Optimal Transport (CROT) distance in Definition 1, and prove that it yields a metric whenever the distance between conditional distributions is a metric in §2.2 (Theorem 3). The CROT distance extends the Wasserstein/EMD distances and the MCOT distance between statistical mixtures. We further sketch show how to build recursively hierarchical families of CROT distances.

  • We report a novel generic upper bound for statistical distances between mixtures [29] using CROT distances in §3 (Theorem 5) whenever the ground distance is joint convex.

  • In §4, experiments highlight quantitatively the upper bound performance of the CROT distance for bounding the total variation distance and a generalization of the square root of the Jensen-Shannon distance.

2 The Chain Rule Optimal Transport (CROT) distance

2.1 Definition

We define a novel class of distances between statistical multivariate distributions. Recall the basic chain rule

factorization of a joint probability distribution

:

where probability is called the marginal probability, and probability is termed the conditional probability. Let and denote the manifolds of marginal probability densities and conditional probability densities, respectively.

For example, for latent models like statistical mixtures or hidden Markov models 

[42, 39], plays the role of the observed variable while denotes the hidden variable [9] (unobserved so that inference has to tackle incomplete data, say, using the EM algorithm [6]).

First, we state the generic definition of the Chain Rule Optimal Transport

distance between joint distributions

and (with ) as follows:

Definition 1 (CROT distance).

Given two multivariate distributions and , we define the Chain Rule Optimal Transport (CROT) as follows:

(1)
(2)

where is a ground distance defined on conditional density manifold (e.g., the Total Variation — TV), and (set of all probability measures on with marginals and ) satisfying the following constraint:

(3)

When the ground distance is clear from the context, we write for a shortcut of . Since and since is a feasible transport solution, we get the following upper bounds:

Property 2 (Upper bounds).

The CROT is upper bounded by

Figure 1 illustrates the principle of the CROT distance. Another complementary motivation when dealing with statistical mixtures ispresented in §3

Figure 1: The CROT distance: Optimal matching of marginal densities wrt. a distance on conditional densities. We consider the complete bipartite graph with edges weighted by the distances between the corresponding conditional densities defined at edge vertices.

Let us notice that the CROT distance generalizes two distances met in the literature:

Remark 2.1 (CROT generalizes Wasserstein/EMD).

In the case that (Dirac distributions), we recover the Wasserstein distance [41] between point sets (or Earth Mover Distance [36]), where is the ground metric distance.

Remark 2.2 (CROT generalizes MCOT).

When both and are both (finite) categorical distributions, we recover the distance formerly defined in [21] that we term the MCOT distance in the remainder (for Mixture Component Optimal Transport).

2.2 CROT is a metric when the ground distance is a metric

Theorem 3 (CROT metric).

is a metric whenever is a metric.

Proof.

We prove that satisfies the following axioms of metric distances:

Non-negativity.

As , we have .

Law of indiscernibles.

If , as is a metric, then the density is concentrated on the region in . We therefore have

(4)
Symmetry.
(5)
(6)

where s.t. and .

Triangle inequality.

The proof of the triangle inequality is not straightforward.

(7)

where denotes the set of all probability measures on with marginals , and . ∎

3 CROT for statistical mixtures and Sinkhorn CROT

Consider two finite statistical mixtures and , not necessarily homogeneous nor of the same type. Let . The Mixture Component Optimal Transport (MCOT) distance proposed in [21] amounts to solve a Linear Program (LP) with the following objective function to minimize:

(8)

satisfying the following constraints:

(9)
(10)
(11)

By defining to be set of non-negative matrices with and (transport polytope [5]), we get the equivalent compact definition of MCOT/CROT:

(12)

When the ground distance is asymmetric, we shall use the ’:’ notation instead of the ’,’ notation for separating arguments.

In general, the LP problem (with variables and inequalities, equalities whom are independent) delivers an optimal soft assignment of mixture components with exactly nonzero coefficients111A LP in -dimensions has its solution located at a vertex of a polytope, described by the intersection of hyperplanes (linear constraints). in matrix . The complexity of linear programming in variables with bits using Karmarkar’s interior point methods is polynomial, in  [19].

Observe that we necessarily have:

and similarly that:

Note that since where denotes the Krönecker symbol: iff , and otherwise.

We can interpret MCOT as a discrete optimal transport between (non-embedded) histograms. When , the transport polytope is the polyhedral set of non-negative matrices:

and

where is the Fröbenius inner product of matrices, and the matrix trace. This OT can be calculated using the network simplex in time.

Cuturi [5] showed how to relax the objective function in order to get fast calculation using the Sinkhorn divergence:

(13)

where . The KL divergence between two matrices and is defined by

with the convention that . The Sinkhorn divergence is calculated using the equivalent dual Sinkhorn divergence by using matrix scaling algorithms (e.g., the Sinkhorn-Knopp algorithm).

Because the minimization is performed on , we have

(14)

3.1 Upper bounding statistical distances between mixtures with CROT

First, let us report the basic upper bounds for MCOT mentioned earlier in Property 2. The objective function is upper bounded by:

(15)

Now, when the conditional density distance is separate convex (i.e., meaning convex in both arguments), we get the following Separate Convexity Upper Bound (SCUB):

(16)

For example, norm-induced distances or -divergences [26] are separate convex distances.

For the particular KL divergence

and when , we get the following upper bound using the log-sum inequality [7, 27]:

(17)

Since this holds for any permutation of of mixture components, we can tight this upper bound by minimizing over all permutations:

(18)

The best permutation can be computed using the Hungarian cubic time algorithm [40, 35, 15, 14] (with cost matrix , and with ).

Now, let us further rewrite with , and with . That is, we can interpret and as mixtures of (redundant) components and , and apply the upper bound of Eq. 17 for the “best split” of matching mixture components :

(19)

Let

(20)

Then it follows that

(21)

Thus CROT allows to upper bound the KL divergence between mixtures. The technique of rewriting mixtures as mixtures of redundant components bears some resemblance with the variational upper bound on the KL between mixtures proposed in [16] that requires to iterate until convergence an update of the variational upper bound.

In fact, the CROT distance provides a good upper bound on the distance between mixtures provided the base distance is joint convex [2, 33].

Definition 4 (Joint convex distance).

A distance is joint convex if and only if

where .

The -divergences (for a convex generator satisfying ) are joint convex distances [31]. For mixtures with same weights but different component basis and a joint convex distance (e.g., KL), we get .

Theorem 5 (Upper Bound on Joint Convex Mixture Distance (UBJCMD)).

Let and be two finite mixtures, and any joint convex statistical base distance. Then CROT upper bounds the distance between mixtures:

(22)
Proof.

Notice that for asymmetric base distance .

Conversely, CROT yields a lower bound for joint concave distances (e.g., fidelity in quantum computing [30]).

Figure 2: An interpretation of CROT by rewriting the mixtures and with and and using the joint convexity of the base distance .

Figure 2 illustrates the CROT distance between statistical mixtures (not having the same number of components).

4 Experiments

4.1 Total Variation distance

Since is a metric -divergence [17] bounded in , so is MCOT. The closed-form formula for the total variation between univariate Gaussian distributions is reported in [24]

(using the erf function), and the other formula for the total variation between Rayleigh distributions and Gamma distributions is given in 

[29].

Figure 2(a) illustrates the performances of the various lower/upper bounds on the total variation between mixtures of Gaussian, Gamma, and Rayleigh distributions with respect to the true value which is estimated using Monte Carlo samplings.

The acronyms of the various bounds are as follows:

  • CELB: Combinatorial Envelope Lower Bound [28] (applies only for 1D mixtures)

  • CEUB: Combinatorial Envelope Upper Bound [28] (applies only for 1D mixtures)

  • CGQLB: Coarse-Grained Quantization Lower Bound [28] for bins (applies only for -divergences that satisfy the information monotonicity property)

  • CROT: Chain Rule Optimal Transport (this paper)

  • Sinkhorn CROT: Entropy-regularized CROT [5] , with and (for convergence of the Skinhorn-Knopp iterative matrix scaling algorithm).

Next, we consider the renown MNIST handwritten digit database [20]: A dataset of 70000 handwritten digit grey images.222http://yann.lecun.com/exdb/mnist/

We learn GMMs composed of multivariate Gaussian distributions with a diagonal covariance matrix from this MNIST database using PCA (dimension reduction from original dimension

to reduced dimension ) as explained in the caption of Table 1

. We used the Expectation-Maximization (EM) algorithm implementation of

scikit-learn [32].

We approximate the TV between -dimensional GMMs using Monte Carlo by performing stochastic integration of the following integrals: Let ,

Furthermore, we have:

The results are obtained using POT [11] (Python Optimal Transport).

TV CROT-TV Sinkhorn () Sinkhorn ()
,
,
,
Table 1: Distances between two GMMs with components each estimated on PCA-processed MNIST dataset. is the dimensionality of the PCA. Parameter is the relative sample size used to estimated the GMMs. The two GMMs are estimated based on non-overlapping samples. For each configuration, the GMMs are repeatedly estimated based on different random split of the MNIST dataset. The mean std is based on independent runs.

Our experiments yield the following observations: As the sample size decreases, the TV distances between GMMs turn larger because the GMMs are pulled towards the two different empirical distributions. As the dimension increases, TV increases because in a high dimensional space the GMM components are less likely to overlap. We check that CROT-TV is an upper bound of TV. We verify that Sinkhorn divergences are upper bounds of CROT.

4.2 Square root of the symmetric -Jensen-Shannon divergence

(1)
(2)
(3)

(a) Performance of the CROT distance and the Sinkhorn CROT distance for upper bounding the total variation distance between mixtures of (1) Gaussian, (2) Gamma, and (3) Rayleigh distributions.

(1)
(2)
(3)

(b) Performance of the CROT distance and the Sinkhorn CROT distance for upper bounding the square root of the -Jensen-Shannon distance between mixtures of (1) Gaussian, (2) Gamma, and (3) Rayleigh distributions.
Figure 3: Experimental results.

TV is bounded in which makes it difficult to appreciate the quality of the CROT upper bounds in general. We shall consider a different parametric distance that is upper bounded by an arbitrary bound: .

It is well known that the square root of the Jensen-Shannon divergence is a metric [12] (satisfying the triangle inequality). In [22], a generalization of the Jensen-Shannon divergence was proposed, given by

(23)

where . unifies (twice) the Jensen-Shannon divergence (obtained when ) with the Jeffreys divergence ([22]

. A nice property is that the skew

-divergence is upper bounded as follows:

for , so that for .

Thus, we have the square root of the symmetrized -divergence that is upper bounded by

However, is not a metric in general [31]. Indeed, in the extreme case of , it is known that any positive power of the Jeffreys divergence does not yield a metric.

Observe that is a -divergence since is a -divergence for the generator , and we have . Since for , it follows that the -generator for the divergence is:

(24)

Figure 2(b) displays the experimental results obtained for the -JS divergences.

5 Conclusion and perspectives

In this work, we defined the generic Chain Rule Optimal Transport (CROT) distance (Definition 1) for a ground distance that encompasses the Wasserstein distance between point sets (Earth Mover Distance [36]) and the Mixture Component Optimal Transport (MCOT) distance [21], and proved that is a metric whenever is a metric (Theorem 3). We then dealt with statistical mixtures, and showed that (Theorem 5) whenever is joint convex. This holds in particular for statistical -divergences :

We also considered the smoothened Sinkhorn CROT distance for fast calculations of via matrix scaling algorithms (Sinkhorn-Knopp algorithm), with .

There are many venues to explore for further research. For example, we may consider infinite Gaussian mixtures [34], the chain rule factorization for

-variate densities: This gives rise to a hierarchy of CROT distances. Another direction is to explore the use of the CROT distance in deep learning.

The smooth (dual) Sinkhorn divergence has also been shown experimentally (MNIST classification) to improve over the EMD in applications [5]. It would be also interesting to consider the Sinkhorn CROT vs CROT in applications [21] that deal with mixtures of features.

Acknowledgments

Frank Nielsen thanks Steve Huntsman for pointing out reference [21] to his attention.

References

  • [1] Shun-ichi Amari. Information Geometry and Its Applications. Applied Mathematical Sciences. Springer Japan, 2016.
  • [2] Heinz H Bauschke and Jonathan M Borwein. Joint and separate convexity of the Bregman distance. In Studies in Computational Mathematics, volume 8, pages 23–36. Elsevier, 2001.
  • [3] Kuo-Chu Chang and Wei Sun. Scalable fusion with mixture distributions in sensor networks. In 11th International Conference on Control Automation Robotics & Vision (ICARCV), pages 1251–1256, 2010.
  • [4] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
  • [5] Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in neural information processing systems, pages 2292–2300, 2013.
  • [6] Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
  • [7] Minh N Do. Fast approximation of Kullback-Leibler distance for dependence trees and hidden Markov models. IEEE signal processing letters, 10(4):115–118, 2003.
  • [8] D. C. Dowson and B.V. Landau.

    The fréchet distance between multivariate normal distributions.

    Journal of multivariate analysis

    , 12(3):450–455, 1982.
  • [9] B. Everett. An introduction to latent variable models. Springer Science & Business Media, 2013.
  • [10] Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-Ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and MMD using Sinkhorn divergences. arXiv preprint arXiv:1810.08278, 2018.
  • [11] R’emi Flamary and Nicolas Courty. POT python optimal transport library, 2017.
  • [12] Bent Fuglede and Flemming Topsoe. Jensen-Shannon divergence and Hilbert space embedding. In International Symposium on Information Theor (ISIT 2004), page 31. IEEE, 2004.
  • [13] N. Ghaffari and S. Walker. On Multivariate Optimal Transportation. ArXiv e-prints, January 2018.
  • [14] Jacob Goldberger and Hagai Aronowitz. A distance measure between GMMs based on the unscented transform and its application to speaker recognition. In INTERSPEECH European Conference on Speech Communication and Technology,, pages 1985–1988, 2005.
  • [15] Jacob Goldberger, Shiri Gordon, and Hayit Greenspan.

    An efficient image similarity measure based on approximations of KL-divergence between two Gaussian mixtures.

    In

    IEEE International Conference on Computer Vision (ICCV)

    , page 487. IEEE, 2003.
  • [16] John R. Hershey and Peder A. Olsen.

    Approximating the Kullback-Leibler divergence between Gaussian mixture models.

    In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), volume 4, pages IV–317. IEEE, 2007.
  • [17] Mohammadali Khosravifard, Dariush Fooladivanda, and T Aaron Gulliver. Confliction of the convexity and metric properties in -divergences. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, 90(9):1848–1853, 2007.
  • [18] Fumiyasu Komaki. Bayesian prediction based on a class of shrinkage priors for location-scale models. Annals of the Institute of Statistical Mathematics, 59(1):135–146, 2007.
  • [19] Bernhard Korte and Jens Vygen. Linear programming algorithms. In Combinatorial Optimization, pages 75–102. Springer, 2018.
  • [20] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [21] Zhu Liu and Qian Huang. A new distance measure for probability distribution function of mixture type. In International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 616–619. IEEE, 2000.
  • [22] Frank Nielsen. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv preprint arXiv:1009.4004, 2010.
  • [23] Frank Nielsen. Closed-form information-theoretic divergences for statistical mixtures. In Pattern Recognition (ICPR), 2012 21st International Conference on, pages 1723–1726. IEEE, 2012.
  • [24] Frank Nielsen. Generalized Bhattacharyya and Chernoff upper bounds on bayes error using quasi-arithmetic means. Pattern Recognition Letters, 42:25–34, 2014.
  • [25] Frank Nielsen and Gaëtan Hadjeres. Monte Carlo information geometry: The dually flat case. arXiv preprint arXiv:1803.07225, 2018.
  • [26] Frank Nielsen and Richard Nock. On the chi square and higher-order chi distances for approximating -divergences. IEEE Signal Processing Letters, 21(1):10–13, 2014.
  • [27] Frank Nielsen and Richard Nock. On mixtures: Finite convex combinations of prescribed component distributions. CoRR, abs/1708.00568, 2017.
  • [28] Frank Nielsen and Ke Sun. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy, 18(12):442, 2016.
  • [29] Frank Nielsen and Ke Sun. Guaranteed deterministic bounds on the total variation distance between univariate mixtures. In IEEE Machine Learning in Signal Processing (MLSP), pages 1–6, 2018.
  • [30] Michael A Nielsen and Isaac Chuang. Quantum computation and quantum information, 2002.
  • [31] Ferdinand Österreicher and Igor Vajda. A new class of metric divergences on probability spaces and its applicability in statistics. Annals of the Institute of Statistical Mathematics, 55(3):639–653, 2003.
  • [32] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of machine learning research, 12(Oct):2825–2830, 2011.
  • [33] József Pitrik and Dániel Virosztek. On the joint convexity of the Bregman divergence of matrices. Letters in Mathematical Physics, 105(5):675–692, 2015.
  • [34] Carl Edward Rasmussen. The infinite Gaussian mixture model. In Advances in neural information processing systems, pages 554–560, 2000.
  • [35] Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. Speaker verification using adapted Gaussian mixture models. Digital signal processing, 10(1-3):19–41, 2000.
  • [36] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. The earth mover’s distance as a metric for image retrieval. International journal of computer vision, 40(2):99–121, 2000.
  • [37] Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, pages 99–102, 2015.
  • [38] Olivier Schwander and Frank Nielsen. Learning mixtures by simplifying kernel density estimators. In Matrix Information Geometry, pages 403–426. Springer, 2013.
  • [39] Jorge Silva and Shrikanth Narayanan. Upper bound Kullback-Leibler divergence for hidden Markov models with application as discrimination measure for speech recognition. In IEEE International Symposium on Information Theory (ISIT), pages 2299–2303. IEEE, 2006.
  • [40] Yoram Singer and Manfred K Warmuth. Batch and on-line parameter estimation of Gaussian mixtures based on the joint entropy. In Advances in Neural Information Processing Systems, pages 578–584, 1999.
  • [41] Asuka Takatsu. Wasserstein geometry of Gaussian measures. Osaka Journal of Mathematics, 48(4):1005–1026, 2011.
  • [42] Li Xie, Valery A. Ugrinovskii, and Ian R. Petersen. Probabilistic distances between finite-state finite-alphabet hidden Markov models. IEEE transactions on automatic control, 50(4):505–511, 2005.