1.1 Total variation and -divergences
Let be a measurable space on the sample space equipped with the Borel -algebra , and and be two probability measures with respective densities and with respect to the Lebesgue measure . The Total Variation distance  (TV for short) is a statistical metric distance defined by
The TV distance ranges in , and is related to the probability of error
in Bayesian statistical hypothesis testing, so that . Since we have for any
we can rewrite the TV equivalently as
1.2 Prior work
For simple univariate distributions like univariate Gaussian distributions, the TV may admit a closed-form expression. For example, consider exponential family distributions with density and . When we can compute exactly the root solutions of , e.g., encode a polynomial of degree at most , then we can split the distribution support as based on the roots. Then in each interval
, we can compute the elementary interval integral using the cumulative distribution functionsand . Indeed, assume without loss of generality that on an interval . Then we have
For univariate Gaussian distributions and with (and ), the quadratic equation expands as , where
We have two distinct roots and with . Therefore the TV between univariate Gaussians writes as follows:
where denotes the error function. Notice that it is difficult problem to bound or find the modes of a GMM , and therefore to decompose the TV between GMMs into elementary intervals.
In practice, for mixture models (like Gaussian mixture models, GMMs), the TV is approximated by either ➀ discretizing the integral (i.e., numerical integration)
() for (one can choose any quadrature rule) or ➁ performing stochastic Monte Carlo (MC) integration via importance sampling:
where are independently and identically distributed (iid) samples from a proposal distribution . Choosing yields
(provided that the variance). This raises the problem of consistent calculations, since for , we may have a first run with , and second run with . Thus we seek for guaranteed deterministic lower (L) and upper (U) bounds so that .
The TV is the only metric -divergence  so that
-divergence can either be bounded (e.g., TV or the Jensen-Shannon divergence) or unbounded when the integral diverges (e.g., the Kullback-Leibler divergence or the-divergences ).
Consider two finite mixtures and . Since is jointly convex, we have . We may also refine this upper bound by using a variational bound [9, 10]. However, these upper bounds are too loose for TV as they can easily go above the trivial upper bound of .
In information theory, Pinsker’s inequality relates the Kullback-Leibler divergence to the TV by
Thus we can upper bound TV in term of KL as follows: . Similarly, we can upper bound TV using any -divergences . However, the bounds may be implicit because the paper  considered the best lower bounds of a given -divergence in term of total variation. For example, it is shown (, p. 15) that the Jensen-Shannon divergence is lower bounded by
See also  for reverse Pinsker inequalities (introducing crucial “fatness conditions” on the distributions since otherwise the -divergences may be unbounded). We may then apply combinatorial lower and upper bounds on -divergences of mixture models, following the method of , to get bounds on TV. However, it is challenging to have our bounds for the TV -divergence beat the naive upper bound of and the lower bound of .
1.3 Contributions and paper outline
We summarize our main contributions as follows:
We describe the Coarse-Grained Quantized Lower Bound (CGQLB, Theorem 1 in §2) by proving the information monotonicity of the total variation distance.
We present the Combinatorial Envelope Lower and Upper bounds (CELB/CEUB, Theorem 2 in §3) for the TV between univariate mixtures that rely on geometric envelopes and density ratio bounds.
2 TV bounds from information monotonicity
Let us prove the information monotonicity property  of the total variation distance: coarse-graining the (mixture) distributions necessarily decreases their total variation.111This is not true for the Euclidean distance.
Let be an arbitrary finite partition of the support into intervals. Using the cumulative distribution functions (CDFs) of mixture components, we can calculate the mass of mixtures inside each elementary interval as a weighted sum of the component CDFs. Let and denote the induced coarse-grained discrete distributions (also called lumping ). Their total variation distance is
Theorem 1 (Information monotonicity of TV).
The information monotonicity of the total variation ensures that
Since and , we have
Note that we coarse-grain a continuum support (or , say for Rayleigh mixtures) into a finite number of bins. The proof does not use the fact that the support is 1D and is therefore generalizable to the multi-dimension case. For the discrete case, the proof  will be different. In summary, this approach yields the Coarse-Grained Quantization Lower bound (CGQLB).
By creating a hierarchy of nested partitions , we get the telescopic inequality:
This coarse-graining technique yields lower bounds for any -divergence due to their information monotonicity property .
We present now a simple upper bound when dealing with a very specific case of mixtures. Consider mixtures sharing the same prescribed components (i.e., only weights may differ). For example, this scenario occurs when we jointly learn a set of mixtures from several datasets . Then it comes that
We may always consider mixtures and sharing the same prescribed components (by allowing some weights to be zero). In that case, let and denote the common weight distribution. From the above derivations we get . However, when mixtures do not share components, we end up the trivial upper bound of since in that case . The upper bound in eq. (2
) can be easily extended to mixture of positive measures (with weight vectors not necessarily normalized to one).
3 TV bounds via geometric envelopes
where and and and denote respectively the indices of the component of mixture (resp. ) that is the lowest (resp. highest) at position . The sequences of are piecewisely integer constant when swipes through
, and can be computed from lower and upper geometric envelopes of the mixture component probability distributions[13, 15]. It follows that
We partition the support into elementary intervals (with ). Observe that on each interval, the indices and are all constant. We have
and we can use the lower/upper bounds of eq. (5) to bound . For a given interval , we calculate
in constant time using the cumulative distribution functions (CDFs) ’s and ’s of the mixture components. Indeed, the probability mass inside an interval of a component (with CDF ) is simply expressed as the difference between two CDF terms . Let and . It follows that
Notice that the above derivation applies to as well, and therefore
Thus we obtain the following lower and upper bounds of the TV:
To further improve the bounds for exponential family components, we choose for each elementary interval a reference measure , which can simply be set to the upper envelope over . Then we can bound the density ratio
for any in the same exponential family. Notice that is usually a vector of monomials representing a polynomial function whose bounds can be computed straightforwardly for any given interval . Therefore
must lie in the range , where
Correspondingly, . If , then , otherwise . . Hence, we get
We call these bounds CELB/CEUB for combinatorial envelope lower/upper bounds.
The total variation distance between univariate Gaussian mixtures can be deterministically approximated in -time, where denotes the total number of mixture components.
The following section describes experimental results that highlight the tightness performance of these bounds.
We assess the proposed TV bounds based on the following univariate GMM models , , and , which was used in . We split each elementary interval into pieces of equal size so as to improve the bound quality. For MC (Monte Carlo) and CGQLB, we sample from both and and combine these sample sets. Fig. (1) shows the envelopes of these GMMs and the corresponding TV. In the rightmost figure, the -axis is the sample size for MC and CGQLB, and the -axis is the TV value. The 95% confidence interval is visualized for MC. We can see that the proposed combinatorial bounds are quite tight and enclose the true TV value. Notably, the CGQLB is even tighter if the sample size is large enough. Given the same sample size, the number of density evaluations (computing and for one time) are the same for MC and CGQLB. Therefore, instead of doing MC, one should prefer to use CGQLB that provides a deterministic bound. The experiments are further carried on Gamma and Rayleigh mixtures, see Fig. (1).
In a second set of experiments, we generate random GMMs where the means are taken from the standard Gaussian distribution (dataset 1) or with its standard deviation increased to 5 (dataset 2). In both cases, the precision is sampled from. All components have equal weights. Fig. (2) shows meanstandard deviation for CELB/CEUB/CGQLB against then the number of mixture components . TV (relative) shows the relative value of the bounds, which is the ratio between the bound and the “true” TV estimated using MC samples. CGQLB is implemented with 100 random samples drawn from a mixture of and with equal weights. The Pinsker upper bound is based on a “true” KL estimated by MC samples. (strictly speaking, the Pinsker bound estimated in this way is not a deterministic bound as our proposed bounds.) We perform 100 independent runs for each . TV decreases as increases because and are more mixed. We see that CELB and CEUB provide relatively tight bounds as compared to the Pinsker bound, which are well enclosed by the trivial bounds . The quality of CGQLB is remarkably impressive: based on the yellow lines in the right figures, using merely 100 random samples we can get a upper bound which is very close to the true value of the TV.
5 Conclusion and discussion
We described novel deterministic lower and upper bounds on the total variation distance between univariate mixtures, and demonstrated their effectiveness for Gaussian, Gamma and Rayleigh mixtures. This task is all the more challenging since the TV value is falling in the range , and that the designed bounds should improve over these naive bounds. A first proposed approach relies on the information monotonicity  of the TV to design a lower bound (or a series of nested lower bounds), and can be extended to arbitrary -divergences. A second set of techniques uses tools of computational geometry to compute mixture component upper and lower geometric envelopes of their weighted component univariate distributions, and retrieve from these decompositions both Combinatorial Envelope Lower and Upper Bounds (CELB/CEUB). All those methods certify deterministic bounds, and are therefore recommended over the traditional Monte Carlo stochastic approximations that has no deterministic guarantee (although being consistent asymptotically).
Finally, let us discuss the role of generalized TV distances in -divergences: -divergences are statistical separable divergences which admit the following integral-based representation [16, 11, 17, 18]:
Here, we have for , see . are generalized (bounded) total variational distances, and our deterministic bounds can be extended to these ’s. However, note that may be infinite (unbounded) when the integral diverges.
Code for reproducible research is available at https://franknielsen.github.io/BoundsTV/index.html
-  Sheldon Ross, A first course in probability, Pearson, 2014.
-  Imre Csiszar and Paul C. Shields, “Notes on information theory and statistics,” Foundations and Trends in Communications and Information Theory, vol. 30, pp. 42, 2004.
-  Frank Nielsen, “Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means,” Pattern Recognition Letters, vol. 42, pp. 25–34, 2014.
Michael J Swain and Dana H Ballard,
International journal of computer vision, vol. 7, no. 1, pp. 11–32, 1991.
-  Frank Nielsen and Vincent Garcia, “Statistical exponential families: A digest with flash cards,” arXiv preprint arXiv:0911.4863, 2009.
-  C. Améndola, A. Engström, and C. Haase, “Maximum Number of Modes of Gaussian Mixtures,” ArXiv e-prints, Feb. 2017.
-  Mohammadali Khosravifard, Dariush Fooladivanda, and T Aaron Gulliver, “Confliction of the convexity and metric properties in -divergences,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. 90, no. 9, pp. 1848–1853, 2007.
-  Shun-ichi Amari, Information Geometry and Its Applications, Applied Mathematical Sciences. Springer Japan, 2016.
-  John R Hershey and Peder A Olsen, “Approximating the Kullback-Leibler divergence between Gaussian mixture models,” in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. IEEE, 2007, vol. 4, pp. IV–317.
-  J-L Durrieu, J-Ph Thiran, and Finnian Kelly, “Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian mixture models,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. Ieee, 2012, pp. 4833–4836.
-  Mark D Reid and Robert C Williamson, “Generalised Pinsker inequalities,” arXiv preprint arXiv:0906.1244, 2009.
-  Igal Sason, “On reverse pinsker inequalities,” arXiv preprint arXiv:1503.07118, 2015.
-  Frank Nielsen and Ke Sun, “Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities,” Entropy, vol. 18, no. 12, pp. 442, 2016.
-  Olivier Schwander, Stéphane Marchand-Maillet, and Frank Nielsen, “Comix: Joint estimation and lightspeed comparison of mixture models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 2449–2453.
-  Frank Nielsen and Ke Sun, “Combinatorial bounds on the -divergence of univariate mixture models,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, 2017, pp. 4476–4480.
-  Friedrich Liese and Igor Vajda, “On divergences and informations in statistics and information theory,” IEEE Transactions on Information Theory, vol. 52, no. 10, pp. 4394–4412, 2006.
-  Mark D Reid and Robert C Williamson, “Information, divergence and risk for binary experiments,” Journal of Machine Learning Research, vol. 12, no. Mar, pp. 731–817, 2011.
-  Igal Sason and Sergio Verdú, “-divergence inequalities,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 5973–6006, 2016.