 # Guaranteed Deterministic Bounds on the Total Variation Distance between Univariate Mixtures

The total variation distance is a core statistical distance between probability measures that satisfies the metric axioms, with value always falling in [0,1]. This distance plays a fundamental role in machine learning and signal processing: It is a member of the broader class of f-divergences, and it is related to the probability of error in Bayesian hypothesis testing. Since the total variation distance does not admit closed-form expressions for statistical mixtures (like Gaussian mixture models), one often has to rely in practice on costly numerical integrations or on fast Monte Carlo approximations that however do not guarantee deterministic lower and upper bounds. In this work, we consider two methods for bounding the total variation of univariate mixture models: The first method is based on the information monotonicity property of the total variation to design guaranteed nested deterministic lower bounds. The second method relies on computing the geometric lower and upper envelopes of weighted mixture components to derive deterministic bounds based on density ratio. We demonstrate the tightness of our bounds in a series of experiments on Gaussian, Gamma and Rayleigh mixture models.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Total variation and f-divergences

Let be a measurable space on the sample space equipped with the Borel -algebra , and and be two probability measures with respective densities and with respect to the Lebesgue measure . The Total Variation distance  (TV for short) is a statistical metric distance defined by

 TV(P,Q):=supE∈F|P(E)−Q(E)|=TV(p,q),

with

 TV(p,q)=12∫X|p(x)−q(x)|dμ(x)=12∥p(x)−q(x)∥1.

The TV distance ranges in , and is related to the probability of error

in Bayesian statistical hypothesis testing

, so that . Since we have for any

 12|a−b|=a+b2−min(a,b)=max(a,b)−a+b2,

we can rewrite the TV equivalently as

 TV(p,q) =∫X(p(x)+q(x)2−min(p(x),q(x)))dμ(x), =1−∫Xmin(p(x),q(x))dμ(x)=∫Xmax(p(x),q(x))dμ(x)−1. (1)

Thus by bounding the “histogram similarity” [4, 3]

 h(p,q):=∫Xmin(p(x),q(x))dμ(x),

or equivalently

 H(p,q):=∫Xmax(p(x),q(x))dμ(x),

since by eq. (1), we obtain corresponding bounds for the TV and Bayes’ error probability .

### 1.2 Prior work

For simple univariate distributions like univariate Gaussian distributions, the TV may admit a closed-form expression. For example, consider exponential family distributions

 with density and . When we can compute exactly the root solutions of , e.g., encode a polynomial of degree at most , then we can split the distribution support as based on the roots. Then in each interval

, we can compute the elementary interval integral using the cumulative distribution functions

and . Indeed, assume without loss of generality that on an interval . Then we have

 12∫I|p(x)−q(x)|dμ(x)=12(Φp(b)−Φp(a)−Φq(b)+Φq(a)).

For univariate Gaussian distributions and with (and ), the quadratic equation expands as , where

 a =1σ21−1σ22, b =2(μ2σ22−μ1σ21), c =(μ1σ1)2−(μ2σ2)2+2logσ1σ2.

We have two distinct roots and with . Therefore the TV between univariate Gaussians writes as follows:

 TV(p1,p2)=12∣∣ ∣∣erf(x1−μ1σ1√2)−erf(x1−μ2σ2√2)∣∣ ∣∣+12∣∣ ∣∣erf(x2−μ1σ1√2)−erf(x2−μ2σ2√2)∣∣ ∣∣,

where denotes the error function. Notice that it is difficult problem to bound or find the modes of a GMM , and therefore to decompose the TV between GMMs into elementary intervals.

In practice, for mixture models (like Gaussian mixture models, GMMs), the TV is approximated by either ➀ discretizing the integral (i.e., numerical integration)

 ˜TVm(p,q)=12m∑i=1|p(xi)−q(xi)|(xi+1−xi)≥0,

() for (one can choose any quadrature rule) or ➁ performing stochastic Monte Carlo (MC) integration via importance sampling:

 ˆTVm(p,q)=12mm∑i=11r(xi)|p(xi)−q(xi)|≥0,

where are independently and identically distributed (iid) samples from a proposal distribution . Choosing yields

 ˆTVm(p,q)=12mm∑i=1∣∣∣1−q(xi)p(xi)∣∣∣≥0,

While ➀ is time consuming, ➁ cannot guarantee deterministic bounds although it is asymptotically a consistent estimator (confidence intervals can be calculated):

(provided that the variance

). This raises the problem of consistent calculations, since for , we may have a first run with , and second run with . Thus we seek for guaranteed deterministic lower (L) and upper (U) bounds so that .

The TV is the only metric -divergence  so that

 TV(p,q) =IfTV(p:q),

where

 If(p:q) :=∫Xp(x)f(q(x)p(x))dμ(x), fTV(u) =12|u−1|.

A

-divergence can either be bounded (e.g., TV or the Jensen-Shannon divergence) or unbounded when the integral diverges (e.g., the Kullback-Leibler divergence or the

-divergences ).

Consider two finite mixtures and . Since is jointly convex, we have . We may also refine this upper bound by using a variational bound [9, 10]. However, these upper bounds are too loose for TV as they can easily go above the trivial upper bound of .

In information theory, Pinsker’s inequality relates the Kullback-Leibler divergence to the TV by

 KL(p:q)≥(2loge)TV2(p:q). (2)

Thus we can upper bound TV in term of KL as follows: . Similarly, we can upper bound TV using any -divergences . However, the bounds may be implicit because the paper  considered the best lower bounds of a given -divergence in term of total variation. For example, it is shown (, p. 15) that the Jensen-Shannon divergence is lower bounded by

 JS(p:q) ≥(12−TV(p:q)4)log(2−TV(p:q)) +(12+TV(p:q)4)log(2+TV(p:q))−log2.

See also  for reverse Pinsker inequalities (introducing crucial “fatness conditions” on the distributions since otherwise the -divergences may be unbounded). We may then apply combinatorial lower and upper bounds on -divergences of mixture models, following the method of , to get bounds on TV. However, it is challenging to have our bounds for the TV -divergence beat the naive upper bound of and the lower bound of .

### 1.3 Contributions and paper outline

We summarize our main contributions as follows:

• We describe the Coarse-Grained Quantized Lower Bound (CGQLB, Theorem 1 in §2) by proving the information monotonicity of the total variation distance.

• We present the Combinatorial Envelope Lower and Upper bounds (CELB/CEUB, Theorem 2 in §3) for the TV between univariate mixtures that rely on geometric envelopes and density ratio bounds.

The paper is organized as follows: We present our deterministic bounds in §2 and in §3. We demonstrate numerical simulations in §4. Finally, §5 concludes and hints at further perspectives for designing bounds on -divergences.

## 2 TV bounds from information monotonicity

Let us prove the information monotonicity property  of the total variation distance: coarse-graining the (mixture) distributions necessarily decreases their total variation.111This is not true for the Euclidean distance.

Let be an arbitrary finite partition of the support into intervals. Using the cumulative distribution functions (CDFs) of mixture components, we can calculate the mass of mixtures inside each elementary interval as a weighted sum of the component CDFs. Let and denote the induced coarse-grained discrete distributions (also called lumping ). Their total variation distance is

 TV(mI,m′I):=12l∑s=1∣∣msI−m′sI∣∣.
###### Theorem 1 (Information monotonicity of TV).

The information monotonicity of the total variation ensures that

 0≤TV(mI,m′I)≤TV(m,m′)≤1. (3)
###### Proof.
 TV(m,m′)=∫max(m(x),m′(x))dμ(x)−1=l∑s=1∫Ismax(m(x),m′(x))dμ(x)−1.

Since and , we have

 ∫Ismax(m(x),m′(x))dμ(x)≥max(∫Ism(x)dμ(x),∫Ism′(x)dμ(x)).

Therefore

 TV(m,m′) ≥l∑s=1max(∫Ism(x)dμ(x),∫Ism′(x)dμ(x))−1 =l∑s=1max(msI,m′sI)−1 =TV(mI,m′I).

Note that we coarse-grain a continuum support (or , say for Rayleigh mixtures) into a finite number of bins. The proof does not use the fact that the support is 1D and is therefore generalizable to the multi-dimension case. For the discrete case, the proof  will be different. In summary, this approach yields the Coarse-Grained Quantization Lower bound (CGQLB).

By creating a hierarchy of nested partitions , we get the telescopic inequality:

 TV(mIh,m′Ih)≤…≤TV(mI1,m′I1)≤TV(m,m′).

This coarse-graining technique yields lower bounds for any -divergence due to their information monotonicity property .

We present now a simple upper bound when dealing with a very specific case of mixtures. Consider mixtures sharing the same prescribed components (i.e., only weights may differ). For example, this scenario occurs when we jointly learn a set of mixtures from several datasets . Then it comes that

 TV(m,m′) =12∫∣∣ ∣∣k∑i=1(wi−w′i)pi(x)∣∣ ∣∣dμ(x) =12k∑i=1∣∣wi−w′i∣∣∫pi(x)dμ(x) =12k∑i=1|wi−w′i|≤1. (4)

We may always consider mixtures and sharing the same prescribed components (by allowing some weights to be zero). In that case, let and denote the common weight distribution. From the above derivations we get . However, when mixtures do not share components, we end up the trivial upper bound of since in that case . The upper bound in eq. (2

) can be easily extended to mixture of positive measures (with weight vectors not necessarily normalized to one).

## 3 TV bounds via geometric envelopes

Consider two statistical mixtures and . Let us bound following the computational geometric technique introduced in [13, 15] as follows

 pl(x) ≤m(x)≤pu(x), p′l′(x) ≤m′(x)≤p′u′(x),

where and and and denote respectively the indices of the component of mixture (resp. ) that is the lowest (resp. highest) at position . The sequences of are piecewisely integer constant when swipes through

, and can be computed from lower and upper geometric envelopes of the mixture component probability distributions

[13, 15]. It follows that

 min(pl(x),p′l′(x))≤min(m(x),m′(x))≤max(pu(x),p′u′(x)). (5)

We partition the support into elementary intervals (with ). Observe that on each interval, the indices and are all constant. We have

 h(m,m′)=ℓ∑s=1∫Ismin(m(x),m′(x))dμ(x),

and we can use the lower/upper bounds of eq. (5) to bound . For a given interval , we calculate

 Ls(m,m′) =∫Ismin(pl(x)(x),p′l′(x)(x))dμ(x), Us(m,m′) =∫Ismax(pu(x)(x),p′u′(x)(x))dμ(x),

in constant time using the cumulative distribution functions (CDFs) ’s and ’s of the mixture components. Indeed, the probability mass inside an interval of a component (with CDF ) is simply expressed as the difference between two CDF terms . Let and . It follows that

 A(m,m′)≤h(m,m′)≤B(m,m′).

Notice that the above derivation applies to as well, and therefore

 A(m,m′)≤H(m,m′)≤B(m,m′).

Thus we obtain the following lower and upper bounds of the TV:

 L(m,m′) :=max{1−B(m,m′),A(m,m′)−1}, U(m,m′) :=min{1−A(m,m′),B(m,m′)−1}. (6)

To further improve the bounds for exponential family components, we choose for each elementary interval a reference measure , which can simply be set to the upper envelope over . Then we can bound the density ratio

 pi(x)rs(x)=exp((θi−θs)⊤t(x))∈[Ais,Bis]

for any in the same exponential family. Notice that is usually a vector of monomials representing a polynomial function whose bounds can be computed straightforwardly for any given interval . Therefore

 m(x)−m′(x)rs(x)=k∑i=1wipi(x)rs(x)−k′∑i=1w′ip′i(x)rs(x)

must lie in the range , where

 Ls=k∑i=1wiAis−k′∑i=1w′i(Bis)′,Us=k∑i=1wiBis−k′∑i=1w′i(Ais)′.

Correspondingly, . If , then , otherwise . . Hence, we get

 μs∫Isrs(x)dμ(x)≤∫Is|m(x)−m′(x)|dμ(x)≤Ωs∫Isrs(x)dμ(x),

and

 12l∑s=1μs∫Isrs(x)dμ(x)≤TV(m,m′)≤12l∑s=1Ωs∫Isrs(x)dμ(x).

We call these bounds CELB/CEUB for combinatorial envelope lower/upper bounds.

###### Theorem 2.

The total variation distance between univariate Gaussian mixtures can be deterministically approximated in -time, where denotes the total number of mixture components.

The following section describes experimental results that highlight the tightness performance of these bounds.

## 4 Experiments

We assess the proposed TV bounds based on the following univariate GMM models , , and , which was used in . We split each elementary interval into pieces of equal size so as to improve the bound quality. For MC (Monte Carlo) and CGQLB, we sample from both and and combine these sample sets. Fig. (1) shows the envelopes of these GMMs and the corresponding TV. In the rightmost figure, the -axis is the sample size for MC and CGQLB, and the -axis is the TV value. The 95% confidence interval is visualized for MC. We can see that the proposed combinatorial bounds are quite tight and enclose the true TV value. Notably, the CGQLB is even tighter if the sample size is large enough. Given the same sample size, the number of density evaluations (computing and for one time) are the same for MC and CGQLB. Therefore, instead of doing MC, one should prefer to use CGQLB that provides a deterministic bound. The experiments are further carried on Gamma and Rayleigh mixtures, see Fig. (1).

In a second set of experiments, we generate random GMMs where the means are taken from the standard Gaussian distribution (dataset 1) or with its standard deviation increased to 5 (dataset 2). In both cases, the precision is sampled from

. All components have equal weights. Fig. (2) shows meanstandard deviation for CELB/CEUB/CGQLB against then the number of mixture components . TV (relative) shows the relative value of the bounds, which is the ratio between the bound and the “true” TV estimated using MC samples. CGQLB is implemented with 100 random samples drawn from a mixture of and with equal weights. The Pinsker upper bound is based on a “true” KL estimated by MC samples. (strictly speaking, the Pinsker bound estimated in this way is not a deterministic bound as our proposed bounds.) We perform 100 independent runs for each . TV decreases as increases because and are more mixed. We see that CELB and CEUB provide relatively tight bounds as compared to the Pinsker bound, which are well enclosed by the trivial bounds . The quality of CGQLB is remarkably impressive: based on the yellow lines in the right figures, using merely 100 random samples we can get a upper bound which is very close to the true value of the TV. Figure 1: Mixture models (from top to bottom: Gaussian, Gamma and Rayleigh) and their upper (red) and lower (blue) envelopes. The rightmost figure shows their TV computed by ➀ MC estimation (black error bars); ➁ the proposed guaranteed combinatorial bounds (blue and red lines); ➂ the coarse-grained quantized lower bound (green line). (a) Random Dataset 1

## 5 Conclusion and discussion

We described novel deterministic lower and upper bounds on the total variation distance between univariate mixtures, and demonstrated their effectiveness for Gaussian, Gamma and Rayleigh mixtures. This task is all the more challenging since the TV value is falling in the range , and that the designed bounds should improve over these naive bounds. A first proposed approach relies on the information monotonicity  of the TV to design a lower bound (or a series of nested lower bounds), and can be extended to arbitrary -divergences. A second set of techniques uses tools of computational geometry to compute mixture component upper and lower geometric envelopes of their weighted component univariate distributions, and retrieve from these decompositions both Combinatorial Envelope Lower and Upper Bounds (CELB/CEUB). All those methods certify deterministic bounds, and are therefore recommended over the traditional Monte Carlo stochastic approximations that has no deterministic guarantee (although being consistent asymptotically).

Finally, let us discuss the role of generalized TV distances in -divergences: -divergences are statistical separable divergences which admit the following integral-based representation [16, 11, 17, 18]:

 I∗f(p:q) = ∫Xq(x)f(p(x)q(x))dμ(x), I∗f(p:q) = ∫101u3f′′(1−uu)TVu(p:q)du, TVu(p:q) := I∗fu(p:q), fu(t) := min{u,1−u}−min{1−u,ut}.

Here, we have for , see . are generalized (bounded) total variational distances, and our deterministic bounds can be extended to these ’s. However, note that may be infinite (unbounded) when the integral diverges.

Code for reproducible research is available at https://franknielsen.github.io/BoundsTV/index.html

## References

•  Sheldon Ross, A first course in probability, Pearson, 2014.
•  Imre Csiszar and Paul C. Shields, “Notes on information theory and statistics,” Foundations and Trends in Communications and Information Theory, vol. 30, pp. 42, 2004.
•  Frank Nielsen, “Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means,” Pattern Recognition Letters, vol. 42, pp. 25–34, 2014.
•  Michael J Swain and Dana H Ballard, “Color indexing,”

International journal of computer vision

, vol. 7, no. 1, pp. 11–32, 1991.
•  Frank Nielsen and Vincent Garcia, “Statistical exponential families: A digest with flash cards,” arXiv preprint arXiv:0911.4863, 2009.
•  C. Améndola, A. Engström, and C. Haase, “Maximum Number of Modes of Gaussian Mixtures,” ArXiv e-prints, Feb. 2017.
•  Mohammadali Khosravifard, Dariush Fooladivanda, and T Aaron Gulliver, “Confliction of the convexity and metric properties in -divergences,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. 90, no. 9, pp. 1848–1853, 2007.
•  Shun-ichi Amari, Information Geometry and Its Applications, Applied Mathematical Sciences. Springer Japan, 2016.
•  John R Hershey and Peder A Olsen, “Approximating the Kullback-Leibler divergence between Gaussian mixture models,” in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on. IEEE, 2007, vol. 4, pp. IV–317.
•  J-L Durrieu, J-Ph Thiran, and Finnian Kelly, “Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian mixture models,” in Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on. Ieee, 2012, pp. 4833–4836.
•  Mark D Reid and Robert C Williamson, “Generalised Pinsker inequalities,” arXiv preprint arXiv:0906.1244, 2009.
•  Igal Sason, “On reverse pinsker inequalities,” arXiv preprint arXiv:1503.07118, 2015.
•  Frank Nielsen and Ke Sun, “Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities,” Entropy, vol. 18, no. 12, pp. 442, 2016.
•  Olivier Schwander, Stéphane Marchand-Maillet, and Frank Nielsen, “Comix: Joint estimation and lightspeed comparison of mixture models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2016, pp. 2449–2453.
•  Frank Nielsen and Ke Sun, “Combinatorial bounds on the -divergence of univariate mixture models,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9, 2017, 2017, pp. 4476–4480.
•  Friedrich Liese and Igor Vajda, “On divergences and informations in statistics and information theory,” IEEE Transactions on Information Theory, vol. 52, no. 10, pp. 4394–4412, 2006.
•  Mark D Reid and Robert C Williamson, “Information, divergence and risk for binary experiments,” Journal of Machine Learning Research, vol. 12, no. Mar, pp. 731–817, 2011.
•  Igal Sason and Sergio Verdú, -divergence inequalities,” IEEE Transactions on Information Theory, vol. 62, no. 11, pp. 5973–6006, 2016.