Strongly Convex Divergences

We consider a sub-class of the f-divergences satisfying a stronger convexity property, which we refer to as strongly convex, or κ-convex divergences. We derive new and old relationships, based on convexity arguments, between popular f-divergences.

04/18/2021

An extension of the proximal point algorithm beyond convexity

We introduce and investigate a new generalized convexity notion for func...
02/20/2020

Cutting Corners

We define and study a class of subshifts of finite type (SFTs) defined b...
05/22/2018

M-convexity of the minimum-cost packings of arborescences

In this paper, we first prove that the minimum-cost packings of disjoint...
10/03/2019

A Grid-based Approach for Convexity Analysis of a Density-based Cluster

This paper presents a novel geometrical approach to investigate the conv...
05/08/2019

The Adam algorithm has become extremely popular for large-scale machine ...
01/26/2017

Linear convergence of SDCA in statistical estimation

In this paper, we consider stochastic dual coordinate (SDCA) without st...
11/27/2018

Convexity in scientific collaboration networks

Convexity in a network (graph) has been recently defined as a property o...

1 Introduction

The concept of an -divergence, introduced independently by Ali-Sivley [1] and Csizisár [6]

, unifies several important information measures between probability distributions, as integrals of a convex function

, composed with the Radon-Nikodym of the two probability distributions. For a convex function

such that , and measures and such that the -divergence from to is given by The canoncial example of an -divergence, realized by taking , is the relative entropy (often called the KL-divergence), and -divergences inherit many properties enjoyed by this special case; non-negativity, joint convexity of arguments, and a data processing inequality. Other important examples include the total variation, the -divergence, and the squared Hellinger distance. The reader is directed to Chapter 6 and 7 of [17] for more background.

We will be interested in how stronger convexity properties of give improvements of classical -divergence inequalities. This is in part inspired by the work of Sason [18], who demonstrated that divergences that are (as we define later) “-convex” satisfy “stronger than ”, data-procesing inequalities.

Aside from the total variation, most divergences of interest have stronger than affine convexity, at least when is restricted to a sub-interval of the real line. This observation is especially relevant to the situtation in which one wishes to study in the existence of a bounded Radon-Nikodym derivative

. One naturally obtains such bounds for skew divergences. That is divergences of the form

for , as in this case, . Important examples of skew-divergences include the skew divergence [10] based on the relative entropy and the Vincze-Le Cam divergence [22, 9], called the triangular discrimination in [21] and its generalization due to Györfi and Vajda [8] based on the -divergence. The Jensen-Shannon divergence [11] and its recent generalization [15] give examples of -divergences realized as linear combinations of skewed divergences.

Let us outline the paper. In Section 2 we derive elementary results of -convex divergences and give a table of examples of -convex divergences. We demonstrate that -convex divergences can be lower bounded by the -divergence, and that the joint convexity of the map can be sharpened under -convexity conditions on . As a consequence we obtain bounds between the mean square total variation distance of a set of distributions from its barycenter, and the average -divergence from the set to the barycenter.

In Section 3 we investigate general skewing of -divergences. In particular we introduce the skew-symmetrization of an -divergence, which recovers the Jensen-Shannon divergence and the Vincze-Le Cam divergences as special cases. We also show that a scaling of the Vincze-Le Cam divergence is minimal among skew-symmetrizations of -convex divergences on . We then consider linear combinations of skew divergences, and show that a generalized Vincze-Le Cam divergence (based on skewing the -divergence) can be upper bounded by the generalized Jensen-Shannon divergence introduced recently by Neilsen [15] (based on skewing the relative entropy), reversing the obvious bound that can be obtained from the classical bound . We also derive upper and lower total variation bounds for Neilsen’s generalized Jensen-Shannon divergence.

In Section 4 we consider a family of densities weighted by , and a density

. We use the Bayes estimator

111

This is the Bayes estimator for the loss function

to derive a convex decomposition of the barycenter and of , each into two auxiliary densities. We use this decomposition to sharpen, for -convex divergences, an elegant theorem of Guntuboyina [7] that generalizes Fano and Pinsker’s inequality to -divergences. We then demonstrate explicitly, using an argument of Topsoe, how our sharpening of Guntuboyina’s inequality gives a new sharpening of Pinsker’s inequality in terms of the convex decomposition induced by the Bayes estimator.

Notation

We consider Borel probability measures and on a Polish space . For a convex function such that , define the -divergence from to , via densities for and for with respect to a common reference measure as

 Df(p||q) =∫Xf(pq)qdμ (1) =∫{pq>0}qf(pq)dμ+f(0)Q({p=0})+f∗(0)P({q=0}). (2)

We note that this representation is independent of , and such a reference measure always exists, take for example.

For , define

 Df(t||s)\coloneqqsf(ts)+(1−s)f(1−t1−s) (3)

with the conventions, , , and

. For a random variable

and a set we denote the probability that take a value in by , the expectation of the random variable by

and the variance by

. For a probability measure satisfying for all Borel , we write

, and when there exists a probability density function such that

for a reference measure , we write . For a probability measure on , and an function , we denote for .

2 Strongly convex divergences

Definition 2.1.

A -valued function on a convex set is -convex when and implies

 f((1−t)x+ty)≤(1−t)f(x)+tf(y)−κt(1−t)(x−y)2/2. (4)

For example, when is twice differentiable, (4) is equivalent to for . Note that the case is just usual convexity.

Proposition 2.2.

For , and following are equivalent:

1. is -convex.

2. The function is convex for any

3. The right handed derivative, defined as satisfies,

 f′+(t)≥f′+(s)+κ(t−s)

for .

Proof.

Observe that it is enough to prove the result when , where the proposition is reduced to the classical result for convex functions. ∎

Definition 2.3.

An divergence is -convex on an interval for when the function is -convex on .

The table below lists some -convex -divergences of interest to this article.

Divergence Domain
relative entropy (KL)
total variation
Pearson’s
squared Hellinger
reverse relative entropy
Vincze- Le Cam
Jensen-Shannon
Neyman’s
Sason’s ,
-divergence

Observe that we have taken the normalization convention on the total variation, which we denote by , such that . Also note, the

-divergence interpolates Pearson’s

-divergence when , one half Neyman’s -divergence when , the squared Hellinger divergence when , and has limiting cases, the relative entropy when and the reverse relative entropy when . If is -convex on , then its dual divergence is -convex on . Recall that satisfies the equality . For brevity, we will use -divergence to refer to the Pearson -divergence, and will articulate Neyman’s explicitly when necessary.

The next lemma is a restatement of Jensen’s inequality.

Lemma 2.4.

If is -convex on the range of ,

 Ef(X)≥f(E(X))+κ2Var(X).
Proof.

Apply Jensen’s inequality to . ∎

For a convex function such that , and the function remains a convex function, and what is more satisfies

 Df(P||Q)=D~f(P||Q)

since .

Definition 2.5 (χ2-divergence).

For , we write

 χ2(P||Q)\coloneqqDf(P||Q)

The following result shows that every strongly convex divergence can be lower bounded, up to its convexity constant , by the -divergence.

Theorem 2.1.

For a -convex function ,

 Df(P||Q)≥κ2χ2(P||Q).
Proof.

Define a , and note that defines the same -convex divergence as . So we may assume without loss of generality that is uniquely zero when . Since is -convex is convex, and by , as well. Thus takes its minimum when and hence so that . Computing,

 Df(P||Q) =∫f(dPdQ)dQ ≥κ2∫(dPdQ−1)2dQ =κ2χ2(P||Q).

The above proof uses a pointwise inequality between convex functions to derive an inequality between their respective divergences. This simple technique was shown to have useful implications by Sason and Verdú in [19], where it appears as Theorem 1, and was used to give sharp comparisons in several -divergence inequalities.

Theorem 2.2 (Sason-Verdú [19]).

For divergences defined by and with for all , then

 Dg(P||Q)≤cDf(P||Q).

Morever if then

 supP≠QDg(P||Q)Df(P||Q)=supt≠1g(t)f(t).
Corollary 2.6.

For a smooth -convex divergence , the inequality

 Df(P||Q)≥κ2χ2(P||Q) (5)

is sharp multiplicatively in the sense that

 infP≠QDf(P||Q)χ2(P||Q)=κ2. (6)

if .

Proof.

Without loss of generality we assume that . If for some , then taking and applying Theorem 2.2 and Theorem 2.1

 supP≠QDg(P||Q)Df(P||Q)=supt≠1g(t)f(t)≤2κ. (7)

Observe that after two applications of L’Hospital,

 limε→0g(1+ε)f(1+ε)=limε→0g′(1+ε)f′(1+ε)=g′′(1)f′′(1)=2κ≤supt≠1g(t)f(t).

Thus (6) follows. ∎

Proposition 2.7.

When is an divergence such that is -convex on and that and are probability measures indexed by a set such that , holds for all and and for a probability measure on , then

 Df(P ||Q)≤∫ΘDf(Pθ||Qθ)dμ(θ)−κ2∫Θ∫X(dPθdQθ−dPdQ)2dQdμ, (8)

In particular when for all

 Df(P ||Q) (9) ≤∫ΘDf(Pθ||Q)dμ(θ)−κ2∫Θ∫X(dPθdQ−dPdQ)2dQdμ(θ) (10) ≤∫ΘDf(Pθ||Q)dμ(θ)−κ∫Θ|Pθ−P|2TVdμ(θ) (11)
Proof.

Let denote a reference measure dominating so that then write .

 Df(P||Q) =∫Xf(dPdQ)dQ (12) =∫Xf(∫ΘdPθdQdμ(θ))dQ (13) =∫Xf(∫ΘdPθdQθν(θ,x)dθ)dQ (14)

By Jensen’s inequality, as in Lemma 2.4

 f(∫ΘdPθdQθνθdθ)≤∫θf (dPθdQθ)νθdθ−κ2∫Θ(dPθdQθ−∫ΘdPθdQθνθdθ)2νθdθ

Integrating this inequality gives

 Df(P||Q)≤∫X(∫θf(dPθdQθ)νθdθ−κ2∫Θ(dPθdQθ−∫ΘdPθdQθνθdθ)2νθdθ)dQ (15)

Note that

 ∫X∫Θ(dPθdQθdQ−∫ΘdPθdQθ0νθ0dθ0)2νθdθdQ=∫Θ∫X(dPθdQθ−dPdQ)2dQdμ,

and

 ∫X∫Θf(dPθdQθ)ν(θ,x)dθdQ =∫Θ∫Xf(dPθdQθ)ν(θ,x)dQdθ (16) =∫Θ∫Xf(dPθdQθ)dQθdμ(θ) (17) =∫ΘD(Pθ||Qθ)dμ(θ) (18)

Inserting these equalities into (15) gives the result.
To obtain the total variation bound one needs only to apply Jensen’s inequality,

 ∫X(dPθdQ−dPdQ)2dQ ≥(∫X∣∣∣dPθdQ−dPdQ∣∣∣dQ)2 (19) =|Pθ−P|2TV. (20)

Observe that taking in Proposition 2.7, one obtains a lower bound for the average -divergence from the set of distribution to their barycenter, by the mean square total variation of the set of distributions to the barycenter,

 κ∫Θ|Pθ−P|2TVdμ(θ)≤∫ΘDf(Pθ||P)dμ(θ). (21)

The next result shows that for strongly convex, Pinsker type inequalities can never be reversed,

Proposition 2.8.

Given strongly convex and , there exists , measures such that

 Df(P||Q)≥M|P−Q|TV. (22)
Proof.

By -convexity is a convex function. Thus and hence Taking measures on the two points space and gives which tends to infinity with , while .

In fact, building on the work of [3, 12], Sason and Verdu proved in [19], that for any divergence, . Thus, an -divergence can be bounded above by a constant multiple of a the total variation, if and only if . From this perspective, Proposition 2.8 is simply the obvious fact that strongly convex functions have super linear (at least quadratic) growth at infinity.

3 Skew divergences

If we denote to be quotient of the cone of convex functions on such that under the equivalence relation when for , then the map gives a linear isomorphism between and the space of all -divergences. The mapping defined by , where we recall , gives an involution of . Indeed, , so that . Mathematically, skew divergences give an interpolation of this involution as

 (P,Q)↦Df((1−t)P+tQ||(1−s)P+sQ)

gives by taking and or yields by taking and .

Moreover as mentioned in the introduction, skewing imposes boundedness of the Radon-Nikodym derivative , which allows us to constrain the domain of -divergences and leverage -convexity to obtain -divergence inequalities in this section.

The following appears as Theorem III.1 in the preprint [14]. It states that skewing an -divergence preserves its status as such. This guarantees that the generalized skew divergences of this section are indeed -divergences. A proof is given in the appendix for the convenience of the reader.

Theorem 3.1 (Melbourne et al [14]).

For and an -divergence, , in the sense that

 Sf(P||Q)\coloneqqDf((1−t)P+tQ||(1−s)P+sQ) (23)

is an -divergence if is.

Definition 3.1.

For an -divergence, its skew symmetrization,

 Δf(P||Q)\coloneqq12Df(P∣∣∣∣∣∣P+Q2)+12Df(Q∣∣∣∣∣∣P+Q2).

is determined by the convex function

 x↦1+x2(f(2x1+x)+f(21+x)). (24)

Observe that , and when , for all since , . When , the relative entropy’s skew symmetrization is the Jensen-Shannon divergence. When up to a normalization constant the -divergence’s skew symmetrization is the Vincze-Le Cam divergence which we state below for emphasis. See [21] for more background on this divergence, where it is referred to as the triangular discrimination.

Definition 3.2.

When denote the Vincze-Le Cam divergence by

 Δ(P||Q)\coloneqqDf(P||Q).

If one denotes the skew symmetrization of the -divergence by , one can compute easily from (24) that . We note that although skewing preserves -conexity, by the above example, it does not preserve -convexity in general. The skew symmetrization of the -divergence a -convex divergence while corresponding to the Vincze-Le Cam divergence satisfies , which cannot be bounded away from zero on .

Corollary 3.3.

For an -divergence such that is a -convex on ,

 Δf(P||Q)≥κ4Δ(P||Q)=κ2Δχ2(P||Q), (25)

with equality when the corresponding the the -divergence, where denotes the skew symmetrized divergence associated to and is the Vincze- Le Cam divergence.

Proof.

Applying Proposition 2.7

 0 =Df(P+Q2∣∣∣∣∣∣Q+P2) ≤12Df(P∣∣∣∣∣∣Q+P2)+12Df(Q∣∣∣∣∣∣Q+P2)−κ8∫(2PP+Q−2QP+Q)2d(P+Q)/2 =Δf(P||Q)−κ4 Δ(P||Q).

When , we have on , which demonstrates that up to a constant the Jensen-Shannon divergence bounds the Vincze-Le Cam divergence. See [21] for improvement of the inequality in the case of the Jensen-Shannon divergence, called the “capacitory discrimination” in the reference, by a factor of .

We will now investigate more general, non-symmetric skewing in what follows.

Proposition 3.4.

For , define

 C(α)\coloneqq{1−α when α≤βα when α>β, (26)

and

 Sα,β(P||Q)\coloneqqD((1−α)P+αQ||(1−β)P+βQ). (27)

Then

 Sα,β(P||Q)≤C(α)D∞(α||β)|P−Q|TV (28)

We will need the following lemma originally proved by Audenart in the quantum setting [2]. It is based on a diffential relationship between the skew divergence [10] and the [8], see [13, 16].

Lemma 3.5 (Theorem III.1 [14]).

For and probability measures, and

 S0,t(P||Q)≤−logt|P−Q|TV. (29)
Proof of Theorem 3.4.

If , then and . Also,

 (1−β)P+βQ=t((1−α)P+αQ)+(1−t)Q (30)

with , thus

 Sα,β(P||Q) =S0,t((1−α)P+αQ||Q) (31) ≤−logt|((1−α)P+αQ)−Q|TV (32) =C(α)D∞(α||β)|P−Q|TV, (33)

where the inequality follows from Lemma 3.5. Following the same argument for , so that , , and

 (1−β)P+βQ=t((1−α)P+αQ)+(1−t)P (34)

for completes the proof. Indeed,

 Sα,β(P||Q) =S0,t((1−α)P+αQ||P) (35) ≤−logt|((1−α)P+αQ)−P|TV (36) =C(α)D∞(α||β)|P−Q|TV. (37)

We recover the classical bound [11, 21] of the Jensen-Shannon divergence by the total variation.

Corollary 3.6.

For probability measure and ,

 JSD(P||Q)≤log2|P−Q|TV (38)
Proof.

Since

Proposition 3.4 gives a sharpening of Lemma 1 of Neilsen [15] who proved , and used the result to establish the boundedness of a generalization of the Jensen-Shannon Divergence.

Definition 3.7 (Nielsen [15]).

For and densities with respect to a reference measure , , such that and , define

 JSα,w(p:q)=n∑i=1wiD((1−αi)p+αiq||(1−¯α)p+¯αq) (39)

where .

Note that when , , and that , the usual Jensen-Shannon divergence. We now demonstrate that Neilsen’s generalized Jensen-Shannon Divergence can be bounded by the total variation distance just as the ordinary Jensen-Shannon Divergence.

Theorem 3.2.

For and densities with respect to a reference measure , , such that and then,

 logeVarw(α)|p−q|2TV≤JSα,w(p:q)≤AH(w)|p−q|TV (40)

where and with

Note that since is the average of the terms with removed, and thus . We will need the following Theorem from [14] for the upper bound.

Theorem 3.3 ([14] Theorem 1.1).

For densities with respect to a common reference measure , and such that ,

 hγ(∑iλifi)−∑iλihγ(fi)≤TH(λ), (41)

where , and with .

Proof of Theorem 3.2.

We apply Theorem 3.3 with , , and noticing that in general

 hγ(∑iλifi)−∑iλhγ(fi)=∑iλiD(fi||f), (42)

we have

 JSα,w(p:q) =n∑i=1wiD((1−αi)p+αiq||(1−¯α)p+¯αq) (43) ≤TH(w). (44)

It remains to determine ,

 ~fi−fi =f−fi1−λi (45) =((1−¯α)p+¯αq)−((1−αi)p+αiq)1−wi (46) =(αi−¯α)(p−q)1−wi (47) =(αi−¯αi)(p−q). (48)

Thus and the proof of the upper bound is complete.

To prove the lower bound, we apply Pinsker’s inequality,