1 Introduction
The concept of an divergence, introduced independently by AliSivley [1] and Csizisár [6]
, unifies several important information measures between probability distributions, as integrals of a convex function
, composed with the RadonNikodym of the two probability distributions. For a convex function
such that , and measures and such that the divergence from to is given by The canoncial example of an divergence, realized by taking , is the relative entropy (often called the KLdivergence), and divergences inherit many properties enjoyed by this special case; nonnegativity, joint convexity of arguments, and a data processing inequality. Other important examples include the total variation, the divergence, and the squared Hellinger distance. The reader is directed to Chapter 6 and 7 of [17] for more background.We will be interested in how stronger convexity properties of give improvements of classical divergence inequalities. This is in part inspired by the work of Sason [18], who demonstrated that divergences that are (as we define later) “convex” satisfy “stronger than ”, dataprocesing inequalities.
Aside from the total variation, most divergences of interest have stronger than affine convexity, at least when is restricted to a subinterval of the real line. This observation is especially relevant to the situtation in which one wishes to study in the existence of a bounded RadonNikodym derivative
. One naturally obtains such bounds for skew divergences. That is divergences of the form
for , as in this case, . Important examples of skewdivergences include the skew divergence [10] based on the relative entropy and the VinczeLe Cam divergence [22, 9], called the triangular discrimination in [21] and its generalization due to Györfi and Vajda [8] based on the divergence. The JensenShannon divergence [11] and its recent generalization [15] give examples of divergences realized as linear combinations of skewed divergences.Let us outline the paper. In Section 2 we derive elementary results of convex divergences and give a table of examples of convex divergences. We demonstrate that convex divergences can be lower bounded by the divergence, and that the joint convexity of the map can be sharpened under convexity conditions on . As a consequence we obtain bounds between the mean square total variation distance of a set of distributions from its barycenter, and the average divergence from the set to the barycenter.
In Section 3 we investigate general skewing of divergences. In particular we introduce the skewsymmetrization of an divergence, which recovers the JensenShannon divergence and the VinczeLe Cam divergences as special cases. We also show that a scaling of the VinczeLe Cam divergence is minimal among skewsymmetrizations of convex divergences on . We then consider linear combinations of skew divergences, and show that a generalized VinczeLe Cam divergence (based on skewing the divergence) can be upper bounded by the generalized JensenShannon divergence introduced recently by Neilsen [15] (based on skewing the relative entropy), reversing the obvious bound that can be obtained from the classical bound . We also derive upper and lower total variation bounds for Neilsen’s generalized JensenShannon divergence.
In Section 4 we consider a family of densities weighted by , and a density
. We use the Bayes estimator
^{1}^{1}1This is the Bayes estimator for the loss function
to derive a convex decomposition of the barycenter and of , each into two auxiliary densities. We use this decomposition to sharpen, for convex divergences, an elegant theorem of Guntuboyina [7] that generalizes Fano and Pinsker’s inequality to divergences. We then demonstrate explicitly, using an argument of Topsoe, how our sharpening of Guntuboyina’s inequality gives a new sharpening of Pinsker’s inequality in terms of the convex decomposition induced by the Bayes estimator.Notation
We consider Borel probability measures and on a Polish space . For a convex function such that , define the divergence from to , via densities for and for with respect to a common reference measure as
(1)  
(2) 
We note that this representation is independent of , and such a reference measure always exists, take for example.
For , define
(3) 
with the conventions, , , and
. For a random variable
and a set we denote the probability that take a value in by , the expectation of the random variable byand the variance by
. For a probability measure satisfying for all Borel , we write, and when there exists a probability density function such that
for a reference measure , we write . For a probability measure on , and an function , we denote for .2 Strongly convex divergences
Definition 2.1.
A valued function on a convex set is convex when and implies
(4) 
For example, when is twice differentiable, (4) is equivalent to for . Note that the case is just usual convexity.
Proposition 2.2.
For , and following are equivalent:

is convex.

The function is convex for any

The right handed derivative, defined as satisfies,
for .
Proof.
Observe that it is enough to prove the result when , where the proposition is reduced to the classical result for convex functions. ∎
Definition 2.3.
An divergence is convex on an interval for when the function is convex on .
The table below lists some convex divergences of interest to this article.
Divergence  Domain  

relative entropy (KL)  
total variation  
Pearson’s  
squared Hellinger  
reverse relative entropy  
Vincze Le Cam  
JensenShannon  
Neyman’s  
Sason’s  ,  
divergence 
Observe that we have taken the normalization convention on the total variation, which we denote by , such that . Also note, the
divergence interpolates Pearson’s
divergence when , one half Neyman’s divergence when , the squared Hellinger divergence when , and has limiting cases, the relative entropy when and the reverse relative entropy when . If is convex on , then its dual divergence is convex on . Recall that satisfies the equality . For brevity, we will use divergence to refer to the Pearson divergence, and will articulate Neyman’s explicitly when necessary.The next lemma is a restatement of Jensen’s inequality.
Lemma 2.4.
If is convex on the range of ,
Proof.
Apply Jensen’s inequality to . ∎
For a convex function such that , and the function remains a convex function, and what is more satisfies
since .
Definition 2.5 (divergence).
For , we write
The following result shows that every strongly convex divergence can be lower bounded, up to its convexity constant , by the divergence.
Theorem 2.1.
For a convex function ,
Proof.
Define a , and note that defines the same convex divergence as . So we may assume without loss of generality that is uniquely zero when . Since is convex is convex, and by , as well. Thus takes its minimum when and hence so that . Computing,
∎
The above proof uses a pointwise inequality between convex functions to derive an inequality between their respective divergences. This simple technique was shown to have useful implications by Sason and Verdú in [19], where it appears as Theorem 1, and was used to give sharp comparisons in several divergence inequalities.
Theorem 2.2 (SasonVerdú [19]).
For divergences defined by and with for all , then
Morever if then
Corollary 2.6.
For a smooth convex divergence , the inequality
(5) 
is sharp multiplicatively in the sense that
(6) 
if .
Proof.
Proposition 2.7.
When is an divergence such that is convex on and that and are probability measures indexed by a set such that , holds for all and and for a probability measure on , then
(8) 
In particular when for all
(9)  
(10)  
(11) 
Proof.
Let denote a reference measure dominating so that then write .
(12)  
(13)  
(14) 
By Jensen’s inequality, as in Lemma 2.4
Integrating this inequality gives
(15) 
Note that
and
(16)  
(17)  
(18) 
Inserting these equalities into (15) gives the result.
To obtain the total variation bound one needs only to apply Jensen’s inequality,
(19)  
(20) 
∎
Observe that taking in Proposition 2.7, one obtains a lower bound for the average divergence from the set of distribution to their barycenter, by the mean square total variation of the set of distributions to the barycenter,
(21) 
The next result shows that for strongly convex, Pinsker type inequalities can never be reversed,
Proposition 2.8.
Given strongly convex and , there exists , measures such that
(22) 
Proof.
By convexity is a convex function. Thus and hence Taking measures on the two points space and gives which tends to infinity with , while .
∎
In fact, building on the work of [3, 12], Sason and Verdu proved in [19], that for any divergence, . Thus, an divergence can be bounded above by a constant multiple of a the total variation, if and only if . From this perspective, Proposition 2.8 is simply the obvious fact that strongly convex functions have super linear (at least quadratic) growth at infinity.
3 Skew divergences
If we denote to be quotient of the cone of convex functions on such that under the equivalence relation when for , then the map gives a linear isomorphism between and the space of all divergences. The mapping defined by , where we recall , gives an involution of . Indeed, , so that . Mathematically, skew divergences give an interpolation of this involution as
gives by taking and or yields by taking and .
Moreover as mentioned in the introduction, skewing imposes boundedness of the RadonNikodym derivative , which allows us to constrain the domain of divergences and leverage convexity to obtain divergence inequalities in this section.
The following appears as Theorem III.1 in the preprint [14]. It states that skewing an divergence preserves its status as such. This guarantees that the generalized skew divergences of this section are indeed divergences. A proof is given in the appendix for the convenience of the reader.
Theorem 3.1 (Melbourne et al [14]).
For and an divergence, , in the sense that
(23) 
is an divergence if is.
Definition 3.1.
For an divergence, its skew symmetrization,
is determined by the convex function
(24) 
Observe that , and when , for all since , . When , the relative entropy’s skew symmetrization is the JensenShannon divergence. When up to a normalization constant the divergence’s skew symmetrization is the VinczeLe Cam divergence which we state below for emphasis. See [21] for more background on this divergence, where it is referred to as the triangular discrimination.
Definition 3.2.
When denote the VinczeLe Cam divergence by
If one denotes the skew symmetrization of the divergence by , one can compute easily from (24) that . We note that although skewing preserves conexity, by the above example, it does not preserve convexity in general. The skew symmetrization of the divergence a convex divergence while corresponding to the VinczeLe Cam divergence satisfies , which cannot be bounded away from zero on .
Corollary 3.3.
For an divergence such that is a convex on ,
(25) 
with equality when the corresponding the the divergence, where denotes the skew symmetrized divergence associated to and is the Vincze Le Cam divergence.
Proof.
When , we have on , which demonstrates that up to a constant the JensenShannon divergence bounds the VinczeLe Cam divergence. See [21] for improvement of the inequality in the case of the JensenShannon divergence, called the “capacitory discrimination” in the reference, by a factor of .
We will now investigate more general, nonsymmetric skewing in what follows.
Proposition 3.4.
For , define
(26) 
and
(27) 
Then
(28) 
We will need the following lemma originally proved by Audenart in the quantum setting [2]. It is based on a diffential relationship between the skew divergence [10] and the [8], see [13, 16].
Lemma 3.5 (Theorem III.1 [14]).
For and probability measures, and
(29) 
Proof of Theorem 3.4.
If , then and . Also,
(30) 
with , thus
(31)  
(32)  
(33) 
where the inequality follows from Lemma 3.5. Following the same argument for , so that , , and
(34) 
for completes the proof. Indeed,
(35)  
(36)  
(37) 
∎
Corollary 3.6.
For probability measure and ,
(38) 
Proof.
Since ∎
Proposition 3.4 gives a sharpening of Lemma 1 of Neilsen [15] who proved , and used the result to establish the boundedness of a generalization of the JensenShannon Divergence.
Definition 3.7 (Nielsen [15]).
For and densities with respect to a reference measure , , such that and , define
(39) 
where .
Note that when , , and that , the usual JensenShannon divergence. We now demonstrate that Neilsen’s generalized JensenShannon Divergence can be bounded by the total variation distance just as the ordinary JensenShannon Divergence.
Theorem 3.2.
For and densities with respect to a reference measure , , such that and then,
(40) 
where and with
Note that since is the average of the terms with removed, and thus . We will need the following Theorem from [14] for the upper bound.
Theorem 3.3 ([14] Theorem 1.1).
For densities with respect to a common reference measure , and such that ,
(41) 
where , and with .