, unifies several important information measures between probability distributions, as integrals of a convex function
, composed with the Radon-Nikodym of the two probability distributions. For a convex functionsuch that , and measures and such that the -divergence from to is given by The canoncial example of an -divergence, realized by taking , is the relative entropy (often called the KL-divergence), and -divergences inherit many properties enjoyed by this special case; non-negativity, joint convexity of arguments, and a data processing inequality. Other important examples include the total variation, the -divergence, and the squared Hellinger distance. The reader is directed to Chapter 6 and 7 of  for more background.
We will be interested in how stronger convexity properties of give improvements of classical -divergence inequalities. This is in part inspired by the work of Sason , who demonstrated that divergences that are (as we define later) “-convex” satisfy “stronger than ”, data-procesing inequalities.
Aside from the total variation, most divergences of interest have stronger than affine convexity, at least when is restricted to a sub-interval of the real line. This observation is especially relevant to the situtation in which one wishes to study in the existence of a bounded Radon-Nikodym derivative
. One naturally obtains such bounds for skew divergences. That is divergences of the formfor , as in this case, . Important examples of skew-divergences include the skew divergence  based on the relative entropy and the Vincze-Le Cam divergence [22, 9], called the triangular discrimination in  and its generalization due to Györfi and Vajda  based on the -divergence. The Jensen-Shannon divergence  and its recent generalization  give examples of -divergences realized as linear combinations of skewed divergences.
Let us outline the paper. In Section 2 we derive elementary results of -convex divergences and give a table of examples of -convex divergences. We demonstrate that -convex divergences can be lower bounded by the -divergence, and that the joint convexity of the map can be sharpened under -convexity conditions on . As a consequence we obtain bounds between the mean square total variation distance of a set of distributions from its barycenter, and the average -divergence from the set to the barycenter.
In Section 3 we investigate general skewing of -divergences. In particular we introduce the skew-symmetrization of an -divergence, which recovers the Jensen-Shannon divergence and the Vincze-Le Cam divergences as special cases. We also show that a scaling of the Vincze-Le Cam divergence is minimal among skew-symmetrizations of -convex divergences on . We then consider linear combinations of skew divergences, and show that a generalized Vincze-Le Cam divergence (based on skewing the -divergence) can be upper bounded by the generalized Jensen-Shannon divergence introduced recently by Neilsen  (based on skewing the relative entropy), reversing the obvious bound that can be obtained from the classical bound . We also derive upper and lower total variation bounds for Neilsen’s generalized Jensen-Shannon divergence.
In Section 4 we consider a family of densities weighted by , and a density
. We use the Bayes estimator111
This is the Bayes estimator for the loss functionto derive a convex decomposition of the barycenter and of , each into two auxiliary densities. We use this decomposition to sharpen, for -convex divergences, an elegant theorem of Guntuboyina  that generalizes Fano and Pinsker’s inequality to -divergences. We then demonstrate explicitly, using an argument of Topsoe, how our sharpening of Guntuboyina’s inequality gives a new sharpening of Pinsker’s inequality in terms of the convex decomposition induced by the Bayes estimator.
We consider Borel probability measures and on a Polish space . For a convex function such that , define the -divergence from to , via densities for and for with respect to a common reference measure as
We note that this representation is independent of , and such a reference measure always exists, take for example.
For , define
with the conventions, , , and
. For a random variableand a set we denote the probability that take a value in by , the expectation of the random variable by
and the variance by. For a probability measure satisfying for all Borel , we write
, and when there exists a probability density function such thatfor a reference measure , we write . For a probability measure on , and an function , we denote for .
2 Strongly convex divergences
A -valued function on a convex set is -convex when and implies
For example, when is twice differentiable, (4) is equivalent to for . Note that the case is just usual convexity.
For , and following are equivalent:
The function is convex for any
The right handed derivative, defined as satisfies,
Observe that it is enough to prove the result when , where the proposition is reduced to the classical result for convex functions. ∎
An divergence is -convex on an interval for when the function is -convex on .
The table below lists some -convex -divergences of interest to this article.
|relative entropy (KL)|
|reverse relative entropy|
|Vincze- Le Cam|
Observe that we have taken the normalization convention on the total variation, which we denote by , such that . Also note, the
-divergence interpolates Pearson’s-divergence when , one half Neyman’s -divergence when , the squared Hellinger divergence when , and has limiting cases, the relative entropy when and the reverse relative entropy when . If is -convex on , then its dual divergence is -convex on . Recall that satisfies the equality . For brevity, we will use -divergence to refer to the Pearson -divergence, and will articulate Neyman’s explicitly when necessary.
The next lemma is a restatement of Jensen’s inequality.
If is -convex on the range of ,
Apply Jensen’s inequality to . ∎
For a convex function such that , and the function remains a convex function, and what is more satisfies
Definition 2.5 (-divergence).
For , we write
The following result shows that every strongly convex divergence can be lower bounded, up to its convexity constant , by the -divergence.
For a -convex function ,
Define a , and note that defines the same -convex divergence as . So we may assume without loss of generality that is uniquely zero when . Since is -convex is convex, and by , as well. Thus takes its minimum when and hence so that . Computing,
The above proof uses a pointwise inequality between convex functions to derive an inequality between their respective divergences. This simple technique was shown to have useful implications by Sason and Verdú in , where it appears as Theorem 1, and was used to give sharp comparisons in several -divergence inequalities.
Theorem 2.2 (Sason-Verdú ).
For divergences defined by and with for all , then
Morever if then
For a smooth -convex divergence , the inequality
is sharp multiplicatively in the sense that
When is an divergence such that is -convex on and that and are probability measures indexed by a set such that , holds for all and and for a probability measure on , then
In particular when for all
Let denote a reference measure dominating so that then write .
By Jensen’s inequality, as in Lemma 2.4
Integrating this inequality gives
Inserting these equalities into (15) gives the result.
To obtain the total variation bound one needs only to apply Jensen’s inequality,
Observe that taking in Proposition 2.7, one obtains a lower bound for the average -divergence from the set of distribution to their barycenter, by the mean square total variation of the set of distributions to the barycenter,
The next result shows that for strongly convex, Pinsker type inequalities can never be reversed,
Given strongly convex and , there exists , measures such that
By -convexity is a convex function. Thus and hence Taking measures on the two points space and gives which tends to infinity with , while .
In fact, building on the work of [3, 12], Sason and Verdu proved in , that for any divergence, . Thus, an -divergence can be bounded above by a constant multiple of a the total variation, if and only if . From this perspective, Proposition 2.8 is simply the obvious fact that strongly convex functions have super linear (at least quadratic) growth at infinity.
3 Skew divergences
If we denote to be quotient of the cone of convex functions on such that under the equivalence relation when for , then the map gives a linear isomorphism between and the space of all -divergences. The mapping defined by , where we recall , gives an involution of . Indeed, , so that . Mathematically, skew divergences give an interpolation of this involution as
gives by taking and or yields by taking and .
Moreover as mentioned in the introduction, skewing imposes boundedness of the Radon-Nikodym derivative , which allows us to constrain the domain of -divergences and leverage -convexity to obtain -divergence inequalities in this section.
The following appears as Theorem III.1 in the preprint . It states that skewing an -divergence preserves its status as such. This guarantees that the generalized skew divergences of this section are indeed -divergences. A proof is given in the appendix for the convenience of the reader.
Theorem 3.1 (Melbourne et al ).
For and an -divergence, , in the sense that
is an -divergence if is.
For an -divergence, its skew symmetrization,
is determined by the convex function
Observe that , and when , for all since , . When , the relative entropy’s skew symmetrization is the Jensen-Shannon divergence. When up to a normalization constant the -divergence’s skew symmetrization is the Vincze-Le Cam divergence which we state below for emphasis. See  for more background on this divergence, where it is referred to as the triangular discrimination.
When denote the Vincze-Le Cam divergence by
If one denotes the skew symmetrization of the -divergence by , one can compute easily from (24) that . We note that although skewing preserves -conexity, by the above example, it does not preserve -convexity in general. The skew symmetrization of the -divergence a -convex divergence while corresponding to the Vincze-Le Cam divergence satisfies , which cannot be bounded away from zero on .
For an -divergence such that is a -convex on ,
with equality when the corresponding the the -divergence, where denotes the skew symmetrized divergence associated to and is the Vincze- Le Cam divergence.
Applying Proposition 2.7
When , we have on , which demonstrates that up to a constant the Jensen-Shannon divergence bounds the Vincze-Le Cam divergence. See  for improvement of the inequality in the case of the Jensen-Shannon divergence, called the “capacitory discrimination” in the reference, by a factor of .
We will now investigate more general, non-symmetric skewing in what follows.
For , define
Lemma 3.5 (Theorem III.1 ).
For and probability measures, and
Proof of Theorem 3.4.
If , then and . Also,
with , thus
where the inequality follows from Lemma 3.5. Following the same argument for , so that , , and
for completes the proof. Indeed,
For probability measure and ,
Definition 3.7 (Nielsen ).
For and densities with respect to a reference measure , , such that and , define
Note that when , , and that , the usual Jensen-Shannon divergence. We now demonstrate that Neilsen’s generalized Jensen-Shannon Divergence can be bounded by the total variation distance just as the ordinary Jensen-Shannon Divergence.
For and densities with respect to a reference measure , , such that and then,
where and with
Note that since is the average of the terms with removed, and thus . We will need the following Theorem from  for the upper bound.
Theorem 3.3 ( Theorem 1.1).
For densities with respect to a common reference measure , and such that ,
where , and with .