Strongly Convex Divergences

09/22/2020
by   James Melbourne, et al.
0

We consider a sub-class of the f-divergences satisfying a stronger convexity property, which we refer to as strongly convex, or κ-convex divergences. We derive new and old relationships, based on convexity arguments, between popular f-divergences.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

04/18/2021

An extension of the proximal point algorithm beyond convexity

We introduce and investigate a new generalized convexity notion for func...
02/20/2020

Cutting Corners

We define and study a class of subshifts of finite type (SFTs) defined b...
05/22/2018

M-convexity of the minimum-cost packings of arborescences

In this paper, we first prove that the minimum-cost packings of disjoint...
10/03/2019

A Grid-based Approach for Convexity Analysis of a Density-based Cluster

This paper presents a novel geometrical approach to investigate the conv...
05/08/2019

SAdam: A Variant of Adam for Strongly Convex Functions

The Adam algorithm has become extremely popular for large-scale machine ...
01/26/2017

Linear convergence of SDCA in statistical estimation

In this paper, we consider stochastic dual coordinate (SDCA) without st...
11/27/2018

Convexity in scientific collaboration networks

Convexity in a network (graph) has been recently defined as a property o...

1 Introduction

The concept of an -divergence, introduced independently by Ali-Sivley [1] and Csizisár [6]

, unifies several important information measures between probability distributions, as integrals of a convex function

, composed with the Radon-Nikodym of the two probability distributions. For a convex function

such that , and measures and such that the -divergence from to is given by The canoncial example of an -divergence, realized by taking , is the relative entropy (often called the KL-divergence), and -divergences inherit many properties enjoyed by this special case; non-negativity, joint convexity of arguments, and a data processing inequality. Other important examples include the total variation, the -divergence, and the squared Hellinger distance. The reader is directed to Chapter 6 and 7 of [17] for more background.

We will be interested in how stronger convexity properties of give improvements of classical -divergence inequalities. This is in part inspired by the work of Sason [18], who demonstrated that divergences that are (as we define later) “-convex” satisfy “stronger than ”, data-procesing inequalities.

Aside from the total variation, most divergences of interest have stronger than affine convexity, at least when is restricted to a sub-interval of the real line. This observation is especially relevant to the situtation in which one wishes to study in the existence of a bounded Radon-Nikodym derivative

. One naturally obtains such bounds for skew divergences. That is divergences of the form

for , as in this case, . Important examples of skew-divergences include the skew divergence [10] based on the relative entropy and the Vincze-Le Cam divergence [22, 9], called the triangular discrimination in [21] and its generalization due to Györfi and Vajda [8] based on the -divergence. The Jensen-Shannon divergence [11] and its recent generalization [15] give examples of -divergences realized as linear combinations of skewed divergences.

Let us outline the paper. In Section 2 we derive elementary results of -convex divergences and give a table of examples of -convex divergences. We demonstrate that -convex divergences can be lower bounded by the -divergence, and that the joint convexity of the map can be sharpened under -convexity conditions on . As a consequence we obtain bounds between the mean square total variation distance of a set of distributions from its barycenter, and the average -divergence from the set to the barycenter.

In Section 3 we investigate general skewing of -divergences. In particular we introduce the skew-symmetrization of an -divergence, which recovers the Jensen-Shannon divergence and the Vincze-Le Cam divergences as special cases. We also show that a scaling of the Vincze-Le Cam divergence is minimal among skew-symmetrizations of -convex divergences on . We then consider linear combinations of skew divergences, and show that a generalized Vincze-Le Cam divergence (based on skewing the -divergence) can be upper bounded by the generalized Jensen-Shannon divergence introduced recently by Neilsen [15] (based on skewing the relative entropy), reversing the obvious bound that can be obtained from the classical bound . We also derive upper and lower total variation bounds for Neilsen’s generalized Jensen-Shannon divergence.

In Section 4 we consider a family of densities weighted by , and a density

. We use the Bayes estimator

111

This is the Bayes estimator for the loss function

to derive a convex decomposition of the barycenter and of , each into two auxiliary densities. We use this decomposition to sharpen, for -convex divergences, an elegant theorem of Guntuboyina [7] that generalizes Fano and Pinsker’s inequality to -divergences. We then demonstrate explicitly, using an argument of Topsoe, how our sharpening of Guntuboyina’s inequality gives a new sharpening of Pinsker’s inequality in terms of the convex decomposition induced by the Bayes estimator.

Notation

We consider Borel probability measures and on a Polish space . For a convex function such that , define the -divergence from to , via densities for and for with respect to a common reference measure as

(1)
(2)

We note that this representation is independent of , and such a reference measure always exists, take for example.

For , define

(3)

with the conventions, , , and

. For a random variable

and a set we denote the probability that take a value in by , the expectation of the random variable by

and the variance by

. For a probability measure satisfying for all Borel , we write

, and when there exists a probability density function such that

for a reference measure , we write . For a probability measure on , and an function , we denote for .

2 Strongly convex divergences

Definition 2.1.

A -valued function on a convex set is -convex when and implies

(4)

For example, when is twice differentiable, (4) is equivalent to for . Note that the case is just usual convexity.

Proposition 2.2.

For , and following are equivalent:

  1. is -convex.

  2. The function is convex for any

  3. The right handed derivative, defined as satisfies,

    for .

Proof.

Observe that it is enough to prove the result when , where the proposition is reduced to the classical result for convex functions. ∎

Definition 2.3.

An divergence is -convex on an interval for when the function is -convex on .

The table below lists some -convex -divergences of interest to this article.

Divergence Domain
relative entropy (KL)
total variation
Pearson’s
squared Hellinger
reverse relative entropy
Vincze- Le Cam
Jensen-Shannon
Neyman’s
Sason’s ,
-divergence

Observe that we have taken the normalization convention on the total variation, which we denote by , such that . Also note, the

-divergence interpolates Pearson’s

-divergence when , one half Neyman’s -divergence when , the squared Hellinger divergence when , and has limiting cases, the relative entropy when and the reverse relative entropy when . If is -convex on , then its dual divergence is -convex on . Recall that satisfies the equality . For brevity, we will use -divergence to refer to the Pearson -divergence, and will articulate Neyman’s explicitly when necessary.

The next lemma is a restatement of Jensen’s inequality.

Lemma 2.4.

If is -convex on the range of ,

Proof.

Apply Jensen’s inequality to . ∎

For a convex function such that , and the function remains a convex function, and what is more satisfies

since .

Definition 2.5 (-divergence).

For , we write

The following result shows that every strongly convex divergence can be lower bounded, up to its convexity constant , by the -divergence.

Theorem 2.1.

For a -convex function ,

Proof.

Define a , and note that defines the same -convex divergence as . So we may assume without loss of generality that is uniquely zero when . Since is -convex is convex, and by , as well. Thus takes its minimum when and hence so that . Computing,

The above proof uses a pointwise inequality between convex functions to derive an inequality between their respective divergences. This simple technique was shown to have useful implications by Sason and Verdú in [19], where it appears as Theorem 1, and was used to give sharp comparisons in several -divergence inequalities.

Theorem 2.2 (Sason-Verdú [19]).

For divergences defined by and with for all , then

Morever if then

Corollary 2.6.

For a smooth -convex divergence , the inequality

(5)

is sharp multiplicatively in the sense that

(6)

if .

Proof.

Without loss of generality we assume that . If for some , then taking and applying Theorem 2.2 and Theorem 2.1

(7)

Observe that after two applications of L’Hospital,

Thus (6) follows. ∎

Proposition 2.7.

When is an divergence such that is -convex on and that and are probability measures indexed by a set such that , holds for all and and for a probability measure on , then

(8)

In particular when for all

(9)
(10)
(11)
Proof.

Let denote a reference measure dominating so that then write .

(12)
(13)
(14)

By Jensen’s inequality, as in Lemma 2.4

Integrating this inequality gives

(15)

Note that

and

(16)
(17)
(18)

Inserting these equalities into (15) gives the result.
To obtain the total variation bound one needs only to apply Jensen’s inequality,

(19)
(20)

Observe that taking in Proposition 2.7, one obtains a lower bound for the average -divergence from the set of distribution to their barycenter, by the mean square total variation of the set of distributions to the barycenter,

(21)

The next result shows that for strongly convex, Pinsker type inequalities can never be reversed,

Proposition 2.8.

Given strongly convex and , there exists , measures such that

(22)
Proof.

By -convexity is a convex function. Thus and hence Taking measures on the two points space and gives which tends to infinity with , while .

In fact, building on the work of [3, 12], Sason and Verdu proved in [19], that for any divergence, . Thus, an -divergence can be bounded above by a constant multiple of a the total variation, if and only if . From this perspective, Proposition 2.8 is simply the obvious fact that strongly convex functions have super linear (at least quadratic) growth at infinity.

3 Skew divergences

If we denote to be quotient of the cone of convex functions on such that under the equivalence relation when for , then the map gives a linear isomorphism between and the space of all -divergences. The mapping defined by , where we recall , gives an involution of . Indeed, , so that . Mathematically, skew divergences give an interpolation of this involution as

gives by taking and or yields by taking and .

Moreover as mentioned in the introduction, skewing imposes boundedness of the Radon-Nikodym derivative , which allows us to constrain the domain of -divergences and leverage -convexity to obtain -divergence inequalities in this section.

The following appears as Theorem III.1 in the preprint [14]. It states that skewing an -divergence preserves its status as such. This guarantees that the generalized skew divergences of this section are indeed -divergences. A proof is given in the appendix for the convenience of the reader.

Theorem 3.1 (Melbourne et al [14]).

For and an -divergence, , in the sense that

(23)

is an -divergence if is.

Definition 3.1.

For an -divergence, its skew symmetrization,

is determined by the convex function

(24)

Observe that , and when , for all since , . When , the relative entropy’s skew symmetrization is the Jensen-Shannon divergence. When up to a normalization constant the -divergence’s skew symmetrization is the Vincze-Le Cam divergence which we state below for emphasis. See [21] for more background on this divergence, where it is referred to as the triangular discrimination.

Definition 3.2.

When denote the Vincze-Le Cam divergence by

If one denotes the skew symmetrization of the -divergence by , one can compute easily from (24) that . We note that although skewing preserves -conexity, by the above example, it does not preserve -convexity in general. The skew symmetrization of the -divergence a -convex divergence while corresponding to the Vincze-Le Cam divergence satisfies , which cannot be bounded away from zero on .

Corollary 3.3.

For an -divergence such that is a -convex on ,

(25)

with equality when the corresponding the the -divergence, where denotes the skew symmetrized divergence associated to and is the Vincze- Le Cam divergence.

Proof.

Applying Proposition 2.7

When , we have on , which demonstrates that up to a constant the Jensen-Shannon divergence bounds the Vincze-Le Cam divergence. See [21] for improvement of the inequality in the case of the Jensen-Shannon divergence, called the “capacitory discrimination” in the reference, by a factor of .

We will now investigate more general, non-symmetric skewing in what follows.

Proposition 3.4.

For , define

(26)

and

(27)

Then

(28)

We will need the following lemma originally proved by Audenart in the quantum setting [2]. It is based on a diffential relationship between the skew divergence [10] and the [8], see [13, 16].

Lemma 3.5 (Theorem III.1 [14]).

For and probability measures, and

(29)
Proof of Theorem 3.4.

If , then and . Also,

(30)

with , thus

(31)
(32)
(33)

where the inequality follows from Lemma 3.5. Following the same argument for , so that , , and

(34)

for completes the proof. Indeed,

(35)
(36)
(37)

We recover the classical bound [11, 21] of the Jensen-Shannon divergence by the total variation.

Corollary 3.6.

For probability measure and ,

(38)
Proof.

Since

Proposition 3.4 gives a sharpening of Lemma 1 of Neilsen [15] who proved , and used the result to establish the boundedness of a generalization of the Jensen-Shannon Divergence.

Definition 3.7 (Nielsen [15]).

For and densities with respect to a reference measure , , such that and , define

(39)

where .

Note that when , , and that , the usual Jensen-Shannon divergence. We now demonstrate that Neilsen’s generalized Jensen-Shannon Divergence can be bounded by the total variation distance just as the ordinary Jensen-Shannon Divergence.

Theorem 3.2.

For and densities with respect to a reference measure , , such that and then,

(40)

where and with

Note that since is the average of the terms with removed, and thus . We will need the following Theorem from [14] for the upper bound.

Theorem 3.3 ([14] Theorem 1.1).

For densities with respect to a common reference measure , and such that ,

(41)

where , and with .

Proof of Theorem 3.2.

We apply Theorem 3.3 with , , and noticing that in general

(42)

we have

(43)
(44)

It remains to determine ,

(45)
(46)
(47)
(48)

Thus and the proof of the upper bound is complete.

To prove the lower bound, we apply Pinsker’s inequality,