1 Introduction
We study the information concentration of probability measures: Given a probability density
and a random variable
, we ask how concentrated is the random variable around its mean , which is simply the differential entropy of .We focus on the class of logconcave probability measures, whose densities are of the form for some convex function . Information concentration for logconcave measures has found many applications in learning theory, ranging from aggregation (dalalyan2016exponentially) and Bayesian decision theory (pereyra2017maximum), to, unsurprisingly, information theory (raginsky2013concentration). It also has immediate implications to online learning and PACBayesian analysis (cf., Section 4 for further discussion on this topic)
bobkov2011concentration discovered the information concentration phenomenon for logconcave measures. Their result was later sharpened by fradelizi2016optimal, which establishes the current stateoftheart. However, via the concentration bound in (fradelizi2016optimal), one can immediately notice a poor dependence on the dimension (see Theorem 3.1).
This unpleasant dependence is, however, not due to any deficit of the analysis: Even in the Gaussian case, the information concentration is known to be dimensiondependent (cover1989gaussian), and the bound in (fradelizi2016optimal)
matches the tightest known result. We can verify that the exponential distributions, another candidate for dimensionfree concentration, share the same poor dimensional scaling.
Given these observations, one might pessimistically conjecture that no meaningful subclass of logconcave measures satisfies the information concentration in a dimensionfree fashion. Hence, our main result comes as a surprise that, not only does there exist a large subclass of logconcave measures with dimensionfree information concentration, but in addition, this subclass is extremely wellknown to the machine learning community:
Our main result (informal statement): Let where is expconcave. Then, the information concentration of solely depends on the expconcavity parameter , and not the ambient dimension.
Many loss functions in machine learning are known to be expconcave; a nonexhaustive list includes the squared loss, entropic loss, loglinear loss, SVMs with squared hinge loss, and log loss; see
(cesa2006games) for more. Moreover, distributions of the type , where is expconcave, appear frequently in many areas of learning theory. Consequently, our main result is tightly connected to learning with expconcave losses; see Section 4.Our main insight is that expconcave functions are Lipschitz in a local norm, and logconcave measures satisfy the “Poincaré inequality in this local norm”, namely the BrascampLieb inequality. We elaborate more on the intuition in Section 5.1. In retrospect, the proof of our main result is, once the right tools are identified, completely natural and elementary. In fact, our result basically implies that the expconcavity arises naturally in the dimensionfree information concentration.
The rest of the paper is organized as follows. We first set up notations and review basics of differential entropy and logconcave distributions in Section 2. In Section 3, which contains precise statements of the main result, we present various dimensionfree inequalities for information concentration. We provide a counterexample to a conjecture, which is a natural strengthening of our results. We discuss implications of information concentration in Section 4 with motivating examples. Finally, Section 5 presents the technical proofs.
2 Preliminaries
2.1 Notations
For a function , we write and . We write for a random variable associated with the probability measure .
In this paper, the norm is always the Euclidean norm, and we use for the Euclidean inner product. We use , , and to denote the gradient, Hessian, and subgraident of , respectively. The notation denotes the class of times differentiable functions with continuous th derivatives.
2.2 Differential Entropy and LogConcave Distributions
Let be a probability measure having density with respect to the Lebesgue measure and let . The differential entropy (cover2012elements) of is defined as
(1) 
The random variable is called the information content of .
We study the concentration of information content around the differential entropy:
where vanishes rapidly as increases.
Throughout this paper, we consider logconcave probability measures, namely probability measures having density of the form
(2) 
where is a convex function such that . The function is called the potential of the measure . For logconcave measures, the concentration of information content is equivalent to the concentration of the potential, i.e., .
3 Dimension Free Concentration of Information for ExpConcave Potentials
This section presents our main results.
We first review the stateoftheart bound in Section 3.1. In Section 3.2, we demonstrate dimensionfree information concentration when the underlying potential is assumed to be expconcave. All our results are of subexponential type; it is hence natural to ask if the subGaussian counterparts are also true. We show that this is impossible even in dimension 1, by giving a counterexample in Section 3.3. Finally, we highlight some immediate consequences of our main results in Section 3.4. All proofs are deferred to Section 5.
3.1 Previous Art
The stateoftheart concentration bound for is given by fradelizi2016optimal:
(Information Concentration for LogConcave Vectors)
Let be a dimensional logconcave probability measure. Then, we have
.

There exist universal constants and such that
(3)
This is the main result of (fradelizi2016optimal) combined with the wellknown relation
for every .
3.2 Our Results
We first recall the definition of expconcave functions (hazan2016introduction): A function is said to be expconcave if is concave. Equivalently, is expconcave if the matrix inequality holds for all . Notice that an expconcave function is necessarily convex.
We next present three concentration inequalities for in Theorem 3.2.13.2.3. Theorem 3.2.1 serves as the prototype for all the concentration inequalities to come, however with restrictive conditions that severely limit its applicability. To overcome such dilemma, in Theorem 3.2.2 and 3.2.3 we introduce practically motivated assumptions, and show how the restrictive conditions of Theorem 3.2.1 can be removed without effecting the concentration.
3.2.1 Information Concentration: the Strictly Convex Case
The first main result of this paper is that, for with being expconcave and strictly convex, the concentration of information content depends solely on . Assume that is expconcave and . Let be the logconcave distribution associated with . Then

.

.
3.2.2 Information Concentration: the General Convex Case
In many of the applications in learning theory (cf., Section 4), the potential is not guaranteed to be globally strictly convex. However, we have the following observation:
Assume that is expconcave. Let
be the subspace spanned by the eiganvectors corresponding to nonzero eigenvalues of
. Then for all . Simply put, may not be strictly convex in all directions, but it is always strictly convex in the direction of . Our second result shows that in this case, one can drop the global strict convexity of while retaining exactly the same dimensionfree concentration. Assume that is expconcave, but not necessarily strictly convex. Let be the logconcave distribution associated with . Then
.

.
3.2.3 Information Concentration in the Presence of Nonsmooth Potential
The following case appears frequently in machine learning applications: The potential can be decomposed as , where is a “nice” convex function (meaning satisfying either the assumptions in Theorem 3.2.1 or Theorem 3.2.2), while is a nonsmooth convex function. Since is neither differentiable nor strictly convex, results above do not apply.
Our third result is to show that, in this scenario, the term in fact enjoys dimensionfree concentration as if the nonsmooth term is absent. Let , where satisfies the assumptions in either Theorem 3.2.1 or Theorem 3.2.2, and is a general convex function. Then we have
(4) 
where the probability is with respect to the total measure , and is the expconcavity parameter of .
3.3 A Counterexample to SubGaussian Concentration of Information Content
So far, we have established dimensionfree concentration of subexponential type under various conditions. An ansatz is whether under the same assumptions, one has dimensionfree subGaussian concentration; i.e., a deviation inequality of the form
(5) 
for a universal constant and some constant depending only on .
In this subsection, we provide a counterexample to this conjecture, showing that this is impossible even in dimension 1.
Consider the onedimensional case where and the support is . Notice that is trivially 1expconcave. If (5) holds for , then we would have
if . However, a straightforward computation shows that
(6) 
for every . We hence cannot have any subGaussian concentration for .
It is easy to generalize this example to any dimension.
3.4 Immediate Consequences
An immediate consequence of information concentration is that many important densities in information theory also concentrate. [Concentration of Information Densities] Let be a joint logconcave density of the random variable pair . Denote the marginal distribution of the first argument by and similarly for , and denote the conditional distribution by Then there exist universal constants such that the following holds:
If, in addition, that , , and are expconcave and is strictly convex. Then the exponents in the above bounds can be improved to .
Notice that is the conditional (differential) entropy, and is the mutual information. The (random) quantities and play prominent roles in recent advances of nonasymptotic information theory; see polyanskiy2010channel and the references therein.
A celebrated result of prekopa1971logarithmic states that the marginals of logconcave measures are also logconcave. The corollary then follows by the wellknown decomposition and .
4 Motivating Examples
Unsurprisingly, information concentration has many applications in learning theory; we present three examples in this section. To avoid lengthy but straightforward calculations, we shall omit the details and refer the readers to proper literature.
Below, we consider loss functions of the form , where ’s are expconcave. By Lemma Appendix A. Properties of ExpConcave Functions in Appendix A, the total loss is also expconcave. Denote the expconcave parameter of by .
We remark that, in general, can depend on the dimension or the sample size . A comparison of the favorable regimes for different ’s is presented in Table 1.
fradelizi2016optimal  Ours,  Ours,  

4.1 HighProbability Regret Bounds for Exponential Weight Algorithms
Expconcave losses have received substantial attention in online learning as they exhibit logarithmic regret (hazan2007logarithmic). One class of algorithms attaining logarithmic regret is based on the Exponential Weight, which makes prediction according to
(7) 
where
(8) 
A common belief is that the algorithm (7) is inefficient to implement, and practitioners would more opt into firstorder methods such as the (see hazan2007logarithmic)
Online Newton Step (which is also somewhat inefficient: every iteration requires inverting a matrix and a projection). However, recent years have witnessed a surge of interest in the sampling schemes, mainly due to its connection to the ultrasimple Stochastic Gradient Descent
(welling2011bayesian). Theoretical (bubeck2015sampling; durmus2016high; dalalyan2017theoretical; dalalyan2017user; cheng2017convergence) and empirical (welling2011bayesian; ahn2012bayesian; rezende2014stochastic; blei2017variational) studies of sampling schemes have now become one of the most active areas in machine learning.In view of these recent developments, it is natural to consider, instead of the expected prediction (7), taking samples and predict . The following corollary of our main result establishes the desirable concentration property of . Let be i.i.d. samples from the distribution . Assume that satisfies either the assumptions of Theorem 3.2.1 or Theorem 3.2.2. Then
(9) 
For simplicity, assume ; the general case is similar.
By the classic Chernoff bounding technique, we can compute
where the second inequality follows from (22) with .
Plugging (9) into the expected regret bounds for the Exponential Weight algorithm (e.g., hazan2007logarithmic), we immediately obtain highprobability regret bounds.
Similar arguments hold for random walkbased approaches in online learning (narayanan2010random).
4.2 Posterior Concentration of Bayesian and PacBayesian Analysis
The (pseudo)posterior distribution plays a fundamental role in the PACBayesian theory:
(10) 
where . Here, represents the parameter vector and is the prior distribution. It is wellknown that (10) is optimal in PACBayesian bounds for the expected (over the posterior distribution on the parameter set) population risk (catoni2007pac). Moreover, when the loss functions ’s are the negative loglikelihood of the data, the optimal PACBayesian posterior (10) coincides with the Bayesian posterior; see (zhang2006information) or the more recent (germain2016pac).
We now consider the highprobability bound in the following sense: Instead of taking the expectation over as previously done, we draw a random sample , and ask what is the population risk for . Besides its apparent theoretical interest, such characterization is also important in practice, as there exist many sampling schemes for logconcave distributions (lovasz2007geometry; bubeck2015sampling; durmus2016high; dalalyan2017theoretical), while computing the mean is in general costly (the mean is typically obtained through a large amount of sampling anyway).
A straightforward application of Theorem 3.2.3 shows that, if the prior is logconcave, then concentrates around ; notice that many popular priors (uniform, Gaussian, Laplace, etc.) are logconcave. On the other hand, concentration of the empirical risk around the population risk is a classical theme in statistical learning. To conclude, Theorem 3.2.3 implies highprobability results for the PACBayesian bounds. In view of the equivalence established in (germain2016pac), we also obtain concentration for the Bayesian posterior in the case of negative loglikelihood loss.
4.3 Bayesian Highest Posterior Density Region
Let be the posterior distribution as in (10). In Bayesian decision theory, the optimal confidence region associated with a level is given by the Highest Posterior Density (HPD) region (robert2007bayesian), which is defined as
(11) 
where is chosen so that .
Using concentration of the information content for logconcave distributions, pereyra2017maximum showed that is contained in the set
(12) 
where is the MAP parameter, and for some constant . A straightforward application of our results shows that, when the data term in is expconcave, then we can improve (12). For simplicity, let us focus on the uniform prior ( constant). Adapting the analysis in pereyra2017maximum, we can show that is contained in the set
(13) 
where . Comparing (12) and (13), we see that (ignoring logarithmic terms) we get improvements whenever
. This is typically the case in highdimensional statistics
(buhlmann2011statistics) or compressive sensing (ji2008bayesian; foucart2013mathematical) where .Similar results can be established for the Gaussian and Laplace prior, where one can invoke results in (cover1989gaussian) and (talagrand1995concentration) to deduce the concentration of the prior term. We omit the details.
5 Proofs of the Main Results
We prove the main results in this section. Our analysis crucially relies on the variance BrascampLieb inequality, recalled and elaborated in Section 5.1. Section 5.24 are devoted to the proofs of Theorem 3.2.13.2.3, respectively.
5.1 Proof Ideas
For a probability measure , we say that satisfies the Poincaré inequality with constant if
(14) 
for all locally Lipschitz . It is wellknown that if (14) is satisfied for , then all the Lipschitz functions concentrate exponentially (ledoux2004spectral; ledoux2005concentration):
(15) 
for some universal constant .
At first glance, our theorems seem to have little to do with the Poincaré inequality, since

It is not known whether a logconcave distribution satisfies the Poincaré inequality with a dimensionindependent constant (this is the content of the KannanLovászSimonovits conjecture; see kannan1995isoperimetric; alonso2015approaching).

Typically, the potential
is not Lipschitz (consider the Gaussian distribution where
). Moreover, even if is Lipschitz, the Lipschitz constant often depends on the dimension (consider the exponential distribution where ).
The important observation in this paper is that the appropriate norm in (14) for information concentration is not the Euclidean norm (or any norm), but instead the (dual of the) local norm defined by the potential itself, namely .
Lemma Appendix A. Properties of ExpConcave Functions in Appendix A expresses the fact that expconcave functions are Lipschitz with respect to this local norm, and the BrascampLieb inequality below provides a suitable strengthening of the Poincaré inequality: [BrascampLieb Inequality] Let be a logconcave probability measure with and . Then for all locally Lipschitz function , we have
(16) 
We shall see that the BrascampLieb inequality provides precisely the desired control of the Lipschitzness of in terms of the aforementioned dual local norm. Once this is observed, the rest of the proof is a routine in deducing from Poincaré inequality the subexponential concentration of Lipschitz functions.
We remark that our approach is, in retrospect, completely natural and elementary. However, to the best of our knowledge, our work is the first to combine the BrascampLieb inequality (16) with the local norm of the form .
5.2 Proof of Theorem 3.2.1
The first assertion is a simple application of the BrascampLieb inequality (16) and Lemma Appendix A. Properties of ExpConcave Functions.
We now prove the concentration inequality. We first show that . Applying (16) to , we get
(17) 
by Lemma Appendix A. Properties of ExpConcave Functions. Let . Then the inequality (17) reads
(18) 
and hence
(19) 
Apply (19) recursively to obtain
(20) 
Since , we have
as . Hence (20) implies
(21) 
which in turn gives
(22) 
The proof can now be completed by the classic Chernoff bounding technique:
(23) 
Now, the inequality (23) implies that for any 1expconcave , we have
If is expconcave, is 1expconcave, and hence we conclude that
that is to say,
(24) 
The bound for is similar.
5.3 Proof of Lemma 3.2.2 and Theorem 3.2.2
We first prove Lemma 3.2.2.
For any point , let be an orthonormal basis for , assumed to have dimension . We extend to an orthonormal basis for as , and we decompose for some real numbers ’s.
For the purpose of contradiction, assume that . Then for some . But then
while
contradicting the expconcavity of . This finishes the proof of Lemma 3.2.2.
We now turn to Theorem 3.2.2.
Let be arbitrarily small, and consider the quantity
(25) 
Lemma 3.2.2 implies that (25) is equal to
(26) 
where is the identity map on the subspace . Since is strictly convex restricted to , and since , Lemma Appendix A. Properties of ExpConcave Functions then implies
(27) 
for all .
Consider , and let . Since is strictly convex, we may invoke the BrascampLieb inequality (16) to conclude
(28) 
where the second inequality follows from (27). Letting in (28) then gives
(29) 
The rest of the proof is similar to that of Theorem 3.2.1; we omit the details.
5.4 Proof of Theorem 3.2.3
We will need the following strengthened BrascampLieb inequality, which might be of independent interest. Once the following theorem is established, one can follow a similar proof as in Section 5.2. We omit the details, and focus on the proof of the following theorem in the rest of this subsection. [Nonsmooth BrascampLieb Inequality] Let be a logconcave measure with , where , , and is convex but possibly nondifferentiable. Then for all locally Lipschitz function , we have
(30) 
[Proof of Theorem 5.4] Define the cost functions
(31) 
and
(32) 
By Proposition 1.1 of cordero2017transport (see also p.482 for the nondifferentiable case), we know that the measure satisfies the transportation cost inequality:
for any probability measure . Here,
where the infimum is over all joint distributions with marginals
and , and is the relative entropy between and . Since is convex, we have for all , and hence satisfies the weaker transportation cost inequality(33) 
The theorem can then be deduced from a standard linearization procedure that is wellknown since the classic (otto2000generalization). The rest of the proof below is a suitable adaptation of the version in (cordero2017transport).
Since continuous functions with compact support are dense in , we will prove Theorem 5.4 for any continuous function with compact support. Notice that such functions are necessarily Lipschitz and hence differentiable almost everywhere. Since modifying in a set of measure 0 does not effect (30), we may henceforth assume that and has compact support.
Since , is uniformly continuous on any compact set, and hence we have
(34) 
uniformly in on any compact set when
. Assume for the moment that
for some . Given any function with compact support and , introduce the infimal convolution associated with the cost :(35) 
whence By the definition of , for any joint probability measure having marginals and , we must have
(36) 
Consider the infimum convolution of :
(37) 
Let denote a point where the infimum is achieved. Since is globally Lipschitz, say with constant ,
(38) 
On the other hand, by setting in (37), we see that . Combining this with (38) gives
(39) 
Notice that (39) does not depend on , and hence uniformly in .
As is compactly supported, we have . Let be the support of , and let We claim that on . Indeed, for any , suppose that the infimum of in (37) is attained in . Then , since and . On the other hand, if the infimum of in (37) is attained outside , then and implies that .
For the sake of linearization, set for some with . We now compute
(40) 
As the set is itself compact, we have, uniformly in ,
(41) 
where the last line follows by (39). Noticing that (41) is convex in , we can find its minimum (up to ) and write
(42) 
Multiplying (42) by and integrate on w.r.t. , we get, using (40) and ,
(43) 
By definition, contains the support of , and hence the integrals in (43) are in fact over the whole space. We hence conclude
(44) 
Replacing by in (44) and optimizing over , we get
(45) 
Moreover, using and , we can compute
(46) 
6 Conclusion
We have shown that for logconcave distributions with expconcave potentials, the information concentration is dictated by its expconcavity parameter . Information theoretically speaking, (or rather ) can be viewed as some sort of effective dimension, in the sense that and play very similar roles in both the variance and concentration controls, the former for logconcave measures with expconcave potential and the latter for general logconcave measures. Such a understanding enables us to derive highprobability results for many of the machine learning algorithms, including the Bayesian, PACBayesian, and Exponential Weight type approaches.
This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement n 725594  timedata).
Appendix A. Properties of ExpConcave Functions
We present two useful properties of expconcave functions in this appendix. While these properties are wellknown to the experts, we provide the proofs for completeness.
Assume that is expconcave and . Then we have
(47) 
for all . Since is expconcave, we have
Comments
There are no comments yet.