1 Introduction
Exponential models are perhaps the most versatile and pragmatic statistical model for a variety of reasons — modelling flexibility (encompassing discrete variables, continuous variables, covariance matrices, time series, graphical models, etc); convexity properties allowing ease of optimization; and robust generalization ability. A principal issue for applicability to large scale problems is estimating these models when the ambient dimension of the parameters,
, is much larger than the sample size — the “” regime.Much recent work has focused on this problem in the special case of linear regression in high dimensions, where it is assumed that the optimal parameter vector is sparse (e.g.
Zhao and Yu [2006], Candes and Tao [2007], Meinshausen and Yu [2009], Bickel et al. [2008]). This body of prior work focused on: sharply characterizing the convergence rates for the prediction loss; consistent model selection; and obtaining sparse models. As we tackle more challenging problems, there is a growing need for model selection in more general exponential families. Recent work here includes learning Gaussian graphs (Ravikumar et al. [2008b]) and Ising models (Ravikumar et al. [2008a]).Classical results established that consistent estimation in general exponential families is possible, in the asymptotic limit where the number of dimensions is held constant (though some work establishes rates under certain conditions as is allowed to grow slowly with [Portnoy, 1988, Ghosal, 2000]). However, in modern problems, we typically grow rapidly with (so even asymptotically we are often interested in the regime where , as in the case of sparse estimation). While we have a handle on this question for a variety of special cases, a pressing question here is understanding how fast can scale as a function of in general exponential families — such an analysis must quantify the relevant aspects of the particular family at hand which govern their convergence rate. This is the focus of this work. We should emphasize that throughout this paper, while we are interested in modelling with an exponential family, we are agnostic about the true underlying distribution (e.g we do not necessarily assume that the data generating process is from an exponential family).
Our Contributions and Related Work
The key issue in analyzing the convergence rates of exponential families in terms of their prediction loss (which we take to be the log loss) is in characterizing the nature in which they are strictly convex — roughly speaking, in the asymptotic regime where we have a large sample size (with
kept fixed), we have a central limit theorem effect where the log loss of any exponential family approaches the log loss of a Gaussian, with a covariance matrix corresponding to the Fisher information matrix. Our first main contribution is quantifying the rate at which this effect occurs in general exponential families.
In particular, we show that every exponential family satisfies a certain rather natural growth rate condition on their standardized moments and standardized cumulants (recall that the
th standardized moment is the unitless ratio of the th central moment to theth power of the standard deviation, which for
is the skew and kurtosis). This condition is rather mild, where these moments can grow as fast as
. Interestingly, similar conditions have been well studied for obtaining exponential tail bounds for the convergence of a random variable to its mean
[Bernstein, 1946]. We show that this growth rate characterizes the rate at which the prediction loss of the exponential family behaves as a strongly convex loss function. In particular, our analysis draws many parallels to that of the analysis of Newton’s method, where there is a “burn in” phase in which a number of iterations must occur until the function behaves as a locally quadratic function — in our statistical setting, we now require a (quantified) “burn in” sample size, where beyond this threshold sample size, the prediction loss inherits the desired strong convexity properties (i.e. it is locally quadratic).
Our second contribution is an analysis of
regularization in generic families, in terms of both prediction loss and the sparsity level of the selected model. Under a particular sparse eigenvalue condition on the design matrix (the Restricted Eigenvalue (RE) condition in
Bickel et al. [2008]), we show how regularization in general exponential families enjoys a convergence rate of (where is the number of relevant features). This RE condition is one of the least stringent conditions which permit this optimal convergence rate for linear regression case (see Bickel et al. [2008]) — stronger mutual incoherence/irrepresentable conditions considered in Zhao and Yu [2006] also provide this rate. We show that an essentially identical convergence rate can be achieved for general exponential families — our results are nonasymptotic and precisely relate and .Our final contribution is one of approximate sparse model selection, i.e. where our goal is to obtain a sparse model with low prediction loss. A drawback of the RE condition in comparison to the mutual incoherence condition is that the latter permits perfect recovery of the true features (at the price of a more stringent condition). However, for the case of the linear regression, Zhao and Yu [2006], Bickel et al. [2008] show that, under a sparse eigenvalue or RE condition, the solution is actually sparse itself (with a multiplicative increase in the sparsity level, that depends on a certain condition number of the design matrix) – so while the the solution may not precisely recover the true model, it still is sparse (with some multiplicative increase) and does recover those features with large true weights.
For general exponential families, while we do not have a characterization of the sparsity level of the regularized solution (an interesting open question), we do however provide a simple two stage procedure (thresholding and refitting) which provides a sparse model, with support on no more than merely features and which has nearly as good performance (with a rather mild increase in the risk) — this result is novel even for the square loss case. Hence, even under the rather mild RE condition, we can obtain both a favorable convergence rate and a sparse model for generic families.
2 The Setting
Our samples are distributed independently according to , and we model the process with , where . However, we do not necessarily assume that lies in this model class. The class of interest is exponential families, which, in their natural form, we denote by:
where is the natural sufficient statistic for , and is the partition function. Here, is the natural parameter space — the (convex) set where is finite. While we work with an exponential family in this general (though natural) form, it should be kept in mind that can be the sufficient statistic for some prediction variable of interest, or, for a generalized linear model (such as for logistic or linear regression), we can have be a function of both and some covariate (see Dobson [1990]). We return to this point later.
Our prediction loss is the likelihood function and is the optimal parameter, i.e.
where the is over the natural parameter space and it is assumed that this is an interior point of this space. Later we consider the case where is sparse.
We denote the Fisher information of as , under the model of . The induced “Fisher risk” is
We also consider the risk .
For a sufficiently large sample size, we expect that the Fisher risk of an empirical minimizer , , be close to — one of our main contributions is quantifying when this occurs in general exponential families. This characterization is then used to quantify the convergence rate for methods in these families. We also expect this strong convexity property to be useful for characterizing the performance of other regularization methods as well.
All proofs can be found in the appendix.
3 (Almost) Strong Convexity of Exponential Families
We first consider a certain bounded growth rate condition for standardized moments and standardized cumulants, satisfied by all exponential families. This growth rate is fundamental in establishing how fast the prediction loss behaves as a quadratic function. Interestingly, this growth rate is analogous to those conditions used for obtaining exponential tail bounds for arbitrary random variables.
3.1 Analytic Standardized Moments and Cumulants
Moments:
For a univariate random variable distributed by , let us denote its th central moment (centered at the mean) by:
where is the mean . Recall that the th standardized moment is the ratio of the th central moment to the th power of the standard deviation, i.e. . This normalization with respect to standard deviation makes the standard moments unitless quantities. For and , the standardized moments are the skew and kurtosis.
We now define the analytic standardized moment for
— we use the term analytic to reflect that if the moment generating function of
is analytic^{1}^{1}1Recall that a real valued function is analytic on some domain of if the derivatives of all orders exist, and if for each interior point, the Taylor series converges in some sufficiently small neighborhood of that point. then has an analytic moment.Definition 3.1.
Let be a univariate random variable under . Then has an analytic standardized moment of if the standardized moments exist and are bounded as follows:
(where the above is assumed to hold if the denominator is ). If is a multivariate random variable distributed according to , we say that has an analytic standardized moment of with respect to a subspace (e.g. a set of directions) if the above bound holds for all univariate where .
This condition is rather mild in that the standardized moments increase as fast as (in a sense is just a unitless scale, and it is predominantly the which makes the condition rather mild). This condition is closely related to those used in obtaining sharp exponential type tail bounds for the convergence of a random variable to its mean — in particular, the Bernstein conditions [Bernstein, 1946] are almost identical to the above, expect that they use the th raw moments (not central moments) ^{2}^{2}2The Bernstein inequalities used in deriving tail bounds require that, for all , for some constant (which has units of ). . In fact, these moment conditions are weaker than requiring “subGaussian” tails.
While we would not expect analytic moments to be finite for all distributions (e.g. heavy tailed ones), we will see that exponential families have (finite) analytic standardized moments.
Cumulants:
Recall that the cumulantgenerating function of under is the log of the momentgenerating function, if it exists, i.e. . The th cumulant is given by the th derivate of at , i.e. . The first, second, and third cumulants are just the first, second, and third central moments — higher cumulants are neither moments nor central moments, but rather more complicated polynomial functions of the moments (though these relationships are known). Analogously, the th standardized cumulant is
— this normalization with respect to standard deviation (the second cumulant is the variance) makes these unitless quantities.
Cumulants are viewed as equally fundamental as central moments, and we make use of their behavior as well — in certain settings, it is more natural to work with the cumulants. We define the analytic standardized cumulant analogous to before:
Definition 3.2.
Let be a univariate random variable under . Then has an analytic standardized cumulant of if the standardized cumulants exist and are bounded as follows:
(where the above is assumed to hold if the denominator is ). If is a multivariate random variable distributed according to , we say that has an analytic standardized cumulant of with respect to a subspace if the above bound holds for all univariate where .
Existence:
The following lemma shows that exponential families have (finite) analytic standardized moments and cumulants, as a consequence of the analyticity of the moment and cumulant generating functions (the proof is in the appendix).
Lemma 3.3.
If is the sufficient statistic of an exponential family with parameter , where is an interior point of the natural parameter space, then has both a finite analytic standardized moment and a finite analytic standardized cumulant, with respect to all directions in .
3.2 Examples
Let us consider a few examples. Going through them, there are two issues to bear in mind. First, is quantified only at a particular (later, is the point we will be interested in) — note that we do not require any uniform conditions on any derivatives over all . Second, we are interested in how could depend on the dimensionality — in some cases, is dimension free and in other cases (like for generalized linear models), depends on the dimension through spectral properties of (and this dimension dependence can be relaxed in the sparse case that we consider, as discussed later).
3.2.1 One Dimensional Families
When is a scalar, there is no direction to consider.
Bernoulli distributions
In the canonical form, the Bernoulli distribution is,
with . We have . The central moments satisfy and for . Thus, is a standardized analytic moment at any . Further, for . Thus, is also a standardized analytic cumulant at any .
Unit variance Gaussian distributions
In the canonical form, unit variance Gaussian is,
with . We have and
. Odd central moments are
and for even , we have . Thus, is a standardized analytic moment at any . However, the loglikelihood is already quadratic in this case (as we shall see, there should be no “burn in” phase until it begins to look like a quadratic!). This becomes evident if we consider the cumulants instead. All cumulants for and hence is a standardized analytic cumulant at any — curiously, cumulant generating function cannot be a finite order polynomial of order greater than 2.3.2.2 Multidimensional Gaussian Covariance Estimation (i.e. “Gaussian Graphs”)
Consider a mean zero dimensional multivariate Normal parameterized by the precision matrix ,
A “direction” here is a positive semidefinite (p.s.d.) matrix , and we seek the cumulants of the random variable .
Note that has Wishart distribution with the moment generating function,
Let ’s be the eigenvalues of . Then, taking logs, the cumulant generating function ,
The th derivative of this is
Thus, the cumulant . Hence, for ,
Thus, is a standardized analytic cumulant at . Note that it is harder to estimate the central moments in this case. This example is also interesting in connection to the analysis of Newton’s method as the function is selfconcordant on the cone of p.s.d. matrices.
3.2.3 Generalized Linear Models
Consider the case where we have some covariate, response pair drawn from some distribution . Suppose that we have a family of distributions such that, for each , it is an exponential family with natural sufficient statistic ,
where . The loss we consider is . A special case of this setup is as follows. Say we have a one dimensional exponential family
where . The family can be be simply (i.e. taking ). Thus,
We see that and . For example, when is either the Bernoulli family or the unit variance Gaussian family, this corresponds to logistic regression or least squares regression, respectively. It is easy to see that the analogue of having a standardized analytic moment of at w.r.t. a direction is to have
where
In the above equation, the expectation is under , the marginal of on . If the sufficient statistic is bounded by in the norm a.s. and the expected Fisher information matrix
has minimum eigenvalue , then we can choose . Note that could be small but it arose only because we are considering an arbitrary direction . If the set of directions is smaller, then we can often get less pessimistic bounds. For example, see section 5.2.2 in the appendix. We also note that similar bounds can be derived when we assume subgaussian tails for rather than assuming it is bounded a.s.
3.3 Almost Strong Convexity
Recall that a strictly convex function is strongly convex if the Hessian of has a (uniformly) lower bounded eigenvalue (see Boyd and Vandenberghe [2004]). Unfortunately, as for all strictly convex functions, exponential families only behave in a strongly convex manner in a (sufficiently small) neighborhood of . Our first main result quantifies when this behavior is exhibited.
Theorem 3.4.
(Almost Strong Convexity) Let be either the analytic standardized moment or cumulant under with respect to a subspace . For any such that , if either
then
Suppose is an MLE. Both preconditions can be thought of as a “burn in” phase — the idea being that initially a certain number of samples is needed until the loss of is somewhat close to the minimal loss; after which point, the quadratic lower bound engages. This is analogous to the analysis of the Newton’s method, which quantifies the number of steps needed to enter the quadratically convergent phase (see Boyd and Vandenberghe [2004]). The constants of and can be made arbitrarily close to (with a longer “burn in” phase), as expected under the central limit theorem.
A key idea in the proof is an expansion of the prediction regret in terms of the central moments. We use the shorthand notation of and to denote the cumulants and moments of the random variable under the distribution .
Lemma 3.5.
(Moment and Cumulant Expansion) Define . For all ,
where the equalities hold if the right hand sides converge.
The proof of this Lemma (in the appendix) is relatively straightforward. The key technical step in the proof of Theorem 3.4 is characterizing when these expansions converge. Note that for , even if (one of our preconditions), a direct attempt at lower bounding using the above expansions with the analytic moment condition would not imply these expansions converge — the proof requires a more delicate argument.
4 Sparsity
We now consider the case where is sparse, with support and sparsity level , i.e.
In order to understand when regularized algorithms (for linear regression) converge at a rate comparable to that of algorithms (subset selection), Meinshausen and Yu [2009] considered a sparse eigenvalue condition on the design matrix, where the eigenvalues on any small (sparse) subset are bounded away from 0. Bickel et al. [2008] relaxed this condition so that vectors whose support is “mostly” on any small subset are not too small (see Bickel et al. [2008] for a discussion). We also consider this relaxed condition, but now on the Fisher matrix.
Assumption 4.1.
(Restricted Fisher Eigenvalues) For a vector , let be the vector such that and is on the other coordinates, and let denote the complement of . Assume that:
The constant of is for convenience. Note we only quantify on the support — a substantially weaker condition than in Meinshausen and Yu [2009], Bickel et al. [2008], which quantify over all subsets (in fact, many previous algorithms/analysis actually use this condition on subsets different from , e.g. Meinshausen and Yu [2009], Candes and Tao [2007], Zhang [2008]).
Furthermore, with regards to our analyticity conditions, our proof shows that the subspace of directions we need to consider is now restricted to the set:
(1) 
Under this Restricted Eigenvalue (RE) condition, we can replace the minimal eigenvalue used in Example 3.2.3 by (section 5.2.2 in appendix), which could be significantly smaller.
4.1 Fisher Risk
Consider the following regularized optimization problem:
(2) 
where the empirical expectation is with respect to a sample. This reduces to the usual linear regression example (for Gaussian means) and involves the logdeterminant in Gaussian graph setting (considered in Ravikumar et al. [2008b]) where is the precision matrix (see Example 3.2.2).
Our next main result provides a risk bound, under the RE condition. Typically, the regularization parameter is specified as a function of the noise level, under a particular noise model (e.g. for linear regression case, where with the noise model , is specified as [Meinshausen and Yu, 2009, Bickel et al., 2008]). Here, our theorem is stated in a deterministic manner (i.e. it is a distribution free statement), to explicitly show that an appropriate value of is determined by the norm of the measurement error, i.e. — we then easily quantify in a corollary under a mild distributional assumption. Also, we must have that this measurement error be (quantifiably) sufficiently small such that our “burn in” condition holds.
Theorem 4.2.
(Risk) Suppose that Assumption 4.1 holds and satisfies both
(3) 
where is the analytic standardized moment or cumulant of for the subspace defined in (1). (Note this setting requires that be sufficiently small). Then if is the solution to the optimization problem in (2), the Fisher risk is bounded as follows
and the risk is bounded as follows:
Intuitively, we expect the measurement error to be , so we think of . Note this would recover the usual (optimal) risk bound of (i.e. the same rate as an algorithm, up to the RE constant). Note that the mild dimension dependence enters through the measurement error. Hence, our theorem shows that all exponential families exhibit favorable convergence rates under the RE condition.
The following proposition and corollary quantify this under a mild (and standard) distributional assumption (which can actually be relaxed somewhat).
Proposition 4.3.
Bounded random variables are in fact subGaussian (though unbounded may also be subGaussian, e.g. Gaussian random variables are obviously subGaussian). The following corollary is immediate.
4.2 Approximate Model Selection
An important issue unaddressed by the previous result is the sparsity level of our estimate . For the linear regression case, Meinshausen and Yu [2009], Bickel et al. [2008] show that the solution is actually sparse, with a sparsity level of roughly , (i.e. the sparsity level increases by a factor which is essentially a condition number squared). In the general setting, we do not have a characterization of the actual sparsity level of the solution.
However, we now present a two stage procedure, which provides an estimate with support on merely features, with nearly as good risk (ShalevShwartz et al. [2009] discuss this issue of trading sparsity for accuracy, but their results are more applicable to settings with rates.). Consider the procedure where we select the set of coordinates which have large weight under (say greater than some threshold ). Then we refit to find an estimate with support only on these coordinates. That is, we restrict our estimate to the set . This algorithm is:
(4) 
Theorem 4.5.
(Sparsity) Suppose that 4.1 holds and the regularization parameter satisfies both
(5) 
where is the analytic standardized moment or cumulant of for the subspace defined in (1). If is the solution of (2) with this and is the solution of (4) with threshold and this , then:

has support on at most coordinates.

The Fisher risk is bounded as follows:
Using Proposition 4.3, we have following corollary.
References
 Bernstein [1946] S. Bernstein. The Theory of Probabilities. Gastehizdat Publishing House, Moscow, 1946.
 Bickel et al. [2008] Peter J. Bickel, Ya’acov Ritov, and Alexandre B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Annals of Statistics, 37(4):1705–1732, 2008.
 Boyd and Vandenberghe [2004] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.
 Brown [1986] Lawrence D. Brown. Fundamentals of Statistical Exponential Families. Institute of Mathematical Statistics, 1986.
 Candes and Tao [2007] Emmanuel Candes and Terence Tao. The Dantzig selector: Statistical estimation when is much larger than . Annals of Statistics, 35:2313, 2007.
 Dobson [1990] A.J. Dobson. An Introduction to Generalized Linear Models. Chapman and Hall, 1990.
 Ghosal [2000] Subhashis Ghosal. Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. J. Multivar. Anal., 74(1):49–68, 2000. ISSN 0047259X.
 Krantz and Parks [2002] Steven G. Krantz and Harold R. Parks. A Primer of Real Analytic Functions. Birkhäuser, 2002.

Meinshausen and Yu [2009]
Nicolai Meinshausen and Bin Yu.
Lassotype recovery of sparse representations for highdimensional data.
Annals of Statistics, 37:246, 2009.  Portnoy [1988] S. Portnoy. Asymptotic behavior of likelihood methods for exponential families when the number of parameters tends to infinity. Annals of Statistics, 16, 1988.
 Ravikumar et al. [2008a] P. Ravikumar, M. J. Wainwright, and J. Lafferty. Highdimensional ising model selection using l1regularized logistic regression. Technical Report Technical Report 750, UC Berkeley, Department of Statistics., 2008a.
 Ravikumar et al. [2008b] Pradeep Ravikumar, Martin J Wainwright, Garvesh Raskutti, and Bin Yu. Highdimensional covariance estimation by minimizing l1penalized logdeterminant divergence. Technical Report arXiv:0811.3628, Nov 2008b.
 ShalevShwartz et al. [2009] S. ShalevShwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity. Technical report, TTIC, 2009. Available at ttic.uchicago.edu/shai.
 Zhang [2008] T. Zhang. Adaptive forwardbackward greedy algorithm for sparse learning with linear models. In Advances in Neural Information Processing Systems 22, 2008.

Zhao and Yu [2006]
P. Zhao and B. Yu.
On model selection consistency of lasso.
Journal of Machine Learning Research
, 7:2541–2567, 2006.
5 Appendix
5.1 Proofs for Section 3
Proof.
(of Lemma 3.3) The proof shows that the central moment generating function of , namely , is analytic at . First, notice that
It is known that for exponential families, (namely, the partition function) is analytic in the interior of (see [4]). Since is also analytic (as a function of ), we have by the chain of equalities above that the central moment generating function is also analytic (as a function of ) for any at the interior of . This property implies that the derivatives of the central moment generating function at (namely, the moments ) cannot grow too fast with . In particular, by proposition 2.2.10 in [8], it holds for all that the th derivative (which is equal to ) is at most for some constant . As a result, is at most for a suitable constant . Thus, has finite analytic standardized moment with respect to all directions.
As to the assertion about having finite analytic standardized cumulant, notice that our argument above also implies that the (raw) moment generating function, , is analytic. Therefore, , which is the cumulant generating function, is also analytic (since the logarithm is an analytic function). An analysis completely identical to the above leads to the desired conclusion about the cumulants of . ∎
From here on, we slightly abuse notation and let be the th central moment of the univariate random variable distributed under .
Proof.
The following upper and lower bounds are useful in that they guarantee the sum converges for the choice of specified.
Lemma 5.1.
Let and be defined as in Theorem 3.4. Let and set . If is is an analytic moment, then
If is is an analytic cumulant, then
Proof.
We only prove the analytic moment case (the proof for the cumulant case is identical). First let us show that:
We can bound the following sum from onwards as:
which proves the claim.
For our choice of ,
Hence, we have:
Analogously, the upper bound can be proved. ∎
The following core lemma leads to the proof of Theorem 3.4.
Lemma 5.2.
Proof.
As is clearly in and by convexity, we have:
For the cumulant case, we have that this is lower bounded by using Lemma 5.1 and Lemma 3.5, which proves (6). Now consider the analytic moment case. By, Lemma 3.5, we have
Now by Jensen’s inequality, we know that the fourth standardized moment (the kurtosis) is greater than one, so (since ). This implies that:
since the sum is only larger if we choose any argument in the . Now for , we have that . Proceeding,
which proves (6) (for the analytic moment case).
For the second claim, the precondition implies that the max, in (6), will be achieved with the argument of , which directly implies the lower bound. For the upper bound, we can apply Lemma 5.1 with ( under our precondition), which implies that is less than . The claim follows directly for the cumulant case using Lemma 3.5, with . For the moment case, we use that . ∎
We are now ready to prove Theorem 3.4.
Proof.
(of Theorem 3.4) If , then the previous Lemma implies the claim. Let us assume the condition on the loss, i.e. . If , then we are done by the previous argument. So let us assume that