1 Introduction
The Gaussian width is an important measure of the complexity of a set, and it plays an important role in geometry, statistics and probability theory. Most relevant to this paper is its central role in empirical process theory, where the Gaussian width and its Bernoulli analogue (known as the Rademacher width) can be used to upper bound the error for various types of nonparametric estimators
[26, 27, 3, 15, 5]. More recently, these same complexity measures have also been shown to play an important role in highdimensional testing problems [31].For a general set, it is nontrivial to provide analytical expressions for its Gaussian or Rademacher widths. There are a variety of techniques for obtaining bounds, including upper bounds via the classical entropy integral of Dudley, as well as lower bounds due to SudakovFernique (see the book [16] for details on these and other results). More recently, Talagrand [22] has introduced a generic chaining technique that leads to sharp lower and upper bounds. However, for a general set, it is impossible to evaluate the expressions obtained from the generic chaining, and so for applications in statistics, it is of considerable interest to develop techniques that yield tractable characterizations of various forms of widths.
In this paper, we study a class of Gaussian widths that arise in the context of estimation over (possibly infinitedimensional) ellipses. As we describe below, many nonparametric problems, among them are regression and density estimation over classes of smooth functions, can be reduced to such ellipse estimation problems. Obtaining sharp rates for such estimation problems requires studying a localized notion of Gaussian width, in which the ellipse is intersected with a Euclidean ball around the element being estimated. The main technical contribution of this paper is to show how this localized Gaussian width can be bounded, from both above and below, using a localized form of the Kolmogorov width [19]. As we show with a number of corollaries, this Kolmogorov width can be calculated in many interesting examples.
Our work makes a connection to the evolving line of work on instancespecific rates in estimation and testing. Within the decisiontheoretic framework, the classical approach is to study the (global) minimax risk over a certain problem class. In this framework, methods are compared via their worstcase behavior as measured by performance over the entire problem class. For the ellipse problems considered here, global minimax risks in various norms are wellunderstood; for instance, see the classic papers [20, 11, 12]. When the risk function is near to constant over the set, then the global minimax risk is reflective of the typical behavior. If not, then one is motivated to seek more refined ways of characterizing the hardness of different problems, and the performance of different estimators.
One way of doing so is by studying the notion of an adaptive estimator, meaning one whose performance automatically adapts to some (unknown) property of the underlying function being estimated. For instance, estimators using wavelet bases are known to be adaptive to unknown degree of smoothness [7, 8]. Similarly, in the context of shapeconstrained problems, there is a line of work showing that for functions with simpler structure, it is possible to achieve faster rates than the global minimax ones (e.g. [18, 33, 6]). A related line of work, including some of our own, has studied adaptivity in the context of hypothesis testing (e.g., [25, 2, 30]). The adaptive estimation rates established in this work also share this spirit of being instancespecific.
1.1 Some motivating examples
A primary motivation for our work is to understand the behavior of leastsquares estimators over ellipses. Accordingly, let us give a precise definition of the ellipse estimation problem, along with some motivating examples.
Given a fixed integer and a sequence of nonnegative scalars , we can define an elliptical norm on via . Here for any coefficient , we interpret the constraint as enforcing that . For any radius , this seminorm defines an ellipse of the form
(1) 
We frequently focus on the case , in which case we adopt the shorthand notation for the set . Whereas equation (1) defines a finitedimensional ellipse, it should be noted that our theory also applies to infinitedimensional ellipses for sequences that are summable. Such results can be recovered by studying a truncated version of the ellipse with finite dimension , and then taking suitable limits. In order to simplify the exposition, we develop our results with finite , noting how they extend to infinite dimensions after stating our results.
Suppose that for some unknown vector , we make noisy observations of the form
(2) 
We assume that the ellipse
and noise standard deviation
is known. The goal of ellipse estimation is to specify a mapping such that the associated Euclidean risk is as small as possible.Let us consider some concrete problems that can be reduced to instances of ellipse estimation.
Example 1 (Linear prediction with correlated designs).
Suppose that we make observations from the standard linear model
where is the response vector, is a (fixed, nonrandom) design matrix, and is noise. Suppose moreover that we know a priori that for some radius
. Alternatively, we can think of a condition of this form arising implicitly when using estimators such as ridge regression.
Given an estimate , its prediction accuracy can be assessed via the meansquared error , where the expectation is taken over the observation noise. Equivalently, letting and , our problem is to minimize the meansquared error . After this transformation, we arrive at the observation model , which is a version of our original model (2) with and . Moreover, the constraint on the norm of translates into an ellipse constraint on
. In particular, the ellipse is determined by the nonzero eigenvalues of the matrix
.As shown in Figure 1, it is natural to conjecture that the location of within this ellipse affects the difficulty of estimation. Note that , so that on average, the observed vector lies at squared Euclidean distance from the true vector. In certain favorable cases, such as a vector that lies at or close to the boundary of an elongated side of the ellipse, the sideknowledge that is helpful. In other cases, such as a vector that lies closer to the center of the ellipse, the elliptical constraint is less helpful. The theory to be developed in this paper makes this intuition precise. In particular, Section 4 is devoted to a number of consequences of our main results for the problem of estimation in ellipses.
Example 2 (Nonparametric regression using reproducing kernels).
We now turn to a class of nonparametric problems that involve a form of ellipse estimation. Suppose that our goal is to predict a response based on observing a collection of predictors . Assuming that pairs are drawn jointly from some unknown distribution , the optimal prediction in terms of meansquared error is given by the conditional expectation . Given a collection of samples , the goal of nonparametric regression is to produce an estimate that is as close to as possible.
(a)  (b) 

Assuming that the samples are i.i.d., we can rewrite our observations in the form
(3) 
where
is an independent sequence of zeromean noise variables with unit variance. A computationally attractive way of estimating
is to perform leastsquares regression over a reproducing kernel Hilbert space, or RKHS for short [1, 13, 10, 28]. Any such function class is defined by a symmetric, positive definite kernel function ; standard examples include the Gaussian kernel, Laplace kernel, and the Sobolev (spline) kernels; see Figure 2 for some illustrative examples. Now suppose that belongs to the RKHS induced by the kernel , say with Hilbert norm . In this case, the representer theorem [13] implies that the observation model (3) is equivalent towhere is the kernel matrix with entries for each , and vector is a dimensional vector formed by The representer theorem and our choice of scaling ensures that , meaning that belongs to the ellipse of radius defined by the symmetric and PSD kernel matrix .
Note that the matrix can be diagonalized as , where is orthonormal, and is a diagonal matrix of nonnegative eigenvalues. Following this transformation, we arrive at an instance of the standard ellipse model
and where belongs to the standard ellipse (1) defined by the eigenvalues of . Note that the noise vector has zeromean entries each with standard deviation . The entries of are not exactly Gaussian (unless the initial noise vector was jointly Gaussian), but are often wellapproximated by Gaussian variables due to central limit behavior for large .
1.2 Organization
The remainder of this paper is organized as follows. In Section 2, we introduce some background on approximationtheoretic quantities, including the Gaussian width, metric entropy, and the Kolmogorov width. Section 3 is devoted to the statement of our main results, while Section 4 develops a number of their specific consequences for ellipse estimation. In Section 5, we provide the proofs of our main results, with more technical aspects of the arguments provided in the appendices.
2 Background
Before proceeding to the statements of our main results, we introduce some background on the notion of Gaussian width, Kolmogorov width, as well as setting the estimation problem with ellipse constraint.
2.1 Gaussian width
Given a bounded subset , the Gaussian width of is defined as
(4) 
It measures the size of set in a certain sense.
It is also useful to define the classical notions of packing and covering entropy. An cover of a set with respect to the metric is a discrete set such that for each , there exists some satisfying . The covering number is the cardinality of the smallest cover, and the logarithm of this number is called the covering metric entropy of set .
Similarly, an packing of a set is a set satisfying for all . The size of the largest such packing is called the packing number of , which we denote by . It is related to the (covering) metric entropy by the inequalities
For this reason, we use the term metric entropy to refer to either the covering or packing metric entropy, since they differ only in constant terms.
The connection between Gaussian width and metric entropy is wellstudied (e.g. [9, 23, 29]). For our future discussion, we collect a few results here as reference. First, Dudley’s entropy integral [9] is an upper bound for the Gaussian width—viz.
(5) 
for some universal constant . This upper bound also holds for more general subGaussian processes. Dudley’s bound can be much looser than the more refined bounds obtained through Talagrand’s generic chaining, which are tight up to a universal constant [23, Thm. 2.4.1]. For Gaussian processes like ours, Sudakov minoration (e.g., [4, Thm. 13.4]) provides a lower bound on the Gaussian width.
(6) 
Although we do not directly use this lower bound when proving our main lower bound (Theorem 2) below, we follow its spirit by constructing a large collection of wellseparated points.
2.2 Kolmogorov width
In this section, we define the Kolmogorov width and briefly review its properties. This geometric quantity plays the central role in our main results.
For a given compact set and integer , the Kolmogorov width of is given by
(7) 
where denotes the set of all dimensional orthogonal linear projections, and denotes the projection of to the corresponding dimensional linear space. Any projection achieving the minimum in expression (7) is said to be an optimal projection for . Note that the Kolmogorov width is a nonincreasing function of , meaning that
We refer the readers to the book by Pinkus [19] for more details on the Kolmogorov width and its properties.
3 Main results
Let us first define the notion of localized Gaussian width formally, and then turn to the statement of our main results.
3.1 Localized Gaussian width
Let denote the Euclidean ball of radius , and for a given vector , define the shifted ellipse . The localized Gaussian width at and scale is defined as
(8) 
Note that this quantity is simply the ordinary Gaussian width of the set , and we say that it is localized since the Euclidean ball restricts it to a neighborhood of . See Figure 3 for an illustration of this set.
We note that localized forms of Gaussian and Rademacher complexity are standard in the literature on empirical processes (e.g., [3, 14]), where it is known that they are needed to obtain sharp rates. In the case of leastsquares estimation over convex sets, there is an extremely explicit connection between the localized Gaussian width and the associated estimation error [26, 5, 29]; we describe this relationship in more detail in Section 4 and Appendix D.
Our main results, to be stated in the following subsections, provide conditions under which we can provide a sharp characterization of the localized Gaussian width (8) in terms of the Kolmogorov width.
3.2 Upper bound on the localized Gaussian width
In order to state our first main result, we introduce an approximationtheoretic quantity having to do with the quality of a given dimensional projection. For a given integer and any dimensional linear projection , let us define the set
(9) 
Here means that for each coordinate It can be verified that the set is always nonempty since the constant vector always belongs to it. (Here denotes the vector of all ones.) To provide some intuition for this definition, the vector corresponds to the error incurred by using the subspace associated with to approximate . The positive vector allows us to weight the entries of this error vector in computing the Euclidean norm of the weighted error.
We are now ready to state an upper bound on the localized Gaussian width.
Theorem 1.
Given any , projection tuple , and vector , we have
(10) 
See Section 5.1 for the proof of this result.
Note that Theorem 1 holds for any dimension and projection pair . Often the case, we can choose a specific pair for which the set is easy to characterize. In particular, given any fixed , let us define the critical dimension
(11) 
for some constant . In words, this integer is the minimal dimension for which there exist a dimensional projection that approximates a neighborhood of the recentered ellipse to accuracy.^{1}^{1}1The constants and are chosen for the sake of convenience in the proof, but other choices of these quantities (which both must be strictly less than ) are also possible. Although our notation does not explicitly reflect it, note that also depends on the ellipse .
Given the integer , we let denote the minimizing projection in the definition (7) of the width, and note that for any vector , the error associated with this projection is given by . It can be seen in our later examples, this particular choice often yields tight control of the localized Gaussian width. So as to streamline notation, we adopt as a short hand for .
Regularity assumption:
For many ellipses encountered in practice, the first term in the upper bound (10) dominates the second term involving the set . In order to capture this condition, we say the ellipse is regular at if there exists some pair such that
(12) 
Here is any universal constant. When this condition holds, Theorem 1 implies the existence of another universal constant such that
(13) 
As is shown in Appendix A, the regularity condition (12) is a generalization of a condition previously introduced by Yang et al. [32] in the context of kernel ridge regression, and it holds for many examples encountered in practice.
As a direct consequence of Theorem 1, the following corollary holds.
Corollary 1.
If the regularity assumption (12) is satisfied with dimension and projection pair , then the localized Gaussian width satisfies
(14) 
Let us illustrate the regularity condition (12) and associated consequences of Theorem 1 with some examples.
Example 3 (Gaussian width of the Euclidean ball).
We begin with a simple example: suppose that the ellipse is the Euclidean ball in , specified by the aspect ratios for all , and let us use Theorem 1 to upper bound the Gaussian width at . For and any integer , we have , because any dimensional projection must neglect at least one coordinate. Since , we conclude that for all . With this choice of , there is no error in the projection, meaning that . Consequently, the regularity condition (12) certainly holds, so that Theorem 1 implies that
In fact, a direct calculation yields that , where is a quantity tending to zero as grows (e.g., [29]). Consequently, our bound is asymptotically sharp up to the constant prefactor in this special case.
We now turn to a second example that arises in nonparametric
regression and density estimation under smoothness constraints:
Example 4 (Gaussian width for Sobolev ellipses).
Now consider an ellipse defined by the aspect ratios , where is a parameter. Ellipses of this form arise when studying nonparametric estimation problems involving functions that are times differentiable with Lebesgueintegrable derivative [24]. Let us again use Theorem 1 to upper bound the localized Gaussian width at . From classical results on Kolmogorov widths of ellipses [19] (see also [30, Sec. 4.3]), we know that . Taking into account the intersection with the Euclidean ball, we find that
(15) 
valid for any . Since , we conclude that
again valid for all . Here the last inequality uses the fact that .
This argument also shows that the corresponding projection subspace is spanned by the first standard orthogonal vectors . With this projection, any feasible vector satisfies , meaning that
(16) 
On the other hand, we also have , so there exists some constant , such that which validates the regularity condition (12). Therefore, Theorem 1 guarantees that
(17) 
In fact, the above bound (17) can be shown to be tight up to a constant prefactor. See the discussion following Corollary 2 in the sequel for further details.
3.3 Lower bound on the localized Gaussian width
Thus far, we have derived an upper bound for the localized Gaussian width. In this section, we use informationtheoretic methods to prove an analogous lower bound on the localized Gaussian width. This lower bound involves both the critical dimension , as previously defined in equation (11), and also a second quantity, one which measures the proximity of to the boundary of the ellipse. More precisely, for a given , define the mapping via
(18) 
As shown by Wei and Wainwright [30], this mapping is welldefined, and has the limiting behavior as ; for completeness, we include the verification of these claims in Appendix G, along with a sketch of the function. Let us denote as the largest positive value of such that . Note that by this definition, we have .
Recall that the elliptical norm on is defined via . We are now ready to state our lower bound for the localized Gaussian width.
Theorem 2.
There exist universal constants such that for all
(19) 
See Section 5.2 for the proof of this theorem.
We remark that the regularity condition (12) is not necessary for this result to hold. Besides, in order to understand the inequality , it is equivalent to ask for . We assume this since it is not our primary interest to study the case when is sufficiently close to the boundary of the ellipse. Concretely, if we assume that , then therefore .
3.4 Some consequences
One useful consequence of Theorem 1 and Theorem 2 is in providing sufficient conditions for tight control of the localized Gaussian width. If the ellipse is regular at , then the above theorems imply the localized Gaussian width (8) is equivalent to up to a multiplicative constant. Specifically, we have the sandwich relation
(20) 
for some positive constants and .
Recall our earlier calculation from Example 3, where we showed that the localized Gaussian width scales as , up to multiplicative constants. The sandwich relation (20) shows that this same scaling holds more generally with replaced by . Thus, we can think of corresponding to the “effective dimension” of the set .
It is worthwhile pointing out that our results have a number of corollaries, in particular in terms of how local Gaussian widths and Kolmogorov widths are related to metric entropy. Recall the notion of the metric (packing) entropy as previously defined Section 2.1. The following corollary provides a sandwich for in terms of the metric entropy of the set .
Corollary 2.
There are universal constants such that for any pair satisfying the regularity condition (12), we have
(21) 
See Appendix B for the proof. The lower bound (i) is a relatively straightforward consequence of Sudakov’s inequality (6), when combined with our results connecting the Kolmogorov and Gaussian widths. The upper bound (ii) requires a lengthier argument.
Recall that in Example 4, we argued that for the Sobolev ellipse with smoothness , the Kolmogorov width at is given by . Combining this calculation with Corollary 2, we find that up to a multiplicative constant. This is a known fact that can be verified by constructing explicit packings of these function classes, but it serves to illustrate the sharpness of our results in this particular context.
4 Consequences for estimation
In the previous section, we established upper and lower bounds on the localized Gaussian width in Theorem 1 and Theorem 2. We now turn to some consequences of these bounds, in particular for the problem of constrained leastsquares estimation.
In particular, suppose we are given observations with according to the earlier model (2), and we consider the constrained least squares estimator (LSE)
(22) 
Let us assume that the ellipse is regular at , so that the localized Gaussian width satisfies the bounds (20) with constants and . Connecting the error to these Gaussian width bounds involves the following two functions
(23) 
with the critical dimension defined in expression (11).
Let us consider the fixed point equation
(24) 
Since is a nonincreasing function of (see Wei and Wainwright [30, Appendix D.1]) while is increasing, if this fixed point problem (24) has a solution, then the solution is unique and we denote it as .
We can now give a precise statement relating the estimation rate of to the solution of the fixed point equation (24).
Proposition 1 (Least squares on ellipses).
Let be regular at , and let be the solution to the fixed point problem (24). Suppose furthermore the following conditions hold

The function is unimodal in .

There exists a constant such that for ,

There exists a constant such that for .
Then the error of the least squares estimator (22) satisfies
(25) 
for some constants that depend only on and .
See Appendix D for the proof of this result.
Note that this result is stated for the ellipse with . For arbitrary one can easily rescale to obtain similar results; see equation (85) in Section D.1 for more detail. When we say is unimodal, we mean that there is some such that is nondecreasing for and nonincreasing for .
Equation (25) provides a high probability bound on the leastsquares error. If furthermore , then we are also guaranteed that the meansquared error is sandwiched as
(26) 
for some universal constants .
We claim the conditions of Proposition 1 are relatively mild. Note that the related function is strongly convex [5, Thm. 1.1], as mentioned in Appendix D.1. So it is reasonable to believe that its approximation is unimodal. Moreover, the assumptions (b) and (c) essentially assert that does not change too drastically at two points and close to the critical radius . In the next section, we will check these assumptions for different examples.
Note that fixed point problem (24) can be viewed as a kind of a critical equation (e.g., [29, Ch. 13] and [32]), whose solution we call the critical radius. Typically an upper bound on the localized Gaussian width would allow this critical radius to serve as an upper bound for the error . Here, we show that with twosided control of the localized Gaussian width and a regularity assumption, the error also satisfies a matching lower bound. In the next section, we will illustrate the consequence of this result with some examples.
4.1 Adaptive estimation rates
We now demonstrate the consequences of Proposition 1 via some examples. We begin with the simple problem of estimation for , where we see a number of standard rates from the ellipse estimation literature. We then consider some more interesting examples of extremal vectors, and show how the resulting estimation rates differ from the classical ones.
4.1.1 Estimating at
We begin our exploration by considering ellipseconstrained estimation problem at . In this section, we focus on two type of ellipses that are specified by aspect ratios where follows an polynomial decay and exponential decay. The first one corresponds to estimating a function in smooth Sobolev class—that is, functions that are almost everywhere times differentiable, and with the derivative being Lebesgue integrable.
polynomial decay:
Consider an ellipse defined by the aspect ratios for some . In Example 4, inequality (16), it is verified that this ellipse is regular at , and that . Thus, solving the fixed point problem (24) yields , and one can check that the conditions for Proposition 1 are met. Here our notation denotes equality up to constants independent of . With a rescaling argument (85), the proposition implies
(27) 
with probability for some constants and . One may notice that the rate coincides with the minimax estimation rate for estimating in an smooth Sobolev function class. We will show in our later section that it is indeed the case.
exponential decay:
Consider another case where the ellipse is defined by the aspect ratios , for some . Then a slight modification of the computation in Example 4 yields
In order to establish the regularity condition, notice that in this case, is achieved in limit by and further more
(28) 
which by definition, shows that is regular at .
Solving the fixed point problem (24) yields up to other polylogarithmic factors in . One can check that the conditions for Proposition 1 are met, so by the rescaling argument (85), we have, up to polylogarithmic factors,
(29) 
with probability for some constants and .
4.1.2 Estimating at extremal vectors
In the previous section, we studied the adaptive estimation rate for . In this section, we study some nonzero cases of the vector . For concreteness, we restrict our attention to vectors that are nonzero some coordinate , and zero in all other coordinates. Even for such simple vectors, our analysis reveals some interesting and adaptive scalings.
Given integer , consider for some where are small constants that are defined in Wei and Wainwright [30, Corollary 2]. Note that the shrinkage away from the boundary is due to the boundary issue in Theorem 2. We believe it is an artifact of our analysis that is possibly removable; for instance, in our simulations below (Figure 4) we have an example with on the boundary of the ellipse that exhibits the same predicted behavior as its shrunken counterpart.
So as to streamline notation, we adopt as a short hand for . Wei and Wainwright [30] (Section 4.4) show that with , we have
(30) 
This upper bound is proved by considering the projection onto the dimensional subspace spanned by . At the same time, we prove in Lemma 6 that
(31) 
polynomial decay:
Consider an ellipse with for some . From the above calculation, we can conclude that
Here our notation denotes equality up to constants independent of problem parameters such as . Let us verify the regularity condition (12) with dimension and projection to linear space spanned by . Since is feasible in limit for the set , we have
Since , and is equal up to a constant, the right hand side above is bounded above by , which establishes the regularity condition at
As long as , solving the fixed point problem (24) yields , and one can check that the conditions for Proposition 1 are met. Thus,
(32) 
with probability for some constants and .
exponential decay:
Now consider ellipse with for some . From the above calculation, we can conclude that
Let us verify the regularity condition (12) with dimension and projection to linear space spanned by . Since is feasible for the set , by similar calculation from inequality (28), we can show that the ellipse is regular at .
Solving the fixed point problem (24) yields up to other polylogarithmic factors in . One can check that the conditions for Proposition 1 are met, so by the rescaling argument (85), we have, up to polylogarithmic factors,
(33) 
with probability
Comments
There are no comments yet.