In this paper we introduce a minimax framework for nonparametric estimation under storage constraints. In the classical statistical setting, the minimax risk for estimating a function from a function class using a sample of size places no constraints on the estimator , other than requiring it to be a measurable function of the data. However, if the estimator is to be constructed with restrictions on the computational resources used, it is of interest to understand how the error can degrade. Letting indicate that the computational resources used to construct are required to fall within a budget , the constrained minimax risk is
Minimax lower bounds on the risk as a function of the computational budget thus determine a feasible region for computation constrained estimation, and a Pareto optimal tradeoff for risk versus computation as varies.
Several recent papers have presented results on tradeoffs between statistical risk and computational resources, measured in terms of either running time of the algorithm, number of floating point operations, or number of bits used to store or construct the estimators [6, 5, 16]. However, the existing work quantifies the tradeoff by analyzing the statistical and computational performance of specific procedures, rather than by establishing lower bounds and a Pareto optimal tradeoff. In this paper we treat the case where the complexity is measured by the storage or space used by the procedure and sharply characterize the optimal tradeoff. Specifically, we limit the number of bits used to represent the estimator . We focus on the setting of nonparametric regression under standard smoothness assumptions, and study how the excess risk depends on the storage budget .
We view the study of quantized estimation as a theoretical problem of fundamental interest. But quantization may arise naturally in future applications of large scale statistical estimation. For instance, when data are collected and analyzed on board a remote satellite, the estimated values may need to be sent back to Earth for further analysis. To limit communication costs, the estimates can be quantized, and it becomes important to understand what, in principle, is lost in terms of statistical risk through quantization. A related scenario is a cloud computing environment where data are processed for many different statistical estimation problems, with the estimates then stored for future analysis. To limit the storage costs, which could dominate the compute costs in many scenarios, it is of interest to quantize the estimates, and the quantization-risk tradeoff again becomes an important concern. Estimates are always quantized to some degree in practice. But to impose energy constraints on computation, future processors may limit precision in arithmetic computations more significantly ; the cost of limited precision in terms of statistical risk must then be quantified. A related problem is to distribute the estimation over many parallel processors, and to then limit the communication costs of the submodels to the central host. We focus on the centralized setting in the current paper, but an extension to the distributed case may be possible with the techniques that we introduce here.
We study risk-storage tradeoffs in the normal means model of nonparametric estimation assuming the target function lies in a Sobolev space. The problem is intimately related to classical rate distortion theory , and our results rely on a marriage of minimax theory and rate distortion ideas. We thus build on and refine the connection between function estimation and lossy source coding that was elucidated in David Donoho’s 1998 Wald Lectures .
We work in the Gaussian white noise model
where is a standard Wiener process on ,
is the standard deviation of the noise, andlies in the periodic Sobolev space of order and radius . (We discuss the nonperiodic Sobolev space in Section 4.) The white noise model is a centerpiece of nonparametric estimation. It is asymptotically equivalent to nonparametric regression  and density estimation , and simplifies some of the mathematical analysis in our framework. In this classical setting, the minimax risk of estimation
is well known to satisfy
where is Pinsker’s constant . The constrained minimax risk for quantized estimation becomes
where is a quantized estimator that is required to use storage no greater than bits in total. Our main result identifies three separate quantization regimes.
In the over-sufficient regime, the number of bits is very large, satisfying and the classical minimax rate of convergence is obtained. Moreover, the optimal constant is the Pinsker constant .
In the sufficient regime, the number of bits scales as . This level of quantization is just sufficient to preserve the classical minimax rate of convergence, and thus in this regime . However, the optimal constant degrades to a new constant , where is characterized in terms of the solution of a certain variational problem, depending on .
In the insufficient regime, the number of bits scales as , with however . Under this scaling the number of bits is insufficient to preserve the unquantized minimax rate of convergence, and the quantization error dominates the estimation error. We show that the quantized minimax risk in this case satisfies
Thus, in the insufficient regime the quantized minimax rate of convergence is , with optimal constant as shown above.
By using an upper bound for the family of constants , the three regimes can be combined together to view the risk in terms of a decomposition into estimation error and quantization error. Specifically, we can write
When , the estimation error dominates the quantization error, and the usual minimax rate and constant are obtained. In the insufficient case , only a slower rate of convergence is achievable. When and are comparable, the estimation error and quantization error are on the same order. The threshold should not be surprising, given that in classical unquantized estimation the minimax rate of convergence is achieved by estimating the first Fourier coefficients and simply setting the remaining coefficients to zero. This corresponds to selecting a smoothing bandwidth that scales as with the sample size .
At a high level, our proof strategy integrates elements of minimax theory and source coding theory. In minimax analysis one computes lower bounds by thinking in Bayesian terms to look for least-favorable priors. In source coding analysis one constructs worst case distributions by setting up an optimization problem based on mutual information. Our quantized minimax analysis requires that these approaches be carefully combined to balance the estimation and quantization errors. To show achievability of the lower bounds we establish, we likewise need to construct an estimator and coding scheme together. Our approach is to quantize the blockwise James-Stein estimator, which achieves the classical Pinsker bound. However, our quantization scheme differs from the approach taken in classical rate distortion theory, where the generation of the codebook is determined once the source distribution is known. In our setting, we require the allocation of bits to be adaptive to the data, using more bits for blocks that have larger signal size. We therefore design a quantized estimation procedure that adaptively distributes the communication budget across the blocks. Assuming only a lower bound on the smoothness and an upper bound on the radius of the Sobolev space, our quantization-estimation procedure is adaptive to and in the usual statistical sense, and is also adaptive to the coding regime. In other words, given a storage budget , the coding procedure achieves the optimal rate and constant for the unknown and , operating in the corresponding regime for those parameters.
In the following section we establish some notation, outline our proof strategy, and present some simple examples. In Section 3 we state and prove our main result on quantized minimax lower bounds, relegating some of the technical details to an appendix. In Section 4 we show asymptotic achievability of these lower bounds, using a quantized estimation procedure based on adaptive James-Stein estimation and quantization in blocks, again deferring proofs of technical lemmas to the supplementary material. This is followed by a presentation of some results from experiments in Section 5, illustrating the performance and properties of the proposed quantized estimation procedure.
2 Quantized estimation and minimax risk
is a random vector drawn from a distribution. Consider the problem of estimating a functional of the distribution, assuming is restricted to lie in a parameter space . To unclutter some of the notation, we will suppress the subscript and write and in the following, keeping in mind that nonparametric settings are allowed. The subscript
will be maintained for random variables. The minimaxrisk of estimating is then defined as
where the infimum is taken over all possible estimators that are measurable with respect to the data . We will abuse notation by using to denote both the estimator and the estimate calculated based on an observed set of data. Among numerous approaches to obtaining the minimax risk, the Bayesian method is best aligned with quantized estimation. Consider a prior distribution whose support is a subset of . Let be the posterior mean of given the data , which minimizes the integrated risk. Then for any estimator ,
Taking the infimum over yields
Thus, any prior distribution supported on gives a lower bound on the minimax risk, and selecting the least-favorable prior leads to the largest lower bound provable by this approach.
Now consider constraints on the storage or communication cost of our estimate. We restrict to the set of estimators that use no more than a total of bits; that is, the estimator takes at most different values. Such quantized estimators can be formulated by the following two-step procedure. First, an encoder maps the data to an index , where
is the encoding function. The decoder, after receiving or retrieving the index, represents the estimates based on a decoding function
mapping the index to a codebook of estimates. All that needs to be transmitted or stored is the -bit-long index, and the quantized estimator is simply , the composition of the encoder and the decoder functions. Denoting by the storage, in terms of the number of bits, required by an estimator , the minimax risk of quantized estimation is then defined as
and we are interested in the effect of the constraint on the minimax risk. Once again, we consider a prior distribution supported on and let be the posterior mean of given the data. The integrated risk can then be decomposed as
where the expectation is with respect to the joint distribution ofand , and the second equality is due to
using the fact that
forms a Markov chain. The first term in the decomposition (2.1) is the Bayes risk . The second term can be viewed as the excess risk due to quantization.
Let be a sufficient statistic for . The posterior mean can be expressed in terms of and we will abuse notation and write it as . Since the quantized estimator uses at most bits, we have
where and denote the Shannon entropy and mutual information, respectively. Now consider the optimization
where the infimum is over all conditional distributions . This parallels the definition of the distortion rate function, minimizing the distortion under a constraint on mutual information . Denoting the value of this optimization by , we can lower bound the quantized minimax risk by
Since each prior distribution supported on gives a lower bound, we have
and the goal becomes to obtain a least favorable prior for the quantized risk.
Before turning to the case of quantized estimation over Sobolev spaces, we illustrate this technique on some simpler, more concrete examples.
Example 2.1 (Normal means in a hypercube).
Let for . Suppose that is known and is to be estimated. We choose the prior on to be a product distribution with density
It is shown in  that
where . Turning to , let be the posterior mean of . In fact, by the independence and symmetry among the dimensions, we know are independently and identically distributed. Denoting by this common distribution, we have
where is the distortion rate function for , i.e., the value of the following problem
Now using the Shannon lower bound , we get
Note that as , converges to in distribution, so there exists a constant independent of and such that
This lower bound intuitively shows the risk is regulated by two factors, the estimation error and the quantization error; whichever is larger dominates the risk. The scaling behavior of this lower bound (ignoring constants) can be achieved by first quantizing each of the intervals using bits each, and then mapping the mle to its closest codeword.
Example 2.2 (Gaussian sequences in Euclidean balls).
In the example shown above, the lower bound is tight only in terms of the scaling of the key parameters. In some instances, we are able to find an asymptotically tight lower bound for which we can show achievability of both the rate and the constants. Estimating the mean vector of a Gaussian sequence with an norm constraint on the mean is one of such case, as we showed in previous work .
Specifically, let for , where . Suppose that the parameter lies in the Euclidean ball . Furthermore, suppose that . Then using the prior it can be shown that
The asymptotic estimation error is the well-known Pinsker bound for the Euclidean ball case. As shown in , an explicit quantization scheme can be constructed that asymptotically achieves this lower bound, realizing the smallest possible quantization error for a budget of bits.
The Euclidean ball case is clearly relevant to the Sobolev ellipsoid case, but new coding strategies and proof techniques are required. In particular, as will be made clear in the sequel, we will use an adaptive allocation of bits across blocks of coefficients, using more bits for blocks that have larger estimated signal size. Moreover, determination of the optimal constants requires a detailed analysis of the worst case prior distributions and the solution of a series of variational problems.
3 Quantized estimation over Sobolev spaces
Recall that the Sobolev space of order and radius is defined by
The periodic Sobolev space is defined by
The white noise model (1.1) is asymptotically equivalent to making equally spaced observations along the sample path, , where . In this formulation, the noise level in the formulation (1.1) scales as , and the rate of convergence takes the familiar form where is the number of observations.
To carry out quantized estimation we now require an encoder
which is a function applied to the sample path . The decoding function then takes the form
and maps the index to a function estimate. As in the previous section, we write the composition of the encoder and the decoder as , which we call the quantized estimator. The communication or storage required by this quantized estimator is no more than bits.
To recast quantized estimation in terms of an infinite sequence model, let be the trigonometric basis, and let
be the Fourier coefficients. It is well known  that belongs to if and only if the Fourier coefficients belong to the Sobolev ellipsoid defined as
Although this is the standard definition of a Sobolev ellipsoid, for the rest of the paper we will set , for convenience of analysis. All of the results hold for both definitions of . Also note that (3.2) actually gives a more general definition, since is no longer assumed to be an integer, as it is in (3.1). Expanding with respect to the same orthonormal basis, the observed path is converted into an infinite Gaussian sequence
with . For an estimator of , an estimate of is obtained by
with squared error . In terms of this standard reduction, the quantized minimax risk is thus reformulated as
To state our result, we need to define the value of the following variational problem:
where the feasible set is the collection of increasing functions and values satisfying
The significance and interpretation of the variational problem will become apparent as we outline the proof of this result.
In the first regime where the number of bits is much greater than , we recover the same convergence result as in Pinsker’s theorem, in terms of both convergence rate and leading constant. The proof of the lower bound for this regime can directly follow the proof of Pinsker’s theorem, since the set of estimators considered in our minimax framework is a subset of all possible estimators.
In the second regime where we have “just enough” bits to preserve the rate, we suffer a loss in terms of the leading constant. In this “Goldilocks regime,” the optimal rate is achieved but the constant in front of the rate is Pinsker’s constant plus a positive quantity determined by the variational problem.
While the solution to this variational problem does not appear to have an explicit form, it can be computed numerically. We discuss this term at length in the sequel, where we explain the origin of the variational problem, compute the constant numerically and approximate it from above and below. The constants and are shown graphically in Figure 1. Note that the parameter can be thought of as the average number of bits per coefficient used by an optimal quantized estimator, since is asymptotically the number of coefficients needed to estimate at the classical minimax rate. As shown in Figure 1, the constant for quantized estimation quickly approaches the Pinsker constant as increases—when the two are already very close.
In the third regime where the communication budget is insufficient for the estimator to achieve the optimal rate, we obtain a sub-optimal rate which no longer depends explicitly on the noise level of the model. In this regime, quantization error dominates, and the risk decays at a rate of no matter how fast approaches zero, as long as . Here the analogue of Pinsker’s constant takes a very simple form.
Proof of Theorem 3.1.
Consider a Gaussian prior distribution on with for in terms of parameters
to be specified later. One requirement for the variances is
We denote this prior distribution by , and show in Section A that it is asymptotically concentrated on the ellipsoid . Under this prior the model is
and the marginal distribution of is thus . Following the strategy outlined in Section 2, let denote the posterior mean of given under this prior, and consider the optimization
where the infimum is over all distributions on such that forms a Markov chain. Now, the posterior mean satisfies where . Note that the Bayes risk under this prior is
Then the classical rate distortion argument  gives that
where . Therefore, the quantized minimax risk is lower bounded by
where is the value of the optimization
and the deviation term is analyzed in the supplementary material.
Observe that the quantity can be upper and lower bounded by
where the estimation error term is the value of the optimization
and the quantization error term is the value of the optimization
The following results specify the leading order asymptotics of these quantities.
Moreover, if and ,
This yields the following closed form upper bound.
Suppose that and . Then
Similarly, in the over-sufficient regime as , we conclude that
We now turn to the sufficient regime . We begin by making three observations about the solution to the optimization (). First, we note that the series that solves () can be assumed to be decreasing. If were not in decreasing order, we could rearrange it to be decreasing, and correspondingly rearrange , without violating the constraints or changing the value of the optimization. Second, we note that given , the optimal is obtained by the “reverse water-filling” scheme . Specifically, there exists such that
where is chosen so that
Third, there exists an integer such that the optimal series satisfies
where is the “water-filling level” for (see ). Using these three observations, the optimization () can be reformulated as