Quantized Nonparametric Estimation over Sobolev Ellipsoids

03/25/2015
by   Yuancheng Zhu, et al.
The University of Chicago
0

We formulate the notion of minimax estimation under storage or communication constraints, and prove an extension to Pinsker's theorem for nonparametric estimation over Sobolev ellipsoids. Placing limits on the number of bits used to encode any estimator, we give tight lower and upper bounds on the excess risk due to quantization in terms of the number of bits, the signal size, and the noise level. This establishes the Pareto optimal tradeoff between storage and risk under quantization constraints for Sobolev spaces. Our results and proof techniques combine elements of rate distortion theory and minimax analysis. The proposed quantized estimation scheme, which shows achievability of the lower bounds, is adaptive in the usual statistical sense, achieving the optimal quantized minimax rate without knowledge of the smoothness parameter of the Sobolev space. It is also adaptive in a computational sense, as it constructs the code only after observing the data, to dynamically allocate more codewords to blocks where the estimated signal size is large. Simulations are included that illustrate the effect of quantization on statistical risk.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/24/2014

Quantized Estimation of Gaussian Sequence Models in Euclidean Balls

A central result in statistical theory is Pinsker's theorem, which chara...
07/21/2021

Optimal Rates for Nonparametric Density Estimation under Communication Constraints

We consider density estimation for Besov spaces when each sample is quan...
03/04/2018

Distributed Nonparametric Regression under Communication Constraints

This paper studies the problem of nonparametric estimation of a smooth f...
01/24/2019

Optimal Nonparametric Inference under Quantization

Statistical inference based on lossy or incomplete samples is of fundame...
04/02/2021

Covariance estimation under one-bit quantization

We consider the classical problem of estimating the covariance matrix of...
02/07/2019

Learning Distributions from their Samples under Communication Constraints

We consider the problem of learning high-dimensional, nonparametric and ...
03/08/2018

Analysis of Decimation on Finite Frames with Sigma-Delta Quantization

In Analog-to-digital (A/D) conversion on oversampled bandlimited functio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we introduce a minimax framework for nonparametric estimation under storage constraints. In the classical statistical setting, the minimax risk for estimating a function from a function class using a sample of size places no constraints on the estimator , other than requiring it to be a measurable function of the data. However, if the estimator is to be constructed with restrictions on the computational resources used, it is of interest to understand how the error can degrade. Letting indicate that the computational resources used to construct are required to fall within a budget , the constrained minimax risk is

Minimax lower bounds on the risk as a function of the computational budget thus determine a feasible region for computation constrained estimation, and a Pareto optimal tradeoff for risk versus computation as varies.

Several recent papers have presented results on tradeoffs between statistical risk and computational resources, measured in terms of either running time of the algorithm, number of floating point operations, or number of bits used to store or construct the estimators [6, 5, 16]. However, the existing work quantifies the tradeoff by analyzing the statistical and computational performance of specific procedures, rather than by establishing lower bounds and a Pareto optimal tradeoff. In this paper we treat the case where the complexity is measured by the storage or space used by the procedure and sharply characterize the optimal tradeoff. Specifically, we limit the number of bits used to represent the estimator . We focus on the setting of nonparametric regression under standard smoothness assumptions, and study how the excess risk depends on the storage budget .

We view the study of quantized estimation as a theoretical problem of fundamental interest. But quantization may arise naturally in future applications of large scale statistical estimation. For instance, when data are collected and analyzed on board a remote satellite, the estimated values may need to be sent back to Earth for further analysis. To limit communication costs, the estimates can be quantized, and it becomes important to understand what, in principle, is lost in terms of statistical risk through quantization. A related scenario is a cloud computing environment where data are processed for many different statistical estimation problems, with the estimates then stored for future analysis. To limit the storage costs, which could dominate the compute costs in many scenarios, it is of interest to quantize the estimates, and the quantization-risk tradeoff again becomes an important concern. Estimates are always quantized to some degree in practice. But to impose energy constraints on computation, future processors may limit precision in arithmetic computations more significantly [11]; the cost of limited precision in terms of statistical risk must then be quantified. A related problem is to distribute the estimation over many parallel processors, and to then limit the communication costs of the submodels to the central host. We focus on the centralized setting in the current paper, but an extension to the distributed case may be possible with the techniques that we introduce here.

We study risk-storage tradeoffs in the normal means model of nonparametric estimation assuming the target function lies in a Sobolev space. The problem is intimately related to classical rate distortion theory [12], and our results rely on a marriage of minimax theory and rate distortion ideas. We thus build on and refine the connection between function estimation and lossy source coding that was elucidated in David Donoho’s 1998 Wald Lectures [9].

We work in the Gaussian white noise model

(1.1)

where is a standard Wiener process on ,

is the standard deviation of the noise, and

lies in the periodic Sobolev space of order and radius . (We discuss the nonperiodic Sobolev space in Section 4.) The white noise model is a centerpiece of nonparametric estimation. It is asymptotically equivalent to nonparametric regression [4] and density estimation [17], and simplifies some of the mathematical analysis in our framework. In this classical setting, the minimax risk of estimation

is well known to satisfy

(1.2)

where is Pinsker’s constant [18]. The constrained minimax risk for quantized estimation becomes

where is a quantized estimator that is required to use storage no greater than bits in total. Our main result identifies three separate quantization regimes.

  • In the over-sufficient regime, the number of bits is very large, satisfying and the classical minimax rate of convergence is obtained. Moreover, the optimal constant is the Pinsker constant .

  • In the sufficient regime, the number of bits scales as . This level of quantization is just sufficient to preserve the classical minimax rate of convergence, and thus in this regime . However, the optimal constant degrades to a new constant , where is characterized in terms of the solution of a certain variational problem, depending on .

  • In the insufficient regime, the number of bits scales as , with however . Under this scaling the number of bits is insufficient to preserve the unquantized minimax rate of convergence, and the quantization error dominates the estimation error. We show that the quantized minimax risk in this case satisfies

    Thus, in the insufficient regime the quantized minimax rate of convergence is , with optimal constant as shown above.

By using an upper bound for the family of constants , the three regimes can be combined together to view the risk in terms of a decomposition into estimation error and quantization error. Specifically, we can write

When , the estimation error dominates the quantization error, and the usual minimax rate and constant are obtained. In the insufficient case , only a slower rate of convergence is achievable. When and are comparable, the estimation error and quantization error are on the same order. The threshold should not be surprising, given that in classical unquantized estimation the minimax rate of convergence is achieved by estimating the first Fourier coefficients and simply setting the remaining coefficients to zero. This corresponds to selecting a smoothing bandwidth that scales as with the sample size .

At a high level, our proof strategy integrates elements of minimax theory and source coding theory. In minimax analysis one computes lower bounds by thinking in Bayesian terms to look for least-favorable priors. In source coding analysis one constructs worst case distributions by setting up an optimization problem based on mutual information. Our quantized minimax analysis requires that these approaches be carefully combined to balance the estimation and quantization errors. To show achievability of the lower bounds we establish, we likewise need to construct an estimator and coding scheme together. Our approach is to quantize the blockwise James-Stein estimator, which achieves the classical Pinsker bound. However, our quantization scheme differs from the approach taken in classical rate distortion theory, where the generation of the codebook is determined once the source distribution is known. In our setting, we require the allocation of bits to be adaptive to the data, using more bits for blocks that have larger signal size. We therefore design a quantized estimation procedure that adaptively distributes the communication budget across the blocks. Assuming only a lower bound on the smoothness and an upper bound on the radius of the Sobolev space, our quantization-estimation procedure is adaptive to and in the usual statistical sense, and is also adaptive to the coding regime. In other words, given a storage budget , the coding procedure achieves the optimal rate and constant for the unknown and , operating in the corresponding regime for those parameters.

In the following section we establish some notation, outline our proof strategy, and present some simple examples. In Section 3 we state and prove our main result on quantized minimax lower bounds, relegating some of the technical details to an appendix. In Section 4 we show asymptotic achievability of these lower bounds, using a quantized estimation procedure based on adaptive James-Stein estimation and quantization in blocks, again deferring proofs of technical lemmas to the supplementary material. This is followed by a presentation of some results from experiments in Section 5, illustrating the performance and properties of the proposed quantized estimation procedure.

2 Quantized estimation and minimax risk

Suppose that

is a random vector drawn from a distribution

. Consider the problem of estimating a functional of the distribution, assuming is restricted to lie in a parameter space . To unclutter some of the notation, we will suppress the subscript and write and in the following, keeping in mind that nonparametric settings are allowed. The subscript

will be maintained for random variables. The minimax

risk of estimating is then defined as

where the infimum is taken over all possible estimators that are measurable with respect to the data . We will abuse notation by using to denote both the estimator and the estimate calculated based on an observed set of data. Among numerous approaches to obtaining the minimax risk, the Bayesian method is best aligned with quantized estimation. Consider a prior distribution whose support is a subset of . Let be the posterior mean of given the data , which minimizes the integrated risk. Then for any estimator ,

Taking the infimum over yields

Thus, any prior distribution supported on gives a lower bound on the minimax risk, and selecting the least-favorable prior leads to the largest lower bound provable by this approach.

Now consider constraints on the storage or communication cost of our estimate. We restrict to the set of estimators that use no more than a total of bits; that is, the estimator takes at most different values. Such quantized estimators can be formulated by the following two-step procedure. First, an encoder maps the data to an index , where

is the encoding function. The decoder, after receiving or retrieving the index, represents the estimates based on a decoding function

mapping the index to a codebook of estimates. All that needs to be transmitted or stored is the -bit-long index, and the quantized estimator is simply , the composition of the encoder and the decoder functions. Denoting by the storage, in terms of the number of bits, required by an estimator , the minimax risk of quantized estimation is then defined as

and we are interested in the effect of the constraint on the minimax risk. Once again, we consider a prior distribution supported on and let be the posterior mean of given the data. The integrated risk can then be decomposed as

(2.1)

where the expectation is with respect to the joint distribution of

and , and the second equality is due to

using the fact that

forms a Markov chain. The first term in the decomposition (

2.1) is the Bayes risk . The second term can be viewed as the excess risk due to quantization.

Let be a sufficient statistic for . The posterior mean can be expressed in terms of and we will abuse notation and write it as . Since the quantized estimator uses at most bits, we have

where and denote the Shannon entropy and mutual information, respectively. Now consider the optimization

such that

where the infimum is over all conditional distributions . This parallels the definition of the distortion rate function, minimizing the distortion under a constraint on mutual information [12]. Denoting the value of this optimization by , we can lower bound the quantized minimax risk by

Since each prior distribution supported on gives a lower bound, we have

and the goal becomes to obtain a least favorable prior for the quantized risk.

Before turning to the case of quantized estimation over Sobolev spaces, we illustrate this technique on some simpler, more concrete examples.

Example 2.1 (Normal means in a hypercube).

Let for . Suppose that is known and is to be estimated. We choose the prior on to be a product distribution with density

It is shown in [15] that

where . Turning to , let be the posterior mean of . In fact, by the independence and symmetry among the dimensions, we know are independently and identically distributed. Denoting by this common distribution, we have

where is the distortion rate function for , i.e., the value of the following problem

such that

Now using the Shannon lower bound [8], we get

Note that as , converges to in distribution, so there exists a constant independent of and such that

This lower bound intuitively shows the risk is regulated by two factors, the estimation error and the quantization error; whichever is larger dominates the risk. The scaling behavior of this lower bound (ignoring constants) can be achieved by first quantizing each of the intervals using bits each, and then mapping the mle to its closest codeword.

Example 2.2 (Gaussian sequences in Euclidean balls).

In the example shown above, the lower bound is tight only in terms of the scaling of the key parameters. In some instances, we are able to find an asymptotically tight lower bound for which we can show achievability of both the rate and the constants. Estimating the mean vector of a Gaussian sequence with an norm constraint on the mean is one of such case, as we showed in previous work [27].

Specifically, let for , where . Suppose that the parameter lies in the Euclidean ball . Furthermore, suppose that . Then using the prior it can be shown that

The asymptotic estimation error is the well-known Pinsker bound for the Euclidean ball case. As shown in [27], an explicit quantization scheme can be constructed that asymptotically achieves this lower bound, realizing the smallest possible quantization error for a budget of bits.

The Euclidean ball case is clearly relevant to the Sobolev ellipsoid case, but new coding strategies and proof techniques are required. In particular, as will be made clear in the sequel, we will use an adaptive allocation of bits across blocks of coefficients, using more bits for blocks that have larger estimated signal size. Moreover, determination of the optimal constants requires a detailed analysis of the worst case prior distributions and the solution of a series of variational problems.

3 Quantized estimation over Sobolev spaces

Recall that the Sobolev space of order and radius is defined by

The periodic Sobolev space is defined by

(3.1)

The white noise model (1.1) is asymptotically equivalent to making equally spaced observations along the sample path, , where [4]. In this formulation, the noise level in the formulation (1.1) scales as , and the rate of convergence takes the familiar form where is the number of observations.

To carry out quantized estimation we now require an encoder

which is a function applied to the sample path . The decoding function then takes the form

and maps the index to a function estimate. As in the previous section, we write the composition of the encoder and the decoder as , which we call the quantized estimator. The communication or storage required by this quantized estimator is no more than bits.

To recast quantized estimation in terms of an infinite sequence model, let be the trigonometric basis, and let

be the Fourier coefficients. It is well known [22] that belongs to if and only if the Fourier coefficients belong to the Sobolev ellipsoid defined as

(3.2)

where

Although this is the standard definition of a Sobolev ellipsoid, for the rest of the paper we will set , for convenience of analysis. All of the results hold for both definitions of . Also note that (3.2) actually gives a more general definition, since is no longer assumed to be an integer, as it is in (3.1). Expanding with respect to the same orthonormal basis, the observed path is converted into an infinite Gaussian sequence

with . For an estimator of , an estimate of is obtained by

with squared error . In terms of this standard reduction, the quantized minimax risk is thus reformulated as

(3.3)

To state our result, we need to define the value of the following variational problem:

(3.4)

where the feasible set is the collection of increasing functions and values satisfying

The significance and interpretation of the variational problem will become apparent as we outline the proof of this result.

Theorem 3.1.

Let be defined as in (3.3), for and .

  1. If as , then

    where is Pinker’s constant defined in (1.2).

  2. If for some constant as , then

    where is the value of the variational problem (3.4).

  3. If and as , then

In the first regime where the number of bits is much greater than , we recover the same convergence result as in Pinsker’s theorem, in terms of both convergence rate and leading constant. The proof of the lower bound for this regime can directly follow the proof of Pinsker’s theorem, since the set of estimators considered in our minimax framework is a subset of all possible estimators.

In the second regime where we have “just enough” bits to preserve the rate, we suffer a loss in terms of the leading constant. In this “Goldilocks regime,” the optimal rate is achieved but the constant in front of the rate is Pinsker’s constant plus a positive quantity determined by the variational problem.

While the solution to this variational problem does not appear to have an explicit form, it can be computed numerically. We discuss this term at length in the sequel, where we explain the origin of the variational problem, compute the constant numerically and approximate it from above and below. The constants and are shown graphically in Figure 1. Note that the parameter can be thought of as the average number of bits per coefficient used by an optimal quantized estimator, since is asymptotically the number of coefficients needed to estimate at the classical minimax rate. As shown in Figure 1, the constant for quantized estimation quickly approaches the Pinsker constant as increases—when the two are already very close.

           
Figure 1: The constants as a function of quantization level in the sufficient regime, where . The parameter can be thought of as the average number of bits per coefficient used by an optimal quantized estimator, because is asymptotically the number of coefficients needed to estimate at the classical minimax rate. Here we take and . The curve indicates that with only bits per coefficient, optimal quantized minimax estimation degrades by less than a factor of 2 in the constant. With bits per coefficient, the constant is very close to the classical Pinsker constant.

In the third regime where the communication budget is insufficient for the estimator to achieve the optimal rate, we obtain a sub-optimal rate which no longer depends explicitly on the noise level of the model. In this regime, quantization error dominates, and the risk decays at a rate of no matter how fast approaches zero, as long as . Here the analogue of Pinsker’s constant takes a very simple form.

Proof of Theorem 3.1.

Consider a Gaussian prior distribution on with for in terms of parameters

to be specified later. One requirement for the variances is

We denote this prior distribution by , and show in Section A that it is asymptotically concentrated on the ellipsoid . Under this prior the model is

and the marginal distribution of is thus . Following the strategy outlined in Section 2, let denote the posterior mean of given under this prior, and consider the optimization

such that

where the infimum is over all distributions on such that forms a Markov chain. Now, the posterior mean satisfies where . Note that the Bayes risk under this prior is

Define

Then the classical rate distortion argument [8] gives that

where . Therefore, the quantized minimax risk is lower bounded by

where is the value of the optimization

()
such that

and the deviation term is analyzed in the supplementary material.

Observe that the quantity can be upper and lower bounded by

(3.5)

where the estimation error term is the value of the optimization

()
such that

and the quantization error term is the value of the optimization

()
such that

The following results specify the leading order asymptotics of these quantities.

Lemma 3.2.

As ,

Lemma 3.3.

As ,

(3.6)

Moreover, if and ,

This yields the following closed form upper bound.

Corollary 3.4.

Suppose that and . Then

(3.7)

In the insufficient regime and as , equation (3.5) and Lemma 3.3 show that

Similarly, in the over-sufficient regime as , we conclude that

We now turn to the sufficient regime . We begin by making three observations about the solution to the optimization (). First, we note that the series that solves () can be assumed to be decreasing. If were not in decreasing order, we could rearrange it to be decreasing, and correspondingly rearrange , without violating the constraints or changing the value of the optimization. Second, we note that given , the optimal is obtained by the “reverse water-filling” scheme [8]. Specifically, there exists such that

where is chosen so that

Third, there exists an integer such that the optimal series satisfies

where is the “water-filling level” for (see [8]). Using these three observations, the optimization () can be reformulated as

()
such that