One of the current challenges in theoretical statistics is to understand when bootstrap methods work in high-dimensional problems. In this direction, there has been a surge of recent interest in connection with “max statistics” such as
where is the th coordinate of the sum , involving i.i.d. vectors in .
This type of statistic has been a focal point in the literature for at least two reasons. First, it is an example of a statistic for which bootstrap methods can succeed in high dimensions under mild assumptions, which was established in several pathbreaking works (Arlot, Blanchard and Roquain, 2010a, b; Chernozhukov, Chetverikov and Kato, 2013, 2017). Second, the statistic is closely linked to several fundamental topics, such as suprema of empirical processes, nonparametric confidence regions, and multiple testing problems. Likewise, many applications of bootstrap methods for max statistics have ensued at a brisk pace in recent years (see, e.g., Chernozhukov, Chetverikov and Kato, 2014; Wasserman, Kolar and Rinaldo, 2014; Chen, Genovese and Wasserman, 2015; Chang, Yao and Zhou, 2017; Zhang and Cheng, 2017; Dezeure, Bühlmann and Zhang, 2017; Chen, 2018; Fan, Shao and Zhou, 2018; Belloni et al., 2018).
One of the favorable aspects of bootstrap approximation results for the distribution is that rates have been established with only logarithmic dependence in . For instance, the results in Chernozhukov, Chetverikov and Kato (2017) imply that under certain conditions, the Kolmogorov distance between and its bootstrap counterpart satisfies the bound
with high probability, whereare constants not depending on or , and denotes the matrix whose rows are . (In the following, will be often re-used to designate a positive constant, possibly with a different value at each occurrence.) Additional refinements of this result can be found in the same work, with regard to the choice of metric, or choice of bootstrap method. Also, recent progress in sharpening the exponent has been made by Deng and Zhang (2017). However, this mild dependence on is offset by the dependence on , which differs from the rate in the multivariate Berry-Esseen theorem when .
Currently, the general question of determining the best possible rates of bootstrap approximation in high dimensions is largely open. In particular, rates of the form (1.1) for and related statistics have been conjectured to be minimax optimal in the settings considered by Chernozhukov, Chetverikov and Kato (2017) and Chen (2018). Nevertheless, in finite-sample experiments, the performance of bootstrap methods for max statistics is often more encouraging than what might be expected from the dependence on (see, e.g. Zhang and Cheng, 2017; Fan, Shao and Zhou, 2018; Belloni et al., 2018). This suggests that improved rates are possible in at least some situations.
The purpose of this paper is to quantify an instance of such improvement when additional model structure is available. Specifically, we consider the case when the coordinates of have decaying variances. If we let for each , and write , then this condition may be formalized as
we discuss examples related to principal component analysis, sparse count data, and Fourier coefficients of functional data. Furthermore, this condition can be empirically verified in an approximate sense, due to the fact that the parameters
can be accurately estimated, even in high dimensions.
Within the setting of decaying variances, our main results show that bootstrap approximation of can be achieved at a nearly parametric rate. More precisely, for any fixed , the bound
holds with high probability, where is a number that depends only on and . Here, it is worth emphasizing a few basic aspects of this bound. First, it is non-asymptotic and does not depend on . Second, the parameter is allowed to be arbitrarily small, and in this sense, the decay condition (1.2) is very weak. Third, the result holds when is constructed using the standard multiplier bootstrap procedure (Chernozhukov, Chetverikov and Kato, 2013), and it is not necessary to use any auxiliary dimension reduction or variable selection.
With regard to previous bootstrap approximation results for , it is important to clarify that our bound (1.3) does not conflict with the conjectured optimality of the rate . The reason is that the rate has been established in settings where the values are restricted from becoming too small. A basic version of such a requirement is that
, suggesting a phase transition between the ratesand at the “boundary” corresponding to .
Another important consideration that is related to the conditions (1.2) and (1.4) is the use of standardized variables. Namely, it is of special interest to approximate the distribution of the statistic
which is equivalent to approximating when each is standardized to have variance 1. Given that standardization eliminates variance decay, it might seem that the rate has no bearing on approximating . However, it is still possible to take advantage of variance decay, by using the basic notion of “partial standardization”.
The idea of partial standardization is to slightly modify by using a fractional power of each . Specifically, if we let be a free parameter, then we can consider the partially standardized statistic
which interpolates betweenand as ranges over . This statistic has the following significant property: If satisfy the variance decay condition (1.2), and if is chosen to be slightly less than 1, then our main results show that the rate holds for bootstrap approximations of . In fact, this effect occurs even when as . Further details can be found in Section 3. Also note that our main theoretical results are formulated entirely in terms of , which covers the statistic as a special case.
In practice, simultaneous confidence intervals derived from approximations to are just as easy to use as those based on
. Although there is a slight difference between the quantiles ofand when , the important point is that the quantiles of may be preferred, since faster rates of bootstrap approximation are available. (See also Figure 1 in Section 4.) In this way, the statistic offers a simple way to blend the utility of standardized variables with the beneficial effects of variance decay.
The remainder of the paper is organized as follows. In Section 2, we outline the problem setting, with a complete statement of the theoretical assumptions, as well as some motivating facts and examples. Our main results are given in Section 3, which consist of a Gaussian approximation result for (Theorem 3.1), and a corresponding bootstrap approximation result (Theorem 3.2). To provide a numerical illustration of our results, in Section 4 we discuss a problem in functional data analysis, where the variance decay condition naturally arises. Specifically, we show how bootstrap approximations to can be used to derive simultaneous confidence intervals for the Fourier coefficients of a mean function. Lastly, our conclusions are summarized in Section 5. Proofs are in the appendices as well as in the supplementary material. The organization of the proofs is described at the beginning of the appendices.
For any symmetric matrix
, the ordered eigenvalues are denoted as, where . If is a fixed vector, and , we write . In a slight abuse of notation, we also write to refer to the
norm of a scalar random variable, with . The -Orlicz norm is , and a random variable satisfying is said to be sub-exponential. If and are sequences of positive real numbers, then the relation means that there is an absolute constant , and an integer , such that for all . Also, define the abbreviations and . Lastly, when using symbols such as , etc., to refer to constants that do not depend on or , we often allow their value to change from line to line in order to simplify presentation.
2 Setting and preliminaries
We consider a sequence of models indexed by , with all parameters depending on , except for those that are explicitly stated to be fixed.
Assumption 2.1 (Data-generating model).
There is a vector and positive definite matrix , such that the observations are generated as for each , where the vectors are the rows of a matrix with i.i.d. entries.
The entries of satisfy , and , where is an absolute constant.
There is an absolute constant , such that the dimension satisfies .
With regard to the dimension in part (iii), note that we allow the ratio to less than 1, or to diverge at an arbitrarily fast rate as . Meanwhile, the sub-exponential condition in part (ii) is similar to other tail conditions that have been used in previous works on bootstrap methods for max statistics (Chernozhukov, Chetverikov and Kato, 2013; Deng and Zhang, 2017), and is considerably weaker than requiring to be sub-Gaussian.
To state our next assumption, fix any , and let denote a set of indices corresponding to the largest values among , i.e., . Next, define the quantity
as the largest correlation among distinct variables indexed by . Lastly, define the integer according to
which occurs in the following conditions.
Assumption 2.2 (Structural assumptions).
There is a parameter not depending on , and absolute constants , such that
There is an absolute constant such that
These assumptions are approximately checkable in practice, since the parameters and can be estimated at nearly parametric rates, even in high dimensions (see Lemmas D.6 and D.7). When considering the size of the decay parameter , note that if is viewed as a covariance operator acting on a Hilbert space, then the condition essentially corresponds to the case of a trace-class operator — a property that is typically assumed in functional data analysis (Hsing and Eubank, 2015). From this perspective, the condition is very weak, and allows the trace of to diverge as .
The conditions (2.2) and (2.3) are also mild in the sense that they only apply to a small index set of size . Furthermore, the correlation structure for the variables outside of is completely unrestricted. Lastly, the condition (2.3) can actually be relaxed so that is allowed to approach 1 at a certain rate as , but we do not pursue such refinements for simplicity.
2.1 Examples of correlation structures
Some examples of correlation matrices satisfying the condition (2.3) are given below.
Based on these examples, it is also straightforward to construct covariance matrices that jointly satisfy (2.1), (2.2), and (2.3). Specifically, let be any sequence of positive numbers satisfying (2.1) and (2.2) and let be one of the matrices above. Then, a suitable covariance matrix can be obtained by conjugating with .
2.2 Examples of variance decay
To provide additional context for the decay condition (2.1), we describe some general situations where it arises.
Principal components analysis (PCA). The broad applicability of PCA rests on the fact that many types of data have an underlying covariance matrix with weakly sparse eigenvalues. Roughly speaking, this means that most of the eigenvalues of are negligible in comparison to the top few. Similar to the condition (1.2), this situation is commonly modeled with the decay condition
for some parameter (see, e.g., Johnstone and Lu, 2009), where are the sorted eigenvalues of . Whenever this holds, it can be shown that the variance decay condition (1.2) must hold for some associated parameter , and this is done in Proposition 2.1 below. So, in a qualitative sense, this indicates that if a dataset is amenable to PCA, then it is also likely to fall within the scope of our setting.
Sparse count data. Consider a multinomial model based on cells and trials, parameterized by a vector of cell frequencies . The case when the vector
is approximately sparse often occurs in the analysis of contingency tables(see, e.g. Cressie and Read, 1984; Zelterman, 1987; Plunkett and Park, 2017). If the th trial is represented as a vector in the set of standard basis vectors , then . Therefore, a weak sparsity condition on conforms naturally with the variance decay condition (1.2). Similar considerations also apply to multivariate models with sparse Poisson marginals. Namely, if each observation has Poisson marginals and a weakly sparse mean vector , then the basic fact leads to variance decay.
Fourier coefficients of functional data. Let be an i.i.d. sample of functional data, taking values in a separable Hilbert space . In addition, suppose that the covariance operator is trace-class and satisfies an eigenvalue decay condition of the form (2.4) — which is common in functional data analysis (see, e.g., Cai and Hall, 2006). Lastly, for each , let denote the first generalized Fourier coefficients of with respect to some fixed orthonormal basis of . That is,
Under the above conditions, it can be shown that no matter which basis is chosen, the vectors always satisfy the variance decay condition (1.2). (This follows from Proposition 2.1 below.) In Section 4, we explore some consequences of this condition as it relates to simultaneous confidence intervals for the Fourier coefficients of the mean function .
To conclude this section, we state a proposition that was used in the examples above. In essence, this basic result shows that decay among requires at least some decay among . As a matter of notation, if is a fixed vector, and , then the weak- quasi-norm is given by where are the sorted absolute entries of .
Fix two numbers , and . Then, there is a constant depending only on and , such that for any positive semidefinite matrix , we have
In particular, if , and if there is a constant such that the inequality
holds for all , then the inequality
holds for all .
The proof is given in Appendix A, and follows essentially from the Schur-Horn majorization theorem, as well as inequalities relating and .
3 Main results
In this section, we present our main results on Gaussian approximation and bootstrap approximation.
3.1 Gaussian approximation
Let be independent random vectors drawn from , and let
The Gaussian counterpart of the partially standardized statistic (1.5) is defined as
Our first theorem shows that in the presence of variance decay, the distribution can approximate at a nearly parametric rate in Kolmogorov distance. Recall that for any random variables and , this distance is given by .
Theorem 3.1 (Gaussian approximation).
As a basic observation, note that the result handles the ordinary max statistic as a special case with . In addition, it is especially notable that the rate does not depend on the dimension , or the variance decay parameter (provided that it is positive). In this sense, the result shows that even a small amount of structure can have a substantial impact of Gaussian approximation, in relation to existing rates that hold when . Lastly, the lower bound on is needed, because if quickly approaches 1 as , then the variances will also quickly approach 1, eliminating the beneficial effect of variance decay.
3.2 Multiplier bootstrap approximation
In order to define the multiplier bootstrap counterpart of , first let be independent vectors drawn from , where we define
and . Noting that each has mean 0, we let , and define the associated max statistic as
where . In the exceptional case when for some , the expression is understood to be 0. This convention is natural, because if , then almost surely.
The above description of differs from some previous works insofar as we have suppressed the role of “multiplier variables”, and have defined in terms of direct samples from . From a mathematical standpoint, this is equivalent to the multiplier formulation (Chernozhukov, Chetverikov and Kato, 2013), where and are independent random variables, conditionally on .
Theorem 3.2 (Bootstrap approximation).
Suppose the conditions of Theorem 3.1 hold, with the same choice of . Then, there is an absolute constant , and a constant depending only and , such that the event
occurs with probability at least .
Remarks on proofs
At a high level, the proofs of Theorems 3.1 and 3.2 are based on the following observation. When the variance decay condition holds, there is a relatively small subset of that is likely to contain the maximizing index for . In other words, if denotes a random index satisfying , then the “effective range” of is fairly small. Although this situation is quite intuitive when the decay parameter is large, what is more surprising is that the effect persists even for small values of .
Once the maximizing index has been localized to a small set, it becomes possible to use tools that are specialized to the regime where . For example, Bentkus’ multivariate Berry-Esseen theorem (Bentkus, 2003) (cf. Lemma E.1) is especially important in this regard. Another technical aspect of the proofs worth mentioning is that they make essential use of the sharp constants in Rosenthal’s inequality, as established in (Johnson, Schechtman and Zinn, 1985) (Lemma E.4).
4 Numerical illustration with functional data
Due to advances in technology and data collection, functional data have become ubiquitous in the past two decades, and statistical methods for their analysis have received growing interest. General references and surveys may be found in Ramsay and Silverman (2005); Ferraty and Vieu (2006); Horvath and Kokoszka (2012); Hsing and Eubank (2015); Wang, Chiou and Müller (2016).
The purpose of this section is to present an illustration, showing how the partially standardized statistic and the bootstrap can be used to do inference on functional data. More specifically, we consider a one-sample test for a mean function, which proceeds by constructing simultaneous confidence intervals (SCI) for its Fourier coefficients. With regard to our theoretical results, this is a natural problem for illustration, because the Fourier coefficients of functional data typically satisfy the variance decay condition (1.2) — as explained in the third example of Section 2.2. Additional background on mean testing for functional data may be found in Benko, Härdle and Kneip (2009); Degras (2011); Cao, Yang and Todem (2012); Horváth, Kokoszka and Reeder (2013); Zheng, Yang and Härdle (2014); Choi and Reimherr (2018); Zhang et al. (2018), as well as references therein.
4.1 Testing the mean function
To set the stage, let be a separable Hilbert space of functions, and let be a random function with mean . Given a sample of i.i.d. realizations of , a basic goal is to test
where is a fixed element in .
This testing problem can be naturally formulated in terms of SCI, as follows. Let denote any orthonormal basis for . Also, let and respectively denote the generalized Fourier coefficients of and with respect to , so that
Then, the null hypothesis is equivalent tofor all . To test this condition, one can construct a confidence interval for each , and reject the null if for at least one . In practice, due to infinite dimensionality, one will choose a sufficiently large integer , and reject the null if for at least one .
— which is equivalent to constructing SCI. In the CR approach, the basis is taken to be the eigenfunctionsof the covariance operator , and is chosen as the number of eigenfunctions required to explain a certain fraction (say 99%) of variance in the data. However, since is unknown, the these functions must be estimated.
When is large, estimating the eigenfunctions is a well-known challenge in functional data analysis. For instance, if the sample paths of are not sufficiently smooth, then a large number may be needed to explain most of the variance. Another example occurs when holds, but and are not well separated. If this is the case, then a large choice of may be needed in order to distinguish and . In our illustration below, we consider an alternative approach to constructing SCI that leverages the fact that the bootstrap method in Section 3 can accommodate large values of .
4.2 Applying the bootstrap
Let be any pre-specified orthonormal basis for . For instance, when , a commonly considered option is to let be the standard Fourier basis. Letting be as before, define a sample of vectors in according to
and note that . For simplicity, we retain the other notation associated with in previous sections, so that , and likewise for other quantities. In addition, for any , let
For a given significance level , the quantiles of and are denoted and . This implies the following event occurs with probability at least ,
which leads to theoretical SCI for .
We now apply the bootstrap from Section 3.2 to estimate and . Specifically, if we generate independent samples of as in (3.4), then we define to be the empirical quantile of the samples, and similarly for . In turn, the bootstrap SCI are defined by
for each .
It remains to select a value for , which can be done with the following simple rule. For each value of in a set of candidates , we construct the associated intervals as in (4.3). Then, we choose the value for which the average width is the smallest. (Note that .)
In Figure 1, we illustrate the influence of on the shape of SCI. There are two main points to notice: (1) The intervals change very gradually as a function of , which shows that partial standardization is a mild adjustment to ordinary standardization. (2) The choice of involves a tradeoff, which controls the “allocation of power” among the intervals. When is close to 1, the intervals are wider for the top coefficients (small ), and narrower for the bottom coefficients (large ). However, as decreases from 1, the widths of the intervals gradually become more uniform, and the intervals for the top coefficients become narrower. Hence, if the vectors and differ in the top coefficients, then choosing a smaller value of may lead to a gain in power. One last interesting point to mention is that in the simulations below, the selection rule of “minimizing the average width” typically selected values of around 0.8, and hence strictly less than 1.
4.3 Simulation settings
To study the numerical performance of the SCI described above, we generated i.i.d. samples from a Gaussian process on , with population mean function
indexed by parameters , where , and
denotes the Beta distribution function with shape parameters. This family of functions was considered in Chen and Müller (2012). To interpret the parameters, note that determines the shape of the mean function (see Figure 2), whereas and are scaling and shift parameters. In terms of these parameters, the null hypothesis corresponds to .
The population covariance function was taken to be the Matérn function
which was considered in CR, with being a modified Bessel function of the second kind. We set , which results in relatively rough sample paths, as illustrated in Figure 3. To understand how this covariance structure relates to Assumption 2.2, we can numerically verify that Assumption 2.2(i) is satisfied with and (see Figure 3). In addition, Assumption 2.2(ii) is satisfied with when , as well as when .
When implementing the bootstrap in Section 4.2, we always used the first functions from the standard Fourier basis on [0,1]. (In principle, an even larger value could have been selected, but we chose to limit computation time.) Meanwhile, we implemented the CR method using its accompanying R package fregion (Choi and Reimherr, 2016) under default settings, which typically estimated the first eigenfunctions of the covariance operator .
Results on type I error
The nominal significance level was set to
in all simulations. To assess the actual type I error, we carried out 5,000 simulations under the null hypothesis, for both of the casesand . When , the type I error was 6.7% for the bootstrap method, and 1.6% for CR. When , the results were 5.7% for the bootstrap method, and 2.6% for CR. So, in these cases, the bootstrap respects the nominal significance level relatively well.
Results on power
To consider power, we varied each of the parameters , and , one at a time, while keeping the other two at their baseline value of zero. In each parameter setting, we carried out 1,000 simulations with sample size . The results are summarized in Figure 4, showing that the bootstrap achieves relative gains in power — especially when and differ in shape () or scale (). Indeed, it seems that using a large number of basis functions can help to catch small differences in shape or scale near the endpoints of the domain [0,1] (see also Figure 2).
The main conclusion to draw from our work is that a modest amount of variance decay in a high-dimensional model can substantially enhance rates of bootstrap approximation for max statistics. In particular, there are three aspects of this type of model structure that are worth emphasizing. First, the variance decay condition (1.2) is very weak, in the sense that the decay parameter is allowed to be arbitrarily small. Second, the condition is approximately checkable in practice, since the variances can be accurately estimated when . Third, this type of structure arises naturally in a variety of contexts, such as in applications of PCA, as well as in the analysis of sparse count data, and functional data.
Beyond our main theoretical focus on rates of approximation, we have also shown that the technique of partial standardization leads to favorable numerical results at moderate sample sizes. Specifically, this was illustrated with an example from functional data analysis, where the inherent variance decay of Fourier coefficients can be leveraged. Finally, we note that this application to functional data is just one of many possible illustrations, and the adaptation of these ideas to other situations may provide some opportunities for future work.
Organization of appendices
In Appendix A we prove Proposition 2.1, and in Appendices B and C we prove Theorems 3.1 and 3.2 respectively. These proofs rely on numerous technical lemmas, which are stated and proved in Appendix D of the supplementary material. Lastly, in Appendix E of the supplementary material, we provide statements of background results that are used in the proofs.
General remarks and notation
It will simplify some of the proofs to make use of the fact that the metric is always bounded by 1, and therefore, it is sufficient to show that Theorems 3.1 and 3.2 hold for all large values of . (This is because a constant can be chosen large enough so that for finitely many values of .)
To fix some notation that will be used throughout the appendices, let , and define a generalized version of as
In particular, note that the statistic defined in equation (1.5) is the same as . Similarly, the Gaussian and bootstrap versions of , denoted and , are defined as
In addition, define the parameter
as well as the integer
where is the value fixed in Theorems 3.1 and 3.2. This integer always satisfies . Lastly, we will often use the fact that if a random variable satisfies for some absolute constant , then there is another absolute constant , such that for all (Vershynin, 2018, Proposition 2.7.1).
Appendix A Proof of Proposition 2.1
It is a standard fact that for any , the norm dominates norm, and so . Next, since is symmetric, the Schur-Horn Theorem implies that the vector is majorized by (Marshall, Olkin and Arnold, 2011, p.300). Furthermore, when , the function is Schur-convex on , which means that if is majorized by , then (Marshall, Olkin and Arnold, 2011, p.138). Hence,
Finally, if , then for any , the inequality
holds, where for . This bound may be derived as in (Johnstone, 2017, p.257),
which completes the proof. ∎
Appendix B Proof of Theorem 3.1
Consider the inequality
where we define
Below, we show that the term is at most of order in Proposition B.1. Later on, we establish a corresponding result for and in Proposition B.2. Taken together, these results complete the proof of Theorem 3.1.∎
Suppose the conditions of Theorem 3.1 hold, with the same choice of . Then, there is a constant depending only on such that
For ease of notation, we will write below. Let denote the projection onto the coordinates indexed by . This means that if is enumerated as , and if , then the th row of is the standard basis vector . Next, define the diagonal matrix . It follows that
In light of this relation, we will deal with the covariance matrix of the random vector , which is
Also, let denote the random vector with zero mean and identity covariance matrix such that
It is simple to check that any fixed , there is a Borel-measurable convex set such that . By the same reasoning, we also have , where
is the standard Gaussian distribution on. Therefore, the quantity satisfies the bound
where denotes the collection of all Borel-measurable convex subsets of .