Functional data are nowadays common in practice and have been extensively studied in the past decades. For a comprehensive treatment on the subject of functional data analysis, we recommend the monographs Ramsay and Silverman (2005) and Kokoszka and Reimherr (2017) for an introduction, Ferraty and Vieu (2006) for nonparametric functional data analysis, Hsing and Eubank (2015) from a theoretical perspective, and Horváth and Kokoszka (2012) and Zhang (2013) with a focus on statistical inference.
Functional linear models that pair a response variable with a predictor variable in a linear way, where at least one of the response variable and predictor variable is a function, play an important role in functional data analysis. A functional linear model (FLM), in its general form that accommodates both functional responses and/or functional predictors, can be mathematically described by
where , with and being two separable Hilbert spaces respectively endowed with inner products and , and , called the slope operator, is an unknown Hilbert–Schmidt operator between and . The variable , representing measurement errors, is assumed to be centered, of finite variance, and independent of . The following popular models are special cases of (1):
The scalar-on-function model, corresponding to for an interval , , and for some function ;
The function-on-function model, corresponding to , for some intervals , and for some function ;
The function-on-vector model, also known as the varying coefficient model, corresponding tofor some positive integer , for some interval , and for some function .
These models have been investigated, for example, among many others, by Cardot et al. (1999, 2003); Yao et al. (2005); Hall and Horowitz (2007); James et al. (2009); Yuan and Cai (2010); Zhou et al. (2013); Lin et al. (2017)
, with a focus on estimation of the slope operator in one of these models.
Practically it is also of importance to check whether the predictor has influence on the response in the postulated model (1), which corresponds to whether the slope operator is null and can be cast into the following hypothesis testing problem
This problem has been investigated in the literature, with more attention given to the scalar-on-function model. For example, among many others, Hilgert et al. (2013) proposed Fisher-type parametric tests with random projection to empirical functional principal components by using multiple testing techniques, Lei (2014) introduced an exponential scan test by utilizing the estimator for proposed in Hall and Horowitz (2007)
that is based on functional principal component analysis,Qu and Wang (2017) developed generalized likelihood ratio test using smoothing splines, and Xue and Yao (2021), exploiting the techniques developed for post-regularization inferences, constructed a test for the case that there are an ultrahigh number of functional predictors. For the function-on-function model, Kokoszka et al. (2008) proposed a weighted test statistic based on functional principal component analysis. For the function-on-vector model, Shen and Faraway (2004); Zhang (2011); Smaga (2019) proposed functional F-tests while Zhu et al. (2012) considered a wild bootstrap method.
In this paper, we develop a unified approach to the hypothesis testing problem (2) with the following features. First, it is constructed for the model (1) and thus accommodates all aforementioned functional linear models. In addition, by choosing a suitable space , the proposed method can handle models that simultaneously contain scalar predictors and functional predictors. This contrasts with existing test methods that consider only one type of functional linear models at a time. Second, the method enjoys relatively high power especially for small samples with weak signals, by leveraging a bootstrap strategy developed in Lopes et al. (2020) and Lin et al. (2020)
for high-dimensional data with variance decay. In these endeavors, it is demonstrated that, the strategy of bootstrapping a max statistic, proposed for testing high-dimensional mean vectors in a one-sample or multiple-sample setting, also extends to mean functions in functional data analysis. While the extension to mean functions is relatively straightforward, application of this strategy to the slope operator is much challenging as the functional linear model (1) involves an ill-posed problem (Hall and Horowitz, 2007).
Our strategy is to represent the slope operator by an orthonormal basis derived from functional principal component regression (Hall and Horowitz, 2007) and then transform the test into testing nullity of the coefficients relative to the basis. In addition, we circumvent the challenge of ill-posedness by further transforming the test into an equivalent and well-defined test on a high-dimensional vector; see Section 2 for details. Each coordinate of the high-dimensional vector is the mean of a composition of (random) functional principal component scores whose variances by nature decay to zero at certain rate, and consequently, the principle behind Lopes et al. (2020) and Lin et al. (2020) applies.
To partially accommodate the fact that the functional principal components forming the needed orthonormal basis are often unknown and practitioners may then choose a different basis, such as a fixed known basis or the basis formed by the empirical functional principal components, we establish the uniform validity and consistency of the proposed test for a class of bases. Consequently, our theoretical analyses are considerably more challenging than those in Lopes et al. (2020) and Lin et al. (2020) which consider only a fixed basis. For example, a key step in our analyses is to establish a non-trivial probabilistic upper bound for for a centered sub-Gaussian vector with dependent coordinates and for a ball with radius and divergent ; see Lemma LABEL:lemma-min-sigma-hat in the supplementary material for details.
The rest of the paper is organized as follows. We describe the proposed test in Section 2 and analyze its theoretical properties in Section 3. We then proceed to showcase its numerical performance via simulation studies in Section 4 and illustrate its applications in Section 5. We conclude the article with a remark in Section 6. All proofs are provided in the supplementary material.
Without loss of generality, we assume and in (1) are centered, i.e., and . Such an assumption, adopted also in Cai et al. (2006), is practically satisfied by replacing with and replacing with , where and . This simplifies the model (1) to
In addition, we assume and where and are norms induced by and respectively, so that the covariances of and exist. Our goal is to test (2) based on the independently and identically distributed (i.i.d.) realizations . We assume that and are fully observed when they are functions. This assumption is pragmatically satisfied when and
are observed in a dense grid of their defining domains, as the observations in the grid can be interpolated to form an accurate approximation toand . The case that and are only observed in a sparse grid is much challenging and is left for future research.
, the tensor product operatoris defined by
for all . The tensor product for is defined analogously. For example, if , then for , and if , is represented by the function , for . With the above notation, the covariance operator of a random element in the Hilbert space is given by . For example, if then and if then for and all .
By Mercer’s theorem, the operator admits the decomposition
are eigenvalues,are the corresponding eigenelements that are orthonormal, and is the dimension of ; for example, if and if . Similarly, the operator is decomposed by
with eigenvalues and the corresponding eigenelements . Without loss of generality, we assume form a complete orthonormal system (CONS) of and form a CONS of .
Let be the set of Hilbert–Schmidt operators from to (Definition 4.4.2, Hsing and Eubank, 2015), and note that . Since and are CONS’s, can be represented as
where each is the generalized Fourier coefficient. It turns out that the coefficients are linked to the cross-covariance operator . Specifically, with , we have the following connection between and ; special cases of this connection have been exploited for example by Cai et al. (2006); Hall and Horowitz (2007); Kokoszka et al. (2008).
Consequently, the null hypothesis in (2) is equivalent to for all and . The next key observation is that, it is further equivalent to for all and , as the eigenvalues are assumed to be nonzero. This bypasses the difficulty of estimating these eigenvalues. In practice, we test for when , and similarly, for when , where and are integers that may grow with the sample size; when or , one may choose or , respectively. Specifically, with denoting the vector formed by for and , we pragmatically consider the following surrogate hypothesis testing problem
To test the above hypothesis, a key observation is that
is the mean of the random variableand the variance of exhibits a decay pattern under some regularity conditions; see Section 3 for details. This motivates us to adapt the technique of partial standardization developed in Lopes et al. (2020); Lin et al. (2020). The basic idea is to construct a test statistic by considering the asymptotic distributions of the max statistic
and the min statistic
where is a tuning parameter, , denotes the th coordinate of with being the vector formed by for and , and with being the th coordinate of . Intuitively, has the same distribution with under and may be much larger than under ; similar intuition applies to the random quantity . This leads us to consider the following test statistics
where represents the th coordinate of , and , which is an estimate of , is the th diagonal element of . For a significance level , we may reject the null hypothesis if exceeds its quantile or is below its quantile.
It remains to estimate the quantiles of and for any given under the null hypothesis, for which we adopt a bootstrap strategy, as follows. Let be drawn from the distribution conditional on the data, where denotes the
-dimensional centered Gaussian distribution with the covariance matrix. Then, the bootstrap counterparts of and are given by
respectively. Intuitively, the distribution of provides an approximation to the distribution of when the sample size is sufficiently large, while the distribution of acts as a surrogate for the distribution of under ; we justify this intuition in Theorems 1 and 2. In particular, the distribution of , and consequently the quantiles of , can be practically computed via resampling from the distribution . Specifically, for a sufficiently large integer , e.g., , for each , we independently draw and compute and . The quantile of and the quantile of are then respectively estimated by the empirical quantile of and the quantile of . Finally, we
In practice, the eigenelements and are unknown. To test (7), we may then choose to use some fixed known orthonormal sequences and , such as the standard Fourier basis involving the and functions. Alternatively, we may estimate and from data. For example, is estimated by the eigenelement corresponding to the th eigenvalue of the sample covariance operator , and similarly, is estimated by the eigenelement corresponding to the th eigenvalue of .
The tuning parameter , controlling the degree of partial standardization in (8) and potentially varying with the sample size, is the key to exploiting the decay variances of the coordinates of
. This may be better understood from the perspective of simultaneous confidence intervals (SCI) for hypothesis testing. Based on the distributions ofand , one can also construct an SCI for each coordinate of , which for the th coordinate, is empirically given by for a significance level . As discussed in Lin et al. (2020), in the extreme case that , all SCIs are of the same width, which counters our intuition that width of the SCI for each coordinate shall be adaptive to the variance of the coordinate, while in the case that , all coordinates in (8) have the same variance, which eliminates the “low-dimensional” structure arising from the decay variances. In practice, the value of can be tuned to maximize the power of the proposed test by the method described in Lin et al. (2020).
We begin with introducing some notations. The symbol denotes the set of natural numbers, and denotes the set of sequences that are square summable. For a matrix , we write where is the element of at position . For a random variable and an integer , the -Orlicz norm is . We also use to denote the distribution of and define the Kolmogorov distance between random variables and by . For two sequences and with non-negative elements, means as , and means for some constant and all sufficiently large . Moreover, we write if , write if , and write if and . Also, define and .
Let with and . Our first assumption is on the tail behavior of ; a similar assumption appears in the equation (4.24) of Vershynin (2018).
Assumption 1 (Tail behavior).
The random vectors is sub-gaussian in the sense of
for any vector and a constant , where is the canonical inner product in and is the covariance operator of .
To state the next assumption, for any , let denote the set of indices corresponding to the largest values among
which are the standard deviations of elements in, i.e., . In addition, let denote the correlation matrix of random variables . Lastly, let be a constant fixed with respect to , and define the integers
Assumption 2 (Structural assumptions).
The eigenvalues for and for are positive, and there are constants and , not depending on , such that
where refers to the decreasingly ordered entries of .
There is a constant , not depending on , such that
where denotes the element of at the position. Also, the matrix with the entry given by is positive semi-definite, and there is a constant not depending on such that
The requirement of and in the condition (i) of the above assumption ensures certain smoothness of the covariance functions of and , respectively; such a requirement is also adopted in Cai et al. (2018) and is connected to the so-called Sacks-Ylvisaker condition (Ritter et al., 1995; Yuan et al., 2010). For a scalar-on-function model, , and thus the requirement for in (10) and the condition on the generalized Fourier coefficients of in (11) are automatically satisfied. In addition, (11) is considerably weaker than the requirement in Cai et al. (2006); Hall and Horowitz (2007) for the scalar-on-function regression model.
Our last assumption imposes some conditions on the growth rates of and relative to , where we recall that and may vary with .
for an arbitrarily small but fixed , where and . In addition, if and if . Moreover, when .
For the scalar-on-function model, the above assumption requires which is only asymptotically slightly less than the rate for in Hall and Horowitz (2007), by noting that can be made arbitrarily small and and are asymptotically less than for any (small) .
As mentioned previously, the eigenelements and are often unknown, and practitioners may use alternative orthonormal elements which may differ from , and similarly use orthonormal elements in place of . In this case, all quantities depending on and , such as and , will be computed by using and . We write, for example, and , to indicate the dependence on and , and note that and .
Given two orthonormal sequences and , define
Let , where denotes the Kronecker product of two matrices. Consider a class of with and , such that 1) and are respectively orthonormal sequences, 2) and , and 3) with , where and are defined in the above assumptions, and is the identity matrix. The condition enforces that the variances of the coordinates of exhibit a decay pattern similar to that of the variances of . The class also corresponds to a class of test statistics and . Below we analyze the uniform asymptotic power and size over this class of test statistics; the asymptotic properties of the proposed test by using and then follow as direct consequences, since the class contains . In addition, we consider only the max statistic while note that similar results hold for the min statistic.
For , we define the Gaussian counterparts of and by
The following result shows that the distribution of converges to the distribution at a near rate.
In the proposed test, a bootstrap strategy is used to estimate the distribution of , which is justified by the following result.
Thoerem 2 (Bootstrap approximation).
In reality, the variances are estimated by , and the max statistic is pragmatically computed by
Below we show that the distribution of this practical max statistic converges to the distribution of the original max statistic.
for some constant not depending on . Consequently, the probability that the null hypothesis is rejected by using at the significance level when the null hypothesis in (2) is true, denoted by , is asymptotically controlled at the rate uniformly on .
Finally, below we establish the consistency of the proposed test.
for any fixed , one has
with probability at least , where is a constant not depending on , and
for some constant not depending on , one has
Consequently, if , then the null hypothesis in (2) will be rejected uniformly on with probability tending to one, that is,
4 Simulation Studies
To illustrate the numerical performance of the proposed method, we consider three families of models, namely, scalar-on-function, function-on-function and function-on-scalar models. For each family, we consider various settings; see below for details. In all settings, is computed from (1) with .
For each setting, we consider different sample sizes, namely, and , to investigate the impact of on the power of a test. For the proposed test procedure, we set when and/or are functions, in the function-on-vector family, and in the scalar-on-function family. The tuning parameter is selected by the method described in Lin et al. (2020). Finally, we independently perform replications for each setting, based on which we compute the empirical size as the proportion of rejections among the replications when the null hypothesis is true and compute the empirical power as the proportion of rejections when the alternative hypothesis is true. In all settings, the significance level is .
We consider two settings for the functional predictor . In the first setting, is a centered Gaussian process with the following Matérn covariance function