# A Unified Approach to Hypothesis Testing for Functional Linear Models

We develop a unified approach to hypothesis testing for various types of widely used functional linear models, such as scalar-on-function, function-on-function and function-on-scalar models. In addition, the proposed test applies to models of mixed types, such as models with both functional and scalar predictors. In contrast with most existing methods that rest on the large-sample distributions of test statistics, the proposed method leverages the technique of bootstrapping max statistics and exploits the variance decay property that is an inherent feature of functional data, to improve the empirical power of tests especially when the sample size is limited and the signal is relatively weak. Theoretical guarantees on the validity and consistency of the proposed test are provided uniformly for a class of test statistics.

## Authors

• 2 publications
• 17 publications
06/12/2019

### Model Testing for Generalized Scalar-on-Function Linear Models

Scalar-on-function linear models are commonly used to regress functional...
06/08/2020

### Achieving Equalized Odds by Resampling Sensitive Attributes

We present a flexible framework for learning predictive models that appr...
07/24/2019

### Hypothesis Testing in Nonlinear Function on Scalar Regression with Application to Child Growth Study

We propose a kernel machine based hypothesis testing procedure in nonlin...
10/06/2021

### Hypothesis Testing of One-Sample Mean Vector in Distributed Frameworks

Distributed frameworks are widely used to handle massive data, where sam...
09/09/2020

### Analysis of Deviance for Hypothesis Testing in Generalized Partially Linear Models

In this study, we develop nonparametric analysis of deviance tools for g...
10/18/2021

### Optimal designs for experiments for scalar-on-function linear models

The aim of this work is to extend the usual optimal experimental design ...
06/02/2021

### Testing Directed Acyclic Graph via Structural, Supervised and Generative Adversarial Learning

In this article, we propose a new hypothesis testing method for directed...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Functional data are nowadays common in practice and have been extensively studied in the past decades. For a comprehensive treatment on the subject of functional data analysis, we recommend the monographs Ramsay and Silverman (2005) and Kokoszka and Reimherr (2017) for an introduction, Ferraty and Vieu (2006) for nonparametric functional data analysis, Hsing and Eubank (2015) from a theoretical perspective, and Horváth and Kokoszka (2012) and Zhang (2013) with a focus on statistical inference.

Functional linear models that pair a response variable with a predictor variable in a linear way, where at least one of the response variable and predictor variable is a function, play an important role in functional data analysis. A functional linear model (FLM), in its general form that accommodates both functional responses and/or functional predictors, can be mathematically described by

 Y−EY=β(X−EX)+Z, (1)

where , with and being two separable Hilbert spaces respectively endowed with inner products and , and , called the slope operator, is an unknown Hilbert–Schmidt operator between and . The variable , representing measurement errors, is assumed to be centered, of finite variance, and independent of . The following popular models are special cases of (1):

• The scalar-on-function model, corresponding to for an interval , , and for some function ;

• The function-on-function model, corresponding to , for some intervals , and for some function ;

• The function-on-vector model, also known as the varying coefficient model, corresponding to

for some positive integer , for some interval , and for some function .

These models have been investigated, for example, among many others, by Cardot et al. (1999, 2003); Yao et al. (2005); Hall and Horowitz (2007); James et al. (2009); Yuan and Cai (2010); Zhou et al. (2013); Lin et al. (2017)

, with a focus on estimation of the slope operator in one of these models.

Practically it is also of importance to check whether the predictor has influence on the response in the postulated model (1), which corresponds to whether the slope operator is null and can be cast into the following hypothesis testing problem

 H0:β=0v.s.Ha:β≠0. (2)

This problem has been investigated in the literature, with more attention given to the scalar-on-function model. For example, among many others, Hilgert et al. (2013) proposed Fisher-type parametric tests with random projection to empirical functional principal components by using multiple testing techniques, Lei (2014) introduced an exponential scan test by utilizing the estimator for proposed in Hall and Horowitz (2007)

that is based on functional principal component analysis,

Qu and Wang (2017) developed generalized likelihood ratio test using smoothing splines, and Xue and Yao (2021), exploiting the techniques developed for post-regularization inferences, constructed a test for the case that there are an ultrahigh number of functional predictors. For the function-on-function model, Kokoszka et al. (2008) proposed a weighted test statistic based on functional principal component analysis. For the function-on-vector model, Shen and Faraway (2004); Zhang (2011); Smaga (2019) proposed functional F-tests while Zhu et al. (2012) considered a wild bootstrap method.

In this paper, we develop a unified approach to the hypothesis testing problem (2) with the following features. First, it is constructed for the model (1) and thus accommodates all aforementioned functional linear models. In addition, by choosing a suitable space , the proposed method can handle models that simultaneously contain scalar predictors and functional predictors. This contrasts with existing test methods that consider only one type of functional linear models at a time. Second, the method enjoys relatively high power especially for small samples with weak signals, by leveraging a bootstrap strategy developed in Lopes et al. (2020) and Lin et al. (2020)

for high-dimensional data with variance decay. In these endeavors, it is demonstrated that, the strategy of bootstrapping a max statistic, proposed for testing high-dimensional mean vectors in a one-sample or multiple-sample setting, also extends to mean functions in functional data analysis. While the extension to mean functions is relatively straightforward, application of this strategy to the slope operator is much challenging as the functional linear model (

1) involves an ill-posed problem (Hall and Horowitz, 2007).

Our strategy is to represent the slope operator by an orthonormal basis derived from functional principal component regression (Hall and Horowitz, 2007) and then transform the test into testing nullity of the coefficients relative to the basis. In addition, we circumvent the challenge of ill-posedness by further transforming the test into an equivalent and well-defined test on a high-dimensional vector; see Section 2 for details. Each coordinate of the high-dimensional vector is the mean of a composition of (random) functional principal component scores whose variances by nature decay to zero at certain rate, and consequently, the principle behind Lopes et al. (2020) and Lin et al. (2020) applies.

To partially accommodate the fact that the functional principal components forming the needed orthonormal basis are often unknown and practitioners may then choose a different basis, such as a fixed known basis or the basis formed by the empirical functional principal components, we establish the uniform validity and consistency of the proposed test for a class of bases. Consequently, our theoretical analyses are considerably more challenging than those in Lopes et al. (2020) and Lin et al. (2020) which consider only a fixed basis. For example, a key step in our analyses is to establish a non-trivial probabilistic upper bound for for a centered sub-Gaussian vector with dependent coordinates and for a ball with radius and divergent ; see Lemma LABEL:lemma-min-sigma-hat in the supplementary material for details.

The rest of the paper is organized as follows. We describe the proposed test in Section 2 and analyze its theoretical properties in Section 3. We then proceed to showcase its numerical performance via simulation studies in Section 4 and illustrate its applications in Section 5. We conclude the article with a remark in Section 6. All proofs are provided in the supplementary material.

## 2 Methodology

Without loss of generality, we assume and in (1) are centered, i.e., and . Such an assumption, adopted also in Cai et al. (2006), is practically satisfied by replacing with and replacing with , where and . This simplifies the model (1) to

 Y=β(X)+Z. (3)

In addition, we assume and where and are norms induced by and respectively, so that the covariances of and exist. Our goal is to test (2) based on the independently and identically distributed (i.i.d.) realizations . We assume that and are fully observed when they are functions. This assumption is pragmatically satisfied when and

are observed in a dense grid of their defining domains, as the observations in the grid can be interpolated to form an accurate approximation to

and . The case that and are only observed in a sparse grid is much challenging and is left for future research.

For and

, the tensor product operator

is defined by

 (x⊗y)z=⟨x,z⟩1y

for all . The tensor product for is defined analogously. For example, if , then for , and if , is represented by the function , for . With the above notation, the covariance operator of a random element in the Hilbert space is given by . For example, if then and if then for and all .

By Mercer’s theorem, the operator admits the decomposition

 CX=dX∑j1=1λj1ϕj1⊗ϕj1 (4)

where

are eigenvalues,

are the corresponding eigenelements that are orthonormal, and is the dimension of ; for example, if and if . Similarly, the operator is decomposed by

 CY=dY∑j2=1ρj2ψj2⊗ψj2 (5)

with eigenvalues and the corresponding eigenelements . Without loss of generality, we assume form a complete orthonormal system (CONS) of and form a CONS of .

Let be the set of Hilbert–Schmidt operators from to (Definition 4.4.2, Hsing and Eubank, 2015), and note that . Since and are CONS’s, can be represented as

 β=dX∑j1=1dY∑j2=1bj1j2ϕj1⊗ψj2, (6)

where each is the generalized Fourier coefficient. It turns out that the coefficients are linked to the cross-covariance operator . Specifically, with , we have the following connection between and ; special cases of this connection have been exploited for example by Cai et al. (2006); Hall and Horowitz (2007); Kokoszka et al. (2008).

###### Proposition 1.

and .

Consequently, the null hypothesis in (

2) is equivalent to for all and . The next key observation is that, it is further equivalent to for all and , as the eigenvalues are assumed to be nonzero. This bypasses the difficulty of estimating these eigenvalues. In practice, we test for when , and similarly, for when , where and are integers that may grow with the sample size; when or , one may choose or , respectively. Specifically, with denoting the vector formed by for and , we pragmatically consider the following surrogate hypothesis testing problem

 H0:ν=0versusHa:ν≠0. (7)

To test the above hypothesis, a key observation is that

is the mean of the random variable

and the variance of exhibits a decay pattern under some regularity conditions; see Section 3 for details. This motivates us to adapt the technique of partial standardization developed in Lopes et al. (2020); Lin et al. (2020). The basic idea is to construct a test statistic by considering the asymptotic distributions of the max statistic

 M=max1≤j≤pSn,jστj (8)

and the min statistic

 L=min1≤j≤pSn,jστj,

where is a tuning parameter, , denotes the th coordinate of with being the vector formed by for and , and with being the th coordinate of . Intuitively, has the same distribution with under and may be much larger than under ; similar intuition applies to the random quantity . This leads us to consider the following test statistics

 TU=max1≤j≤p√n¯Vj^στj and TL=min1≤j≤p√n¯Vj^στj,

where represents the th coordinate of , and , which is an estimate of , is the th diagonal element of . For a significance level , we may reject the null hypothesis if exceeds its quantile or is below its quantile.

It remains to estimate the quantiles of and for any given under the null hypothesis, for which we adopt a bootstrap strategy, as follows. Let be drawn from the distribution conditional on the data, where denotes the

-dimensional centered Gaussian distribution with the covariance matrix

. Then, the bootstrap counterparts of and are given by

 M⋆=max1≤j≤pS⋆n^στj and L⋆=min1≤j≤pS⋆n^στj,

respectively. Intuitively, the distribution of provides an approximation to the distribution of when the sample size is sufficiently large, while the distribution of acts as a surrogate for the distribution of under ; we justify this intuition in Theorems 1 and 2. In particular, the distribution of , and consequently the quantiles of , can be practically computed via resampling from the distribution . Specifically, for a sufficiently large integer , e.g., , for each , we independently draw and compute and . The quantile of and the quantile of are then respectively estimated by the empirical quantile of and the quantile of . Finally, we

 reject the null hypothesis if TU>^qM(1−ϱ/2) or TL<^qL(ϱ/2).

In practice, the eigenelements and are unknown. To test (7), we may then choose to use some fixed known orthonormal sequences and , such as the standard Fourier basis involving the and functions. Alternatively, we may estimate and from data. For example, is estimated by the eigenelement corresponding to the th eigenvalue of the sample covariance operator , and similarly, is estimated by the eigenelement corresponding to the th eigenvalue of .

The tuning parameter , controlling the degree of partial standardization in (8) and potentially varying with the sample size, is the key to exploiting the decay variances of the coordinates of

. This may be better understood from the perspective of simultaneous confidence intervals (SCI) for hypothesis testing. Based on the distributions of

and , one can also construct an SCI for each coordinate of , which for the th coordinate, is empirically given by for a significance level . As discussed in Lin et al. (2020), in the extreme case that , all SCIs are of the same width, which counters our intuition that width of the SCI for each coordinate shall be adaptive to the variance of the coordinate, while in the case that , all coordinates in (8) have the same variance, which eliminates the “low-dimensional” structure arising from the decay variances. In practice, the value of can be tuned to maximize the power of the proposed test by the method described in Lin et al. (2020).

## 3 Theory

We begin with introducing some notations. The symbol denotes the set of natural numbers, and denotes the set of sequences that are square summable. For a matrix , we write where is the element of at position . For a random variable and an integer , the -Orlicz norm is . We also use to denote the distribution of and define the Kolmogorov distance between random variables and by . For two sequences and with non-negative elements, means as , and means for some constant and all sufficiently large . Moreover, we write if , write if , and write if and . Also, define and .

Let with and . Our first assumption is on the tail behavior of ; a similar assumption appears in the equation (4.24) of Vershynin (2018).

###### Assumption 1 (Tail behavior).

The random vectors is sub-gaussian in the sense of

 ∥⟨V∞1,t⟩∥ψ2≤K⟨CVt,t⟩12 (9)

for any vector and a constant , where is the canonical inner product in and is the covariance operator of .

To state the next assumption, for any , let denote the set of indices corresponding to the largest values among

which are the standard deviations of elements in

, i.e., . In addition, let denote the correlation matrix of random variables . Lastly, let be a constant fixed with respect to , and define the integers

 ℓn=⌈(1∨log(n)3)∧p⌉ and kn=⌈(ℓn∨n1log(n)a)∧p⌉.
###### Assumption 2 (Structural assumptions).
1. The eigenvalues for and for are positive, and there are constants and , not depending on , such that

 λj1≍j1−α1and~κ(j2)≍j2−α2. (10)

Moreover,

 maxj1≥1|bj1(j2)|≍~κ(j2),% for all j2=1,2,…, (11)

where refers to the decreasingly ordered entries of .

2. There is a constant , not depending on , such that

 maxi≠j[Rℓn]i,j≤1−ϵ0, (12)

where denotes the element of at the position. Also, the matrix with the entry given by is positive semi-definite, and there is a constant not depending on such that

 ∑1≤i

The requirement of and in the condition (i) of the above assumption ensures certain smoothness of the covariance functions of and , respectively; such a requirement is also adopted in Cai et al. (2018) and is connected to the so-called Sacks-Ylvisaker condition (Ritter et al., 1995; Yuan et al., 2010). For a scalar-on-function model, , and thus the requirement for in (10) and the condition on the generalized Fourier coefficients of in (11) are automatically satisfied. In addition, (11) is considerably weaker than the requirement in Cai et al. (2006); Hall and Horowitz (2007) for the scalar-on-function regression model.

Our last assumption imposes some conditions on the growth rates of and relative to , where we recall that and may vary with .

###### Assumption 3.

, and

 p2α0~p2(α+1)≲nℓ2nk2αn

for an arbitrarily small but fixed , where and . In addition, if and if . Moreover, when .

For the scalar-on-function model, the above assumption requires which is only asymptotically slightly less than the rate for in Hall and Horowitz (2007), by noting that can be made arbitrarily small and and are asymptotically less than for any (small) .

As mentioned previously, the eigenelements and are often unknown, and practitioners may use alternative orthonormal elements which may differ from , and similarly use orthonormal elements in place of . In this case, all quantities depending on and , such as and , will be computed by using and . We write, for example, and , to indicate the dependence on and , and note that and .

Given two orthonormal sequences and , define

 UpX(~ϕ)=⎛⎜ ⎜ ⎜⎝⟨ϕ1,~ϕ1⟩1⋯⟨ϕ1,~ϕp1⟩1⋮⋱⋮⟨ϕp1,~ϕ1⟩1⋯⟨ϕp1,~ϕp1⟩1⎞⎟ ⎟ ⎟⎠andUpY(~ψ)=⎛⎜ ⎜ ⎜⎝⟨ψ1,~ψ1⟩2⋯⟨ψ1,~ψp2⟩2⋮⋱⋮⟨ψp2,~ψ1⟩2⋯⟨ψp2,~ψp2⟩2⎞⎟ ⎟ ⎟⎠.

Let , where denotes the Kronecker product of two matrices. Consider a class of with and , such that 1) and are respectively orthonormal sequences, 2) and , and 3) with , where and are defined in the above assumptions, and is the identity matrix. The condition enforces that the variances of the coordinates of exhibit a decay pattern similar to that of the variances of . The class also corresponds to a class of test statistics and . Below we analyze the uniform asymptotic power and size over this class of test statistics; the asymptotic properties of the proposed test by using and then follow as direct consequences, since the class contains . In addition, we consider only the max statistic while note that similar results hold for the min statistic.

For , we define the Gaussian counterparts of and by

 ˇM(~ϕ,~ψ)=max1≤j≤pˇSn,j(~ϕ,~ψ)στnj(~ϕ,~ψ).

The following result shows that the distribution of converges to the distribution at a near rate.

###### Thoerem 1 (Gaussian approximation).

For any small number , if Assumptions 13 hold, then

 sup(~ϕ,~ψ)∈FpdK(L(M(~ϕ,~ψ)),L(ˇM(~ϕ,~ψ)))≲n−12+δ.

In the proposed test, a bootstrap strategy is used to estimate the distribution of , which is justified by the following result.

###### Thoerem 2 (Bootstrap approximation).

For any small number , if Assumptions 13 hold, then there is a constant , not depending on , such that the event

 sup(~ϕ,~ψ)∈FpdK(L(ˇM(~ϕ,~ψ)),L(M⋆(~ϕ,~ψ)|D))≤cn−12+δ

occurs with probability at least

, where represents the distribution of conditional on the observed data .

In reality, the variances are estimated by , and the max statistic is pragmatically computed by

 ^M(~ϕ,~ψ)=max1≤j≤pSn,j(~ϕ,~ψ)^στnj(~ϕ,~ψ).

Below we show that the distribution of this practical max statistic converges to the distribution of the original max statistic.

###### Thoerem 3.

For any small number , if Assumptions 13 hold, then

 sup(~ϕ,~ψ)∈FpdK(L(^M(~ϕ,~ψ)),L(M(~ϕ,~ψ)))≲n−12+δ.

Theorems 13, with the triangle inequality, together imply that, with probability at least ,

 sup(~ϕ,~ψ)∈FpdK(L(^M(~ϕ,~ψ)),L(M⋆(~ϕ,~ψ)|D))≤cn−12+δ

for some constant not depending on . Consequently, the probability that the null hypothesis is rejected by using at the significance level when the null hypothesis in (2) is true, denoted by , is asymptotically controlled at the rate uniformly on .

###### Thoerem 4.

For any small number , if Assumptions 13 hold, then for any , for some constant not depending on , we have

 sup(~ϕ,~ψ)∈Fp∣∣SIZE(ϱ,~ϕ,~ψ)−ϱ∣∣≤cn−1/2+δ.

Finally, below we establish the consistency of the proposed test.

###### Thoerem 5.

Suppose Assumptions 13 hold. Then,

1. for any fixed , one has

 sup(~ϕ,~ψ)∈Fp|^qM(~ϕ,~ψ)(ϱ)|≤c√logn

with probability at least , where is a constant not depending on , and

2. for some constant not depending on , one has

 P(sup(~ϕ,~ψ)∈Fpmax1≤j≤p^σ2j(~ϕ,~ψ)σ2max(~ϕ,~ψ)<5)≥1−cn−1,

where .

Consequently, if , then the null hypothesis in (2) will be rejected uniformly on with probability tending to one, that is,

 P(∀(~ϕ,~ψ)∈Fp: TU(~ϕ,~ψ)>^qM(~ϕ,~ψ)(1−ϱ/2) or TL(~ϕ,~ψ)<^qL(~ϕ,~ψ)(ϱ/2))→1.

## 4 Simulation Studies

To illustrate the numerical performance of the proposed method, we consider three families of models, namely, scalar-on-function, function-on-function and function-on-scalar models. For each family, we consider various settings; see below for details. In all settings, is computed from (1) with .

For each setting, we consider different sample sizes, namely, and , to investigate the impact of on the power of a test. For the proposed test procedure, we set when and/or are functions, in the function-on-vector family, and in the scalar-on-function family. The tuning parameter is selected by the method described in Lin et al. (2020). Finally, we independently perform replications for each setting, based on which we compute the empirical size as the proportion of rejections among the replications when the null hypothesis is true and compute the empirical power as the proportion of rejections when the alternative hypothesis is true. In all settings, the significance level is .

#### Scalar-on-function.

We consider two settings for the functional predictor . In the first setting, is a centered Gaussian process with the following Matérn covariance function

 C(s,t)=σ221−νΓ(ν)(√2ν