1 Introduction
In this article we consider a scalaronfunction historical linear regression model where the functional predictor is defined on a time interval but influences the scalar response only on for some unknown cutoff time . Specifically, the model is written as
(1) 
where, without loss of generality, is assumed to be centered, i.e., , is then the mean of , is the slope function (or coefficient function), and represents the noise that is independent of .
By setting a new process and slope function , the above model can be equivalently expressed as in which the response depends only on the recent past of the process up to a time lag . The term “historical” stems from its resemblance to the functiononfunction historical linear model considered in Malfait and Ramsay (2003), where the response is a function instead of a scalar. In the case of , such model is reduced to model (1) or its equivalent form.
An example of the scalaronfunction historical linear regression is to determine the effects of the past engine acceleration on the current particulate matter emission. The response variable is the current particulate matter emission and the explanatory function is the smoothed engine acceleration curve for the past
seconds. Figure 1(a) displays smoothed engine acceleration curves against the backward time, in which means the current time, while Figure 1(b) shows the slope function estimated by the smoothing spline method (Cardot et al., 2003). We observe from Figure 1(b) that the acceleration over the past 20–60 seconds does not have apparent contribution to predicting the current particulate matter emission. Intuitively, the particulate matter emissions shall depend on the recent acceleration, but not the ancient one. Therefore, if a linear relation between the particulate matter emissions and the acceleration curve is assumed, one might naturally use the historical linear model (1) to analyze such data, where the task includes identifying the cutoff time beyond which the engine acceleration has no influence on the current particulate matter emission.The degenerate case in model (1) corresponds to the classic functional linear regression that has been studied in vast literature. Hastie and Mallows (1993) pioneered the smooth estimation of via penalized least squares and/or smooth basis expansion. Cardot et al. (2003) adopted Bspline basis expansion, while Li and Hsing (2007)
utilized Fourier basis, both with a roughness penalty to control the smoothness of estimated slope functions. Datadriven bases such as eigenfunctions of the covariance function of the predictor process
were considered in Cardot et al. (2003), Cai and Hall (2006) and Hall and Horowitz (2007). Yuan and Cai (2010) took a reproducing kernel Hilbert space approach to estimate the slope function. The case of sparsely observed functional data was studied by Yao et al. (2005). These estimation procedures for classic functional linear regression do not apply to the historical linear model where is often assumed. For models beyond linear regression and a comprehensive introduction to functional data analysis, readers are referred to the monographs by Ramsay and Silverman (2005), Ferraty and Vieu (2006), Hsing and Eubank (2015) and Kokoszka and Reimherr (2017), as well as the review papers by Morris (2015) and Wang et al. (2016) and references therein.Model (1) has been investigated by Hall and Hooker (2016) who proposed to estimate and by penalized least squares with a penalty on . The resulting estimates for are discontinuous at where stands for the estimated . This feature might not be desirable when is a priori assumed to be continuous. For example, it is more reasonable to assume the acceleration function influences particulate matter in a continuous and smooth manner. Moreover, in practice, predictor functions are often not very smooth, while our simulation study suggests that estimates of Hall and Hooker (2016) generally do not perform well in such case. Alternatively, we observe that model (1) is equivalent to a classic functional linear model with for all . Such a slope function is a special case of locally sparse functions which by definition are functions being zero in a substantial portion of their defining domains. Locally sparse slope functions have been studied in Lin et al. (2017), as well as pioneering works James et al. (2009) and Zhou et al. (2013). For example, in Lin et al. (2017), a general functional shrinkage regularization technique, called fSCAD, was proposed and demonstrated to be able to encourage the local sparseness. Although these endeavors are able to produce a smooth and locally sparse estimate, they do not specifically focus on the tail region . Therefore, the estimated slope functions produced by such methods might not be zero in the region that is very close to the endpoint , in particular when the boundary effect is not negligible.
In this article, we propose a new nested group bridge approach to estimate the slope function and the cutoff time . Comparing to the existing methods, the proposed approach has two features. First, it is based on Bspline basis expansion and penalized least squares with a roughness penalty. Therefore, the resulting estimator of is continuous and smooth over the entire domain , contrasting the discontinuous estimator of Hall and Hooker (2016). Second, it employs a new nested group bridge shrinkage method proposed in Section 2 to specifically shrink the estimated function on the tail region . Group bridge was proposed in Huang et al. (2009) for variable selection, and utilized by Wang and Kai (2015) for locally sparse estimation in the setting of nonparametric regression. In our approach, we creatively organize the coefficients of Bspline basis functions into a sequence of nested groups and apply the group bridge penalty to the groups. With the aid from Bspline basis expansion, such nested structure enables us to shrink the tail of the estimated slope function. This fixes the problem of the aforementioned generic locally sparse estimation procedures.
We structure the rest of the paper as follows. In Section 2 we present the proposed estimation method for the slope function and the cutoff time, and also provide computational details. In Section 3 we investigate the asymptotic properties of derived estimators. Simulation studies are discussed in Section 4, and an application to the particulate matter emissions data is given in Section 5.
2 Methodology
2.1 Nested Group Bridge Approach
Our estimation method utilizes Bspline basis functions that are detailed in de Boor (2001). Let
be a vector that contains
Bspline basis functions defined on with degree and equally spaced knots . For , let denote the vector of the th derivatives of the Bspline basis functions. Each of these basis functions is a piecewise polynomial of degree . Bspline basis functions are well known for their compact support property, i.e., each basis function is positive over at most adjacent subintervals. For illustration, Figure 2 shows thirteen Bspline basis functions defined on with and . Due to this compact support property, if we approximate by a linear combination of Bspline basis functions, then such approximation is locally sparse if the coefficients are sparse in groups.We shall further introduce some notations. Let , and for . Intuitively, each group represents the indices of Bspline basis functions that are nonzero on . For a vector of scalars, we denote by the subvector of elements whose indices are in the th group . We shall use to denote the norm of a generic dimensional vector , and use to denote the norm of a generic function . As our focus is on the estimation of and , without loss of generality, we assume that in model (1) in the sequel.
For a fixed , the historically sparse and smooth estimators for and are defined as
(2) 
where and minimizes the penalized least squares
(3) 
with known weights and nonnegative tuning parameters and
. In the above criterion, the first term is the ordinary least squares error that encourages the fidelity of model fitting, while the second term is a roughness penalty that aims to enforce smoothness of the estimate
. In practice, is a common choice, which corresponds to measuring the roughness of a function by its integrated curvature.The last term in the objective function (3) is designed to shrink the estimated slope function toward zero specifically on the tail region. It originates from the group bridge penalty that was introduced by Huang et al. (2009) for simultaneous selection of variables at both the group and withingroup individual levels. In (3), the groups have a special structure: . In other words, the groups are nested as a sequence and hence we call the last term in (3) nested group bridge. Due to such nested nature, if , then one can observe in (3) that (i) the coefficient appears in all groups where the coefficient also appears, and (ii) appears in more groups than . As a consequence, is always penalized more heavily than . These two features suggest that the nested group bridge penalty spends more effort on shrinking those coefficients of Bspline basis functions whose support is in a closer proximity to . As Bspline basis functions enjoy the aforementioned compact support property and our estimate is represented by a linear combination of such basis functions as in (2), the progressive shrinkage of nested group bridge encourages the estimate of to be locally sparse specifically on the tail part of the time domain. Such estimate is exactly what we are after in the scalaronfunction historical linear model (1). The weights are introduced to offset the effect of different dimensions of . As suggested by Huang et al. (2009), a simple choice for is , where denotes the cardinality of .
2.2 Computational Method
The objective function (3) is not convex and thus difficult to optimize. Huang et al. (2009) suggested the following formulation that was easier to work with. Based on Proposition 1 of Huang et al. (2009), for , if , then minimizes (3) if and only if minimizes
(4) 
subject to , where and . Below we develop an algorithm following this idea.
Let denote the matrix with elements , and let denote the matrix with elements . The first term of (4) can be expressed as and the second term of (4) yields . Since is a positive semidefinite matrix, by Cholesky decomposition we write , where is symmetric. Define
where is the zero vector of length . If we write for , then (4) can be written in the form
(5) 
Let be the diagonal matrix with the th diagonal element . With notation and , (5) can be expressed in a form of lasso problem (Tibshirani, 1996),
where denote the th element of vector . Now, we take the following iterative approach to compute .

Step 1. Obtain an initial estimate .

Step 2. At iteration , , compute

Step 3. At iteration , compute
(6) 
Step 4. Repeat Step 2 and Step 3 until convergence is reached.
A choice for the initial estimate is , which is obtained by the smoothing spline method (Cardot et al., 2003). Once is produced, the estimates for and are given in (2). As the nested group bridge penalty is not convex, the above algorithm converges to a local minimizer. It is worth emphasizing that (6) is a lasso problem, which can be efficiently solved by the least angle regression algorithm (Efron et al., 2004).
In our fitting procedure, there are a few tuning parameters including the smoothing parameter , the shrinkage parameter , and the parameters for constructing the Bspline basis functions such as the degree of the Bspline basis and the number of knots . Following the schemes of Marx and Eilers (1999), Cardot et al. (2003) and Lin et al. (2017), we choose to be relatively large to capture the local features of . In addition, is estimated by the knot , therefore a small may lead to a large bias of the estimator . The effect of potential overfitting caused by a large number of knots can be offset by the roughness penalty. Compared to , the degree is of less importance, and therefore we fix it to a reasonable value, i.e., . The smoothing parameter and shrinkage parameter can be chosen via Bayesian information criterion, as follows. Let be the estimate based on a chosen pair of and . Let denote the submatrix of with columns corresponding to the nonzero , and denote the submatrix of with rows and columns corresponding to the nonzero
. The approximated degree of freedom for
and isThen, Bayesian information criterion (bic) can be approximated by
The optimal and are selected to minimize .
3 Asymptotic Properties
Let and be the true values of the cutoff time and the slope function , respectively. We assume that realizations are fully observed, while notice that the analysis can be extended to sufficiently densely observed data. Without loss of generality, we assume . If , set , and if , let . Otherwise, let be an integer such that . According to Theorem XII(6) of de Boor (), there exists some with , such that for some positive constant and . Define , . For simplicity, we derive the theoretical results based on . Define as the the covariance operator of the random process , and as the empirical version of , which is defined by
For tow functions and defined on , we define the inner product in the Hilbert space as . Let be the matrix with elements . In order to establish our asymptotic properties, we assume that the following conditions are satisfied.

[label=C.0]

.

The th derivative exists and satisfies the Hölder condition with exponent , that is , for some constant , . Define . Assume .

, and .

There are constants such that
with probability tending to one as
goes to infinity, where anddenote the smallest and largest eigenvalues of a matrix, respectively.

, where .

.
The condition 1 assures the existence of the covariance function of . The second condition concerns the smoothness of the slope function , which has been used by Cardot et al. (2003) and Lin et al. (2017). In condition 3 we set the growth rate for the smoothing tuning parameter . Our analysis applies to , which is equivalent to Tikhonov regularization in Hall and Horowitz (2007) and simplifies our analysis. A similar result can be derived for . The last two conditions pose certain constraints on the decay rate of and (and hence ). Similar conditions appear in Wang and Kai (2015). Below we state the main results, and relegate their proofs to the supplementary file. Our first result provides the convergence rate of the estimator defined in (2).
The convergence rate consists of two competing components, the variance term
and the bias term . With an increase of , the approximation to by Bspline basis functions is improved, however, at the cost of increased variance. The next result shows that the null tail of , as well as the cutoff time , can be consistently estimated.4 Simulation Studies
We conduct simulation studies to evaluate the numerical performance of our nested group bridge method, and compare the results with the smoothing spline approach, as well as the two truncation methods proposed by Hall and Hooker (). The two truncation methods first expand the slope function with an orthonormal basis and then penalize by adding a penalty on to the least squares. Two estimation procedures were suggested by Hall and Hooker (2016). The first one (called Method A) estimates and simultaneously, while the second one (called Method B) estimates them in an iterative fashion.
In our studies, for the purpose of fair comparison, we consider the same scenarios for in Hall and Hooker (2016), namely,
Scenario I. ,
Scenario II. ,
Scenario III. ,
where denotes the indicator function. For all cases the slope function on and on . As illustrated in Figure 3, the slope function is discontinuous for scenario I, and the first and second derivatives of the slope functions are discontinuous for scenario II and III, respectively. The predictor functions are generated by , where are cubic Bspline basis functions defined on (the number 64 is randomly selected between 50 and 100) equally spaced knots over , and the coefficients
are generated from the standard normal distribution. The errors
are normally distributed and sampled so that the signaltonoise ratio equals to . We consider sample sizes and . For each of the three scenarios and for each sample size, we replicate the simulation independently for times. For our nested group bridge approach, we choose , where dividing by borrows the idea of adaptive lasso (Zou, 2006). We obtain by the smoothing spline method (Cardot et al., 2003). We set and .Table 1
summarizes the Monte Carlo mean and standard error of
. The results suggest that the proposed estimator is more accurate than the truncation methods in Scenario III when the second derivative of the slope function is discontinuous. On the other hand, in Scenario I and II when the slope function and the first derivative of the slope function are discontinuous, respectively, the proposed method is comparable to the truncation method B and better than the truncation method A. It is observed here and also discussed in Hall and Hooker (2016) that the truncation methods tend to underestimate and exhibit a large bias when is smooth. The figures in Table 1 seem inconsistent with those reported in Hall and Hooker (2016). A possible reason is that the predictor functions in their paper are much smoother than those in our setting. Indeed, an exponential decay in eigenvalues was assumed in the simulation setup of Hall and Hooker (2016), corresponding to rather smooth predictor functions. However, such smooth random functions might not be common in practice. The histograms shown in Figure 4 provide more details of the performance of our method. They indicate that when is not smooth, the proposed estimator is conservative, in the sense that , which might be better than being aggressive when accurate prediction of the response is the primary goal.NGR  TR (Method A)  TR (Method B)  True Value  

Scenario I  
()  ()  ()  
()  ()  ()  
Scenario II  
()  ()  ()  
()  ()  ()  
Scenario III  
()  ()  ()  
()  ()  () 

NGR, our proposed nested group bridge method; TR (Method A), the truncation method that estimates and simultaneously; TR (Method B), the truncation method that estimates and iteratively.
simulation replications with the corresponding Monte Carlo standard deviation included in parentheses.
To examine the quality of the estimation for , we report the mean integrated squared errors of the estimated in Table 2. It is observed that in general, the proposed estimator outperforms the smoothing spline method and the two truncation methods. The truncation methods do not regularize the roughness of the estimated slope function, which leads to a less favorable performance when the predictor function is relatively rough, as in our setting and common in practice. The smoothing spline method is comparable to the proposed method in terms of estimation accuracy of , but it is unable to provide an estimate for .
To examine the quality of the estimation for , we report the mean integrated squared errors of the estimated in Table 2. It is observed that in general, the proposed estimator outperforms the smoothing spline method and the two truncation methods. The truncation methods do not regularize the roughness of the estimated slope function, which leads to a less favorable performance when the predictor function is relatively rough, as in our setting and common in practice. The smoothing spline method is comparable to the proposed method in terms of estimation accuracy of , but it is unable to provide an estimate for .
NGR  SS  TR (Method A)  TR (Method B)  

Scenario I  
n =  ()  ()  ()  () 
n =  ()  ()  ()  () 
Scenario II  
n =  ()  ()  ()  () 
n =  ()  ()  ()  () 
Scenario III  
n =  ()  ()  ()  () 
n =  ()  ()  ()  () 

NGR, our proposed nested group bridge method; SS, the smoothing spline method; TR (Method A), the truncation method that estimates and simultaneously; TR (Method B), the truncation method that estimates and iteratively.
5 Applications: Particulate Matter Emissions Data
In this section, we demonstrate the proposed approach to analyze the particulate matter emissions data which are taken from the Coordinating Research Councils E55/E59 research project (Clark et al., 2007). In this project, trucks were placed on the chassis dynamometer bed to mimic inertia and particulate matter was measured by an emission analyzer on standard test cycles. The engine acceleration of diesel trucks was also recorded. We are interested in determining the effects of the past engine acceleration on the current particulate matter emission, and in particular, identifying the cutoff time in the past that have a predicting power on the current particulate matter emission. As noted in Hall and Hooker (2016), we obtain observation every second after the first seconds to remove dependence in the data. Let be the logarithm of the particulate matter emission measured at the th second after the first seconds, and be the corresponding engine acceleration at the past time . Both and are centered such that and . We estimate the functional linear model (1), where , the engine acceleration in the past seconds is the predictor curve, and . In total, we have such samples. Figure 5(a) displays randomly selected smoothed engine acceleration curves recorded on every second for seconds.
Figure 5(b) and (c) provides estimates for obtained by the proposed approach and the smoothing spline method, respectively, both of which use cubic Bspline basis functions. The proposed estimate is zero over and the estimate for is s. It suggests that the engine acceleration influences particulate matter emission for no longer than 20 seconds. A similar trend can be observed for the smoothing spline method which, however, does not give a clear cutoff time of influence of acceleration on particulate matter emission. Hall and Hooker (2016) suggested that the point estimate for is s using Method A and s using Method B, both of which are more aggressive than our estimator.
We also construct a 95% pointwise bootstrap pivotal confidence interval for
which is depicted in Figure 5(b) together with the proposed estimate. The bootstrap confidence interval is derived by resampling the residuals, reestimating , and at each time pointcalculating the sample quantile. Let
denote the sample quantile of the reestimated slope functions at time point . The bootstrap pivotal confidence interval for is . For further details of pivotal bootstrap confidence intervals, we refer readers to Wasserman (2010), Chapter . The % pointwise bootstrap confidence interval in Figure 5(b) implies that there is little effect of the acceleration on the current particulate matter emission when the time is beyond the past seconds.6 Concluding Remarks
In this paper, we consider to study the relation between a scalar response and a functional predictor in a historical functional linear model. We propose a nested group bridge approach to achieve the historical sparseness, which reduces the variability and enhances the interpretability. Compared with the truncation methods by Hall and Hooker (2016), the proposed approach is able to provide a smooth and continuous estimate for the coefficient function and performs much better when the coefficient function tends to zero more smoothly. The proposed estimator of the cutoff time enjoys the estimation consistency. We demonstrate in simulation studies and an real data application that the proposed approach performs well for predictor functions that are not very smooth. We also show that even when the signal to noise ratio is low, our proposed approach can still accommodate the situation very well.
Supplementary materials
A supplementary document is available online at Journal of Computational and Graphical Statistics, which includes proofs of the theoretical results. The R codes for our real data analysis and the simulation studies can be downloaded online.
References
 Cai and Hall (2006) Cai, T. T. and P. Hall (2006). Prediction in functional linear regression. The Annals of Statistics 34(5), 2159–2179.
 Cardot et al. (2003) Cardot, H., F. Ferraty, and P. Sarda (2003). Spline estimators for the functional linear model. Statistica Sinica 13, 571–591.
 Clark et al. (2007) Clark, N., M. Gautam, W. Wayne, D. Lyons, G. Thompson, and B. Zielinska (2007). Heavyduty vehicle chassis dynamometer testing for emissions inventory, air quality modeling, source apportionment and air toxics emissions inventory: E55/59 all phases. Coordinating Research Council, Alpharetta.
 de Boor (2001) de Boor, C. (2001). A practical Guide to Splines. New York: SpringerVerlag.
 Efron et al. (2004) Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani (2004). Least angle regression. The Annals of statistics 32(2), 407–499.
 Ferraty and Vieu (2006) Ferraty, F. and P. Vieu (2006). Nonparametric Functional Data Analysis: Theory and Practice. New York: SpringerVerlag.
 Hall and Hooker (2016) Hall, P. and G. Hooker (2016). Truncated linear models for functional data. Journal of Royal Statistical Society, Series B 78(3), 637–653.
 Hall and Horowitz (2007) Hall, P. and J. L. Horowitz (2007). Methodology and convergence rates for functional linear regression. The Annals of Statistics 35(1), 70–91.
 Hastie and Mallows (1993) Hastie, T. and C. Mallows (1993, May). A statistical view of some chemometrics regression tools. Technometrics 35(2), 140–143.
 Hsing and Eubank (2015) Hsing, T. and R. Eubank (2015). Theoretical Foundations of Functional Data Analysis, with an Introduction to Linear Operators. Chichester: Wiley.
 Huang et al. (2009) Huang, J., S. Ma, H. Xie, and C. H. Zhang (2009). A group bridge approach for variable selection. Biometrika 96, 339–355.
 James et al. (2009) James, G. M., J. Wang, and J. Zhu (2009). Functional linear regression that’s interpretable. The Annals of Statistics 37(5A), 2083–2108.
 Kokoszka and Reimherr (2017) Kokoszka, P. and M. Reimherr (2017). Introduction to Functional Data Analysis. Boca Raton: Chapman and Hall/CRC.

Li and Hsing (2007)
Li, Y. and T. Hsing (2007).
On rates of convergence in functional linear regression.
Journal of Multivariate Analysis
98, 1782–1804.  Lin et al. (2017) Lin, Z., J. Cao, and L. W. . H. Wang (2017). Locally sparse estimator for functional linear regression models. Journal of Computational and Graphical Statistics 26(2), 306–318.
 Malfait and Ramsay (2003) Malfait, N. and J. O. Ramsay (2003). The historical functional linear model. The Canadian Journal of Statistics 31(2), 115–128.
 Marx and Eilers (1999) Marx, B. D. and P. H. C. Eilers (1999). Generalized linear regression on sampled signals and curves: A Pspline approach. Technometrics 41(1), 1–13.
 Morris (2015) Morris, J. S. (2015). Functional regression. Annual Review of Statistics and Its Application 2(1), 321–359.
 Ramsay and Silverman (2005) Ramsay, J. O. and B. W. Silverman (2005, June). Functional Data Analysis. New York: Springer.
 Tibshirani (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 58(1), 267–288.
 Wang and Kai (2015) Wang, H. and B. Kai (2015). Functional sparsity: global vesus local. Statistica Sinica 25, 1337–1354.
 Wang et al. (2016) Wang, J.L., J.M. Chiou, and H.G. Müller (2016). Review of functional data analysis. Annual Review of Statistics and Its Application 3, 257–295.
 Wasserman (2010) Wasserman, L. (2010). All of statistics: a concise course in statistical inference. New York: Springer.

Yao
et al. (2005)
Yao, F., H. G. Müller, and J.L. Wang (2005).
Functional linear regression analysis for longitudinal data.
The Annals of Statistics 33, 2873–2903.  Yuan and Cai (2010) Yuan, M. and T. T. Cai (2010). A reproducing kernel Hilbert space approach to functional linear regression. The Annals of Statistics 38(6), 3412–3444.
 Zhou et al. (2013) Zhou, J., N.Y. Wang, and N. Wang (2013). Functional linear model with zerovalue coefficient function at subregions. Statistica Sinica 23, 25–50.
 Zou (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American Statistical Association 101(476), 1418–1429.
Comments
There are no comments yet.