 # Optimal Penalized Function-on-Function Regression under a Reproducing Kernel Hilbert Space Framework

Many scientific studies collect data where the response and predictor variables are both functions of time, location, or some other covariate. Understanding the relationship between these functional variables is a common goal in these studies. Motivated from two real-life examples, we present in this paper a function-on-function regression model that can be used to analyze such kind of functional data. Our estimator of the 2D coefficient function is the optimizer of a form of penalized least squares where the penalty enforces a certain level of smoothness on the estimator. Our first result is the Representer Theorem which states that the exact optimizer of the penalized least squares actually resides in a data-adaptive finite dimensional subspace although the optimization problem is defined on a function space of infinite dimensions. This theorem then allows us an easy incorporation of the Gaussian quadrature into the optimization of the penalized least squares, which can be carried out through standard numerical procedures. We also show that our estimator achieves the minimax convergence rate in mean prediction under the framework of function-on-function regression. Extensive simulation studies demonstrate the numerical advantages of our method over the existing ones, where a sparse functional data extension is also introduced. The proposed method is then applied to our motivating examples of the benchmark Canadian weather data and a histone regulation study.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Functional data have attracted much attention in the past decades (Ramsay & Silverman, 2005). Most of the existing literature has only considered the regression models of a scalar response against one or more functional predictors, possibly with some scalar predictors as well. Some of them considered a reproducing kernel Hilbert space framework. For example, Yuan & Cai (2010)

provided a thorough theoretical analysis of the penalized functional linear regression model with a scalar response. The paper laid the foundation for several theoretical developments including the representer theorem and minimax convergence rates for prediction and estimation for penalized functional linear regression models. In a follow-up,

Cai & Yuan (2012) showed that the minimax rate of convergence for the excess prediction risk is determined by both the covariance kernel and the reproducing kernel. Then they designed a data-driven roughness regularization predictor that can achieve the optimal convergence rate adaptively without the knowledge of the covariance kernel. Du & Wang (2014) extended the work of Yuan & Cai (2010) to the setting of a generalized functional linear model, where the scalar response comes from an exponential family distribution.

In contrast to these functional linear regression models with a scalar response, the model with a functional response over a functional predictor has only been scarcely investigated (Yao et al., 2005b; Ramsay & Silverman, 2005). Such data with functional responses and predictors are abundant in practice. We shall now present two motivating examples.

###### Example 1.1

Daily temperature and precipitation at 35 different locations in Canada averaged over 1960 to 1994 were collected (Figure 1). The main interest is to use the daily temperature profile to predict the daily precipitation profile for a location in Canada. Figure 1: Smoothed trajectories of temperature (Celsius) in left panel and the log (base 10) of daily precipitation (Millimetre) in right panel. The x-axis labels in both panels represent 365 days.
###### Example 1.2

Histone Regulation Data
Extensive researches have been shown that histone variants, i.e. histones with structural changes compared to their primary sequence, play an important role in the regulation of chromatin metabolism and gene activity (Ausió, 2006). An ultra-high throughput time course experiment was conducted to study the regulation mechanism during heat stress in Arabidopsis thaliana. The genome-wide histone variant distribution was measured by ChIP sequencing (ChIP-seq) (Johnson et al., 2007) experiments. We computed histone levels over 350 base pairs (bp) on genomes from the ChIP-seq data, see left panel in Figure 2. The RNA sequencing (RNA-seq) (Wang et al., 2009) experiments measured the expression levels over seven time points within 24 hours, see right panel in Figure 2. Of primary interest is to study the regulation mechanism between gene expression levels over time domain and histone levels over spatial domain. Figure 2: Smoothed trajectories of normalized histone levels in ChIP-seq experiments in left panel and the normalized expression levels in RNA-seq experiments in right panel. The x-axis label in the left panel stands for the region of 350 bp. The x-axis label in the right panel represents seven time points within 24 hours.

Motivated by the examples, we now present the statistical model. Let be two random processes defined respectively on . Suppose independent copies of are observed: , . The functional linear regression model of interest is

 Yi(t)=α(t)+∫Ixβ(t,s)Xi(s)ds+ϵi(t),    t∈Iy, (1)

where is the intercept function, is a bivariate coefficient function, and , independent of , are i.i.d. random error functions with and . Here denotes the -norm. In Example 1.1, and represent the daily precipitation and temperature at station . In Example 1.2, the expression levels of gene over seven time points, , from RNA-seq is used as the functional response. The histone levels of gene over 350 base pairs (bp), , from ChIP-seq is used as the functional predictor.

At a first look, model (1) might give the (wrong) impression of being an easy extension from the model with a scalar response, with the latter obtained from (1) by removing all the notation. However, the coefficient function in the scalar response case is univariate and thus can be easily estimated by most off-the-shelf smoothing methods. When extended to estimating a bivariate coefficient function in (1), many of these smoothing methods may encounter major numerical and/or theoretical difficulties. This partly explains the much less abundance of research in this direction.

Some exceptions though are reviewed below. Cuevas et al. (2002) considered a fixed design case, a different setting from (1) with and represented and analyzed as sequences. Nonetheless they provided many motivating applications in neuroscience, signal transmission, pharmacology, and chemometrics, where (1) can apply. The historical functional linear model in Malfait & Ramsay (2003) was among the first to study regression of a response functional variable over a predictor functional variable, or more precisely, the history of the predictor function. Ferraty et al. (2011) proposed a simple extension of the classical Nadaraya-Watson estimator to the functional case and derived its convergence rates. They provided no numerical results on the empirical performance of their kernel estimator. Benatia et al. (2015)

extended ridge regression to the functional setting. However, their estimation relied on an empirical estimate of the covariance process of predictor functions. Theoretically sound as it is, this covariance process estimate is generally not reliable in practice. Consequently, their coefficient surface estimates suffered as shown in their simulation plots.

Meyer et al. (2015) proposed a Bayesian function-on-function regression model for multi-level functional data, where the basis expansions of functional parameters were regularized by basis-space prior distributions and a random effect function was introduced to incorporate the with-subject correlation between functional observations.

A popular approach has been the functional principal component analysis (FPCA) as in

Yao et al. (2005b) and Crambes & Mas (2013). The approach starts with a basis representation of

in terms of the eigenfunctions in the Karhunen-Loève expansions of

and . Since this representation has infinitely many terms, it is truncated at certain point to obtain an estimable basis expansion of . Yao et al. (2005b) studied a general data setting where and are only sparsely observed at some random points. They derived the consistency and proposed asymptotic point-wise confidence bands for predicting response trajectories. Crambes & Mas (2013) furthered the theoretical investigation of the FPCA approach by providing a minimax optimal rates in terms of the mean square prediction error. However, the FPCA approach has a couple of critical drawbacks. Firstly, is a statistical quantity unrelated to or . Hence the leading eigenfunctions in the truncated Karhunen-Loève expansions of and may not be an effective basis for representing . See, e.g., Cai & Yuan (2012) and Du & Wang (2014) for some scalar-response examples where the FPCA approach breaks down when the aforementioned situation happens. Secondly, the truncation point is integer-valued and thus only has a discrete control on the model complexity. This puts it at disadvantage against the roughness penalty regularization approach, which offers a continuous control via a positive and real-valued smoothing parameter (Ramsay & Silverman, 2005, Chapter 5).

In this paper, we consider a penalized function-on-function regression approach to estimating the bivariate coefficient function . There have been a few recent developments in the direction of penalized function-on-function regression. Lian (2015) studied the convergence rates of the function-on-function regression model under a reproducing kernel Hilbert space framework. Although his model resembled model (1), he developed everything with the variable fixed and did not enforce any regularization on the direction. Firstly, this lack of -regularization can be problematic since this leaves the noisy errors on the direction completely uncontrolled and can result in an estimate that is very rough on the direction. Secondly, this simplification of fixing essentially reduces the problem to a functional linear model with a scalar response and thus makes all the results in Yuan & Cai (2010) directly transferrable even without calling on any new proofs. The R package fda maintained by Ramsay et al. has implemented a version of penalized B-spline estimation of with a fixed smoothing parameter. Ivanescu et al. (2015)

considered a penalized function-on-function regression model where the coefficient functions were represented by expansions into some basis system such as tensor cubic B-splines. Quadratic penalties on the expansion coefficients were used to control the smoothness of the estimates. This work provided a nice multiple-predictor-function extension to the function-on-function regression model in the

fda package. Scheipl & Greven (2016) studied the identifiability issue in these penalized function-on-function regression models. However, this penalized B-spline approach has several well-known drawbacks. First, it is difficult to show any theoretical optimality such as the minimax risk of mean prediction in Cai & Yuan (2012). So its theoretical soundness is hard to justify. Moreover, the B-spline expansion is only an approximate solution to the optimization of the penalized least squares score. Hence the penalized B-spline estimate is not numerically optimal from the beginning either. These drawbacks can have negative impacts on the numerical performance as we shall see from the simulation results in Section 4.

The penalized function-on-function regression method proposed in this paper obtains its estimator of through the minimization of penalized least squares on a reproducing kernel Hilbert space that is naturally associated with the roughness penalty. Such a natural formulation through a reproducing kernel Hilbert space offers several advantages comparing with the existing penalized function-on-function regression methods. Firstly, it allows us to establish a Representer Theorem which states that, although the optimization of the penalized least squares is defined on an infinite dimensional function space, its solution actually resides in a data-adaptive finite dimensional subspace. This result guarantees an exact solution when the optimization is carried out on this finite dimensional subspace. This result itself is a nontrivial generalization of the Representer Theorems in the scenarios of nonparametric smooth regression model (Wahba, 1990) and the penalized functional regression model with a scalar response (Yuan & Cai, 2010). Based on the Representer Theorem, we propose an estimation algorithm which uses penalized least squares and Gaussian quadrature with the Gauss-Legendre rule to estimate the bivariate coefficient function. The smoothing parameter is selected by the generalized cross validation (GCV) method. Secondly, the reproducing kernel Hilbert space framework allows us to show that our estimator has the optimal rate of mean prediction since it achieves the minimax convergence rate in terms of the excess risk. This generalizes the results in Cai & Yuan (2012) and Du & Wang (2014) for functional linear regression with a scalar response to the functional response scenario. In the numerical study, we have also considered the problem with sparsely sampled data. Particularly, we introduce an extra pre-smoothing step before applying the proposed penalized functional regression model. The pre-smoothing step implements the principal-component-analysis-through-expectation (PACE) method in Yao et al. (2005a). Our extensive simulation studies demonstrate the numerical advantages of our method over the existing ones. In summary, our method has the following distinguishing features: (i) it makes no structural dependence assumptions of over the predictor and response processes; (ii) the Representer Theorem guarantees an exact solution instead of an approximation to the optimization of the penalized score; (iii) benefited from the Representer Theorem, we develop a numerically reliable algorithm that has sound performance in simulations; (iv) we show theoretically the estimator achieves the optimal minimax convergence rate in mean prediction.

The rest of the paper is organized as follows. In Section 2, we first derive a Representer Theorem showing that the solution of the minimization of penalized least squares can be found in a finite-dimension subspace. In addition, an easily implementable estimation algorithm is considered in the Section 2. In Section 3, we prove that our method has the optimal rate of mean prediction. Numerical experiments are reported in Section 4, where we compare our method with the functional linear regressions in (Ramsay & Silverman, 2005; Yao et al., 2005b) in terms of prediction accuracy. Two real data examples, the Canadian weather data, and the histone regulation data are analyzed in Section 5. Discussion in Section 6 concludes the paper. Proofs of the theorems are collected in Supplementary Material.

## 2 Penalized Functional Linear Regression Method

We first introduce a simplification to model (1). Since model (1) implies that

 Yi(t)−EYi(t)=∫Ixβ(t,s){Xi(s)−EXi(s)}ds+ϵi(t),    t∈Iy,

we may, for simplicity, only consider and to be centered, i.e., . Thus, the functional linear regression model takes the form of

 Yi(t)=∫Ixβ(t,s)Xi(s)ds+ϵi(t),    t∈Iy. (2)

### 2.1 The Representer Theorem

Assume that the unknown resides in a reproducing kernel Hilbert space with the reproducing kernel , where . The estimate can be obtained by minimizing the following penalized least squares functional

 1nn∑i=1∫Iy{Yi(t)−∫Ixβ(t,s)Xi(s)ds}2dt+λJ(β) (3)

with respect to , where the sum of integrated squared errors represents the goodness-of-fit, is a roughness penalty on , and is the smoothing parameter balancing the trade-off. We now establish a Representer Theorem stating that actually resides in a finite dimensional subspace of . This result generalizes Theorem 1 in Yuan & Cai (2010) and facilitates the computation by reducing an infinite dimensional optimization problem to a finite dimensional one.

Note that the penalty functional is a squared semi-norm on . Its null space is a finite-dimensional linear subspace of . Denote by its orthogonal complement in such that . For any , there exists a unique decomposition where and . Let and be the corresponding reproducing kernels of and . Then and are both nonnegative definite operators on , and . In fact the penalty term . By the theory of reproducing kernel Hilbert spaces, has a tensor product decomposition . Here is the reproducing kernel Hilbert space with a reproducing kernel , and is the reproducing kernel Hilbert space with a reproducing kernel . For the reproducing kernels, we have . Note that the functions in and are univariate and defined respectively on and . Similar to the decomposition of and , we have the tensor sum decompositions of the marginal subspaces and , and the orthogonal decompositions of the marginal reproducing kernels and . Here is a reproducing kernel on with running through the index set .

Upon piecing the marginal decomposition parts back to the tensor product space, we obtain and . Correspondingly, the reproducing kernels satisfy that

 K0((t1,s1),(t2,s2)) =K0y(t1,t2)K0x(s1,s2), K1((t1,s1),(t2,s2)) =K0y(t1,t2)K1x(s1,s2)+K1y(t1,t2)K0x(s1,s2)+K1y(t1,t2)K1x(s1,s2).

Let and . Denote by and respectively the basis functions of and . With some abuse of notation, define and . Now we can state the Representer Theorem as follows with its proof collected in the Supplementary Material.

###### Theorem 2.1

Let be the minimizer of (3) in . Then resides in the subspace of functions of the form

 β(t,s) ={Ny∑k=1dk,βyψk,y(t)+n∑i=1ci,βy(K1yYi)(t)}{Nx∑l=1dl,βxψl,x(s)+n∑j=1cj,βx(K1xXj)(s)} ={d⊤βyψy(t)+c⊤βy(K1yY)(t)}{d⊤βxψx(s)+c⊤βx(K1xX)(s)}, (4)

where , , and

are some coefficient vectors, and

and are vectors of functions.

For the purpose of illustration, we give a detailed example below.

###### Example 2.1

Consider the case of tensor product cubic splines with . The marginal spaces with the inner product

 ⟨f,g⟩Hy=(∫10f∫10g+∫10f′∫10g′)+∫10f′′g′′dt.

The marginal space can be further decomposed into the tensor sum of and . The reproducing kernel is the orthogonal sum of and , where is a scaled version of the Bernoulli polynomial . The space has a dimension of and a set of basis functions .

The function space is defined as with the reproducing kernel and the penalty functional

We have and ; see, e.g., Chapter 2 of Gu (2013).

### 2.2 Estimation Algorithm

To introduce the computational algorithm, we first need some simplification of notation. Let and . We rewrite the functions spanning the subspace in Theorem 2.1 as , , and , , . Thus a function in this subspace has the form for some coefficient vectors , and vectors of functions , . To solve (3), we choose Gaussian quadrature with the Gauss-Legendre rule to calculate the integrals. Consider the Gaussian quadrature evaluation of an integral on with knots and weights such that . Let be the diagonal matrix with repeating times on the diagonal. Then the estimation of in (3) reduces to the minimization of

 (5)

with respect to and , where with , with being an matrix with the th entry , with being an matrix with the th entry , and is a matrix with the th entry . Let , , and , we have .

We then utilize standard numerical linear algebra procudures such as the Cholesky decomposition with pivoting and forward and back substitutions, to calculate and in (5) (Gu, 2013, Section 3.5). To choose the smoothing parameter in (5), a modified Generalized Cross-Validation (GCV) score (Wahba & Craven, 1979),

 V(λ)=(nT)−1YTw(I−A(λ))2Yw{(nT)−1tr(I−αA(λ))}2 (6)

is implemented, where is a fudge factor curbing undersmoothing (Kim & Gu, 2004) and is the smoothing matrix bridging the prediction and the observation as , similar to the hat matrix in a general linear model.

## 3 Optimal Mean Prediction Risk

We are interested in the estimation of coefficient function and mean prediction, that is, to recover the functional based on the training sample , . Let be an estimate of . Suppose is a new observation that has the same distribution as and is also independent of , . Then the prediction accuracy can be naturally measured by the excess risk

 Rn(^βn) = ∫Iy[E∗{Yn+1(t)−∫Ix^βn(t,s)Xn+1(s)ds}2−E∗{Yn+1(t)−∫Ixβ(t,s)Xn+1(s)ds}2]dt = ∫IyE∗{η^βn(Xn+1,t)−ηβ(Xn+1,t)}2dt

where represents the expectation taken over only. We shall study the convergence rate of as the sample size increases.

This section collects two theorems whose combination indicates that our estimator achieves the optimal minimax convergence rate in mean prediction. We first establish the minimax lower bound for the convergence rate of the excess risk . There is a one-to-one relationship between and which is a linear functional space endowed with an inner product such that

 β(t,s)=⟨K((t,s),⋅),β⟩H(K),   for any β∈H(K).

The kernel can also be treated as an integral operator such that

 K(β)(⋅)=⟨K((t,s),⋅),β⟩L2=∫∫IK((t,s),⋅)β(t,s)dtds.

It follows from the spectral theorem that there exist a set of orthonormal eigenfunctions

and a sequence of eigenvalues

such that

 K((t1,s1),(t2,s2))=∞∑k=1κkζk(t1,s1)ζk(t2,s2),    K(ζk)=κkζk,  k=1,2,….

Denote Let be the covariance kernel of . Define a new kernel such that

 (7)

Let be the eigenvalues of and be the corresponding eigenfunctions. Therefore,

 Π((t1,s1),(t2,s2))=∞∑k=1ρkϕk(t1,s1)ϕk(t2,s2),   ∀ (t1,s1),(t2,s2)∈Iy×Ix.
###### Theorem 3.1

Assume that for any

 (8)

for a positive constant . Suppose that the eigenvalues of the kernel in (7) satisfy for some constant . Then,

 limA→∞limn→∞supβ∈H(K)P{Rn≥An−2r2r+1}=0, (9)

when is of order .

Theorem 3.1 indicates that the convergence rate is determined by the decay rate of the eigenvalues of this new operator , which is jointly determined by both reproducing kernel and the covariance kernel as well as the alignment between and in a complicated way. This result has not been reported in the literature before. A close and related result is from Yuan & Cai (2010) who studied an optimal prediction risk for functional linear models, where the optimal rate depends on the decay rate of the eigenvalues of . It is interesting to see, on the other hand, whether the convergence rate of in Theorem 3.1 is optimal. In the following, we derive a minimax lower bound for the risk.

###### Theorem 3.2

Let be as in Theorem 3.1. Then the excess prediction risk satisfies

 limc→0limn→∞inf~ηsupβ∈H(K)P(Rn≥cn−2r2r+1)=1, (10)

where the infimum is taken over all possible predictors based on .

Theorem 3.2 shows that the minimax lower bound of the convergence rate for the prediction risk is , which is determined by and the decay rate of the eigenvalues of . We have shown that this rate is achieved by our penalized estimator, and therefore our estimator is rate-optimal.

## 4 Numerical Experiments

We compared the proposed optimal penalized function-on-function regression (OPFFR) method with existing function-on-function linear regression models under two different designs. In a dense design, each curve was densely sampled at regularly-spaced common time points. We compared the OPFFR with two existing models. In a sparse design, each curve was irregularly and sparsely sampled at possibly different time points. We extended the OPFFR to this design by adding an extra pre-smoothing step and compared it with the FPCA model. In the first model (Ramsay & Silverman, 2005) for comparison, the coefficient function is estimated by penalizing its B-spline basis function expansion. This approach does not have the optimal mean prediction property and partially implemented in the fda package of R (linmod function) for the case of a fixed smoothing parameter. We shall add a search on the grid for smoothing parameter selection to their implementation and denote this augmented approach by FDA. The coefficient function is represented in terms of 10 basis functions each for the and directions. The second model for comparison was the functional principal component analysis (hence denoted by FPCA) approach proposed by Yao et al. (2005b). The coefficient function is represented in terms of the leading functional principal components. This is implemented in the MatLab package PACE (FPCreg

function) maintained by the UC-Davis research group. The Akaike information criterion (AIC) and fraction of variance explained (FVE) criterion were used to select the number of principal components for predictor and response respectively. The cutoff value for FVE was

. The ‘regular’ parameter was set to 2 for the dense design and 0 for the sparse design. No binning was performed.

### 4.1 Simulation Study

#### 4.1.1 Dense Design

We simulated data according to model (2) with three scenarios.

• Scenario 1: The predictor functions are , where

is from the uniform distribution

, and if and otherwise. The coefficient function is the exponential function of and .

• Scenario 2: The predictor functions are the same as those in Scenario 1 and the coefficient function .

• Scenario 3: The predictor functions are generated as , where if and otherwise. The coefficient function .

For each simulation scenario, we generated samples, each with 20 time points on the interval . The random errors

were from a normal distribution with a constant variance

. The value of was adjusted to deliver three levels of signal-to-noise ratio (SNR, 5, and 10) in each scenario. To assess the mean prediction accuracy, we generated an additional predictor curves and computed the mean integrated squared error , where was the estimator obtained from the training data. We had 100 runs for each combination of scenario and SNR. Figure 3: Perspective plots of the true β(t,s) in three scenarios, and their respective estimates by the OPFFR, FDA, and FPCA methods when SNR=10.

We applied the OPFFR, FDA and FPCA methods to the simulated data sets. Figure 3 displayed the perspective plots of the true coefficient functions in the three scenarios as well as their respective estimates for a single run with SNR. In the first two scenarios, both OPFFR and FDA did a decent job in recovering the true coefficient function although the FDA estimates were slightly oversmoothed. In both scenarios the FPCA estimates clearly suffered since the true coefficient function could not be effectively represented by the eigen-functions of the predictor processes. Figure 4: Boxplots of log2(MISE) for three scenarios under three signal-to-noise ratios (SNR=0.5, 5, 10), based on 100 simulation runs. OPFFR is the proposed approach.

Figure 4 gave the summary reports of performances in terms of MISEs based on 100 runs. When the signal to noise ratio is low, the OPFFR and FDA approaches had comparable performances. But when the signal to noise ratio increases, OPFFR showed clear advantage against FDA. The FPCA method failed to deliver competitive performance against the other two methods in all the settings due to its restrictive requirement of the effective representation of the coefficient function.

#### 4.1.2 Sparse Design

In this section, we compared the performance of the proposed OPFFR method and the FPCA method regarding prediction error on sparsely, irregularly, and noisily observed functional data. To extend our method to sparsely and noisily observed data, we first applied the principal-component-analysis-through-conditional-expectation (PACE) method in Yao et al. (2005a) to the sparse functional data. Then we obtained a dense version of functional data by computing the PACE-fitted response and predictor functions at 50 selected time points for each curve. We applied the OPFFR method to these densely generated data and called this sparse extension to the OPFFR by the OPFFR-S method. The original OPFFR method, FPCA and OPFFR-S methods were all applied to the simulated data for comparison.

We first generated samples for both response and predictor functions in Scenario 3, each with 50 time points on interval . To obtain different sparsity levels, we then randomly chose 5, 10 and 15 time points from the 50 ones for each curve independently. Normally distributed random errors were added to functional response and predictor with the SNR set to 10 in generating each pair of noisy response and predictor. The mean integrated squared error (MISE) was calculated based on additional predictor curves without random noises.

Figure 5 displayed the perspective plots of the true coefficient functions in the sparse scenario as well as their respective estimates for a single run with 10 sampled time points per curve. The OPFFR-S method and FPCA performed well in estimating the coefficient function. The estimate recovered by the original OPFFR method was a little oversmoothed. In Figure 6, the performance in terms of MISEs based on 100 runs was compared. The OPFFR-S method always had the best prediction performances at all the three sparsity levels. When the sparsity level was high (5 time points per curve), the original OPFFR method had a worse prediction performance than the FPCA. However, its prediction performance quickly picked up as the data became denser. When the sparsity level was 15 time points per curve, it actually delivered a better prediction performance than the FPCA. Such an interesting phenomenon was referred to as the “phase transistion” (Cai & Yuan, 2011; Wang et al., 2016). Figure 5: Perspective plots of the true β(t,s) in the sparse scenario, and their respective estimates by the OPFFR-S, FPCA, and OPFFR methods when the number of randomly selected time points is ten. Figure 6: Boxplots of MISEs for the sparse scenario under three different sparsity levels, based on 100 simulation runs. The boxplots with different grayscale shades from left to right respectively represent the sparsity levels of 5, 10 and 15 time points per curve.

## 5 Real Data Examples

We analyzed two real example in this section. We showed that our method had the numerical advantage over other approaches in terms of prediction accuracy in the analysis of the Canadian weather and histone regulation data. The results in the Canadian weather data, a dense design case, and the histone regulation data, a sparse design case, echoed with our findings in the simulation study. The smoothing parameters used in FDA for Canadian weather data were taken from the example codes in Ramsay et al. (2009) and seven basis functions were used for the and directions respectively. In the histone regulation data we selected the smoothing parameter for FDA by a grid search on and used six basis functions each for the and directions. For the FPCA method, the ‘regular’ parameter was set to for the Canadian weather data and for the histone regulation data. The other parameters for FDA and FPCA approaches were the same as those used in the simulation study.

We first look at the Canadian weather data (Ramsay & Silverman, 2005), a benchmark data set in functional data analysis. The main goal is to predict the log daily precipitation profile based on the daily temperature profile for a geographic location in Canada. The daily temperature and precipitation data averaged over 1960 to 1994 were recorded at 35 locations in Canada. We compared OPFFR with FDA and FPCA in terms of prediction performance defined by integrated squared error (ISE) , where and was estimated by the dataset without the th observation. For the convenience of calculation, we computed at a grid of values as the surrogate of ISE. Since the findings through the coefficient function estimates were similar to those in Ramsay & Silverman (2005), we only focused on the comparison of prediction performance. The summary in Table  1 clearly showed the numerical advantage of the proposed OPFFR method over the FDA and FPCA methods.

### 5.2 Histone Regulation Data

Nucleosomes, the basic units of DNA packaging in eukaryotic cells, consist of eight histone protein cores including two copies of H2A, H2B, H3, and H4. Besides the role as DNA scaffold, histones provide a complex regulatory platform for regulating gene activity (Wollmann et al., 2012). Focused study of the interaction between histones and gene activity may reveal how the organisms respond to the environmental changes. There are multiple sequence variants of histone proteins, which have some amino acid changes compared to their primary sequence, coexist in the same nucleus. For instance, in both plants and animals, there exist three variants of H3, the H3.1, the H3.3, and the centromere-specific CENP-A (CENH3) (Deal & Henikoff, 2011). Each variant shows distinct regulatory mechanisms over gene expression.

In this paper, an ultra-high throughput time course study was conducted to explore the interaction mechanism between the gene activity and histone variant, H3.3, during heat stress in Arabidopsis thaliana. In this study, the 12-day-old Arabidopsis seedlings that had been grown at were subject to heat stress of , and plants were harvested at 7 different time points within 24 hours for RNA sequencing (RNA-seq) (Wang et al., 2009) and ChIP sequencing (ChIP-seq) (Johnson et al., 2007) experiments. We were interested in the genes responding to the heat shock, therefore 160 genes in response to heat (GO:0006951) pathway (Ashburner et al., 2000) were chosen. We selected 55 genes with the fold change above 0.5 at at least two consecutive time points in RNA-seq data. In ChIP-seq experiments, we calculated the mean of normalized read counts by taking the average of normalized read counts over seven time points for the region of 350 base pairs (bp) in the downstream of transcription start sites (TSS) of selected 55 genes. The normalized read counts over 350 bp from ChIP-seq and the normalized fragments per kilobase of transcript per million mapped reads (FPKM) (Trapnell et al., 2010) over seven time points from RNA-seq were used to measure the histone levels and gene expression levels respectively.

We applied the OPFFR, FDA and FPCA methods to histone regulation data in example 1.2. Since the gene expression levels were sparsely observed, we also applied the OPFFR-S method to the data. The comparison of the four methods is shown in Table 2. In the table, the standard deviation of ISEs was the only measure that neither the OPFFR nor the OPFFR-S was the most optimal. This was caused by a few observations where all the methods failed to make a good prediction and the OPFFR methods happened to have larger ISEs. In terms of all the other measures, the proposed OPFFR and OPFFR-S methods clearly showed the advantage in prediction accuracy again. Since the results from the OPFFR and OPFFR-S were comparable to each other, we chose to present all the following results based on the OPFFR analysis.

Figure 7 is the plot of the fitted coefficient function generated from our OPFFR method. For region between 300 bp and 350 bp, there was a strong negative influence of H3.3 on genes activity from half hour to 8 hours. It indicted that the loss of H3.3 might have the biological influence on the up-regulation of heat-induced genes. This negative correlation phenomenon was also observed after 30 minutes on the region of 250 bp to 300 bp between H3.3 and gene activity. In addition, the region from 50 bp to 150 bp had a positive effect on genes activity over time domain from 0 hour to half hour and 4 hours to 8 hours. Therefore, we provided a numerical evidence that heat-shock-induced transcription of genes in response to heat stress might be regulated via the epigenetic changes of H3.3, especially on the downstream region of TSS. The sample plots in Figure 8 showed a nice match of the predicted gene expression curves with the observed values. Figure 7: The estimated coefficient function β(t,s) for the histone regulation study. The y-axis label represents the positions on genomes and x-axis label represents seven time points. Figure 8: The fitted response functions for six genes in the histone regulation study. The y-axis stands for the normalized expression levels and x-axis label represents seven time points. The curve fitted using OPFFR is in the solid line, with the data in circles.

## 6 Conclusion

In this article, we have presented a new analysis tool for modeling the relationship of a functional response against a functional predictor. The proposed method is more flexible and generally delivers a better numerical performance than the FPCA approach since it does not have the restrictive structural dependence assumption on the coefficient function. When compared with the penalized B-splines method, the proposed method has the theoretical advantage of possessing the optimal rate for mean prediction as well as some numerical advantage as shown in the numerical studies. Moreover, the Representer Theorem guarantees an exact solution to the penalized least squares, a property that is not shared by the existing penalized function-on-function regression models. The application of our method to a histone regulation study provided numerical evidence that the changes in H3.3 might regulate some genes through transcription regulations. Although such a finding sheds light on the relationship between histone variant H3.3 and gene activity, the details of the regulation process are still unknown and merit further investigations. For instance, we may investigate how the H3.3 organizes the chromatins to up-regulate those active genes. Such investigations would call for more collaborations between statisticians and biologists.

When the regression model has a scalar response against one or more functional predictors, methods other than the roughness penalty approach are available to overcome the inefficient basis representation drawback in the FPCA method. For example, Delaigle et al. (2012) considered a partial least squares (PLS) based approach. Ferré & Yao (2003) and Yao et al. (2015) translated the idea of sufficient dimension reduction (SDR) into the setting of functional regression models. Intuitively, these methods might be more efficient in their selection of the principal component basis functions since they incorporate the response information into consideration. However, our experiments with a functional response version of the functional PLS (Preda & Saporta, 2005), not shown here due to space limit, did not look so promising. Therefore, further investigation in this direction is surely needed.

## References

• (1)
• Ashburner et al. (2000) Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T. et al. (2000), ‘Gene Ontology: tool for the unification of biology’, Nature Genetics 25(1), 25–29.
• Ausió (2006) Ausió, J. (2006), ‘Histone variants—the structure behind the function’, Briefings in Functional Genomics & Proteomics 5(3), 228–243.
• Benatia et al. (2015) Benatia, D., Carrasco, M. & Florens, J.-P. (2015), Functional linear regression with functional response, Technical report, Départment de sciences économiques, Université de Montréal, Montréal, Canada.
• Cai & Yuan (2011)

Cai, T. T. & Yuan, M. (2011), ‘Optimal estimation of the mean function based on discretely sampled functional data: Phase transition’,

The Annals of Statistics 39(5), 2330–2355.
• Cai & Yuan (2012) Cai, T. T. & Yuan, M. (2012), ‘Minimax and adaptive prediction for functional linear regression’, Journal of the American Statistical Association 107(499), 1201–1216.
• Crambes & Mas (2013) Crambes, C. & Mas, A. (2013), ‘Asymptotics of prediction in functional linear regression with functional outputs’, Bernoulli 19(5B), 2627–2651.
• Cucker & Smale (2002) Cucker, F. & Smale, S. (2002), ‘On the mathematical foundations of learning’, Bulletin of the American Mathematical Society 39(1), 1–49.
• Cuevas et al. (2002) Cuevas, A., Febrero, M. & Fraiman, R. (2002), ‘Linear functional regression: The case of fixed design and functional response’, Canadian Journal of Statistics 30(2), 285–300.
• Deal & Henikoff (2011) Deal, R. B. & Henikoff, S. (2011), ‘Histone variants and modifications in plant gene regulation’, Current Opinion in Plant Biology 14(2), 116–122.
• Delaigle et al. (2012) Delaigle, A., Hall, P. et al. (2012), ‘Methodology and theory for partial least squares applied to functional data’, The Annals of Statistics 40(1), 322–352.
• Du & Wang (2014) Du, P. & Wang, X. (2014), ‘Penalized likelihood functional regression’, Statistica Sinica 24, 1017–1041.
• Ferraty et al. (2011) Ferraty, F., Laksaci, A., Tadj, A. & Vieu, P. (2011), ‘Kernel regression with functional response’, Electronic Journal of Statistics 5, 159–171.
• Ferré & Yao (2003)

Ferré, L. & Yao, A.-F. (2003), ‘Functional sliced inverse regression analysis’,

Statistics 37(6), 475–488.
• Gu (2013) Gu, C. (2013), Smoothing Spline ANOVA Models (2nd Ed.), Springer-Verlag, New York.
• Ivanescu et al. (2015) Ivanescu, A. E., Staicu, A.-M., Scheipl, F. & Greven, S. (2015), ‘Penalized function-on-function regression’, Computational Statistics 30(2), 539–568.
• Johnson et al. (2007) Johnson, D. S., Mortazavi, A., Myers, R. M. & Wold, B. (2007), ‘Genome-wide mapping of in vivo protein-DNA interactions’, Science 316(5830), 1497–1502.
• Kim & Gu (2004) Kim, Y.-J. & Gu, C. (2004), ‘Smoothing spline gaussian regression: more scalable computation via efficient approximation’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 66(2), 337–356.
• Lian (2015) Lian, H. (2015), ‘Minimax prediction for functional linear regression with functional responses in reproducing kernel Hilbert spaces’,

Journal of Multivariate Analysis

140, 395–402.
• Malfait & Ramsay (2003) Malfait, N. & Ramsay, J. O. (2003), ‘The historical functional linear model’, Canadian Journal of Statistics 31(2), 115–128.
• Meyer et al. (2015) Meyer, M. J., Coull, B. A., Versace, F., Cinciripini, P. & Morris, J. S. (2015), ‘Bayesian function-on-function regression for multilevel functional data’, Biometrics 71(3), 563–574.
• Preda & Saporta (2005) Preda, C. & Saporta, G. (2005), ‘PLS regression on a stochastic process’, Computaional Statistics & Data Analysis 48(1), 149–158.
• Ramsay et al. (2009) Ramsay, J. O., Hooker, G. & Graves, S. (2009), Functional Data Analysis with R and MATLAB, Springer Science & Business Media.
• Ramsay & Silverman (2005) Ramsay, J. O. & Silverman, B. W. (2005), Functional Data Analysis, Springer-Verlag, New York.
• Scheipl & Greven (2016) Scheipl, F. & Greven, S. (2016), ‘Identifiability in penalized function-on-function regression models’, Electronic Journal of Statistics 10, 495–526.
• Trapnell et al. (2010) Trapnell, C., Williams, B. A., Pertea, G., Mortazavi, A., Kwan, G., Van Baren, M. J., Salzberg, S. L., Wold, B. J. & Pachter, L. (2010), ‘Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation’, Nature Biotechnology 28(5), 511–515.
• Wahba (1990) Wahba, G. (1990), Spline Models for Observational Data, Vol. 59 of CBMS-NSF Regional Conference Series in Applied Mathematics, SIAM, Philadelphia.
• Wahba & Craven (1979) Wahba, G. & Craven, P. (1979), ‘Smoothing noisy data with spline functions: estimating the correct degree of smoothing by the method of generalized cross-validation.’, Numerische Mathematik 31, 377–403.
• Wang et al. (2016) Wang, J.-L., Chiou, J.-M. & Müller, H.-G. (2016), ‘Functional data analysis’, Annual Review of Statistics and Its Application 3, 257–295.
• Wang et al. (2009) Wang, Z., Gerstein, M. & Snyder, M. (2009), ‘RNA-Seq: a revolutionary tool for transcriptomics’, Nature Reviews Genetics 10(1), 57–63.
• Wollmann et al. (2012) Wollmann, H., Holec, S., Alden, K., Clarke, N. D., Jacques, P.-E. & Berger, F. (2012), ‘Dynamic deposition of histone variant H3.3 accompanies developmental remodeling of the Arabidopsis transcriptome’, PLoS Genetics 8(5), e1002658.
• Yao et al. (2015) Yao, F., Lei, E. & Wu, Y. (2015), ‘Effective dimension reduction for sparse functional data’, Biometrika 102(2), 421–437.
• Yao et al. (2005a) Yao, F., Müller, H.-G. & Wang, J.-L. (2005a), ‘Functional data analysis for sparse longitudinal data’, Journal of the American Statistical Association 100(470), 577–590.
• Yao et al. (2005b) Yao, F., Müller, H.-G. & Wang, J.-L. (2005b), ‘Functional linear regression analysis for longitudinal data’, The Annals of Statistics 33(6), 2873–2903.
• Yuan & Cai (2010) Yuan, M. & Cai, T. T. (2010), ‘A reproducing kernel Hilbert space approach to functional linear regression’, The Annals of Statistics 38(6), 3412–3444.