 # Tree-based boosting with functional data

In this article we propose a boosting algorithm for regression with functional explanatory variables and scalar responses. The algorithm uses decision trees constructed with multiple projections as the "base-learners", which we call "functional multi-index trees". We establish identifiability conditions for these trees and introduce two algorithms to compute them: one finds optimal projections over the entire tree, while the other one searches for a single optimal projection at each split. We use numerical experiments to investigate the performance of our method and compare it with several linear and nonlinear regression estimators, including recently proposed nonparametric and semiparametric functional additive estimators. Simulation studies show that the proposed method is consistently among the top performers, whereas the performance of any competitor relative to others can vary substantially across different settings. In a real example, we apply our method to predict electricity demand using price curves and show that our estimator provides better predictions compared to its competitors, especially when one adjusts for seasonality.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Functional data consist of a sample of functional variables, which are often smooth functions measured on a discrete grid. In this article, we study regression problems with a functional covariate and a scalar response , also called “scalar-on-function” regression, where the main goal is to estimate a function in order to make predictions for future observations of .

Functional data are intrinsically infinite dimensional, and that poses challenges that vary with how the regression model is defined. Assuming that are square-integrable functions defined over an interval , the linear functional regression model is defined as where is the functional regression coefficient and denotes the usual inner-product associated with the Hilbert space . Since is infinite dimensional, its estimation requires some form of dimension reduction. A very common way of fitting this model is to represent in some basis system (most often splines or functional principal components (FPCs)), and control its smoothness by regularizing the expansion, either through truncating the number of basis functions or by adding roughness penalties. A number of proposals in the literature explored different basis expansions and regularization strategies, for example reiss2007functional, zhao2012wavelet, hall2007methodology, and cardot2003spline. For non-Gaussian responses, several authors studied the generalized functional linear model for a known link function , see, e.g. james2002generalized, cardot2005estimation, muller2005generalized, and dou2012estimation.

A well-specified linear model typically leads to a stable estimator associated with easy interpretation, however, a misspecified one can result in unreliable conclusions. Compared to linear models, nonparametric ones exhibit a higher degree of flexibility and have received recent attention in the functional context. We refer to ferraty2006nonparametric and ling2018nonparametric for general discussions on this topic. Various proposals adapted nonparametric regression in the finite dimensional case to functional data, many of which are based on kernels that require distances between pairs of explanatory variables. For functional data, natural measures of proximity involve notions of shape as those given by the similarities of derivatives or FPC scores (shang2016bayesian). This suggests the use of semi-metrics as distance measures for many kernel-based functional regression estimators, including functional local-constant/linear estimators (ferraty2002functional; baillo2009local; barrientos2010locally; berlinet2011local), functional k-nearest neighbours (burba2009k; kudraszow2013uniform; kara2017data), functional recursive kernel estimator (amiri2014recursive), and Reproducing Kernel Hilbert Space (RKHS) methods (preda2007regression; avery2014rkhs). To avoid having to choose a semi-metric if several are available, ferraty2009additive and febrero2013generalized suggested to combine kernel estimators constructed with different semi-metrics.

It is well known that in finite dimensional settings nonparametric estimators suffer from the “curse of dimensionality” due to the sparseness of data in high-dimensional spaces. Similarly, the performance of kernel estimators for functional data is affected by sparsity in the infinite dimensional space, making it difficult to select a bandwidth when a data point has very few neighbours

(geenens2011curse; mas2012lower). Exceptions derived from (generalized) additive models avoid this issue by imposing some structure on the target function. For example, muller2008functional introduced a functional additive model with each component associated to a FPC score; muller2013continuously and mclean2014functional studied a continuous additive model and its generalized version where the additivity for both models occurs in the time domain. Their methods use splines to fit the smooth regression surface with different regularization strategies.

Beyond linear and nonparametric models, there have been several developments towards semi-parametric regression for functional data, particularly in the direction of index-based models. lingsemiparametric summarized advances in this field in a recent survey. ferraty2003modele first proposed a functional single index model , where is an unknown smooth function and is the functional regression coefficient. The computational and theoretical aspects of the proposal in ferraty2003modele were further explored by ait2008cross, ferraty2011estimation, jiang2011functional, and goia2015partitioned.

To estimate more complex regression functions, some proposals studied multi-index models:

 F(X)=r(⟨X,α1⟩,...,⟨X,αp⟩), (1)

where is the number of indices and is an unknown smooth function. amato2006dimension

proposed an extension of minimum average variance estimation to functional data. Their method requires a

-dimensional kernel smoother to estimate , which depends on a computationally expensive procedure to select bandwidths and needs a large sample size to obtain a stable estimator when is large. A few other proposals were derived from sliced inverse regression (li1991sliced), including functional sliced inverse regression (ferre2003functional)

, functional k-means inverse regression

(wang2014functional), and functional sliced average variance estimation (lian2014series). These methods avoid -dimensional smoothing but require more restrictive assumptions on the distribution of .

Another group of methods approximates in (1) with a finite sum of single index terms:

 F(X)≈r1(⟨X,α1⟩)+...+rp(⟨X,αp⟩). (2)

This additive approximation is attractive since each component can be estimated by a single dimensional smoother that can be computed efficiently. chen2011single and ferraty2013functional proposed iterative algorithms that estimate the ’s sequentially with kernel estimators, fitting one per iteration while keeping the previous ones fixed. For non-Gaussian responses, james2005functional considered a generalized functional index model with a known link function and estimated the ’s with spline smoothers.

In this article, we propose a new way to estimate the regression function using a boosting algorithm. There exist other proposals that developed boosting-type algorithms for functional data, but they either apply to tasks different from ours (e.g. identifying the most predictive design points), or do not consider multi-index estimators (ferraty2010most; tutz2010feature; fan2015functional; ferraty2009additive; chen2011single; ferraty2013functional; greven2017general)

. Compared to these proposals, our approach fits more “closely” into a boosting framework by minimizing a loss function with a gradient descent-like procedure and applying regularization via shrinkage and early stopping.

Like gradient boosting, our algorithm constructs a regression estimator using a linear combination of “base learners”. The proposed algorithm can be thought of as fitting an approximation of the regression function

 F(X)≈r1(⟨X,β1,1⟩,...,⟨X,β1,k⟩)+...+rT(⟨X,βT,1⟩,...,⟨X,βT,K⟩), (3)

where each of the components is characterized by indices. Compared to (2), the approximation (3) enables modelling more complex structures of the indices by including possible interactions. We fit each component with a tree-type base learner, which we call a “functional multi-index tree”, where the projection directions and are calculated at the -th boosting iteration.

The rest of this paper is organized as follows. In Section 2, we introduce in detail the functional regression model and our boosting algorithm, including identifiability conditions of functional multi-index trees and two algorithms to compute them. Section 3 reports the results of a simulation study comparing our method with existing alternatives. We discuss numerical results for a few representative settings and provide the complete set of results in the Appendix. A case study is presented in Section 4. Finally, Section 5 summarizes our findings and points out directions for future work.

## 2 Methodology

Consider a data set with i.i.d. realizations of the pair , where

are predictor variables in some space

, and is a scalar response. We are interested in estimating a function in order to make predictions for future observations of . Following friedman2001greedy, we define the target function as the minimizer of

 F=argminGEY,XL(Y,G(X)) (4)

over the joint distribution of

, where is a pre-specified loss function such as the squared loss . Gradient boosting (friedman2001greedy) is a method to estimate in (4) by constructing an approximate minimizer of the empirical risk:

 argminFn∑i=1L(yi,F(xi)). (5)

We view the objective in (5

) as a function of the vector

. Similar to the gradient descent algorithm that sequentially adds to the current point the negative gradient vector of the objective function, gradient boosting adds to the current function estimate an approximation to the negative gradient vector. At the -th iteration, the negative gradient vector is computed at the point obtained form the previous iteration :

 ut,i=−∂L(yi,b)∂b∣∣b=^Ft−1(xi), i=1,...,n, (6)

and is approximated using a base learner which we assume is characterized by parameters that are chosen to minimize

 n∑i=1(ut,i−ht(xi;Θt))2,

over members of a family of base learners and over in the parameter space. We then calculate the step size

 ^αt=argminα∈Rn∑i=1L(yi,^Ft−1(xi)+α^ht(xi;^Θt)).

and update

 ^Ft(x)=^Ft−1+γ^αt^ht(x;^Θt), (7)

where is a shrinkage parameter that controls the learning rate. The inclusion of is expected to reduce the impact of each base learner on by approximating more finely the search path, and as a result, improve the prediction accuracy (telgarsky2013margins). This simple and effective strategy can be viewed as a form of regularization. friedman2001greedy showed empirically that a much lower prediction error can be achieved when .

Similar to what the gradient descent algorithm does, gradient boosting minimizes the training loss in a greedy fashion and may eventually overfit the training data at the expense of degraded generalization performance. To avoid this issue, boosting algorithms generally use an “early stopping” strategy that stops training when the performance on the validation set starts to deteriorate. We define an early stopping rule based on the performance on a validation set () that follows the same model as the training set. Usually, the validation set is randomly selected from all available data and set aside from the training set. The early stopping time is defined as the iteration that achieves the lowest loss on :

 Tstop=argmint=1,...,Tmax∑i∈VL(yi,^Ft(xi)), (8)

where is the maximum number of iterations allowed in our algorithm. After training completes, the final boosting estimator is given by

 ^FTstop(x)=^F0(x)+Tstop∑t=1γαt^ht(x;^Θt),

where the initial function estimate is generally a constant defined as

 ^F0(x)=argmina∈Rn∑i=1L(yi,a).

Core to the performance of a boosting algorithm is the choice of base learners. In the functional context, for in a Hilbert space (e.g. the space of square-integrable functions) we will next introduce a flexible class of base learners which we call “functional multi-index trees”. They have the advantage of being capable of fitting functions of that involve multiple projections and their possible interactions. Furthermore, they are scalable to large sample sizes and can easily incorporate additional real-valued explanatory variables.

### 2.1 Functional multi-index tree

When the explanatory variables are vectors in for some , decision trees are commonly used as base learners for gradient boosting, which select variables to make the best splits. Similarly for , we “select” the optimal indices and use them to define the splits that partition the data.

Let be functions which we view as projection directions. The inner-products are corresponding indices projecting onto these directions. At the -th iteration, we compute negative gradients as in (6) and define a functional multi-index tree as the solution to

 (9)

where is a decision tree. In what follows, we will introduce two strategies to compute a functional multi-index tree: one chooses the optimal set of indices over the entire tree, and the other finds a best single index at each split. We refer to the resulting trees as Type A and Type B trees, respectively.

#### 2.1.1 Type A tree

The Type A tree takes a two-level approach where we estimate at the inner level and search for the optimal at the outer level. It is easy to see this two-level structure if we re-state (9) as

 argminβ1,...,βK{argminhn∑i=1(ut,i−h(⟨xi,β1⟩,...,⟨xi,βK⟩))2}. (10)

Given , the inner level is simply fitting a usual decision tree using , …, , as predictors. At the outer level, we find ,…, that minimize

 n∑i=1(ut,i−^h(⟨xi,β1⟩,...,⟨xi,βK⟩))2. (11)

If are unconstrained, the solution to (10) is not unique. For example, for any decision tree and non-zero real values ,

 h(⟨⋅,β1⟩,...,⟨⋅,βK⟩)=~h(⟨⋅,b1β1⟩,...,⟨⋅,bKβK⟩),

where the value at which the -th input of is split is multiplied by in . To avoid this issue, we introduce conditions under which , , …, and are identifiable up to sign changes of each :

, for

###### Condition 2

Each index is chosen by at least one of the splits of a decision tree . For any set of indices , there exist a and , such that for ,

 h(⟨L1(x),β1⟩...,⟨Lj(x),βj⟩...,⟨LK(x),βK⟩) (12)

is a non-constant function of , where if and else , for .

The following result shows that these conditions are sufficient to generate a unique solution to (10) up to sign changes of each . The proof of creftypecap 1 is included in the Appendix.

###### Theorem 1

Suppose that Conditions 1 and 2 hold, then ,…, are identifiable up to sign changes of each . That means for any decision trees and if

 h(⟨x,β1⟩...,⟨x,βK⟩)=~h(⟨x,η1⟩,...,⟨x,ηK⟩) (13)

hold for all , then for some , and .

A convenient approach to find the solution of (10) is to approximate the optimal directions using a flexible basis (e.g. splines). Let ,…, be an orthonormal set in and write . Then creftypecap 1 becomes where is the Euclidean norm and for . creftypecap 2 generally holds for functional multi-index trees except for very special situations, such as one of the indices being equal for all individuals.

With a slight abuse of notation, we define and write the objective in (10) as

 n∑i=1(ut,i−h(cT1⟨xi,ψ⟩,...,cTK⟨xi,ψ⟩))2. (14)

To further simplify the computation to minimize (14) under the condition for , we represent in a spherical coordinate system in order to obtain an optimization problem with simple box constraints.

Referring to blumenson1960derivation, Cartesian coordinates and spherical coordinates are connected through the following equations:

 rj= ⎷d∑l=1c2j,l cj,1=rjcos(θj,1) cj,l=rjcos(θj,l)l−1∏k=1sin(θj,k), l=2,...,d−1 ⋮ cj,d=rjd−1∏k=1sin(θj,k),

where , , and for . As a result of this connection, we can transform to and further reduce creftypecap 1 to . In addition, we restrict , which is equivalent to . This restriction ensures that (14) has a single unique solution as opposed to multiple solutions that are unique up to sign changes.

Finally, we treat the objective of (14) as a function of and minimize it under conditions

 θj,1∈[−π/2,π/2), θj,2,...,θj,d−1∈[0,π]. (15)

This can be achieved by applying generic gradient-free optimization algorithms that allow box constraints. Note that the objective function (14) may not be convex and therefore multiple starting points are recommended when preforming the optimization (see Section 3.1 for details).

#### 2.1.2 Type B tree

Unlike Type A trees which find projection directions to minimize the prediction error of the whole tree, Type B trees adapt the CART algorithm (breiman1984) to select an optimal direction at each split.

At each boosting iteration , we start from ,…, and make the first split at the root node by finding the optimal split and the projection direction that solve

 argminβ{argming∈G1n∑i=1(ut,i−g(⟨xi,β⟩)2}, (16)

where denotes the class of single split trees, also called decision stumps. For any given , the tree partitions the data into two regions (nodes) where the dividing threshold is selected from to minimize the squared prediction error. To make predictions for data points in each region, the average responses of the training data in that region is used.

A Type B tree can be fitted by repeatedly solving (16) at each split. We can view (16) as a special case of fitting a Type A tree with and depth , and apply the algorithm we introduced in Section 2.1.1. Having found the first split at the root node, we repeat the same process in each of the two resulted regions. This process is continued until the tree reaches its specified maximum depth.

This iterative optimization procedure is straightforward, but it can be computationally costly. There is also the concern of performing optimization at nodes that contain very few examples, for instance, nodes at deep levels of the tree, which may result in unstable estimation of or even convergence failure of the optimization algorithm. To avoid these issues, we suggest an alternative approach to approximate the solution of (16).

Rather than solving (16) with respect to all possible s at each split, we select from a pool of randomly generated candidates. As we did for Type A trees, we expand in the space of an orthonormal basis ,…, and let . We randomly generate a large pool of candidate vectors denoted as , where each is an independently sampled vector that satisfies and . This ensures the candidate s satisfy identifiability conditions where for .

At each split, the coefficient vector of the projection direction is selected from . This corresponds to applying the CART algorithm to find a decision tree that minimizes the squared error:

 (17)

where are the explanatory variables and is the response for the -th individual. In order to reduce overfitting, we use different pools randomly chosen at each boosting iteration. This is expected to introduce randomness to the algorithm and allow more directions to be considered.

We point out that Type B trees cannot fully replace Type A trees as either type is preferred in different situations. If the target function is complex and requires a base learner with high complexity to achieve satisfactory prediction accuracy, we suggest using Type B trees, taking advantage of its fast computation and flexible structure. On the other hand, when the target function is simple, parsimonious base learners (e.g. Type A trees with few indices) are often preferred to prevent overfitting. Note that both Type A and B trees can easily include real-valued explanatory variables by replacing in (9) with . More specifically, (14) becomes

 n∑i=1(ut,i−h(cT1⟨xi,ψ⟩,...,cTK⟨xi,ψ⟩,vi))2, (18)

for Type A trees, and (17) becomes

 (19)

for Type B rrees. This extension allows our method to fit partial-functional regression models with mixed-type predictors. In Section 4, we will introduce an example that illustrates the usage of our method in a partial-functional setting.

## 3 Simulation

To assess the numerical performance of our proposed method, we conducted a simulation study comparing it with alternative functional regression methods in the literature.

We generated data sets , consisting of a predictor and a scalar response that follow the model:

 yi=r(xi)+ρϵi, (20)

where the errors are i.i.d , is the regression function, and

is a constant that controls the signal-to-noise ratio (SNR):

 SNR=Var(r(X))Var(ρϵ).

To sample the functional predictors , we considered two models adopted from ferraty2013functional and boente2021robust respectively:

• Model 1 :

 xi(t)=ai+bit2+ciexp(t)+sin(dit),

where , , , and ; and

• Model 2 :

 xi(t)=μ(t)+4∑p=1√λjξijϕj(t),

where , , , and , , and

’s are the first four eigenfunctions of the “Mattern” covariance function

with parameters :

 γ(s,t)=C(√2ν|s−t|ρ), C(u)=σ221−νΓ(ν)uνKν(u),

where is the Gamma function and is the modified Bessel function of the second kind.

For each subject , we evaluated on a dense grid in for and for . Figure 1 shows an example of 10 randomly sampled ’s generated from (left panel) and (right panel). Figure 1: Ten random samples generated from Model 1 (left panel) and Model 2 (right panel).

We considered five regression functions:

• where the first two FPCs ( and ) and mean () for were calculated using a large sample of , while for their true values were used,

• where , and

• where and for , and and for .

These regression functions were selected to represent a single-index model and other nonlinear models (, , and ). To control the level of noise, we considered SNR = 5 and 20 as high and low noise levels and denoted them as and , For each combination of the predictor model (), regression function (), and noise level (), we generated i.i.d. samples of size . The dataset was randomly partitioned into a training set, a validation set, and a test set of size 400, 200, and 1000 respectively. In total, we simulated data from 16 settings, each defined as a combination from the set .

### 3.1 Implementation details

For each setting, we used 100 independently generated datasets and compared the performance of the following estimators:

• FLM1

: functional linear regression with cubic B-splines

(hastie1993statistical),

• FLM2: functional linear regression with FPC scores (cardot1999functional),

• FAM: functional additive models (muller2008functional),

• FAME: functional adaptive model estimation (james2005functional),

• FPPR: functional projection pursuit regression (ferraty2013functional),

• FGAM: functional generalized additive models (mclean2014functional),

• TFBoost(A,K): tree-based functional boosting with Type A trees, where each tree is specified with = 1, 2, or 3 indices, and

• TFBoost(B): tree-based functional boosting with Type B trees.

FLM1 and FLM2 are classical functional regression methods for linear models, and they use different bases to estimate the regression coefficients. For FLM1, we used cubic B-splines with 7 basis functions (3 evenly spaced interior knots), which was found to work generally well compared to using less or more interior knots. We penalized the second derivative of the coefficient and selected the penalization parameter to minimize the mean-squared-prediction-error (MSPE) on the validation set. For FLM2, we used the top 4 FPCs required to explain 99% of the variance.

FAM constructs an additive estimator using FPC scores as predictors. We modified its implementation in the fdapace package (rfdapace) to select the number of FPC scores that minimizes the MSPE on the validation set.

FAME estimates an extension of generalized additive models (GAM) to functional predictors. We used the code shared by the authors of james2005functional and fitted the model using a Gaussian link function. FPPR follows the principle of projection pursuit regression and assumes an additive decomposition with each component being a functional single index model (FSIM). We implemented FPPR using the code to fit a single additive component shared by the authors of ferraty2013functional and built an estimator by adding multiple functional single index estimators. Similarly to what was done with FLM1, for FPPR and FAME we used cubic B-splines with 7 functions (3 evenly spaced interior knots) as the basis. For both methods, we selected the number of additive components between 1 and 15 to minimize the MSPE on the validation set. In our simulation settings, the performance of FAME and FPPR rarely improved beyond 15 additive components.

The implementation of FGAM is available from the refund package in R (rrefund). The method fits a model of the form

 F(X)=μ+∫G(X(t),t)dt,

where , and is estimated using bivariate B-splines with roughness penalties. A restricted maximum likelihood approach is applied to select the parameter that penalizes the second order marginal differences of the basis (see the documentation of fgam in the refund

package for details). We chose to use tensor-type bivariate cubic B-splines of dimensions 15 by 15. In all our settings, this choice ensured a stable fit and almost always resulted in the best performance compared to using other B-splines bases.

We implemented TFBoost in R, including TFBoost(A.K) for any positive integer and TFBoost(B). We fitted decision trees using the rpart package (rpart) and performed optimization of TFBoost(A.K) using the Nelder_Mead function from the lme4 package (lme4). The code implementing our method is publicly available online at https://github.com/xmengju/TFBoost. To make a fair comparison with its alternatives, we used cubic B-splines with 7 functions (3 evenly spaced interior knots) as the basis for TFBoost methods, same as what we did for FLM1, FPPR, and FAME. We then orthonormalized the basis as required by Type A and Type B trees.

We studied the performance of TFBoost(A.K) with = 1,2, and 3; and TFBoost(B) using 200 random directions at each iteration. For each of these methods, the maximum depth () of the functional multi-index trees was fixed for all iterations. We experimented with and for each method selected the value of that achieved the lowest MSPE on the validation set at the early stopping time. We set the shrinkage parameter to 0.05 and the maximum number of iterations to 1000.

As mentioned in Section 2.1.1, the estimation of a Type A tree may result in a non-convex optimization problem. To avoid suboptimal local minima, we fitted each Type A tree with multiple starting points. More specifically, we first uniformly sampled 30 points in the spherical coordinates system that satisfied the box constrains in (15) and ran the Nelder-Mead algorithm for 10 steps using each of the 30 points as the starting point. We then chose the five ending points with the lowest objective values and continued running the Nelder-Mead algorithm until convergence. Out of the five resulting estimates, we chose the one that minimized the objective function.

The results for all methods under consideration were evaluated in terms of the MSPE on the test set ():

 MSPE=1|T|∑i∈T(^F(xi)−yi)2. (21)

For TFBoost, the estimate was reported at the early stopping time.

### 3.2 Results

For each combination of the regression function and the predictor model (), the results for the high noise () settings look very similar as those for the low noise () settings. In settings, the differences between estimators are more pronounced, showing a larger advantage of TFBoost(A.K) and TFBoost(B) over the others. We report here the results for settings and provide those for in the Appendix.

Figures 5, 4, 3 and 2 show the MSPEs on the test sets for to respectively. Since FLM1, FLM2, and FAM tend to produce very large errors, to be able to visualize the differences among the other methods, we excluded them from the figures and reported the summary statistics of their MSPEs in the Appendix. As expected, the linear estimators (FLM1 and FLM2) do not work well since to are nonlinear functions. FAM also performs poorly, likely due to its using one FPC as the projection direction for each additive component, making it less flexible compared with estimating the projection direction based on the data.

In each figure, panel (a) corresponds to data with the predictors generated from and panel (b) for those from . Each panel shows the boxplot of MSPEs on the test sets from 100 independent runs of the experiment. For TFBoost(A.K), we show the MSPEs for the value of that gave the best results on the test sets, and include the rest in the Appendix. In the case of a tie we picked the simpler model (one with a smaller ). As a result, we selected

• for all and settings, and ,

• for all settings and .

In general, TFBoost(A.K) and TFBoost(B) show stable and remarkable predictive performance in all these figures, with either or both achieving the lowest or second lowest average test MSPEs amongst all. In contrast, the performance of FPPR, FAME, and FGAM heavily depends on the regression function as well as the predictor model. For example, in panel (a) of Figure 3 where was generated from , FPPR produces the worst errors compared to the other methods. However, when followed in panel (b), FPPR becomes one of the best. The opposite holds for FGAM, being among the best in panel (a) and the worst in panel (b). Similarly in Figure 4, the performance of FGAM differs greatly in and settings. In Figures 5 and 2, the performances of FAME and FGAM are similar in both and settings, but they differ substantially across figures. FGAM produces significantly worse errors compared to FAME in Figure 2, whereas in Figure 5 we observe the opposite.

It is worth noting that TFBoost methods are close to their alternatives in terms of test errors when the true model matches the one on which the other methods are based. For example, in Figure 2 TFBoost(A.2) and TFBoost(B) produce similar errors as FPPR, where the regression function matches with the true model of FPPR specified with a single additive component. In other cases where the alternatives deviate from the true model, TFBoost(A.K) and TFBoost(B) usually show the lowest errors and their performances are stable regardless of the setting. This illustrates the flexibility of our method, which helps to achieve low prediction errors without posing strong assumptions on the target function.

So far we have focused on the performance of TFBoost in nonlinear settings since the base learners are nonlinear. In the Appendix, we provide also results for an experiment fitting TFBoost methods with data generated from a linear model and comparing them with alternatives considered previously. As expected, the linear estimators FLM1, FLM2 outperform the others. For FGAM, we note that the linear regression function matches with its model assumptions, which explains its good performance in linear settings. We note that the test errors of TFBoost(A.1) are only slightly worse than those produced by linear estimators: about 5% worse in the setting, and about 7% worse in the setting. This implies that even though TFBoost is an estimator for possibly nonlinear regression functions, we do not lose much applying it to data generated from linear models.

## 4 German electricity data

To illustrate the use of TFBoost, we analyzed a data set consisting of electricity spot prices traded at the European Energy Exchange (EEX) in Leipzig and electricity demand reported by the European Network of Transmission System Operators for Electricity from January 1st 2006 to September 30th 2008. On each day, electricity spot prices and demands were recorded hourly and represented as vectors of dimension 24. Our objective is to predict the daily average of electricity demand using hourly electricity spot prices. The whole dataset is available from the on-line supplementary materials of liebl2013modeling. To avoid days with known atypical price or demand values affecting our analysis, we removed weekends and holidays and analyzed data collected on the remaining 638 days. For each of these days, we view the hourly electricity spot prices as discretized evaluations of a smooth price function.

We considered methods of TFBoost(A,K) with = 1, 2, and 3, TFBoost(B), as well as their alternatives FLM1, FLM2, FAM, FPPR, FAME, and FGAM. To adjust for potential seasonality effects, we created a “day of the year” variable defined as the number of days from January 1st of the year in which the data were collected. In order to compare the results before and after adjusting for seasonality, our analysis consisted of two parts: (1) predicting daily electricity demand with hourly electricity prices only, and (2) with both hourly electricity prices and the “day of the year” variable.

In part (1), all methods under consideration were specified in the same way as in Section 3.1. In part (2), we added “day of the year” as an additional predictor to all methods and adjusted their algorithms accordingly. For TFBoost methods, at every boosting iteration, we added “day of the year” as an additional variable to the functional multi-index tree. For the competitors, we included the scalar predictor in the model following the same approach as we did for the functional predictor. We added “day of the year” as a linear term in FLM1 and FLM2, and as a nonparametric term in FAM, FPPR, FAME, and FGAM. For each of FAM and FPPR, we first calculated the estimator in the same way as in part (1) and obtained the residuals. Then we added to the part (1) estimator a Nadaraya-Watson estimator with Gaussian kernel fitted to the residuals using “day of the year” as the predictor. For FAME and FGAM we specified a smooth term on “day of the year” represented as penalized cubic splines (wood2017generalized). Other than adding this additional predictor, the other parameters and fitting procedures were kept the same as in part (1).

We randomly partitioned the data into a training set (60%), a validation set (20%), and a test set (20%). As in Section 3, we fitted each model using the training and validation sets and recorded MSPE on the test set. The validation set was used to choose the regularization parameter of FLM1, the number of additive components of FAM and FPPR, and the maximum tree depth and early stopping time of TFBoost(A,1),TFBoost(A,2), TFBoost(A,3), and TFBoost(B).

To reduce the variability introduced by partitioning the data, we repeated the experiment with 100 random data splits. Figures 7 and 6 show test MSPEs obtained from 100 random data splits in part (1) and (2) of the experiment respectively. The summary statistics of these test MSPEs are provided in the Appendix. To better reveal the differences between the estimators under consideration, in each figure the y-axis is truncated and represents test MSPE .

In both figures, TFBoost(A.1), TFBoost(A.2), TFBoost(A.3), and TFBoost(B)

show superior performance compared to their alternatives. They achieve the lowest errors with the smallest standard errors. In

Figure 6, we observe that TFboost(B) produces the smallest test errors, substantially better than FLM1,FLM2, FAM, and FGAM. The performance of FPPR is the closest to TFBoost(B), however, it is much more expensive to fit kernel estimators compared to Type B trees, particularly with data of a large sample size. Comparing Figure 7 with Figure 6, we observe improvements in test errors for all methods except for FLM1. This implies the importance of adjusting for seasonality when predicting the electricity demand, which is likely due to higher electricity usage in the summer for cooling and in the winter for heating. In Figure 7, TFBoost(A.1), TFBoost(A.2), TFBoost(A.3) produce very similar results, and TFBoost(A.2) outperforms the other two by a small margin in terms of the average test MSPE. Figure 6: Boxplot of test MSPEs obtained from 100 random splits of the data from part (1) of the experiment. The unit of y-axis is 10−6. Figure 7: Boxplot of test MSPEs obtained from 100 random splits of the data from part (2) of the experiment. The unit of y-axis is 10−6.

## 5 Conclusion and future work

In this paper we have proposed a novel boosting method for regression with a functional explanatory variable. Our approach uses functional multi-index trees as base learners, which offer flexibility and are relatively easy to fit. Compared with widely studied single index estimators, our multi-index estimator enables modelling possible interactions between indices, allowing us to better approximate complex regression functions. Through extensive numerical experiments, we demonstrate that our method consistently produces one of the lowest prediction errors across different settings, while available alternatives can be seriously affected by model misspecification. In addition, TFBoost with Type B learners exhibits significant computational advantages over kernel-based methods, being able to provide accurate predictions at a much lower cost.

The methodology of TFBoost suggests a number of interesting areas for future work. First, we formulated the regression problem with a single functional predictor, but in principle, it could also be extended to multiple predictors, functional or scalar. This can be achieved by including indices calculated from multiple predictors, either allowing each index to be associated with all predictors or one of the predictors. If there are many predictors, this approach may be computationally difficult and can result in unstable estimators when the sample size is small. In this situation, we need to assume sparsity in the functional space, which means that some but not all predictors are related to the response. Group penalties with projections on each predictor being a group can be considered to enable variable selection. Second, in our experiments we only studied TFBoost

applied with the squared loss. In cases where data contain outliers, it will be useful to consider fitting our method with different loss functions, for example, Huber’s or Tukey’s loss. Finally, in this paper we focused on functional predictors observed at many time points without measurement errors. In practice, one may encounter situations where only very few and possibly noisy measurements of the predictor function are available. In this case, smoothing each predictor curve provides unreliable approximations and makes them poor inputs to

TFBoost. It may be more advisable to borrow strength across curves, for example, fitting TFBoost to curves reconstructed using FPCA (yao2005functional and boente2021robust).

## 6 Acknowledgments

The authors gratefully thank Professor James and Professor Ferraty for sharing the code used in their papers (james2005functional and ferraty2013functional). This research was supported by the Natural Sciences and Engineering Research Council of Canada [Discovery Grant RGPIN-2016-04288].

APPENDICES

## Appendix A

###### Proof 1

It is clear that if for some , then . Therefore, it suffices to show that for some . We prove that if there do not exist for the two sets to be equal, there exists a set of indices for which (12) is a constant function and thus contradicts creftypecap 2.

For simplicity, we let . If for any , we match two sets so that the same vectors and align with each other. We let , , for and , for , and . By creftypecap 2, there exist a for , (12) is not a constant function.

By (13), creftypecap 1 and creftypecap 2, for any

 h(⟨x0,β1⟩+t1,...,⟨x0,βK⟩+tK) =h(⟨x0+t1β1,β1⟩,...,⟨x0+tKβK⟩,βK) =~h(⟨x0+t1β1,η1⟩,...,⟨x0+tKβK,ηK⟩) =~h(⟨x0,η1⟩+t1⟨β1,η1⟩,...,⟨x0,ηK⟩+tK⟨βK,ηK⟩)

and similarly

 ~h(⟨x0,η1⟩+t1,...,⟨x0,ηK⟩+tK) =~h(⟨x0+t1η1,η1⟩,...,⟨x0+tKηK,ηK⟩) =h(⟨x0+t1η1,β1⟩,...,⟨x0+tKηK,βK⟩) =h(⟨x0,β1⟩+t1⟨β1,η1⟩,...,⟨x0,βK⟩+tK⟨βK,ηK⟩)

By Cauchy-Schwarz inequality and creftypecap 1, for and for . For any ,

 h(⟨x0,β1⟩+t1,...,⟨x0,βK⟩+tK) =~h(⟨x0,η1⟩+t1⟨β1,η1⟩,...,⟨x0,ηK⟩+tK⟨βK,ηK⟩) =h(⟨x0,β1⟩+t1⟨β1,η1⟩2,...,⟨x0,βK⟩+tK⟨βK,ηK⟩2) ⋮ =h(⟨x0,β1⟩+t1⟨β1,η1⟩2n,...,⟨x0,βK⟩+tK⟨βK,ηK⟩2n) ⋮ =h(⟨x0,β1⟩+t1I1,...,⟨x0,βK⟩+tKIK) (22)

where for and for .

Let for any unit function , and . Then fills the space of . For , we define

 Lj(x) =(1−Ij)x+Ijx0 =(1−Ij)(x0+te)+Ijx0 =x0+(1−Ij)te
 h(⟨L1(x),β1⟩,...,⟨LK(x),βK⟩) =h(⟨x0+(1−I1)te,β1⟩,...,⟨x0+(1−IK)te,βK⟩) =h(⟨x0,β1⟩+⟨(1−I1)te,β1⟩,...,⟨x0,βK⟩+⟨(1−IK)te,βK⟩),by (???) =h(⟨x0,β1⟩+⟨I1(1−I1)te,β1⟩,...,⟨x0,βK⟩+⟨IK(1