1 Introduction
Functional data analysis (FDA) is a growing statistical field for analyzing curves, surfaces, or any multidimensional functions, in which each random function is treated as a sample element (Ramsay and Silverman, 2005; Ferraty and Vieu, 2006). Functional data is found commonly in many applications such as timecourse gene expressions and brain scan images. The everexpanding umbrella that encompasses deeplearning methodologies has thus far largely excluded the usage of functional covariates. With the advent and rise of functional data analysis, it is natural to extend neural networks and all of their recent advances to the functional space. The main goal of this article is to provide a new means of modelling functional data for scalar response prediction in the form of neural networks.
Let
be a scalar response variable, and
, , be the functional covariate. Several models have been proposed to predict the scalar response with the functional covariate . For instance, when the scalar responsefollows a normal distribution, the conventional functional linear model is defined as:
where is the intercept term, and is the functional coefficient which represents the cumulative linear effect of on (Cardot et al., 1999). This model was extended to the general functional linear model:
when the scalar response follows a general distribution in an exponential family, where is referred to as the link function and has a specific parametric form for the corresponding distribution of (Müller and Stadtmüller, 2005). When this link function has no parametric form, the model is called the functional single index model (Jiang and Wang, 2011)
. Other predictive methods in the realm of FDA are related to various estimation methods of
. For example, a partial least squares approach was proposed by Preda et al. (2007) where an attempt was made to estimate such that it maximized the covariance between the response, and the functional covariate, . Ferraty and Vieu (2006)proposed a nonparametric model:
where is a smooth nonparametric function that is estimated using a kernel approach. Another model, which serves as an extension to the previous, is the semifunctional partial linear model (AneirosPérez and Vieu, 2006) defined as:
where is the scalar covariate that is observed in the usual multivariate case, and the function is also estimated with no parametric form using kernel methods.
All of the functional models above have been shown to have some level of predictive success. However, we show that the general neural network in this article outperforms these models. We propose a novel methodology for deep network structures in a general regression framework for longitudinal data. The form of a single neuron
, for functional covariates and scalar covariates in this model is expressed as:(1) 
where is some activation (i.e. nonlinear) function, is the weight associated with the scalar covariate , and is the functional weight that corresponds to the functional covariate .
With respect to neural networks, we have seen a growing number of approaches, some of which have resulted in previous benchmarks being eclipsed. For example, Krizhevsky et al. (2012)
won the ImageNet LargeScale Visual Recognition Challenge in 2012 improving on the next best approach by an over 10% increase in accuracy.
Rossi and ConanGuez (2005) proposed a neural network in which they used a single functional variable for classification problems. He et al. (2016)introduced residual neural networks, which allowed for circumvention of vanishing gradients – an innovation that carved a path for networks with exponentially more layers thus further improving error rates. These successes however, have come at the cost of interpretability. As the models become more complex, it becomes an increasingly difficult task to make sense of the network parameters. On the other hand, conventional linear regression models have a relatively clear interpretation of the parameters estimates
(Seber and Lee, 2012). In the functional linear regression case, the coefficient parameters being estimated are functions , rather than a set of scalar values. This paper details an approach that makes the functional coefficient traditionally found in the regression model readily available from the neural network process in the form of functional weights; the expectation is that this increases the interpretability of the neural networks while maintaining the superior predictive power.Our paper has three major contributions. First, we introduce the general framework of functional neural networks (FNNs) with a methodology that allows for deep architectures with multiple functional and scalar covariates, the usage of modern optimization techniques (Kingma and Ba, 2014; Ruder, 2016)
, and hyperparameters such as early stopping
(Yao et al., 2007) and dropout (Srivastava et al., 2014), along with some justifications that underpin the approach. Second, we introduce functional weights that are smooth functions of time and can be much easier to interpret than the vectors of parameters estimated in the usual neural network. This is exemplified in our applications and simulations. Finally, branching off this work is a forthcoming R package developed on top of a popular deep learning library (
Tensorflow/Keras) that will allow users to apply the proposed method easily on their own data sets.The rest of our paper is organized as follows. We first introduce the methodology for functional neural networks in Section 2. Additionally, commentary is provided on the interpretation potential, weight initialization, and the hyperparameters of these networks. Then, Section 3 provides results from real world examples; this includes prediction comparisons among a number of methods for multiple data sets. In Section 4, we use simulation studies for the purpose of recovering the true underlying coefficient function , and to test the predictive accuracy of multivariate and functional methods in four different contexts. Lastly, Section 5 contains some closing thoughts and new avenues of research for this kind of network.
2 Methods
2.1 Functional Neural Networks
We will begin with a quick introduction of traditional neural networks which are made up of hidden layers each of which contains some number of neurons. Let be the number of neurons in the
th hidden layer. Each neuron in each layer is some nonlinear transformation of a linear combination of each activation in the previous layer. An activation value is the output from each of these neurons. For example, the first hidden layer
would be defined as where is a vector of covariates, is an x weight matrix, is the intercept (often referred to as the bias in machine learning texts), andis some activation function that transforms the resulting linear combination
(Tibshirani et al., 2009). The choice of the function is highly context dependent. The rectifier (Hahnloser et al., 2000)(Han and Moraga, 1995) are popular choices for . Note that the vector corresponds to a single observation of our data set. The resulting vector is dimensional. This vector contains the activation values to be passed on to the next layer.Thus far, the assumption has been that is dimensional. However, we wish now to consider the case when our input is infinite dimensional defined over some finite domain , i.e., we postulate that our input is a functional covariate , . By finite domain we mean that for . We must weigh this functional covariate at every point along its domain. Therefore, our weight must be infinite dimensional as well. We define this weight as . The form of a neuron with a single functional covariate in the first layer then becomes
(2) 
where the subscript is an index that denotes one of the neurons in this first hidden layer, i.e., . We omit the superscript on the functional weight , because this parameter only exists in the first layer of the network.
The functional weight is expressed as a linear combination of basis functions: , where is a vector of basis functions, and is the corresponding vector of basis coefficients. The basis coefficients for will be initialized by the network; these initializations will then be updated as the network learns. Common choices of basis functions are the Bsplines and the Fourier basis functions (Ramsay and Silverman, 2005). We also note that the evaluation of the neuron in Equation (2) results in some scalar value. This implies that the rest of the layers of the network can be of any of the usual forms (feedforward, residual, etc.). Using these basis approximations of , we can simplify to get that the form of a single neuron is:
(3) 
where the integral in Equation (3) can be approximated with numerical integration methods such as the composite Simpson’s rule (Süli and Mayers, 2003).
We can now consider the generalization for functional covariates and scalar covariates. Consider the input layer as presented in Figure 1. The covariates correspond to the th observation can be seen as the set:
Then, the th neuron of the first hidden layer corresponding to the th observation can be formulated as (we suppress the index, because this expression does not change with the observation number):
where
is the basis function and is the corresponding basis coefficient. This neuron formulation is the core of this methodology as alluded to in Equation (1). Note that here is unique at the initialization for each functional weight, – the choice of these initializations is discussed later in the article. Also, in this formation, we have assumed that is the same across all functional weights; it could be the case that the user prefers some functional weight to be defined using a different number of basis functions than say , so this is left as a hyperparameter.
Having specified the form, we define the following general formation of the first layer:
To justify this, we consider the onelayer case and consider Theorem 1 in Cybenko (1989), which states that linear combinations of the form exhibit the quality that, under some conditions, the 1norm between the function you want to learn , and the function can be arbitrarily small. Since we are looking at the onelayer case, we have neurons (indexed from to ) and we omit the superscript that indexes the layer number, . Additionally, we fix the observation number because it does not play a role in the proof (you can apply the same argument to each observation).
Theorem 1.
Let be any continuous sigmoidal function, denote the dimensional hypercube and denote the space of continuous functions. Then, the finite sum of the following form, is dense in :
meaning that for any and for , the function obeys:
A proof is provided in the supplementary document. After running through this set of initial neurons in the first layer and calculating the activations for the layers following, we can arrive at a final value. The output will be single dimensional. In order to assess performance, we can use some loss function,
; for example, the mean squared errorwhere is the set of parameters defining the neural network, is the observed data for the scalar response, and is the output from the functional neural network.
2.2 Functional Neural Network Training
Having defined the general formation of functional neural networks, we can now turn our attention to the optimization of this kind of network. We will consider the usual backpropogation algorithm (Rumelhart et al., 1985). While in the implementation, we used an adam() optimizer (Kingma and Ba, 2014)
, we can explain the general process when the optimization scheme uses stochastic gradient descent.
Given our generalization and reworking of the parameters in the network, we can note that the set making up the gradient associated with the parameters is:
This set exists for every observation, . We are trying to optimize for the entirety of the training set, so we will move slowly in the direction of the gradient. The rate at which we move, which is called the learning rate, will be denoted by . For the sake of efficiency, we will take a subset of the training observations, which is called a minibatch, for which we calculate . Then, letting , where is the derivative of any parameter for the th observation and is the size of the minibatch. The update for is (Ruder, 2016). This process is repeated until all partitions (minibatches) of the data set are completed thus completing one training iteration; the number of training iterations is left as a hyperparameter. We summarize the entire network process in Algorithm 1.
Lastly, we would like to emphasize that the number of parameters in the functional neural network presented here has decreased significantly under this approach. Consider a longitudinal data set where we have observations and scalar repeat measurements of some covariate at different points along its continuum. Passing this information into a network will mean that the number of parameters in the first layer will be . Note that in our network, the number of parameters in the first layer is a function of the number of basis functions we use to define the functional weight. The number of basis functions , we use to define this functional weight will be less than
as there is no need to have a functional weight that interpolates across all our observed points – we prefer a smooth effect across the continuum to avoid fitting to noise. Therefore, good practice indicates that the number of parameters in the first layer of our network is
where .2.3 Functional Weights
Since a leading contributor to the blackbox reputation of neural networks is the inordinate amount of changing weights and intercepts, it would be helpful to consider rather a function defined by these seemingly uninterpretable numbers. In a functional neural network, we are estimating functional weights . These functional weights are akin to the ones predicted in the functional regression model (Ramsay and Silverman, 2005); the final set of functional weights can be compared with the one estimated from a function linear model. In the case of multiple neurons, we take the average of the estimated functional weight . Over iterations of the network, as it is trained, we can see movement of the functional weight over its domain. Since these parameters can be visualized, it can be much easier to garner intuition about the relationship between the functional covariates and the response. Figure 2
may be illuminating. At the 99th training iteration, the validation error stops decreasing with respect to some threshold. We can see that the difference in the curves is most pronounced in the beginning and is least pronounced after the model finds some local extrema. In this example, the functional weights were initialized from a uniform distribution but a more drastic change in the shape could be seen with a different initialization and a different choice of basis functions for the functional weight.
2.4 Weight Initialization and Parameter Tuning
For any usual neural network, the weights and intercepts can be initialized in a number of ways. For example, in Kim and Ra (1991) weights are initialized based on a boundary that allowed for quicker convergence. Another approach is to consider a zeroinitialization, i.e., letting the initial parameter values be 0. Many of these approaches have also been compared to one another using various guidelines (FernándezRedondo and HernándezEspinosa, 2001). In the case of the networks presented here, this is left as a hyperparameter. Since the implementation is built on top of the Keras architecture (Chollet and others, 2015), the initialization is dependant on the type of connected layers, but generally the glorot_uniform() initializer is the choice for dense layers.
Due to the sheer number of hyperparameters in the network, a tuning approach can be used to find optimal values in our applications. The tuning method is to take a list of possible values for each parameter and run a crossvalidation (Tibshirani et al., 2009) for all combinations. The number of folds to use depends on the size of the problem. The general scheme is that the function creates a grid, and calculates the Kfold crossvalidated mean squared prediction error , where is the th partition of the data set, and is the predicted value for by training the functional neural network using the rest of the partitions of the data set. The number of data points in depends on the number of folds. The final output of the tuning function is the combination of hyperparameters that have the minimum value of this crossvalidated error. A list of hyperparameters is given in Table S1 in the supplementary materials. One important parameter in this particular kind of network is the number of basis functions that govern the functional weights. Tuning this is fairly important as the number of terms significantly impacts the potential for interpretability and restricts us to some particular shape of the curve. In the examples to come, we tune our hyperparameters using the tuning function in our forthcoming package.
3 Applications
3.1 Bike Rental Data
An important problem in rental businesses is the amount of supply to keep onsite. If the company cannot meet demands, they are not being as profitable as they can be. If they exceed the required supply, they have made investments that are not yielding an acceptable return. Using the bike rental data set (FanaeeT and Gama, 2014), we look to model the relationship between the total number of daily rentals (a scalar value) and the hourly temperature throughout the day (functional observation). It makes intuitive sense for temperature to be related to the number of bike rentals: on average, if it’s cold, less people are likely to rent than if it were warm. We also expect there to be a temporal effect of temperature – if we have the same temperature at 1pm and 9pm, we would expect more rentals at 1pm under the assumption that less people are deciding to bike later at night. In total, we have data for 102 Saturdays from which we did our analysis – we chose the same day of the week to eliminate the daytoday variation. Our functional observations of temperature (to be passed into the network) are formed using 31 Fourier basis functions.
We are first concerned with the accuracy of our predictions. Using and a 10fold crossvalidated mean squared prediction error, we can compare results for a number of models. Here, we compare with the usual functional linear model, an FPCA approach (Cardot et al., 1999), a nonparametric functional linear model (Ferraty and Vieu, 2006), and a functional partial least squares model (Preda et al., 2007). The results are summarized in Table 1
. For the final model, we had a fourlayer network with exact hyperparameter configurations being found in Table S2 in the supplementary materials. We observe that FNNs outperform all the other models using both criteria but note that the penalized partial least squares approach and the principal component ridge regression performed comparably.
Model  MSPE  

Functional Linear Model (Basis)  0.0723  0.515 
Functional NonParametric Regression  0.143  0.154 
Functional PC Regression  0.0773  0.503 
Functional PC Regression (2nd Deriv Penalization)  0.128  0.0481 
Functional PC Regression (Ridge Regression)  0.0823  0.464 
Functional Partial Least Squares  0.0755  0.458 
Functional Partial Least Squares (2nd Deriv Penalization)  0.0701  0.545 
Functional Neural Networks  0.0669  0.582 
We can also look to see what the determined relationship is according to the functional linear model and the functional neural network between hourly temperature and daily rentals as indicated by . Figure 3 shows the estimated weight function . The optimal number of basis functions was eleven for the functional linear model and three for the functional neural network. For the functional linear model, we note that there seems to be no obvious discernable relationship between hourly temperature and bike rentals. In the case of the functional neural network, we see that there seems to be a positive relationship as we move into the afternoon and that this relationship tapers off as the day ends. We would also expect there to be no effect for when bike rental retailers would be closed, and this is much better reflected in the functional weight from the neural network than the functional coefficient in the functional linear model. We observe different scales for the two and we posit that this difference can be explained by the fact that the functional neural network has a large number of additional parameters that may be explaining some of the variation in the response. Moreover, even though the scale is different, the range of the scale is the same so the relative effect across the continuum is similar.
3.2 Tecator Data
We consider the classic Tecator data set (Thodberg, 2015). The data are recorded on a Tecator Infratec Food and Feed Analyzer using nearinfrared light (wavelength is nm) to analyze the samples. Each sample contains meat with different moisture, fat, and protein contents. The goal is to predict the scalar value of fat contents of a given meat sample using the functional covariate of the near infrared absorbance spectrum and the scalar covariate associated with the water contents. Absorbance spectroscopy measures the fraction of incident radiation absorbed by the sample. Samples with higher water composition may exhibit different spectral features (absorbance bands) than samples with higher protein content. Since we are working with functional covariates, we also have access to their derivatives; because this is fundamentally a problem in physics, the derivative information can serve as an important predictor and is used as such.
In total, there are 215 absorbance curves. We used 29 Fourier basis functions to estimate the functional observations. The first 165 absorbance curves are used as the training set, and the predictions are made on the remaining – this test/train paradigm comes from FebreroBande and de la Fuente (2012); they fitted several models using this paradigm. We present their results along with results from the functional neural network, in Table 2. In the original paper, the authors use the metric , where MSPE is the average squared errors of the test set and
is the variance of the observed response (we can think of
MEP as a rescaling of the to assess the models. They also used , which we tabulate in Table 2. In the functional neural network, we tuned to find that a sixlayer network was optimal with a total of 4029 parameters. Our model has the lowest MEP, but is about lower than the best . Most other models perform worse with the SemiFunctional Partial Linear Model (AneirosPérez and Vieu, 2006) being the most comparable.Model  MEP  

fregre.basis(X.d1, Fat)  0.0626  0.928 
fregre.basis.cv(X.d2, Fat)  0.0566  0.965 
fregre.pc(X.d1, Fat)  0.0580  0.950 
fregre.pc(X.d2, Fat)  0.0556  0.954 
fregre.pls(X.d1, Fat)  0.0567  0.951 
fregre.pls(X.d2, Fat)  0.0487  0.962 
fregre.lm(Fat, X.d1 + Water)  0.0097  0.987 
fregre.lm(Fat, X.d2 + Water)  0.0119  0.986 
fregre.np(X.d1, Fat)  0.0220  0.987 
fregre.np(X.d2, Fat)  0.0144  0.996 
fregre.plm(Fat, X.d1 + Water)  0.0090  0.996 
fregre.plm(Fat, X.d2 + Water)  0.0115  0.997 
FNN(Fat, X.d2 + Water)  0.00883  0.965 
Note that we only presented the results using the second derivative of the spectrum curves as the functional covariate because it was the better performer when compared to the network made using the first derivative or the raw functional observations themselves. We did not use multiple functional covariates here because all the other models only used one functional covariate; however, we did use water as a scalar covariate.
3.3 Canadian Weather Data
The data set used here has information regarding the total amount of precipitation in a year and the daily temperature for 35 Canadian cities. We are interested in modelling the relationship between annual precipitation and daily temperature. Generally, you would expect that lower temperatures would indicate higher precipitation rates. However, this is not always the case. In some regions, the temperature might be very low, but the inverse relationship with rain/snow does not hold. Our goal is to see whether we can successfully model these anomalies relative to other methods.
The functional observations are defined for the temperature of the cities for which there are 365 (daily) time points, . In total, there are 35 functional observations and the scalar response is the average precipitation across the year. A Fourier basis expansion was used with 65 basis functions defining each of the 35 cities (Ramsay et al., 2009). The details of the network hyperparameters is in Table S2 in the supplementary materials. The results from two criteria ( and a leaveoneoutcrossvalidated MSPE) are measured for a number of models. We see that the FNN model outperforms all other approaches including the usual neural networks. Table 3 summarizes the predictive results.
We can compare the estimated functional weight from the functional linear model with the functional neural network in Figure 4. For this data set, we decided to keep the number of basis functions the same across both models – the choice for this was eleven and comes from Ramsay et al. (2009). This was to measure how similar the functional coefficients would be under the same conditions. We observe similar patterns between the two models especially over the second half of the domain. Note that the difference between the two recovered functional weights only accounts for some of the difference in . The functional neural network has many more parameters allowing for more flexibility in the modelling process and thus the great increase in accuracy.
Model  MSPE  

Functional Linear Model (Basis)  0.123  0.00312 
Functional NonParametric Regression  0.0647  0.0506 
Functional PC Regression  0.0272  0.352 
Functional PC Regression (2nd Deriv Penalization)  0.0930  0.00298 
Functional PC Regression (Ridge Regression)  0.0259  0.382 
Functional Partial Least Squares  0.0449  0.177 
Functional Partial Least Squares (2nd Deriv Penalization)  0.0483  0.155 
Neural Networks  0.126  0.0453 
Functional Neural Networks  0.0194  0.541 
4 Simulation Studies
4.1 Recovery of Functional Weight
In this section, we present results from when we know the true underlying functional weight. The goal is to compare the functional weight in the functional neural network to the conventional functional linear model. Useful results here would go a long way in showing that the functional neural network is not only useful for prediction, but can be a valiant tool when the goal is to approximate relationships via parameter estimation. In order to measure this, the integrated mean square error (IMSE) is used, which is defined as
where is the estimated functional weight either from the FNN or from the functional regression. We use the following to generate our response:
(4) 
where our choice for is
and
is sampled from the Gaussian distribution,
. The true functional covariate is generated, depending on the simulation scenario, either from or , where , , and are parameters that govern the difference between the functional observations.This generative procedure will be used for four different simulations. In all four, we generate 300 observations randomly using Equation (4) by varying , and . The coefficients for , , are set beforehand. We fit the functional linear model and the functional neural network for each simulation data set. We crossvalidate over a grid for in order to find a smooth estimate of from the functional linear model. The difference is measured using IMSE. The simulation is replicated 250 times.
The first simulation is for when the link function is the identity function. Here, we would expect the functional linear model to perform comparably than the functional neural network due to its deterministic nature and the linear relationship. In the second simulation, we look to see if our method can recover for when the link function is exponential. The third simulation explores this behaviour for a sigmoidal relationship. And lastly, we simulate a logarithmic relationship between the response and the functional covariates. All scenarios except simulation 2 use + + for the data generation. These simulations are summarized as follows:
In all but the fourth scenario, we use a threelayer network with rectifier (Hahnloser et al., 2000) and linear activation functions. In the final scenario, we use a onelayer network with a sigmoidal activation function. With respect to the linear model, we crossvalidate over a grid to find the optimal parameter to smooth the resulting functional weight.
In Figure 5, we present the results for these four simulations. We observe that the usual linear model seems to perform better when the relationship is linear. There are far more parameters in the FNN that are contributing to the prediction of . When the relationship is nonlinear, the functional linear model struggles where relatively, the FNN does a much better job in recovering .
The averages of these results along with computation times are provided in Table 4. We observe that the functional neural network, as expected, takes a longer time to run across all simulation scenarios. However, we also observe, as the box plots indicate, that for nonlinear simulation scenarios, the functional neural network outperforms the functional linear model. This difference seems to be the most pronounced in simulation scenario 4. We also note that because of the stochastic nature i.e., random weight initialization of functional neural networks, there is a higher variance in our estimates when compared with the deterministic functional linear model. We note that with a more rigorous tuning of the functional neural networks, we could further improve these results.

Functional Linear Model  Functional Neural Networks  
Mean  SD  Avg. Comp. Time  Mean  SD  Avg. Comp. Time  
Simulation: 1  2.27  .0370  0.232s  2.39  .0476  4.68s 
Simulation: 2  2.53  .00901  0.232s  2.36  .0453  4.67s 
Simulation: 3  6.70  .0182  0.247s  6.43  .108  6.00s 
Simulation: 4  7.43  .0464  0.258s  6.46  .0752  6.30s 
4.2 Prediction
In this section, we look to see how relatively well our method does under the four different simulation scenarios detailed previously when the task is prediction. That is, we are interested in seeing how FNNs perform versus functional and multivariate approaches. The multivariate methods to be compared include: least squares regression (MLR), LASSO (Tibshirani, 1996)
, random forests (RF)
(Breiman, 2001), gradient boosting approaches (GBM, XGB)
(Friedman, 2001) (Chen et al., 2015), and projection pursuit regression (PPR) (Friedman and Stuetzle, 1981).We did not tune our functional neural networks because we found that they performed well in our initial tests irregardless of the tuning. We did make an effort to tune all the other models. For example, the choice of for the LASSO was made using crossvalidation. The tree methods including RF, GBM, and XGB were tuned across a number of their hyperparameters such as the node size, and for PPR, we built models with different numbers of terms and picked the model with the lowest MSPE. In these simulations, we did 100 replicates. For each simulation replicate, we generated 300 functional observations in accordance to Equation (4). After the realization, we split the data randomly, built a model on the training set, and predicted on the test set. This process is repeated for the same four simulation scenarios as given in Section 4.1.
The box plots in Figure 6 measure the relative error in each simulation replicate. We call this the relative MSPE defined as:
For example, on any given simulation replicate, we calculate the MSPE values for each model, and then divide each of them by the minimum in that run. The best model according to this measure will have a value of 1. Error values greater than 1 implies a worse performance. Table S3 in the supplementary materials contains the absolute MSPE values.
The relative measure we use makes it easy to compare each model within a simulation, and across the four simulation scenarios. Notably, we see in Figure 6 that the functional neural network performs well. This can be attributed to the addition of the functional information passed into the network; the functional neural network can take into account the temporal trend of the data to learn more about the underlying relationship in each training iteration. Therefore, we gain a model that better estimates the underlying true relationship between the covariates and the response in comparison with multivariate approaches.
In the curve building process, we assume that there is noise associated with the observed discrete values – by reverse engineering into an approximation of the curve via basis functions, we effectively reduce that noise and then later, when we build the model, we avoid some of the error chasing that we would otherwise be privy to. This is a good application of Theorem 1, as we prove that this method should produce estimates of the response that come arbitrarily close to the true response, given that the response is a continuous function. As a comparison, we see that generally the treebased methods perform comparably within a particular simulation, but performance changes across different simulations. Depending on the scenario, it seems that multivariate methods are capable of outperforming functional methods with respect to rMSPE. However, one exception to this is the functional neural networks introduced here; they seem to be consistently good performers across these scenarios. With respect to outliers, we can see that they are most prevalent in Simulation 2; this is because this simulation was for when the link function
was exponential. In this context, we expect that the difference between our prediction and the corresponding observed value is greater than it would be in the other simulation scenarios.5 Conclusions and Discussion
The extreme rise in popularity of deep learning research has resulted in enormous breakthroughs in computer vision, classification, and scalar prediction. However, these advantages thus far have been limited to when the data is treated as discrete. This paper introduces the first of a family of neural networks that extend into the functional space.
In particular, we present a functional feedforward neural network to predict a scalar response with functional and scalar covariates. We developed a methodology which showed the steps required to compute the functional weight for the neural network. Multiple examples were provided which showed that the functional neural network outperformed a number of other functional models and multivariate methods with respect to the mean squared prediction error. It was also shown through simulation studies that the recovery of the true functional weight is better done by the functional neural network than the functional linear model when the true relationship is nonlinear.
To extend this project, algorithms can be developed for other combinations of input and output types such as the function on function regression models (Morris, 2015). Moreover, one can consider adding additional constraints to the firstlayer neurons via penalization or other methods.
SUPPLEMENTAL MATERIALS
 Data and R Codes:

Data and R Codes used for each application and simulation study presented in this paper; this includes the source code for the upcoming R package. A README file is included which describes each file. (FNNcode.zip File)
 Supplemental Document:

Document containing the proof for Theorem 1, a table with descriptions of various parameters, model configurations, and additional simulation results. (supplementalFNN.pdf File)
References

Semifunctional partial linear regression.
Statistics & Probability Letters
76 (11), pp. 1102–1110. Cited by: §1, §3.2.  Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §4.2.
 Functional linear model. Statistics & Probability Letters 45 (1), pp. 11–22. Cited by: §1, §3.1.
 Xgboost: extreme gradient boosting. R package version 0.42, pp. 1–4. Cited by: §4.2.
 Keras. Note: https://keras.io Cited by: §2.4.
 Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems 2 (4), pp. 303–314. Cited by: §2.1.

Event labeling combining ensemble detectors and background knowledge.
Progress in Artificial Intelligence
2 (23), pp. 113–127. Cited by: §3.1.  Statistical computing in functional data analysis: the r package fda.usc. Journal of Statistical Software 51 (4), pp. 1–28. External Links: ISSN 15487660 Cited by: §3.2, Table 2.
 Weight initialization methods for multilayer feedforward.. In European Symposium on Artificial Neural Networks, Cited by: §2.4.
 Nonparametric functional data analysis: theory and practice. SpringerVerlag, New York. Cited by: §1, §1, §3.1.
 Projection pursuit regression. Journal of the American Statistical Association 76 (376), pp. 817–823. Cited by: §4.2.
 Greedy function approximation: a gradient boosting machine. Annals of Statistics 29 (5), pp. 1189–1232. Cited by: §4.2.
 Digital selection and analogue amplification coexist in a cortexinspired silicon circuit. Nature 405 (6789), pp. 947–951. Cited by: §2.1, §4.1.

The influence of the sigmoid function parameters on the speed of backpropagation learning
. In International Workshop on Artificial Neural Networks, Cited by: §2.1. 
Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, Cited by: §1.  Functional single index models for longitudinal data. The Annals of Statistics 39 (1), pp. 362–388. Cited by: §1.
 Weight value initialization for improving training speed in the backpropagation network. In IEEE International Joint Conference on Neural Networks, Cited by: §2.4.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §1, §2.2.

Imagenet classification with deep convolutional neural networks
. In Advances in Neural Information Processing Systems, Cited by: §1.  Functional regression. Annual Review of Statistics and Its Application 2 (1), pp. 321–359. Cited by: §5.
 Generalized functional linear models. The Annals of Statistics 33 (2), pp. 774–805. Cited by: §1.
 PLS classification of functional data. Computational Statistics 22 (2), pp. 223–235. Cited by: §1, §3.1.
 Functional data analysis. Springer, New York. Cited by: §1, §2.1, §2.3.
 Functional data analysis with r and matlab. Springer, New York. Cited by: §3.3, §3.3.

Functional multilayer perceptron: a nonlinear tool for functional data analysis
. Neural Networks 18 (1), pp. 45–60. Cited by: §1.  An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747. Cited by: §1, §2.2.
 Learning internal representations by error propagation. Technical report California Univ San Diego La Jolla Inst for Cognitive Science. Cited by: §2.2.

Linear regression analysis
. John Wiley & Sons, Hoboken. Cited by: §1.  Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §1.
 An introduction to numerical analysis. Cambridge University Press, Cambridge. Cited by: §2.1.
 Tecator meat sample dataset. StatLib Datasets Archive. Cited by: §3.2.
 The elements of statistical learning: data mining, inference, and prediction. Springer, New York. Cited by: §2.1, §2.4.
 Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological) 58 (1), pp. 267–288. Cited by: §4.2.
 On early stopping in gradient descent learning. Constructive Approximation 26 (2), pp. 289–315. Cited by: §1.
Comments
There are no comments yet.