1 Introduction
It is often of interest to predict a response from a set of inputs . In many applications, the relationship between and
can be quite complex and its functional form may be difficult to know a priori. In low and moderate dimensional settings, there are many methods that have been effective for estimating such complex relationships. For classification problems, popular methods include kernel extensions of Support Vector Machines,
nearest neighbors, and classification trees (Cristianini and ShaweTaylor, 2000; Hofmann et al., 2008); for regression problems, popular methods include spline regression, kernel regression and regression trees Nadaraya (1964); Watson (1964); Breiman et al. (1984). Neural networks have also proven to be particularly effective in problems from complex domains where other methods have had limited success (e.g. speech recognition, computer vision, and natural language processing, among others;
Graves et al. (2013); Krizhevsky et al. (2012); Szegedy et al. (2015); Socher et al. (2013); Mikolov et al. (2013)).With the latest technological developments in biology and other fields, it is now very common to encounter highdimensional data, where the number of features
may far exceed the number of observations . For example, in genomewide, and epigenomewide studies, it is common to collect thousands or millions of features on each of hundreds or thousands of subjects. For general problems in this setting, where the appropriate network structure is unknown, neural networks are rarely used since the number of training samples required for good performance is prohibitive. Instead, it is more typical to apply the Lasso (Tibshirani, 1996), its nonparametric extensions such as Sparse Additive Models (SpAM) (Ravikumar et al., 2007), and highdimensional additive models (Meier et al., 2009) to solve highdimensional problems. These methods typically model the data using the sum of a small number of univariate (or very lowdimensional) functions. Unfortunately in actual scientific problems, the response may depend on complex interactions between multiple covariates, and failing to model these interactions can result in highly biased estimates.There are few, if any, solutions available for these highdimensional problems with complex interactions. We aim to address this gap in the literature using penalized neural networks. In particular, we leverage a unique quality of neural nets that sets them apart from more traditional nonparametric methods: With relatively few parameters, a neural net is able to approximate models with multivariate interaction terms (Barron, 1993). This is in contrast to the exponential number of terms necessary required by polynomial or spline regression to model general multivariate functions. In this paper, we propose controlling the size of the neural network space using the sparse group lasso penalty, which is a mixed penalty (Simon et al., 2013). Our method, sparseinput neural networks, groups weights connected to the same input node and consequently selects a small subset of informative features for modeling a potentially complex response. We also provide a generalized gradient descent algorithm for training the network.
To understand the theoretical properties of these sparseinput neural networks, we prove oracle inequalities that give probabilistic performance guarantees, assuming we have reached a global minimizer of our penalized criterion. We show that if the response is best approximated by a sparse neural network that uses only of the features, the difference between the prediction error of a sparseinput neural network and the prediction error of the best neural network shrinks at a rate of (here we treat the number of hidden nodes as fixed). Hence the prediction error of sparseinput neural networks grows slowly with the number of features, making them suitable for highdimensional problems. In addition, we provide convergence rates for the weights connected to the irrelevant input features. To our knowledge, there have been no theoretical results on the shrinkage rates of irrelevant model parameters for neural networks up to now.
The paper is organized as follows. Section 2 describes our extension of neural networks for highdimensional problems and an algorithm to train the network. A discussion of work related to our method is provided in Section 3. Section 4 establishes theoretical guarantees of our model. Finally, we present simulation studies in Section 5 and data analyses in Section 6 that show sparseinput neural networks can significantly outperform more traditional nonparametric highdimensional methods when the true function is composed of complex higherorder interaction terms.
We note that this paper is an extension of a paper submitted to the Principled Approaches to Deep Learning Workshop at ICML 2017. The workshop paper only considered sparseinput neural networks for regression problems and the proofs were specific to the lasso. Here we consider both regression and classification problems and prove oracle inequalities for the sparse group lasso and more general loss functions.
2 Sparseinput neural networks
In this paper, we consider neural networks with a single hidden layer with hidden nodes and a single output node. The neural network parameters are denoted where is the firstlayer input weights, is the vector of intercept terms for the hidden nodes, is the secondlayer weights, and is the final intercept term. Let denote the weights tied to the th hidden node and denote the weights tied to the
th input feature. Data is fed into the neural network and output values from each node are fed as inputs into nodes in the following layer. The hidden nodes are associated with an activation function
, which we assume to be bounded over and differentiable. For regression problems, the neural network maps input to(1) 
A neural network for classification is similar to (1
) except that we also apply the sigmoid function
at the end to ensure output values are between zero and one:(2) 
To compare neural networks, let us denote the loss function as , where represents the true value and represents the prediction from the neural network. We suppose that is differentiable with respect to . For regression problems, the loss function is typically the squared error loss . For classification problems, the loss function is typically the logistic loss .
Given observations for , we propose fitting a sparseinput neural network where we penalize the firstlayer weights using a sparse group lasso penalty and the secondlayer weights using a ridge penalty as follows:
(3) 
where and is the sparse group lasso. When , the sparse group lasso reduces to the Lasso and when , it reduces to the group lasso (Yuan and Lin, 2006).
We have three types of penalties in this criterion. The ridge penalty on serves to control the magnitude of the weights that are not in the first layer. The sparse group lasso is a mixture of the group lasso and lasso penalties (Simon et al., 2013). The group lasso penalty on encourages sparsity at the input level. That is, it encourages the entire vector to be zero. The lasso penalty on encourages sparsity across all the weights, so the hidden nodes will be connected to only a few input nodes. The parameter allows us to control the level of sparsity at the level of the inputs vs. the individual weights. Pictorially, (3) encourages fitting neural networks like that in Figure 1.
One could also consider adding a sparsity penalty to the second layer weights. This is useful if the neural network has a large number of hidden nodes that need to be pruned away. However in the highdimensional setting, using a small number of hidden nodes generally works well.
2.1 Learning
Sparseinput neural networks can be trained using generalized gradient descent (Daubechies et al., 2004; Beck and Teboulle, 2009; Nesterov, 2004). Though generalized gradient descent was originally developed for convex problems, we can apply the recent work by Gong et al. (2013) to find a critical point in nonconvex objective functions. Here we specialize their proposal, called GIST, to solve (3). This algorithm is similar to that in Alvarez and Salzmann (2016).
Let be the smooth component of the loss function in (3) defined as follows
(4) 
Let be the coordinatewise softthresholding operator
(5) 
The algorithm for training a sparseinput neural network is given in Algorithm 1. The proximal gradient step is composed of three substeps. The first substep (7) performs a gradient update step only for the smooth component of the loss; the gradient can be computed using the standard backpropagation algorithm. The second and third substeps, (8) and (9), are the proximal operations for the Sparse Group Lasso (Simon et al., 2013): a softthresholding operation followed by a softscaling operation on each .
The initial step size at each iteration is chosen to be some value in . In our implementation, we simply used a fixed value, e.g. , though one can also adaptively choose the initial step size (Barzilai and Borwein, 1988; Gong et al., 2013).
We choose the step size according to a monotone line search criterion, which accepts the step size if the following condition is satisfied:
(6) 
where is the objective function of (3) and . This line search criterion guarantees that Algorithm 1 will converge to a critical point (where the subdifferential contains zero) (Gong et al., 2013).
Finally, another option for training sparseinput neural networks is to apply the accelerated generalized gradient descent framework for nonconvex objective functions developed in Ghadimi and Lan (2016). These are guaranteed to converge to a critical point, and have accelerated rate guarantees.
In short, these generalized gradient descent algorithms provide an efficient way to train sparseinput neural networks.
(7)  
(8) 
(9) 
2.2 Tuning Hyperparameters
There are four hyperparameters for fitting a sparseinput neural network: the ridge penalty parameter , the sparse group lasso parameter , the sparse group lasso penalty parameter , and the number of hidden nodes . The parameters should be tuned to ensure low generalization error of the model. Typically, the hyperparameters are tuned using crossvalidation. With a few hyperparameters, we can tune them using a simple grid search. If there are many hyperparameters, one can use NelderMead (Nelder and Mead, 1965) and Bayesian optimization methods (Snoek et al., 2012).
For sparseinput neural networks, we found that the hyperparameters can be tuned by searching over a relatively small space, so a simple grid search will suffice. More specifically, the optimal number of hidden nodes for highdimensional problems tends to be small, so only a few candidate values need be tested. In addition, since there are only a small number of weights in the second layer of the neural network, the estimation error is quite robust to different ridge penalty parameter values. In our simulations and data analysis, we pretune the ridge penalty parameter and keep it fixed.
The most important parameters to tune are the penalty parameters for the sparse group lasso, and . We found that the lasso penalty can easily overpenalize the input weights and set everything to zero. Thus we only use candidate values where is greater than 0.5.
3 Related Work
A number of other authors have applied lasso and group lasso penalties to neural networks. That work has largely been focused on learning network structure however, rather than feature selection.
Sun (1999) was one of the earliest papers that fit a neural network with a lasso penalty over all the weights. Scardapane et al. (2016) and Alvarez and Salzmann (2016) proposed using the sparse group lasso over all the weights in a deep neural network in order to learn a more compact network — this is in the low dimensional setting with . The recent work by Bach (2017) is most closely aligned to our work: he considers convex neural networks which have a single hidden layer with an unbounded number of hidden nodes. He shows that theoretically, a lasso penalty on the input weights of such networks should perform well in highdimensional settings.The recent work in Zhang et al. (2016) examined the utility of regularization in deep learning. The authors found that using a ridge penalty on the internal network architecture was not necessary for a neural network to have good performance; instead, using an appropriate network architecture led to larger decreases in generalization error. Our results support this claim since the sparse group lasso also performs structurelearning.
Previously, statistical convergence rates for neural networks have been established for problems when the norm of the weights are constrained. These results show that the estimation error of the neural networks indeed grew with the logarithm of the number of input nodes (Bartlett, 1998; Anthony and Bartlett, 2009). Our convergence rate results also have a term. Moreover, we improve upon previously established convergence rates by showing that the excess estimation error shrinks at a rate of rather than . This faster convergence rate allows us to additionally bound the rate at which the norm of the irrelevant network weighs shrink to zero.
Our proofs for statistical convergence rates are inspired by Städler et al. (2010), which considers penalization for mixture regression models. The techniques used in their paper are relevant as neural networks can be thought of as a mixture of regressions. Significant additional work was required however as the identifiability condition assumed in Städler et al. (2010) does not hold for neural networks.
4 Theoretical Guarantees
In this section, we provide probabilistic, finitesample upper bounds on the prediction error of sparseinput neural networks. Instead of using a ridge penalty, we constrain the norm of the secondlayer weights. Our sparseinput neural network problem now becomes
(10) 
for a constant . All of our proofs are in the Supplementary Materials.
Notice that (10) assumes that the estimator is a global minimizer of a nonconvex objective function. Since nonconvex problems are typically computationally intractable to solve, there is admittedly a disconnect between the computational algorithm we have proposed and the theoretical results we establish in this section. Though it is desirable to establish theoretical properties for estimators arising from local optima, it is difficult to characterize their behavior and up to now, much of the theory for the Lasso depends on the estimator being a global minimizer. We do not address this issue and leave this problem for future research.
4.1 Problem Setup and Notation
Let
be the joint distribution of the covariates
and response where has support over . We assume contains some open set and . Let be the marginal distribution of . Given observations, we denote the empirical distribution as .The neural network can be defined either as in (1) or (2), where we suppose the activation function is the sigmoid function. Note that neural networks are included in this framework since the
and sigmoid functions are related by a linear transformation.
Next we define a neural network equivalence class. Given parameter , the set of equivalent parameterizations is
(11) 
If satisfy the following independence property, then only contains parameterizations that are signflips or rotations of and thus has cardinality exactly (Sontag, 1996; Anthony and Bartlett, 2009). The independence property states that the firstlayer weights in satisfy for all and for all .
Let the set of optimal neural networks that minimize the expected loss be denoted
(12) 
We suppose that is the union of equivalence classes, where all of the optimal neural networks satisfy the independence property. We will suppose that is chosen large enough such that for all elements in , we have that the gradient of the expected loss at every is zero.
Next let us suppose that the expected loss can be minimized using neural networks that employ only a few input features , which we also call the “relevant” features. That is, we suppose that the features have weights equal to zero for all . Thus we call the irrelevant features. Note that all neural nets from the same equivalence class have the same set of features with nonzero weights (because members of the equivalence class are rotations and signflips of one another). Let denote the cardinality of . In this work we are particularly interested in sparse optimal neural networks, e.g. those with small .
Let the minimum distance to from any be defined as
and the minimizer of be denoted
For notational shorthand, let the loss function be denoted . Using this notation, the excess loss of a neural network with parameters is defined as
(13) 
Recall that (13) is welldefined since is the same for all . Finally let denote the weights tied to the input nodes and denote the weights tied to the input nodes .
4.2 Results
To understand the behavior of our estimated neural network from (10), we derive an upper bound on the combination of its excess loss and the norm of irrelevant neural network weights. Our proof technique is inspired by Städler et al. (2010); however significant adaptations were needed to deal with the complex loss surface and equivalence classes of neural networks.
In order for our results to hold, we make the assumption that the expected loss is locally strongly convex at all . Since this only specifies the local behavior at , this assumption is relatively weak. More formally this assumption states:
Condition 1
Let the neural network parameter be ordered such that . There is a constant that may depend on and the distribution , but does not depend on , such that for all ,
(14) 
where is an identity matrix, are appropriately sized zero matrices, and means that is a positive semidefinite matrix.
In addition, we need the following identifiability condition.
Condition 2
For all , there is an that may depend on and the distribution , but does not depend on such that
Condition 2 places a lower bound on the excess loss of neural networks outside the set of optimal neural networks . However we only need this lower bound to apply to neural networks where the weight of the irrelevant nodes is dominated by the difference between and with respect to . By restricting to this smaller set of neural networks, it is more realistic to claim that is independent of . This condition is somewhat reflective of the typical compatibility conditions used in proving theoretical results for the Lasso (Bühlmann and Van De Geer, 2011).
Finally, we require a bound on the third derivative of the expected loss. It is easy to show that this condition is satisfied when the loss function is mean squared error or logistic loss.
Condition 3
The third derivative of the expected loss function is bounded uniformly over by some constant :
(15) 
With the three conditions above, we have the following theorem. We use the notation .
Theorem 1
Theorem 1 simultaneously bounds the excess loss and the norm of the irrelevant nodes. Consider the case where the identifiability constant is sufficiently large such that . Then the above theorem states that for (e.g. we are using the sparse group lasso and not the degenerate cases with only the lasso or only the group lasso), the excess loss will be on the order of and the norm of will shrink at the rate of . If shrinks as the number of samples increases, these values will go to zero. The convergence rate of the excess risk is faster for functions that are best approximated by neural networks that are more sparse (e.g. is small). The upper bound in (16) will also be small if and are small. However we must choose carefully so that
occurs with high probability. The number of hidden nodes
also must be chosen carefully since determines how well we can approximate (Barron, 1994). This comes implicitly into our rate, as the “excess risk” is relative to the best neural net with hidden nodes.In order to make claims regarding the set , we specialize the next theorem to classification and regression settings with particular loss functions. For the regression scenario, we need an additional assumption that random noise is subgaussian. Note that the following result bounds the excess risk where the number of hidden nodes is fixed; it does not handle the approximation error incurred from using a fixed number of hidden nodes. The proof relies on techniques from empirical process theory.
Theorem 2
Consider the following two settings:

Regression setting: Let
where
is a subgaussian random variable with mean zero. That is,
satisfies for some constants and ,(19) We suppose that is independent of .

Classification setting: Let take on values where .
Suppose that the set of optimal neural networks is composed of equivalence classes. For both settings, there exist constants that depend on , and such that for
(20) 
we have for any , the probability is at least
where .
If the identifiability constant is not too small, Theorems 1 and 2 state that if and is chosen according to (20), the excess loss converges at the rate
and the norm of the irrelevant weights converges at the rate
modulo log terms that do not depend on . Note that the total number of features only enters these rates in a logterm.
To the best of our knowledge, our results are the first to bound the convergence rate of the norm of irrelevant weights. Though our bounds might not be tight, they help us understand why we observe better performance in sparseinput neural networks compared to standard neural networks in our simulation studies and data analyses.
We hope to improve these convergence rate bounds in future research. In particular, we would like to shrink the exponent on and since these rates become very slow for moderate values of and .
In addition, our results depend on a couple of assumptions and constants: For our rates to be useful, we need to make sure the constants and in Conditions 1 and 2 do not shrink too quickly with . Also, we assumed that there were equivalence classes in , but it would be better to upper bound how grows with the neural network size. Sontag (1996) shows that the number of critical points for the empirical loss function is upperbounded by . Their result suggests that grows exponentially in , which would not change our convergence rates. However if grows faster or if is infinite, the convergence rates may be slower.
5 Simulation study
We now present a simulation study to understand how sparseinput neural networks compare against other nonparametric methods. We focus on regression problems as the results from classification are quite similar. We consider the scenarios where the true function is the sum of univariate functions, a complex multivariate function, and a function that is in between these two extremes. In all the cases, the true function is sparse and only depends on the first six variables. We compare sparseinput neural networks to five methods. Some of these methods we term “oracle methods”: These take only the first variables as input.

neural network with a single hidden layer and a ridge penalty on all the weights (ridgeonly neural network);

oracle ridgeonly neural network;

oracle additive univariate model of the form where are fit using additive smoothing splines;

oracle general multivariate model where is fit using a 6variate smoothing spline;

Sparse Additive Model (SpAM), which fits an additive univariate model with a sparsityinducing penalty (Ravikumar et al., 2007).
The oracle methods are not competitors in practice since they use information which will not be available; however, they give us some idea of how well our feature selection is working. Performance of the methods is assessed by the mean squared error (MSE) in estimating the true function , evaluated over 2000 covariates drawn from .
5.1 Simulation Settings
For all of the simulations, we generated the data according to the model where and is scaled such that the signal to noise ratio is 2. We sampled each covariate
independently from the standard uniform distribution. We generated a training set, a validation set, and a test set. The validation set was a quarter of the training set size and the test set was composed of 2000 observations. The penalty parameters in all the methods were tuned using the validation set. For neural networks, we also tuned the number of nodes in the hidden layer (5 or 10 nodes). We use
as the activation function in our neural networks. Each simulation was repeated 20 times.5.2 Additive univariate model
In the first scenario, we have covariates and the true function is the sum of univariate functions
Since the true model is additive, we expect that the additive univariate oracle performs the best, followed by SpAM. As shown in Figure 2, we see that this is indeed the case.
We find that sparseinput neural networks also perform quite well. Thus if we are unsure if the true function is the sum of univariate functions, a sparseinput neural network can be a good option. In addition, we notice that the performance of sparseinput neural networks tends to track the oracle neural network and the multivariate oracle. In small sample sizes, sparseinput neural networks and oracle neural networks perform better than the multivariate oracle, as there is not enough data to support fitting a completely unstructured variate smoother. As the number of samples increase, the multivariate oracle overtakes the sparseinput neural network since it knows which features are truly relevant. We observe similar trends in the next two scenarios. The ridgeonly neural network performs poorly in this scenario. Without a sparsityinducing penalty, it is unable to determine which variables are relevant.
Figure 2 shows the norm of the firstlayer weights in the sparseinput neural network, stratified by those connected to the relevant variables and those connected to the remaining irrelevant variables. In this example the norm of the irrelevant weights appears to approach zero as the number of samples increases — this is in accordance with our theoretical results in Section 4.
(a) Additive Univariate  
(b) Complex Multivariate  
(c) Highdimensional  
5.3 Complex multivariate model
In the second simulation, we use covariates and a sparse, multivariate regression function
Here we expect the general multivariate oracle to perform the best in large sample sizes, which is confirmed in Figure 2. Similar to results in Section 5.2, the performance of sparseinput neural networks nearly tracks the trajectory of oracle neural networks and the general multivariate oracle. As expected, the additive univariate oracle and SpAM perform very poorly. Their MSEs flatten out very quickly due to the bias from assuming an additive univariate model. In fact, we see that given a large enough training set, even a ridgeonly neural network can outperform additive univariate oracle and SpAM.
Finally, we note that, as before, the norm of the irrelevant weights in the sparseinput neural network decreases toward zero as the number of samples increases.
5.4 Highdimensional additive multivariate model
Finally we consider a setting where we have a large number of input features, . We use a regression function that is the sum of 3 and 4variate functions:
This places it between the simple additive univariate function in the first scenario and the complex 6variate function in the second scenario.
As shown in Figure 2
, SpAM and sparseinput neural networks perform similarly in small samples and diverge at larger sample sizes. In particular, the sparseinput neural network starts outperforming SpAM as the oracle neural network and general multivariate oracle start outperforming the simple additive univariate oracle. This makes intuitive sense: at this point the irreducible bias from the additive univariate model begins to exceed the additional variance one has from estimating a
variate function. As before, the neural network trained with only a ridge penalty cannot take advantage of sparsity and performs poorly.6 Data analyses
We now present two data analyses on highthroughout genomic data sets. The first data analysis is a regression problem and the latter is a classification problem.
6.1 Regression problem: prediction of riboflavin production rates
We first consider a highthroughput genomic data set for the production of riboflavin in Bacillus subtilis (Bühlmann et al., 2014). For each of the experimental settings, we have gene expression profiles of genes as well as standardized logtransformed riboflavin production rates. Our goal is to predict the riboflavin production rate. We randomly split the dataset such that test set makes up onefifth of the data and the rest are for training.
We compare the performance of a sparseinput neural network to a ridgepenalized neural network, SpAM, and a linear model with a lasso penalty. In addition, we fit two special cases of sparseinput neural networks: a lassoonly version and a grouplassoonly version. The penalty parameters in all methods were tuned via 5fold crossvalidation. All the neural networks used three hidden nodes.
Table 1 shows the average performance across 30 random splits of the dataset, as measured by their average MSE on the test set. Sparseinput neural networks had the smallest MSE out of all the methods, and this difference was statistically significant. These results suggest that riboflavin production rates are not wellapproximated using an additive univariate or a linear model. Instead it is better to estimate higherorder interactions, even at such small sample sizes. Ridgepenalized neural networks performed the worst, presumably because it was unable to determine which features were relevant for modeling the response. The lasso and grouplassoonly sparseinput neural networks also did not perform as well. Thus it is valuable to encourage sparsity at the grouplevel and the level of individual weights.
From this data analysis, we conclude that sparseinput neural networks can be quite effective in practice. Even though neural networks are not typically applied to such highdimensional datasets, we see here that neural networks with proper regularization can significantly outperform traditional nonparametric regression methods.
Method  MSE on test set  Number of nonzero features 

Sparseinput NN  0.124 (0.010)  52.7 (6.1) 
Sparseinput NN: lassoonly  0.161 (0.016)  112.1 (30.5) 
Sparseinput NN: grouplassoonly  0.144 (0.014)  56.1 (2.9) 
Ridgepenalized NN  0.237 (0.018)  4088 (0) 
SpAM  0.145 (0.014)  34.5 (1.3) 
Linear model with Lasso  0.145 (0.017)  39.1 (3.3) 
. Standard error given in parentheses.
6.2 Classification problem: prediction of molecular biology of Leukemia samples
Now we analyze gene expression data in adult patients with B and Tcell acute lymphoblastic leukemia (Chiaretti et al., 2004). For each subject,
gene expression levels were measured using Affymetrix human 95Av2 arrays and normalized using the robust multichip average method. In addition, samples collected from each subject were classified according to their molecular biology. Our task is to use the gene expression data to predict which subjects had samples classified as NEG versus those that were not. In this dataset, there were a total of 74 subjects who had samples that were classified as NEG. The data used in this analysis is available in the ALL package in Bioconductor (CITE?)
We compare sparseinput NN against similar models as before, but now we compare against SpAM specialized for classification tasks and a logistic model with Lasso. We also use the same settings as before to fit neural networks and the same procedure to compare the performance of the difference models.
Sparseinput neural networks performed better than both SpAM and logistic regression with Lasso (Table
2). Sparseinput neural networks with only the group lasso did the best over the thirty replicates, though it was not significantly different from the performance of sparseinput neural networks with a sparse group lasso. Surprisingly, SpAM selected models with a lot of features, whereas sparseinput neural networks and logistic regression with the Lasso chose much more parsimonious models. Again, we see that the ridgepenalized neural networks perform poorly since they are unable to prune away irrelevant features.Method  Error on test set  Num nonzeros 

Sparseinput NN  0.133 (0.010)  52.1 (1.1) 
Sparseinput NN: lassoonly  0.174 (0.010)  67.7 (5.1) 
Sparseinput NN: grouplassoonly  0.130 (0.010)  88.4 (13.0) 
Ridgepenalized NN  0.191 (0.016)  12625 (0) 
SpAM  0.153 (0.011)  242.6 (38.6) 
Logistic regression with Lasso  0.150 (0.010)  44.1 (2.2) 
7 Discussion
We have introduced using sparseinput neural networks as a nonparametric regression and classification method for highdimensional data, where the firstlayer weights are penalized with a sparse group lasso. When the true model is best approximated by a sparse network, we show that the fitted model using the Lasso has a prediction error that grows logarithmically with the number of features. Thus sparseinput neural networks can be effective for modeling highdimensional data. We have also provided empirical evidence via simulation studies and data analyses to show that sparseinput neural networks can outmatch other more traditional nonparametric methods. Our results show that neural networks can indeed be applied to highdimensional datasets, as long as proper regularization is applied.
One drawback of sparseinput neural networks, and neural networks in general, is that they require a significant amount of time to train. Much of the training time is spent on tuning the hyperparameters and testing different initializations since the objective function is nonconvex. On the other hand, since sparse additive models are convex and have fewer hyperparameters, they are faster to fit.
There are many directions for future research. Even though this paper only considers sparseinput neural networks with a single hidden layer, the estimation method can be easily extended to multiple hidden layers. It will be important to understand if additional layers are useful in highdimensional settings and how sparsityinducing penalties can be best used to control the expressive power of deep networks.
In addition, the theoretical results in this paper assume the estimator is a global minimizer of the penalized empirical loss, but finding the global minimizer is typically a hard problem. In future work, we would like to analyze the behavior of neural networks when the estimator is a local minimizer. There are also a number of other assumptions required in our theoretical results that we would like to weaken or remove.
Up to now, there have been few, if any, methods that are effective for modeling complex multivariate interactions in a highdimensional setting. Sparseinput neural networks tries to fill this gap by leveraging the power of neural networks while controlling its complexity using sparsityinducing penalties. Though neural networks have not been as popular in the statistical literature as the machine learning and computer science literature, we hope that this work shows that they have potential to solve problems that traditional tools have had difficulty with. Moreover the theory for neural networks is still not wellunderstood and can benefit from the help of statisticians.
Supplementary Materials
The code for fitting sparseinput neural networks is available at http://github.com/jjfeng/spinn.
Acknowledgements
Jean Feng was supported by NIH grants DP5OD019820 and T32CA206089. Noah Simon was supported by NIH grant DP5OD019820.
Appendix A Proofs
a.1 Proof of Theorem 1
The proof for Theorem 1 is composed of two main steps. First, we show that the excess loss of any is lower bounded by a quadratic function of the distance from to . We combine this with the definition of and to derive the result.
Lemma 1 (Quadratic lower bound)
By Taylor expansion and the assumption that the gradient of the loss function is zero at all , we have
(25) 
where is the Lagrange remainder. By Condition 1, we have that
(26) 
In addition, is bounded above by
(27)  
(28) 
where we used Condition 3 in the last line. Moreover, if (21) holds, then
(29)  
(30) 
Since , we can combine (28) and (30) to get
(31) 
Combining (26) and (31) gives us
Now we use Condition 2 and apply Auxiliary Lemma in Städler et al. (2010). In particular, the Auxiliary Lemma states that
Auxiliary Lemma in Städler et al. (2010) Let have the following properties:
such that ,
, such that .
Then ,
(32) where
(33)
To use the Auxiliary Lemma, plug in .
Finally, we are ready to prove Theorem 1. [Proof of Theorem 1] For simplicity, we’ll denote , the closest point in to , by . By definition of ,
Rearranging, we get
Over the set , we have that
(34) 
Now we consider three possible cases.
Case 1:
The first term in the right hand side of (34) is clearly bounded by . Rearranging terms, we have
(35) 
As , we have
(36) 
Case 2:
In this case, (34) reduces to the following, after some rearranging of terms:
(37) 
Given the constraints of , we can show that (21) in Lemma 1 is satisfied. To see this, note that
(38) 
Also, we have from (37)
(39)  
(40) 
Using a similar argument, it is easy to show that
So Case 2 satisfies all the conditions in Lemma 1.
Now we split Case 2 into two subcases.
Case 2a:
In this case, (34) reduces to
Comments
There are no comments yet.