Sparse-Input Neural Networks for High-dimensional Nonparametric Regression and Classification

by   Jean Feng, et al.
University of Washington

Neural networks are usually not the tool of choice for nonparametric high-dimensional problems where the number of input features is much larger than the number of observations. Though neural networks can approximate complex multivariate functions, they generally require a large number of training observations to obtain reasonable fits, unless one can learn the appropriate network structure. In this manuscript, we show that neural networks can be applied successfully to high-dimensional settings if the true function falls in a low dimensional subspace, and proper regularization is used. We propose fitting a neural network with a sparse group lasso penalty on the first-layer input weights, which results in a neural net that only uses a small subset of the original features. In addition, we characterize the statistical convergence of the penalized empirical risk minimizer to the optimal neural network: we show that the excess risk of this penalized estimator only grows with the logarithm of the number of input features; and we show that the weights of irrelevant features converge to zero. Via simulation studies and data analyses, we show that these sparse-input neural networks outperform existing nonparametric high-dimensional estimation methods when the data has complex higher-order interactions.



There are no comments yet.


page 1

page 2

page 3

page 4


Adaptive Group Lasso Neural Network Models for Functions of Few Variables and Time-Dependent Data

In this paper, we propose an adaptive group Lasso deep neural network fo...

Ensembled sparse-input hierarchical networks for high-dimensional datasets

Neural networks have seen limited use in prediction for high-dimensional...

Statistical Learning using Sparse Deep Neural Networks in Empirical Risk Minimization

We consider a sparse deep ReLU network (SDRN) estimator obtained from em...

Group Sparse Regularization for Deep Neural Networks

In this paper, we consider the joint task of simultaneously optimizing (...

Calibrating Lévy Process from Observations Based on Neural Networks and Automatic Differentiation with Convergence Proofs

The Lévy process has been widely applied to mathematical finance, quantu...

Classifying high-dimensional Gaussian mixtures: Where kernel methods fail and neural networks succeed

A recent series of theoretical works showed that the dynamics of neural ...

Nonconvex sparse regularization for deep neural networks and its optimality

Recent theoretical studies proved that deep neural network (DNN) estimat...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is often of interest to predict a response from a set of inputs . In many applications, the relationship between and

can be quite complex and its functional form may be difficult to know a priori. In low and moderate dimensional settings, there are many methods that have been effective for estimating such complex relationships. For classification problems, popular methods include kernel extensions of Support Vector Machines,

-nearest neighbors, and classification trees (Cristianini and Shawe-Taylor, 2000; Hofmann et al., 2008); for regression problems, popular methods include spline regression, kernel regression and regression trees Nadaraya (1964); Watson (1964); Breiman et al. (1984)

. Neural networks have also proven to be particularly effective in problems from complex domains where other methods have had limited success (e.g. speech recognition, computer vision, and natural language processing, among others;

Graves et al. (2013); Krizhevsky et al. (2012); Szegedy et al. (2015); Socher et al. (2013); Mikolov et al. (2013)).

With the latest technological developments in biology and other fields, it is now very common to encounter high-dimensional data, where the number of features

may far exceed the number of observations . For example, in genome-wide, and epigenome-wide studies, it is common to collect thousands or millions of features on each of hundreds or thousands of subjects. For general problems in this setting, where the appropriate network structure is unknown, neural networks are rarely used since the number of training samples required for good performance is prohibitive. Instead, it is more typical to apply the Lasso (Tibshirani, 1996), its nonparametric extensions such as Sparse Additive Models (SpAM) (Ravikumar et al., 2007), and high-dimensional additive models (Meier et al., 2009) to solve high-dimensional problems. These methods typically model the data using the sum of a small number of univariate (or very low-dimensional) functions. Unfortunately in actual scientific problems, the response may depend on complex interactions between multiple covariates, and failing to model these interactions can result in highly biased estimates.

There are few, if any, solutions available for these high-dimensional problems with complex interactions. We aim to address this gap in the literature using penalized neural networks. In particular, we leverage a unique quality of neural nets that sets them apart from more traditional nonparametric methods: With relatively few parameters, a neural net is able to approximate models with multivariate interaction terms (Barron, 1993). This is in contrast to the exponential number of terms necessary required by polynomial or spline regression to model general multivariate functions. In this paper, we propose controlling the size of the neural network space using the sparse group lasso penalty, which is a mixed penalty (Simon et al., 2013). Our method, sparse-input neural networks, groups weights connected to the same input node and consequently selects a small subset of informative features for modeling a potentially complex response. We also provide a generalized gradient descent algorithm for training the network.

To understand the theoretical properties of these sparse-input neural networks, we prove oracle inequalities that give probabilistic performance guarantees, assuming we have reached a global minimizer of our penalized criterion. We show that if the response is best approximated by a sparse neural network that uses only of the features, the difference between the prediction error of a sparse-input neural network and the prediction error of the best neural network shrinks at a rate of (here we treat the number of hidden nodes as fixed). Hence the prediction error of sparse-input neural networks grows slowly with the number of features, making them suitable for high-dimensional problems. In addition, we provide convergence rates for the weights connected to the irrelevant input features. To our knowledge, there have been no theoretical results on the shrinkage rates of irrelevant model parameters for neural networks up to now.

The paper is organized as follows. Section 2 describes our extension of neural networks for high-dimensional problems and an algorithm to train the network. A discussion of work related to our method is provided in Section 3. Section 4 establishes theoretical guarantees of our model. Finally, we present simulation studies in Section 5 and data analyses in Section 6 that show sparse-input neural networks can significantly outperform more traditional nonparametric high-dimensional methods when the true function is composed of complex higher-order interaction terms.

We note that this paper is an extension of a paper submitted to the Principled Approaches to Deep Learning Workshop at ICML 2017. The workshop paper only considered sparse-input neural networks for regression problems and the proofs were specific to the lasso. Here we consider both regression and classification problems and prove oracle inequalities for the sparse group lasso and more general loss functions.

2 Sparse-input neural networks

In this paper, we consider neural networks with a single hidden layer with hidden nodes and a single output node. The neural network parameters are denoted where is the first-layer input weights, is the vector of intercept terms for the hidden nodes, is the second-layer weights, and is the final intercept term. Let denote the weights tied to the th hidden node and denote the weights tied to the

th input feature. Data is fed into the neural network and output values from each node are fed as inputs into nodes in the following layer. The hidden nodes are associated with an activation function

, which we assume to be bounded over and differentiable. For regression problems, the neural network maps input to


A neural network for classification is similar to (1

) except that we also apply the sigmoid function

at the end to ensure output values are between zero and one:


To compare neural networks, let us denote the loss function as , where represents the true value and represents the prediction from the neural network. We suppose that is differentiable with respect to . For regression problems, the loss function is typically the squared error loss . For classification problems, the loss function is typically the logistic loss .

Given observations for , we propose fitting a sparse-input neural network where we penalize the first-layer weights using a sparse group lasso penalty and the second-layer weights using a ridge penalty as follows:


where and is the sparse group lasso. When , the sparse group lasso reduces to the Lasso and when , it reduces to the group lasso (Yuan and Lin, 2006).

Figure 1: An example of a sparse-input neural network. The heavy and dotted lines indicate nonzero and zero weights, respectively. Each shaded oval corresponds to a group of first-layer weights. The weights from the dark blue oval are both nonzero. Each medium blue oval has a single nonzero weight, so they exhibit element-wise sparsity. All the weights in the light blue ovals are zero, so they exhibit group-level sparsity.

We have three types of penalties in this criterion. The ridge penalty on serves to control the magnitude of the weights that are not in the first layer. The sparse group lasso is a mixture of the group lasso and lasso penalties (Simon et al., 2013). The group lasso penalty on encourages sparsity at the input level. That is, it encourages the entire vector to be zero. The lasso penalty on encourages sparsity across all the weights, so the hidden nodes will be connected to only a few input nodes. The parameter allows us to control the level of sparsity at the level of the inputs vs. the individual weights. Pictorially, (3) encourages fitting neural networks like that in Figure 1.

One could also consider adding a sparsity penalty to the second layer weights. This is useful if the neural network has a large number of hidden nodes that need to be pruned away. However in the high-dimensional setting, using a small number of hidden nodes generally works well.

2.1 Learning

Sparse-input neural networks can be trained using generalized gradient descent (Daubechies et al., 2004; Beck and Teboulle, 2009; Nesterov, 2004). Though generalized gradient descent was originally developed for convex problems, we can apply the recent work by Gong et al. (2013) to find a critical point in non-convex objective functions. Here we specialize their proposal, called GIST, to solve (3). This algorithm is similar to that in Alvarez and Salzmann (2016).

Let be the smooth component of the loss function in (3) defined as follows


Let be the coordinate-wise soft-thresholding operator


The algorithm for training a sparse-input neural network is given in Algorithm 1. The proximal gradient step is composed of three sub-steps. The first sub-step (7) performs a gradient update step only for the smooth component of the loss; the gradient can be computed using the standard back-propagation algorithm. The second and third sub-steps, (8) and (9), are the proximal operations for the Sparse Group Lasso (Simon et al., 2013): a soft-thresholding operation followed by a soft-scaling operation on each .

The initial step size at each iteration is chosen to be some value in . In our implementation, we simply used a fixed value, e.g. , though one can also adaptively choose the initial step size (Barzilai and Borwein, 1988; Gong et al., 2013).

We choose the step size according to a monotone line search criterion, which accepts the step size if the following condition is satisfied:


where is the objective function of (3) and . This line search criterion guarantees that Algorithm 1 will converge to a critical point (where the subdifferential contains zero) (Gong et al., 2013).

Finally, another option for training sparse-input neural networks is to apply the accelerated generalized gradient descent framework for non-convex objective functions developed in Ghadimi and Lan (2016). These are guaranteed to converge to a critical point, and have accelerated rate guarantees.

In short, these generalized gradient descent algorithms provide an efficient way to train sparse-input neural networks.

   Initialize neural network parameters . Choose and such that .
  for iteration  do
        for  do
        end for
     until  line search criterion is satisfied
  end for
Algorithm 1 Training sparse-input neural networks

2.2 Tuning Hyper-parameters

There are four hyper-parameters for fitting a sparse-input neural network: the ridge penalty parameter , the sparse group lasso parameter , the sparse group lasso penalty parameter , and the number of hidden nodes . The parameters should be tuned to ensure low generalization error of the model. Typically, the hyper-parameters are tuned using cross-validation. With a few hyper-parameters, we can tune them using a simple grid search. If there are many hyper-parameters, one can use Nelder-Mead (Nelder and Mead, 1965) and Bayesian optimization methods (Snoek et al., 2012).

For sparse-input neural networks, we found that the hyper-parameters can be tuned by searching over a relatively small space, so a simple grid search will suffice. More specifically, the optimal number of hidden nodes for high-dimensional problems tends to be small, so only a few candidate values need be tested. In addition, since there are only a small number of weights in the second layer of the neural network, the estimation error is quite robust to different ridge penalty parameter values. In our simulations and data analysis, we pre-tune the ridge penalty parameter and keep it fixed.

The most important parameters to tune are the penalty parameters for the sparse group lasso, and . We found that the lasso penalty can easily over-penalize the input weights and set everything to zero. Thus we only use candidate values where is greater than 0.5.

3 Related Work

A number of other authors have applied lasso and group lasso penalties to neural networks. That work has largely been focused on learning network structure however, rather than feature selection.

Sun (1999) was one of the earliest papers that fit a neural network with a lasso penalty over all the weights. Scardapane et al. (2016) and Alvarez and Salzmann (2016) proposed using the sparse group lasso over all the weights in a deep neural network in order to learn a more compact network — this is in the low dimensional setting with . The recent work by Bach (2017) is most closely aligned to our work: he considers convex neural networks which have a single hidden layer with an unbounded number of hidden nodes. He shows that theoretically, a lasso penalty on the input weights of such networks should perform well in high-dimensional settings.

The recent work in Zhang et al. (2016) examined the utility of regularization in deep learning. The authors found that using a ridge penalty on the internal network architecture was not necessary for a neural network to have good performance; instead, using an appropriate network architecture led to larger decreases in generalization error. Our results support this claim since the sparse group lasso also performs structure-learning.

Previously, statistical convergence rates for neural networks have been established for problems when the -norm of the weights are constrained. These results show that the estimation error of the neural networks indeed grew with the logarithm of the number of input nodes (Bartlett, 1998; Anthony and Bartlett, 2009). Our convergence rate results also have a term. Moreover, we improve upon previously established convergence rates by showing that the excess estimation error shrinks at a rate of rather than . This faster convergence rate allows us to additionally bound the rate at which the norm of the irrelevant network weighs shrink to zero.

Our proofs for statistical convergence rates are inspired by Städler et al. (2010), which considers -penalization for mixture regression models. The techniques used in their paper are relevant as neural networks can be thought of as a mixture of regressions. Significant additional work was required however as the identifiability condition assumed in Städler et al. (2010) does not hold for neural networks.

4 Theoretical Guarantees

In this section, we provide probabilistic, finite-sample upper bounds on the prediction error of sparse-input neural networks. Instead of using a ridge penalty, we constrain the norm of the second-layer weights. Our sparse-input neural network problem now becomes


for a constant . All of our proofs are in the Supplementary Materials.

Notice that (10) assumes that the estimator is a global minimizer of a non-convex objective function. Since non-convex problems are typically computationally intractable to solve, there is admittedly a disconnect between the computational algorithm we have proposed and the theoretical results we establish in this section. Though it is desirable to establish theoretical properties for estimators arising from local optima, it is difficult to characterize their behavior and up to now, much of the theory for the Lasso depends on the estimator being a global minimizer. We do not address this issue and leave this problem for future research.

4.1 Problem Setup and Notation


be the joint distribution of the covariates

and response where has support over . We assume contains some open set and . Let be the marginal distribution of . Given observations, we denote the empirical distribution as .

The neural network can be defined either as in (1) or (2), where we suppose the activation function is the sigmoid function. Note that neural networks are included in this framework since the

and sigmoid functions are related by a linear transformation.

Next we define a neural network equivalence class. Given parameter , the set of equivalent parameterizations is


If satisfy the following independence property, then only contains parameterizations that are sign-flips or rotations of and thus has cardinality exactly (Sontag, 1996; Anthony and Bartlett, 2009). The independence property states that the first-layer weights in satisfy for all and for all .

Let the set of optimal neural networks that minimize the expected loss be denoted


We suppose that is the union of equivalence classes, where all of the optimal neural networks satisfy the independence property. We will suppose that is chosen large enough such that for all elements in , we have that the gradient of the expected loss at every is zero.

Next let us suppose that the expected loss can be minimized using neural networks that employ only a few input features , which we also call the “relevant” features. That is, we suppose that the features have weights equal to zero for all . Thus we call the irrelevant features. Note that all neural nets from the same equivalence class have the same set of features with nonzero weights (because members of the equivalence class are rotations and sign-flips of one another). Let denote the cardinality of . In this work we are particularly interested in sparse optimal neural networks, e.g. those with small .

Let the minimum distance to from any be defined as

and the minimizer of be denoted

For notational shorthand, let the loss function be denoted . Using this notation, the excess loss of a neural network with parameters is defined as


Recall that (13) is well-defined since is the same for all . Finally let denote the weights tied to the input nodes and denote the weights tied to the input nodes .

4.2 Results

To understand the behavior of our estimated neural network from (10), we derive an upper bound on the combination of its excess loss and the norm of irrelevant neural network weights. Our proof technique is inspired by Städler et al. (2010); however significant adaptations were needed to deal with the complex loss surface and equivalence classes of neural networks.

In order for our results to hold, we make the assumption that the expected loss is locally strongly convex at all . Since this only specifies the local behavior at , this assumption is relatively weak. More formally this assumption states:

Condition 1

Let the neural network parameter be ordered such that . There is a constant that may depend on and the distribution , but does not depend on , such that for all ,


where is an identity matrix, are appropriately sized zero matrices, and means that is a positive semi-definite matrix.

In addition, we need the following identifiability condition.

Condition 2

For all , there is an that may depend on and the distribution , but does not depend on such that

Condition 2 places a lower bound on the excess loss of neural networks outside the set of optimal neural networks . However we only need this lower bound to apply to neural networks where the weight of the irrelevant nodes is dominated by the difference between and with respect to . By restricting to this smaller set of neural networks, it is more realistic to claim that is independent of . This condition is somewhat reflective of the typical compatibility conditions used in proving theoretical results for the Lasso (Bühlmann and Van De Geer, 2011).

Finally, we require a bound on the third derivative of the expected loss. It is easy to show that this condition is satisfied when the loss function is mean squared error or logistic loss.

Condition 3

The third derivative of the expected loss function is bounded uniformly over by some constant :


With the three conditions above, we have the following theorem. We use the notation .

Theorem 1

For any and , let

Suppose Conditions 1, 2, and 3 hold. Suppose that the optimal neural networks only have nonzero weights for input features . Consider . Let be the solution to (10). Then over the set , we have


where the constants are defined as follows:


Theorem 1 simultaneously bounds the excess loss and the norm of the irrelevant nodes. Consider the case where the identifiability constant is sufficiently large such that . Then the above theorem states that for (e.g. we are using the sparse group lasso and not the degenerate cases with only the lasso or only the group lasso), the excess loss will be on the order of and the norm of will shrink at the rate of . If shrinks as the number of samples increases, these values will go to zero. The convergence rate of the excess risk is faster for functions that are best approximated by neural networks that are more sparse (e.g. is small). The upper bound in (16) will also be small if and are small. However we must choose carefully so that

occurs with high probability. The number of hidden nodes

also must be chosen carefully since determines how well we can approximate (Barron, 1994). This comes implicitly into our rate, as the “excess risk” is relative to the best neural net with hidden nodes.

In order to make claims regarding the set , we specialize the next theorem to classification and regression settings with particular loss functions. For the regression scenario, we need an additional assumption that random noise is sub-gaussian. Note that the following result bounds the excess risk where the number of hidden nodes is fixed; it does not handle the approximation error incurred from using a fixed number of hidden nodes. The proof relies on techniques from empirical process theory.

Theorem 2

Consider the following two settings:

  1. Regression setting: Let


    is a sub-gaussian random variable with mean zero. That is,

    satisfies for some constants and ,


    We suppose that is independent of .

  2. Classification setting: Let take on values where .

Suppose that the set of optimal neural networks is composed of equivalence classes. For both settings, there exist constants that depend on , and such that for


we have for any , the probability is at least

where .

If the identifiability constant is not too small, Theorems 1 and 2 state that if and is chosen according to (20), the excess loss converges at the rate

and the -norm of the irrelevant weights converges at the rate

modulo log terms that do not depend on . Note that the total number of features only enters these rates in a log-term.

To the best of our knowledge, our results are the first to bound the convergence rate of the norm of irrelevant weights. Though our bounds might not be tight, they help us understand why we observe better performance in sparse-input neural networks compared to standard neural networks in our simulation studies and data analyses.

We hope to improve these convergence rate bounds in future research. In particular, we would like to shrink the exponent on and since these rates become very slow for moderate values of and .

In addition, our results depend on a couple of assumptions and constants: For our rates to be useful, we need to make sure the constants and in Conditions 1 and 2 do not shrink too quickly with . Also, we assumed that there were equivalence classes in , but it would be better to upper bound how grows with the neural network size. Sontag (1996) shows that the number of critical points for the empirical loss function is upper-bounded by . Their result suggests that grows exponentially in , which would not change our convergence rates. However if grows faster or if is infinite, the convergence rates may be slower.

5 Simulation study

We now present a simulation study to understand how sparse-input neural networks compare against other nonparametric methods. We focus on regression problems as the results from classification are quite similar. We consider the scenarios where the true function is the sum of univariate functions, a complex multivariate function, and a function that is in between these two extremes. In all the cases, the true function is sparse and only depends on the first six variables. We compare sparse-input neural networks to five methods. Some of these methods we term “oracle methods”: These take only the first variables as input.

  • neural network with a single hidden layer and a ridge penalty on all the weights (ridge-only neural network);

  • oracle ridge-only neural network;

  • oracle additive univariate model of the form where are fit using additive smoothing splines;

  • oracle general multivariate model where is fit using a 6-variate smoothing spline;

  • Sparse Additive Model (SpAM), which fits an additive univariate model with a sparsity-inducing penalty (Ravikumar et al., 2007).

The oracle methods are not competitors in practice since they use information which will not be available; however, they give us some idea of how well our feature selection is working. Performance of the methods is assessed by the mean squared error (MSE) in estimating the true function , evaluated over 2000 covariates drawn from .

5.1 Simulation Settings

For all of the simulations, we generated the data according to the model where and is scaled such that the signal to noise ratio is 2. We sampled each covariate

independently from the standard uniform distribution. We generated a training set, a validation set, and a test set. The validation set was a quarter of the training set size and the test set was composed of 2000 observations. The penalty parameters in all the methods were tuned using the validation set. For neural networks, we also tuned the number of nodes in the hidden layer (5 or 10 nodes). We use

as the activation function in our neural networks. Each simulation was repeated 20 times.

5.2 Additive univariate model

In the first scenario, we have covariates and the true function is the sum of univariate functions

Since the true model is additive, we expect that the additive univariate oracle performs the best, followed by SpAM. As shown in Figure 2, we see that this is indeed the case.

We find that sparse-input neural networks also perform quite well. Thus if we are unsure if the true function is the sum of univariate functions, a sparse-input neural network can be a good option. In addition, we notice that the performance of sparse-input neural networks tends to track the oracle neural network and the multivariate oracle. In small sample sizes, sparse-input neural networks and oracle neural networks perform better than the multivariate oracle, as there is not enough data to support fitting a completely unstructured -variate smoother. As the number of samples increase, the multivariate oracle overtakes the sparse-input neural network since it knows which features are truly relevant. We observe similar trends in the next two scenarios. The ridge-only neural network performs poorly in this scenario. Without a sparsity-inducing penalty, it is unable to determine which variables are relevant.

Figure 2 shows the norm of the first-layer weights in the sparse-input neural network, stratified by those connected to the relevant variables and those connected to the remaining irrelevant variables. In this example the norm of the irrelevant weights appears to approach zero as the number of samples increases — this is in accordance with our theoretical results in Section 4.

(a) Additive Univariate
(b) Complex Multivariate
(c) High-dimensional
Figure 2: Simulation results from the three scenarios: additive univariate function with (a), a complex multivariate function (b), and the sum of multivariate functions with (c). The left plots compares the mean squared error (MSE) of nonparametric regression methods. The dashed lines indicate oracle models. The right plots show the average magnitude of the weights from the sparse-input neural net fit, stratified into the relevant and irrelevant features.

5.3 Complex multivariate model

In the second simulation, we use covariates and a sparse, multivariate regression function

Here we expect the general multivariate oracle to perform the best in large sample sizes, which is confirmed in Figure 2. Similar to results in Section 5.2, the performance of sparse-input neural networks nearly tracks the trajectory of oracle neural networks and the general multivariate oracle. As expected, the additive univariate oracle and SpAM perform very poorly. Their MSEs flatten out very quickly due to the bias from assuming an additive univariate model. In fact, we see that given a large enough training set, even a ridge-only neural network can outperform additive univariate oracle and SpAM.

Finally, we note that, as before, the norm of the irrelevant weights in the sparse-input neural network decreases toward zero as the number of samples increases.

5.4 High-dimensional additive multivariate model

Finally we consider a setting where we have a large number of input features, . We use a regression function that is the sum of 3- and 4-variate functions:

This places it between the simple additive univariate function in the first scenario and the complex 6-variate function in the second scenario.

As shown in Figure 2

, SpAM and sparse-input neural networks perform similarly in small samples and diverge at larger sample sizes. In particular, the sparse-input neural network starts outperforming SpAM as the oracle neural network and general multivariate oracle start outperforming the simple additive univariate oracle. This makes intuitive sense: at this point the irreducible bias from the additive univariate model begins to exceed the additional variance one has from estimating a

-variate function. As before, the neural network trained with only a ridge penalty cannot take advantage of sparsity and performs poorly.

6 Data analyses

We now present two data analyses on high-throughout genomic data sets. The first data analysis is a regression problem and the latter is a classification problem.

6.1 Regression problem: prediction of riboflavin production rates

We first consider a high-throughput genomic data set for the production of riboflavin in Bacillus subtilis (Bühlmann et al., 2014). For each of the experimental settings, we have gene expression profiles of genes as well as standardized log-transformed riboflavin production rates. Our goal is to predict the riboflavin production rate. We randomly split the dataset such that test set makes up one-fifth of the data and the rest are for training.

We compare the performance of a sparse-input neural network to a ridge-penalized neural network, SpAM, and a linear model with a lasso penalty. In addition, we fit two special cases of sparse-input neural networks: a lasso-only version and a group-lasso-only version. The penalty parameters in all methods were tuned via 5-fold cross-validation. All the neural networks used three hidden nodes.

Table 1 shows the average performance across 30 random splits of the dataset, as measured by their average MSE on the test set. Sparse-input neural networks had the smallest MSE out of all the methods, and this difference was statistically significant. These results suggest that riboflavin production rates are not well-approximated using an additive univariate or a linear model. Instead it is better to estimate higher-order interactions, even at such small sample sizes. Ridge-penalized neural networks performed the worst, presumably because it was unable to determine which features were relevant for modeling the response. The lasso- and group-lasso-only sparse-input neural networks also did not perform as well. Thus it is valuable to encourage sparsity at the group-level and the level of individual weights.

From this data analysis, we conclude that sparse-input neural networks can be quite effective in practice. Even though neural networks are not typically applied to such high-dimensional datasets, we see here that neural networks with proper regularization can significantly outperform traditional nonparametric regression methods.

Method MSE on test set Number of nonzero features
Sparse-input NN 0.124 (0.010) 52.7 (6.1)
Sparse-input NN: lasso-only 0.161 (0.016) 112.1 (30.5)
Sparse-input NN: group-lasso-only 0.144 (0.014) 56.1 (2.9)
Ridge-penalized NN 0.237 (0.018) 4088 (0)
SpAM 0.145 (0.014) 34.5 (1.3)
Linear model with Lasso 0.145 (0.017) 39.1 (3.3)
Table 1: Average performance of the different methods for predicting riboflavin production rates in Bacillus subtilis

. Standard error given in parentheses.

6.2 Classification problem: prediction of molecular biology of Leukemia samples

Now we analyze gene expression data in adult patients with B- and T-cell acute lymphoblastic leukemia (Chiaretti et al., 2004). For each subject,

gene expression levels were measured using Affymetrix human 95Av2 arrays and normalized using the robust multichip average method. In addition, samples collected from each subject were classified according to their molecular biology. Our task is to use the gene expression data to predict which subjects had samples classified as NEG versus those that were not. In this dataset, there were a total of 74 subjects who had samples that were classified as NEG. The data used in this analysis is available in the ALL package in Bioconductor (CITE?)

We compare sparse-input NN against similar models as before, but now we compare against SpAM specialized for classification tasks and a logistic model with Lasso. We also use the same settings as before to fit neural networks and the same procedure to compare the performance of the difference models.

Sparse-input neural networks performed better than both SpAM and logistic regression with Lasso (Table 

2). Sparse-input neural networks with only the group lasso did the best over the thirty replicates, though it was not significantly different from the performance of sparse-input neural networks with a sparse group lasso. Surprisingly, SpAM selected models with a lot of features, whereas sparse-input neural networks and logistic regression with the Lasso chose much more parsimonious models. Again, we see that the ridge-penalized neural networks perform poorly since they are unable to prune away irrelevant features.

Method Error on test set Num nonzeros
Sparse-input NN 0.133 (0.010) 52.1 (1.1)
Sparse-input NN: lasso-only 0.174 (0.010) 67.7 (5.1)
Sparse-input NN: group-lasso-only 0.130 (0.010) 88.4 (13.0)
Ridge-penalized NN 0.191 (0.016) 12625 (0)
SpAM 0.153 (0.011) 242.6 (38.6)
Logistic regression with Lasso 0.150 (0.010) 44.1 (2.2)
Table 2: Average performance of the different methods for predicting molecular biology of samples collected from patients with B- and T-cell acute lymphoblastic leukemia.

7 Discussion

We have introduced using sparse-input neural networks as a nonparametric regression and classification method for high-dimensional data, where the first-layer weights are penalized with a sparse group lasso. When the true model is best approximated by a sparse network, we show that the fitted model using the Lasso has a prediction error that grows logarithmically with the number of features. Thus sparse-input neural networks can be effective for modeling high-dimensional data. We have also provided empirical evidence via simulation studies and data analyses to show that sparse-input neural networks can outmatch other more traditional nonparametric methods. Our results show that neural networks can indeed be applied to high-dimensional datasets, as long as proper regularization is applied.

One drawback of sparse-input neural networks, and neural networks in general, is that they require a significant amount of time to train. Much of the training time is spent on tuning the hyper-parameters and testing different initializations since the objective function is non-convex. On the other hand, since sparse additive models are convex and have fewer hyper-parameters, they are faster to fit.

There are many directions for future research. Even though this paper only considers sparse-input neural networks with a single hidden layer, the estimation method can be easily extended to multiple hidden layers. It will be important to understand if additional layers are useful in high-dimensional settings and how sparsity-inducing penalties can be best used to control the expressive power of deep networks.

In addition, the theoretical results in this paper assume the estimator is a global minimizer of the penalized empirical loss, but finding the global minimizer is typically a hard problem. In future work, we would like to analyze the behavior of neural networks when the estimator is a local minimizer. There are also a number of other assumptions required in our theoretical results that we would like to weaken or remove.

Up to now, there have been few, if any, methods that are effective for modeling complex multivariate interactions in a high-dimensional setting. Sparse-input neural networks tries to fill this gap by leveraging the power of neural networks while controlling its complexity using sparsity-inducing penalties. Though neural networks have not been as popular in the statistical literature as the machine learning and computer science literature, we hope that this work shows that they have potential to solve problems that traditional tools have had difficulty with. Moreover the theory for neural networks is still not well-understood and can benefit from the help of statisticians.

Supplementary Materials

The code for fitting sparse-input neural networks is available at


Jean Feng was supported by NIH grants DP5OD019820 and T32CA206089. Noah Simon was supported by NIH grant DP5OD019820.

Appendix A Proofs

a.1 Proof of Theorem 1

The proof for Theorem 1 is composed of two main steps. First, we show that the excess loss of any is lower bounded by a quadratic function of the distance from to . We combine this with the definition of and to derive the result.

Using Conditions 1, 2, and 3, we lower bound the excess loss by a quadratic function.

Lemma 1 (Quadratic lower bound)

Suppose Conditions 1, 2, 3 hold. Consider constant . Suppose satisfies both and the following condition:


Then where


By Taylor expansion and the assumption that the gradient of the loss function is zero at all , we have


where is the Lagrange remainder. By Condition 1, we have that


In addition, is bounded above by


where we used Condition 3 in the last line. Moreover, if (21) holds, then


Since , we can combine (28) and (30) to get


Combining (26) and (31) gives us

Now we use Condition 2 and apply Auxiliary Lemma in Städler et al. (2010). In particular, the Auxiliary Lemma states that

Auxiliary Lemma in Städler et al. (2010) Let have the following properties:

  1. such that ,

  2. , such that .

Then ,




To use the Auxiliary Lemma, plug in .

Finally, we are ready to prove Theorem 1. [Proof of Theorem 1] For simplicity, we’ll denote , the closest point in to , by . By definition of ,

Rearranging, we get

Over the set , we have that


Now we consider three possible cases.

Case 1:

The first term in the right hand side of (34) is clearly bounded by . Rearranging terms, we have


As , we have


Case 2:

In this case, (34) reduces to the following, after some rearranging of terms:


Given the constraints of , we can show that (21) in Lemma 1 is satisfied. To see this, note that


Also, we have from (37)


Using a similar argument, it is easy to show that

So Case 2 satisfies all the conditions in Lemma 1.

Now we split Case 2 into two sub-cases.

Case 2a:

In this case, (34) reduces to