High-dimensional machine learning is often tackled using generalized linear models, where a response variable
is related to a feature vectorvia
for some unknown weight vector and some smooth transfer function . Typical examples of
are the logit and probit functions for classification, and the linear function for regression. High dimensional parameter estimation for GLMs has been widely studied, both from a theoretical and algorithmic point of view[Van de Geer2008, Negahban et al.2012, Park and Hastie2007]. While classical work on generalized linear models (GLMs) assumes is known, this function is often unknown in real-world datasets, and hence we need methods that can simultaneously learn both and .
The model in (1) with unknown is called a Single Index Model (SIM) and is a powerful semi-parametric generalization of a GLM. SIMs were first introduced in the econometrics and statistics literature [Horowitz and Härdle1996, Ichimura1993, Horowitz2009], and have since become popular in statistical machine learning applications as well. Recently, computationally and statistically efficient algorithms have been provided for learning SIMs [Kalai and Sastry2009, Kakade et al.2011] in low-dimensional settings where the number of samples/observations is much larger than the ambient dimension . However, many problems in modern machine learning, signal processing and computational biology are high dimensional, i.e. the number of parameters to learn, far exceeds the number of data points . For example, in genetics, one has to infer activation weights for thousands of genes with hundreds of measurements.
In this paper, motivated by high-dimensional data analysis problems, we consider learning SIM in high dimensions. This is a hard learning problem because (i) statistical inference is ill-posed, and indeed impossible in the high-dimensional setup without making additional structural assumptions and (ii) unlike GLMs the transfer function itself is unknown and also needs to be learned from the data. To handle these problems we impose additional structure on the unknown weight vectorwhich is elegantly captured by the concept of small atomic cardinality [Chandrasekaran et al.2012] and make smoothness assumptions on the transfer function
. The concept of small atomic cardinality generalizes commonly imposed structure in high-dimensional statistics such as sparsity, group sparsity, low-rank, and allows us to design a single algorithm that can learn a SIM with various structural assumptions.
We provide an efficient algorithm called CSI (Calibrated Single Index) that can be used to learn SIMs in high dimensions. The algorithm is an optimization procedure that minimizes a loss function that is calibrated to the unknown SIM, for bothand . CSI alternates between a projected gradient descent step to update its estimate of and a function learning procedure called LPAV to learn a monotonic, Lipschitz function. We provide extensive experimental evidence that demonstrates the effectiveness of CSI in a variety of high dimensional machine learning scenarios. Moreover we also show that we are able to obtain competitive, and often better, results when compared to a single layer neural network, with significantly less computational cost.
Related Work and Our Contributions
Alquier and Biau [Alquier and Biau2013] consider learning high dimensional single index models. They provide estimators of using PAC-Bayesian analysis, which relies on reversible jump MCMC, and is slow to converge even for moderately sized problems. [Radchenko2015] learns high dimensional single index models with simple sparsity assumptions on the weight vectors, while [Ganti, Balzano, and Willett2015] provide methods to learn SIM’s in the matrix factorization setting. While these are first steps towards learning high dimensional SIM’s, our method can handle several types of structural assumptions, generalizing these approaches to several other structures in an elegant manner. Restricted versions of the SIM estimation problem with (structured) sparsity constraints have been considered in [Plan, Vershynin, and Yudovina2014, Rao et al.2016], where the authors are only interested in accurate parameter estimation and not prediction. Hence, in these works the proposed algorithms do not learn the transfer function. We finally comment that there is also related literature focused on how to query points in order to learn the SIM, such as [Cohen et al.2012].
The class of SIM belongs to a larger set of semi-parametric models called multiple index models [Hastie et al.2005], which involves learning a sum of multiple and corresponding . Other semi-parametric models [Friedman and Stuetzle1981, Buja, Hastie, and Tibshirani1989, Ravikumar et al.2009] where the model is a linear combination of functions of the form are also popular, but our restrictions on the transfer function allow us to use simple optimization methods to learn .
Finally, neural networks have emerged as a powerful alternative to learn nonlinear transfer functions that can be basically thought of being defined by compositions of nonlinear functions. In the high dimensional setting (data poor regime), it may be hard to estimate all the parameters accurately of a multilayer network, and a thorough comparison is beyond the scope of this paper. Nonetheless, we show that our method enjoys comparable and often superior performance to a single-layer feed forward NN, while being significantly cheaper to train. These positive results indicate that one could perhaps use our method as a much cheaper alternative to NN in practical data analysis problems, and motivates us to consider “deep” variants of our method in the future. To the best of our knowledge, simple, practical algorithms with good empirical performance for learning single index models in high dimensions are not available.
Structurally Constrained Problems in High Dimensions
We now set up notations that we use in the sequel, and set up the problem we are interested to solve. Assume we are provided i.i.d. data , where the label is generated according to the model for an unknown parameter vector and unknown 1-Lipschitz, monotonic function . The monotonicity assumption on
is not unreasonable. In GLMs the transfer function is monotonic. In neural networks the most common activation functions are ReLU, sigmoid, and the hyperbolic tangent functions, all of which are monotonic functions. Moreover, learning monotonic functions is an easier problem than learning general smooth functions, as this learning problem can be cast as a simple quadratic programming problem. This allows us to avoid using costlier non-parametric smoothing techniques such as local polynomial regression[Tsybakov2009]. We additionally assume that 111We can easily relax this to , i.e. bounded is sufficient.. Let be a matrix with each row corresponding to an and let be the corresponding vector of observations. Note that in the case of matrix estimation problems the data are matrices, and for the sake of notational simplicity we assume that these matrices have been vectorized. In the case where , the problem of recovering from the measurements is ill-posed even when is known. To overcome this, one usually makes additional structural assumptions on the parameters . Specifically, we assume that the parameters satisfy a notion of “structural simplicity”, which we will now elaborate on.
Suppose we are given a set of atoms, , such that any can be written as . Although the number of atoms in may be uncountably infinite, the sum notation implies that any can be expressed as a linear combination of a finite number of atoms222This representation need not be unique..
Consider the following non convex atomic cardinality function:
denotes the indicator function: it is unity when the condition inside the is satisfied, and infinity otherwise. We say that a vector is “structurally simple” with respect to an atomic set if in (2) is small. The notion of structural simplicity plays a central role in several high dimensional machine learning and signal processing applications:
The idea of group sparsity plays a central role in multitask learning [Argyriou, Evgeniou, and Pontil2008] and computational biology [Jacob, Obozinski, and Vert2009], among other applications. The atoms are low dimensional unit disks, and the atomic cardinality of a vector is simply the group sparsity of .
Matrix estimation problems that typically appear in problems such as collaborative filtering [Koren, Bell, and Volinsky2009] can be modeled as learning vectors with atoms being unit rank matrices and the resulting atomic cardinality being the rank of the matrix.
Problem Setup: Calibrated loss minimization
Our goal in this paper will be to solve an optimization problem of the form
where is a known atomic set, is a positive integer, and is a loss function that is appropraitely designed that we elaborate on next. Notice that in the above formulation we added a squared norm penalty to make the objective function strongly convex. In the case when we are dealing with matrix problems we can use the Frobenius norm of . The constraint on the atomic cardinality ensures the learning of structurally simple parameters, and indeed makes the problem well posed.
Suppose was known. Let be a function such that , and consider the following optimization problem.
Modulo the penalty and the regularization terms, the above objective is a sample version of the following stochastic optimization problem:
Since, is a monotonically increasing function, is convex and the above stochastic optimization problem is convex. By taking the first derivative we can verify that the optimal solution satisfies the relation . Hence, by defining the loss function in terms of the integral of the transfer function, the loss function is calibrated to the transfer function, and automatically adapts to the SIM from which the data is generated. To gain further intuition, notice that when is linear, then is quadratic and the optimization problem in Equation (Problem Setup: Calibrated loss minimization) is a constrained squared loss minimization problem. When is logit then the problem in Equation (Problem Setup: Calibrated loss minimization) is a constrained logistic loss minimization problem.
where we constrain the set of monotonic, 1-Lipschitz functions. With this choice of , our optimization problem becomes
Notice that in the above optimization problem we are simultaneously learning a function as well as a weight vector. This additional layer of complication explains why learning SIMs is a considerably harder problem than learning GLMs where a typical optimization problem is similar to the one in Equation (Problem Setup: Calibrated loss minimization
). As we will later show in our experimental results this additional complexity in optimization is justified by the excellent results achieved by our algorithm compared to GLM based algorithms such as linear/logistic regression.
The Calibrated Single Index Algorithm
Our algorithm to solve the optimizaion problem in Equation (Problem Setup: Calibrated loss minimization) is called as Calibrated Single Index algorithm (CSI) and is sketched in Algorithm 1. CSI interleaves parameter learning via iterative projected gradient descent and monotonic function learning via the LPAV algorithm.
Function learning using LPAV :
We use the LPAV [Kakade et al.2011] method to update the function . One way to learn the a monotonic function would be to model the function as a multi-layer neural network and learn the weights of the newtwork using a gradient based algorithm. LPAV is computationally far simpler. Furthermore, learning several parameters of a NN is typically not an option in data-poor settings such as the ones we are interested in. Another alternative is to cast learning as a dictionary learning problem, which requires a good dictionary at hand, which in turn relies on domain expertise.
Given a vector , in order to find a function fit that minimizes the objective in (Problem Setup: Calibrated loss minimization), we can look at the first order optimality condition. Differentiating the objective in (Problem Setup: Calibrated loss minimization) w.r.t. , and assuming that we get . If,
, i.e. if we assume that the features are uncorrelated, and the features have similar variance333The variance assumption can be satisfied by normalizing the features appropriately. Similarly, the uncorrelatedness assumption can be satisfied via a Whitening transformation, then by elementary algebra we just need the function to optimize the expression . LPAV solves this exact optimization problem. More precisely, given data , where and , LPAV outputs a best univariate monotonic, 1-Lipschitz function that minimizes the squared error . LPAV does this using the following two step procedure. In the first step, it solves:
This is a pretty simple convex quadratic programming problem and can be solved using standard methods. In the second step, we define as follows: Let for all . To get
everywhere else on the real line, LPAV performs linear interpolation as follows: Sortfor all and let be the entry after sorting. Then, for any , we have
It is easy to see that is a Lipschitz, monotonic function and attains the smallest least squares error on the given data.
Note that solving the LPAV is not the same as fitting a GLM. Specifically, LPAV finds a function that minimizes the squared error between the fitted function and the response.
We are now ready to describe CSI.
CSI begins by initializing to . Here is a projection operator that outputs the best atomic-sparse representation of the argument. is provided as a parameter to CSI . We then update our estimate of to by using the LPAV algorithm on the data projected onto the vector . Using the updated estimate, , we update our weight vector to by a single gradient step on the objective function of the optimization problem in Equation (Problem Setup: Calibrated loss minimization) with , where and are related by the equation . This gradient step is followed by an atomic projection step (Step 5 in CSI). While, one can use convergence checks and stopping conditions to decide when to stop, we noticed that few tens of iterations are sufficient, and in our experiments we set .
A key point to note is that CSI is very general: indeed the only step that depends on the particular structural assumption made is the projection step (step 5 in Algorithm (1)).
As long as one can define this projection step for the structural constraint of interest, one can use the CSI algorithm to learn an appropriate high dimensional SIM. As we show next, this projection step is indeed tractable in a whole lot of cases of interest in high dimensional statistics.
Note that the projection can be replaced by a soft thresholding-type operator as well, and the algorithmic performance should be largely unaffected. However, performing hard thresholding is typically more efficient, and has been shown to enjoy comparable performance to soft thresholding operators in several cases.
Examples of Atomic Projections
A key component of Algorithm 1 is the projection operator , which entirely depends on the atomic set . Suppose we are given a vector , an atomic set and a positive integer . Also, let , where the achieve the in the sense of (2). Let be the elements , arranged in descending order by magnitude. We define
where is the element of the vector, and denotes the corresponding atom in the original representation. We can see that performing such projections is computationally efficient in most cases:
When the atomic set are the signed canonical basis vectors, the projection is the standard hard thresholding operation: retain the top magnitude coefficients of .
Under low rank constraints, reduces to retaining the best rank-s approximation of . Since is typically small, this can be done efficiently using power iterations.
When the atoms are low dimensional unit disks, the projection step reduces to computing the norm of restricted to each group, and retaining the top groups.
Computational Complexity of CSI
To analyze the computational complexity of each iterate of the CSI algorithm, we need to analyze the time complexity of the gradient step, the projection step and the LPAV steps used in CSI . The gradient step takes time. The projection step for low-rank, sparse and group sparse cases can be naively implemented using time or via the use of max-heaps in time. The LPAV algorithm is a quadratic program with immense structure in the inequality constraints and in the quadratic objective. Using clever algorithmic techniques one can solve this optimization problem in time (See Appendix D in [Kakade et al.2011]). The total runtime complexity for iterations of CSI is , making the algorithm fairly efficient. In most large scale problems, the data is sparse, in which case the term can be replaced by .
We now compare and contrast our method with several other algorithms, in various high dimensional structural settings and on several datasets. We start with the case of standard sparse parameter recovery, before proceeding to display the effectiveness of our method in multitask/multilabel learning scenarios and also in the structured matrix factorization setting.
Sparse Signal Recovery
We compare our method with several standard algorithms on high dimensional datasets:
Sparse classification with the logistic loss (SLR) and the squared hinge loss (SQH). We vary the regularization parameter over . We used MATLAB code available in the L1-General library.
Sparse regression using least squares SLS. We used a modified Frank Wolfe method [Rao, Shah, and Wright2015], and varied the regularizer over .
Our method CSI . We varied the sparsity of the solution as , rounded off to the nearest integer, where is the dimensionality of the data.
Slisotron [Kakade et al.2011] which is an algorithm for learning SIMs in low-dimensions.
Single layer feedforward NN (SLNN
) trained using Tensorflow[Abadi et al.2016] and the Adam optimizer used to minimize cross-entropy [Kingma and Ba2014] 444The settings used are: learning_rate=0.1, beta1=0.9, beta2=0.999, epsilon=1e-08, use_locking=False
. We used the early stopping method and validated results over multiple epochs between 50 and 1000, and the number of hidden units were varied between 5 and 1000. Since, a SLNN is not constrained to fitting a monotonic function, we would expect SLNNs to have smaller bias than SIMs. However, since SLNNs use more parameters, they have larger variance than SIMs.
We always perform a train-validation-test split of the data, and report the results on the test set.
We tested the algorithms on several datasets: link and page are datasets from the UCI machine learning repository. We also use four datasets from the 20 newsgroups corpus: atheism-religion, autos-motorcycle, cryptography-electronics and mac-windows. We compared the AUC in Table (1) - since several of the datasets are unbalanced - for each of the methods. The following is a summary:
CSI outperforms simple, widely popular learning algorithms such as SLR, SQH, SLS. Often, the difference between CSI and these other algorithms is quite substantial. For example when measuring accuracy, the difference between CSI and either SLR, SQH, SLS on all the datasets is at least 2% and in many cases as large as .
CSI comfortably outperforms Slisotron on all datasets and often by a margin as large as . This is expected because Slisotron does not enforce any structure such as sparsity in its updates.
The most interesting result is the comparison with SLNN. In spite of its simplicity, we see that CSI is comparable to and often outperforms SLNN by a slight margin.
Group Sparsity: Multilabel and Multitask Learning
Next, we consider the problem of multi-label learning using group sparsity structure. We consider two datasets. For multilabel learning, the flags dataset contains measurement and possible labels (based on the colors in the flag). The data is split into measurements, for training and test respectively. Out of the training set, we randomly set aside of the measurements for validation.
For multitask learning, the atp7d dataset consists of simultaneous regression tasks from dimensional data with measurements. We perform a random split of the data for training, validation and testing.
We compared our method with group sparse logistic regression and least squares, using the MALSAR package [Zhou, Chen, and Ye2011]. For logistic regression and least squares, the range of parameter values was . We varied the step size on a log scale for our method, setting the group sparsity parameter to be for both datasets. Table (2) shows that our method performs better than both compared methods. For classification, we use the F1 score as a performance measure, since multilabel problems are highly unbalanced, and measures such as accuracy are not indicative of performance. For multitask learning, we report the MSE.
Structured Matrix Factorization
We now visit the problem of matrix completion in the presence of graph side information. We consider two datasets, Epinions and Flixster Both datasets have a (known) social network among the users. We process the data as follows: we first retain the top 1000 users and items with the most ratings. Then we sparsify the data so as to randomly retain only 3000 observations in the training set, out of which we set aside 300 observations for cross validation. Furthermore, we binarize the observations at 3, corresponding to “likes” and “dislikes” among users and items.[Natarajan, Rao, and Dhillon2015] showed that the problem of structured matrix factorization can be cast as the following atomic norm constrained program. The least squares approach solves the following program:
are the singular vectors and singular values of the graph Laplacian of the graph among the rows ofand are the same for the graph Laplacian corresponding to the graph among columns of . We use the same atoms in our case, except we replace the loss function by our calibrated loss. We report the MSE in Table (3).
Empirical discussion of the convergence of CSI
When is known then the CSI algorithm is basically an iterative gradient descent based algorithm on a convex likelhood function, combined with hard thresholding. Such algorithms have been analyzed and exponential rates of convergence have been established [Agarwal, Negahban, and Wainwright2012, Jain, Tewari, and Kar2014]. These results assume that the likelihood loss function satisfies certain restricted strong convexity and restricted strong smoothness assumptions. This leads to a natural question: Can we establish exponential rates of convergence for the CSI algorithm, for the single index model, i.e. when is unknown? While, we have been unable to establish a formal analysis of the rates of convergence in this case, we believe that such fast rates might not be achievable in the case of SIM and at best one can achieve much slower sub-linear rates of convergence on the iterates. We support our claim with an experiment, where we study how quickly do the iterates generated by CSI converge to on a synthetic dataset generated using the SIM. Our synthetic experiment is setup as follows: We generate the covariates
from a standard normal distribution. We use in our experiment. We then choose to be sparse with the locations of the non-zero entries chosen at random. The non-zero entries are filled with values sampled from . Next choose to be the logistic function 555Note that this definition of is exactly the same as the standard logistic formula . Since we are working with expectations in Equation (1
) and not probabilities as is done in classical logistic regression, our formula, on the surface, looks a bit different.and generate labels in for each using a SIM as shown in Equation (1) with the above . For our experiments both are kept hidden from the CSI algorithm. We run CSI with and . In Figure (1) we show how the distance of the iterates from changes as the number of iterations of CSI increases. This result tells us that the distance monotonically decreases with the number of iterations and moreover, the problem is harder as dimensionality increases. Combining the results of [Jain, Tewari, and Kar2014] and the simulation result shown in Figure (1) we make the following conjecture.
Suppose we are given i.i.d. labeled data which satisfies the SIM , where is a Lipschitz, monotonic function and . Let be restricted strong convex and restricted strong smooth, as defined in [Agarwal, Negahban, and Wainwright2012], for the given data distribution. Then with an appropriate choice of the parameters and , algorithm CSI with the hard-thresholding operation after iterations outputs a vector that satisfies
where is some function that depends on and is some function dependent on and is independent of . represents the statistical error of the iterates that arises due to the presence of limited data.
Conclusions and Discussion
In this paper, we introduced CSI , a unified algorithm to learn single index models in high dimensions, under general structural constraints on the data. The simplicity of our learning algorithm, its versatility, and competitive results makes it a great tool that can be added to a data analyst’s toolbox.
- [Abadi et al.2016] Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G. S.; Davis, A.; Dean, J.; Devin, M.; et al. 2016. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467.
- [Agarwal, Negahban, and Wainwright2012] Agarwal, A.; Negahban, S. N.; and Wainwright, M. J. 2012. Fast global convergence of gradient methods for high-dimensional statistical recovery. The Annals of Statistics 40(5):2452–2482.
- [Alquier and Biau2013] Alquier, P., and Biau, G. 2013. Sparse single-index model. The Journal of Machine Learning Research 14(1):243–280.
- [Argyriou, Evgeniou, and Pontil2008] Argyriou, A.; Evgeniou, T.; and Pontil, M. 2008. Convex multi-task feature learning. Machine Learning 73(3):243–272.
- [Buja, Hastie, and Tibshirani1989] Buja, A.; Hastie, T.; and Tibshirani, R. 1989. Linear smoothers and additive models. The Annals of Statistics 453–510.
- [Chandrasekaran et al.2012] Chandrasekaran, V.; Recht, B.; Parrilo, P. A.; and Willsky, A. S. 2012. The convex geometry of linear inverse problems. Foundations of Computational Mathematics 12(6):805–849.
- [Cohen et al.2012] Cohen, A.; Daubechies, I.; DeVore, R.; Kerkyacharian, G.; and Picard, D. 2012. Capturing ridge functions in high dimensions from point queries. Constructive Approximation 35(2):225–243.
- [Donoho2006] Donoho, D. L. 2006. Compressed sensing. Information Theory, IEEE Transactions on 52(4):1289–1306.
- [Friedman and Stuetzle1981] Friedman, J. H., and Stuetzle, W. 1981. Projection pursuit regression. Journal of the American statistical Association 76(376):817–823.
- [Ganti, Balzano, and Willett2015] Ganti, R. S.; Balzano, L.; and Willett, R. 2015. Matrix completion under monotonic single index models. In Advances in Neural Information Processing Systems, 1864–1872.
- [Hastie et al.2005] Hastie, T.; Tibshirani, R.; Friedman, J.; and Franklin, J. 2005. The elements of statistical learning: data mining, inference and prediction. The Mathematical Intelligencer 27(2):83–85.
- [Horowitz and Härdle1996] Horowitz, J. L., and Härdle, W. 1996. Direct semiparametric estimation of single-index models with discrete covariates. Journal of the American Statistical Association 91(436):1632–1640.
- [Horowitz2009] Horowitz, J. L. 2009. Semiparametric and nonparametric methods in econometrics. Springer.
- [Ichimura1993] Ichimura, H. 1993. Semiparametric least squares (sls) and weighted sls estimation of single-index models. Journal of Econometrics 58(1):71–120.
- [Jacob, Obozinski, and Vert2009] Jacob, L.; Obozinski, G.; and Vert, J.-P. 2009. Group lasso with overlap and graph lasso. In ICML, 433–440.
- [Jain, Tewari, and Kar2014] Jain, P.; Tewari, A.; and Kar, P. 2014. On iterative hard thresholding methods for high-dimensional m-estimation. In Advances in Neural Information Processing Systems, 685–693.
- [Kakade et al.2011] Kakade, S. M.; Kanade, V.; Shamir, O.; and Kalai, A. 2011. Efficient learning of generalized linear and single index models with isotonic regression. In Advances in Neural Information Processing Systems, 927–935.
- [Kalai and Sastry2009] Kalai, A. T., and Sastry, R. 2009. The isotron algorithm: High-dimensional isotonic regression. In COLT.
- [Kingma and Ba2014] Kingma, D., and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- [Koren, Bell, and Volinsky2009] Koren, Y.; Bell, R.; and Volinsky, C. 2009. Matrix factorization techniques for recommender systems. Computer (8):30–37.
- [Natarajan, Rao, and Dhillon2015] Natarajan, N.; Rao, N.; and Dhillon, I. 2015. Pu matrix completion with graph information. In CAMSAP, 2015 IEEE 6th International Workshop on, 37–40. IEEE.
- [Negahban et al.2012] Negahban, S. N.; Ravikumar, P.; Wainwright, M. J.; and Yu, B. 2012. A unified framework for high-dimensional analysis of m-estimators with decomposable regularizers. Statistical Science 27(4):538–557.
- [Park and Hastie2007] Park, M. Y., and Hastie, T. 2007. L1-regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(4):659–677.
- [Plan, Vershynin, and Yudovina2014] Plan, Y.; Vershynin, R.; and Yudovina, E. 2014. High-dimensional estimation with geometric constraints. arXiv preprint arXiv:1404.3749.
High dimensional single index models.
Journal of Multivariate Analysis139:266–282.
- [Rao et al.2016] Rao, N. S.; Nowak, R. D.; Cox, C. R.; and Rogers, T. T. 2016. Classification with the sparse group lasso. Signal Processing, IEEE Transactions on 64.
- [Rao, Shah, and Wright2015] Rao, N.; Shah, P.; and Wright, S. 2015. Forward–backward greedy algorithms for atomic norm regularization. IEEE Transactions on Signal Processing 63(21):5798–5811.
- [Ravikumar et al.2009] Ravikumar, P.; Lafferty, J.; Liu, H.; and Wasserman, L. 2009. Sparse additive models. JRSS: Series B 1009–1030.
- [Ryali et al.2010] Ryali, S.; Supekar, K.; Abrams, D. A.; and Menon, V. 2010. Sparse logistic regression for whole-brain classification of fmri data. NeuroImage 51(2):752–764.
- [Tsybakov2009] Tsybakov, A. 2009. Introduction to nonparametric estimation. Springer Verlag.
- [Van de Geer2008] Van de Geer, S. A. 2008. High-dimensional generalized linear models and the lasso. The Annals of Statistics 614–645.
- [Zhou, Chen, and Ye2011] Zhou, J.; Chen, J.; and Ye, J. 2011. MALSAR: Multi-tAsk Learning via StructurAl Regularization. Arizona State University.