In the multivariate regression problem the objective is to estimate the conditional mean
where is a
-dimensional response vector,is a -dimensional covariate vector, and we are given a sample of i.i.d. pairs
from the joint distribution ofand
. This is also referred to as multi-task learning in the machine learning literature. Under a linear model, the mean is estimated aswhere is a matrix of regression coefficients. When the dimensions and are large relative to the sample size , the coefficients of cannot be reliably estimated, without further assumptions.
In reduced rank regression the matrix is estimated under a rank constraint , so that the rows or columns of lie in an -dimensional subspace of or . Intuitively, this implies that the model is based on a smaller number of features than the ambient dimensionality would suggest, or that the tasks representing the components of the response are closely related. In low dimensions, the constrained rank model can be computed as an orthogonal projection of the least squares solution; but in high dimensions this is not well defined.
Recent research has studied the use of the nuclear norm as a convex surrogate for the rank constraint. The nuclear norm , also known as the trace or Ky-Fan norm, is the sum of the singular vectors of . A rank constraint can be thought of as imposing sparsity, but in an unknown basis; the nuclear norm plays the role of the norm in sparse estimation. Its use for low rank estimation problems was proposed by Fazel (2002)
. More recently, nuclear norm regularization in multivariate linear regression has been studied byYuan et al. (2007), and by Negahban and Wainwright (2011), who analyzed the scaling properties of the procedure in high dimensions.
In this paper we study nonparametric parallels of reduced rank linear models. We focus our attention on additive models, so that the regression function has each component equal to a sum of functions, one for each covariate. The objective is then to estimate the matrix of functions .
The first problem we address, in Section 2, is to determine a replacement for the regularization penalty in the linear model. Because we must estimate a matrix of functions, the analogue of the nuclear norm is not immediately apparent. We propose two related regularization penalties for nonparametric low rank regression, and show how they specialize to the linear case. We then study, in Section 4, the (infinite dimensional) subdifferential of these penalties. In the population setting, this leads to stationary conditions for the minimizer of the regularized mean squared error. This subdifferential calculus then justifies penalized backfitting algorithms for carrying out the optimization for a finite sample. Constrained rank additive models (cram) for multivariate regression are analogous to sparse additive models (SpAM) for the case where the response is 1-dimensional (Ravikumar et al., 2009) (studied also in the reproducing kernel Hilbert space setting by Raskutti, Wainwright and Yu (2012)), but with the goal of recovering a low-rank matrix rather than an entry-wise sparse vector. The backfitting algorithms we derive in Section 5 are analogous to the iterative smoothing and soft thresholding backfitting algorithms for SpAM proposed by Ravikumar et al. (2009). A uniform bound on the excess risk of the estimator relative to an oracle is given Section 7. This shows the statistical scaling behavior of the methods for prediction. The analysis requires a concentration result for nonparametric covariance matrices in the spectral norm. Experiments with synthetic, gene, and biochemistry data are given in Section 8, which are used to illustrate different facets of the proposed nonparametric reduced rank regression techniques.
2 Nonparametric Nuclear Norm Penalization
We begin by presenting the penalty that we will use to induce nonparametric regression estimates to be low rank. To motivate our choice of penalty and provide some intuition, suppose that are smooth one-dimensional functions with a common domain. What does it mean for this collection of functions to be low rank? Let be a collection of points in the common domain of the functions. We require that the matrix of function values is low rank. This matrix is of rank at most for every set of arbitrary size if and only if the functions are -linearly independent—each function can be written as a linear combination of of the other functions.
In the multivariate regression setting, but still assuming the domain is one-dimensional for simplicity ( and ), we have a random sample . Consider the sample matrix associated with a vector of smooth (regression) functions, and suppose that . We would like for this to be a low rank matrix. This suggests the penalty
denotes the eigenvalues of a symmetric matrixand
denotes the singular values of a matrix. Now, assuming the columns of are centered, and for each , we recognize as the sample covariance of the population covariance
This motivates the following sample and population penalties, where denotes the matrix square root:
We will also use the notation .
This leads to the following population and empirical regularized risk functionals for low rank nonparametric regression:
|population penalized risk:||(2.3)|
|empirical penalized risk:||(2.4)|
where, in the empirical case, denotes the matrix of response values for the sample . We recall that if has spectral decomposition then .
3 Constrained Rank Additive Models (cram)
We now consider the case where is -dimensional. Throughout the paper we use superscripts to denote indices of the -dimensional response, and subscripts to denote indices of the -dimensional covariate. We consider the family of additive models, with regression functions of the form
each term being a -vector of functions evaluated at .
In this setting we propose two different penalties. The first penalty, intuitively, encourages the vector to be low rank, for each . Assume that the functions all have mean zero; this is required for identifiability in the additive model. As a shorthand, let denote the covariance matrix of the -th component functions, with sample version . The population and sample versions of the first penalty are then given by
The second penalty encourages the set of vector-valued functions to be low rank. This penalty is given by
where, for convenience of notation, is an matrix. The corresponding population and empirical risk functionals, for the first penalty, are then
and similarly for the second penalty.
Now suppose that each is normalized so that . In the linear case we have where . Let . Some straightforward calculation shows that the penalties reduce to
Thus, in the linear case the first penalty is encouraging to be column-wise sparse, so that many of the s are zero, meaning that doesn’t appear in the fit. This is a version of the group lasso (Yuan and Lin, 2006). The second penalty reduces to the nuclear norm regularization used for high-dimensional reduced-rank regression.
4 Subdifferentials for Functional Matrix Norms
A key to deriving algorithms for functional low-rank regression is computation of the subdifferentials of the penalties. We are interested in -dimensional matrices of functions . For each column index and row index ,
is a function of a random variable, and we will take expectations with respect to implicitly. We write to mean the th column of , which is a -vector of functions of . We define the inner product between two matrices of functions as
and write . Note that equals the Frobenius norm of where is a positive semidefinite matrix.
We define two further norms on a matrix of functions , namely,
where is the spectral norm (operator norm), the largest singular value of , and it is convenient to write the matrix square root as . Each of the norms depends on only through . In fact, these two norms are dual—for any ,
where the supremum is attained by setting , with denoting the matrix pseudo-inverse.
The subdifferential of is the set
The fact that contains the subdifferential can be proved by comparing our setting (matrices of functions) to the ordinary matrix case; see Watson (1992); Recht, Fazel and Parrilo (2010). Here, we show the reverse inclusion, . Let and let be any element of the function space. We need to show
where the second equality follows from , and the fact that . The inequality follows from the duality of the norms.
Finally, we show that . We have
where we use the fact that , implying . Next, let
be a reduced singular value decomposition, whereis a positive diagonal matrix of size . Then , and we have
This implies that and so these two symmetric matrices have orthogonal row spans and orthogonal column spans. Therefore,
where the last bound comes from the fact that . Therefore . ∎
This gives the subdifferential of penalty 2, defined in (3.3). We can view the first penalty update as just a special case of the second penalty update. For penalty 1 in (3.1), if we are updating and fix all the other functions, we are now penalizing the norm
which is clearly just a special case of penalty 2 with a single -vector of functions instead of different -vectors of functions. So, we have
5 Stationary Conditions and Backfitting Algorithms
Returning to the base case of covariate, consider the population regularized risk optimization
where is a vector of univariate functions. The stationary condition for this optimization is
Assume the singular values are sorted as , and let be the largest index such that . Thus, has rank . Note that , and therefore
where and . It follows that
and take . Then we have .
It remains to show that satisfies the conditions of the subdifferential in (4.3). Since
we have . Also, since
It follows that . ∎
The analysis above justifies a backfitting algorithm for estimating a constrained rank additive model with the first penalty, where the objective is
For a given coordinate , we form the residual , and then compute the projection , with singular value decomposition . We then update
and proceed to the next variable. This is a Gauss-Seidel procedure that parallels the population backfitting algorithm for SpAM (Ravikumar et al., 2009).
In the sample version we replace the conditional expectation by a nonparametric linear smoother, . The algorithm is given in Figure 1. The algorithm for penalty 2 is similar and given in Figure 2. Both algorithms can be viewed as functional projected gradient descent procedures. Note that to predict at a point not included in the training set, the smoother matrices are constructed using that point; that is, .
6 Working over an RKHS
Suppose that the functions are required to lie in a Hilbert space . A modified empirical optimization is (for the first penalty)
where is an data matrix and is an matrix of function values associated with the th columns of a data matrix . The first penalty is then a nuclear-norm constraint on these observed function values. The second penalty is a smoothness penalty on each of the coordinate functions in the appropriate Hilbert space, and is not empirical.
If is an RKHS with kernel , then the representer theorem implies that we can restrict to functions of the form
where the are real weights. In this case the optimization becomes a finite dimensional semidefinite program over . This parallels the approach of Raskutti, Wainwright and Yu (2012) for sparse additive models; see also Dinuzzo and Fukumizu (2012).
If denotes the Gram matrix for the th variable, then where . Using the first penalty, the convex optimization is then
where the third term is a smoothness penalty for the RKHS. This is a cone program with constraints involving both the second-order cone and the semidefinite cone.
7 Excess Risk Bounds
The population risk of a regression matrix is
with sample version denoted . Consider all models that can be written as
where is an orthogonal matrix, is a positive diagonal matrix, and satisfies . The population risk can be reexpressed as
and similarly for the sample risk, with replacing above. The “uncontrollable” contribution to the risk, which does not depend on , is . We can express the remaining “controllable” risk as
Using the von Neumann trace inequality, where ,
where here and in the following is a generic constant. For the last factor in (7.1), it holds that
where is a -covering of the unit -sphere, which has size ; compare Vershynin (2012, p. 665). We now assume that the functions are uniformly bounded from a Sobolev space of order two. Specifically, let denote a uniformly bounded, orthonormal basis with respect to , and assume that where
for some . The -covering number of satisfies .
Suppose that is Gaussian and the true regression function is bounded. Then the family of random variables is sub-Gaussian and sample continuous. It follows from a result of Cesa-Bianchi and Lugosi (1999) that
for some constant . Thus, by Markov’s inequality we conclude that
when tends to infinity and and possibly change with . If
then returning to (7.1), this gives us a bound on that is . More precisely, for , define the class of matrices of functions
Then, for a fitted matrix chosen from , writing , we have
|Subtracting from each of the bracketed differences, we obtain that|
then we may conclude from (7.2) that
This proves the following result.
Let minimize the empirical risk over the class . Suppose that is Gaussian, the true regression function