1 Introduction
Since it was introduced by Tibshirani (1996), the lasso regularized method for high-dimensional regression models with sparse coefficients has received a great deal of attention in the literature. Properties of interest for such regression models include the finite sample oracle inequalities. Among the extensive literature of the lasso method, Bunea, Tsybakov, and Wegkamp (2007) and Bickel, Ritov, and Tsybakov (2009)
derived the oracle inequalities for prediction risk and estimation error in a general nonparametric regression model including the high-dimensional linear regression as a special example, and
van de Geer (2008)provided oracle inequalities for the generalized linear models with Lipschitz loss functions, e.g. logistic regression and classification with hinge loss.
We consider lasso regularized high-dimensional Cox regression. Let be the survival time and the censoring time. Suppose we observe a sequence of iid observations , , where , , and are the covariates in . Due to largely parallel material, we follow closely the notation in van de Geer (2008). Let
Here is a convex subset of , and the functions are real-valued basis functions on , which are identity functions of corresponding covariates in a standard Cox model.
Consider the following Cox model (Cox, 1972):
where is the parameter of interest and is the unknown baseline hazard function. The negative log partial likelihood function for becomes
(1.1) |
The corresponding estimator with lasso penalty is denoted by
where is the weighted
norm of the vector
, with random weightsClearly the negative log partial likelihood is a sum of non-iid random variables. For ease of theoretical calculation, it is natural to consider the following intermediate function as a “replacement” of the negative log partial likelihood function:
(1.2) |
which has the desirable iid structure, but with an unknown population expectation
The negative log partial likelihood function (1.1) can then be viewed as a “working” model for the empirical loss function (1.2), and the corresponding loss function becomes
(1.3) |
with expected loss
(1.4) |
where denotes the distribution of . Define the target function by
where . For simplicity we will assume that there is a unique minimum as in van de Geer (2008). Uniqueness holds for the regular Cox model when , see for example, Andersen and Gill (1982). Define the excess risk of by
It is desirable to show similar non-asymptotic oracle inequalities for the Cox regression model as in, for example, van de Geer (2008)
for generalized linear models. That is, with large probability,
Here is called the “estimation error” by van de Geer (2008), which is typically proportional to times the number of nonzero elements in .
Note that the summands in the negative log partial likelihood function (1.1) are not iid, and the intermediate loss function given in (1.3) is not Lipschitz. Hence the conclusion of van de Geer (2008) can not be applied directly. With the Lipschitz condition in van de Geer (2008) replaced by a similar boundedness assumption for regression parameters in Bühlmann (2006), we tackle the problem using pointwise arguments to obtain the oracle bounds of two types of errors: one is between empirical loss (1.2) and expected loss (1.4), and one is between the negative log partial likelihood (1.1) and empirical loss (1.2).
The article is organized as follows. In Section 2, we provide assumptions and additional notation that will be used throughout the paper. In Section 3, following the flow of van de Geer (2008), we first consider the case where the weights are fixed, then discuss briefly the case with random weights .
2 Assumptions
We impose five basic assumptions in this section. Assumptions A, B, and C are identical to the corresponding assumptions in van de Geer (2008). Assumption D has a similar flavor to the assumption (A2) in Bühlmann (2006) for the persistency property of boosting method in high-dimensional linear regression models. Here it replaces the Lipschitz assumption in van de Geer (2008). Assumption E is commonly used for survival models with censored data, see for example, Andersen and Gill (1982).
Assumption A.
Assumption B. There exists an and strictly convex increasing G, such that for all with , one has
Assumption C. There exists a function on the subsets of the index set , such that for all , and for all and , we have
Assumption D.
Assumption E. The observation time stops at a finite time with
The convex conjugate of function given in Assumption B is denoted by such that . A typical choice of is quadratic function with some constant , i.e. , see van de Geer (2008).
From Assumptions A, D and E, we have for any ,
(2.1) |
for all , where .
Let be the theoretical norm of , and be the empirical norm. For any and in , denote
Similarly we have corresponding empirical versions,
3 Main results
3.1 Non-random normalization weights in the penalty
We show that a similar result to Theorem A.4 of van de Geer (2008) holds for the Cox model. Suppose that are known and consider the estimator
Denote the empirical probability measure based on the sample by . Let be a Rademacher sequence, independent of the training data . We fix some and denote for some . For any where , denote
Note that van de Geer (2008) has considered the supremum of the above over . We find that the pointwise argument is adequate for our purpose because only the lasso estimator is of interest, and that the calculation with in van de Geer (2008) does not apply to the Cox model due to the lack of Lipschitz property.
Lemma 3.1.
Under Assumptions A, D and E, for all satisfying , we have
where
Proof.
By the symmetrization theorem, see e.g. van der Vaart and Wellner (1996) or Theorem A.2 in van de Geer (2008), for a class of only one function we have
For , instead of using the contraction theorem that requires Lipschitz, we use the mean value theorem in the following:
We now can bound using the Bousquet’s concentration theorem provided in van de Geer (2008) as Theorem A.1.
Corollary 3.1.
Under Assumptions A, D and E, for all , and all satisfying , it holds that
where
Proof.
Now for any satisfying , we bound
which is equal to
By the mean value theorem, we have
Lemma 3.2.
Under Assumption E, we have
Lemma 3.3.
Under Assumptions A, D and E, for all we have
(3.6) | |||
where is a constant that only depends on .
Proof.
For a class of functions indexed by , , we calculate its bracketing number. For any , let be the -th quantile of , i.e.,
where is the smallest integer that is greater than or equal to . Furthermore, denote and . For , define brackets with
such that when . Since
we have , which yields
where . Thus, from Theorem 2.14.9 in van der Vaart and Wellner (1996), we have for any ,
where is a constant that only depends on . Note that is bounded by . Let , we obtain (3.6).
∎
Lemma 3.4.
Under Assumptions A, D and E, for all we have
(3.7) |
Proof.
Consider the classes of functions indexed by ,
Using the same argument in the proof of Lemma 3.3, we have
where , and then for any ,
Thus we have
Let , i.e. . Since
Comments
There are no comments yet.