## 1 Introduction

Since it was introduced by Tibshirani (1996), the lasso regularized method for high-dimensional regression models with sparse coefficients has received a great deal of attention in the literature. Properties of interest for such regression models include the finite sample oracle inequalities. Among the extensive literature of the lasso method, Bunea, Tsybakov, and Wegkamp (2007) and Bickel, Ritov, and Tsybakov (2009)

derived the oracle inequalities for prediction risk and estimation error in a general nonparametric regression model including the high-dimensional linear regression as a special example, and

van de Geer (2008)provided oracle inequalities for the generalized linear models with Lipschitz loss functions, e.g. logistic regression and classification with hinge loss.

We consider lasso regularized high-dimensional Cox regression. Let be the survival time and the censoring time. Suppose we observe a sequence of iid observations , , where , , and are the covariates in . Due to largely parallel material, we follow closely the notation in van de Geer (2008). Let

Here is a convex subset of , and the functions are real-valued basis functions on , which are identity functions of corresponding covariates in a standard Cox model.

Consider the following Cox model (Cox, 1972):

where is the parameter of interest and is the unknown baseline hazard function. The negative log partial likelihood function for becomes

(1.1) |

The corresponding estimator with lasso penalty is denoted by

where is the weighted

norm of the vector

, with random weightsClearly the negative log partial likelihood is a sum of non-iid random variables. For ease of theoretical calculation, it is natural to consider the following intermediate function as a “replacement” of the negative log partial likelihood function:

(1.2) |

which has the desirable iid structure, but with an unknown population expectation

The negative log partial likelihood function (1.1) can then be viewed as a “working” model for the empirical loss function (1.2), and the corresponding loss function becomes

(1.3) |

with expected loss

(1.4) |

where denotes the distribution of . Define the target function by

where . For simplicity we will assume that there is a unique minimum as in van de Geer (2008). Uniqueness holds for the regular Cox model when , see for example, Andersen and Gill (1982). Define the excess risk of by

It is desirable to show similar non-asymptotic oracle inequalities for the Cox regression model as in, for example, van de Geer (2008)

for generalized linear models. That is, with large probability,

Here is called the “estimation error” by van de Geer (2008), which is typically proportional to times the number of nonzero elements in .

Note that the summands in the negative log partial likelihood function (1.1) are not iid, and the intermediate loss function given in (1.3) is not Lipschitz. Hence the conclusion of van de Geer (2008) can not be applied directly. With the Lipschitz condition in van de Geer (2008) replaced by a similar boundedness assumption for regression parameters in Bühlmann (2006), we tackle the problem using pointwise arguments to obtain the oracle bounds of two types of errors: one is between empirical loss (1.2) and expected loss (1.4), and one is between the negative log partial likelihood (1.1) and empirical loss (1.2).

The article is organized as follows. In Section 2, we provide assumptions and additional notation that will be used throughout the paper. In Section 3, following the flow of van de Geer (2008), we first consider the case where the weights are fixed, then discuss briefly the case with random weights .

## 2 Assumptions

We impose five basic assumptions in this section. Assumptions A, B, and C are identical to the corresponding assumptions in van de Geer (2008). Assumption D has a similar flavor to the assumption (A2) in Bühlmann (2006) for the persistency property of boosting method in high-dimensional linear regression models. Here it replaces the Lipschitz assumption in van de Geer (2008). Assumption E is commonly used for survival models with censored data, see for example, Andersen and Gill (1982).

Assumption A.

Assumption B. There exists an and strictly convex increasing G, such that for all with , one has

Assumption C. There exists a function on the subsets of the index set , such that for all , and for all and , we have

Assumption D.

Assumption E. The observation time stops at a finite time with

The convex conjugate of function given in Assumption B is denoted by such that . A typical choice of is quadratic function with some constant , i.e. , see van de Geer (2008).

From Assumptions A, D and E, we have for any ,

(2.1) |

for all , where .

Let be the theoretical norm of , and be the empirical norm. For any and in , denote

Similarly we have corresponding empirical versions,

## 3 Main results

### 3.1 Non-random normalization weights in the penalty

We show that a similar result to Theorem A.4 of van de Geer (2008) holds for the Cox model. Suppose that are known and consider the estimator

Denote the empirical probability measure based on the sample by . Let be a Rademacher sequence, independent of the training data . We fix some and denote for some . For any where , denote

Note that van de Geer (2008) has considered the supremum of the above over . We find that the pointwise argument is adequate for our purpose because only the lasso estimator is of interest, and that the calculation with in van de Geer (2008) does not apply to the Cox model due to the lack of Lipschitz property.

###### Lemma 3.1.

Under Assumptions A, D and E, for all satisfying , we have

where

###### Proof.

By the symmetrization theorem, see e.g. van der Vaart and Wellner (1996) or Theorem A.2 in van de Geer (2008), for a class of only one function we have

For , instead of using the contraction theorem that requires Lipschitz, we use the mean value theorem in the following:

We now can bound using the Bousquet’s concentration theorem provided in van de Geer (2008) as Theorem A.1.

###### Corollary 3.1.

Under Assumptions A, D and E, for all , and all satisfying , it holds that

where

###### Proof.

Now for any satisfying , we bound

which is equal to

By the mean value theorem, we have

###### Lemma 3.2.

Under Assumption E, we have

###### Lemma 3.3.

Under Assumptions A, D and E, for all we have

(3.6) | |||

where is a constant that only depends on .

###### Proof.

For a class of functions indexed by , , we calculate its bracketing number. For any , let be the -th quantile of , i.e.,

where is the smallest integer that is greater than or equal to . Furthermore, denote and . For , define brackets with

such that when . Since

we have , which yields

where . Thus, from Theorem 2.14.9 in van der Vaart and Wellner (1996), we have for any ,

where is a constant that only depends on . Note that is bounded by . Let , we obtain (3.6).

∎

###### Lemma 3.4.

Under Assumptions A, D and E, for all we have

(3.7) |

###### Proof.

Consider the classes of functions indexed by ,

Using the same argument in the proof of Lemma 3.3, we have

where , and then for any ,

Thus we have

Let , i.e. . Since

Comments

There are no comments yet.