# Non-asymptotic Oracle Inequalities for the High-Dimensional Cox Regression via Lasso

We consider the finite sample properties of the regularized high-dimensional Cox regression via lasso. Existing literature focuses on linear models or generalized linear models with Lipschitz loss functions, where the empirical risk functions are the summations of independent and identically distributed (iid) losses. The summands in the negative log partial likelihood function for censored survival data, however, are neither iid nor Lipschitz. We first approximate the negative log partial likelihood function by a sum of iid non-Lipschitz terms, then derive the non-asymptotic oracle inequalities for the lasso penalized Cox regression using pointwise arguments to tackle the difficulty caused by the lack of iid and Lipschitz property.

## Authors

• 3 publications
• 7 publications
12/29/2011

### Estimation And Selection Via Absolute Penalized Convex Minimization And Its Multistage Adaptive Applications

The ℓ_1-penalized method, or the Lasso, has emerged as an important tool...
10/07/2021

### Heterogeneous Overdispersed Count Data Regressions via Double Penalized Estimations

This paper studies the non-asymptotic merits of the double ℓ_1-regulariz...
03/14/2015

### Communication-efficient sparse regression: a one-shot approach

We devise a one-shot approach to distributed sparse regression in the hi...
04/29/2021

### Generalized Linear Models with Structured Sparsity Estimators

In this paper, we introduce structured sparsity estimators in Generalize...
09/13/2018

### Deterministic Inequalities for Smooth M-estimators

Ever since the proof of asymptotic normality of maximum likelihood estim...
11/16/2009

### Kullback-Leibler aggregation and misspecified generalized linear models

In a regression setup with deterministic design, we study the pure aggre...
06/20/2016

### On the prediction loss of the lasso in the partially labeled setting

In this paper we revisit the risk bounds of the lasso estimator in the c...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Since it was introduced by Tibshirani (1996), the lasso regularized method for high-dimensional regression models with sparse coefficients has received a great deal of attention in the literature. Properties of interest for such regression models include the finite sample oracle inequalities. Among the extensive literature of the lasso method, Bunea, Tsybakov, and Wegkamp (2007) and Bickel, Ritov, and Tsybakov (2009)

derived the oracle inequalities for prediction risk and estimation error in a general nonparametric regression model including the high-dimensional linear regression as a special example, and

van de Geer (2008)

provided oracle inequalities for the generalized linear models with Lipschitz loss functions, e.g. logistic regression and classification with hinge loss.

We consider lasso regularized high-dimensional Cox regression. Let be the survival time and the censoring time. Suppose we observe a sequence of iid observations , , where , , and are the covariates in . Due to largely parallel material, we follow closely the notation in van de Geer (2008). Let

 F={fθ(⋅)=m∑k=1θkψk(⋅),θ∈Θ}.

Here is a convex subset of , and the functions are real-valued basis functions on , which are identity functions of corresponding covariates in a standard Cox model.

Consider the following Cox model (Cox, 1972):

 λ(t|X)=λ0(t)efθ(X),

where is the parameter of interest and is the unknown baseline hazard function. The negative log partial likelihood function for becomes

 (1.1)

The corresponding estimator with lasso penalty is denoted by

 ^θn:=argminθ∈Θ{ln(θ)+λn^I(θ)},

where is the weighted

norm of the vector

, with random weights

Clearly the negative log partial likelihood is a sum of non-iid random variables. For ease of theoretical calculation, it is natural to consider the following intermediate function as a “replacement” of the negative log partial likelihood function:

 ~ln(θ)=−1nn∑i=1{fθ(Xi)−logμ(Yi;fθ)}Δi, (1.2)

which has the desirable iid structure, but with an unknown population expectation

 μ(t;fθ)=EX,Y{1(Y≥t)efθ(X)}.

The negative log partial likelihood function (1.1) can then be viewed as a “working” model for the empirical loss function (1.2), and the corresponding loss function becomes

 γfθ=γ(fθ(X),Y,Δ):=−{fθ(X)−logμ(Y;fθ)}Δ, (1.3)

with expected loss

 l(θ)=−EY,Δ,X[{fθ(X)−logμ(Y;fθ)}Δ]=Pγfθ, (1.4)

where denotes the distribution of . Define the target function by

 ¯f:=argminf∈FPγf,

where . For simplicity we will assume that there is a unique minimum as in van de Geer (2008). Uniqueness holds for the regular Cox model when , see for example, Andersen and Gill (1982). Define the excess risk of by

 E(f):=Pγf−Pγ¯f.

It is desirable to show similar non-asymptotic oracle inequalities for the Cox regression model as in, for example, van de Geer (2008)

for generalized linear models. That is, with large probability,

Here is called the “estimation error” by van de Geer (2008), which is typically proportional to times the number of nonzero elements in .

Note that the summands in the negative log partial likelihood function (1.1) are not iid, and the intermediate loss function given in (1.3) is not Lipschitz. Hence the conclusion of van de Geer (2008) can not be applied directly. With the Lipschitz condition in van de Geer (2008) replaced by a similar boundedness assumption for regression parameters in Bühlmann (2006), we tackle the problem using pointwise arguments to obtain the oracle bounds of two types of errors: one is between empirical loss (1.2) and expected loss (1.4), and one is between the negative log partial likelihood (1.1) and empirical loss (1.2).

The article is organized as follows. In Section 2, we provide assumptions and additional notation that will be used throughout the paper. In Section 3, following the flow of van de Geer (2008), we first consider the case where the weights are fixed, then discuss briefly the case with random weights .

## 2 Assumptions

We impose five basic assumptions in this section. Assumptions A, B, and C are identical to the corresponding assumptions in van de Geer (2008). Assumption D has a similar flavor to the assumption (A2) in Bühlmann (2006) for the persistency property of boosting method in high-dimensional linear regression models. Here it replaces the Lipschitz assumption in van de Geer (2008). Assumption E is commonly used for survival models with censored data, see for example, Andersen and Gill (1982).

Assumption A.

Assumption B.  There exists an and strictly convex increasing G, such that for all with , one has

Assumption C.  There exists a function on the subsets of the index set , such that for all , and for all and , we have

Assumption D.

Assumption E.  The observation time stops at a finite time with

The convex conjugate of function given in Assumption B is denoted by such that . A typical choice of is quadratic function with some constant , i.e. , see van de Geer (2008).

From Assumptions A, D and E, we have for any ,

 e|fθ(Xi)|≤eKmLmσ(m):=Um<∞ (2.1)

for all , where .

Let be the theoretical norm of , and be the empirical norm. For any and in , denote

 I1(θ|~θ):=∑k:~θk≠0σk|θk|,    I2(θ|~θ):=I(θ)−I1(θ|~θ).

Similarly we have corresponding empirical versions,

 ^I1(θ|~θ):=∑k:~θk≠0^σk|θk|,    ^I2(θ|~θ):=^I(θ)−^I1(θ|~θ).

## 3 Main results

### 3.1 Non-random normalization weights in the penalty

We show that a similar result to Theorem A.4 of van de Geer (2008) holds for the Cox model. Suppose that are known and consider the estimator

 ^θn:=argminθ∈Θ{ln(θ)+λnI(θ)}.

Denote the empirical probability measure based on the sample by . Let be a Rademacher sequence, independent of the training data . We fix some and denote for some . For any where , denote

 Zθ(M):=∣∣(Pn−P)[γfθ−γfθ∗]∣∣=∣∣[~ln(θ)−l(θ)]−[~ln(θ∗)−l(θ∗)]∣∣.

Note that van de Geer (2008) has considered the supremum of the above over . We find that the pointwise argument is adequate for our purpose because only the lasso estimator is of interest, and that the calculation with in van de Geer (2008) does not apply to the Cox model due to the lack of Lipschitz property.

###### Lemma 3.1.

Under Assumptions A, D and E, for all satisfying , we have

 EZθ(M)≤¯anM,

where

 ¯an=4an,    an=√2K2mlog(2m)n+Kmlog(2m)n.
###### Proof.

By the symmetrization theorem, see e.g. van der Vaart and Wellner (1996) or Theorem A.2 in van de Geer (2008), for a class of only one function we have

 EZθ(M) ≤ 2E(∣∣∣1nn∑i=1εi{[fθ(Xi)−logμ(Yi;fθ)]Δi − [fθ∗(Xi)−logμ(Yi;fθ∗)]Δi}∣∣∣) ≤ 2E(∣∣∣1nn∑i=1εi{fθ(Xi)−fθ∗(Xi)}Δi∣∣∣) + 2E(∣∣∣1nn∑i=1εi{logμ(Yi;fθ)−logμ(Yi;fθ∗)}Δi∣∣∣) = A+B.

For we have

 A≤2(m∑k=1σk|θk−θ∗k|)E(max1≤k≤m∣∣∣1nn∑i=1εiΔiψk(Xi)/σk∣∣∣).

Applying Lemma A.1 in van de Geer (2008) with and , we obtain

 E(max1≤k≤m∣∣∣1nn∑i=1εiΔiψk(Xi)σk∣∣∣)≤an.

Thus we have

 A≤2anM. (3.1)

For , instead of using the contraction theorem that requires Lipschitz, we use the mean value theorem in the following:

 ∣∣∣1nn∑i=1εi{logμ(Yi;fθ)−logμ(Yi;fθ∗)}Δi∣∣∣ =∣∣∣1nn∑i=1εiΔim∑k=11μ(Yi;fθ∗∗)∫∞Yi∫X(θk−θ∗k)ψk(x)efθ∗∗(x)dPX,Y(x,y)∣∣∣ = ∣∣∣m∑k=1σk(θk−θ∗k)1nn∑i=1εiΔiμ(Yi;fθ∗∗)σk∫∞Yi∫Xψk(x)efθ∗∗(x)dPX,Y(x,y)∣∣∣ ≤ ∣∣∣m∑k=1σk(θk−θ∗k)∣∣∣max1≤k≤m∣∣∣1nn∑i=1εiΔiFθ∗∗(k,Yi)∣∣∣ ≤ Mmax1≤k≤m∣∣∣1nn∑i=1εiΔiFθ∗∗(k,Yi)∣∣∣,

where is between and , and

 Fθ∗∗(k,t) = E[1(Y≥t)ψk(X)efθ∗∗(X)]μ(t;fθ∗∗)σk (3.2) ≤ (∥ψk∥∞/σk)E[1(Y≥t)efθ∗∗(X)]μ(t;fθ∗∗) ≤ Km.

Since for all ,

 E[εiΔiFθ∗∗(k,Yi)]=0,   ∥εiΔiFθ∗∗(k,Yi)∥∞≤Km, and 1nn∑i=1E[εiΔiFθ∗∗(k,Yi)]2≤1nn∑i=1E[Fθ∗∗(k,Yi)]2≤EK2m=K2m,

following Lemma A.1 in van de Geer (2008), we obtain

 B≤2anM. (3.3)

Combining (3.1) and (3.3), the upper bound for is achieved. ∎

We now can bound using the Bousquet’s concentration theorem provided in van de Geer (2008) as Theorem A.1.

###### Corollary 3.1.

Under Assumptions A, D and E, for all , and all satisfying , it holds that

 P(Zθ(M)≥¯λAn,0M)≤exp(−n¯a2nr21),

where

 ¯λAn,0:=¯λAn,0(r1):=¯an(1+2r1√2(K2m+¯anKm)+4r21¯anKm3)
###### Proof.

Using the triangular inequality and the mean value theorem, we obtain

 |γfθ−γfθ∗| ≤ |fθ(X)−fθ∗(X)|Δ+|logμ(Y;fθ)−logμ(Y;fθ∗)|Δ ≤ ∣∣ ∣∣m∑k=1σk|θk−θ∗k|ψk(X)σk∣∣ ∣∣+|logμ(Y;fθ)−logμ(Y;fθ∗)| ≤ MKm+m∑k=1σk|θk−θ∗k|⋅max1≤k≤m|Fθ∗∗(k,Y)| ≤ 2MKm,

where is between and , is defined in (3.2). So we have

 ∥γfθ−γfθ∗∥∞≤2MKm,

and

 P(γfθ−γfθ∗)2≤4M2K2m.

Therefore, in view of Bousquet’s concentration theorem and Lemma 3.1, for all and ,

 P(Zθ(M)≥¯anM(1+2r1√2(K2m+¯anKm)+4r21¯anKm3)) ≤ exp(−n¯a2nr21).

Now for any satisfying , we bound

 Rθ(M):=∣∣[ln(θ)−~ln(θ)]−[ln(θ∗)−~ln(θ∗)]∣∣,

which is equal to

 1nn∑i=1∣∣∣[log1nn∑j=11(Yj≥Yi)efθ(Xj)μ(Yi;fθ)−log1nn∑j=11(Yj≥Yi)efθ∗(Xj)μ(Yi;fθ∗)]Δi∣∣∣ ≤sup0≤t≤τ∣∣∣log1nn∑j=11(Yj≥t)efθ(Xj)μ(t;fθ)−log1nn∑j=11(Yj≥t)efθ∗(Xj)μ(t;fθ∗)∣∣∣.

By the mean value theorem, we have

 Rθ(M) ≤ sup0≤t≤τ∣∣∣m∑k=1(θk−θ∗k){∑nj=11(Yj≥t)efθ∗∗(Xj)μ(t;fθ∗∗)}−1 {∑nj=11(Yj≥t)ψk(Xj)efθ∗∗(Xj)μ(t;fθ∗∗) − ∑nj=11(Yj≥t)efθ∗∗(Xj)E[1(Y≥t)ψk(X)efθ∗∗(X)]μ(t;fθ∗∗)2}∣∣∣ = sup0≤t≤τ∣∣∣m∑k=1σk(θk−θ∗k){∑nj=11(Yj≥t){ψk(Xj)/σk}efθ∗∗(Xj)∑nj=11(Yj≥t)efθ∗∗(Xj) ≤ Msup0≤t≤τ[1nn∑i=11(Yi≥t)efθ∗∗(Xi)]−1 − E[1(Y≥t){ψk(X)/σk}efθ∗∗(X)]∣∣∣ + Km∣∣∣1nn∑i=11(Yi≥t)efθ∗∗(Xi)−E[1(Y≥t)efθ∗∗(X)]∣∣∣},

where is between and , and by (2.1) we have

 sup0≤t≤τ[1nn∑i=11(Yi≥t)efθ∗∗(Xi)]−1≤Um[1nn∑i=11(Yi≥τ)]−1. (3.5)
###### Lemma 3.2.

Under Assumption E, we have

 P(1nn∑i=11(Yi≥τ)≤π2)≤2e−nπ2/2.
###### Proof.

This is obtained directly from Massart (1990) by taking in the following:

 P(1nn∑i=11(Yi≥τ)≤π2) ≤ P(sup0≤t≤τ√n∣∣ ∣∣1nn∑i=11(Yi≥t)−π∣∣ ∣∣≥r) ≤ 2e−2r2.

###### Lemma 3.3.

Under Assumptions A, D and E, for all we have

 P(sup0≤t≤τ∣∣ ∣∣1nn∑i=11(Yi≥t)efθ(Xi)−μ(t;fθ)∣∣ ∣∣≥Um¯anr1) (3.6) ≤ 15W2e−n¯a2nr21,

where is a constant that only depends on .

###### Proof.

For a class of functions indexed by , , we calculate its bracketing number. For any , let be the -th quantile of , i.e.,

 P(Y≤ti)=iε,i=1,⋯,⌈1/ε⌉−1,

where is the smallest integer that is greater than or equal to . Furthermore, denote and . For , define brackets with

 Li(x,y)=1(y≥ti)efθ(x)/Um, Ui(x,y)=1(y>ti−1)efθ(x)/Um

such that when . Since

 {E[Ui−Li]2}1/2 ≤ ⎧⎨⎩E[efθ(X)Um{1(Y≥ti)−1(Y>ti−1)}]2⎫⎬⎭1/2 ≤ {P(ti−1

we have , which yields

 N[](ε,F,L2)≤2ε2=(Kε)2,

where . Thus, from Theorem 2.14.9 in van der Vaart and Wellner (1996), we have for any ,

 P(√nsup0≤t≤τ∣∣∣1nn∑i=11(Yi≥t)efθ(Xi)Um−μ(t;fθ)Um∣∣∣≥r) ≤ 12W2r2e−2r2 ≤ 15W2e−r2,

where is a constant that only depends on . Note that is bounded by . Let , we obtain (3.6).

###### Lemma 3.4.

Under Assumptions A, D and E, for all we have

 P(sup0≤t≤τmax0≤k≤m∣∣∣1nn∑i=11(Yi≥t)ψk(Xi)σkefθ(Xi) − E[1(Y≥t)ψk(X)σkefθ(X)]∣∣∣≥KmUm[¯anr1+√log(2m)n]) ≤ 110W2e−n¯a2nr21. (3.7)
###### Proof.

Consider the classes of functions indexed by ,

 Gk = {1(y≥t)efθ(x)ψk(x)/(σkKmUm):t∈[0,τ],y∈R, ∣∣efθ(x)ψk(x)/σk∣∣≤KmUm},k=1,…,m.

Using the same argument in the proof of Lemma 3.3, we have

 N[](ε,Gk,L2)≤(Kε)2,

where , and then for any ,

 P(√nsup0≤t≤τ∣∣∣1nn∑i=11(Yi≥t)efθ(Xi)ψk(Xi)σkKmUm − E[1(Y≥t)efθ(X)ψk(X)σkKmUm]∣∣∣≥r)≤15W2e−r2.

Thus we have

 − E[1(Y≥t)efθ(X)ψk(X)/(σkUmKm)]∣∣∣≥r) ≤ P(m⋃k=1√nsup0≤t≤τ∣∣∣1nn∑i=11(Yi≥t)efθ(Xi)ψk(Xi)/(σkUmKm) − E[1(Y≥t)efθ(X)ψk(X)/(σkUmKm)]∣∣∣≥r) − E[1(Y≥t)efθ(X)ψk(X)/(σkUmKm)]∣∣∣≥r) ≤ m5W2e−r2=110W2elog(2m)−r2.

Let , i.e. . Since

 √¯a2nr21+log(2m)n≤