    # Cox's proportional hazards model with a high-dimensional and sparse regression parameter

This paper deals with the proportional hazards model proposed by D. R. Cox in a high-dimensional and sparse setting for a regression parameter. To estimate the regression parameter, the Dantzig selector is applied. The variable selection consistency of the Dantzig selector for the model will be proved. This property enables us to reduce the dimension of the parameter and to construct asymptotically normal estimators for the regression parameter and the cumulative baseline hazard function.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The proportional hazards model, which was proposed by Cox (1972) is one of the most commonly used models for survival analysis. In a fixed dimensional setting, , the case where the number of covariates is fixed, Andersen and Gill (1982) proved that the maximum partial likelihood estimator for regression parameter has the consistency and the asymptotic normality. Besides, they discussed the asymptotic property of the Breslow estimator for cumulative baseline hazard function.

Recently, many researchers are interested in a high-dimensional and sparse setting for a regression parameter, that is, the case where and the number of nonzero components in the true value is relatively small. In this setting, several kinds of estimation methods have been proposed for various regression-type models. Especially, the penalized methods such as Lasso (Tibshirani (1997), Huang et al. (2013), Bradic et al. (2011) and others) have been well studied. In particular, Huang et al. (2013) derived oracle inequalities of the Lasso estimator for the proportional hazards model, which means the Lasso estimator satisfies the consistency even in a high-dimensional setting. Bradic et al. (2011) considered the general penalized estimators including Lasso, SCAD and others and proved that the estimators satisfies the consistency and the asymptotic normality. On the other hand, the Dantzig selector, which was proposed by Candés and Tao (2007)

for the linear regression model, is also applied to the proportional hazards model by

Antoniadis et al. (2010), who dealt with the consistency of the estimator. Fujimori and Nishiyama (2017) extended the consistency results of the Dantzig selector for the model to the consistency for every by a method similar to that of Huang et al. (2013). However, the asymptotic normalities of estimators for high-dimensional regression parameter and the Breslow estimator have not yet been studied up to our knowledge.

In this paper, we will focus on the asymptotic normalities of estimators in a high-dimensional setting. To discuss this problem, we need to consider the dimension reduction of the regression parameter. We will show that the Dantzig selector has variable selection consistency, which enables us to reduce the dimension. Then, we will construct a new maximum partial likelihood estimator by using the variable selection consistency result and show that this estimator has the asymptotic normality. In addition, we will prove that a Breslow type estimator, which is obtained by using the maximum partial likelihood estimator after dimension reduction, satisfies the asymptotic normality.

This paper is organized as follows. The model setup, some regularity conditions and matrix conditions to deal with a high-dimensional and sparse setting are introduced in Section 2. In Section 3, we prove the asymptotic properties of the high-dimensional regression parameter, that is, the variable selection consistency of the Dantzig selector and the asymptotic normality of the maximum partial likelihood estimator after dimension reduction. The asymptotic property of the Breslow type estimator is established in Section 4.

Throughout this paper, we denote by the

norm of vector for every

, for we denote:

 ∥v∥q =(p∑j=1|vj|q)1q,q<∞; ∥v∥∞ =sup1≤j≤p|vj|.

In addition, for a matrix , where , we define by

 ∥A∥∞:=sup1≤i≤msup1≤j≤n|Aji|,

where denotes the -component of the matrix . For a vector , and an index set , we write for the -dimensional sub-vector of restricted by the index set , where is the number of elements in the set . Similarly, for a matrix and index sets , we define the sub-matrix by

 AT,T′:=(Aji)i∈T,j∈T′.

## 2 Preliminaries

### 2.1 Model setup

Let be a survival time and a censoring time of -th individual for every

, which are positive real valued random variables on a probability space

. Assume that each -th individual has an -valued covariate process , and that the survival time is conditionally independent of the censoring time given . Moreover, we assume that ’s never occur simultaneously. For every and , we observe , where and . We define the counting process and for every as follows:

 Ni(t):=1{t≤Xi,Di=1},Yi(t):=1{Xi≥t},t∈[0,1]

Let be the filtration defined as follows:

 Ft :=σ{Ni(u), Yi(u), Zi(u); 0≤u≤t, i=1,2,…,n}.

Suppose that , are predictable processes. In Cox’s proportional hazards model, it is assumed that each for every has the following intensity:

 λi(t):=Yi(t)λ0(t)exp(βT0Zi(t)),t∈[0,1],

where is the unknown deterministic baseline hazard function and is the unknown regression parameter. Then, we have that the following process for every is a square integrable martingale:

 Mi(t):=Ni(t)−∫t0λi(s)ds,t∈[0,1].

Note that predictable variation process of is given by:

 ⟨Mi,Mi⟩(t)=∫t0λi(s)ds,t∈[0,1]

and

 ⟨Mi,Mj⟩(t)=0,i≠j, t∈[0,1].

Hereafter, we write for the cumulative baseline hazard function, ,

 Λ0(t):=∫t0λ0(s)ds,t∈[0,1].

The aim of this paper is to estimate the regression parameter and the cumulative baseline hazard in a high-dimensional and sparse setting for , ,

 p=pn≫n,S:=|T0|≪n,

where is the support index set of the true value. To estimate , we use Cox’s -partial likelihood which is given by;

 Cn(β):=n∑i=1∫10{βTZi(t)−logS(0)n(β,t)}dNi(t),

where

 S(0)n(β,t):=n∑i=1Yi(t)exp(βTZi(t)).

Put . We write for the gradient of and for the Hessian of , ,

 Un(β)=1nn∑i=1∫10{Zi(t)−S(1)nS(0)n(β,t)}dNi(t)

and

 Jn(β)=1nn∑i=1∫10⎧⎨⎩S(2)nS(0)n(β,t)−(S(1)nS(0)n)⊗2(β,t)⎫⎬⎭dNi(t),

where

 S(1)n(β,t):=n∑i=1Zi(t)Yi(t)exp(βTZi(t))

and

 S(2)n(β,t):=n∑i=1Zi(t)⊗2Yi(t)exp(βTZi(t)).

Note that is a terminal value of the following square integrable martingale:

 Un(β0,t):=1nn∑i=1∫t0{Zi(s)−S(1)nS(0)n(β,s)}dMi(s).

### 2.2 Regularity conditions and matrix conditions

We assume the following conditions.

###### Assumption 2.1.

The true value satisfies that . Moreover, there exists a global constant such that

 infj∈T0|βj0|>C.

The covariate processes , , are uniformly bounded, , there exists global constant such that

 supt∈[0,1]supi∥Zi(t)∥∞

The baseline hazard function is integrable, ,

 ∫10λ0(t)dt<∞.

For every , there exist deterministic -valued function , valued function and - valued function which satisfy the following conditions:

 supβsupt∈[0,1]∥∥∥1nS(l)n(β,t)−s(l)n(β,t)∥∥∥∞→p0,l=0,1,2

as .

The functions , , satisfy the following conditions:

 limsupn→∞supβsupt∈[0,1]∥s(l)n(β,t)∥∞<∞,l=0,1,2.
 liminfn→∞infβinft∈[0,1]s(l)n(β,t)>0.

For every , the following matrix is nonnegative definite:

For every , it holds that

 n∑i=1∫10∥∥ξnT0,i∥∥221{∥ξnT0,i∥2}>ϵYi(t)exp(βT0Zi(t))λ0(t)dt→p0,

where

 ξnT0,i:=1√n⎧⎪⎨⎪⎩ZiT0(t)−S(1)nT0S(0)n(β0T0,t)⎫⎪⎬⎪⎭.

Note that the condition ensures that Lindeberg’s condition holds. Recalling that is the support index set of the true value , we introduce the following factor for the matrix .

###### Definition 2.2.

Define the set as follows:

 CT0:={h∈Rpn;∥hTc0∥1≤∥hT0∥1}.
Compatibility factor:
 κ(T0;In(β0)):=inf0≠h∈CT0S12(hTIn(β0)h)12∥hT0∥1.

The matrix factor like this can be seen in many papers which deal with high-dimensional and sparse setting. See, e.g., Bickel et al. (2009), van de Geer and Bühlmann (2009) and Huang et al. (2013) for the details. We assume the following condition for .

###### Assumption 2.3.

The compatibility factor is asymptotically positive, ,

 liminfn→∞κ(T0;In(β0))>0.

## 3 The estimator for the regression parameter

### 3.1 The Dantzig selector for the proportional hazards model

Now, we define the estimator for the regression parameter by the Dantzig selector for the proportional hazards model given by:

 (1)

where is a tuning parameter. This type of estimator was proposed by Antoniadis et al. (2010) and was further discussed by Fujimori and Nishiyama (2017).

### 3.2 The lq consistency of the Dantzig selector

In this subsection, we discuss the consistency of the estimator in the sense of -norm for every . Assume that and satisfy the following conditions:

 logpn=O(nζ),γn,pn=O(n−αlogpn),

where are constants. Suppose that the sparsity is fixed constant which does not depend on . Moreover, we define the random sequence by:

 ϵn:=∥In(β0)−Jn(β0)∥∞.

Then, we can show that (see Fujimori and Nishiyama (2017)).

###### Theorem 3.1.

Under the Assumption 2.1 and 2.3, the it holds for global positive constants that

 limn→∞P(∥^βn−β0∥1≥4K2Sγn,pnκ2(T0;In(β0))−4Sϵn)=0.

In particular, it holds for every that as .

The proof of Theorem 3.1 can be seen in Fujimori and Nishiyama (2017).

### 3.3 The variable selection consistency of the Dantzig selector

The aim of this subsection is to show that selects non-zero components of correctly. To do this, we define the following estimator for the support index set of the true value :

 ^Tn:={j;|^βjn|>γn,pn}.

The estimator similar to can be seen in Fujimori (2017) which consider a linear model of diffusion processes in a high-dimensional and sparse setting. The following theorem states that has a variable selection consistency.

###### Theorem 3.2.

Under the Assumption 2.1 and 2.3, it holds that

 limn→∞P(^Tn=T0)=1.
• Note that and that the sparsity is assumed to be fixed. We have that

 limn→∞P(∥^βn−β0∥∞>γn,pn)=0

by the bound from Theorem 3.1 . Therefore, it is sufficient to show that the next inequality

 ∥^βn−β0∥∞≤γn,pn

implies that

 ^Tn=T0.

For every , it follows from the triangle inequality that

 |βj0|−|^βjn|≤|^βjn−βj0|≤γn,pn.

Then, we have that

 |^βjn|≥|βj0|−γn,pn>γn,pn

for sufficiently large , which implies that . On the other hand, for every , we have that

 |^βjn−βj0|=|^βjn|≤γn,pn

since it holds that . Then, we can see that which implies that . We thus obtain the conclusion.

### 3.4 The maximum partial likelihood estimator for the regression parameter after dimension reduction

Using the set , we construct a new estimator by the solution to the next equation:

 Un(β^Tn)=0,β^Tcn=0. (2)

We prove the asymptotic normality of . In this subsection, we assume that the following matrix is positive definite:

where

 s(0)(β0T0,t):=s(0)n(β0T0,t),
 s(1)(β0T0,t):=s(1)nT0(β0T0,t)

and

 s(2)(β0T0,t):=s(2)nT0,T0(β0T0,t).

The following theorem states that this estimator satisfies consistency.

###### Theorem 3.3.

Under Assumption 2.1 and 2.3, it holds that

 ∥^β(2)n−β0∥1→p0

as .

• We have that

 ∥^β(2)n−β0∥1=∥^β(2)nT0−β0T0∥1+∥^β(2)nTc0∥1.

It follows from Lemma 3.1 of Andersen and Gill (1982) that the first term of right-hand side is since the sparsity is assume to be fixed. Moreover, we have that

 ∥^β(2)nTc0∥11{^Tn=T0}=0

by the definition of . Noting that , we obtain the conclusion by using Slutsky’s theorem.

To show the asymptotic normality of , we need to prove the next lemma.

###### Lemma 3.4.

For every random sequence which satisfies that

 ∥β∗n−β0∥1→p0

as , it holds that

 ∥Jn(β∗n)−In(β0)∥∞=op(1).
• We have for every and that

 1n∥S(l)n(β∗n,t)−S(l)n(β0,t)∥∞ ≤ 1n∥∥ ∥∥n∑i=1Yi(t)Zi(t)⊗l(t)exp(βT0Zi(t)){exp[∥Zi(t)∥∞∥β∗n−β0∥1]−1}∥∥ ∥∥∞ ≤ K1exp(K1∥β0∥1)|exp(K1∥β∗n−β0∥1)−1|.

The right-hand side of this inequality converges to in probability when as . Then, we obtain the conclusion by a similar way to the proof in Andersen and Gill (1982).

Then, we can prove the asymptotic normality in the following sense by a similar way to that in Andersen and Gill (1982).

###### Theorem 3.5.

Under Assumption 2.1 and 2.3, it holds that

 √n(^β(2)n^Tn−β0T0)1{^Tn=T0}→dN(0,I−1).
• It follows from the Taylor expansion and Lemma 3.4 that

 √n(^β(2)n^Tn−β0T0)1{^Tn=T0}=I−1√nUnT0(β0T0)1{^Tn=T0}+op(1).

Using the martingale central limit theorem, we can see that

 √nUnT0(β0T0)→dN(0,I).

Since Theorem 3.2 implies that as , we obtain the conclusion by using Slutsky’s theorem.

## 4 The estimator for the cumulative baseline hazard function

We define the estimator for by the following Breslow type estimator:

 ^Λ(t):=∫t0d¯N(s)∑ni=1Yi(s)exp(^β(2)TnZi(s)),t∈[0,1],

where is defined by the equation (2). We discuss the asymptotic property of in this section. For every , we have that

 √n{^Λ(t)−Λ0(t)}=(I)+(II)+(III),

where

 (I)=√n∫t0⎧⎪⎨⎪⎩1S(0)n(^β(2)n,s)−1S(0)n(β0,s)⎫⎪⎬⎪⎭d¯N(s),
 (II)=√n⎧⎨⎩∫t0d¯N(s)S(0)n(β0,s)−∫t0λ0(s)1{∑ni=1Yi(s)>0}⎫⎬⎭

and

 (III)=√n{∫t0λ0(s)1{∑ni=1Yi(s)>0}−Λ0(t)}.

The third term is asymptotically negligible because it follows from Assumption 2.1 that

 limn→∞P({∫t0λ0(s)1{∑ni=1Yi(s)>0}−Λ0(t)}=0)=1.

Moreover, we have that equals to the following process :

 Wn(t)=√n∫t0d¯M(s)S(0)n(β0,s),

which is a square integrable martingale. Using the Taylor expansion, we have that

 (I)=Hn(β∗n,t)T(^β(2)n−β0),

where

 Hn(β∗n,t):=−∫t0S(1)n{S(0)n}2(β∗n,s)d¯N(s)

and lies between and . Since it holds that by Theorem 3.3, we can see that

 supt∈[0,1]∥∥ ∥∥Hn(β∗n,t)+∫t0s(1)ns(0)n(β0,s)λ0(s)ds∥∥ ∥∥∞=op(1)

by a similar way to the proof of Lemma 3.4. Therefore, we obtain the following theorem, which is proved by using Slutsky’s theorem and a similar way to that in Andersen and Gill (1982).

###### Theorem 4.1.

Under Assumption 2.1 and 2.3, it holds that and the process equal in the point to

 [√n{^Λ(t)−Λ0(t)}+√n∫t0(^β(2)n^Tn−β0T0)Ts(1)s(0)(β0T0,s)λ0(s)ds]1{^Tn=T0}

are asymptotically independent. The latter process is asymptotically distributed as a Gaussian martingale with the variance function

 ∫t0λ0(s)s(0)(β0T0,s)ds.

Acknowledgements. The author would like to express the appreciation to Prof. Y. Nishiyama of Waseda University and Dr. K. Tsukuda of the University of Tokyo for long hours discussion about this work.

## References

• Andersen and Gill (1982) Andersen, P.K. and Gill, R.D. (1982). Cox’s regression model for counting processes: a large sample study. Ann. Statist. 10, no. 4, 1100-1120.
• Antoniadis et al. (2010) Antoniadis, A., Fryzlewicz, P. and Letué, F. (2010). The Dantzig selector in Cox’s proportional hazards model. Scand. J. Stat. 37, no. 4, 531-552.
• Bickel et al. (2009) Bickel, P.J., Ritov, Y. and Tsybakov, A.B. (2009). Simultaneous analysis of lasso and Dantzig selector. Ann. Statist. 37 (2009), no. 4, 1705-1732.
• Bradic et al. (2011) Bradic, J. Fan, J. and Jiang, J. (2011). Regularization for Cox’s proportional hazards model with NP-dimensionality. Ann. Statist. 39, no. 6, 3092-3120.
• Candés and Tao (2007) Candés, E. and Tao, T. (2007). The Dantzig selector: statistical estimation when is much larger than . Ann. Statist. 35, no. 6, 2313-2351.
• Cox (1972) Cox, D.R. (1972). Regression models and life tables (with discussion). J. Roy. Statist. Soc. Ser B 34 187-220.
• Fujimori and Nishiyama (2017) Fujimori, K. and Nishiyama, Y. (2017). The consistency of the Dantzig selector for Cox’s proportional hazards model. J. Statist. Plann. Inference 181, 62-70.
• Fujimori (2017) Fujimori, K. (2017). The Dantzig selector for a linear model of diffusion processes. arXiv:1709.00710
• Huang et al. (2013) Huang, J., Sun, T., Ying, Z., Yu, Y. and Zhang, C-H. (2013). Oracle inequalities for the LASSO in the Cox model. Ann. Statist. 41, no. 3, 1142-1165.
• Tibshirani (1997) Tibshirani, R. (1997). The lasso method for variable selection in the Cox model. Stat. Med. 16 385-395.
• van de Geer (1995) van de Geer, S. (1995). Exponential inequalities for martingales, with application to maximum likelihood estimation for counting processes. Ann. Statist. 23, no. 5, 1779-1801.]
• van de Geer and Bühlmann (2009) van de Geer, S.A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the Lasso. Electron. J. Stat. 3, 1360-1392.