# High Dimensional Robust Inference for Cox Regression Models

We consider high-dimensional inference for potentially misspecified Cox proportional hazard models based on low dimensional results by Lin and Wei [1989]. A de-sparsified Lasso estimator is proposed based on the log partial likelihood function and shown to converge to a pseudo-true parameter vector. Interestingly, the sparsity of the true parameter can be inferred from that of the above limiting parameter. Moreover, each component of the above (non-sparse) estimator is shown to be asymptotically normal with a variance that can be consistently estimated even under model misspecifications. In some cases, this asymptotic distribution leads to valid statistical inference procedures, whose empirical performances are illustrated through numerical examples.

## Authors

• 3 publications
• 1 publication
• 12 publications
• 49 publications
04/17/2017

### Statistical inference for high dimensional regression via Constrained Lasso

In this paper, we propose a new method for estimation and constructing c...
09/24/2021

### On Statistical Inference with High Dimensional Sparse CCA

We consider asymptotically exact inference on the leading canonical corr...
05/05/2021

### The costs and benefits of uniformly valid causal inference with high-dimensional nuisance parameters

Important advances have recently been achieved in developing procedures ...
10/28/2017

### Cox's proportional hazards model with a high-dimensional and sparse regression parameter

This paper deals with the proportional hazards model proposed by D. R. C...
09/25/2020

### Towards the interpretation of time-varying regularization parameters in streaming penalized regression models

High-dimensional, streaming datasets are ubiquitous in modern applicatio...
06/23/2018

### Assumption Lean Regression

It is well known that models used in conventional regression analysis ar...
07/04/2020

### Transfer learning of regression models from a sequence of datasets by penalized estimation

Transfer learning refers to the promising idea of initializing model fit...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

With rapid technology advances, it is now possible to collect a large amount of healthcare data as often observed in genomic or image studies. In survival analysis, data are often analyzed to investigate how covariates for patient information affect the occurrence of some events such as disease. The Cox proportional hazards model [cox:1972] is one of the most widely used survival models for censored time-to-event data. When the number of covariates collected is larger than sample size, high dimensional regularized Cox regression (e.g., under Lasso penalty) has been proposed in the literature, e.g., [gui:2005, bradic:2011, jelena:2015]. In particular, [kong:2014] and [huang:2014] studied finite sample oracle inequalities for Lasso regularized Cox models under random and fixed designs, respectively.

As a natural followup work, we consider high dimensional inference for Cox regression models under possible misspecifications. Recent process towards high dimensional inference is mostly concerned with (generalized) linear models. One particular method is through de-sparsifying Lasso estimator; see [van:2014], [javanmard:2014] and [zhang:2014]. In this paper, a similar de-sparsified Lasso estimator is proposed based on the log partial likelihood of Cox regression models. A key technical condition in justifying high dimension inference for (generalized) linear models is that the summands in the log likelihood need to be Lipschitz and independently and identically distributed (i.i.d.). Unfortunately, the summands in the log partial likelihood for censored survival data are neither i.i.d. nor Lipschitz. One major novelty in our theoretical analysis is to introduce an intermediate function with a sum of i.i.d. non-Lipschitz terms for approximating the above log partial likelihood as in [kong:2014]

. We further apply mean value theorem to deal with the non-Lipschitz loss function under a bounded condition, i.e., Assumption

2.

This novel technical device enables us to derive the limiting value of our proposed (non-sparse) estimator, called as pseudo-true parameter, which turns out to be determined by the intermediate loss function proposed above and can be interpreted meaningfully; see Example 1. Note that the pseudo-true parameter is not necessarily sparse even if the underlying true hazard function depends only on a few covariates. Fortunately, we are able to identify a situation where the sparsity of the true parameter can be inferred from that of the above limiting parameter. Specifically, the inactive variables are always included in the sparsity set estimated through the working model.

Another crucial feature of our work is that it does not require the model to be correctly specified. The consequences of misspecifying low dimensional Cox models have been extensively investigated in [gail:1984, struthers:1986, lin:1989] among others. High dimensional inference for misspecified linear models have recently been studied in [buhlmann:2015]. To perform valid statistical inference, we establish the asymptotic distribution that centers around the pseudo-true parameter, and further provide a robust variance estimation formula that works even under model misspecifications. Empirical performances are demonstrated in both correctly specified and misspecified Cox regression models. While this manuscript was under preparation, we note an arxiv work [fang:2014] for high dimensional inference on correctly specified Cox regression based on decorrelated method. During our revision, we also note another arxiv work [yu:2018]

for constructing confidence intervals for high dimensional Cox model based on CLIME estimator

[cai:2011], where the covariates are possibly time-dependent. Nevertheless, our inference results are constructed based on de-sparsified Lasso with a particular focus on misspecification, and analyzed through a different intermediate function approach.

## 2 Robust de-sparsified Lasso estimator

Consider a survival model with a true hazard function for a failure time given a covariate vector . Denote as the censoring time, and . Let be i.i.d. observations from the underlying true model and be the design matrix. We fit a potentially misspecified working model to the above observations:

 λ(t|X)=λ(t)exp(XTβ), (1)

where and is an unknown baseline hazard function. Note that does not need to take an exponential regression form, or has to be a proportional hazard model.

Under the working model (1), the negative log partial likelihood function is written as

 ln(β)=−1nn∑i=1[XiTβ−log{1nn∑j=11(Yj≥Yi)exp(XjTβ)}]Δi, (2)

with its first and second derivatives

 ˙ln(β)=−1nn∑i=1{Xi−ˆμ1(Yi;β)ˆμ0(Yi;β)}Δi,¨ln(β)=1nn∑i=1⎧⎨⎩ˆμ2(Yi;β)ˆμ0(Yi;β)−[ˆμ1(Yi;β)ˆμ0(Yi;β)]⊗2⎫⎬⎭Δi,

where and represents the Kronecker product.111For a matrix and a matrix , the Kronecker product is a matrix in such that

. In particular, we have and .

The Lasso estimator for is defined as

 ˆβ:=argminβ∈Rp{ln(β)+2λ∥β∥1},

where is the norm. It is known that does not possess a tractable limiting distribution [kong:2014, huang:2014]. Inspired by the recent de-sparsifying idea, we construct a non-sparse estimator by inverting the Karush-Kuhn-Tucker (KKT) condition:

 ˆb:=(ˆb1,…,ˆbp)T=ˆβ−ˆΘ˙ln(ˆβ),

where is a reasonable approximation for the inverse of . We remark that the procedure of constructing remains the same regardless whether the working model (1) is correctly specified or not. As will be shown in Section 3, the limiting value of can be interpreted meaningfully. Moreover, is shown to be asymptotically normal, whose variance can be estimated consistently even under model misspecifications.

The approximation can be constructed by performing nodewise Lasso as follows. We first re-write as a product of a matrix and its transpose:

 ¨ln(β)=CTβCβ, (3)

where

 Cβ:=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝n−1Δ11(Y1≥Y1)√exp(X1Tβ)ˆμ0(Y1;β)(X1T−ˆμT1(Y1;β)ˆμ0(Y1;β))⋮n−1Δ11(Yn≥Y1)√exp(XnTβ)ˆμ0(Y1;β)(XnT−ˆμT1(Y1;β)ˆμ0(Y1;β))⋮n−1Δn1(Y1≥Yn)√exp(X1Tβ)ˆμ0(Yn;β)(X1T−ˆμT1(Ynβ)ˆμ0(Yn;β))⋮n−1Δn1(Yn≥Yn)√exp(XnTβ)ˆμ0(Yn;β)(XnT−ˆμT1(Yn;β)ˆμ0(Yn;β))⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠n2×p. (4)

Denote as the -th column of and as the submatrix of without . Based on the decomposition (3), we run the following nodewise Lasso times:

 ˆγj:=argminγ{∥Cˆβ,j−Cˆβ,−jγ∥2+2λj∥γ∥1}, (5)

where . Define where denotes the -th diagonal element of and denotes the -th row of without . We now define

 ˆΘ:=ˆT−2ˆG,

where

 ˆT2:=diag(ˆτ21,⋯,ˆτ2p)andˆG:=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1−ˆγ1,2⋯−ˆγ1,p−ˆγ2,11⋯−ˆγ2,p⋮⋮⋱⋮−ˆγp,1−ˆγp,2⋯1⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠.

## 3 Theoretical properties

### 3.1 Pseudo-true Parameter

In this section, we derive the limiting value of , denoted as , and further discuss the meaning of its sparsity. As discussed previously, the summands in log partial likelihood (2), based on which is constructed, are neither i.i.d. nor Lipschitz. Therefore, we first need to introduce an intermediate function that approximates (2):

 ˜ln(β):=−1nn∑i=1{XiTβ−logμ0(Yi;β)}Δi,

where As implied by Theorem 3.2, the pseudo-true parameter is the unique solution to a system of equations,

 ∫∞0μ1(t)dt−∫∞0μ1(t;β)μ0(t;β)μ0(t)dt=0, (6)

where , provided that

 Σβ0=∫∞0⎧⎨⎩μ2(t;β0)μ0(t;β0)−[μ1(t;β0)μ0(t;β0)]⊗2⎫⎬⎭μ0(t)dt (7)

is positive definite. It is easy to verify that (6) and (7) turn out to be and , respectively. Hence, indeed plays a similar role as a true likelihood for .

From the example below, we note that with some particular limit value can still be useful for some statistical inference problems. [Example 1] Suppose that the true hazard function is in comparison with the working model . Let be the sub-vector of without the first element . If we assume that is independent of and is symmetric about zero, it can be shown by substituing into (6) that , provided that the censoring time is independent of . In this case, according to Theorem 3.2 below, we can construct a valid test based on

for testing the null hypothesis that the failure time does not depend on

, for any .

The pseudo-true parameter defined in (6) is not necessarily sparse even if the underlying true hazard function only depends on a few covariates. Theorem 3.1 says that if we infer a variable as an active variable (significantly different from zero) in the working model, it must be an active variable in the true model. Interestingly, this directly implies for in Example 1 without doing any calculation.

Define and as the index set of all variables having an influence on the true conditional hazard function (conditional distribution of given ). Let be a sub-vector of with (the complement of being its index set.

Suppose that . Then we have If we further assume that the censoring time is independent of , then .

In the theorem above, we do not need Gaussian design condition, which is required in [buhlmann:2015] for misspecified linear models. Rather, a conditional expectation condition

suffices (even for generalized linear regression).

### 3.2 Asymptotic Distribution

In this section, we show that

converges to a normal distribution and further provide a

robust variance estimate formula that is consistent even under misspecifications.

Recall that is the intermediate function. Some straightforward calculation shows that can be re-written as (in comparison with (3))

 ¨˜ln(β)=1nn∑i=1⎧⎨⎩μ2(Yi;β)μ0(Yi;β)−[μ1(Yi;β)μ0(Yi;β)]⊗2⎫⎬⎭Δi=EX,Y(DTβDβ),

where denotes the expectation with respect to and only, and

 Dβ:=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝1nΔ11(Y≥Y1)√exp(XTβ)μ0(Y1;β)(XT−μT1(Y1;β)μ0(Y1;β))⋮1nΔ11(Y≥Y1)√exp(XTβ)μ0(Y1;β)(XT−μT1(Y1;β)μ0(Y1;β))⋮1nΔn1(Y≥Yn)√exp(XTβ)μ0(Yn;β)(XT−μT1(Yn;β)μ0(Yn;β))⋮1nΔn1(Y≥Yn)√exp(XTβ)μ0(Yn;β)(XT−μT1(Yn;β)μ0(Yn;β))⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠n2×p.

Before stating our main assumptions, we need the following notation. For , define . Let (assume to exist) and . For simplicity, we write . Recall that is the design matrix.

.

.

###### Assumption 3

, where is defined in the Appendix.

###### Assumption 4

The smallest eigenvalue of

is bounded away from zero and .

###### Assumption 5

The observation time stops at a finite time

with probability

.

###### Assumption 6

and , where and is the number of off-diagonal non-zeros of the -th row of .

###### Assumption 7

and uniformly for

Assumptions 1 – 4 are the same as Conditions (iii), (iv) and (v) in Theorem 3.3 of [van:2014]. Assumption 5 is typically required in survival analysis, see [andersen:1982]. The condition on in Assumption 6 is also typical for the de-sparsified Lasso method, while the condition imposed on is to ensure that the difference between and is of the order , where and are the -th rows of and respectively. Note that Assumption 2 significantly relaxes the bounded condition on imposed in [kong:2014]. In fact, with Assumption 2, we can obtain a similar non-asymptotic oracle inequality as that in [kong:2014] by choosing a slightly larger constant in the tuning parameter defined therein.

Lemma 3.2 (Lemma 3.2) describes the difference between and ( and ). We omit the proof of Lemma 3.2, which can be straightforwardly adapted from [kong:2014] under a weaker condition Assumption 2 as discussed above. Our proof of Lemma 3.2 differs from [van:2014] as (used in the analysis of ) does not necessarily minimize . This is due to the introduction of our intermediate function.

Under Assumptions 1 - 7, we have

 ∥ˆβ−β0∥1=OP(λs0),∥X(ˆβ−β0)∥2/n=OP(λ2s0).

Under Assumptions 1 - 7, we have for every ,

 ∥ˆΘj−Θj∥1=OP(λs3/2j∨λ√s0sj),∥ˆΘj−Θj∥2=OP(λ√s0∨λsj),

and

 |ˆτ2ˆβ,j−τ2β0,j|=OP(sj√logp/n∨λ√s0).

Moreover,

 |ˆΘTjΣβ0ˆΘj−Θjj|≤(∥Σβ0∥∞∥ˆΘj−Θj∥21)∧(Λ2max∥ˆΘj−Θj∥22)+2|ˆτ2ˆβ,j−τ2β0,j|,

where is the largest eigenvalue of .

Lemma 3.2 shows the asymptotic normality of the score statistic under high dimensional setting, which is similar to [lin:1989] for any fixed .

Under Assumptions 1 - 7, we have

 √nˆΘTj˙ln(β0)√ˆΘTjE{n−1∑ni=1vi(β0)⊗2}ˆΘj

converges weakly to , where

 vi(β)=Δi{Xi−μ1(Yi;β)μ0(Yi;β)}−∫∞01(Yi≥t)exp{XiTβ}μ0(t;β){Xi−μ1(Yi;β)μ0(Yi;β)}d˜F(t),

and .

From Lemmas 3.23.2, we obtain our main results on the asymptotic normality of . In particular, the asymptotic variance formula (8) in Theorem 3.2 (also used in [lin:1989] for low dimensional case) is robust in the sense that it can be applied irrespective whether the model is correct or not, while (9) in Corollary 3.2 only holds for correctly specified models.

Under Assumptions 1 - 7, we have for every ,

 √n(ˆbj−β0j)/ˆσj=Vj+oP(1),

where converges weakly to and

 ˆσ2j=ˆΘTj{n−1n∑i=1ˆvi(β0)⊗2}ˆΘj, (8)

with

 ˆvi(β)=Δi{Xi−ˆμ1(Yi;β)ˆμ0(Yi;β)}−n∑k=1Δk1(Yi≥Yk)exp{XiTβ}nˆμ0(Yk;β){Xi−ˆμ1(Yk;β)ˆμ0(Yk;β)}.

If the working model (1) is correctly specified, we have for every ,

 √n(ˆbj−β0j)/˜σj=Vj+oP(1),

where converges weakly to and

 ˜σ2j=ˆΘTj¨ln(ˆβ)ˆΘj. (9)

## 4 Numerical study

We conducted extensive simulations to investigate the finite sample performances of our high dimensional inference methods. The rows of were drawn independently from with each element truncated by . Constant censoring time was generated to yield and censoring rates. The Lasso estimator was obtained with a tuning parameter from -fold cross-validation, while ’s in nodewise Lasso were also chosen by -fold cross-validation. We set and with replications. Note that when , we compare our results with those derived from partial likelihood estimation method. All the simulations were done on Purdue University rice cluster. For the case , it took approximately 4 hours for replications run on one node with two 10-core Intel Xeon-E5 processors (20 cores per node) and 64 GB of memory. For , it took approximately 1 hour for replications based on de-sparsified Lasso method. For the real example in section 4.3, it took approximately 6 hours.

### 4.1 Correctly specified Cox regression model

Assume is the true hazard function with two different covariance matrices :

• Independent: ,

• Equal Correlation: .222Here for denotes a matrix with diagonal elements 1 and off-diagonal elements 0.8.

The active set has either cardinality or with and the regression coefficients were drawn from a fixed realization of

i.i.d. uniform random variables on

. Denote as a two-sided confidence interval.

In Table 1 and 2, we report empirical versions of

 Avgcov S0=s−10∑j∈S0P(β0j∈CIj), Avglength S0=s−10∑j∈S0length(CIj), Avgcov Sc0=(p−s0)−1∑j∈Sc0P(0∈CIj), Avglength Sc0=(p−s0)−1∑j∈Sc0length(CIj).

It is demonstrated that the coverage probabilities are generally close to . For active sets with a larger , we observe that the confidence intervals are wider, especially for the equal correlation case. This might be because our high dimensional inference procedure is more suitable for very sparse regression. When , it can be seen that coverage probabilities of confidence intervals based on the de-sparsified Lasso method are mostly closer to the nominal level than those based on partial likelihood method.We further notice from Table 2 that partial likelihood method does not work well for the ‘equal correlation’ case. This indicates that partial likelihood method does not allow strong collinearity among the covariates.

### 4.2 misspecified Cox regression model

In this section, we consider misspecified models. In Tables 3 and 4, survival time was generated from , and the working model (1) was used to fit the data in simulations. As explained in Example 1, the pseudo-true parameter . we calculated the average coverage probabilities and average lengths by considering two covariance matrices:

• Independent: ,

• Block Equal Correlation I: .

The asymptotic variance estimates were calculated either from (8) (robust) or (9) (non-robust). Table 3 demonstrates that when robust variance estimate is used, the coverage probabilities are closer to the nominal level in comparison with the non-robust formula.

Next, we test the null hypothesis that the failure time does not depend on . When the working model (1) is false, a valid test for based on and robust variance estimation method is possible if . One example for is that is symmetric about and independent of other covariates, and has an important effect on the true hazard function . Note that the true model need not take an exponential regressionform, and neither does it have to a proportional hazards model. In Tables 5 and 6, different true hazards with covariates satisfying these conditions were explored, under the following covariance matrices:

• Independent: ,

• Block Equal Correlation II: .

Row 1 of Tables 5 and 6 is on the omission of relevant covariates from Cox models, rows 2-3 are on the misspecification of regression forms with possible omission of relevant covariates, and rows 4-5 are on nonproportional hazards models with possible omission of relevant covariates. In particular, row 2 is an additive hazards model and row 4 is an accelerated failure time model.

Tables 5 and 6 demonstrate that tests based on robust variance estimate give empirical sizes closer to 5% than those non-robust cases. This is because test based on robust variance estimation method is asymptotically valid, whereas test based on non-robust variance estimation may not be. When , it is noted that our results based on de-sparsified method are comparable with those based on partial likelihood method.

### 4.3 Real Data Analysis

We consider a dataset, [alizadeh:2000]: gene-expression data in lymphoma patients. The data (“LymphomaData.rda”) is available in R glmnet package and is publicly available online. The original data is available from http://llmpp.nih.gov/lymphoma/data.shtml. There are patients with measurements on genes. It is of particular interest to find out which genes are important to the disease. We model the data with a high dimensional Cox regression model and obtain the following results for significance. There are genes out of the total genes found significant at individual level based on the robust variance estimation method, while genes are found significant based on non-robust variance estimation method. This is because the robust variance estimates are generally smaller than the non-robust variance estimates. It is consistent with the findings in [lin:1989] that when the model is correctly specified, robust variance estimates tend to be smaller than non-robust variance estimates, see row 1 of Table 1 in [lin:1989]. This also suggests it may be ideal to model the data with a high dimensional Cox regression model. The Bonferroni-Holm procedure based on finds no significant coefficient at the significance level for the family-wise error rate (FWER), under either robust or non-robust variance estimate. Similarly as Example 4.3 of [van:2014], such a low power is expected in presence of thousands of variables.

## Appendix

### .4 Notations

Let , and

 J=diag{˜J,⋯,˜Jn},~{}where~{}˜J=1n1Tn, ˜Δ=diag(Δ1,⋯,Δ1n,⋯,Δn,⋯,Δnn),