1 Introduction
With rapid technology advances, it is now possible to collect a large amount of healthcare data as often observed in genomic or image studies. In survival analysis, data are often analyzed to investigate how covariates for patient information affect the occurrence of some events such as disease. The Cox proportional hazards model [cox:1972] is one of the most widely used survival models for censored timetoevent data. When the number of covariates collected is larger than sample size, high dimensional regularized Cox regression (e.g., under Lasso penalty) has been proposed in the literature, e.g., [gui:2005, bradic:2011, jelena:2015]. In particular, [kong:2014] and [huang:2014] studied finite sample oracle inequalities for Lasso regularized Cox models under random and fixed designs, respectively.
As a natural followup work, we consider high dimensional inference for Cox regression models under possible misspecifications. Recent process towards high dimensional inference is mostly concerned with (generalized) linear models. One particular method is through desparsifying Lasso estimator; see [van:2014], [javanmard:2014] and [zhang:2014]. In this paper, a similar desparsified Lasso estimator is proposed based on the log partial likelihood of Cox regression models. A key technical condition in justifying high dimension inference for (generalized) linear models is that the summands in the log likelihood need to be Lipschitz and independently and identically distributed (i.i.d.). Unfortunately, the summands in the log partial likelihood for censored survival data are neither i.i.d. nor Lipschitz. One major novelty in our theoretical analysis is to introduce an intermediate function with a sum of i.i.d. nonLipschitz terms for approximating the above log partial likelihood as in [kong:2014]
. We further apply mean value theorem to deal with the nonLipschitz loss function under a bounded condition, i.e., Assumption
2.This novel technical device enables us to derive the limiting value of our proposed (nonsparse) estimator, called as pseudotrue parameter, which turns out to be determined by the intermediate loss function proposed above and can be interpreted meaningfully; see Example 1. Note that the pseudotrue parameter is not necessarily sparse even if the underlying true hazard function depends only on a few covariates. Fortunately, we are able to identify a situation where the sparsity of the true parameter can be inferred from that of the above limiting parameter. Specifically, the inactive variables are always included in the sparsity set estimated through the working model.
Another crucial feature of our work is that it does not require the model to be correctly specified. The consequences of misspecifying low dimensional Cox models have been extensively investigated in [gail:1984, struthers:1986, lin:1989] among others. High dimensional inference for misspecified linear models have recently been studied in [buhlmann:2015]. To perform valid statistical inference, we establish the asymptotic distribution that centers around the pseudotrue parameter, and further provide a robust variance estimation formula that works even under model misspecifications. Empirical performances are demonstrated in both correctly specified and misspecified Cox regression models. While this manuscript was under preparation, we note an arxiv work [fang:2014] for high dimensional inference on correctly specified Cox regression based on decorrelated method. During our revision, we also note another arxiv work [yu:2018]
for constructing confidence intervals for high dimensional Cox model based on CLIME estimator
[cai:2011], where the covariates are possibly timedependent. Nevertheless, our inference results are constructed based on desparsified Lasso with a particular focus on misspecification, and analyzed through a different intermediate function approach.2 Robust desparsified Lasso estimator
Consider a survival model with a true hazard function for a failure time given a covariate vector . Denote as the censoring time, and . Let be i.i.d. observations from the underlying true model and be the design matrix. We fit a potentially misspecified working model to the above observations:
(1) 
where and is an unknown baseline hazard function. Note that does not need to take an exponential regression form, or has to be a proportional hazard model.
Under the working model (1), the negative log partial likelihood function is written as
(2) 
with its first and second derivatives
where and represents the Kronecker product.^{1}^{1}1For a matrix and a matrix , the Kronecker product is a matrix in such that
The Lasso estimator for is defined as
where is the norm. It is known that does not possess a tractable limiting distribution [kong:2014, huang:2014]. Inspired by the recent desparsifying idea, we construct a nonsparse estimator by inverting the KarushKuhnTucker (KKT) condition:
where is a reasonable approximation for the inverse of . We remark that the procedure of constructing remains the same regardless whether the working model (1) is correctly specified or not. As will be shown in Section 3, the limiting value of can be interpreted meaningfully. Moreover, is shown to be asymptotically normal, whose variance can be estimated consistently even under model misspecifications.
The approximation can be constructed by performing nodewise Lasso as follows. We first rewrite as a product of a matrix and its transpose:
(3) 
where
(4) 
Denote as the th column of and as the submatrix of without . Based on the decomposition (3), we run the following nodewise Lasso times:
(5) 
where . Define where denotes the th diagonal element of and denotes the th row of without . We now define
where
3 Theoretical properties
3.1 Pseudotrue Parameter
In this section, we derive the limiting value of , denoted as , and further discuss the meaning of its sparsity. As discussed previously, the summands in log partial likelihood (2), based on which is constructed, are neither i.i.d. nor Lipschitz. Therefore, we first need to introduce an intermediate function that approximates (2):
where As implied by Theorem 3.2, the pseudotrue parameter is the unique solution to a system of equations,
(6) 
where , provided that
(7) 
is positive definite. It is easy to verify that (6) and (7) turn out to be and , respectively. Hence, indeed plays a similar role as a true likelihood for .
From the example below, we note that with some particular limit value can still be useful for some statistical inference problems. [Example 1] Suppose that the true hazard function is in comparison with the working model . Let be the subvector of without the first element . If we assume that is independent of and is symmetric about zero, it can be shown by substituing into (6) that , provided that the censoring time is independent of . In this case, according to Theorem 3.2 below, we can construct a valid test based on
for testing the null hypothesis that the failure time does not depend on
, for any .The pseudotrue parameter defined in (6) is not necessarily sparse even if the underlying true hazard function only depends on a few covariates. Theorem 3.1 says that if we infer a variable as an active variable (significantly different from zero) in the working model, it must be an active variable in the true model. Interestingly, this directly implies for in Example 1 without doing any calculation.
Define and as the index set of all variables having an influence on the true conditional hazard function (conditional distribution of given ). Let be a subvector of with (the complement of being its index set.
Suppose that . Then we have If we further assume that the censoring time is independent of , then .
In the theorem above, we do not need Gaussian design condition, which is required in [buhlmann:2015] for misspecified linear models. Rather, a conditional expectation condition
suffices (even for generalized linear regression).
3.2 Asymptotic Distribution
In this section, we show that
converges to a normal distribution and further provide a
robust variance estimate formula that is consistent even under misspecifications.Recall that is the intermediate function. Some straightforward calculation shows that can be rewritten as (in comparison with (3))
where denotes the expectation with respect to and only, and
Before stating our main assumptions, we need the following notation. For , define . Let (assume to exist) and . For simplicity, we write . Recall that is the design matrix.
Assumption 1
.
Assumption 2
.
Assumption 3
, where is defined in the Appendix.
Assumption 4
Assumption 5
Assumption 6
and , where and is the number of offdiagonal nonzeros of the th row of .
Assumption 7
and uniformly for
Assumptions 1 – 4 are the same as Conditions (iii), (iv) and (v) in Theorem 3.3 of [van:2014]. Assumption 5 is typically required in survival analysis, see [andersen:1982]. The condition on in Assumption 6 is also typical for the desparsified Lasso method, while the condition imposed on is to ensure that the difference between and is of the order , where and are the th rows of and respectively. Note that Assumption 2 significantly relaxes the bounded condition on imposed in [kong:2014]. In fact, with Assumption 2, we can obtain a similar nonasymptotic oracle inequality as that in [kong:2014] by choosing a slightly larger constant in the tuning parameter defined therein.
Lemma 3.2 (Lemma 3.2) describes the difference between and ( and ). We omit the proof of Lemma 3.2, which can be straightforwardly adapted from [kong:2014] under a weaker condition Assumption 2 as discussed above. Our proof of Lemma 3.2 differs from [van:2014] as (used in the analysis of ) does not necessarily minimize . This is due to the introduction of our intermediate function.
and
Moreover,
where is the largest eigenvalue of .
Lemma 3.2 shows the asymptotic normality of the score statistic under high dimensional setting, which is similar to [lin:1989] for any fixed .
From Lemmas 3.2 – 3.2, we obtain our main results on the asymptotic normality of . In particular, the asymptotic variance formula (8) in Theorem 3.2 (also used in [lin:1989] for low dimensional case) is robust in the sense that it can be applied irrespective whether the model is correct or not, while (9) in Corollary 3.2 only holds for correctly specified models.
4 Numerical study
We conducted extensive simulations to investigate the finite sample performances of our high dimensional inference methods. The rows of were drawn independently from with each element truncated by . Constant censoring time was generated to yield and censoring rates. The Lasso estimator was obtained with a tuning parameter from fold crossvalidation, while ’s in nodewise Lasso were also chosen by fold crossvalidation. We set and with replications. Note that when , we compare our results with those derived from partial likelihood estimation method. All the simulations were done on Purdue University rice cluster. For the case , it took approximately 4 hours for replications run on one node with two 10core Intel XeonE5 processors (20 cores per node) and 64 GB of memory. For , it took approximately 1 hour for replications based on desparsified Lasso method. For the real example in section 4.3, it took approximately 6 hours.
4.1 Correctly specified Cox regression model
Assume is the true hazard function with two different covariance matrices :

Independent: ,

Equal Correlation: .^{2}^{2}2Here for denotes a matrix with diagonal elements 1 and offdiagonal elements 0.8.
The active set has either cardinality or with and the regression coefficients were drawn from a fixed realization of
i.i.d. uniform random variables on
. Denote as a twosided confidence interval.In Table 1 and 2, we report empirical versions of
It is demonstrated that the coverage probabilities are generally close to . For active sets with a larger , we observe that the confidence intervals are wider, especially for the equal correlation case. This might be because our high dimensional inference procedure is more suitable for very sparse regression. When , it can be seen that coverage probabilities of confidence intervals based on the desparsified Lasso method are mostly closer to the nominal level than those based on partial likelihood method.We further notice from Table 2 that partial likelihood method does not work well for the ‘equal correlation’ case. This indicates that partial likelihood method does not allow strong collinearity among the covariates.
Active set  Active set  

Independent  Equal Correlation  Independent  Equal Correlation  
Censoring Rate  15%  30%  15%  30%  15%  30%  15%  30% 
Avgcov  0.902  0.896  0.918  0.914  0.930  0.952  0.915  0.903 
Avglength  5.553  5.842  9.883  10.442  5.467  5.982  12.477  14.104 
Avgcov  0.967  0.965  0.943  0.934  0.998  0.998  0.946  0.925 
Avglength  4.874  5.273  7.396  7.972  5.176  5.675  11.264  12.690 
Desparsified Lasso Method  

Active set  Active set  
Independent  Equal Correlation  Independent  Equal Correlation  
Censoring Rate  15%  30%  15%  30%  15%  30%  15%  30% 
Avgcov  0.837  0.867  0.921  0.925  0.833  0.836  0.878  0.879 
Avglength  0.400  0.445  0.754  0.855  0.508  0.551  0.841  0.947 
Avgcov  0.949  0.952  0.949  0.956  0.940  0.937  0.927  0.921 
Avglength  0.338  0.382  0.757  0.865  0.344  0.388  0.791  0.899 
Partial Likelihood Method  
Active set  Active set  
Independent  Equal Correlation  Independent  Equal Correlation  
Censoring Rate  15%  30%  15%  30%  15%  30%  15%  30% 
Avgcov  0.823  0.809  0.230  0.401  0.831  0.811  0.229  0.397 
Avglength  0.459  0.513  0.776  0.902  0.459  0.513  0.775  0.902 
Avgcov  0.921  0.920  0.927  0.908  0.925  0.919  0.930  0.904 
Avglength  0.367  0.414  0.776  0.896  0.367  0.414  0.776  0.895 
4.2 misspecified Cox regression model
In this section, we consider misspecified models. In Tables 3 and 4, survival time was generated from , and the working model (1) was used to fit the data in simulations. As explained in Example 1, the pseudotrue parameter . we calculated the average coverage probabilities and average lengths by considering two covariance matrices:

Independent: ,

Block Equal Correlation I: .
The asymptotic variance estimates were calculated either from (8) (robust) or (9) (nonrobust). Table 3 demonstrates that when robust variance estimate is used, the coverage probabilities are closer to the nominal level in comparison with the nonrobust formula.
Independent  Block Equal Correlation I  
15%  30%  15%  30%  
Robust  NonRobust  Robust  NonRobust  Robust  NonRobust  Robust  NonRobust  
avgcov  0.930  0.900  0.937  0.904  0.946  0.938  0.952  0.938 
avglength  1.498  1.368  1.684  1.523  1.753  1.734  1.856  1.791 

Desparsified Lasso Method  
Independent  Block Equal Correlation I  
15%  30%  15%  30%  
Robust  NonRobust  Robust  NonRobust  Robust  NonRobust  Robust  NonRobust  
avgcov  0.961  0.956  0.963  0.960  0.966  0.956  0.970  0.961 
avglength  0.360  0.343  0.396  0.386  0.827  0.783  0.944  0.903 
Partial Likelihood Method  
Independent  Block Equal Correlation I  
15%  30%  15%  30%  
Robust  NonRobust  Robust  NonRobust  Robust  NonRobust  Robust  NonRobust  
avgcov  0.913  0.919  0.922  0.919  0.915  0.921  0.920  0.919 
avglength  0.344  0.347  0.386  0.379  0.725  0.732  0.714  0.799 
Next, we test the null hypothesis that the failure time does not depend on . When the working model (1) is false, a valid test for based on and robust variance estimation method is possible if . One example for is that is symmetric about and independent of other covariates, and has an important effect on the true hazard function . Note that the true model need not take an exponential regressionform, and neither does it have to a proportional hazards model. In Tables 5 and 6, different true hazards with covariates satisfying these conditions were explored, under the following covariance matrices:

Independent: ,

Block Equal Correlation II: .
Row 1 of Tables 5 and 6 is on the omission of relevant covariates from Cox models, rows 23 are on the misspecification of regression forms with possible omission of relevant covariates, and rows 45 are on nonproportional hazards models with possible omission of relevant covariates. In particular, row 2 is an additive hazards model and row 4 is an accelerated failure time model.
Tables 5 and 6 demonstrate that tests based on robust variance estimate give empirical sizes closer to 5% than those nonrobust cases. This is because test based on robust variance estimation method is asymptotically valid, whereas test based on nonrobust variance estimation may not be. When , it is noted that our results based on desparsified method are comparable with those based on partial likelihood method.
True Hazard Function  Independent  Block Equal Correlation II  

15%  30%  15%  30%  
Rob.  NonR.  Rob.  NonR.  Rob.  NonR.  Rob.  NonR.  
1.  0.049  0.080  0.049  0.063  0.043  0.196  0.040  0.202 
2.  0.049  0.083  0.044  0.074  0.056  0.126  0.049  0.110 
3.  0.042  0.063  0.045  0.083  0.069  0.108  0.062  0.131 
4.  0.054  0.072  0.046  0.092  0.043  0.223  0.060  0.250 
5.  0.054  0.096  0.046  0.116  0.049  0.158  0.048  0.180 
Desparsified Lasso Method  

True Hazard Function  Independent  Block Equal Correlation II  
15%  30%  15%  30%  
Rob.  NonR.  Rob.  NonR.  Rob.  NonR.  Rob.  NonR.  
1.  0.071  0.127  0.074  0.124  0.050  0.131  0.062  0.143 
2.  0.034  0.036  0.033  0.034  0.048  0.039  0.051  0.052 
3.  0.047  0.049  0.041  0.039  0.036  0.035  0.049  0.046 
4.  0.095  0.177  0.098  0.188  0.056  0.188  0.051  0.184 
5.  0.058  0.085  0.064  0.084  0.058  0.085  0.065  0.105 
Partial Likelihood Method  
15%  30%  15%  30%  
Rob.  NonR.  Rob.  NonR.  Rob.  NonR.  Rob.  NonR.  
1.  0.061  0.157  0.075  0.196  0.064  0.191  0.063  0.198 
2.  0.069  0.068  0.088  0.100  0.075  0.073  0.065  0.084 
3.  0.083  0.084  0.087  0.086  0.085  0.077  0.071  0.079 
4.  0.058  0.240  0.052  0.259  0.069  0.228  0.060  0.234 
5.  0.075  0.137  0.069  0.131  0.101  0.161  0.114  0.170 
is a zeromean normal variable with standard deviation; is a standard exponential variable
4.3 Real Data Analysis
We consider a dataset, [alizadeh:2000]: geneexpression data in lymphoma patients. The data (“LymphomaData.rda”) is available in R glmnet package and is publicly available online. The original data is available from http://llmpp.nih.gov/lymphoma/data.shtml. There are patients with measurements on genes. It is of particular interest to find out which genes are important to the disease. We model the data with a high dimensional Cox regression model and obtain the following results for significance. There are genes out of the total genes found significant at individual level based on the robust variance estimation method, while genes are found significant based on nonrobust variance estimation method. This is because the robust variance estimates are generally smaller than the nonrobust variance estimates. It is consistent with the findings in [lin:1989] that when the model is correctly specified, robust variance estimates tend to be smaller than nonrobust variance estimates, see row 1 of Table 1 in [lin:1989]. This also suggests it may be ideal to model the data with a high dimensional Cox regression model. The BonferroniHolm procedure based on finds no significant coefficient at the significance level for the familywise error rate (FWER), under either robust or nonrobust variance estimate. Similarly as Example 4.3 of [van:2014], such a low power is expected in presence of thousands of variables.
Appendix
.4 Notations
Let , and
Comments
There are no comments yet.