Inference for High Dimensional Censored Quantile Regression

With the availability of high dimensional genetic biomarkers, it is of interest to identify heterogeneous effects of these predictors on patients' survival, along with proper statistical inference. Censored quantile regression has emerged as a powerful tool for detecting heterogeneous effects of covariates on survival outcomes. To our knowledge, there is little work available to draw inference on the effects of high dimensional predictors for censored quantile regression. This paper proposes a novel procedure to draw inference on all predictors within the framework of global censored quantile regression, which investigates covariate-response associations over an interval of quantile levels, instead of a few discrete values. The proposed estimator combines a sequence of low dimensional model estimates that are based on multi-sample splittings and variable selection. We show that, under some regularity conditions, the estimator is consistent and asymptotically follows a Gaussian process indexed by the quantile level. Simulation studies indicate that our procedure can properly quantify the uncertainty of the estimates in high dimensional settings. We apply our method to analyze the heterogeneous effects of SNPs residing in lung cancer pathways on patients' survival, using the Boston Lung Cancer Survival Cohort, a cancer epidemiology study on the molecular mechanism of lung cancer.

There are no comments yet.

Authors

• 2 publications
• 6 publications
• 3 publications
• 90 publications
• Estimation and Inference for High Dimensional Generalized Linear Models: A Splitting and Smoothing Approach

For a better understanding of the molecular causes of lung cancer, the B...
03/11/2019 ∙ by Zhe Fei, et al. ∙ 0

• Structured Bayesian variable selection for multiple related response variables and high-dimensional predictors

It is becoming increasingly common to study the complex association betw...
01/14/2021 ∙ by Zhi Zhao, et al. ∙ 0

• Statistical Inference for Cox Proportional Hazards Models with a Diverging Number of Covariates

For statistical inference on regression models with a diverging number o...
06/06/2021 ∙ by Lu Xia, et al. ∙ 0

• Weighted Cox regression for the prediction of heterogeneous patient subgroups

An important task in clinical medicine is the construction of risk predi...
03/19/2020 ∙ by Katrin Madjar, et al. ∙ 0

• Deep Learning for Quantile Regression: DeepQuantreg

The computational prediction algorithm of neural network, or deep learni...
07/14/2020 ∙ by Yichen Jia, et al. ∙ 0

• Correlation-Adjusted Regression Survival Scores for High-Dimensional Variable Selection

Background: The development of classification methods for personalized m...
02/22/2018 ∙ by Thomas Welchowski, et al. ∙ 0

• Correlation-Adjusted Survival Scores for High-Dimensional Variable Selection

Background: The development of classification methods for personalized m...
02/22/2018 ∙ by Thomas Welchowski, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Lung cancer presents much heterogeneity in etiology (mckay2017large; dong2012potentially; huang2009genome), and some genetic variants may insert different impacts on different quantile levels of survival time. For example, in the Boston Lung Cancer Survival Cohort (BLCSC) (Christiani2017blcs), a cancer epidemiology cohort of over 11,000 lung cancer cases enrolled in the Boston area since 1992, it was found that SNP AX.37793583 (rs115952579), along with age, gender, cancer stage and smoking status, had heterogeneous effects on different quantiles of survival time. A total of patients in the study were genotyped, with the goal of identifying lung cancer survival-predictive SNPs. Target gene approaches, which focus on SNPs residing in cancer-related gene pathways, are appealing for increased statistical power in detecting significant SNPs (moon2003current; risch2008lung; ho2019machine), and the investigators have identified SNPs residing in 14 well-known lung cancer-related genes (zhu2017driver; korpanty2014biomarkers; yamamoto2008pik3ca; kelley2001genetic). One goal was to investigate whether and how each SNP might play a different role among the high-risk (i.e., lower quantiles of overall survival) and low-risk (i.e., higher quantiles of overall survival) cancer survivors.

Quantile regression (QR) (koenker1978regression)

is a significant extension of classic linear regression. By permitting the effects of active variables to vary across quantile levels, quantile regression can naturally accommodate and examine the heterogeneous impacts of biomarkers on different segments of the response variable’s conditional distribution. As survival data are subject to censoring and may be incomplete, QR methods developed for complete data may be unsuitable. Efforts have been devoted to developing censored quantile regression (CQR)

(powell1986censored; portnoy2003censored; peng2008survival, among others), which has become a useful alternative strategy to traditional survival models, such as the Cox model and the accelerated failure time model. QR has also been widely studied to accommodate high dimensional predictors. For example, wang2012quantile dealt with variable selection using non-convex penalization; zheng2013adaptive proposed an adaptive penalized quantile regression estimator that can select the true sparse model with high probability; and fan2014adaptive studied the penalized quantile regression with a weighted penalty in an ultra-high dimensional setting. As to high dimensional CQR (HDCQR), he2013quantile provided a model-free variable screening procedure for ultrahigh dimensional covariates, and zheng2018high proposed a penalized HDCQR built upon a stochastic integral based estimating equation. However, most of the existing works in HDCQR were designed to select a subset of predictors and estimate the effects of the selected variables, instead of drawing inference on all predictors.

Progress in high dimensional inferences has been made for linear and non-linear models (zhang2014confidence; buhlmann2014high; javanmard2014confidence; ning2017general; fei2019drawing; fei2021estimation). For example, meinshausen2009p proposed to aggregate -values from multi-sample splittings for high dimensional linear regression. Another line of works referred to as post-selection inference includes berk2013valid, lee2016exact, and belloni2019valid, which provided post-selection inference at fixed quantiles for complete data. However, these methods may not handle censored outcomes. For censored median regression, shows2010sparse

provided sparse estimation and inference, but it cannot handle high dimensional data.

We propose to draw inference on high dimensional HDCQR based on a splitting and fusing scheme, termed Fused-HDCQR. Utilizing a variable selection procedure for HDCQR such as zheng2018high

, our method operates partial regression followed by smoothing. Specifically, partial regression allows us to estimate the effect of each predictor, regardless of whether or not it is chosen by variable selection. The fused estimator aggregates the estimates based on multiple data-splittings and variable selection, with a variance estimator derived by the functional delta method

(efron2014estimation; wager2018estimation). To comprehensively assess the covariate effects on the survival distribution, we adopt a “global” quantile model (zheng2015globally) with the quantile level varying over an interval, instead of a local CQR that focuses only on a few pre-specified quantile levels. The global quantile model can address the molecular mechanism of lung cancer, our motivating disease, that hypothesizes that some genetic variants may cause heterogeneous impacts on different but unspecified segments of survival distribution (mckay2017large; dong2012potentially; huang2009genome).

Our work presents several advantages. First, compared to high dimensional Cox models (zhao2012principled; fang2017testing; kong2021high), the employed HDCQR stems from the accelerated failure time model (wei1992accelerated) and offers straightforward interpretations (hong2019quantile). Second, utilizing the global conditional quantile regression, it uses various segments of the conditional survival distribution to improve the robustness of variable selection and capture global sparsity. Third, our splitting-and-averaging scheme avoids the technicalities of estimating the precision matrix by inverting a Hessian matrix of the log likelihood, which is a major challenge for debiased-LASSO type methods (zhang2014confidence; van2014asymptotically) and is even more so if we apply debiased-LASSO to the CQR setting. Finally, as opposed to post-selection inferences (belloni2019valid, among others), Fused-HDCQR accounts for variations in model selection and draws inference for all of the predictors.

The rest of the paper is organized as follows. Section 2 introduces the method, and Section 3 details the asymptotic properties. Section 4 derives a non-parametric variance estimator, Section 5 conducts simulation studies, and Section 6 applies the proposed method to analyze the BLCSC data. The technical details, such as proofs and additional lemmas, are relegated to the online Supplementary Materials.

2 Model and Method

2.1 High dimensional censored quantile regression

Let and denote the survival outcome and censoring time, respectively. We assume that is independent of given , a

-dimensional vector of covariates (

). Let , and , where is the binary indicator function. The observed data, , are identical and independently distributed (i.i.d.) copies of . With , let be the -th conditional quantile of given . A global censored quantile regression model stipulates

 QY(τ|Z)=ZTβ∗(τ),τ∈(0,1), (1)

where is a -dimensional vector of coefficients at . We aim to draw inference on for each and for all , where is an upper bound for estimable quantiles subject to identifiability constraint caused by censoring (peng2008survival).

Let , , and . Then, is a martingale process under model (1) (fleming2011counting) and hence . We use and , , to denote the sample analogs of and . Let and

 Un(β,τ)=n−1n∑i=1Zi{Ni(θi(τ))−∫τ01{logXi≥θi(u)}dH(u)}.

We denote by the expectation of .

The martingale property implies with , entailing an estimating equation with :

 n1/2Un(β,τ)=n−1/2n∑i=1Zi{Ni(θi(τ))−∫τ01{logXi≥θi(u)}dH(u)}=0. (2)

The stochastic integral in (2) naturally suggests sequential estimation with respect to . We define a grid of quantile values to cover the interval , where and . The assumption on the lower bound is made to circumvent the singularity problem with CQR at , as detailed in assumption A. In practice, is chosen such that only a small proportion of observations are censored below the -th quantile.

Then, ’s, the estimates of ’s, , can be sequentially obtained by solving

 n−1/2n∑i=1Zi(Ni(θi(τk))−k−1∑r=0∫τr+1τr1{logXi≥´θi(τr)}dH(u))=0,

where . Due to the monotonicity of in , can be solved efficiently via -minimization. And , is defined as a right-continuous piece-wise constant function that only jumps at the grid points. It can be shown that is uniformly consistent and converges weakly to a mean zero Gaussian process for when . More importantly, provides a comprehensive understanding of the covariate effects on the conditional survival distribution over the quantile interval . We incorporate this sequential estimating procedure for low dimensional CQR estimation in our proposed method.

In addition, our method requires dimension reduction, which can be accomplished by existing methods, including the screening method proposed by he2013quantile and the penalized estimation and selection procedure developed by zheng2018high. Specifically, zheng2018high incorporated an penalty into the stochastic integral based estimating equation in (2) to obtain an L-HDCQR estimator, which achieves a uniform convergence rate of , and results in “sure screening” variable selection with high probability, where is defined in condition D. zheng2018high also proposed an AL-HDCQR estimator by employing the Adaptive Lasso penalties, which attains a uniform convergence rate of and selection consistency.

2.2 Fused-HDCQR estimator

Our proposed Fused-HDCQR procedure consists of multiple data splitting, selecting variables, fitting low dimensional CQRs with partitioned data, applying append-and-estimate to all predictors, and aggregating those estimates.

1. With the full data , determine via cross-validation the tuning parameter(s) of , an HDCQR variable selection method.

2. Let be a large positive integer. For each ,

1. randomly split the data into equal halves, and ;

2. on , apply with on , to select a subset of predictors, denoted by , or for short;

3. on , for each , append to such that , fit a partial CQR on the covariates indexed by , and denote their coefficient estimates by . Here, is a right-continuous piecewise-constant function with jumps only at the grid points of ;

4. denote by the entry of corresponding to .

3. Fusing: the final estimate of is

 ˆβj(τ)=1BB∑b=1˜βbj(τ). (3)
Remark 1.

We could select different tuning parameters for in each data split, but with much added computation. Our numerical evidence suggests that a globally chosen work well.

Remark 2.

Our procedure needs a variable selection procedure to reduce dimension. For example, L-HDCQR selects a subset: where ’s are the L-HDCQR estimates, is a predetermined threshold, and starts with 2 as the intercept term (corresponding to ) is always included in the model. For the choice of variable selection methods, our experience suggests that we adopt the screening method in he2013quantile for fast computation, use L-HDCQR for detecting any non-zero effects in the quantile interval , and choose AL-HDCQR if we opt to select fewer predictors.

Remark 3.

We select by minimizing a -fold cross-validation error defined by deviance residuals in the presence of censored outcomes (zheng2018high). Specifically, we partition the data to folds, and let be the penalized estimate of using all of the data excluding the -th fold with a tuning parameter and , where . Under the global CQR model (1), we define the cross-validation error as

 CV Error(λ)=K∑k=1∑i∈fold k∫τUν∣∣Di[ˆβ(−k)λ(τ)]∣∣dτ, (4)

where

with . Here, , is the counting process, and is the martingale residual under model (1) (zheng2018high).

Remark 4.

When carrying out quantile regression at each grid point, we formulate it as a linear programming problem

(Koenker2005), which can be solved by a simplex algorithm with a computational complexity of (klee1972good). Since our grid size is and the number of resampling, , is , the computational complexity of our procedure is .

3 Theoretical Studies

3.1 Notation and regularity conditions

For any vector and a subset , denote by its complementary set, and define , the -norm of the sub-vector , in which if and if . We set the following conditions.

1. There exist a quantile and a constant such that

 n−1n∑i=11{logCi≤ZTiβ∗(ν)}(1−Δi)≤cn−1/2

holds for sufficiently large .

2. (Bounded observations) . Without loss of generality, we assume . In addition, .

3. (Bounded densities) Let , , , and . Also, define , and .

1. There exist constants , , and such that

 f––≤infz,τ∈[ν,τU]f(zTβ∗(τ)|z)≤supz,τ∈[ν,τU]f(zTβ∗(τ)|z)≤¯¯¯f, g–≤infz,τ∈[ν,τU]g(zTβ∗(τ)|z)≤supz,τ∈[ν,τU]g(zTβ∗(τ)|z)≤¯¯¯g.
2. There exist constants and such that, when ,

 supz,τ∈[ν,τU]∣∣f(zTβ∗(τ)+t|z)−f(zTβ∗(τ)|z)∣∣≤A|t|, supz,τ∈[ν,τU]∣∣g(zTβ∗(τ)+t|z)−g(zTβ∗(τ)|z)∣∣≤A|t|.
4. (Sparsity) Assume , and let

 Sτ={j:β∗j(τ)≠0},S∗=⋃τ∈[ν,τU]Sτ={j:supτ∈[ν,τU]|β∗j(τ)|>0},andq=|S∗|.

Let be the index set of covariates selected by with a tuning parameter . There exist constants , such that , , and

 P(S∗⊆ˆS)≥1−K2(p∨n)−1−c2.
5. Let . There exists a constant such that and , for all and .

6. (

Bounded eigenvalues

) is bounded below and above by and , respectively, over , where ;
(Nonlinear impact) .

7. is equally gridded with for () and a constant .

Assumption A requires the number of censored observations below the -th quantile not to exceed , which is satisfied if the lower bound of ’s support is greater than the lower bound of ’s support, a reasonable scenario in real applications. As recommended in zheng2018high, is chosen such that only a small proportion of the observed survival times below the -th quantile are censored. B assumes that the covariates are uniformly bounded. As pointed out by zheng2015globally, the global linear quantile regression model is most meaningful when the covariates are confined to a compact set to avoid crossing of the quantile functions. C ensures the positiveness of between and , which is essential for the identifiability of for . D restricts the order of data dimensions, as well as the sparsity of , which is necessary for the convergence of the low dimensional estimator in (2) (Condition C4 in wang2012quantile). D also characterizes the “sure screening” property by . This asymptotic property does not assess the variability of selection with a finite sample; it is crucial to account for such variability for high dimensional inference (fei2019drawing; fei2021estimation). Also, several variable selection methods for high dimensional CQR satisfy the sure screening property in D with additional mild conditions.

• L-HDCQR: by Corollary 4.1 of zheng2018high, a beta-min condition is required in addition to the set of conditions in this paper. Explicitly, there exist constants , such that

 infj∈S∗supτ∈[τL,τU]|β∗j(τ)|>C1exp(C2qτU)√qlog(p∨n)/n+L√qϵn.
• AL-HDCQR: by Corollary 4.2 of zheng2018high, AL-HDCQR achieves the stronger selection consistency property, which implies the sure screening property.

• Quantile-adaptive Screening: by Theorem 3.3 of he2013quantile, with a proper threshold value in their technical conditions, the screening procedure achieves the sure screening property.

E characterizes the smoothness of . The eigenvalue condition in F is the sparse Riesz condition in zhang2008sparsity, satisfied by many commonly used covariance structures, including the compound symmetry structure and the first order autoregressive structure (AR(1)) (zhang2008sparsity). Also, the nonlinear impact condition controls the minoration of the quantile regression objective function by a quadratic function, as adopted in zheng2018high, for establishing the consistency of L-HDCQR estimator. The condition is satisfied when the covariates

have a log-concave density, which includes the commonly used normal distribution, Wishart distribution and Dirichlet distribution

(lovasz2007geometry). G details the fineness of , which renders an adequate approximation to the stochastic integration in (2).

3.2 Theoretical properties of Fused-HDCQR

We first extend the results in peng2008survival from a fixed to a -diverges-but-less-than- case. The results are novel and critical since we allow the true model size to increase with , while the selected ’s in the fused procedure vary around . Specifically, we assume a subset in Theorems 1 and 2, where , and . Let be the estimator from peng2008survival of fitting the CQR with over the -grid .

Theorem 1.

(Consistency with a diverging number of covariates) Under Conditions AG and given a subset such that and , there exist positive constants and such that

 supν≤τ≤τU∥´βS(τ)−β∗(τ)∥≤ζ1exp(ζ2)(K1nc1−1logn)1/2

with probability at least .

Remark 5.

From the proof of this theorem (in particular, the proofs of Propositions 1 and 2 in the Supplementary Materials that lead to this theorem), it can be seen that and do not depend on the choice of or . Thus, and are universal for all possible satisfying and .

Next, we derive the weak convergence of for any .

Theorem 2.

(Weak convergence with a diverging number of covariates) Under Conditions AG and given a such that and , it holds that

 √n(´βj(τ)−β∗j(τ))

converges weakly to a mean zero Gaussian process for and any .

In high dimensional settings, the next theorem shows that the fused estimator enjoys desirable theoretical properties.

Theorem 3.

Consider the Fused-HDCQR estimator in (3). Under assumptions AG, for any ,

 √n(ˆβj(τ)−β∗j(τ))

converges weakly to a mean zero Gaussian process for .

Our framework enables us to obtain the joint distribution of any

-dimensional estimated coefficients, where is a finite number. Let be the collection of the indices of covariates of interest. We can show that the weak convergence result of , a -dimensional subvector of the oracle estimator, still holds for , that is, converges to a

-dimensional Gaussian distribution at any

. We only need to replace by in the proof of Theorem 2 in the Appendix and slightly modify the arguments accordingly. Consequently, the term I in the proof of Theorem 3 still converges weakly to a mean zero Gaussian distribution, while the norms of items II and III are still . Therefore, Theorem 3 still holds for any -dimensional subvector of , i.e., converges to a mean zero -dimensional Gaussian distribution at any .

As shown in the proof, the covariance function of depends on the unknown active set , the unknown conditional density functions and , and other unknown quantities. Thus, it is not calculable. The next section proposes an alternative model-free variance estimator based on the functional delta method and the multi-sample splitting properties (efron2014estimation; fei2021estimation).

4 A Variance Estimator via the Functional Delta Method

Let indicate whether the observation is in the sub-sample , and . For each , we define the re-sampling covariance between and at as

 sij(τk)=1BB∑b=1(Jbi−J⋅i)(˜βbj(τk)−ˆβj(τk)).

Define and let . It follows that the covariance between and can be consistently estimated by

 ˆCovj(τk,τℓ)=n−1n(nn−n1)2n∑i=1sij(τk)sij(τℓ)=n(n−1)(n−n1)2STj(τk)Sj(τℓ),

where the multiplier is a finite-sample correction for sub-sampling (wager2018estimation). In particular, by taking , a variance estimator for is

 ˆVj(τk)=n(n−1)(n−n1)2STj(τk)Sj(τk). (5)

As in wager2018estimation, it follows that with . Furthermore, for a finite , we propose a bias corrected version of (5):

 ˆVBj(τk)=ˆVj(τk)−nn1B(n−n1){B−1B∑b=1(˜βbj(τk)−ˆβj(τk))2},τk∈Γm. (6)

The correction term in (6) is a suitable multiplier of the re-sampling variance of ’s, and converges to zero with and . Thus, the two variance estimators in (5) and (6) are asymptotically equal. However, in (5) requires to be of order to reduce the Monte Carlo noise below the sampling noise, while in (6) only requires to be of order to achieve the same (wager2014confidence).

Since converges weakly to a Gaussian process by Theorem 3, and our variance estimators are consistent on the grid points, we define an asymptotic

point-wise confidence interval for

at any as

 (ˆβj(τk)−Φ−1(1−α/2)√ˆVBj(τk),ˆβj(τk)+Φ−1(1−α/2)√ˆVBj(τk)),

where is as defined in (6), and

is the standard normal cumulative distribution function. The

-value of testing for a is

 2×{1−Φ(|ˆβj(τk)|/√ˆVBj(τk))}.

5 Simulation Studies

In various settings, we compare the proposed method, Fused-HDCQR (referred to as “Fused” in the tables and figures hereafter), with some competing methods in quantile regression or high dimensional inference. These methods include wang2012quantile (“W12”) and fan2014adaptive (“F14”) for quantile regression; zheng2018high (“Z18”) for censored quantile regression; and meinshausen2009p (“M09”) for inference with aggregated -values from multi-sample splittings.

In the simulations and the later data analysis, we choose L-HDCQR described in Section 3 as the variable selection tool for Fused-HDCQR. We also explore the feasibility of using other alternatives for variable selection, such as fan2009ultrahigh (“F09”) and M09.

When implementing Fused-HDCQR, we specify the number of splits as , the quantile interval as , and the grid length as . As regards the selection of tuning parameters, Theorems 1 and 2 suggest that our procedure not be sensitive to tuning parameters as long as they can ensure sure screening. In practical settings, we recommend to select tuning parameters by minimizing the 5-fold cross-validation error as in (4), which may help achieve sure screening and works well in our simulations. We study the following examples with sparse non-zero effects, some of which are heterogeneous.

Example 1. The event times are generated by

 logTi=˜ZTib+εi,i=1,…,n,

where the coefficient vector is sparse with , for all other ’s, and . Therefore, the true coefficients satisfy for all , where , the -th quantile of the distribution of , is the intercept. The covariates ’s are i.i.d. from and are independent across . The censoring times are generated independently as , giving a censoring rate around .

Example 2. The event times follow

 logTi=˜ZTib+1.5˜Z3,iεi, (7)

where , for all other ’s, and . We first generate with an AR(1) , where , and then let , except that the third covariate . Thus, , and , for all other ’s. The censoring times are generated independently as , giving a censoring rate around .

Example 3. The event times follow

 logTi=˜ZTib+ϕ1(ξi)˜Z1,i+ϕ4(ξi)˜Z4,i,

where , for all other ’s, , and are monotone functions as the dashed lines in Figure 1, both are continuous with zero and non-zero pieces over . We first generate as in Example 2, and then let , except that and . Therefore, , and , for all other ’s. The censoring times are generated independently as , which gives a censoring rate around .

For each of these examples, we set = and to study the impacts of the sample size and the number of variables on the performance, and, in particular, how the methods fare when . In Example 3, which mimics the real data example in Section 6 most closely, we have also explored , which is roughly equal to the dimension of the real dataset. For every parameter configuration, a total of 100 independent datasets are generated, and we report the average results based on these replications. We choose 100 replications because the penalized methods for high dimensional CQR in general take much computing time (Table 5).

We first evaluate the feasibility of using various variable selection tools for our proposed method. Comparisons of true positives and false negatives among F09, M09, and L-HDCQR under Examples 1–3 are reported in Table 1. F09 presents a subpar performance because, by taking intersections of variables selected from different partitions of data, it tends to miss out some true signals and thus have fewer true positives. In contrast, L-HDCQR retains more true positives than both F09 and M09, while having more false positives. Because our method requires the variable selection step to include the true signals with high probability, even at the cost of some false positives, we use L-HDCQR as the screening tool for our method.

We next compare the performance of Fused-HDCQR with other high dimensional quantile regression methods at under Example 1. As a benchmark for comparisons, we also compute the oracle estimates based on the true model (with

known). Since W12, F14, and Z18 only provide coefficient estimates without standard errors (SEs), we only report the estimation biases for them, while reporting the average SEs, empirical standard deviations (SDs) and coverage probabilities of the confidence intervals for our method. Table

2 shows that Fused-HDCQR presents the smallest biases, which are comparable to those of the oracle estimates. In contrast, Z18 has smaller biases when the sample size is large, and larger biases otherwise, while W12 and F14 incur substantial biases since they are not designed for censored data. Moreover, the average SEs based on Fused-HDCQR agree with the empirical SDs of the estimates. The consistent estimates of coefficients and SEs obtained by Fused-HDCQR lead to proper coverage probabilities around the nominal level. In addition, the coverage probabilities improve as increases.

Table 2 also concerns the power for detection of signals. Since W12, F14, and Z18 cannot draw inference and, in general, there is a lack of literature that deals with inference for HDCQR, we compare our method with the aggregated -value approach (M09) in the quantile setting, though M09 originated from linear regression. The results indicate that Fused-HDCQR outperforms M09, presenting more power when the effect size is moderate or large.

Table 3 summarizes the results from Example 2 with the heterogeneous effect varying with . We compare the estimation accuracy between Fused-HDCQR and Z18, as well as the statistical power between Fused-HDCQR and M09. Again, Fused-HDCQR presents smaller biases than Z18 and a higher power than M09. To assess whether the tuning parameters selected as in Remark 3 help the variable selection method (L-HDCQR), used by Fused-HDCQR, satisfy assumption D in Section 3, we report the selection frequency of each signal variable in Table 3 (and also in Table 4), and observe that the selection frequency increases as the sample size increases, hinting that assumption D may be satisfied with these selected tuning parameters.

Table 4 summarizes the results based on Example 3. For the two heterogeneous effects and that vary with , their estimation biases of Fused-HDCQR become smaller and the estimated SEs are closer to the empirical ones as increases. Figure 1 shows that the Fused-HDCQR estimates in general agree with the oracle estimates and the truth, except at the non-smooth change points, and have narrower confidence intervals with a larger , where the vertical bars are the average confidence intervals of the grid.

In regards to the choice of in the variance computation, our numerical experience suggests that it may be sufficient to use a that is of the same order of the sample size, even when is less than . This coincides with the note under (6) that is only required to be of order to reduce the Monte Carlo noise below the sampling noise.

Finally, we compare the computation intensity among Z18, M09, W12, F14, and Fused-HDCQR under Example 1 and report in Table 5 the computing time on average per dataset. Our method is the most computationally intensive, because it involves multiple data-splittings and draws inferences on all of the coefficients. However, by utilizing parallel computing, we have managed to reduce the computational time to the same order of Z18, W12, and F14 that are based on penalized regression. The R code used for generating the simulation results can be accessed via https://github.com/feizhe/HDCQR_Paper.

6 Application to the Boston Lung Cancer Survival Cohort (BLCSC)

Detection of molecular profiles related to cancer survival can aid personalized treatment in prolonging patients’ survival and improving their quality of life. In a subset of BLCSC samples, 674 lung cancer patients were measured with survival times, along with SNPs and clinical indicators, such as lung cancer subtypes (adenocarcinoma, squamous cell carcinoma, or others), cancer stages (1-4), age, gender, education level ( high school or high school) and smoking status (active or non-active smokers); see Table 6 for patients’ characteristics. The censoring rate was 23% and a total of 518 deaths were observed during the followup period, with the observed followup time varying from 13 to days.

We could have included all 40,000 SNPs in our analysis. However, for more statistical power, we opt for the targeted gene approach by focusing on 2,002 SNPs residing in 14 genes identified to be cancer related, namely, ALK, BRAF, BRCA1, EGFR, ERBB2, ERCC1, KRAS, MET, PIK3CA, RET, ROS1, RRM1, TP53, and TYMS (brose2002braf; toyooka2003tp53; paez2004egfr; soda2007identification). Pinpointing the effects of individual loci within the targeted genes is helpful for understanding disease mechanisms (evans2011interaction; d2019systematic) and designing gene therapies (paques2007meganucleases; hanawa2004extended). We also adjust for patients’ clinical and environmental characteristics listed in Table 6, which gives a total of predictors.

We apply Fused-HDCQR to compute the point estimates (3) and the variance estimates (6). We set the quantile interval to be , which is wide enough to cover high- and low-risk groups and, in the meantime, ensures the quantile parameters be estimable in the presence of censoring (zheng2015globally). We choose the lower bound to circumvent the singularity problem with CQR at , because few () observations are censored below the -th quantile. With , we form the -grid of length . We set as the number of re-samples, which is sufficiently large and comparable to the sample size. To determine the tuning parameter in L-HDCQR for selection, we use 5-fold cross-validation as specified in Remark 3.

For ease of presentation, we summarize the results evaluated at quantile levels, , instead of the whole grid . To highlight the findings of the high-risk group, we rank all SNPs based on their -values at . In particular, after Bonferroni correction for multiple testing, there are significant SNPs for

with the overall type I error of

. Our method estimates the coefficients and the -values for all predictors, and we only present the results for the patient characteristics, the top significant SNPs, and the least significant SNPs in Figure 2 and Table 7. The estimated coefficient of active smoking drops from () to () as changes from to , and then increases to () as changes to , suggesting that active smoking might be more harmful to the high- and median-risk groups than the low-risk group of patients. The most significant SNP at is AX.37793583_T, which remains significant throughout to . However, its estimated coefficient decreases from () to (), indicating its heterogeneous impacts on survival, i.e., stronger protective effect at lower quantiles and vice versa.

The effects of some SNPs are nearly zero for higher quantiles. For example, the estimated coefficient of AX.15207405_G decreases from (; ) to (; ), with the estimated standard error increasing from to . Similarly, the estimated coefficient of AX.40182999_A decreases from (; ) to (; ). The results again hint at heterogeneous SNP effects in various risk groups, which cannot be detected using traditional Cox models.

Finally, our results shed light on the roles of SNPs in the high-risk group (i.e., lower quantiles). Specifically, we map the SNPs with significant effects at the -th quantile by Fused-HDCQR to the corresponding genes and rank the genes by the number of significant SNPs (over total number of SNPs for each gene in the parenthesis), which are TP53 (14/321), RRM1 (14/174), ERCC1 (10/167), BRCA1 (10/114), ALK (8/163), ROS1 (5/294), EGFR (5/261), ERBB2 (4/167), and 6 other genes with numbers of significant SNPs less than 4. While these genes were reported to be associated with lung cancer (toyooka2003tp53; takeuchi2012ret; rosell2011pretreatment; lord2002low; zheng2007dna; sasaki2006egfr; brose2002braf), our analysis provides more detailed information as to which SNPs and locations of the genes are jointly associated with the lung cancer survival, as well as the estimated effects and uncertainties. Analysis of heterogeneous SNP effects has been gaining increasing research attention in lung cancer research (mckay2017large; dong2012potentially; huang2009genome), and beyond it (garcia2008heterogeneity; cheng2010prostate; gulati2014systematic).

7 Conclusions

Our proposed procedure involves repeated estimates from low dimensional CQRs, which are computationally straightforward and can be efficiently implemented with parallel computing. We require the variable selection to possess a sure screening property as in condition D. This seems to be supported by our simulations, which find our procedure works well when the variable selection method can select a superset of the true model with high probability. Our condition is much weaker than a condition of selection consistency as specified in fei2019drawing.

For the selection of , we recommend to be in the same order of the sample size . Smaller might not affect coefficient estimation much; but it might yield inaccurate estimated standard errors, leading to incorrect inferences. In addition, we opt to define by setting the grid as equally spaced points between and . This may cover the quantile interval well, with reasonable computation efficiency.

There are open questions to be addressed. First, substantial work is needed for handling highly correlated predictors as the performance of our method, like the other competing methods, deteriorates when correlations among predictors become stronger. Second, it is of interest to investigate an alternative method when the sparsity condition fails. For example, it is challenging to find an effective strategy to draw inference when a non-negligible portion of predictors have small but non-zero effects. We will pursue them elsewhere.

Acknowledgements

We are deeply grateful toward the Editor, the AE and the two referees for their constructive comments and suggestions that have improved the manuscript substantially. We thank our long-term collaborator, David Christiani of Harvard Medical School, for providing the BLCSC data. The work is partially supported by grants from NIH (5R01CA249096 and U01CA209414).