# A pliable lasso for the Cox model

We introduce a pliable lasso method for estimation of interaction effects in the Cox proportional hazards model framework. The pliable lasso is a linear model that includes interactions between covariates X and a set of modifying variables Z and assumes sparsity of the main effects and interaction effects. The hierarchical penalty excludes interaction effects when the corresponding main effects are zero: this avoids overfitting and an explosion of model complexity. We extend this method to the Cox model for survival data, incorporating modifiers that are either fixed or varying in time into the partial likelihood. For example, this allows modeling of survival times that differ based on interactions of genes with age, gender, or other demographic information. The optimization is done by blockwise coordinate descent on a second order approximation of the objective.

## Authors

• 2 publications
• 4 publications
12/01/2017

### A Pliable Lasso

We propose a generalization of the lasso that allows the model coefficie...
03/31/2020

### A flexible adaptive lasso Cox frailty model based on the full likelihood

In this work a method to regularize Cox frailty models is proposed that ...
06/26/2021

### Using relative weight analysis with residualization to detect relevant nonlinear interaction effects in ordinary and logistic regressions

Relative weight analysis is a classic tool for detecting whether one var...
07/20/2021

### Study of the Parent-of-origin effect in monogenic diseases with variable age of onset. Application on ATTRv

In genetic diseases with variable age of onset, an accurate estimation o...
03/28/2020

### A Hierarchical Integrative Group LASSO (HiGLASSO) Framework for Analyzing Environmental Mixtures

Environmental health studies are increasingly measuring multiple polluta...
12/19/2017

### High dimensional Single Index Bayesian Modeling of the Brain Atrophy over time

We study the effects of gender, APOE genes, age, genetic variation and A...
03/13/2018

### Regularized hazard estimation for age-period-cohort analysis

In epidemiological or demographic studies, with variable age at onset, a...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider an approach to estimating high dimensional covariate and interaction effects in the survival analysis framework. For example, a common case occurs with patient survival time data, which may include genetic information and demographics. We might use a model that includes interactions between these two components. The assumption is that the effect of gene expression may differ depending on variables like the age or gender of the individual. The problem is that the set of genes and thus the set of possible interactions may be large. We propose to use a pliable lasso penalty to incorporate these interactions in a hierarchical way. This allows for interactions among the genes that have a main effect on survival. Another example is if the relationship between genes and survival changes over time. Using the pliable lasso, we can estimate possible gene interactions with time and time dependent variables as well.

The pliable lasso was introduced in the Gaussian setting in [tibshirani2017]. In this setting, suppose the data contains high dimensional covariates , modifying variables , and outcomes . The regression model including interactions between each and each is

 yi=β0−ziθ0−p∑k=1xik(βk+ziθk)+ϵi

where is a noise term assumed to be Gaussian. Since is high dimensional, regularized least squares regression can be used to estimate main effects and interaction effects . We can add a lasso penalty on or on and . The problem is that the first method ignores interactions and the second might not find main effects due to the large number of interaction terms. Under the assumption that the interactions are hierarchical, [tibshirani2017] instead proposed adding the pliable lasso penalty to get the objective

 M(β0,θ0,β,θ)=12nn∑i=1(yi−β0−ziθ0−p∑k=1xik(βk+ziθk))2
 +λ((1−α)p∑k=1(∥(βk,θk)∥2+∥θk∥2)+αp∑k=1∥θk∥1)

We optimize with respect to and , where and are intercept and main effect terms for modifiers,

is the vector of

terms and is the vector of concatenated vectors. This penalty term includes an overlapping group lasso penalty . If is nonzero in the solution, is also nonzero, but can be zero when is nonzero. This imposes a hierarchy on the main effects and corresponding interactions.

In the survival analysis setting, consider data with high dimensional covariates , the time observed , and whether the sample is censored . We are interested in estimating the effects of each variable on failure time. The likelihood is

 n∏i=1f(yi|xi)δiS(yi|xi)1−δi

where is the density and the survival function. In order to estimate the effects of interest, a parametrization can be used. One often used model is Cox’s proportional hazards model [cox], which defines the hazard function to be . Each individual has the same baseline hazard with a proportionality constant determined by their characteristics. The limitation of this form is that the effect of a variable on survival is fixed across time. The trend over time is determined only be and is shared by all individuals. The parameter of interest is , which gives the effect of each covariate. Under this model, we maximize the partial likelihood

 L(β)=m∏i=1exj(i)β∑j∈Riexjβ

where is the index of the observation corresponding to time and is the risk set at time .

One drawback of the model is that with high dimensional covariates the solution for may be degenerate. Adding a lasso penalty is proposed in [tibshirani1996]

for linear regression and extended for Cox’s model in

[tibshirani1997]. A more general elastic net penalty

 λPα(β)=λ(αp∑k=1|βk|+(1/2)(1−α)p∑k=1β2k)

can also be used. Since the data is high dimensional, these problems are computationally difficult, and [simon] proposes a fast coordinate descent algorithm.

In this paper, we consider a survival model including high dimensional covariates and their interactions with known modifying variables . We incorporates these interaction effects in a hierarchical way following [tibshirani2017] using the pliable lasso. We extend the blockwise coordinate descent algorithm and screening conditions in the paper to the proportional hazards model by approximating the partial log likelihood problem as a weighted least squares objective. We also propose an application to the non proportional case, where time varying variables can be chosen as modifiers.

## 2 Cox’s proportional hazards model

We start with the Cox proportional hazard function , in which the hazard function for the individuals are proportional to each other with covariates determining the level. Supposing that the data does not have ties or weights, we include modifying variables and interactions . The Cox proportional hazard function with modifying variables is then

 hi(t)=h0(t)eztiθ0+xtiβ+wtiθ

and the partial likelihood and log likelihood are

 L(θ0,β,θ)=m∏i=1ezj(i)tθ0+∑pk=1xj(i)kβk+wtj(i)kθk∑j∈Rieztjθ0+∑pk=1xjkβk+wtjkθk
 l(θ0,β,θ)=m∑i=1zj(i)tθ0+p∑k=1xj(i)kβk+wtj(i)kθk−log(∑j∈Rieztjθ0+∑pk=1xjkβk+wtjkθk)

Since including interactions increases the dimension of parameters, the solution to the log likelihood maximization may not be well defined, particularly in the case of high dimensional covariates . We propose to use the pliable lasso penalty, and get the objective

 J(θ0,β,θ)=m∑i=1zj(i)tθ0+p∑k=1xj(i)kβk+wtj(i)kθk−log(∑j∈Rieztjθ0+∑pk=1xjkβk+wtjkθk)
 +λ((1−α)p∑k=1(∥(βk,θk)∥2+∥θk∥2)+αp∑k=1∥θk∥1)

The advantage over the lasso penalty is that this allows hierarchical inclusion of interactions with depending on the inclusion of in the model. The assumption is that variables that do not enter the model also do not have interactions but variables that do enter the model may or may not have interactions.

A coordinate descent algorithm for the least squares optimization with pliable lasso penalty is given in [tibshirani2017]. We follow the same optimization steps, but at each iteration of the algorithm, we approximate the Cox log likelihood using a quadratic expansion to obtain the weighted least squares problem. Following [simon], we use a Taylor expansion at the current parameter values . Setting , we have

 l(θ0,β,θ)≈l(~θ0,~β,~θ)
 +((θ0,β,θ)−(~θ0,~β,~θ))t˙l(~θ0,~β,~θ)+((θ0,β,θ)−(~θ0,~β,~θ))t¨l(~θ0,~β,~θ)((θ0,β,θ)−(~θ0,~β,~θ))/2
 ≈l(~θ0,~β,~θ)+(Zθ0+Xβ+Wθ−~η)tl′(~η)+(Zθ0+Xβ+Wθ−~η)tl′′(~η)(Zθ0+Xβ+Wθ−~η)/2
 =(Zθ0+Xβ+Wθ−~η+l′′(~η)−1l′(~η))tl′′(~η)(Zθ0+Xβ+Wθ−~η+l′′(~η)−1l′(~η))/2+C(~θ0,~β,~θ)

Defining to be indicator of failure and to be the set of times where is at risk for a failure index, the derivatives of the partial log likelihood are

 l′(~η)j=δj−∑i∈Cjeηj∑k∈Rieηk
 l′′(~η)jj=−∑i∈Cj(∑k∈Rieηk)eηj−e2ηj(∑k∈Rieηk)2

The log likelihood in the objective may be approximated using only the diagonal elements of . Adding the pliable lasso penalty to the approximation gives

 M(θ0,β,θ)=12nn∑j=1−l′′(~η)jj(~ηj−l′(~η)j/l′′(~η)jj−ztjθ0−p∑k=1xjkβk−wtjkθk)2
 +(1−α)λp∑k=1∥(βk,θk)∥2+∥θk∥2+αλp∑k=1∥θk∥1

Optimizing for , , and is a weighted least squares maximization problem with weights and adjusted residuals . The problem may be solved using the Gaussian pliable lasso algorithm [tibshirani2017]. The advantage of this algorithm is that we evaluate conditions under which each and are zero based on the adjusted residuals before proceeding to the gradient descent step. This saves computation for sparse models, where many of the effects are zero. The details are given in the appendix.

In the general case, survival data may include ties for times of failure or weights for observations. We use the Breslow approximation [breslow], which assumes the same denominator for each tied sample. We weight each observation accordingly. Suppose the weights are given by , are the indices that fail for time and is the sum of the weights for these indices. The partial likelihood and log likelihood are

 L(β,θ)=m∏i=1∏j∈Di(ωjeztjθ0+∑pk=1xjkβk+wtjkθk)ωj(∑j∈Riωjeztjθ0+∑pk=1xjkβk+wtjkθk)di
 l(β,θ)=m∑i=1∑j∈Diωj(ztjθ0+p∑k=1xjkβk+wtjkθk)−dilog(∑j∈Riwjeztjθ0+∑pk=1xjkβk+wtjkθk)

The derivatives are

 l′(~η)j=ωjδj−∑i∈Cjdiωjeηj∑k∈Riωkeηk
 l′′(~η)jj=−∑i∈Cjdi(∑k∈Riωkeηk)ωjeηj−ω2je2ηj(∑k∈Riωkeηk)2

This is a generalization, and taking the weights to be one and failure times to be distinct returns the same log likelihood and derivatives as above.

For most applications, we are interested in determining a path of solutions corresponding to a sequence of values. Among these, cross validation can be used to select a model. In this case, the maximum of interest is when all of the coefficients are zero. We approximate this value by calculating the at which all are zero, ignoring interactions. This is the same calculation commonly used for lasso paths and is given by

 λmax=maxk|Xtk(−diag(l′′(~η)))(~η−l′(~η)/diag(l′′(~η))−Zθ0)/n|≤(1−α)λ

To allow faster convergence at small values, where the solution may not be sparse, warm starts are used.

Using the path of solutions, we would like to find the optimal value. Typically, -fold cross validation is used, where validation on each of the

folds is based on the log likelihood. However, for the Cox model, since the partial log likelihood depends on the risk set sample, the value calculated using the validation fold may have few observations leading to a high variance estimate. For the

th fold, we instead use the goodness of fit estimate from [simon] given by

 l(^θ0,^β,^θ)−l−i(^θ0,^β,^θ)

where is the log likelihood on the rest of the folds not including and , , and are the estimated coefficient values for each . These estimates are then averaged over each of the folds and the minimizing is chosen as optimal.

## 3 Extension to the non-proportional hazards setting

The assumption that the hazard functions for the individuals are proportional may be unrealistic. The effect of a subset of the modifiers or covariates may be different across time, so that the hazard functions for differing values are not proportional. A common generalization that accounts for interaction effects is to stratify based on the subset of non proportional variables, assuming it is known. For individual in strata , the hazard function is

 hi(t)=hg0(t)eztiθ0+xtiβ+wtiθ

where and now refers to the remaining covariates. This allows the same effects , , and to be estimated across strata but a different baseline hazard for each. To optimize, the stratified partial likelihood can be written with separate products for samples in each strata . Including a pliable lasso penalty, it is possible to generalize the estimation strategy above to this case.

We note that the advantage of stratification is that it allows for an interactive effect between time and the specified variables. This suggests that pliable lasso may be used to select these variables if we treat time as a modifier in our model. We consider the case when the subset of covariates with non proportional hazard functions are unknown. We assume that the baseline hazard function depends on covariates through interactions between and some collection of functions . The hazard function is

 hi(t)=(h0(t)exiG(t))tθ′k)eztiθ0+xtiβ+wtiθ.

We can approximate the interaction with time by taking to be an evaluated spline bases. Then, using a pliable lasso penalty with extended parameters instead of , we are able to also estimate and determine the non proportional covariates. Combining this approach with the above stratification gives a more general model reflecting both known non proportional modifiers and unknown covariates. In general, instead of a spline bases, the function can be any time dependent modifiers of interest. The estimation approach for this extension is given in the appendix.

To simplify the computation, we instead propose to use a logistic model to approximate the likelihood111This approximation appears to be known to the statistical community, but was new to us. We thank Terry Therneau for very helpful discussions about this issue. . For each failure time , we create a covariate matrix by concatenating with for each in the risk set. We form the outcome vector by concatenating indicating failure for observation and for the other observations. The matrix for the possibly time dependent modifying variables is formed by concatenating the values of with for each in the risk set. Then, we stack the problems for each and use the binomial pliable lasso algorithm with separate intercepts for each time.

To justify this procedure, we show that the logistic model log likelihood can approximate the Cox log likelihood. The Cox log likelihood for observation is

 ηi−log(∑j∈Rieηj)

The logistic model log likelihood contribution for observation is

 β0+ηi−∑j∈Rilog(1+eβ0eηj)

Then, assuming , solves the gradient condition and . This reduces to the Cox log likelihood contribution above. The main assumption of this approximation is that the relative risk is near zero. This holds if the risk set at each failure time is large, which happens when the data set is large and there are few failures.

## 4 Simulations

We generate and with sample size , number of x variables , and number of z variables . We use a constant baseline hazard , so that

 hi(t)=eztiθ0+xtiβ+wtiθ

Define the linear term in the hazard as

 λ=x1−x2+x3+x4+x1z1−x1z2−2x2z3+2x2z4

For this specification, failure times follow an exponential distribution. The model is

 y∼e−λExp(1)

and censoring times are generated by

 c∼Exp(1).

All nonzero interactions are on variables with main effects. Since the interactions are hierarchical, pliable lasso has an advantage. The results for an alternative model that is not hierarchical are shown in the appendix.

The methods we compare are pliable lasso, lasso with no interactions, and lasso with all interactions using standardized covariates. The lasso versions are optimized with glmnet [simon]. The pliable lasso uses a fixed , but we obtain similar results using a range of values. The cross validated model is chosen for each using the value that minimizes the goodness of fit criteria listed above. The

value using the one standard deviation rule is also tested, but the performance in terms of test negative log likelihood is worse for all methods.

The tables below include the negative log likelihood values evaluated using a test set of size . The value at the true coefficients are also shown. In terms of this measure, pliable lasso performs better than both lasso methods for both values of . For small , pliable lasso and the full lasso give comparable results. The total number of variables including interactions is in this simulation, which is small enough for the full lasso to perform well. However, for large , pliable lasso outperforms the full lasso, because the total number of variables is , but performs similarly to the main effect lasso. Both methods that include interactions do not select interaction effects well. The hierarchical pliable lasso penalty enforces that the interactions are nonzero only when the main effect is nonzero. Even though the model does relatively well on selecting main effects, the false positives and negatives increase the errors in selecting interaction effects. On the other hand, without the hierarchical structure of the penalty, the lasso with interactions misses more main effects. To have better selection of interaction effects, we may use a smaller model than the cross validated one.

We next consider a baseline hazard that includes interactions on covariates. Using the interaction model in the previous section, this introduces time dependent modifiers. We generate and do not include any fixed modifiers for simplicity. The number of x variables is , but we increase the sample size to . This is chosen to compensate for the increased complexity of having modifiers that change with time. The hazard we use is

 hi(t)=e(xit)tθ′extiβ

Define the linear terms as

 λ=x1−x2+x3+x4

and the time dependent linear terms as

 k=5x1+5x2

Failure times in this case follow the same distribution as the logarithm of an exponential variable. The model is

 y∼log(Exp(1)eλk+1)/k

and censoring times are generated by

 c∼Exp(1).

All time dependency is on variables with main effects, giving pliable lasso an advantage. The results for an alternative model that is not hierarchical are shown in a later section.

We compare the pliable lasso to the lasso with no interactions and regular Cox estimation with time dependent variables. For the pliable lasso, we use five linear spline terms, standardize all covariates including the splines, and sample risks sets of size five. We present the results for pliable lasso using a fixed . With larger , pliable lasso outperforms the other methods in terms of log likelihood and main effect determination but does not find time effects. Cox estimation with a linear time interaction using the coxph R library is included as a comparison since glmnet does not allow possible time dependencies. Cross validation with the goodness of fit criteria is used to choose the model for lasso and pliable lasso.

The tables below include the same measures as above, where the negative log likelihood is calculated with a test set of size . For pliable lasso, the false positives and false negatives for are calculated based on selection of any of the spline terms corresponding to a given covariate interaction. The log likelihood values are comparable, but the pliable lasso outperforms the lasso in selecting main effects. It also does well in selecting interaction effects, but since Cox estimation does not set effects to zero, we cannot compare.

## 5 NKI data example

We test the Cox pliable lasso on the breast cancer NKI data from [vantveer][vandevijver]. The data contains survival information and gene expression levels of 24,481 genes for 319 individuals. We consider distant metastasis free survival as the outcome and threshold on the variance of the genes to obtain 134 total covariates. The lasso and pliable lasso estimators are compared, where the pliable lasso includes a linear time interaction. The results shown below are from left out test data.

We use the logistic approximation to estimate the survival model. The below plot shows the AUC for the test predictions from logistic lasso and logistic pliable lasso. We see that the pliable lasso outperforms the lasso in terms of classifying survival events.

The plot below shows that the Cox partial log likelihood for the test data is also larger for the pliable lasso estimates than for the lasso ones either using the approximate logistic model or the Cox model. Adding the time varying component improves on estimates of the survival model as well as the classification.

## 6 Conclusion

We have extended the pliable lasso approach to Cox’s model for survival data. This allows for estimation of main effects for covariates and interaction effects between covariates and modifiers that may be fixed in time or time varying. The estimation uses a blockwise coordinate descent algorithm that optimizes a weighted least squares problem with the penalty term at each step. To simplify computations, particularly for the time varying case, we propose to use a logistic model to approximate the partial likelihood. The implementation will be made available in R as an option in the plasso package.

## Appendix: Computational details for the non proportional hazards model

The problem with the time dependent form of the hazard function is that the likelihood terms for the observations in the risk set in the denominator change depending on the time . For simplicity, we write and and omit the variable terms. The logic for calculating main effects and interactions on follows the same pattern. We write the hazard function as

 h(t)=h0(t)exjβ+wjtθ

and the partial likelihood and log likelihood as

 L(β,θ)=m∏i=1e∑pk=1xj(i)kβk+wj(i)ikθk∑j∈Rie∑pk=1xjkβk+wjikθk
 l(β,θ)=m∑i=1p∑k=1xjkβk+wjikθk−log(∑j∈Rie∑pk=1xjkβk+wjikθk)

Due to the difference in the calculation of the risk set depending on , the likelihood now depends on and for . Labeling the corresponding matrices, we write and . The Taylor expansion is similar to the above but the derivatives are in terms of . We have

 l(β,θ)≈l(~β,~θ)+((β,θ)−(~β,~θ))t(˙l(~β,~θ))+((β,θ)−(~β,~θ))t¨l(~β,~θ)((β,θ)−(~β,~θ))/2
 ≈l(~β,~θ)+((X0β+W0θ,X1β+W1θ)−~η)tl′~η(~η)+((X0β+W0θ,X1β+W1θ)−~η)tl′′~η(~η)
 ×((X0β+W0θ,X1β+W1θ)−~η)/2
 =((X0β+W0θ,X1β+W1θ)−~η+l′′(~η)−1l′(~η))tl′′(~η)
 ×((X0β+W0θ,X1β+W1θ)−~η+l′′(~η)−1l′(~η))/2+C(~β,~θ)

The derivatives are

 l′~η0(~η)j=δj−eηj+ηji∑k∈Rieηk+ηki
 l′′~η0(~η)jj=−(∑k∈Rieηk+ηki)eηj+ηji−e2(ηj+ηji)(∑k∈Rieηk+ηki)2
 l′~η1(~η)ji′=−eηj+ηji′∑k∈Ri′eηk+ηki
 l′′~η1(~η)ji′ji′=−(∑k∈R′ieηk+ηki′)eηj+ηji′−e2(ηj+ηji′)(∑k∈R′ieηk+ηki′)2

We again only using the diagonal elements of in the Taylor approximation. Adding the pliable lasso penalty gives

 M(β,θ)=12nn∑j=1−l′′~η0(~η)jj(~η0j−l′~η0(~η)j/l′′~η0(~η)0jj−p∑k=1xjkβk−wjikθk)2
 +12n∑ji′−l′′~η1(~η)ji′ji′(~η1ji′−l′~η1(~η)ji′/l′′~η1(~η)ji′ji′−xjkβ−wji′kθk)2
 +(1−α)λp∑k=1∥(βk,θk)∥2+∥θk∥2+αλp∑k=1∥θk∥1

This is another weighted least squares plus pliable lasso penalty problem that may be solved using the Gaussian pliable lasso algorithm. The weights are and the adjusted residuals are .

Since we now calculate separate denominators for each time, the computation time is increased by an order of . To reduce it, we sample points in the denominator when computing the log likelihood and its derivatives. This sample risk set procedure is proposed by [liddell] and asymptotically justified by [goldstein]. In practice, we use the logistic approximation described above, which is much faster and gives similar results.

## Optimization details for proportional and non proportional hazards model

Following [tibshirani2017], we solve the weighted least squares problem by checking zero conditions for each group corresponding to and its interactions. For the proportional hazard case, the derivatives of the objective with the pliable lasso penalty are

 dM(θ0,β,θ)dθ0=1nn∑j=1(−zj)(−l′′(~η)jj)(~ηj−l′(~η)j/l′′(~η)jj−ztjθ0−p∑k=1xjkβk−wtjkθk)=0
 ^θ0=(n∑j=1zj(−l′′(~η)jj)zj)−1(n∑j=1zj(−l′′(~η)jj)(~ηj−l′(~η)j/l′′(~η)jj−p∑k=1xjkβk−wtjkθk))
 dM(θ0,β,θ)dβk=1nn∑j=1(−xjk)(−l′′(~η)jj)(~ηj−l′(~η)j/l′′(~η)jj−ztjθ0−p∑k=1xjkβk−wtjkθk)
 +(1−α)λu
 =−Xtkr/n+(1−α)λu=0
 dM(θ0,β,θ)dθk=1nn∑j=1(−wjk)(−l′′(~η)jj)(~ηj−l′(~η)j/l′′(~η)jj−ztjθ0−p∑k=1xjkβk−wtjkθk)
 +(1−α)λ(u2+u3)+αλv
 =−Wtkr/n+(1−α)λ(u2+u3)+αλv=0
 (βk,θk)≠0,u=βk∥(βk,θk)∥2,u2=θk∥(βk,θk)∥2
 (βk,θk)=0,∥(u,u2)∥2≤1
 θk≠0,u3=θk∥θk∥2
 θk=0,∥u3∥2≤1
 v=sign(θj)

The conditions for are

 |Xtkr(−k)/n|≤(1−α)λ
 ∥∥S(Wtkr(−k)/n,αλ)∥∥2≤2(1−α)λ

and the conditions for are

 ^βk=(n/n∑j=1(−l′′(~η)jj)(xjk)2)S(Xtkr(−k)/n,(1−α)λ)
 ∥∥S(Wtk(r(−k)−diag(l′′(~η))Xk^βk)/n,αλ)∥∥2≤(1−α)λ

When , taking , we optimize

 M(γk)=−l(γ