Machine Learning for Survival Analysis: A Survey

08/15/2017 ∙ by Ping Wang, et al. ∙ University of Michigan Virginia Polytechnic Institute and State University 0

Accurately predicting the time of occurrence of an event of interest is a critical problem in longitudinal data analysis. One of the main challenges in this context is the presence of instances whose event outcomes become unobservable after a certain time point or when some instances do not experience any event during the monitoring period. Such a phenomenon is called censoring which can be effectively handled using survival analysis techniques. Traditionally, statistical approaches have been widely developed in the literature to overcome this censoring issue. In addition, many machine learning algorithms are adapted to effectively handle survival data and tackle other challenging problems that arise in real-world data. In this survey, we provide a comprehensive and structured review of the representative statistical methods along with the machine learning techniques used in survival analysis and provide a detailed taxonomy of the existing methods. We also discuss several topics that are closely related to survival analysis and illustrate several successful applications in various real-world application domains. We hope that this paper will provide a more thorough understanding of the recent advances in survival analysis and offer some guidelines on applying these approaches to solve new problems that arise in applications with censored data.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the development of various data acquisition and big data technologies, the ability to collect a wide variety of data and monitor the observation over long-term periods have been attained in different disciplines. For most of the real-world applications, the primary objective of monitoring these observations is to obtain a better estimate of the time of occurrence of a particular event of interest. One of the main challenges for such time-to-event data is that usually there exist censored instances, i.e., the event of interests is not observed for these instances due to either the time limitation of the study period or losing track during the observation period. More precisely, certain instances have experienced event (or labeled as event) and the information about the outcome variable for the remaining instances is only available until a specific time point in the study. Therefore, it is not suitable to directly apply predictive algorithms using the standard statistical and machine learning approaches to analyze the survival data. Survival analysis, which is an important subfield of statistics, provides various mechanisms to handle such censored data problems that arise in modeling such complex data (also referred to as time-to-event data when modeling a particular event of interest is the main objective of the problem) which occurs ubiquitously in various real-world application domains.

In addition to the difficulty in handling the censored data, there are also several unique challenges to perform the predictive modeling with such survival data and hence several researchers have, more recently, developed new computational algorithms for effectively handling such complex challenges. To tackle such practical concerns, some related works have adapted several machine learning methods to solve the survival analysis problems and machine learning researchers have developed more sophisticated and effective algorithms which either complement or compete with the traditional statistical methods. In spite of the importance of these problems and relevance to various real-world applications, this research topic is scattered across different disciplines. Moreover, there are only few surveys that are available in the literature on this topic and, to the best of our knowledge, there is no comprehensive review paper about survival analysis and its recent developments from a machine learning perspective. Almost all of these existing survey articles describe solely statistical methods and either completely ignore or barely mention the machine learning advancements in this research field. One of the earliest surveys may be found in [Chung et al. (1991)], which gives an overview of the statistical survival analysis methods and describes its applications in criminology by predicting the time until recidivism. Most of the existing books about survival analysis [Kleinbaum and Klein (2006), Lee and Wang (2003), Allison (2010)] focus on introducing this topic from the traditional statistical perspective instead of explaining from the machine learning standpoint. Recently, the authors in [Cruz and Wishart (2006)] and [Kourou et al. (2015)] discussed the applications in cancer prediction and provided a comparison of several machine learning techniques.

The primary purpose of this survey article is to provide a comprehensive and structured overview of various machine learning methods for survival analysis along with the traditional statistical methods. We demonstrate the commonly used evaluation metrics and advanced related formulations that are commonly investigated in this research topic. We will discuss a detailed taxonomy of all the survival analysis methods that were developed in the traditional statistics as well as more recently in the machine learning community. We will also provide links to various implementations and sources codes which will enable the readers to further dwell into the methods discussed in this article. Finally, we will discuss various applications of survival analysis.

The rest of this paper is organized as follows. We will give a brief review of the basic concepts, notations and definitions that are necessary to comprehend the survival analysis algorithms and provide the formal problem statement for survival analysis problem in Section 2

. A taxonomy of the existing survival analysis methods, including both statistical and machine learning methods will also be provided to elucidate the holistic view of the existing works in the area of survival analysis. We will then review the well-studied representative conventional statistical methods including non-parametric, semi-parametric, and parametric models in Section 

3. Section 4

describes several basic machine learning approaches, including survival trees, Bayesian methods, support vector machines and neural networks developed for survival analysis. Different kinds of advanced machine learning algorithms such as ensemble learning, transfer learning, multi-task learning and active learning for handling survival data will also be discussed. Section 

5 demonstrates the evaluation metrics for survival models. In addition to the survival analysis algorithms, some interesting topics related to this topic have received considerable attention in various fields. In Section 6, several related concepts such as early prediction and complex events will be discussed. Various data transformation techniques such as uncensoring and calibration which are typically used in conjunction with existing predictive methods will also be mentioned briefly. A discussion about topics in complex event analysis such as competing risks and recurrent events will also be provided. In Section 7, various real-world applications of survival analysis methods will be briefly explained and more insights into these application domains will be provided. In Section 8, the details about the implementations and software packages of the survival analysis methods are discussed. Finally, Section 9 concludes our discussion.

2 Definition of Survival Analysis

In this section, we will first provide the basic notations and terminologies used in this paper. We will then give an illustrative example which explains the structure of the survival data and give a more formal problem statement for survival analysis. At last, we also give a complete taxonomy of the existing survival analysis methods that are available in the literature, including both the conventional statistical methods and the machine learning approaches. It provides a holistic view of the field of survival analysis and will aid the readers to gain the basic knowledge about the methods used in this field before getting into the detailed algorithms.

Notations used in this paper. Notations Descriptions The number of features The number of instances feature vector covariate vector of instance vector of event times vector of last follow up times vector of observed time which is equal to binary vector for event status coefficient vector Death density function

Cumulative event probability function

Survival probability function Hazard function Baseline hazard function Cumulative hazard function

2.1 Survival Data and Censoring

During the study of a survival analysis problem, it is possible that the events of interest are not observed for some instances; this scenario occurs because of the limited observation time window or missing traces caused by other uninterested events. This concept is known as censoring [Klein and Moeschberger (2005)]. We can broadly categorize censoring into three groups based on the occurrence of the censoring [Lee and Wang (2003)], (i) right-censoring, for which the observed survival time is less than or equal to the true survival time; (ii) left-censoring, for which the observed survival time is greater than or equal to the true survival time; and (iii) interval censoring, for which we only know that the event occurs during a given time interval. It should be noted that the true event occurrence time is unknown in all the three cases. Among them, right-censoring is the most common scenario that arises in many practical problems [Marubini and Valsecchi (2004)], thus, the survival data with right-censoring information will be mainly analyzed in this paper.

For a survival problem, the time to the event of interest is known precisely only for those instances who have the event occurred during the study period. For the remaining instances, since we may lose track of them during the observation time or their time to event is greater than the observation time, we can only have the censored time which may be the time of withdrawn, lost or the end of the observation. They are considered to be censored instances in the context of survival analysis. In other words, here, we can only observe either survival time or censored time but not both, for any given instance . If and only if

can be observed during the study, the dataset is said to be right-censored. In the survival problem with right-censored instances, the censoring time is also a random variable since the instances enter the study randomly and the randomness in the censoring time of the instances. Thus, in this paper, we assume that the censoring occurs randomly in the survival problems. For the sake of brevity, we will refer to the randomly occurred right-censoring as

censoring hereafter in the paper.

In Figure 1, an illustrative example is given for a better understanding of the definition of censoring and the structure of survival data. Six instances are observed in this study for months and the event occurrence information during this time period is recorded. From Figure 1, we can find that only subjects S4 and S6 have experienced the event (marked by ‘X’) during the follow-up time and the observed time for them is the event time. While the event did not occur within the months period for subjects S1, S2, S3 and S5, which are considered to be censored and marked by red dots in the figure. More specifically, subjects S2 and S5 are censored since there was no event occurred during the study period, while subjects S1 and S3 are censored due to the withdrawal or being lost to follow-up within the study time period.

Figure 1: An illustration demonstrating the survival analysis problem.

Problem Statement: For a given instance , represented by a triplet , where is the feature vector; is the binary event indicator, i.e., for an uncensored instance and for a censored instance; and denotes the observed time and is equal to the survival time for an uncensored instance and for a censored instance, i.e.,

(1)

It should be noted that is a latent value for censored instances since these instances did not experience any event during the observation time period.

The goal of survival analysis is to estimate the time to the event of interest for a new instance with feature predictors denoted by . It should be noted that, in survival analysis problem, the value of will be both non-negative and continuous.

2.2 Survival and Hazard Function

The survival function, which is used to represent the probability that the time to the event of interest is not earlier than a specified time [Lee and Wang (2003), Klein and Moeschberger (2005)], is one of the primary goals in survival analysis. Conventionally, survival function is represented by , which is given as follows:

(2)

The survival function monotonically decreases with , and the initial value is 1 when , which represents the fact that, in the beginning of the observation, of the observed subjects survive; in other words, none of the events of interest have occurred.

On the contrary, the cumulative death distribution function , which represents the probability that the event of interest occurs earlier than , is defined as , and death density function can be obtained as for continuous cases, and , where denotes a small time interval, for discrete cases. Figure 2 shows the relationship among these functions.

Figure 2: Relationship among different entities , and .

In survival analysis, another commonly used function is the hazard function (), which is also called the force of mortality, the instantaneous death rate or the conditional failure rate [Dunn and Clark (2009)]. The hazard function does not indicate the chance or probability of the event of interest, but instead it is the rate of event at time given that no event occurred before time . Mathematically, the hazard function is defined as:

(3)

Similar to , is also a non-negative function. While all the survival functions, , decrease over time, the hazard function can have a variety of shapes. Consider the definition of , which can also be expressed as , so the hazard function can be represented as:

(4)

Thus, the survival function defined in Eq. (2) can be rewritten as

(5)

where represents the cumulative hazard function (CHF) [Lee and Wang (2003)].

2.3 Taxonomy of Survival Analysis methods

Broadly speaking, the survival analysis methods can be classified into two main categories: statistical methods and machine learning based methods. Statistical methods share the common goal with machine learning methods to make predictions of the survival time and estimate the survival probability at the estimated survival time. However, they focus more on characterizing both the distributions of the event times and the statistical properties of the parameter estimation by estimating the survival curves, while machine learning methods focus more on the prediction of event occurrence at a given time point by incorporating the traditional survival analysis methods with various machine learning techniques. Machine learning methods are usually applied to the high-dimensional problems, while statistical methods are generally developed for the low-dimensional data. In addition, machine learning methods for survival analysis offer more effective algorithms by incorporating survival problems with both statistical methods and machine learning methods and taking advantages of the recent developments in machine learning and optimization to learn the dependencies between covariates and survival times in different ways.

Based on the assumptions and the usage of the parameters used in the model, the traditional statistical methods can be subdivided into three categories: (i) non-parametric models, (ii) semi-parametric models and (iii) parametric models. Machine learning algorithms, such as survival trees, Bayesian methods, neural networks and support vector machines, which have become more popular in the recent years are included under a separate branch. Several advanced machine learning methods, including ensemble learning, active learning, transfer learning and multi-task learning methods, are also included. The overall taxonomy also includes some of the research topics that are related to survival analysis such as complex events, data transformation and early prediction. A complete taxonomy of these survival analysis methods is shown in Figure

3.

Figure 3: Taxonomy of the methods developed for survival analysis.

3 Traditional statistical methods

In this section, we will introduce three different types of statistical methods to estimate the survival/hazard functions: non-parametric, semi-parametric and parametric methods. Table 3 shows both the advantages and disadvantages of each type of methods based on theoretical and experimental analysis and lists the specific methods in each type.

Summary of different types of statistical methods for survival analysis. Type Advantages Disadvantages Specific methods Non-parametric More efficient when no suitable theoretical distributions known. Difficult to interpret; yields inaccurate estimates. Kaplan-Meier Nelson-Aalen Life-Table Semi-parametric The knowledge of the underlying distribution of survival times is not required. The distribution of the outcome is unknown; not easy to interpret. Cox model Regularized Cox CoxBoost Time-Dependent Cox Parametric Easy to interpret, more efficient and accurate when the survival times follow a particular distribution. When the distribution assumption is violated, it may be inconsistent and can give sub-optimal results. Tobit Buckley-James Penalized regression Accelerated Failure Time

Non-parametric methods are more efficient when there is no underlying distribution for the event time or the proportional hazard assumption does not hold. In non-parametric methods, an empirical estimate of the survival function is obtained using Kaplan-Meier (KM) method, Nelson-Aalen estimator (NA) or Life-Table (LT) method. More generally, any KM estimator for the survival probability at the specified survival time is a product of the same estimate up to the previous time and the observed survival rate for that given time. Thus, KM method is also referred to as a product-limit method [Kaplan and Meier (1958), Lee and Wang (2003)]. NA method is an estimator based on modern counting process techniques [Andersen et al. (2012)]. LT [Cutler and Ederer (1958)] is the application of the KM method to the interval grouped survival data.

Under the semi-parametric category, Cox model is the most commonly used regression analysis approach for survival data and it differs significantly from other methods since it is built on the proportional hazards assumption and employs partial likelihood for parameter estimation. Cox regression method is described as semi-parametric method since the distribution of the outcome remains unknown even if it is based on a parametric regression model. In addition, several useful variants of the basic Cox model, such as penalized Cox models, CoxBoost algorithm and Time-Dependent Cox model (TD-Cox), are also proposed in the literature.

Parametric methods are more efficient and accurate for estimation when the time to event of interest follows a particular distribution specified in terms of certain parameters. It is relatively easy to estimate the times to the event of interest with parametric models, but it becomes awkward or even impossible to do so with the Cox model [Allison (2010)]

. Linear regression method is one of the main parametric survival methods, while the Tobit model, Buckley-James regression model and the penalized regression are the most commonly used linear models for survival analysis. In addition, other parametric models, such as accelerated failure time (AFT) which models the survival time as a function of covariates

[Kleinbaum and Klein (2006)], are also widely used. We will now describe these three types of statistical survival methods in this section.

3.1 Non-parametric Models

Among all functions, the survival function or its graphical presentation is the most widely used one. In 1958, Kaplan and Meier [Kaplan and Meier (1958)] developed the Kaplan-Meier (KM) Curve or the product-limit (PL) estimator to estimate the survival function using the actual length of the observed time. This method is the most widely used one for estimating survival function. Let be a set of distinct ordered event times observed for instances. In addition to these event times, there are also censoring times for instances whose event times are not observed. For a specific event time , the number of observed events is , and instances will be considered to be “at risk” since their event time or censored time is greater than or equal to . It should be noted that we cannot simply consider as the difference between and due to the censoring. The correct way to obtain is , where is the number of censored instances during the time period between and . Then the conditional probability of surviving beyond time can be defined as:

(6)

Based on this conditional probability, the product-limit estimate of survival function is given as follows:

(7)

However, if the subjects in the data are grouped into some interval periods according to the time, or if the number of subjects is very large, or when the study is for a large population, the Life Table (LT) analysis [Cutler and Ederer (1958)] will be a more convenient method. Different from KM and LT method, Nelson-Aalen estimator [Nelson (1972), Aalen (1978)] is a method to estimate the cumulative hazard function for censored data based on counting process approach. It should be noted that when the time to event of interest follows a specific distribution, nonparametric methods are less efficient compared to the parametric methods.

3.2 Semi-Parametric Models

As a hybrid of the parametric and non-parametric approaches, semi-parametric models can obtain a more consistent estimator under a broader range of conditions compared to the parametric models, and a more precise estimator than the non-parametric methods [Powell (1994)]. Cox model [David (1972)] is the most commonly used survival analysis method in this category. Unlike parametric methods, the knowledge of the underlying distribution of time to event of interest is not required, but the attributes are assumed to have an exponential influence on the outcome. We will now discuss the details of Cox model more elaborately and then describe different variants and extensions of the basic Cox model such as regularized Cox models, CoxBoost and Time-Dependent Cox.

3.2.1 The Basic Cox Model

For a given instance , represented by a triplet , the hazard function in the Cox model follows the proportional hazards assumption given by

(8)

for , where the baseline hazard function, , can be an arbitrary non-negative function of time, is the corresponding covariate vector for instance , and is the coefficient vector. The Cox model is a semi-parametric algorithm since the baseline hazard function, , is unspecified. For any two instances and , the hazard ratio is given by

(9)

which means that the hazard ratio is independent of the baseline hazard function. Cox model is a proportional hazards model since the hazard ratio is a constant and all the subjects share the same baseline hazard function. Based on this assumption, the survival function can be computed as follows:

(10)

where is the cumulative baseline hazard function, and represents the baseline survival function. The Breslow’s estimator [Breslow (1972)] is the most widely used method to estimate , which is given by

(11)

where if is an event time, otherwise . Here, represents the set of subjects who are at risk at time .

Because the baseline hazard function in Cox model is not specified, it is not possible to fit the model using the standard likelihood function. In other words, the hazard function is a nuisance function, while the coefficients are the parameters of interest in the model. To estimate the coefficients, Cox proposed a partial likelihood [David (1972), David (1975)] which depends only on the parameter of interest and is free of the nuisance parameters. The hazard function refers to the probability that an instance with covariate fails at time on the condition that it survives until time can be expressed by with . Let be the total number of events of interest that occurred during the observation period for instances, and is the distinct ordered time to event of interest. Without considering the ties, let be the corresponding covariate vector for the subject who fails at , and be the set of risk subjects at . Thus, conditional on the fact that the event occurs at , the individual probability corresponding to covariate can be formulated as follows:

(12)

and the partial likelihood is the product of the probability of each subject; referring to the Cox assumption and the presence of the censoring, the partial likelihood is defined as follows:

(13)

It should be noted that here ; if , the term in the product is the conditional probability; otherwise, when , the corresponding term is , which means that the term will not have any effect on the final product. The coefficient vector is estimated by maximizing this partial likelihood, or equivalently, minimizing the negative log-partial likelihood for improving efficiency.

(14)

The maximum partial likelihood estimator (MPLE) [David (1972), Lee and Wang (2003)] can be used along with the numerical Newton-Raphson method [Kelley (1999)] to iteratively find an estimator which minimizes with time complexity .

3.2.2 Regularized Cox models

With the development of data collection and detection techniques, most real-world domains tend to encounter high-dimensional data. In some cases, the number of variables (

) in the given data is almost equal to or even exceeds the number of instances (). It is challenging to build the prediction model with all the features and the model might provide inaccurate results because of the overfitting problem [van Houwelingen and Putter (2011)]. This motivates using sparsity norms to select vital features in high-dimension under the assumption that most of the features are not significant [Friedman et al. (2001)]. For the purpose of identifying the most relevant features to the outcome variable among tens of thousands of features, different penalty functions, including lasso, group lasso, fused lasso and graph lasso, are also used to develop the prediction models using the sparse learning methods. The family of -norm penalty functions , with the form of are the commonly used penalty functions. The smaller the value of , the sparser the solution, but when , the penalty is non-convex, which makes the optimization problem more challenging to solve. Here, we will introduce the commonly used regularized Cox models, whose regularizers are summarized in Table 3.2.2.

Different regularizers used in the variants of Cox model. Regularized Cox models Regularizers Lasso-Cox Ridge-Cox EN-Cox OSCAR-Cox

Lasso-Cox: Lasso [Tibshirani (1996)] is a

-norm regularizer which is good at performing feature selection and estimating the regression coefficients simultaneously. In 

[Tibshirani (1997)], the -norm penalty was incorporated into the log-partial likelihood shown in Eq.  (14) to obtain the Lasso-Cox algorithm, which inherits the properties of -norm in feature selection.

There are also some extensions of Lasso-Cox method. Adaptive Lasso-Cox [Zhang and Lu (2007)] is based on a penalized partial likelihood with adaptively weighted penalties on regression coefficients, with small weights for large coefficients and large weights for small coefficients. In fused Lasso-Cox [Tibshirani et al. (2005)], the coefficients and their successive differences are penalized using the -norm. In graphical Lasso-Cox [Friedman et al. (2008)], the sparse graphs are estimated using coordinate descent method by applying a -penalty to the inverse covariance matrix. These extensions solve the survival problems in a similar way as the regular Lasso-Cox model by incorporating different penalties.

Ridge-Cox:Ridge regression was originally proposed by Hoerl and Kennard [Hoerl and Kennard (1970)] and was successfully used in the context of Cox regression by Verweij et al. [Verweij and Van Houwelingen (1994)]. It incorporates a -norm regularizer to select the correlated features and shrink their values towards each other.

Feature-based regularized Cox method (FEAR-Cox) [Vinzamuri and Reddy (2013)] uses feature-based non-negative valued regularizer for the modified least squares formulation of Cox regression and the cyclic coordinate descent method is used to solve this optimization problem, where is a positive semi-definite matrix. Ridge-Cox is a special case of FEAR-Cox when

is the identity matrix.

EN-Cox: Elastic net (EN), which combines the and squared penalties, has the potential to perform the feature selection and deal with the correlation between the features simultaneously [Zou and Hastie (2005)]. The EN-Cox method was proposed by Noah Simon et al. [Simon et al. (2011)] where the Elastic Net penalty term shown in Table 3.2.2 with and introduced into the log-partial likelihood function in Eq. (14). Different from Lasso-Cox, EN-Cox can select more than features if .

Kernel Elastic Net (KEN) algorithm [Vinzamuri and Reddy (2013)], which uses the concept of kernels, compensates for the drawbacks of the EN-Cox which is partially effective at dealing with the correlated features in survival data. In KEN-Cox, it builds a kernel similarity matrix for the feature space in order to incorporate the pairwise feature similarity into the Cox model. The regularizer used in KEN-Cox is defined as , where is a symmetric kernel matrix with as its entries. We can see that the equation for KEN-Cox method includes both smooth and non-smooth terms.

OSCAR-Cox: The modified graph Octagonal Shrinkage and Clustering Algorithm for Regression (OSCAR) [Yang et al. (2012), Ye and Liu (2012)] regularizer is incorporated in the basic Cox model as the OSCAR-Cox algorithm [Vinzamuri and Reddy (2013)], which can perform the variable selection for highly correlated features in regression problem. The main advantage of OSCAR regularizer is that it tends to have equal coefficients for the features which relate to the outcome in similar ways. In addition, it can simultaneously obtain the advantages of the individual sparsity because of the norm and the group sparsity due to the norm. The regularizer used in the formulation of the OSCAR-Cox is given in Table 3.2.2, where is the sparse symmetric edge set matrix generated by building a graph structure which considers each feature as an individual node. By using this way, a pairwise feature regularizer can be incorporated into the basic Cox regression framework.

Among the regularizers shown in Table 3.2.2, the parameters can be tuned to adjust the influence introduced by the regularizer term. The performance of these penalized estimators significantly depend on , and the optimal can be chosen via cross-validation. The time complexity of both Lasso-Cox and EN-Cox method is .

3.2.3 CoxBoost

While there exists several algorithms (such as the penalized parameter estimation) which can be applied to fit the sparse survival models on the high-dimensional data, none of them are applicable in the situation that some mandatory covariates should be taken into consideration explicitly in the models. CoxBoost [Binder and Schumacher (2008)] approach is proposed to incorporate the mandatory covariates into the final model. The CoxBoost method also aims at estimating the coefficients in Eq. (8

) as in the Cox model. It considers a flexible set of candidate variables for updating in each boosting step by employing the offset-based gradient boosting approach. This is the key difference from the regular gradient boosting approach, which either updates only one component of

in component-wise boosting or fits the gradient by using all covariates in each step.

3.2.4 Time-dependent (TD) Cox Model

Cox regression model is also effectively adapted to handle time-dependent covariates, which refer to the variables whose values may change with time for a given instance. Typically, the time-dependent variable can be classified into three categories [Kleinbaum and Klein (2006)]: internal time-dependent variable, ancillary time-dependent variable and defined time-dependent variable. The reason for a change in the internal time-dependent variable depends on the internal characteristics or behavior that is specific to the individual. In contrast, a variable is called an ancillary time-dependent variable if its value changes primarily due to the environment that may affect several individuals simultaneously. Defined variable, with the form of the product of a time-independent variable multiplied by a function of time, is used to analyze a time independent predictor not satisfying the PH assumption in the Cox model. The commonly used layout of the dataset in time-dependent Cox model is in the form of counting process (CP) [Kleinbaum and Klein (2006)].

Given a survival analysis problem which involves both time-dependent and time-independent features, we can denote the variables at time as , where and represent the number of time-dependent and time-independent variables, respectively. And and represent the time-dependent feature and the time-independent feature, respectively. Then, by involving the time-dependent features into the basic Cox model given in Eq. (8), the time-dependent Cox model can be formulated as:

(15)

where and represent the coefficients corresponding to the time-dependent variable and the time-independent variable, respectively. For the two sets of predictors at time : and , the hazard ratio for the time-dependent Cox model can be computed as follows:

(16)

Since the first component in the exponent of Eq. (16) is time-dependent, we can consider the hazard ratio in the TD-Cox model as a function of time . This means that it does not satisfy the PH assumption mentioned in the standard Cox model. It should be noted that the coefficient is in itself not time-dependent and it represents the overall effect of the time-dependent variable at various survival time points. The likelihood function of time-dependent Cox model can be constructed in the same manner and optimized with the same time complexity as done in the Cox model.

3.3 Parametric Models

The parametric censored regression models assume that the survival times or the logarithm of the survival times of all instances in the data follow a particular theoretical distribution [Lee and Wang (2003)]. These models are important alternatives to the Cox-based semi-parametric models and are also widely used in many application domains. It is simple, efficient and effective in predicting the time to event of interest using parametric methods. The parametric survival models tend to obtain the survival estimates that are consistent with a theoretical survival distribution. The commonly used distributions in parametric censored regression models are: normal, exponential, weibull, logistic, log-logistic and log-normal. If the survival times of all instances in the data follow these distributions, the model is referred as linear regression model. If the logarithm of the survival times of all instances follow these distributions, the problem can be analyzed using the accelerated failure time model, in which we assume that the variable can affect the time to the event of interest of an instance by some constant factor [Lee and Wang (2003)]. It should be noted that if no suitable theoretical distribution is known, nonparametric methods are more efficient.

The maximum-likelihood estimation (MLE) method [Lee and Wang (2003)] can be used to estimate the parameters for these models. Let us assume that the number of instances is with censored observations and uncensored observations, and use as a general notation to denote the set of all parameters [Li et al. (2016e)]. Then the death density function and the survival function of the survival time can be represented as and , respectively. For a given instance , if it is censored, the actual survival time will not be available. However, we can conclude that the instance did not experience the event of interest before the censoring time , so the value of the survival function will be a probability closed to . In contrast, if the event occurs for instance at , then the death density function will have a high probability value. Thus, we can denote as the joint probability of all the uncensored observations and to represent the joint probability of the censored observations [Li et al. (2016e)]. Therefore, we can estimate the parameters by optimizing the likelihood function of all instances in the form of

(17)

Table 3.3 shows the death density function and its corresponding survival function and hazard function for these commonly used distributions. Now we will discuss more details about these distributions.

Density, Survival and Hazard functions for the distributions commonly used in the parametric methods in survival analysis. Distribution PDF Survival Hazard Exponential Weibull Logistic Log-logistic Normal Log-normal

Exponential Distribution: Among the parametric models in survival analysis, exponential model is the simplest and prominent one since it is characterized by a constant hazard rate, , which is the only parameter. In this case, the failure or the death is assumed to be a random event independent of time. A larger value of indicates a higher risk and a shorter survival time period. Based on the survival function shown in Table 3.3, we can have , in which the relationship between the logarithm of survival function and time is linear with

as the slope. Thus, it is easy to determine whether the time follows an exponential distribution by plotting

against time  [Lee and Wang (2003)].

Weibull Distribution: The Weibull model, which is characterized by two parameters and , is the most widely used parametric distribution for survival problem. The shape of the hazard function is determined using the shape parameter , which provides more flexibility compared to the exponential model. If , the hazard will be a constant, and in this case, the Weibull model will become an exponential model. If , the hazard function will be decreasing over time. The scaling of the hazard function is determined by the scaling parameter .

Logistic and Log-logistic Distribution: In contrast to Weibull model, the hazard functions of both logistic and log-logistic models allow for non-monotonic behavior in the hazard function, which is shown in Table 3.3. The survival time and the logarithm of survival time will follow the logistic distribution in logistic and log-logistic models, respectively. For logistic model, is the parameter to determine the location of the function, while is the scale parameter. For log-logistic model, the parameter is the shape parameter. If , the hazard function is decreasing over time. However, if , the hazard function will increase over time to the maximum value first and then decrease, which means that the hazard function is unimodal if . Thus, the log-logistic distribution may be used to describe a monotonically decreasing hazard or a first increasing and then decreasing hazard [Lee and Wang (2003)].

Normal and Log-normal Distribution: If the survival time satisfies the condition that or

is normally distributed with mean

and variance

, then

is normally or log-normally distributed. This is suitable for the survival patterns with an initially increasing and then decreasing hazard rate.

Based on the framework given in Eq. (17), we will discuss these commonly used parametric methods.

3.3.1 Linear regression models

In data analysis, the linear regression model, together with the least squares estimation method, is one of the most commonly used approach. We cannot apply it directly to solve survival analysis problems since the actual event times are missing for censored instances. Some linear models [Miller and Halpern (1982), Koul et al. (1981), Buckley and James (1979), Wang et al. (2008), Li et al. (2016e)] including Tobit regression and Buckley-James (BJ) regression were proposed to handle censored instances in survival analysis. Strictly speaking, linear regression is a specific parametric censored regression, however, this method is fundamental in data analysis, and hence we discuss the linear regression methods for censored data separately here.

Tobit Regression: The Tobit model [Tobin (1958)]

is one of the earliest attempts to extend linear regression with the Gaussian distribution for data analysis with censored observations. In this model, a latent variable

is introduced and the assumption made here is that it linearly depends on via the parameter as , where is a normally distributed error term. Then, for the instance, the observable variable will be if , otherwise it will be . This means that if the latent variable is above zero, the observed variable equals to the latent variable and zero otherwise. Based on the latent variable, the parameters in the model can be estimated with maximum likelihood estimation (MLE) method with time complexity .

Buckley-James Regression: The Buckley-James (BJ) regression [Buckley and James (1979)] estimates the survival time of the censored instances as the response value based on Kaplan-Meier (KM) estimation method, and then fits a linear (AFT) model by considering the survival times of uncensored instances and the approximated survival times of the censored instances at the same time. To handle high-dimensional survival data, Wang et al. [Wang et al. (2008)] applied the elastic net regularizer in the BJ regression (EN-BJ).

Penalized Regression: Penalized regression methods [Kyung et al. (2010)] are well-known for their nice properties of simultaneous variable selection and coefficient estimation. The penalized regression method can provide better prediction results in the presence of either multi-collinearity of the covariates or high-dimensionality. Recently, these methods have received a great attention in survival analysis. The weighted linear regression model with different regularizers for high-dimensional censored data is an efficient method to handle the censored data by giving different weights to different instances [Li et al. (2016e)]. In addition, the structured regularization based linear regression algorithm [Bach et al. (2012), Vinzamuri et al. (2017)] for right censored data has a good ability to infer the underlying structure of the survival data.

  • Weighted Regression: Weighted regression method [Li et al. (2016b)]

    can be used when the constant variance assumption about the errors in the ordinary least squares regression methods is violated (which is called heteroscedasticity), which is different from the constant variance in the errors (which is called homoscedasticity) in ordinary least squares regression methods. Instead of minimizing the residual sum of squares, the weighted regression method minimizes the weighted sum of squares

    . The ordinary least squares is a special case of this where all the weights . Weighted regression method can be solved in the same manner as the ordinary linear least squares problem with the time complexity of . In addition, using the weighted regression method, we can assign higher weights to the instances that we want to emphasize or ones where mistakes are especially costly. If we give the samples high weights, the model will be pulled towards matching the data. This will be very helpful for survival analysis to put more emphasis on the instances whose information may contribute more to the model.

  • Structured Regularization: The ability to effectively infer latent knowledge through tree-based hierarchies and graph-based relationships is extremely crucial in survival analysis. This is also supported by the effectiveness of structured sparsity based regularization methods in regression [Bach et al. (2012)]. Structured regularization based LInear REgression algorithm for right Censored data (SLIREC) in [Vinzamuri et al. (2017)] infers the underlying structure of the survival data directly using sparse inverse covariance estimation (SICE) method and uses the structural knowledge to guide the base linear regression model. The structured approach is more robust compared to the standard statistical and Cox based methods since it can automatically adapt to different distributions of events and censored instances.

3.3.2 Accelerated Failure Time (AFT) Model

In the parametric censored regression methods discussed previously, we assume that the survival time of all instances in the given data follows a specific distribution and that the relationship between either the survival time or the logarithm of the survival time and the features is linear. Specially, if the relationship between the logarithm of survival time and the covariates is linear in nature, it is also termed as Accelerated failure time (AFT) model [Kalbfleisch and Prentice (2011)]. Thus, we consider these regression methods as the generalized linear models.

In the AFT model, it assumes that the relationship of the logarithm of survival time and the covariates is linear and can be written in the following form.

(18)

where is the covariate matrix, represents the coefficient vector, denotes an unknown scale parameter, and is an error variable which follows a similar distribution to