In many medical studies, estimating the failure time distribution function, or quantities that depend on this distribution, as a function of patient demographic and prognostic variables, is of central importance for risk assessment and health planing. Frequently, such data is subject to right censoring. The goal of this paper is to develop tools for analyzing such data using machine learning techniques.
Traditional approaches to right censored failure time analysis include using parametric models, such as the Weibull distribution, and semiparametric models such as proportional hazard models(see Lawless, 2003, for both). Even when less stringent models—such as nonparametric estimation—are used, it is typically assumed that the distribution function is smooth in both time and covariates (Dabrowska, 1987; Gonzalez-Manteiga and Cadarso-Suarez, 1994)
. These assumptions seem restrictive, especially when considering today’s high-dimensional data settings.
In this paper, we propose a support vector machine (SVM) learning method for right censored data. The choice of SVM is motivated by the fact that SVM learning methods are easy-to-compute techniques that enable estimation under weak or no assumptions on the distribution (Steinwart and Chirstmann, 2008). SVM learning methods, which we review in detail in Section 2
, are a collection of algorithms that attempt to minimize the risk with respect to some loss function. An SVM learning method typically minimizes a regularized version of the empirical risk over some reproducing kernel Hilbert space (RKHS). The resulting minimizer is referred to as the SVM decision function. The SVM learning method is the mapping that assigns to each data set its corresponding SVM decision function.
We adapt the SVM framework to right censored data as follows. First, we represent the distribution’s quantity of interest as a Bayes decision function, i.e., a function that minimizes the risk with respect to a loss function. We then construct a data-dependent version of this loss function using inverse-probability-of-censoring weighting (Robins et al., 1994). We then minimize a regularized empirical risk with respect to this data-dependent loss function to obtain an SVM decision function for censored data. Finally, we define the SVM learning method for censored data as the mapping that assigns for every censored data set its corresponding SVM decision function.
Note that unlike the standard SVM decision function, the proposed censored SVM decision function is obtained as the minimizer of a data-dependent loss function. In other words, for each data set, a different minimization loss function is defined. Moreover, minimizing the empirical risk no longer consists of minimizing a sum of i.i.d. observations. Consequently, different techniques are needed to study the theoretical properties of the censored SVM learning method.
We prove a number of theoretical results for the proposed censored SVM learning method. We first prove that the censored SVM decision function is measurable and unique. We then show that the censored SVM learning method is a measurable learning method. We provide a probabilistic finite-sample bound on the difference in risk between the learned censored SVM decision function and the Bayes risk. We further show that the SVM learning method is consistent for every probability measure for which the censoring is independent of the failure time given the covariates, and the probability that no censoring occurs is positive given the covariates. Finally, we compute learning rates for the censored SVM learning method. We also provide a simulation study that demonstrates the performance of the proposed censored SVM learning method. Our results are obtained under some conditions on the approximation RKHS and the loss function, which can be easily verified. We also assume that the estimation of censoring probability at the observed points is consistent.
used neural networks.Segal (1988), Hothorn et al. (2004), Ishwaran et al. (2008), and Zhu and Kosorok (2011)
, among others, suggested versions of splitting trees and random forests for survival data.Johnson et al. (2004), Shivaswamy et al. (2007), Shim and Hwang (2009), and Zhao et al. (2011), among others, suggested versions of SVM different from the proposed censored SVM. The theoretical properties of most of these algorithms have never been studied. Exceptions include the consistency proof of Ishwaran and Kogalur (2010) for random survival trees, which requires the assumption that the feature space is discrete and finite. In the context of multistage decision problems, Goldberg and Kosorok (2012b) proposed a Q-learning algorithm for right censored data for which a theoretical justification is given, under the assumption that the censoring is independent of both failure time and covariates. However, both of these theoretically justified algorithms are not SVM learning methods. Therefore, we believe that the proposed censored SVM and the accompanying theoretical evaluation given in this paper represent a significant innovation in developing methodology for learning in survival data.
Although the proposed censored SVM approach enables the application of the full SVM framework to right censored data, one potential drawback is the need to estimate the censoring probability at observed failure times. This estimation is required in order to use inverse-probability-of-censoring weighting for constructing the data-dependent loss function. We remark that in many applications it is reasonable to assume that the censoring mechanism is simpler than the failure-time distribution; in these cases, estimation of the censoring distribution is typically easier than estimation of the failure distribution. For example, the censoring may depend only on a subset of the covariates, or may be independent of the covariates; in the latter case, an efficient estimator exists. Moreover, when the only source of censoring is administrative, in other words, when the data is censored because the study ends at a prespecified time, the censoring distribution is often known to be independent of the covariates. Fortunately, the results presented in this paper hold for any censoring estimation technique. We present results for both correctly specified and misspecified censoring models. We also discuss in detail the special cases of the Kaplan-Meier and the Cox model estimators (Fleming and Harrington, 1991).
While the main contribution of this paper is the proposed censored SVM learning method and the study of its properties, an additional contribution is the development of a general machine learning framework for right censored data. The principles and definitions that we discuss in the context of right censored data, such as learning methods, measurability, consistency, and learning rates, are independent of the proposed SVM learning method. This framework can be adapted to other learning methods for right censored data, as well as for learning methods for other missing data mechanisms.
The paper is organized as follows. In Section 2 we review right-censored data and SVM learning methods. In Section 3 we briefly discuss the use of SVM for right-censored data when no censoring is present. Section 4 discusses the difficulties that arise when applying SVM to right censored data and presents the proposed censored SVM learning method. Section 5 contains the main theoretical results, including finite sample bounds and consistency. Simulations appear in Section 6. Concluding remarks appear in Section 7. The lengthier key proofs are provided in the Appendix. Finally, the Matlab code for both the algorithm and the simulations can be found in LABEL:sec:suppA.
In this section, we establish the notation used throughout the paper. We begin by describing the data setup (Section 2.1). We then discuss loss functions (Section 2.2). Finally we discuss SVM learning methods (Section 2.3). The notation for right censored data generally follows Fleming and Harrington (1991) (hereafter abbreviated FH91, ). For the loss function and the SVM definitions, we follow Steinwart and Chirstmann (2008) (hereafter abbreviated SC08, ).
2.1 Data Setup
We assume the data consist of independent and identically-distributed random triplets . The random vector is a covariate vector that takes its values in a set
. The random variableis the observed time defined by , where is the failure time, is the censoring time, and where . The indicator is the failure indicator, where is if is true and otherwise, i.e., whenever a failure time is observed.
Let be the survival functions of , and let be the survival function of . We make the following assumptions:
takes its values in the segment for some finite , and .
is independent of , given .
The first assumption assures that there is a positive probability of censoring over the observation time range (). Note that the existence of such a is typical since most studies have a finite time period of observation. In the above, we also define to be the left-hand limit of a right continuous function with left-hand limits. The second assumption is standard in survival analysis and ensures that the joint nonparametric distribution of the survival and censoring times, given the covariates, is identifiable.
We assume that the censoring mechanism can be described by some simple model. Below, we consider two possible examples, although the main results do not require any specific model. First, we need some notation. For every , define and . Note that since we are interested in the survival function of the censoring variable, is the counting process for the censoring, and not for the failure events, and is the at-risk process for observing a censoring time. For a cadlag function on , define the product integral (van der Vaart and Wellner, 1996). Define to be the empirical measure, i.e., . Define to be the expectation of with respect to .
Independent censoring: Assume that is independent of both and . Define
Then is the Kaplan-Meier estimator for . is a consistent and efficient estimator for the survival function (FH91).
The proportional hazards model: Consider the case that the hazard of give is of the form for some unknown vector and some continuous unknown nondecreasing function with and . Let be the zero of the estimating equation
Then is a consistent and efficient estimator for survival function (FH91).
Even when no simple form for the censoring mechanism is assumed, the censoring distribution can be estimated using a generalization of the Kaplan-Meier estimator of Example 1.
Generalized Kaplan-Meier: Let be a kernel function of width . Define and . Define
Then the generalized Kaplan-Meier estimator is given by , where the product integral is defined for every fixed . Under some conditions, Dabrowska (1987, 1989) proved consistency of the estimator and discussed its convergence rates.
Usually we denote the estimator of the survival function of the censoring variable by without referring to a specific estimation method. When needed, the specific estimation method will be discussed. When independent censoring is assumed, as in Example 1, we denote the estimator by .
By Assumption (A1), , and thus if the estimator is consistent for , then, for all large enough, . In the following, for simplicity, we assume that the estimator is such that . In general, one can always replace by , where . In this case, for all large enough, and for all , .
2.2 Loss Functions
Let the input space be a measurable space. Let the response space be a closed subset of . Let be a measure on .
A function is a loss function if it is measurable. We say that a loss function is convex if is convex for every and . We say that a loss function is locally Lipschitz continuous with Lipschitz local constant function if for every
We say that is Lipschitz continuous if there is a constant such that the above holds for any with .
For any measurable function we define the -risk of with respect to the measure as . We define the Bayes risk of with respect to loss function and measure as , where the infimum is taken over all measurable functions . A function that achieves this infimum is called a Bayes decision function.
We now present a few examples of loss functions and their respective Bayes decision functions. In the next section we discuss the use of these loss functions for right censored data.
Binary classification: Assume that . We would like to find a function such that for almost every , . One can think of as a function that predicts the label of a pair when only is observed. In this case, the desired function is the Bayes decision function with respect to the loss function . In practice, since the loss function is not convex, it is usually replaced by the hinge loss function .
Expectation: Assume that . We would like to estimate the expectation of the response given the covariates . The conditional expectation is the Bayes decision function with respect to the squared error loss function .
Median and quantiles: Assume that . We would like to estimate the median of . The conditional median is the Bayes decision function for the absolute deviation loss function . Similarly, the -quantile of given is obtained as the Bayes decision function for the loss function
Note that the functions , , , and for are all convex. Moreover, all these functions except are Lipschitz continuous, and is locally Lipschitz continuous when is compact.
2.3 Support Vector Machine (SVM) Learning Methods
Let be a convex locally Lipschitz continuous loss function. Let be a separable reproducing kernel Hilbert space (RKHS) of a bounded measurable kernel on (for details regarding RKHS, the reader is referred to SC08, Chapter 4).
Let be a set of i.i.d. observations drawn according to the probability measure . Fix and let be as above. Define the empirical SVM decision function
is the empirical risk.
For some sequence , define the SVM learning method , as the map
for all . We say that is measurable if it is measurable for all with respect to the minimal completion of the product -field on . We say that that is (-risk) -consistent if for all
We say that is universally consistent if for all distributions on , is -consistent.
We now briefly summarize some known results regarding SVM learning methods needed for our exposition. More advanced results can be obtained using conditions on the functional spaces and clipping. We will discuss these ideas in the context of censoring in Section 5.
Let be a convex Lipschitz continuous loss function such that is uniformly bounded. Let be a separable RKHS of a bounded measurable kernel on the set . Choose such that , and . Then
The empirical SVM decision function exists and is unique.
The SVM learning method defined in (2) is measurable.
The -risk .
If the RKHS is dense in the set of integrable functions on , then the SVM learning method is universally consistent.
3 SVM for Survival Data without Censoring
In this section we present a few examples of the use of SVM for survival data but without censoring. We show how different quantities obtained from the conditional distribution of given can be represented as Bayes decision functions. We then show how SVM learning methods can be applied to these estimation problems and briefly review theoretical properties of such SVM learning methods. In the next section we will explain why these standard SVM techniques cannot be employed directly when censoring is present.
Let be a random vector where is a covariate vector that takes its values in a set , is survival time that takes it values in for some positive constant , and where is distributed according to a probability measure on .
Note that the conditional expectation is the Bayes decision function for the least squares loss function . In other words
where the minimization is taken over all measurable real functions on (see Example 6). Similarly, the conditional median and the -quantile of can be shown to be the Bayes decision functions for the absolute deviation function and , respectively (see Example 7). In the same manner, one can represent other quantities of the conditional distribution using Bayes decision functions.
Defining quantities computed from the survival function as Bayes decision functions is not limited to regression (i.e., to a continuous response). Classification problems can also arise in the analysis of survival data (see, for example, Ripley and Ripley, 2001; Johnson et al., 2004). For example, let , , be a cutoff constant. Assume that survival to a time greater than is considered as death unrelated to the disease (i.e., remission) and a survival time less than or equal to is considered as death resulting from the disease. Denote
In this case, the decision function that predicts remission when the probability of given the covariates is greater than and failure otherwise is a Bayes decision function for the binary classification loss of Example 5.
Let be a data set of i.i.d. observations distributed according to . Let where is some deterministic measurable function. For regression problems, is typically the identity function and for classification can be defined, for example, as in (4). Let be a convex locally Lipschitz continuous loss function, . Note that this includes the loss functions , , and . Define the empirical decision function as in (1) and the SVM learning method as in (2). Then it follows from Theorem 8 that for an appropriate RKHS and regularization sequence , is measurable and universally consistent.
4 Censored SVM
In the previous section, we presented a few examples of the use of SVM for survival data without censoring. In this section we explain why standard SVM techniques cannot be applied directly when censoring is present. We then explain how to use inverse probability of censoring weighting (Robins et al., 1994) to obtain a censored SVM learning method. Finally, we show that the obtained censored SVM learning method is well defined.
Let be a set of i.i.d. random triplets of right censored data (as described in Section 2.1). Let be a convex locally Lipschitz loss function. Let be a separable RKHS of a bounded measurable kernel on . We would like to find an empirical SVM decision function. In other words, we would like to find the minimizer of
where is a fixed constant, and is a known function. The problem is that the failure times may be censored, and thus unknown. While a simple solution is to ignore the censored observations, it is well known that this can lead to severe bias (Tsiatis, 2006).
In order to avoid this bias, one can reweight the uncensored observations. Note that at time , the -th observation has probability not to be censored, and thus, one can use the inverse of the censoring probability for reweighting in (5) (Robins et al., 1994).
More specifically, define the random loss function by
where is the estimator of the survival function of the censoring variable based on the set of random triplets (see Section 2.1). When is given, we denote . Note that in this case the function is no longer random. In order to show that is a loss function, we need to show that is a measurable function.
Let be a convex locally Lipschitz loss function. Assume that the estimation procedure is measurable. Then for every the function is measurable.
By Remark 4, the function is well defined. Since by definition, both and are measurable, we obtain that is measurable. ∎
We define the empirical censored SVM decision function to be
The existence and uniqueness of the empirical censored SVM decision function is ensured by the following lemma:
Let be a convex locally Lipschitz loss function. Let be a separable RKHS of a bounded measurable kernel on . Then there exists a unique empirical censored SVM decision function.
Note that given , the loss function is convex for every fixed , , and . Hence, the result follows from Lemma 5.1 together with Theorem 5.2 of SC08. ∎
Note that the empirical censored SVM decision function is just the empirical SVM decision function of (1), after replacing the loss function with the loss function . However, there are two important implications to this replacement. Firstly, empirical censored SVM decision functions are obtained by minimizing a different loss function for each given data set. Secondly, the second expression in the minimization problem (6), namely,
is no longer constructed from a sum of i.i.d. random variables.
We would like to show that the learning method defined by the empirical censored SVM decision functions is indeed a learning method. We first define the term learning method for right censored data or censored learning method for short.
A censored learning method on maps every data set , , to a function .
Choose such that . Define the censored SVM learning method , as for all . The measurability of the censored SVM learning method is ensured by the following lemma, which is an adaptation of Lemma 6.23 of SC08 to the censored case.
Let be a convex locally Lipschitz loss function. Let be a separable RKHS of a bounded measurable kernel on . Assume that the estimation procedure is measurable. Then the censored SVM learning method is measurable, and the map is measurable.
First, by Lemma 2.11 of SC08, for any , the map is measurable. The survival function is measurable on and by Remark 4, the function is well defined and measurable. Hence is measurable. Note that the map where is also measurable. Hence we obtain that the map , defined by
is measurable. By Lemma 10, is the only element of satisfying
By Aumann’s measurable selection principle ( SC08, Lemma A.3.18), the map is measurable with respect to the minimal completion of the product -field on . Since the evaluation map is measurable ( SC08, Lemma 2.11), the map is also measurable. ∎
5 Theoretical Results
In the following, we discuss some theoretical results regarding the censored SVM learning method proposed in Section 4. In Section 5.1 we discuss function clipping which will serve as a tool in our analysis. In Section 5.2 we discuss finite sample bounds. In Section 5.3 we discuss consistency. Learning rates are discussed in Section 5.4. Finally, censoring model misspecification is discussed in Section 5.5.
5.1 Clipped Censored SVM Learning Method
In order to establish the theoretical results of this section we first need to introduce the concept of clipping. We say that a loss function can be clipped at , if, for all ,
where denotes the clipped value of at , that is,
In our context the response variableusually takes it values in a bounded set (see Section 3). When the response space is bounded, we have the following criterion for clipping. Let be a distance-based loss function, i.e., for some function . Assume that . Then can be clipped at some (Chapter 2 SC08).
Moreover, when the sets and are compact, we have the following criterion for clipping which is usually easy to check.
Let and be compact. Let be continuous and strictly convex, with a bounded minimizer for every . Then can be clipped at some .
See proof in Appendix A.3.
For a function , we define to be the clipped version of , i.e., . Finally, we note that the clipped censored SVM learning method, that maps every data set , , to the function is measurable, where is the clipped version of defined in (6). This follows from Lemma 12, together with the measurability of the clipping operator.
5.2 Finite Sample Bounds
We would like to establish a finite-sample bound for the generalization of clipped censored SVM learning methods. We first need some notation. Define the censoring estimation error
to be the difference between the estimated and true survival functions of the censoring variable.
Let be an RKHS over the covariates space . Define the -th dyadic entropy number as the infimum over , such that can be covered with no more than balls of radius
with respect to the metric induced by the norm. For a bounded linear transformationwhere is a normed space, we define the dyadic entropy number as . For details, the reader is referred to Appendix 5.6 of SC08.
Define the Bayes risk , where the infimum is taken over all measurable functions . Note that Bayes risk is defined with respect to both the loss and the distribution . When a function exists such that we say that is a Bayes decision function.
We need the following assumptions:
The loss function is a locally Lipschitz continuous loss function that can be clipped at such that the supremum bound
holds for all and for some . Moreover, there is a constant such that
for all and for some .
is a separable RKHS of a measurable kernel over and is a distribution over for which there exist constants and such that
for all and ; and where is shorthand for the function .
There are constants and , such that for for all the following entropy bound holds:
where is the embedding of into the space of square integrable functions with respect to the empirical measure .
Before we state the main result of this section, we present some examples for which the assumptions above hold:
We are now ready to establish a finite sample bound for the clipped censored SVM learning methods:
Let be a loss function and be an RKHS such that assumptions (B1)–(B3) hold. Let satisfy for some . Let be an estimator of the survival function of the censoring variable and assume (A1)–(A2). Then, for any fixed regularization constant , , and , with probability not less than ,
where is a constant that depends only , , , , and .
The proof appears in Appendix A.2.
Specifically, let be the Kaplan-Meier estimator. Let be a lower bound on the survival function at . Then, for every and the following Dvoretzky-Kiefer-Wolfowitz-type inequality holds (Bitouzé et al., 1999, Theorem 2):
As a result, we obtain the following corollary:
Consider the setup of Theorem 17. Assume that the censoring variable is independent of both and . Let be the Kaplan-Meier estimator of . Then for any fixed regularization constant , , and , with probability not less than ,
where is a constant that depends only on , , , , and .
5.3 -universal Consistency
In this section we discuss consistency of the clipped version of the censored SVM learning method proposed in Section 4. In general, -consistency means that (3) holds for all . Universal consistency means that the learning method is -consistent for every probability measure on . In the following we discuss a more restrictive notion than universal consistency, namely -universal consistency. Here,
is the set of all probability distributions for which there is a constantsuch that conditions (A1)–(A2) hold. We say that a censored learning method is -universally consistent if (3) holds for all . We note that when the first assumption is violated for a set of covariates with positive probability, there is no hope of learning the optimal function for all , unless some strong assumptions on the model are enforced. The second assumption is required for proving consistency of the learning method proposed in Section 4. However, it is possible that other censored learning techniques will be able to achieve consistency for a larger set of probability measures.
In order to show -universal consistency, we utilize the bound given in Theorem 17. We need the following additional assumptions:
For all distributions on , .
is consistent for and there is a finite constant such that for any .
Before we state the main result of this section, we present some examples for which the assumptions above hold:
Assume that is compact. A continuous kernel whose corresponding RKHS is dense in the class of continuous functions over the compact set is called universal. Examples of universal kernels include the Gaussian kernels, and other Taylor kernels. For more details, the reader is referred to SC08 Chapter 4.6. For universal kernels, Assumption (B4) holds for , , , and . (Corollary 5.29 SC08).
Assume that is consistent for . When is the Kaplan-Meier estimator, Assumption (B5) holds for all (Bitouzé et al., 1999, Theorem 3). Similarly, when is the proportional hazards estimator (see Example 2), under some conditions, Assumption (B5) holds for all (see Goldberg and Kosorok, 2012a, Theorem 3.2 and its conditions). When is the generalized Kaplan-Meier estimator (see Example 3), under strong conditions on the failure time distribution, Dabrowska (1989) showed that Assumption (B5) holds for all where is the dimension of the covariate space (see Dabrowska, 1989, Corollary 2.2 and its conditions there). Recently, Goldberg and Kosorok (2013) relaxed these assumptions and showed that Assumption (B5) holds for all where satisfies
where is the survival function of given (see Goldberg and Kosorok, 2013, for the conditions).
Now we are ready for the main result.
Define the approximation error
By Theorem 17, for we obtain
for any fixed regularization constant , , and , with probability not less than .