The volume and velocity of information about individual patients or customers are greatly increasing with use of electronic records and personal device. Potential benefits of utilizing such information could be numerous, ranging from the ability to determine large-scale effects of treatment to the ability to monitor real-time effects of treatment on a general population. In the context of this wealth of real-world data (RWD), it is often necessary to conduct large-scale or online regression to unveil real-world evidence (RWE). For large-scale or online survival data, the response variable is survival time and is subject to possible right censoring. Letbe the survival time, the censoring time, and the
-vector of covariates. Define, , and . Suppose that the data consist of independent and identically distributed (i.i.d.) copies of and denote it by for large-scale data and for online data, respectively. Here denotes the size of the dataset and is assumed to be large. We consider the accelerated failure time model which postulates
where the stochastic error is independent of and its distribution is left unspecified. Because it provides a natural formulation of the effects of covariates on potentially censored response variable, the model (1.1
), along with the Cox proportional hazards model, are two main approaches to the regression analysis of censored data(Kalbfleisch and Prentice, 2002; Zeng and Lin, 2007). Define the residual . Let denote the counting and at risk processes of the residual, respectively. Write
For classical batch-based methods, the estimation and inference in model (1.1) often centers on solving the weighted rank-based estimating equations which take the form
Solving (1.3) is equivalent to minimizing the objective function
. The optimization can be formulated as a linear programming problem and solved by standard statistical packages(Jin et al., 2003; Koenker, 2005). This approach yields numerically efficient estimation and inference procedures for the accelerated failure time model. However, it requires the entire dataset to be stored in memory and the computational complexity is of order . When is extremely large, as in large-scale RWD, or in an online setting, as in streaming RWD, these batch-based estimation methods become numerically infeasible. Instead, online learning tools avoid the problem of managing the large-scale data exceeding the size of the memory and is applicable to the streaming data where the observations arrive sequentially by sharing the property of analyzing one observation at a time (Bottou and Le Cun, 2005). As an online learning tool, stochastic gradient descent (SGD) algorithms have recently regained a great deal of attention in the statistical community for analyzing big data since nowadays it is becoming increasingly prevalent in practice to manage and process big data that are much larger than the memory of a typical PC (Bottou, 2010). Stochastic gradient descent, as a stochastic approximation method, processes one data point at a time upon its arrival. For example, suppose that we have i.i.d. observations, denotes the model parameter, and is minus log likelihood of the th observation, . The maximum likelihood estimates of can then be solved by minimizing the objective function
Instead of using the Newton-Raphson algorithm to directly minimize , the SGD method calculates the estimates by recursively updating the estimates upon the arrival of each observation, starting with some initial estimates , for ,
where is some learning rate. This approach provides a numerically convenient and memory efficient approach for large-scale or online applications. The estimator or its variants has been shown to exhibit good properties such as asymptotic consistency and normality under some regularity conditions (Ruppert, 1988; Polyak and Juditsky, 1992)2011) and Fang et al. (2018). It is important to note that stochastic gradient descent is not directly applicable to the estimation based on (1.3) because it involves the calculation of the ranks of and upon the arrival of each data point, the recursive updating cannot be carried out without resorting to the entire dataset. To be more specific, denote the gradient of the objective function (1.4) that involves the observation by . We find that and it involves all the observations in the set . To evaluate , we need to have the entire set of observations. However, having the entire set of observations is what the SGD method attempts to avoid. To address this problem and overcome the difficulty of the original SGD method, we propose a new strategy which retains the numerical simplicity of SGD in recursive and online updating as well as caters to the specific nature of rank-based analysis of survival data. Therefore, the proposed estimation procedure scales well for large-scale and streaming survival data. Furthermore, apart from scalable estimation, there remains the core inferential need to assess the quality and quantify the uncertainty of the proposed SGD stimator. The ability to assess estimator quality efficiently is essential to allow efficient use of available resources by processing only as much data as is necessary to achieve a desired accuracy or confidence. We propose an online resampling method which allows scalable inference in an online and parallel manner. It preserves the automatic nature of the original bootstrap and is thus applicable to a wide variety of inferential problems.
The rest of this paper is organized as follows. In Section 2, we propose a scalable SGD method for large-scale or online survival data and study the asymptotic properties of the proposed estimator. In Section 3, we propose an online resampling strategy and establish the theory to justify its validity. Simulation studies and an application to the Surveillance, Epidemiology, and End Results (SEER) breast cancer data are provided in Section 4 to examine the performance of the proposed method. All the proofs are presented in the Appendix.
Because the summands in (1.2) or (1.3) involve more than one data point, the original SGD approach which sequentially makes use of one data point for each recursive updating is no longer applicable. To address this problem, we propose a new stochastic gradient descent method which sequentially updates the estimates upon the arrival of every data points. Here is fixed and set to be greater than to allow the recursive updating to be effectively carried out based on (1.2) or (1.3). For ease of exposition, we consider the large-scale setting but the method can be applied to the online setting in a similar fashion. Without loss of generality, it is assumed that . The data consist of i.i.d. copies of . Let denote the data points in the th updating, . Based on , it can be shown that the true regression parameter minimizes the expectation of the objective function
which has the gradient
In the same spirit of the original SGD, we propose the following recursive algorithm. Starting from some initial estimate , for , we update the estimate via
which can also be recursively updated given that , . It can be seen that the proposed method retains the appealing properties of SGD such as numerical convenience and memory efficiency. It scales well with the size of the dataset and is readily applicable to large-scale and online applications. Next we study its asymptotic properties.
2.2 Limiting distribution
It is assumed that is independent of conditional on . Throughout the paper, we shall use , and to denote the distribution, density and survival functions of , respectively. The conditional distribution, density and survival functions of given are denoted by and , respectively. Let , and be the Hessian matrix of . Denote the true parameter by . Let and . We make the following assumptions.
is bounded and the matrix is full rank.
The density function of and its derivative are bounded.
The conditional density function of and its derivative are bounded.
The matrix is strictly positive definite.
The learning rate are chosen as with and .
We prove the asymptotic normality of the SGD estimator as follows.
Under Assumptions A1-A5, we have (i)
2.3 Asymptotic relative efficiency
Let denote the classical batch-based estimator which minimizes the objective function (1.4) involving the entire dataset. Let , where . As , it is known that
is asymptotically normally distributed with mean zero and the covariance, where
and is the hazard function of , and (Tsiatis et al., 1990; Ying, 1993). Therefore both the classical batched-based method and the proposed SGD method yield asymptotically unbiased and normally distributed estimates. For any given , when comparing these two methods in estimating , a measure of asymptotic relative efficiency of the proposed SGD method relative to the classical batch-based method can be defined as
For any , and when , RE(k) .
The above theorem suggests the efficiency of the proposed method when is large. We also use Monte-carlo method to evaluate numerically in Section 4.1.
In order to conduct statistical inference with the proposed SGD estimator , we propose an online bootstrap resampling procedure, which recursively updates the SGD estimate as well as a large number of randomly perturbed SGD estimates, upon the arrival of every observations. The resampling strategy based on the random perturbation has also been widely used for inference in classical batch-based methods (Rao and Zhao, 1992; Jin et al., 2003; Peng and Huang, 2008). Specifically, let2.3), with , upon receiving , we recursively updates randomly perturbed SGD estimates,
We will show that and converge in distribution to the same limiting distribution. In practice, these results allow us to estimate the distribution of by generating a large number, say , of random samples of . We obtain by sequentially updating perturbed SGD estimates for each sample ,
and then approximate the sampling distribution of using the empirical distribution of . Specifically, the covariance matrix of can be estimated by the sample covariance matrix constructed from . Estimating the distribution of based on the distribution of leads to the construction of confidence regions for . The resulting inferential procedure retains the numerical simplicity of the SGD method, only using one pass over every data points. The proposed inferential procedure scales well for datasets with millions of data points or more, and its theoretical validity can be justified with mild regularity conditions as shown in the next section.
3.1 Theoretical justification
In this section, we derive some theoretical properties of , justifying that the conditional distribution of given data can approximate the sampling distribution of , under the following assumptions. Let be the Euclidean norm for vectors and the operator norm for matrices. From Theorem 2.1, we can conduct statistical inference based on provided that we can estimate the covariance matrix . Because involves the unknown hazard function, to bypass the difficulty of nonparametric smoothing in estimating the hazard function, we can use some resampling procedure to approximate the sampling distribution of . We first derive the asymptotically linear representation of for any perturbation variables that are i.i.d. random variables satisfying that .
If Assumptions A1-A5 hold, and the perturbation variables, , are non-negative i.i.d. random variables satisfying that , then we have,
By Theorem 3.1, letting , we derive the following representation for ,
denote the conditional probability and expectation given the data. Starting from (3.9), we derive the following theorem.
If Assumptions A1-A6 hold, then we have
By Theorem 3.2, the Kolmogorov-Smirnov distance between and converges to zero in probability. This validates our proposal of the perturbation-based resampling procedure for inference with the proposed SGD estimator .
4 Numerical studies
4.1 Simulation studies
Extensive simulation studies are conducted to assess the operating characteristics of the proposed SGD methods. We generate the failure time from the model
where and are independent standard normal random variables, , and
follows the standard normal, logistic or extreme-value distribution. The censoring time is generated from the uniform distribution to yield a censoring proportion ofor . We consider the learning rate , the sample size and let to examine how the proposed procedures are influenced by different choices of in practice. For each simulation setting, we repeat the data generation times. For each data repetition, we use as random weights and generate
copies of random weights from the standard exponential distribution. Then, for each data repetition, we obtain the proposed SGD estimate (2.4) and apply the online resampling procedure to construct confidence intervals. We report the bias, standard deviation and the empirical coverage probability (Cov_P) of interval estimation at 95% confidence level. The simulation results are summarized in Table 1 for the standard normal error, Table 2 for the standard logistic error and Table 3 for the standard extreme-value error, respectively. From Table 1, Table 2 and Table 3, we see that, the estimation is quite accurate, the estimation accuracy is robust to varying choices of , and the empirical coverage probabilities are close to the nominal level . This indicates the good performance of the proposed SGD-based estimation and inference procedures.
Next, we examine the computational scalability of the proposed method and compare it with the batch-based method. Note that the computational complexity of the batch-based method is . By Jin et al. (2003), the optimization of (1.4) can be formulated as a linear programming problem and we use the function in software to obtain the estimator. We let the error follows the standard normal distribution and the censoring proportion is set to be . The data are generated as before and we repeat the data generation times. We use Batch to denote the classical batch-based method and use to denote the proposed SGD method which updates the estimates every data points. The average computation time per petition is summarized in Table 4 for varying and .
From Table 4, we can see that the proposed SGD method scales well with the size of the dataset, the time cost increases linearly with and is robust to varying choices of . However, for the classical batch-based method, the computational burden is dramatically increased when the sample size increases. It becomes computationally inefficient or prohibitive when is greater than , mainly due to memory and time restrictions for the computation in practice.
Finally, we evaluate the relative efficiency of the proposed SGD estimation method to the batch-based estimation method (1.4). Because when , it is numerically inefficient to obtain the batch-based estimates, we use the asymptotic relative efficiency formula in Section 2.3 for the assessment. Because the formula involves the unknown density function and the censoring distribution, we use the Monte-Carlo method to evaluate it and assess the impact of on the asymptotic relative efficiency. The results are summarized in Table 5. We can see that the performance of the proposed SGD method is quite robust to varying choices of if gauged by the asymptotic relative efficiency and when is moderately large such as or , is close to . This affirms our theoretical result in Section 2.3 and indicates that in addition to its superior computational advantages, the proposed method performs also well in terms of the estimation efficiency.
4.2 An application to the SEER breast cancer data
We applied the proposed method to the breast cancer data collected in the U.S. National Cancer Institute’s Surveillance, Epidemiology, and End Results-SEER Public‐Use Database. The SEER program collects cancer incidence and survival data from population-based cancer registries covering approximately 34.6% of the population of the United States. The database records a number of patient and tumor characteristics such as demographics, primary tumor site, tumor morphology, stage at diagnosis, and first course of treatment, and the registries follow up with patients for vital status. With such a large-scale dataset, it offers a unique opportunity to examine the effect of patient and tumor characteristics on survival which is of most relevance in many cancer survival studies (Rosenberg et al., 2005; McCready et al., 2000).
We consider the data collected from 1998 to 2015 and select the data by the following criteria: 1) Races of either white or black; 2) Tumor size of less than 2cm; 3) Cancer stages of In situ, Localized, Regional, Distant; and 4) No missing values. The dataset consists of observations and the censoring proportion is about 10%. The model (1.1) is employed to examine the covariate effects and the proposed SGD methods are used for estimation (2.4) and inference (3.6) since the classical batch-based approach (1.4) is numerically no longer applicable due to memory and time constraints. The following covariates are included in our analysis. 1) Age ( 6 levels: Below 35, 36-45, 46-55, 56-65, 66-75, Above 75); 2) Race ( 1 for White and 0 for Black); 3) Tumor Grade ( 4 levels: Well differentiated, Moderately differentiated. Poorly differentiated, Undifferentiated); 4) Cancer Stage (4 levels: In situ, Localized, Regional, Distant); 5) Year of diagnosis (3 levels: 1997-2003, 2004-2009, 2010-2015); and 6) the logarithm of Tumor size (mm). The categories “Above 75”, “Undifferentiated”, “Distant” are taken to be the reference levels for Age, Tumor Grade and Cancer Stage, respectively.
The estimated regression coefficients along with their 95% confidence intervals with varying choices of are reported in Table 6. We find that the survival is longer for women of medium age 36-45, compared with that of younger or older women. This result is consistent with previous works by Wingo et al. (1998) and Rosenberg et al. (2005). The effect of Race on survival is also consistent with previous studies (Li et al., 2003). The average survival time of White is about 12% longer than Black. For Tumor Grade, we see that patients with smaller grade levels tend to live longer. For Cancer Stage, patients who are diagnosed at an earlier stage have a larger chance of being cured, especially for the In situ stage. We also find that Year at diagnosis has a significant effect on survival, which indicates that the effectiveness of treatment improves over time along with advances of medical research. For Tumor size, the effect is quite pronounced and the tumor size is negatively correlated with the survival. It is important to note that for varying choices of , the proposed method yields fairly robust point and interval estimates for the regression coefficients. This reaffirms our findings in simulation studies and demonstrates that the proposed estimation and inference procedures indeed provide a useful tool for large-scale or online analysis of survival data which is becoming increasingly prevalent in practice.
|Age (Below 35)||0.3442||0.3429||0.3412||0.3389|
|(0.3270, 0.3615 )||(0.3246, 0.3613)||(0.3249, 0.3575)||(0.3273, 0.3507)|
|(0.4065, 0.4372)||(0.4132, 0.4434)||(0.4188, 0.4464)||(0.4261, 0.4457)|
|(0.3868, 0.4168)||(0.3923, 0.4216)||(0.3965, 0.4233)||(0.4034, 0.4229)|
|( 0.3763, 0.4062)||(0.3805, 0.4093)||(0.3831, 0.4095)||(0.3895, 0.4087)|
|(0.3402, 0.3694)||( 0.3434, 0.3720)||(0.3432, 0.3699)||(0.3480, 0.3672)|
|(0.1559, 0.1689)||(0.1168, 0.1239)||(0.1082, 0.1157)||(0.1034, 0.1093)|
Grade (Well differentiated)
|(0.1220, 0.1895)||(0.1255, 0.1911)||(0.1171, 0.1972)||(0.1312, 0.1840)|
Grade (Moderately differentiated)
|(0.1184, 0.1865)||( 0.1205, 0.1863)||(0.1111, 0.1918)||(0.1254, 0.1782)|
Grade (Poorly differentiated)
|(0.1031, 0.1714)||(0.1037, 0.1693)||(0.0930, 0.1737)||(0.1070, 0.1591)|
Stage (In situ)
|(0.9685, 1.0424)||(0.9697, 1.0257)||(0.9631, 1.0244)||(0.9668, 1.0178)|
|(0.9463, 1.0116)||(0.9388, 0.9936)||(0.9314, 0.9912)||(0.9331, 0.9820)|
|(0.9295, 0.9948)||(0.9197, 0.9744)||(0.9096, 0.9695)||(0.9107, 0.9594)|
|(1.2170, 1.2223)||(1.2211, 1.2271)||(1.2293, 1.2346)||(1.2388, 1.2438)|
|(0.8268, 0.8315)||(0.8301, 0.8357)||( 0.8407, 0.8455)||(0.8524, 0.8566)|
|(-0.0105, -0.0065)||(-0.0108, -0.0067)||(-0.01178, -0.0080)||(-0.0126, -0.0085)|
Appendix A Proofs
Consider the general setting where we sequentially update the estimates for every observations. Without loss of generality, let . Recall that and the data consist of independent and identically distributed (i.i.d.) copies of . The data can be rewritten as , where denote the th block of data points and , for . Using the th block , the objective function is
Let . Then, the corresponding gradient function is
and let .
Let and . For a given positive constant , define the following Lyapunov function associated with ,
Under Assumption A1, is Lipschitz continuous. That is, there exists such that
Let , and are the i.i.d copies from the population. Noting that is a U-statistic, we have
Under Assumptions A1-A3, we see that each element of is bounded. Hence there exists a positive constant such that . ∎
In addition, because the derivative of density function of and are bounded under assumptions A2 and A3, by a similar argument, we also obtain the following lemma.
There exist constants and such that, for all , we have
For any , we have
And there exist and such that, for all ,
Note that . By Taylor expansion, we have
where is some vector on the segment of and and . Hence, for any . Therefore,
For the second part, it suffices to show that there exist and such that, for all ,
Noting that , we only need to show that there exist and such that, for all ,
. This holds because the smallest eigenvalue ofis positive and is continuous around . ∎
Let be the Borel field of data and be the current estimate of . Define
Noting that , we see that is a martingale-difference sequence. Consider the following decomposition,
where , , and
Denote . Under Assumption A1 and by the dominated convergence theorem, we have
Hence, we establish the following lemma.
are i.i.d. with . Under Assumption A1, we have
with as .
Lemma A.5 (Polyak and Juditsky (1992), Lemma 1).
Define the following sequences of matrices,
where by convention. Then, under Assumptions A4 and A5, there exists a constant such that for all and ,
where is the operator norm of a matrix .
Now we apply the above lemma to the following recursive process: starting from ,
By recursion, we obtain the following two formulas,