Log In Sign Up

BigSurvSGD: Big Survival Data Analysis via Stochastic Gradient Descent

In many biomedical applications, outcome is measured as a “time-to-event” (eg. disease progression or death). To assess the connection between features of a patient and this outcome, it is common to assume a proportional hazards model, and fit a proportional hazards regression (or Cox regression). To fit this model, a log-concave objective function known as the “partial likelihood” is maximized. For moderate-sized datasets, an efficient Newton-Raphson algorithm that leverages the structure of the objective can be employed. However, in large datasets this approach has two issues: 1) The computational tricks that leverage structure can also lead to computational instability; 2) The objective does not naturally decouple: Thus, if the dataset does not fit in memory, the model can be very computationally expensive to fit. This additionally means that the objective is not directly amenable to stochastic gradient-based optimization methods. To overcome these issues, we propose a simple, new framing of proportional hazards regression: This results in an objective function that is amenable to stochastic gradient descent. We show that this simple modification allows us to efficiently fit survival models with very large datasets. This also facilitates training complex, eg. neural-network-based, models with survival data.


page 1

page 2

page 3

page 4


Nonparametric Bayesian Lomax delegate racing for survival analysis with competing risks

We propose Lomax delegate racing (LDR) to explicitly model the mechanism...

SGDR: Stochastic Gradient Descent with Warm Restarts

Restart techniques are common in gradient-free optimization to deal with...

Weak error analysis for stochastic gradient descent optimization algorithms

Stochastic gradient descent (SGD) type optimization schemes are fundamen...

Smart Gradient – An Adaptive Technique for Improving Gradient Estimation

Computing the gradient of a function provides fundamental information ab...

Stochasticity of Deterministic Gradient Descent: Large Learning Rate for Multiscale Objective Function

This article suggests that deterministic Gradient Descent, which does no...

Scalable Estimation and Inference with Large-scale or Online Survival Data

With the rapid development of data collection and aggregation technologi...

Online Trajectory Segmentation and Summary With Applications to Visualization and Retrieval

Trajectory segmentation is the process of subdividing a trajectory into ...

1 Introduction

It is commonly of interest in biomedical settings to characterize the relationship between characteristics of an individual and their risk of experiencing an event of interest (eg. progression of a disease, recovery, death, etc. see Lee1997). Outcomes of this type are known as “time-to-event” outcomes, and characterizing such relationships is known as Survival analysis (Schober2018). In such applications, we often only have partial information on some patients due to censoring (e.g., they might leave the study before experiencing the event of interest).

Cox proportional hazards regression (CoxPH) (Cox1972)

is the most common tool for conducting survival analyses. The CoxPH model assumes a particular semi-parametric relationship between the risk at each given time of experiencing an event and the features of a patient (eg. age, sex, treatment assignment, etc). To estimate, parameters in this model CoxPH regression maximizes a log-concave function known as the "partial likelihood". Once estimated, this model can predict the person-specific risk of an event (as a function of their features). Such predictions are often used in personalized medicine, eg. in the development of prognostic and predictive biomarkers

(Bjorn2008). To maximize the partial likelihood, it is most common to use an efficient second-order algorithm such as Newton-Raphson (MITTAL2014) for datasets with few features (though potentially many observations). Traditionally, CoxPH has been used on data-sets with relatively few observations, though penalized extensions have been developed for high dimensional applications (Simon2011).

It is increasingly common to have biomedical datasets with a large number of observations, especially with increasing use of electronic medical records (Raghupathi2014). Although the CoxPH regression model has been widely used for small-to-moderate numbers of observations, current methodologies for fitting the Cox model have issues on datasets with many observations. In particular, in fitting the Cox model, it is common to leverage the sequential structure of the partial likelihood to vastly speed up computation (from to , see Simon2011). However, when there are a large number of observations, this can lead to computational instability (which we illustrate in this manuscript).

The second issue is that the partial likelihood does not naturally decouple over individuals or subsets of individuals. Thus, if the dataset does not fit in memory, the model can be very computationally expensive to fit: Standard distributed optimization methods such as those based the alternating direction method of multipliers (Boyd2010) cannot be used. This additionally means that the objective is not directly amenable to stochastic gradient-based optimization methods (Ruder2016). Unfortunately, to fit more complex neural network-based models, it is most common to use stochastic-gradient-based optimization (NN). This decoupling issue makes it impossible or at least very impractical to fit neural-network-based models with time-to-event data.

In this paper, we propose a novel and simple framework for conducting survival analysis using the CoxPH model. Our framework is built upon an objective that is a modification of the usual partial likelihood function. In particular this modified objective decouples over subsets of observations, and allows us to employ stochastic-gradient-based methods that engage only a subset of our data at each iteration. We show that the parameters estimated by this new objective function are equivalent to the original parameters when the assumptions of the CoxPH model hold (and may actually be more robust in the case of model misspecification). In addition, our new objective function is amenable to optimization via stochastic-gradient based methods. Standard stochastic gradient-based algorithms are computationally efficient and stable for this objective and can easily scale to datasets that are too large to fit in memory. We discuss how our new framework can be implemented in both streaming (Gaber2005) and non-streaming algorithms. We also discuss extending our framework to use mini-batches and we present some recommendations that we have found important, in practice, for performance.

We organize this paper as follows. In Section 2, we review the CoxPH regression model, the partial likelihood, and standard optimization tools used to maximize the partial likelihood. In Section 3, we present our new framework for fitting the CoxPH model and we prove the statistical equivalence of the parameters indicated by our optimization problem and those one from the standard CoxPH model. We also discuss applications of our proposed framework to both streaming and non-streaming algorithms. In addition, we discuss some recommendations that we have found important in practice for performance. In section 4 we discuss how our framework can be extended to left truncation and right censoring. In section 5, we discuss the equivalence of our optimization procedure to U-statistics based optimization. In Section 6, we provide simulation results that compare estimates from stochastic optimization of our modified objective to the current state of the art that estimates parameters by attempting to optimize the usual partial likelihood. In section 7, we conclude our paper and discuss some potential implications of our framework.

2 Cox Proportional Hazards (Cox PH) Model

The Cox PH model proposed by Cox (Cox1972) is a commonly used semi-parametric regression model in the medical literature for evaluating the association between the time until some event of interest and a set of variable(s) measured on a patient. More formally, suppose on each patient we measure an event time, and

a vector of numeric features. The Cox model engages with the so-called hazard function

where , . The hazard function,

, can be thought of as the probability density of having an event at time

, given that a patient (with covariates ) has not had an event up until that time. In particular the Cox model assumes a particular form for the hazard function:


where is a specified function of parameters that determines the role played by in the hazard; and is a baseline hazard function (independent of covariates). Note that may be assumed to be of different forms in different applications: For instance, in many scenarios is taken to be , and the simple linear model is used. This model assumes that the manner in which a patient’s covariates modulate their risk of experiencing an event is independent of time. In particular it is encoded entirely in . This simplifies estimation and interpretation of the predictive model.

Our aim is to use data to estimate . In particular we will assume that we have a dataset with independent observations drawn from the model (1):

. For the moment we assume that there is no censoring (all event times are observed), and no ties (all event times are unique). Estimation is conducted using the log-partial-likelihood:


where is the “risk set for patient ”. Note that aside from , the expression in (2) is independent of the event times. Extending this partial likelihood to deal with censoring, left-truncation and ties is quite straightforward (Klein2003), however for ease of exposition we do not include it in this manuscript.

2.1 Estimating by Maximizing the Log-Partial-Likelihood

Using the log-partial-likelihood from equation (2) an estimate of can be obtained as


When linear is used, our objective function in (3) is convex in , and thus the tools of convex optimization can be applied to find (see Corollary 1 in Appendix A for the proof of convexity). In the current gold standard survival package in R (coxph2019), Newton-Raphson is used to minimize (3) with linear . In the later sections of this manuscript we will refer to this implementation as coxph().

For linear , one can show that which is rate optimal

(as is standard for estimation in parametric models, see


In current state-of-the-art packages, the structure of the ordered structure of the loss (as well as the gradient, and hessian) are leveraged to improve computational efficiency. In particular, we examine the gradient


where is the gradient of with respect to . While a naive calculate would have computational complexity because of the nested summations, this is not necessary. In the case that the times are ordered we see that . This allows us to use cumulative sums and differences to calculate the entire gradient in computational complexity, with a single complexity sort required at the beginning of the algorithm (Simon2011). This is also true for calculating the Hessian. Unfortunately, however, when employing this strategy, the algorithm becomes susceptible to roundoff issues, especially with a larger number of observations () and features (), as seen in Section 6.

Additional inspection of the gradient in (4) shows why stochastic-gradient-based methods cannot be used to decouple gradient calculations over observations in our sample: While the gradient can be written as a sum over indices , the denominator for the term involves all observations in the dataset. In the next section, we propose a novel simple modification of optimization problem (4) that admits an efficient stochastic-gradient-based algorithm for estimating .

3 Big survival data analysis using SGD: BigSurvSGD

We begin by reformulating our problem. We consider a population parameter , defined as the population minimizer of the expected partial likelihood of random patients (which we will refer to as “strata of size s”)


Here we think of as a draw of random patients from our population. Note that the minimum value for is 2, otherwise, expression (2) becomes zero for all . By including a superscript in , we note that this parameter may depend on . In fact, when the assumptions of the Cox model hold (1) then we have for all . The proof of this is quite simple, with details given in Appendix A.

To estimate , we select a small fixed () and directly apply stochastic gradient descent to the population optimization problem (5). In practice this will amount to calculating stochastic gradients using random strata of size . One may note that for small, there are on the order of such strata. However, results for stochastic gradient descent indicate that under strong convexity of (5), on the order of only steps should be required to obtain a rate optimal estimator (converging at a rate of in MSE). See Appendix B for the proof of this convergence rate .

In Section 6, we see that this modification mitigates issues with roundoff error, and allows us to computationally efficiently fit survival models with millions of observations and many features.

3.1 Pairwise concordance ()

An interesting special case is when we choose strata of size , and look at pairs of patients. Then, in the case of no censoring, the population minimizer in (5), i.e., maximizes the expectation of the pairwise log-partial likelihood


This log-partial-likelihood can be thought of as a smoothed version of the standard concordance measure used in the concordance index (Peter2012). Thus, even when the proportional hazards model does not hold, the parameter maintains a useful interpretation as the population minimizer of the average smoothed concordance index. (6

) is similar to the objective function for conditional logistic regression (CLR) with strata size


. In the deep learning literature, neural net models with a conditional logistic outcome layers are often referred to as Siamese Neural Networks

(Gregory2015), though, to our knowledge, these ideas have not been previously been applied to time-to-event data.

3.2 Optimization with SGD

Suppose that we have independent strata, each with independent patients drawn from our population (with ). For ease of notation, let denote the indices of patients in strata for each .

We first note, that for any we have


when are drawn from a reasonable distribution (eg. bounded); and is not too poorly behaved (eg. lipschitz). Here is defined analogously to (5) using


where are risk sets that include only patients in stratum ; and denotes the gradient of wrt .

From here we can give the simplest version of our stochastic gradient descent (SGD) algorithm for (3). We choose an initial (perhaps ), and at each iteration , we update our estimate by


Here, is the learning rate (and should be specified in advance, or determined adaptively as discussed later in Section 3.3.1). The computation time to run steps of stochastic gradient descent according to (9) for linear is (where is the total sample size). If ordering of risk sets is not leveraged, then computation is required. In contrast, Newton’s algorithm for optimizing the full log-partial likelihood requires computation per iteration when the roundoff-error-prone updating rule is used (and if not). Additionally, using these small strata of size , we are not prone to roundoff issues when using stochastic optimization (because this sum is calculated separately for each strata).

3.2.1 Averaging over updates

It has been shown that SGD algorithms for strongly convex objective-functions are asymptotically more efficient if we use a running average of the iterates as the final estimate (Ruppert1988; Polyak1992). Additionally in this case can be set to a fixed value (so long as it is sufficiently small). We denote the running-average estimator by


Note that the averaging process does not change the values of . In our simulations in the main manuscript, we use the averaged . Strong convexity of the objective in (5) depends on properties of and (weakly) on the distribution of . For linear , and with a non-degenerate distribution, this objective will be strongly convex (See Appendix B for the proof of strong convexity of our objective function). In such cases, standard results (Bottou2010) show that . As a reminder, this is the statistically optimal rate of convergence for estimating (or equivalently, when the Cox model holds) from observations. See Appendix B for the proof of this convergence rate for averaging over iterates. When we use averaging over iterates, it has been shown that choosing the learning rate as (C is a constant) gives us such an optimal convergence rate (Eric2011). We choose this learning rate in our simulation studies in Section 6 where we tune the constant to get the desired convergence rate.

3.3 Streaming vs non-streaming

In the discussion above, we imagined that observations were arriving in a continual stream of strata, and that we were more concerned with the cost of computation than the cost of data collection. The algorithm we described engaged with each strata only once (in calculating a single stochastic gradient). Algorithm 1 in Appendix C details an implementation of a streaming algorithm in this imagined scenario. In practice, we generally have a fixed (though potentially large) number of observations, , that are not naturally partitioned into strata. To employ SGD here, we randomly partition our observations into disjoint strata of size , and then carry out our updates in (9). Additionally, in practice we may want to take more than one pass over the data (and similarly use more than random partitioning), we discuss this further in Section 3.3.1. Algorithm 2 in Appendix C details an implementation that takes multiple passes over the data (and includes additional bells and whistles discussed in Section  3.3.1).

3.3.1 Mini-batches, Moment-based-learning-rate, and Multiple Epochs

For small , the stochastic gradients given in (8) are potentially quite noisy. In these cases, rather than using only a single stratum to calculate the stochastic gradient, it may be preferable to average results over multiple strata. This is known as a using a mini-batch (Ruder2016) in the SGD literature. In this survival context, we use batches of strata, so a larger strata size can eliminate the need for mini-batch sizes of greater than .

Choosing a reasonable value for the hyper-parameter (the learning rate) is critical for good practical performance of the algorithm. A number of publications have developed methods for adaptively selecting the learning rate using moments, including Adagrad (Duchi2011), and ADAM (Kingma2014). However, it was shown that ADAM may not converge in some settings and an updated version called AMSGrad has been proposed (Sashank2019). We found substantially improved performance on simulated data with AMSGrad over the simple non-adaptive updating rule given in (9), especially in combination with averaging as discussed above.

In practice, taking only a single pass over the data leads to poor empirical performance. We generally use multiple passes (or “epochs”). In particular, for each epoch, we (1) randomly partition our data into strata; then (2) use the last updated iterate of

from the previous epoch as the initial iterate of for the new epoch; and finally (3) apply a full pass of stochastic gradient descent (with eg. averaging and momentum) over the partitioned data. In practice we found that epochs was more than enough for very robust convergence (often was fine).

4 Extensions for Left Truncation and Right Censoring

In practice it is common for participants in a study to leave the study before experiencing an event. This is known as right censoring. In particular it is often assumed that a patient has some random “time until censoring” , and what we observe is , whichever happens first (the event or censoring), along with an indicator that the patient experienced an event (rather than censoring). Censored patients still contribute some information to estimation of — in particular, if a patient is censored quite late, then there is a long period of time during which we know they did not experience an event (so likely we should estimate them to be low risk).

In some studies it is common to consider the event time as the age at which a patient had an event (rather than the calendar on study). This means that patients do not enroll in the study at (and patients can only be observed to fail once they are enrolled). This phenomenon, wherein patients enroll in a study at times other than , is known as left-truncation.

When i) the assumptions of the Cox model hold; and ii) censoring and truncation times are independent of event times conditional on covariates , it is relatively straightforward to adapt cox regression to accommodate these missingness mechanisms. The partial likelihood is modified in 2 minor ways: 1) The outer summation is only taken over indices for which an event occurred (i.e. that were not censored); and 2) The risk sets, , are modified to include only patients currently at risk (who have already been enrolled, and have not yet been censored, or had an event) (Andersen1993). We can similarly modify our objective function (5), and apply our algorithm with only minor modification (a slight change in the gradient).

5 U-statistic based optimization

While in this manuscript we discuss obtaining an estimator by directly attempting to minimize the population objective function (5) using SGD, there is a corresponding empirical minimization problem. In particular, for strata of size , one could define an estimator by


As in a standard U-statistic (Van2000), this sum is taken over all subsets of patients out of our original patients (resulting in terms). This appears to be a difficult optimization problem, given the enormous number of terms. However, our approach shows that, in fact, only of those terms need be considered: The majority contain redundant information. In fact, one could see this directly by noting that the objective function in (11) decouples over subsets: An application of stochastic gradient descent here would involve sampling strata with replacement, and (assuming the objective is strongly convex), with averaging over iterates would converge to a tolerance of after steps. This approach which is known as incomplete U-statistics (Gunnar1976) could be taken more generally for losses defined by -statistics.

6 Simulation Results

Data Generation

We assume that our event time follows the Cox PH model (1) with simple linear detailed below. We generate the baseline hazard

using an exponential distribution with parameters

. We generate the censoring and event times independently. The details of the data simulation procedure are given below (Bender2005)

where , i.e., time to event or censoring whichever comes first. Here , the probability of censoring, is a parameter we can tune. Though this is not written in the form of (1), it is still consistent with the Cox PH model assumptions, with . In all comparisons, we include the performance of coxph() the gold standard R implementation of Newton’s algorithm for maximizing the partial likelihood.

We used R version 3.6.1 (R2019)

to conduct the analyses. All simulations were conducted on a quad-core Intel Core i7-3520 M CPU @ 2.9GHz with 12 GB RAM. In all figures, we used mean of mean-square-error (MSE) over datasets for central tendency (curves) and standard error of MSE over datasets for variability (error bars o curves). Our implementation of the SGD procedure described in this manuscript is publicly available in the github repository: (Tarkhan2020). The implementation is written in R, with the computational back-end written in C++.

Small Data Results

We first evaluate the statistical efficiency of our estimation procedure (using strata sizes of less than ). We evaluate mean-squared error (MSE) between estimators , and and the truth, over 1000 simulated datasets with number of epochs up to 100. We see that for small strata sizes (eg. 2 and 5), there is some statistical inefficiency: Even though the convergence rate is still , the constant in front appears to be worse with small strata sizes. For large strata sizes, there appears to be nearly no statistical inefficiency. For practical purposes, the SGD-based estimators do quite well. This can be seen in Figure 1.

We next evaluate the performance of averaged SGD with a fixed learning rate, against averaged SGD with an adaptive learning rate (using AMSGrad) with a fixed strata-size of for both. In addition, we try various numbers of epochs (from 10 to 100). Performance is shown in Figure 2 for 1000 simulated datasets. We see that with enough epochs (around 100) both perform well. However, AMSGrad nearly reaches that performance with as few as 50 epochs, where using a fixed learning rate does not attain that performance with fewer than epochs. For both of these methods, we tuned our [initial] learning-rate to be optimal in these experiments.

Figure 1: (MSE) of estimates from AveAMSGrad (with optimal for the learning rate defined as ) for different strata sizes () with B=1, P=10, 100 epochs, and probability of censoring .
Figure 2: of estimates from averaged SGD (AveSGD), averaged AMSGrad (AveAMSGrad), and coxph(). We choose the proportionality constant and for learning rate defined as for AveAMSGrad and AveSGD, respectively), S=20, B=1, P=10, and probability of censoring .

In practice, we have found that AMSGrad is much more robust to misspecification of this initial learning rate. Figure 3 compares MSE of the estimate from AveAMSGrad for different choices of the proportionality constant in the learning rate over 1000 simulated datasets. We see that selecting the constant around 25 in our learning rate (defined as ) gives strong performance. However, AveAMSGrad is relatively robust to a wide range of the learning rates around the optimum value due to its capability of adapting the learning rate over iterates.

Figure 3: (MSE) of estimates from AveAMSGrad based on sample size () with , , , probability of censoring , and different proportional constant for learning rate defined as

We discussed the computational instability of coxph() in small-to-moderate sized datasets and how our framework can avoid such an instability. Figure 4 compares the MSE of estimates from coxph() for the small-to-moderate sample size () and number of features () over 1000 simulated datasets. As we see, coxph() performs poorly for larger and . For instance, coxph() with performs worse than . This is because of computational instability with coxph() for the larger sample sizes and numbers of features. One important feature of these examples is that we include a large amount of signal (which increases as the number of features increases). With less signal, this instability is less pronounced unless very large sample sizes are used.

Figure 4: of estimates from coxph() for different sample size () and different dimension of coefficient () with and probability of censoring .

Big Data Results

We next consider the numerical stability of our algorithm/framework (versus directly maximizing the full partial likelihood using Newton’s algorithm). We generated 100 datasets and we only used one epoch for AveAMSGrad algorithm. Figure 5 shows a surprising and unfortunate result for coxph(): We see that as sample size increases drastically, the performance of coxph() actually stops improving and starts getting worse! In particular for , coxph() is basically producing nonsense by the time we get to observations for this simulation setup. This indicates that for large datasets the current gold standard may be inadequate, though we do note that there is a large amount of signal in these simulations (more than we might often see in practice). In contrast BigSurvSGD has no such issues and gives quite strong performance even with only one epoch. Note did not add error bars for ease of illustration for this case.

Figure 5: MSE of estimates from AveAMSGrad (with proportional constant for learning rate defined as ) and coxph() for different values of sample size () and dimension () with B=1, s=20, and probability of censoring .

We next examine the computational efficiency of our framework and algorithm. Figure 6 plots computing time (in seconds) for estimates of from coxph() and AveAMSGrad for different sample sizes (we did not add error bars for ease of illustration). We considered four different numbers of covariates and to examine the sensitivity of the computing time to the dimension of . In these examples, our algorithm read the data in chunks from the hard-drive (allowing us to engage with datasets difficult to fit in memory). The computing time of our proposed framework increases linearly in the sample size, , (and in ). The computing time for coxph() also grows roughly linearly in , though it grows quadratically in (which can be seen from the poor performance with ). Furthermore, coxph() fails for the medium-to-large datasets as it is poorly equipped to deal with datasets that do not easily fit in memory (R unfortunately generally deals somewhat poorly with memory management). For instance from Figure , coxph() with crashes after observations. As a reminder, the statistical performance of the output of coxph(), due to floating-point issues, degenerates much earlier.

Figure 6: Computation time (seconds) of estimates from coxph and AveAMSGrad (with C=3 for ) based on sample size (n) with B=1, s=20, probability of censoring .

7 Conclusion

We propose a simple and novel framework for doing survival analysis under a Cox proportional hazards model. Our framework leverages a modified optimization problem which allows us apply iterative methods over only a subset of our observations at each time. In particular it allows us to leverage the tools of stochastic gradient descent (and its extensions). This results in an algorithm that is more computationally efficient and stable than the current state of the art. We showed that our framework can handle large survival datasets with little difficulty. This framework will also facilitate training complex models such as neural network with large survival datasets.


Appendix A Proof of consistency for parameter in section 3

In this appendix, for the sake of completeness, we prove the Fisher consistency of parameter . We treat the Cox proportional hazard model as a counting process (SurvPoint2008). We assume that censoring and survival times are independent given the covariate vector of interest and they follow the model (1) with true parameter . In the following, we define some terminology before proceeding with the proof.

Definition 1: . For patient with time to event define the counting process by

For instance, if we define , the above expression is an indicator representing whether patient failed in interval (i.e., 1 represents failure and 0 otherwise). We further define which is a counting process for failure times over all patients. We assume that the failure time process is absolutely continuous w.r.t. Lebesgue measure on time so that there is at most one failure at any time (i.e., no ties).

Definition 2: . We define to be an indicator representing whether patient is at risk at time , i.e., . By this definition, indicates number of patients who are at risk at time . Note that the independent censoring assumption implies that those patients at risk at time (who have not yet failed or been censored) represent a random sample of the sub-spopulation of patients who will survive until time .

Definition 3: . Let denote the filtration that includes all information up to time , i.e.,

Note that given , we know whether patient failed or was censored (i.e., ), when they failed/were censored (i.e., ), and their covariate vectors .

Using the above definitions, one can write the log-partial-likelihood as


where is the duration of the study. Note that we keep in the most general form and we only assume that it is differentiable in for all . Then the score function may be written as


where is a weight proportional to the hazard of failure of patient ; .

Now we show that the parameter is Fisher consistent, i.e., .


where (a) follows from the conditional expectation given the filtration ; (b) follows from the fact that and are known given ; (c) follows from the fact that . Since , the parameter is Fisher consistent. In the following, we present a sufficient condition under which such a parameter is a global minmizer of .

Corollary 1: The is a global minimizer of if is an affine function of .
Proof. Suppose that we choose a convex function of . Then, the first term inside summation of (i.e., ) is a concave function. We show that the second term inside summation of , i.e., is a convex function through the following steps:

Step 1: is a convex function because is a non-decreasing convex function and is convex (Boyd2004, sec. 3.2.4).

Step 2: is a convex function because non-negative weighted sums of convex functions is convex (Boyd2004, sec. 3.2.1).

Step 3: is a convex function because is non-decreasing concave function and is a convex function (Boyd2004, sec. 3.2.4).

Therefore, expression is the sum of a concave function and a convex function. A sufficient condition for convexity of this expression is convexity of or equivalently concavity of . This means that needs to be both convex and concave at the same time. The only functions that satisfy this condition are affine functions (Boyd2004, sec. 3.1.1). By choosing as an affine function, and hence become convex functions. Therefore, the parameter becomes a global minimizer of .

Having the loss function

a convex function motivates us to explore the convergence rate of SGD-based minimization algorithms in the next section.

Appendix B Convergence rate of SGD-based estimate

In this appendix, for the sake of completeness, we prove that our SGD-based estimate can achieve the convergence rate of for choices , , and no ties. We also assume that our covariates are bounded, i.e., there exists some such that with probability and that we consider a domain of optimization such that .

For the sake of simplicity of proof, we rewrite our optimization procedure as


and we estimate iteratively using SGD from strata of size through


Authors in (Eric2011) showed that if the loss function satisfies the 4 following conditions, then, we can achieve the optimal convergence rate of by choosing for the single SGD-based iterates and , for averaging over iterates (i.e., Polyak-Ruppert average).

Condition 1: Gradient of

is an unbiased estimate of the gradient


Condition 2: Gradient of is D-Lipschitz-continuous, i.e., and , there exists such that,


Condition 3: is -strongly convex, i.e., , there exists such that,


Condition 4: Variance of the gradient of is bounded, i.e., there exists such that for all ,


In the following, we show that the loss function in our framework, i.e., satisfies all four conditions above.

Proof of Condition 1:
This condition is automatically satisfied based on the definition of in (B.1) and that .

Proof of condition 2: The loss function belongs to continuous function family and proving (B.3) is equivalent to proving


For , and assuming no ties, can be simplified as


where ; . Note that and that because . Therefore we have


where follows the triangle inequality. Therefore, by assuming our covariates are bounded, gradient of is Lipschitz-continuous with . This completes the proof of Condition 2.

Proof of condition 3: The loss function belongs to the continuous function family and proving (B.4) is equivalent to proving


where is the expected Hessian matrix. Starting from (B.7), can be written as