Ensemble Risk Modeling Method for Robust Learning on Scarce Data

08/13/2011 ∙ by Marina Sapir, et al. ∙ 0

In medical risk modeling, typical data are "scarce": they have relatively small number of training instances (N), censoring, and high dimensionality (M). We show that the problem may be effectively simplified by reducing it to bipartite ranking, and introduce new bipartite ranking algorithm, Smooth Rank, for robust learning on scarce data. The algorithm is based on ensemble learning with unsupervised aggregation of predictors. The advantage of our approach is confirmed in comparison with two "gold standard" risk modeling methods on 10 real life survival analysis datasets, where the new approach has the best results on all but two datasets with the largest ratio N/M. For systematic study of the effects of data scarcity on modeling by all three methods, we conducted two types of computational experiments: on real life data with randomly drawn training sets of different sizes, and on artificial data with increasing number of features. Both experiments demonstrated that Smooth Rank has critical advantage over the popular methods on the scarce data; it does not suffer from overfitting where other methods do.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


In medical risk modeling, typical data are “scarce”: they have relatively small number of training instances (N), censoring, and high dimensionality (M). We show that the problem may be effectively simplified by reducing it to bipartite ranking, and introduce new bipartite ranking algorithm, Smooth Rank, for robust learning on scarce data. The algorithm is based on ensemble learning with unsupervised aggregation of predictors. The advantage of our approach is confirmed in comparison with two “gold standard” risk modeling methods on 10 real life survival analysis datasets, where the new approach has the best results on all but two datasets with the largest ratio N/M. For systematic study of the effects of data scarcity on modeling by all three methods, we conducted two types of computational experiments: on real life data with randomly drawn training sets of different sizes, and on artificial data with increasing number of features. Both experiments demonstrated that Smooth Rank has critical advantage over the popular methods on the scarce data; it does not suffer from overfitting where other methods do.

1 Introduction

1.1 The Survival Analysis Problem

Survival analysis deals with datasets, where each observation

includes covariate vector

, a survival time and a binary event indicator , if an event (failure) occurred, and if the observation is censored at time So, the survival time means either time to event or to the end of study, if the observation was stopped (censored ) without the event. The last two variables represent the target or outcome of the observation.

For example, in a study about risk of metastases after cancer surgery, survival time is time from the surgery to the discovery of metastases, if they occurred, or to some other end of study. The observations where metastases were not discovered will end up censored. The censoring may occur either because the cancer was removed by surgery, or because the patients lost to follow up before the cancer metastasized. Some of these lost patients could have died from different causes or moved to another location.

The prediction in survival analysis is generally understood as an estimate of an individual’s risk, and the concept of risk is associated not only with the fact of event, but with its timing: the earlier the event happened, the higher was the risk.

Because of incomplete (censored) observations, the individuals in study can be partially ordered by their risk: for observations

Denote a model built by a survival analysis algorithm. The commonly accepted criterion of the risk modeling is Harrell’s concordance index (CI) [1] between the risk order and the values of . The concordance index is based on the concept of concordant pairs of observations: such pairs , where both and The pairs, where and are called discordant. By definition,

where are numbers of concordant pairs, discordant pairs and ties, respectively. With continuos features, the ties are unlikely, and in absence of ties, concordance index measures proportion of concordant pairs.

Thus, the goal can be formalized as learning the partially known order by the way of building a scoring function on to minimize the number of discordant pairs of observations. This puts the survival analysis in the class of the supervised ranking problems (see, for example [2, 3]).

What makes the problem difficult is the data quality in medical longitudinal studies, which we call “scarcity”: the number of observations tends to be small; many of them may be censored; number of features may be large comparing with the number of observations, the outcome and quantitative features depend to a large degree on unknown factors; clinical features are subjective; some of the features may be irrelevant - and so on. All this put main emphasis on the robustness of learning.

1.2 Traditional Approach

The survival analysis is not commonly treated as a ranking problem. Usually, the solution is found by modeling the time it takes for events to occur [4, 5] (a more difficult problem), and the most common method for this purpose is Cox proportional hazard regression (Cox PH) [6]. Cox PH regression builds a model by maximizing likelihood of the observed survival rates. Originally, the method was intended to study dependence between few covariates and the outcome; if used for prediction, it tends to overfit on typical data. Popular tutorial [1] recommends that number of covariates for Cox regression does not exceed 1/10th of the number of uncensored observations.

In machine learning, most of the developed approaches try to “improve” Cox PH regression one way or another. The most popular methods include

or penalization of the regression parameters [7, 8], to make learning more robust. As we show here on real and artificial data, this regularization may not be sufficient, as it often does not lead to improved performance on scarce data and does not prevent overfitting.

1.3 Alternative Approach

We propose an alternative approach to the survival analysis learning.

First, we address the noise in outcome by defining the “risk” as a chance of having failure before certain time . The interpretation of risk as a propensity of an individual to have an early failure is natural and acceptable for medical practitioners. Considering survival analysis as a ranking problem and splitting observations on two classes (“early failure” vs “no early failure”) we reduce the problem to bipartite ranking (see, for example, [9]

). Binarization of the survival outcome simplifies the problem and decreases influence of noise in the survival times.

Let us notice, that even though only the order between two classes is modeled, the model produces continuous scores associated with the levels of risk, which, in turn, is related with the timing of events. Therefore, the performance still can be evaluated by the concordance index between the scores and the order on the test data.

Second, we introduce new bipartite ranking method Smooth Rank, designed specifically for the scarce data. The method is based on the strong

regularization technique used in Naive Bayes: unsupervised aggregation of independently built univariate predictors. Avoiding multidimensional optimization makes Naive Bayes less sensitive to the “curse of dimensionality”, allows it to be competitive with more sophisticated methods

[10] on scarce data. The term strong regularization was first used in [4] with reference to weighted voting [11], which is based on the same approach.

In addition to the strong regularization, Smooth Rank employs smoothing techniques to make the model more robust, less dependent on peculiarities of the small training samples.

1.4 Comparison of the Two Approaches

To show that our approach is working, Smooth Rank is compared here with -penalized path method CoxPath and the Cox PH regression on 10 real life survival analysis datasets. Smooth Rank has the best performance on 8 datasets. It yields to other methods only on the two datasets with the largest ratio.

In addition, we study relationship between data scarcity and methods’ performance in two types of computations experiments. First, on two real life datasets, we randomly exclude some observations to produce series of training sets of different sizes. Then, on the artificial datasets, we gradually increase the number of features, keeping the size of the training set constant. Both experiments demonstrate that Smooth Rank has significant advantage in performance on scarce data, while other methods may work better on “rich” data.

2 Smooth Rank

2.1 Definition of Smooth Rank

The main scheme of the algorithm can be described as two steps procedure:

  1. Independently for each feature build a predictor and calculate its weight based on its performance;

  2. Calculate a scoring function

For a classification problem, there are two popular ensemble algorithms which follow this scheme. One of them is Naive Bayes classifier

[12] where all weights and each predictor is built as a log-ratio between densities of two classes: Another example of an algorithm with the same scheme is “weighted voting” [11].

There are several ways the general scheme can be implemented in the context of survival analysis. Below is one of the possible implementations.

The algorithm is applied to the data, where observations are split on two classes with labels by survival time threshold .

Bipartite Ranking Algorithm Smooth Rank

  • For each feature :

    1. Build kernel approximations of the density of each class on ;

    2. For each point calculate

      where are frequencies of the classes 1 and 2;

    3. Build marginal predictors as smooth approximation of the function

    4. Calculate weights of the predictors based on their correlation with outcome

  • Calculate scoring function

2.2 Implementation and parameters

The algorithm is implemented in R using some standard R functions.

2.2.1 Selection of the time threshold

In the experiments, we select the time threshold to make the classes similar in size. The class 1 contains the events only: . The class 2 contains all observations with the survival time above It means that the censored observations with censoring time below the threshold are excluded from training.

2.2.2 Density evaluation

The density is approximated with cosine kernel. The R function density [13]

uses Fourier transform with a discretized version of the kernel and then makes linear approximation to evaluate the density at the specified points. Density was evaluated with default function parameters, on equally spaced 512 points.

2.2.3 Building marginal predictors

The function is less sensitive to the errors in the density evaluation than the default function used in Naive Bayes: . However, for the areas where density of both classes is low, small errors in can lead to big errors in . To deal with this issue, is not evaluated for . The aggregation on the last step handles values of the predictors in these points as missing. The function is smoothed on the Step 3 using loess procedure with the default parameters. LOESS stands for “locally weighted scatterplot smoothing” [14]. Advantage of this method is that it does not require to specify the class of functions for approximation. The procedure loess was used with polynomials of the degree 1.

2.2.4 Calculation of weights

The calculation of weights is implemented as a two step procedure.

First, weights of the predictors are calculated by the formula

where is a concordance index between the variable and the outcome. For two-class outcome, in absence of ties, CI is equal area under the ROC curve, which is common performance measure for bipartite ranking [3].

The next step includes “post-filtering”, or “shrinkage”. The goal of this step is to improve learning on the datasets where many correlated weak predictors can overweight few strong ones. Rather than setting a hard threshold for selection of predictors, or use data to optimize the threshold, the filtering is made based on comparison of all weights with the highest weight . The updated weights are calculated by the formula:

The empiric formula allows to filter out relatively weak predictors, making the filtering data-dependent without tuning of hyper-parameters.

2.3 Properties of the predictors


approximate densities of both classes, according to Bayes theorem, the marginal conditional probability of the class

in the point can be defined by the formula:

where is the class labels, are priors for the two classes, approximated by their frequencies. Then the marginal predictor can be presented as

difference between ratios of conditional posterior probabilities to prior probabilities in two classes. If variable

is conditionally independent on in , both posterior probabilities in this point are equal to their priors, and In each point , the value of marginal predictor function indicates degree and direction of local association between the values of the variable

and the response variable. Then,

influences the scoring function only for those points which are predictive, and do not participate in ranking otherwise.

The fact that densities for each class are evaluated independently increases robustness of the proposed function.

In our implementation, the parameters of the kernel approximation and LOESS procedures were fixed to ensure maximal smoothness. Thus, unlike most of other advanced risk modeling methods, Smooth Rank does not have hyper-parameters to tune up.

3 Results on Real Data

3.1 Algorithms under comparison

Smooth Rank is compared with two algorithms, which become standard in survival analysis [4, 5, 15].

3.1.1 Cox proportional hazard regression

In the traditional approach by sir David Cox [6], risk of failure is understood as a time-dependent “hazard function” : cumulative probability of an individual having event (failure) up to time .

Cox proportional hazard (PH) regression is based on the strong assumption that the hazard function has the form of

where is unknown time-dependent function, common for all individuals in the population. The assumption implies, in particularly, that for any two individuals, their hazards are proportional all the time. Accordingly, the result of the modeling is not the time-dependent hazard functions, but rather the “proportionality” scores.

The method can not be applied on data with

We use the method’s implementation from the R package survival. The method does not work with missing values.

3.1.2 CoxPath algorithm

CoxPath [16] algorithm is one of most popular approaches to regularization of the Cox PH regression. The path algorithm implements -penalized Cox regression with series of values of regularization parameter . The important property of the

-regularization is that it includes automatic feature selection, and it can work when number of features exceeds number of training cases.

The function implemented in the R package glmpath by the method’s authors is used here. The function builds regression models at the values of at which the set of non-zero coefficients changes. For each model, the function outputs values of three criteria: AIC, BIC, loglik. The criterion AIC was chosen to select the best model for the given training set. We used default values of the parameters of the coxpath procedure. The method does not work with missing values.

3.2 The Datasets

The next datasets were used for methods comparison.

  • BMT: The dataset represents data on 137 bone marrow transplant patients [17] . The data allow to model several outcomes. Here, the models are built for disease free survival time. The first feature is diagnosis, which has three values: ALL; AML Low Risk; AML High Risk. Other features characterize demographics of the patient and donor, hospital, time of waiting for transplant, and some characteristics of the treatment. There are 11 features overall, among them two are nominal.

  • Colon: These are data from one of the first successful trials of adjuvant chemotherapy for colon cancer. Levamisole is a low-toxicity compound previously used to treat worm infestations in animals; 5-FU is a moderately toxic (as these things go) chemotherapy agent. There is possibility to model two outcome: recurrence and death. The data can be found in R package survival . The features include treatment (with three options: Observation, Levamisole, Levamisole+5-FU); properties of the tumor, number of lymph nodes. There are total 11 features and 929 observations.

  • Lung1: Survival in patients with advanced lung cancer from the North Central Cancer Treatment Group [18]. Performance scores rate how well the patient can perform usual daily activities. Other features characterize calories intake and weight loss. The dataset has 228 records with 7 features

  • Lung2, the dataset from [19] Along with the patients’ performance scores, the features include cell type (squamous, small cell, adeno, and large), type of treatment and prior treatment.

  • BC : Breast cancer dataset [20]. It contains 7 tumor characteristics in 97 records of patients.

  • PBC : This data is from the Mayo Clinic trial in primary biliary cirrhosis of the liver conducted between 1974 and 1984 [21]. Patients are characterized by standard description of the disease conditions. The dataset has 17 features and 228 observations.

  • Al: The data [22] of the 40 patients with diffuse large B-cell lymphoma contain information about 148 gene expressions associated with cell proliferation from lympohichip microarray data. Since there are more features than the observations, the Cox regression could not be applied on the data.

  • Ro02s: the dataset from [23]

    contains information about 240 patients with lymphoma. Using hierarchical cluster analysis on whole dataset and expert knowledge about factors associated with disease progression, the authors identified relevant four clusters and a single gene out of the 7399 genes on the lymphochip. Along with gene expressions, the data include two features for histological grouping of the patients. The authors aggregated gene expressions in each selected cluster to create a signatures of the clusters. The signatures, rather than gene expressions themselves were used for modeling. The dataset with aggregated data has 7 features.

  • Ro03g, Ro03s: the data [24] of 92 lymphoma patients. The input variables include data from lymphochip as well as results of some other tests. The Ro03s data contain averaged values of the gene expressions related with cell proliferation (proliferation signature). The Ro03g dataset includes the values of the gene expressions included in the proliferation cluster, instead of their average. Thus, the Ro03s dataset contains 6 features, and the dataset Ro03g contains 26 features.

3.3 Description of the experiment

For each dataset we did 100 random splits on train and test data in proportion . For each split, all methods were applied on the train data and tested on the test data. Thus, the splits are the same for all the methods. The performance of each method was evaluated by average concordance index on the test data.

Cox PH regression and CoxPath algorithms do not work with missing values, while Smooth Rank does. So, for the first two methods records with missing values were removed. The exception is the data Al, where there are too few records and many missing values. In this dataset 5-nearest neighbors imputation was used for all the methods.

3.4 The results

Data N M N/M Smooth Rank Cox Cox Path
1 BMT 137 11 12.4 0.68 (6.4) 0.58 0.58
2 Colon 929 11 84.4 0.65 (4) 0.66 0.66
3 Lung1 228 7 32.6 0.63 (5.7) 0.62 0.62
4 Lung2 137 6 22.8 0.73 (2.16) 0.69 0.70
5 BCW 97 7 13.9 0.71 (5.9) 0.69 0.69
6 PBC 418 17 24.6 0.83 (12.6) 0.82 0.82
7 Al 40 148 0.27 0.63 (110) 0.52
8 Ro02s 240 7 34.3 0.70 (7) 0.73 0.73
9 Ro03s 92 6 15.3 0.76 (3) 0.74 0.75
10 Ro03g 92 26 3.54 0.76 (23) 0.58 0.67

Every cell contains mean value of CI on the test data for 100 random splits in proportion 2:1. The highest values in each row are marked by bold font. The number in brackets is average number of features left after filtering in Smooth Rank

Table 1: Comparison of Methods on Survival Analysis Data

The results are presented in the Table 1. The ratio is included as some measure of the dataset scarcity: the smaller is the ratio, the less representative (more scarce) is the dataset. For all datasets, the table contains average CI for each method for each dataset.

The Table 1 shows that in 8 out of 10 cases Smooth Rank has the best results. Smooth Rank yields to two other methods on the 2 datasets (lines 2 and 8) with the largest ratio .

In the three cases with the lowest ratio of (lines 1,7,10) the advantage of Smooth Rank is the most prominent. Its performance is higher than performance of other methods by

Let us notice that the dataset R003s is a processed version of the dataset Ro03g: Ro03g contains original values of gene expression, and Ro03s includes aggregated features, “signatures”. While Smooth Rank has equally good (the best) results with or without aggregation, two other methods require preliminary feature aggregation for comparable performance.

The table allows to contrast two types of regularization: the traditional one, with - penalization, and the proposed here alternative approach, which includes the strong regularization.

CoxPath uses penalization and model selection to improve Cox PH regression, but most of times it has almost identical accuracy with Cox PH regression on the given data. A possible explanation is that the optimal model selection on the same small training set as part of the path method leads to “model selection bias” [25, 26].

Advantages in accuracy of Smooth Rank over other methods indicate superiority of the strong regularization for risk modeling on scarce data.

4 Experiments with Controlled Data Scarcity

To confirm the advantages of Smooth Rank on scarce data, we conducted two series of experiments controlling two aspects of the data scarcity: number of training instances and dimensionality. In the first series with real data, some instances were randomly removed to obtain training sets of various sizes. In the second experiment on artificial data, the number of instances was fixed, but the number of features was gradually increased. The goal was to see the trends in methods performance on the series of modified datasets.

4.1 Experiments with reduced number of training cases

The experiments were conducted on two of the largest real life datasets from our list, PBC and Colon, where missing values were imputed using 5 nearest neighbors.

For each dataset, 20 of records were randomly selected as test data. The rest of the data were used to randomly draw training sets of five given sizes, total 20 random training sets of each size. All the methods were trained on each training set and tested on the same test set. The experiment was repeated 10 times starting with selection of the test set. So, every method was applied 200 times on the training sets of the same size. For each method, for each sample size, average CI on the test data was calculated over all 200 models.

The Figures 1, 2 demonstrate trends in methods performance with increasing number of instances in the training set. As expected, average performance of each method improves with the number of training cases. In both cases, Smooth Rank has higher accuracy than two other methods when training sets are small. For the Colon dataset, as the number of training instances grows, CoxPH regression and CoxPath methods surpass in accuracy Smooth Rank.

Figure 1: PBC dataset

Figure 2: Colon dataset

Overall, the experiment confirms advantage of the Smooth Rank on the scarce data.

In particularly, on the PBC data, Smooth Rank achieved the same quality of prediction, as two other methods, with about half of the training instances. This is an important quality for longitudinal studies, where each individual requires months or years of observations, making it difficult to increase the number of instances for learning.

The experiment confirms our hypothesis that observed higher accuracy of Cox PH regression and CoxPath on the Colon dataset (see Table 1) can be explained by the larger than usual size of the training sample, not some specifics of the data. If only portion of the instances was available, Smooth Rank would be the best method to model the data.

4.2 Experiments with increasing number of features

It is very common in survival analysis with medical applications that, out of all available information, features for modeling are selected because they are known to have some bearing on the risk of failure. The artificial datasets for this experiment were designed to model this type of data.


was generated as logarithm from a normally distributed random variable. With the assumption that times of events

depend on the risk and some unknown factors, the variable was generated by the formula , where

is a random variable with uniform distribution within interval


Half of the records were randomly assigned status “censored”. For non censored observations (events), For the censored observations, target times indicate the end of observation, which happened before the event (failure) occurred. The target times of the censored observations were calculated by formula , where is a random variable uniformly distributed within interval .

Every feature was generated independently the same way as the times of events : .

In reality, the features do not always depend on the risk linearly. However, having more complex features could affect methods differently and make the effects of the dimensionality on the performance less clear.

All experiments were conducted on the samples with 400 records, which were split on equally sized train and test sets.

For each multiple of 5 from the interval [5 , 75], training and test samples of with features we generated 20 times. For each method, average performance on the test was calculated for each dimension of the data.

The Fig. 3. shows the results of the experiments. All the methods have similar performance with the smallest number of features , which is equal 1/20 of the number of uncensored observations in the training set. Cox PH regression does not benefit from adding more features. For CoxPath, top performance was achieved with . Both CoxPath and Cox PH regression have tendency of decreasing accuracy with .

Figure 3: Atrificial data

Smooth Rank is the only method which does not show any sign of overfitting in this experiment. It achieves the best accuracy and the largest advantage over other methods on datasets with highest studied dimensionality .

The difference in accuracy between CoxPath and Smooth Rank for the most scarce datasets is about , which is similar to the results we observed in experiments on real data (see Table 1). The consistency between the simulation and application of the methods on real data may serve to justify the simulation.

5 Conclusions

In survival analysis studies with medical applications, it is much easier and cheaper to add features than observations. Unknown factors affect measurements and the outcome to a large degree. Small, noisy, high-dimensional data put stringent demands on the robustness of the learning methods.

We proposed two innovations to address this challenge: (1) reduction of the survival analysis to bipartite ranking; (2) a new robust algorithm for bipartite ranking, Smooth Rank. The method does not use multidimensional optimization to avoid the “curse of dimensionality”; it uses smoothing techniques while building marginal predictors.

Advantages of Smooth Rank were proved experimentally, in comparison against the most popular methods for survival analysis (CoxPath and Cox PH regression) in three types of tests. First, the three methods were applied on the 10 real life survival analysis datasets. Then, to systematically study effects of data scarcity on the methods performance, we conducted two computational experiments: with real data, where some instances were randomly removed to produce series of training samples of different sizes, and with artificial data, where the number of features was gradually increased.

All three types of experiments, indeed, demonstrated that Smooth Rank has sizable advantage in accuracy over other two methods on data with smaller number of observations and/or higher dimensionality. The method does not suffer from overfitting where two other methods do. This can make the method a valuable tool in survival analysis studies.

Smooth Rank is a general bipartite ranking method. Even though its creation was motivated by risk modeling, it may be useful in other applications where robustness of learning is critical. Comparison of the method with other (bipartite) ranking algorithms on other applications may be a subject of another study.


  •  1. Harrell FE, Lee KL, Mark DB (1996) Tutorai in biostatistics. multivariate prognostic models. Statistics in Medicine 15: 361 - 387.
  •  2. Clémençon S, Lugosi G, Vayatis N (2008) Ranking and empirical minimization of u-statistics. Annals of Statistics 36: 844-874.
  •  3. Cortes C, Mohri M (2004) Auc optimization vs. error rate minimization. In: Thrun S, Saul L, Schölkopf B, editors, Advances in Neural Information Processing, Cambridge, MA: MIT Press. pp. 313 - 320.
  •  4. Segal MR (2006) Microarray gene expression data with linked survival phenotypes. Biostatistics 7: 268 - 285.
  •  5. Wieringen W, Kun D, Hampel R, Boulestei AL (2009) Survival prediction using gene expression data: a review and comparison. Computational Statistics and Data Analysis 53: 1590 - 1603.
  •  6. Cox D (1972) Regression models and life-tables. Journal of Royal Statistical Society Series B (Methological) 34: 187 - 220.
  •  7.

    Gui J, Li H (2005) Penalized cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data.

    Bioinformatics 21: 3001 – 3008.
  •  8. Li H, Luan Y (2003) Kernel cox regression models for linking gene expression profiles to censored survival data. In: Altman R, Dunker AK, Hunter L, editors, Pacific Symposium on Biocomputing, World Scientific. pp. 65 – 75.
  •  9. Clémençon S, Vayatis N (2009) Adaptive estimation of the optimal roc curve and a bipartite ranking algorithm. In: Galvadà R, Lugosi G, Zeugmann T, Zilles S, editors, Algorithmic learning theory, LNAI 5809, Springer-Verlag. pp. 216-231.
  •  10.

    Friedman J (1997) On bias, variance, 0/1 loss, and the curse-of-dimensionality.

    Data Mining and Knowledge Discovery 1: 55 – 77.
  •  11. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, et al. (1999) Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286: 531 - 537.
  •  12. Hastie T, Tibshirani R, Friedman J (2009) The Elements of Statistical Learning. Springer, NY.
  •  13. Becker RA, Chambers JM, Wilks AR (1988) The New S Language. Wadsworth and Brooks.
  •  14. Cleveland W (1979) Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74(368): 829 – 836.
  •  15. Raz D, Ray MR, Kim JY, He B, Taron M, et al. (2008) A multigene assay is prognostic of survival in patients with early-stage lung adenocarcinoma. Clin Cancer Res 14: 5565-5570.
  •  16. Park M, Hastie T (2007) L1-regularization path algorithm for generalized linear models. J R Statist Soc B 69: 659 - 677.
  •  17. Klein JP, Moeschberger ML (2003) Survival Analysis: Techniques for Censored and Truncated Data. Springer, NY, 2 edition.
  •  18. Loprinzi CL, et al (1994) Prospective evaluation of prognostic variables from patient-completed questionnaires. north central cancer treatment group. Journal of Clinical Oncology 12: 601 - 607.
  •  19. Kalbfleisch J, Prentice R (2002) The Statistical Analysis of Failure Time Data. J. Wiley, Hoboken, N.J.
  •  20. van’t Veer LJ, et al (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415: 530 - 536.
  •  21. Therneau T, Grambsch P (2000) Modeling Survival Data: Extending the Cox Model. Springer-Verlag, New York.
  •  22. Alizadeh A, et al (2000) Distinct types of diffuse large-b-cell lymphoma identified by gene expression profiling. Nature 403: 503 – 511.
  •  23. Rosenwald A, et al (2002) The use of molecular profiling to predict survival after chemotherapy for defuse large -b-cell lymphoma. New England journal of Medicine 346: 1937 – 1947.
  •  24. Rosenwald A, et al (2003) The proliferation gene expression signature is a quantitative predictor of oncogenic events that predict survival in mantle cell lymphoma. Cancer Cell : 183–197.
  •  25. Burnham BKP, Anderson DR (1998) Model Selection and Multimodel Inference: A Practical-Theoretic Approach. Springer.
  •  26. Cawley GC, Talbot NL (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11: 2079–2107.