Super Learning in the SAS system

05/21/2018 ∙ by Alexander P Keil, et al. ∙ 0

Background and objective: Stacking is an ensemble machine learning method that averages predictions from multiple other algorithms, such as generalized linear models and regression trees. A recent iteration of stacking, called super learning, has been developed as a general approach to black box learning and has seen frequent usage, in part due to the availability of an R package. I develop super learning in the SAS software system using a new macro, and demonstrate its performance relative to the R package. Methods: I follow closely previous work using the R SuperLearner package and assess the performance of super learning in a number of domains. I compare the R package with the new SAS macro in a small set of simulations assessing curve fitting in a prediction model, a set of 14 publicly available datasets to assess cross-validated, expected loss, and data from a randomized trial of job seekers' training to assess the utility of super learning in causal inference using inverse probability weighting. Results: Across the simulated data and the publicly available data, the macro performed similarly to the R package, even with a different set of potential algorithms available natively in R and SAS. The example with inverse probability weighting demonstrated the ability of the SAS macro to include algorithms developed in R. Conclusions: The super learner macro performs as well as the R package at a number of tasks. Further, by extending the macro to include the use of R packages, the macro can leverage both the the robust, enterprise oriented procedures in SAS and the nimble, cutting edge packages in R. In the spirit of ensemble learning, this macro extends the potential library of algorithms beyond a single software system and provides a simple avenue into machine learning in SAS.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised machine learning algorithms are emerging as an essential tool for prediction and causal inference in biomedicine. Ensemble machine learning algorithms combine multiple algorithms into a single learner that can improve prediction characteristics such as classification accuracy or mean squared error. One ensemble machine learning method, referred to as stacking, is an approach to combining an arbitrary set of learning algorithms, including other ensemble methodsWolpert (1992); Breiman (1996). A recent approach to stacking, referred to as super learning, has demonstrated theoretical and practical properties that make it an ideal framework for prediction van der Laan et al. (2007); Polley and van der Laan (2010). One of the practical properties is the relative ease of software implementation that has lead to the development of several software packages for super learning. In turn, the availability of software has made the approach relatively simple to implement in some analysis systems Naimi and Balzer (2018).

Existing implementations of Super Learner include an R package Polley et al. (2017), maintained by the developers of the super learning algorithm, and unofficial, open source releases in Python 2Lendle (2014) that has been updated for Python 3 by one of us Keil (2017), and a small open source version in SAS Brooks (2016). Unfortunately, the SAS implementation does not appear to be under active development and has a limited library of algorithms.

I demonstrate usage of super learning in the SAS system using a macro written by us. This macro improves on existing software by providing a general approach to super learning with an extensive existing library that is easily extensible by the user to incorporate new learners, including algorithms from the R programming language. I demonstrate the use of this macro using one simulated example and two examples using real data, and I compare performance with existing implementations. The first two examples closely follow the analyses of Polley and van der Laan, which demonstrated the R SuperLearner package Polley and van der Laan (2010); Polley et al. (2017), while the third example demonstrates some unique features of the SAS macro: namely, the ability to include algorithms from both SAS and R in the same ensemble machine learner.

2 Methods: supervised and super learning

We first provide a brief review of supervised machine learning, and then describe the super learning algorithm.

2.1 Supervised learning

Suppose one is interested in learning about how lung cancer mortality rates vary according to age, smoking, and radon exposures in a population of uranium miners from Colorado in the 1950s. One can frame such learning in terms of a causal inference problem (e.g. what would be the change in the lung cancer mortality rates if one could eliminate smoking among the miners?) or in terms of a prediction problem (e.g. what is the expected lung cancer mortality rate among a non-smoking 70 year old former miner who was exposed at the Mining Safety and Health Administration occupational exposure limit from ages 20 to 65?). Supervised machine learning is one way to use the inputs (smoking, age, radon exposure) and outputs (lung cancer mortality) as a way to describe or uncover patterns in how relates to . The way these variables relate is through a function that yields the average (or ”predicted”) lung cancer mortality rate for a given pattern of smoking and radon exposure at a given age within the context of our study sample . The parameters determine the shape of the function that relates inputs to outputs. For example

could represent log-odds ratios if our function is a logistic regression model or it could represent the rules that define the nodes of classification and regression trees.

Using the study data and a some function (e.g. logistic model, regression tree), one “trains” the parameters of that function by choosing the

that minimize some estimated expected loss function, which could be the negative log-likelihood (as in maximum likelihood estimation) or some other estimated expected loss function such as the squared sum of the fitted residuals


(as in ordinary least-squares). In this context, supervised machine learning is the act of training the parameters such that the function

carries information about how and relate to each other in the study sample. That is, the model is “learned” from the data by minimizing an estimate of an expected loss function.

2.2 Super learning

Super learning is a way of combining multiple functions, or learners, using cross-validation. Precise, theoretic descriptions of super learning are given in van der Laan et al. (2007) and Polley and van der Laan (2010), but I review basic principles here using alternative notation. Let be the number of learners in the library (set) of learners, and index each learner as for . I denote predictions from the

learners by the vector

. For example, M could equal 3 for a continuous and our library could contain a generalized linear model (glm), a generalized additive model (g.a.m.), and a regression tree (tree), and . The super learner prediction is given as a combination of the predictions from the leaners, which can be expressed as in equations 1 and 2.


I adopt Wolpert’s terminology and refer to the functions as “level-0” models, which are essentially regression models for the observed on covariates indexed by parameters ; I refer to the function as a “level-1” model in which the observed is regressed on the set of predictions using a model indexed by parameters Wolpert (1992). The backbone for “stacked generalization” was laid out by Wolpert Wolpert (1992) and developed further by Breiman Breiman (1996) using parametric level-1 models. The algorithm given in equations 1 and 2 was generalized to arbitrary functions and by van der Laan et al., who allowed that

could be, for example, a penalized regression model or a data adaptive approach like random forest

van der Laan et al. (2007)

; this generalization was termed “super learning.” In practice, however, modern super learning algorithms are relatively unchanged from stacking algorithms in place by the late 1990s, which rely on parametric models in which the parameters

form a convex combination (i.e. , and ); thus, super learner predictions can often be expressed as weighted combinations of a set of other machine learning algorithms.

Because we do not know the parameters , we must estimate them. In combination with the level-1 model, -fold cross-validation is used as a way to estimate without over-fitting to the data.

-fold cross-validation proceeds by partitioning the data into equally sized folds, where denotes the th fold, and is the size of the study sample. Typically, is chosen as 10 or 20. I denote a set of cross-validated predictions by . This notation emphasizes that a cross-validated prediction for the th fold results from first training the parameters of the function on the remaining folds, denoted by , and then using the trained value to predict , given the values of in the th fold of the study sample, denoted The final set of cross-validated predictions for learner are given as the set of all N cross-validated predictions (that is, if is a vector of length , then will also be a vector of length ). The coefficients are estimated in a model of the form , which is essentially a fitted regression model identical to the level-1 model above, but cross-validated predictions from each of the level-0 models are used in place of the “true” predictions. That is, the super learning algorithm finds that minimize the estimated expected cross-validated loss function . The final super-learner prediction is made using predictions from the level-1 model with inputs estimated in the full study sample and parameters .

In the remainder of the manuscript, we give three example applications of super learning using the %SuperLearner macro, including one simulation study and two examples with real-world data. The %SuperLearner macro is available from the github page of the author, and version 1.0 of the macro can be enabled in SAS by including the following at the top of a SAS program:


3 Results using the %SuperLearner macro.

3.1 Example 1: Simulation studies

Polley and van der Laan (2010) demonstrated the performance of super learner in a simple simulation problem. This simulation involved learning a regression function characterizing the mean of a continuous variable as a function of a single continuous predictor

, given as a uniform random variable with min=-4, max=4.

was generated under four different scenarios, given by:

Sim 1

Sim 2

Sim 3

Sim 4

The simulations assess out-of-sample prediction accuracy for super learner. Specifically, this set of simulations quantifies how well a learning algorithm with parameters estimated in a training set of N=100 could predict outcomes in a validation data set (with the same data generating mechanism) of size 10,000. The metric used is an estimated statistic given by

Polley and van der Laan showed that the optimal value of (the expected value under the true parametric model) is 0.80 for all four simulations, where is estimated in the validation data. The simulations were each repeated 100 times and was calculated using estimated by super learner as well as for each learner in the super learner library.

I repeated Polley and van der Laan’s original simulation analysis using SAS and R with some modifications: I used the R package ‘xgboost’ for

boosting (rather than ‘gbm’), and b.a.r.t. was dropped from the library (there is no SAS procedure for b.a.r.t.

). The final super learner library contained algorithms for linear regression, linear regression with all first order interaction terms (

glm, glm + intx), random forest, bootstrap aggregation of trees (baggingPeters and Hothorn (2017)), generalized additive models (g.a.m.Hastie (2018)

), gradient boosting (

boostingChen et al. (2018)

), neural networks (

neural netVenables and Ripley (2002)), multivarate adaptive regression splines (m.a.r.s.Kooperberg (2015)), Bayesian additive regression trees (b.a.r.t.Chipman et al. (2010)), and local polynomial regression (loessCleveland (1991)). Variations of some of these algorithms were added to the super learner library: bagging algorithms with complexity parameters set to 0.0, 0.01, and 0.1 were used, as well as one with a mean split size (ms) of 5; g.a.m.

algorithms were created using splines with 2, 3, or 4 degrees of freedom;

neural net algorithms were created with 2,3,4, or 5 hidden nodes; finally loess algorithms were created with smoothing parameters set to 0.75, 0.5, 0.25, or 0.1.

An example call to the %SuperLearner macro for the analysis of the simulated data is given in Figure 1. For each algorithm, and for super learner, I estimated the mean and interquartile range of the estimates across the 100 simulations. As shown in Figure 2, super learner performed equally well in both R and SAS. There were few meaningful differences across software platforms in the performance of individual learners, with the exception of loess, which is likely due to in parameterization of the smoothing kernel.


Figure 1: Calling the %SuperLearner macro to carry out the simulation analysis described in section 3.1
Figure 2: Repeating Polley and van der Laan simulations Polley and van der Laan (2010) assessing the out-of-sample prediction accuracy for super learner and algorithms in the super learner library. The optimal value of is given by the black vertical line, and means are given with solid circles with interquartile ranges given by horizontal lines.

3.2 Example 2: Performance in 14 real world datasets

To extend beyond simple simulation scenarios, I tested the performance of the %SuperLearner macro in 14 real world data sets. The analyses in this section closely follow those reported in section 3.2 of Polley and van der Laan. I used 14 publicly available, real world datasets with continuous target variables and sample sizes between 200 and 700 (Appendix Table A1). The datasets vary in terms of the number of predictors and are from diverse subject matter areas including economic, health, engineering, and biologic data. Some of the datasets used by Polley and van der Laan are no longer available, so only 10 of our 14 datasets were featured in Polley and van der Laan’s analyses.

The objective of these analyses is to assess cross-validated predictive accuracy, which I quantify using 10-fold cross validated mean-squared error (CVMSE). I followed Polley and van der Laan by scaling the CVMSE for each learner by the CVMSE for the generalized linear model using

. To facilitate comparisons of the average performance of each learner, I calculated the geometric mean of

across all datasets for each learner.

In addition to a number of the learners used in the simulation analyses of section 3.1, I also included b spline (basis splines, SAS only), b.a.r.t. (R only), stepwise (step-wise selection of a linear model Hastie and Pregibon (1992)), ridge

(ridge regression

Hastie and Pregibon (1992)), l.a.s.s.o.(least absolute shrinkage and selection operator Friedman et al. (2010)), random forest Liaw and Wiener (2002), bayes glm (R only) s.v.m

(support vector machine regression, R only

Meyer et al. (2017)), and d.s.a. (the deletion/substitution/addition algorithm Molinaro et al. (2010), R only).

In both SAS and R, the super learner algorithm had the lowest (best) average of all the algorithms examined (3). Among the other algorithms, g.a.m. performed well under several different parameterizations. Notably, the SAS version of g.a.m. yielded very few estimates that performed worse than a linear model, and seem to have a more consistent average performance than the R version. In contrast with the findings of Polley and van der Laan, b.a.r.t. did not perform the best among the other algorithms, which may be partly due to differences in the packages used (the b.a.r.t. algorithm used by Polley and van der Laan was not available at the time of this analysis) or simply due to differences in the datasets used for each analysis. Interestingly, the l.a.s.s.o. algorithm performed on average better than the linear model in R, but not in SAS. The SAS and R versions of the l.a.s.s.o. differ in terms of how the model is chosen, so the nominal category of the learner will not necessarily dictate its performance. Some individual algorithmic differences aside, there appeared to be no important difference in average across these 14 datasets between Super Learner implemented with native-SAS procedures and Super Learner implemented with native-R packages. Within a given dataset, however, they may not yield the same result due to differing performance among the available algorithms.

Figure 3: Variant of Polley and van der Laan real data analysis Polley and van der Laan (2010) assessing the 10-fold cross-validated relative mean-squared error (relative to glm) across 14 real datasets, sorted by geometric mean, denoted with a plus (+) sign.

3.3 Example 3: Super learner estimates of inverse probability weights

One recent application of prediction algorithms is inverse probability weighting. Inverse probability weighting has been used for causal effect estimation in observational studies Robins et al. (2000), generalizing study results from randomized trials Cole and Stuart (2010) and observational studies Keil and Richardson (2017), and addressing potential bias from non-compliance and dependent censoring in clinical trials Robins and Finkelstein (2000); Cain and Cole (2009).

Under assumptions of exchangeability, consistency, positivity, no interference, no measurement error, and correct model specification, inverse probability weighting can be used to correct for selection bias from informative censoring in order to estimate the per-protocol (i.e. perfect adherence) effect of a treatment in a randomized trial. Super learner provides one potential avenue to relax the statistical assumption of correct model specification in the spirit of other machine learning approaches to estimate propensity scores Lee et al. (2010); Westreich et al. (2010); Watkins et al. (2013).

I used the %SuperLearner macro to estimate inverse-probability-of-treatment and inverse-probability-of-censoring weights in a randomized trial of a job training protocol among patients with substance abuse disorders. This trial was previously described in detail Svikis et al. (2012). Briefly, 628 un- or under-employed patients seeking treatment for substance abuse disorders were randomized either to receive standard of care or to be offered weekly training via a series of small group sessions. Participants were assessed at randomization, as well as 4, 12, and 24 weeks (approximately) following randomization to enquire about current and former employment status, current and prior substance use (including a testing program), reading skills, and demographics. Over the course of the study, 127(20%) participants were lost-to-follow-up.

In our analysis, I included only individuals with non-missing baseline values for race, self-reported illicit substance use, or reading scores. I censored individuals after their first missed study visit or drug test, and individuals were censored at the time of the scheduled (missed) visit or test. The outcome of interest () is the first self-reported change in employment (gaining employment). Specific dates of employment were not available, so the time of the job change is recorded to be the time of the interview. The treatment of interest is participation in the small group job-training sessions. The target parameters are the study arm specific cumulative probabilities of gaining employment during the study period, had all participants remained under follow-up and compliant to the study arm to which they were assigned. Such estimands are often referred to as the “per-protocol” effects.

3.3.1 Estimating the probability of gaining employment

Inverse probability weighting was used to adjust for chance baseline imbalances in randomization by measured variables (treatment weights Greenland and Robins (2009)) and informative loss-to-follow-up by measured, possibly time-varying variables (censoring weights Robins and Finkelstein (2000)). Using these weights, I estimated non-parametric cumulative distribution curves that estimate the per-protocol cumulative probability of obtaining employment in each study arm Cole and Hernán (2004), and I compared these curves with the estimated curves from an intent-to-treat (ITT) analysis which is a crude comparison of estimated probabilities of employment across study arms.

3.3.2 Estimating inverse probability weights using super learner

Following the notation of Section 2, inverse probability of treatment weights were based on estimated probabilities derived from (where is treatment arm) and predictions from each learner depended baseline variables given by age, self-reported number of jobs ever worked, WRAT subtest scores for word and letters, any work in the last 4 weeks, race, the number of positive drug tests (out of 10 different substances), and self-reported ever-use of cocaine or heroin. The time-varying censoring weights were based on (where is a censoring indicator at time ). Predictions from each learner depended on all variables included in treatment weights as well as the most recent number of positive drug tests, study arm and day of the study as predictors.

The library of learners for both treatment and censoring weights comprise a logistic model (glm), b.a.r.t., m.a.r.s., boosting, bagging, stepwise, and neural net. I expanded the neural net

algorithms in the previous examples to potentially include multiple hidden layers (indexed by #layers, #nodes per layer), which is a simple version of “deep learning

LeCun et al. (2015)

. The weight used in analysis was calculated using the product of treatment and censoring weights, and weights were stabilized by multiplying by the crude probabilities of treatment and censoring. The loss function for both treatment and censoring level-1 models was the negative of the binomial likelihood with a logit link and

were constrained to be a convex combination.

A notable feature of this analysis is that the b.a.r.t. algorithm fit via an R routine. However, using the RLANG system option in SAS, it can be called via the %SuperLearner macro using the library member ‘r_bart’ (as can R versions of g.a.m., s.v.m., random forest, boosting, bagging, and m.a.r.s.). Thus, the %SuperLearner macro subsumes many of the learners available in the R SuperLearner package.

3.3.3 Results

As shown in Table 1, the learning algorithms that contributed the greatest to the super learner fit were l.a.s.s.o. (treatment weights) and m.a.r.s. (censoring weights). The strong influence of l.a.s.s.o. on the super learner fit for predicting treatment coincides with prior knowledge that most baseline covariates would not contribute to predictions of treatment (on the logit scale) because treatment is randomized. The other explicit variable selection algorithm in the library, stepwise, had the second lowest expected cross-validated loss. Such prior knowledge is not available for predicting censoring. Interestingly, b.a.r.t. demonstrated the lowest expected cross-validated loss of all learners in the library, but did not contribute substantially to the super learner fit. Intuition about regression modeling may provide one explanation: regressors that demonstrate high bivariate correlation with the dependent variable may not display high correlation conditional on other regressors. That is, this may simply be a level-1 model analogue of Simpson’s paradox Hernán et al. (2011).

The estimated 10-fold cross validated expected loss function was lowest for l.a.s.s.o. for the treatment weights and was lowest for super learner for the censoring weights. Notably, the method used to fit the level-1 model of super learner had a noticeable impact on whether or not super learner achieved the minimum (or near minimum) estimated cross-validated expected loss among the set of learners. For example, using the “NNLS” (the R SuperLearner package default: non-negative, normalized least squares) method resulted in expected cross-validated loss for super learner that was higher than 8/14 learners in the library for prediction of censoring (Appendix Table A2). This result emphasizes the utility of using an additional layer of cross-validation to assess whether super learner adequately fits the data. It also underscores the continued importance of the analyst in evaluating the fit and plausibility of findings from automated algorithms such as super learner Keil and Edwards (2018).

Treatment Censoring
model model
Learner Loss Loss
super learner 0.695
glm 0.20 0.703 0.00 0.292
b.a.r.t. 0.00 0.696 0.00 0.277
m.a.r.s. 0.10 0.698 0.36 0.280
boosting 0.00 0.748 0.00 0.290
bagging 0.18 0.705 0.00 0.280
stepwise 0.00 0.697 0.00 0.292
random forest 0.00 0.717 0.00 0.288
l.a.s.s.o. 0.51 0.00 0.290
naive Bayes 0.00 0.700 0.19 0.623
s.v.m. 0.00 0.816 0.18 0.367
neural net (1,5) 0.00 0.714 0.00 0.284
neural net (1,15) 0.00 0.699 0.05 0.286
neural net (2,15) 0.00 0.698 0.12 0.284
random forest 2 0.02 0.732 0.09 0.303

Estimated expected 10-fold cross-validated loss: or ; given as the negative of the binomial likelihood with a logit link. Learners with lowest expected loss are denoted in bold.
using the “dbarts” package in R
Random forest using sampling with replacement (SAS default is sampling without replacement)

Table 1: Super learner coefficients for the level-1 model, and 10-fold cross validated loss functions for each learner

The mean (sd) of the stabilized weights (estimated using super learner) was 0.93 (0.13), while weights estimated using only logistic models for treatment and censoring had a mean (sd) of 1.03 (0.40). Ideally, a mean stabilized weight will equal 1.0, though it has previously been reported that machine learning methods to estimate propensity scores have resulted in stabilized weights that deviated substantially from 1.0 but nonetheless resulted in treatment effect estimates with low bias relative to misspecified parametric models Lee et al. (2010). As demonstrated in figure 4, the per-protocol effect of the intervention did not differ appreciably from the intent to treat analysis. Based on the small difference between the intent to treat analysis and the per-protocol analysis, bias due to informative dropout by the variables I included as covariates did not appear to be meaningfully large. Statistical analysis using weighted Cox models agreed with graphical analysis which suggested a larger effect of treatment in the latter half of the study (not shown).

Figure 4: Cumulative probability of self-reported employment in a randomized trial of the Job Seekers’ Workshop in patients with substance abuse disorders.

4 Conclusions

While the SuperLearner package has been available in R since 2010, there has been no such facility that is generally available in the SAS software system. With the addition of the SAS Enterprise Miner software, the availability of machine learning algorithms in SAS has warranted a need for a way to combine inference from both parametric and data adaptive models. The %SuperLearner macro fills this gap and, as demonstrated, performs similarly to the R package in a number of problems, even with a different set of natively available machine learning algorithms in each software system.

The reliance on SAS places some constraints on the available features of the %SuperLearner macro. Namely, while the %SuperLearner macro can be used to train super learner in one dataset in order to make predictions in another (as in the simulation example shown in figure 1), these processes must currently be done simultaneously. The R package, on the other hand, allows one to save a trained super learner model for making predictions at a later time. The procedural programming oriented approach of SAS make such a feature difficult to implement in SAS. Further, many of the procedures underlying machine learning algorithms in SAS require the (paid) installation of SAS Enterprise Miner (e.g. random forest and neural net) and even basic implementations require SAS/STAT and SAS/OR, which may or may not be included in some SAS installations.

Limitations notwithstanding, the %SuperLearner macro is a power tool that can draw on both the SAS and R systems for machine learning algorithms. Thus, in the spirit of ensemble machine learning algorithms, this approach is appealing in the number of different learners that can be implemented as part of the super learner library. This macro draws on a number of strengths from the SAS system, including the robust, enterprise oriented procedures, by-group processing, and the default capability to handle datasets that are too large to fit in memory. Rather than replacing the SuperLearner package in R, this macro provides a valuable alternative to researchers more familiar with the SAS system or who use SAS due to enterprise features or collaborative ease.



  • Wolpert (1992) D. H. Wolpert, Stacked generalization, Neural networks 5 (1992) 241–259.
  • Breiman (1996) L. Breiman, Stacked regressions, Machine learning 24 (1996) 49–64.
  • van der Laan et al. (2007) M. J. van der Laan, E. C. Polley, A. E. Hubbard, Super learner, Report, Division of Biostatistics, University of California, Berkeley, 2007.
  • Polley and van der Laan (2010) E. C. Polley, M. J. van der Laan, Super learner in prediction, Report, Division of Biostatistics, University of California, Berkeley, 2010.
  • Naimi and Balzer (2018) A. I. Naimi, L. B. Balzer, Stacked generalization: an introduction to super learning, Eur J Epidemiol (2018).
  • Polley et al. (2017) E. Polley, E. LeDell, C. Kennedy, M. van der Laan, SuperLearner: Super Learner Prediction, 2017. URL:, r package version 2.0-22.
  • Lendle (2014) S. Lendle, Supy learner, 2014. URL:
  • Keil (2017) A. P. Keil, Supy learner (python 3), 2017. URL:
  • Brooks (2016) J. Brooks, Superlearner sas macro, 2016. URL:
  • Peters and Hothorn (2017) A. Peters, T. Hothorn, ipred: Improved Predictors, 2017. URL:, r package version 0.9-6.
  • Hastie (2018) T. Hastie, gam: Generalized Additive Models, 2018. URL:, r package version 1.15.
  • Chen et al. (2018) T. Chen, T. He, M. Benesty, V. Khotilovich, Y. Tang, xgboost: Extreme Gradient Boosting, 2018. URL:, r package version
  • Venables and Ripley (2002) W. N. Venables, B. D. Ripley, Modern Applied Statistics with S, fourth ed., Springer, New York, 2002. URL:, iSBN 0-387-95457-0.
  • Kooperberg (2015) C. Kooperberg, polspline: Polynomial Spline Routines, 2015. URL:, r package version 1.1.12.
  • Chipman et al. (2010) H. A. Chipman, E. I. George, R. E. McCulloch, Bart: Bayesian additive regression trees, The Annals of Applied Statistics 4 (2010) 266–298.
  • Cleveland (1991) W. S. Cleveland, Local regression models, Statistical Models in S. (1991).
  • Hastie and Pregibon (1992) T. J. Hastie, D. Pregibon, Generalized linear models. Chapter 6 of Statistical Models in S, Wadsworth & Brooks/Cole, 1992.
  • Friedman et al. (2010) J. Friedman, T. Hastie, R. Tibshirani, Regularization paths for generalized linear models via coordinate descent, Journal of Statistical Software 33 (2010) 1–22.
  • Liaw and Wiener (2002) A. Liaw, M. Wiener, Classification and regression by randomforest, R News 2 (2002) 18–22.
  • Meyer et al. (2017)

    D. Meyer, E. Dimitriadou, K. Hornik, A. Weingessel, F. Leisch, e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien, 2017. URL:, r package version 1.6-8.
  • Molinaro et al. (2010) A. M. Molinaro, K. Lostritto, M. J. van der Laan, partdsa: Deletion/substitution/addition algorithm for partitioning the covariate space in prediction, Bioinformatics 26 (2010) 1357–63.
  • Robins et al. (2000) J. M. Robins, M. A. Hernán, B. Brumback, Marginal structural models and causal inference in epidemiology, Epidemiology 11 (2000) 550–60.
  • Cole and Stuart (2010) S. R. Cole, E. A. Stuart, Generalizing evidence from randomized clinical trials to target populations: The actg 320 trial, American journal of epidemiology 172 (2010) 107–115.
  • Keil and Richardson (2017) A. P. Keil, D. B. Richardson, Quantifying cancer risk from radiation, Risk Analysis (2017).
  • Robins and Finkelstein (2000) J. M. Robins, D. M. Finkelstein, Correcting for noncompliance and dependent censoring in an aids clinical trial with inverse probability of censoring weighted (ipcw) log-rank tests, Biometrics 56 (2000) 779–788.
  • Cain and Cole (2009) L. E. Cain, S. R. Cole, Inverse probability-of-censoring weights for the correction of time-varying noncompliance in the effect of randomized highly active antiretroviral therapy on incident aids or death, Statistics in medicine 28 (2009) 1725–1738.
  • Lee et al. (2010) B. K. Lee, J. Lessler, E. A. Stuart, Improving propensity score weighting using machine learning, Statistics in medicine 29 (2010) 337–346.
  • Westreich et al. (2010) D. Westreich, J. Lessler, M. J. Funk,

    Propensity score estimation: neural networks, support vector machines, decision trees (cart), and meta-classifiers as alternatives to logistic regression,

    Journal of clinical epidemiology 63 (2010) 826–833.
  • Watkins et al. (2013) S. Watkins, M. Jonsson-Funk, M. A. Brookhart, S. A. Rosenberg, T. M. O’Shea, J. Daniels, An empirical comparison of tree-based methods for propensity score estimation, Health Serv Res 48 (2013) 1798–817.
  • Svikis et al. (2012) D. S. Svikis, L. Keyser-Marcus, M. Stitzer, T. Rieckmann, L. Safford, P. Loeb, T. Allen, C. Luna-Anderson, S. E. Back, J. Cohen, et al., Randomized multi-site trial of the job seekers’ workshop in patients with substance use disorders, Drug & Alcohol Dependence 120 (2012) 55–64.
  • Greenland and Robins (2009) S. Greenland, J. M. Robins, Identifiability, exchangeability and confounding revisited, Epidemiol Perspect Innov 6 (2009) 4.
  • Cole and Hernán (2004) S. R. Cole, M. A. Hernán, Adjusted survival curves with inverse probability weights, Comput Methods Programs Biomed 75 (2004) 45–9.
  • LeCun et al. (2015) Y. LeCun, Y. Bengio, G. Hinton, Deep learning, nature 521 (2015) 436.
  • Hernán et al. (2011) M. A. Hernán, D. Clayton, N. Keiding, The simpson’s paradox unraveled, Int J Epidemiol 40 (2011) 780–5.
  • Keil and Edwards (2018) A. P. Keil, J. K. Edwards, You are smarter than you think:(super) machine learning in context, European journal of epidemiology (2018) 1–4.
  • Cook and Weisberg (2009) R. D. Cook, S. Weisberg, An introduction to regression graphics, volume 405, John Wiley & Sons, 2009.
  • Penrose et al. (1985) K. W. Penrose, A. Nelson, A. Fisher, Generalized body composition prediction equation for men using simple measurement techniques, Medicine & Science in Sports & Exercise 17 (1985) 189.
  • James et al. (2013) G. James, D. Witten, T. Hastie, R. Tibshirani, An introduction to statistical learning, volume 112, Springer, 2013.
  • Berndt (1991) E. R. Berndt, The practice of econometrics: classic and contemporary, Addison Wesley Publishing Company, 1991.
  • Kibler et al. (1989) D. Kibler, D. W. Aha, M. K. Albert, Instance-based prediction of real-valued attributes, Computational Intelligence 5 (1989) 51–57.
  • Harrell (2001) F. E. Harrell, Regression modeling strategies, with applications to linear models, survival analysis and logistic regression, GET ADDRESS: Springer (2001).
  • Chu (2001) S. Chu, Pricing the c’s of diamond stones, Journal of Statistics Education 9 (2001).
  • Rosner (2015) B. Rosner, Fundamentals of biostatistics, Nelson Education, 2015.
  • Harrison Jr and Rubinfeld (1978) D. Harrison Jr, D. L. Rubinfeld, Hedonic housing prices and the demand for clean air, Journal of environmental economics and management 5 (1978) 81–102.
  • Cook (2009) R. D. Cook, Regression graphics: Ideas for studying regressions through graphics, volume 482, John Wiley & Sons, 2009.
  • Gelman et al. (2014) A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin, Bayesian data analysis, volume 2, CRC press Boca Raton, FL, 2014.
  • Carroll et al. (1997) R. J. Carroll, D. Ruppert, L. A. Stefanski, J. Buonaccorsi, Measurement error in nonlinear models, Metrika 45 (1997) 182–183.
  • Western (1996) B. Western, Vague theory and model uncertainty in macrosociology, Sociological Methodology (1996) 165–192.

5 Appendix


Figure A1: Calling the advanced %_SuperLearner macro to estimate probability of censoring to estimate the per protocol effect of a job skill workshop on employment among individuals with substance abuse disorders, as described in section 3.3
Data N p Citation
ais 202 10 Cook and Weisberg (2009)
bodyfat 252 14 Penrose et al. (1985)
cholesterol 297 13 James et al. (2013)
cps78 550 18 Berndt (1991)
cps85 534 18 Berndt (1991)
cpu 209 6 Kibler et al. (1989)
diabetes 375 15 Harrell (2001)
diamond 308 17 Chu (2001)
fev 654 4 Rosner (2015)
house 506 13 Harrison Jr and Rubinfeld (1978)
mussels 201 3 Cook (2009)
presidential 591 20 Gelman et al. (2014)
sat 339 4 Carroll et al. (1997)
strike 625 6 Western (1996)

number of predictors
does not appear in Polley and van der Laan (2010).

Table A1: 14 Real-world datasets used to evaluate super learning of a continuous variable in SAS and R
Treatment Censoring
model model
Learner Loss Loss
super learner 0.250 0.079
glm 0.32 0.251 0.00 0.079
b.a.r.t. 0.00 0.251 0.00
m.a.r.s. 0.13 0.251 0.41 0.079
boosting 0.00 0.273 0.00 0.081
bagging 0.13 0.255 0.00 0.079
stepwise 0.00 0.252 0.00 0.079
random forest 0.00 0.260 0.04 0.080
l.a.s.s.o. 0.40 0.00 0.079
naive Bayes 0.00 0.253 0.14 0.221
s.v.m. 0.00 0.296 0.00 0.084
neural net (1,5) 0.00 0.257 0.00 0.079
neural net (1,15) 0.00 0.252 0.00 0.079
neural net (2,15) 0.00 0.252 0.40 0.078
random forest 2 0.02 0.266 0.02 0.083

Estimated expected 10-fold cross-validated loss: or ; given as the mean squared error. Learners with lowest expected loss are denoted in bold.
using the “dbarts” package in R
Random forest using sampling with replacement (SAS default is sampling without replacement)

Table A2: Super learner coefficients for the level-1 model, and 10-fold cross validated loss functions for each learner, using the method=NNLS in the %SuperLearner call