1 Introduction
In many scientific fields, statistical models have become a frequently used tool to approach research questions. Choosing the most appropriate model(s), i.e. the model(s) best supported by the data, however, can be difficult. Especially in ecological, sociological and psychological research, where data are often sparse while systems are complex, the evidence one particular statistical model may not be exclusive. The existence of alternative models, which fit the data with comparable goodness, but yield considerably different predictions is ubiquitous (e.g. Draper 1995, Madigan et al. 1995, Raftery 1996).
In ecological and evolutionary research, statistical procedures are dominated by an “informationtheoretical approach” (Burnham & Anderson 1998, 2002), which essentially means that model fit is assessed by Akaike’s Information Criterion (defined as : Akaike 1973). In this field, the AIC has become the paradigmatic standard for model selection of likelihoodbased models as well as for determination of model weights in model averaging (e.g. DinizFilho et al. 2008, Mundry 2011, Hegyi & Garamszegi 2011). In recent years, and particularly in the field of species distribution analyses, nonparametric, likelihoodfree approaches (‘machine learning’) have become more prevalent and in comparisons typically show better predictive ability (e.g. Recknagel 2001, Elith et al. 2006, Olden et al. 2008, Elith & Leathwick 2009). However, these approaches do not allow the computation of an AIC, because many of such ‘blackbox’ methods are neither likelihoodbased, nor can one readily account for model complexity, as the number of parameters does not reflect the effective degrees of freedom (a phenomenon Elder (2003)
calls the “paradox of ensembles”). Thus, at the moment ecologists are faced with the dilemma of either following the AICparadigm, which essentially limits their toolbox to GLMs and GAMs, or use machinelearning tools and closing the AICdoor. From a statistical point of view, dropping an AICbased approach to model selection and averaging is no loss, as an alternative approximation of the KullbackLeibler distance is possible through crossvalidation.
In this study, we explore a potential avenue to unite AIC and machine learning, based on the concept of Generalised Degrees of Freedom (GDF). We explore the computation of GDFs for Gaussian and Bernoullidistributed data as plugin estimates of the number of parameters in a model. We also compare such a GDFAICbased approach with crossvalidation.
The paper is organised as follows. We first review the Generalised Degrees of Freedom concept and relate it to the degrees of freedom in linear models. Next, we briefly illustrate the relation between crossvalidation and KullbackLeibler divergence, as KL also underlies the derivation of the AIC. Through simulation and real data we then explore the computation of GDFs for a variety of modelling algorithms and its stability. The paper closes with a comparison of GDFbased AIC and crossvalidationderived deviance, and comments on computational efficiency of the different approaches and consequences for model averaging.
1.1 Generalised Degrees of Freedom
Generalised Degrees of Freedom (GDF), originally proposed by Ye (1998) and illustrated for machine learning by Elder (2003), can be used as a measure of model complexity through which we can make information theoretical approaches applicable to blackbox algorithms.
In order to understand the conceptual properties of this method, we can make use of a somewhat simpler version of degrees of freedom (df). For a linear model m, , df are computed as the trace of the hat (or projection) matrix H, with elements (e.g. Hastie et al. 2009, p. 153):
(1) 
For a linear model this is the number of its independent parameters, i.e. the rank of the model. To expand this concept to other, nonparametric methods, making it in fact independent of the actual fitted model, the definition above provides the basis for a generalised version of degrees of freedom. Thus, and according to Ye (1998),
(2) 
where is the fitted value. That is to say, a model is considered the more complex the more adaptively it fits the data, which is reflected in higher GDF. Or in the words of Ye (1998, p. 120): GDF quantifies the “sensitivity of each fitted value to perturbations in the corresponding observed value”. For additive error models (i.e. , with ), we can use the specific definition that
(3) 
(Hastie et al. 2009, p. 233). Beyond the additiveerror model more generally, however, we have to approximate eqn 2 in other ways. By fitting the model with perturbed and assessing the response in we can evaluate so that,
(4) 
where is perturbed in such a way that, for normally distributed data, with
being small relatively to the variance of
.In order to adapt GDF to binary data we need to reconsider this procedure. Since a perturbation cannot be achieved by adding small random noise to , the only possibility is to replace 0s by 1s and vice versa. A perturballdataatonce approach (Elder 2003) as for Gaussian data is thus not feasible. We explore ways to perturb Bernoullidistributed below.
The equivalence of GDF and in the linear model encourages us to use GDF as a plugin estimator of the number of parameters in the AIC computation.
1.2 Crossvalidation and a measure of model complexity
Crossvalidation is a standard approach to quantify predictive performance of a model, automatically accounting for model complexity (by yielding lower fits for too simple and too complex models, e.g. Wood 2006, Hastie et al. 2009). Because each model is fitted repeatedly, crossvalidation is computationally more expensive that the AIC, but the same problem arises for GDF: it requires many simulation to estimate it stably (see below).
We decided to use the loglikelihood as measure of fit for the crossvalidation (Horne & Garton 2006), with the following reasoning. Let denote the model density of our data, y, and let denote the true density. Thus the KullbackLeibler (KL) divergence (Kullback & Leibler 1951) is
(5) 
AIC is an estimate of this values (in the process taking an expectation over the distribution of ). Now is independent of the model, and the same for all models under consideration (and hence gets dropped from the AIC). So the important part of the KLdivergence is
(6) 
that is, the expected model loglikelihood, where the expectation is over the distribution of data not used in the estimation of . Expression 6 can be estimated by crossvalidation:
(7) 
where is the sum over K folds of the log likelihood of the test subset , given the trained model , with parameter estimates . To put this on the same scale as AIC we multiply by to obtain the crossvalidated deviance.
With the assumption of AIC and (leaveoneout) crossvalidation being asymptotically equivalent (Stone 1977) and given the definition of AIC, we argue that it should be possible to extract a measurement from that quantifies model complexity. Hence,
(8) 
with representing (estimated) model complexity, the maximum loglikelihood of the original (noncrossvalidated) model and N the number of data points. Embracing the small samplesize bias adjustment of AIC (Sugiura 1978, Hurvich & Tsai 1989) we get:
(9) 
Thus, we can compute both a crossvalidationbased deviance that should be equivalent to the AIC, , as well as a crossvalidation alternative to GDF, based on the degree of overfitting of the original model (eqns. 8 and 9).
This approach to computing model complexity is not completely unlike that of the DIC (Spiegelhalter et al. 2002), where represents the effective number of parameters in the model and is computed as the mean of the deviances in an MCMCanalysis minus the deviance of the mean estimates: (Wood 2015). In eqn. 8 the likelihood estimate plays the role of , while the crossvalidation estimates .
2 Implementing and evaluating the GDFapproach for normally and Bernoullidistributed data
We analysed simulated and real data sets using five different statistical models. Then we computed their GDF, disturbing data points at a time, with different intensity (only for normal data), and for different values of .
2.1 Data: simulated and real
First, we evaluated the GDFapproach on normally distributed data following Elder (2003), deliberately using a relatively small data set: . The response was simulated as (, with values of 10 and 10, respectively), with . (This simulation was repeated to control for possible influences of the data itself on the resulting GDF, but results were nearidentical.)
We simulated binary data with (effective sample size ESS) and , (with values of , 5, , 10, respectively), with .
Our realworld data comprised a fairly small data set (, ESS), showing the occurrence of sperm whales (Physeter macrocephalus) around Antarctica (collected during cetacean IDCRDESS SOWER surveys), and a larger global occurrence data set (, ESS) for the red fox (Vulpes vulpes, provided by the International Union for Conservation of Nature (IUCN)). We preselected predictors as described in (Dormann & Kaschner 2010, Dormann et al. 2010), yielding the six and three (respectively) most important covariates. Since we are not interested in developing the best model, or comparing model performance, we did not strive to optimise the model settings for each data set.
2.2 Implementing GDF for normally distributed data
A robust computation of GDF is a little more problematic than eq. 4 suggests. Ye (1998)
proposed to fit a linear regression to repeated perturbations of each data point, i.e. to
over for each repeatedly perturbed data point , calculating GDF as the sum of the slopes across all data points. As is constant for all models, and is constant for the nonstochastic algorithms (GLM, GAM), the linear regression simplifies to over . For internally stochastic models we compute a mean as plugin for , and therefore apply the procedure also to randomForest, ANN and BRT.Elder (2003) presents this socalled “horizontal” method (the perturbed and fitted are stored as different columns in two separate matrices, which are then regressed rowwise) as more robust compared to the “vertical” method (where for each perturbation a GDF is computed for each column in these matrices and later averaged across replicate perturbations). Convergence is poor when using the “vertical” method, so we restricted the computation of GDF to the “horizontal” method.
2.3 GDF for binary data
While normal data can be perturbed by adding normal noise (see next section), binary data cannot. The only option is to invert their value, i.e. or . Clearly, we cannot perturb many data points simultaneously this way, as that will dilute any actual signal in the data. However, for large data sets it is computationally expensive to perturb all data points individually, and repeatedly (to yield values for the horizontal method, see above). We thus varied the number of data points to invert, , to evaluate whether we can raise without biasing the GDF estimate.
2.4 How many data points to perturb simultaneously?
With the number of points to perturb, , the number of replicates for each GDFcomputation () and the amplitude of disturbance (for the normal data), we have three tuning parameters when computing GDF.
First, we calculated GDF for the simulated data and the sperm whale dataset with taking up increasing values (from 1 to close to , or to ESS for binary data). Thus, random subsets of size
of the response variable
were perturbed, yielding . After each particular perturbation the model was refitted to . To gain insight about the variance of the computed GDF values we repeated this calculation 100 times (due to very high computational effort, the number of reruns had to be limited to 10 for both randomForest and BRT).The perturbation for the normally distributed were also drawn from a normal distribution . We evaluated the sensitivity to this parameter by setting it to 0.125 and 0.5.
2.5 Modelling approaches
We analysed the data using Generalised Linear Model (GLM), Generalised Additive Model (GAM), randomForest, feedforward Artificial Neural Network (ANN) and Boosted Regression Trees (BRT). For the GLM, GDF should be identical to the trace of the Hessian (and the rank of the model), hence the GLM serves as benchmark. For GAMs, different ways to compute the degrees of freedom of the model have been proposed (Wood 2006), while for the other three methods the intrinsic stochasticity of the algorithm (or in the case of ANN of the initial weights) may yield models with different GDFs each time.
All models were fitted using R (Team 2014) and packages gbm (for BRTs: Ridgeway et al. 2013, interaction depth=3, shrinkage=0.001, number of trees=3000, cv.folds=5), mgcv (for GAMs: Wood 2006, thin plate regression splines), nnet (for ANNs: Venables & Ripley 2002, size=7, decay=0.03 for simulated and decay=0.1 for real data, linout=TRUE for normal data), randomForest (Liaw & Wiener 2002) and core package stats (for GLMs, using also quadratic terms and firstorder interactions). Rcode for all simulation as well as data are available on https://github.com/biometry/GDF.
2.6 Computation of AIC and AICweights, from GDF and crossvalidation
In addition to the number of parameters, the AICformula requires the likelihood of the data, given the model. As machinelearning algorithms may minimise a score function different from the likelihood, the result probably differs from a maximum likelihood estimate. To calculate the AIC, however, we have to assume that the distance minimised by nonlikelihood methods is proportional to the likelihood, otherwise the AIC would not be a valid approximation of the KLdistance. No such assumption has to be made for crossvalidation, as
here only serves as a measure of model performance. For the normal data, we compute the standard deviation of the model’s residuals as plugin estimate of
.For the binary data, we use the model fits as probability in the Bernoulli distribution to compute the likelihood of that model. We then calculated the AIC for all considered models based on their GDFvalue. Due to the small sample sizes, we used AICc (Sugiura 1978, Hurvich & Tsai 1989):
(10) 
We used (10fold) crossvalidation, maintaining prevalence in the case of binary data, to compute the loglikelihood of the crossvalidation, yielding . To directly compare it to AICc, we multiplied with . The crossvalidation automatically penalises for overfitting by making poorer predictions.
For model averaging, we computed model weights for each model , once for the GDFbased AICc and for the crossvalidation loglikelihood, using the equation for Akaikeweights (Turkheimer et al. 2003, Burnham & Anderson 2002, p. 75):
(11) 
where , taking the smallest AICc, i.e. the AICc of the best of the candidate models as ; n is the number of models to be averaged over.
The same idea can be applied to crossvalidated loglikelihoods, so that
(12) 
where , with being the largest crossvalidated loglikelihood in the model set.
3 Results
3.1 GDF configuration analysis
For normally distributed data, increasing the number of points perturbed simultaneously typically slightly increased the variance of the Generalised Degrees of Freedom calculated for the model (Fig. 1 left column, GLM, GAM, but not for randomForest and ANN). For GLM and GAM, GDF computations yielded exactly the same value as the model’s rank (indicated by the dashed horizontal line).
For simulated Bernoulli data (Fig. 1 central column), we also observed an effect of the number of points perturbed on the actual GDF value, which decreased for GLM, GAM and ANN, but increased for BRTs with the number of data points perturbed. Several data points needed to be perturbed () to yield an approximately correct estimate. More worryingly, GDF depended nonlinearly on the number of data points perturbed, with values varying by a factor of 2 for GAM and ANN. For GAM the sensitivity occurs because the smoothing parameter selection is sensitive to the quite severe information loss as more and more data are perturbed. GLM and BRT yielded more consistent but still systematically varying GDFestimates.
The same pattern was observed for the sperm whale data (Fig. 1, right column). For the GLM, there was still some bias observable, also in the sperm whalecase study’s GLM. We attribute it to the fact that by perturbing the binary data we also alter the prevalence.
The two replicate simulations yielded consistent estimates, except in the case of the normal GAM and normal BRT. This suggests that both methods “fitted into the noise” specific to the data set, while the other methods did not. We did not observe this phenomenon with the binary data.
For randomForest, GDFestimates centre around , meaning that perturbations of the data did not affect the model predictions. This is to some extent explicable by the stochastic nature of randomForest, giving different predictions when fitted to the same data. We interpret the value of as the perturbations creating less variability than the intrinsic stochasticity of this approach.
This is not a general feature of stochastic approaches, as in neural networks and boosted regression trees the intrinsic stochasticity seems to be much less influential, and both approaches yielded relatively consistent GDFvalues. The actual GDF estimate of course depends on the settings of the methods and should not be interpreted as representative.
To compute the GDF for normally distribute data, we have to choose the strength of perturbation. The GDFvalue is robust to this choice, unless many data points are disturbed (Fig. 2). Only for BRT does increasing the strength of perturbation lead to a consistent, but small, decrease in GDFestimates, suggesting again that BRTs fit into the noise.
Estimates of GDF for randomForest and ANN (but not BRT) respond with decreasing variance to increasing the strength of perturbation. Increasing the intensity of perturbation seems to overrule their internal stochasticity.
It seems clear from the results presented so far that no single best perturbation strength and optimal proportion of data to perturb exists for all methods. For the following analyses, we use, for normal data, perturbation and (except for GAM, where ); and for Bernoulli data (except for BRT and ANN, where ).
one standard error.
3.2 Efficiency of GDF and crossvalidation computations
Both GDF and crossvalidation require multiple analyses. For the GDF, we need to run several perturbations, and possibly replicate this many times. For crossvalidation, we may also want to repeatedly perform the 10fold crossvalidation itself to yield stable estimates. The mean GDF and CVloglikelihood () over 11000 replicates are depicted in Fig. 3. With 100 runs, both estimates have stabilised, but 100 runs for GDF represent 25000 model evaluations (due to the perturbations and 50 internal replicates for a data set of ), while for 10fold crossvalidation these represent only 1000 model evaluations, making it 25 times less costly.
3.3 GDFbased AIC vs CrossValidation
We analysed four data sets with the above settings for GDF and crossvalidation, two simulated (Gaussian and Bernoulli) and two realworld data sets (sperm whale and red fox geographic distributions).
Both Generalised Degrees of Freedom (GDF) and crossvalidation loglikelihooddifferences () measure model complexity in an asymptotically equivalent way (section 1.2). For finite data sets both approaches yield identical rankings of model complexity but are rather different in absolute value (eqn. 1). Particularly the red foxcase study yields very low GDFestimates for GLM, GAM and randomForest, while their crossvalidation model complexity is much higher.
Model  GDF  GDF  

Gaussian simulation  Bernoulli simulation  
GLM  
GAM  
randomForest  
ANN  
BRT  
Sperm whale  Red fox  
GLM  
GAM  
randomForest  
ANN  
BRT 
Model complexity is only one term in the AICformula (eqn. 8) and for large data sets the loglikelihood will dominate. To put the differences between the two measures of model complexity into perspective, we computed AIC for all data sets and compared it to the equivalent crossvalidation deviance ().
Across the entire range of data sets analysed both approaches yield extremely similar results (Fig. 4 right). Within each data set, however, the pattern is more idiosyncratic, revealing a high sensitivity to low sample size (, i.e. all data sets except the red fox).
3.4 Gdf, and model weights
For all four datasets one modelling approach always substantially outperforms the others, making model averaging an academic exercise. We compare model weights (according to eqns 11 and 12) purely to quantify the differences in AIC and crossvalidation deviance for model averaging. Only for the Bernoulli simulation is the difference noticeable (Table 2). Here GLM and randomForest share the model weight when quantified based on crossvalidation, while for the GDFapproach GLM takes all the weight.
Model  

Gaussian simulation  Bernoulli simulation  
GLM  
GAM  
rF  
ANN  
BRT  
Sperm whale  Red fox  
GLM  
GAM  
rF  
ANN  
BRT 
4 Discussion
So far, Ye’s (1998) Generalised Degrees of Freedomconcept did not attract much attention in the statistical literature, even though it builds on established principles and applies to machine learning, where model complexity is unknown. Shen & Huang (2006) have explored the perturbation approach to GDFs in the context of adaptive model selection for linear models and later extended it to linear mixed effect models (Zhang et al. 2012). The also extended it to the exponential family (including the Bernoulli distribution) and even to classification and regression trees (Shen et al. 2004). Our study differs in that the models considered are more diverse and internally include weighted averaging, which clearly poses a challenge to the GDFalgorithm.
4.1 GDF for normally distributed data
For normally distributed data our explorations demonstrate a low sensitivity to the intensity of perturbation used to compute GDFs. Furthermore, across the five modelling approaches employed here, GDF estimates are stable and constant for different numbers of data points perturbed simultaneously.
GDFestimates were consistent with the rank of GLM models and in line with the estimated degrees of freedom reported by the GAM. For neural networks and boosted regression trees, GDFestimates appear plausible, but cannot be compared with any selfreported values. Compared to the crossvalidation method, GDF values are typically, but not consistently, lower by 1030% (Table 1).
For randomForest, GDFestimates were essentially centred on zero. It seems strange to find that an algorithm that uses hundreds of classification and regression trees internally to actually have no (or even negative) degrees of freedom. We expected a low value due to the averaging of submodels (called the ‘paradox of ensembles’ by Elder 2003), but not such a complete insensitivity to perturbations. (Within this study, SH and CFD independently reprogrammed our GDFfunction to make sure that this was not due to a programming error.) As eqn. 4 shows, the perturbation of individual data point are compared to the change in the model expectation for this data point, and then summed over all data points. To yield a GDF of 0, the change in expectation (numerator) must be much smaller than the perturbation itself (denominator). This is possible when expectations are variable due to the stochastic nature of the algorithm. It seems that randomForest is much more variable than the other stochastic approaches of boosting and neural networks.
4.2 Bernoulli GDF
Changing the value of Bernoullidata from 0 into 1 (or vice versa) is a stronger perturbation than adding a small amount of normal noise to Gaussian data. As our exploration has shown, the GDF for such Bernoullidata is indeed much less wellbehaved than for the normal data. Not only is the estimated GDF dependent on the number of data points perturbed, also is this dependence different for each modelling approach we used. This makes GDFcomputation impractical for Bernoulli data. As a consequence, we did not attempt to extend GDFs in this way to other distributions, as in our perception only a general, distribution and modelindependent algorithm is desirable.
4.3 GDF vs model complexity from crossvalidation
Crossvalidation is typically used to get a nonoptimistic assessment of model fit (e.g. Hawkins 2004). As we have shown, it can also be used to compute a measure of model complexity similar (in principle) to GDF (eqn. 8 and Table 1). Both express model complexity as the effective number of parameters fitted in the model. GDF and crossvalidationbased model complexity estimator are largely similar, but may also differ substantially (Table 4, red fox case study). Since the ‘correct’ value for this estimator is unknown, we cannot tell which approach actually works better. Given our inability to choose the optimal number of data points to perturb (except for GLM), we prefer , which does not make any such assumption.
4.4 Remaining problems
To make the GDF approach more generally applicable, a new approach has to be found. The original idea of Ye (1998) is appealing, but not readily transferable in the way we had hoped.
Another problem, even for Gaussian data where this approach seems to be performing fine, is the high computational burden. GDFestimation requires tens of thousands of model evaluations, giving it very limited appeal, except for small data sets and fast modelling approaches. Crossvalidation, as alternative, is at least an order of magnitude faster, but still requires around 1000 evaluations. If the aim is to compute model weights for model averaging, no precise estimation of model complexity is needed and even the results of a single 10fold crossvalidation based on eqn. 8 can be used. It was beyond the scope of this study to develop an efficient crossvalidationbased approach to compute degrees of freedom, but we clearly see this as a more promising way forward than GDF.
4.5 Alternatives to AIC
The selection of the most appropriate statistical model is most commonly based on KullbackLeibler (KL) discrepancy (Kullback & Leibler 1951): a measure representing the distance between the true and an approximating model. Thus, we assume that a model , for which the distance to the true model is minimal, is the KLbest model. Yet, since KLdiscrepancy is not observable, even if a true model existed, many statisticians have attempted to find a metric approximation (e.g. Burnham & Anderson 2002, Burnham et al. 1994). Akaike (1973), who proposed this measure as the basis for model selection in the first place, developed the AIC to get around the discussed problem.
The point of the crossvalidated loglikelihood is that we do away with the approximation that yields the degrees of freedom term in the AIC, instead estimating the modeldependent part of the KL divergence directly. This approach is disadvantageous if AIC can be computed from a single model fit. But if the EDF terms for the AIC would require repeated model fits then there is no reason to use the AICapproximation to the KLdivergence, rather than a more direct estimator. If leaveoneout crossvalidation is too expensive, then we can leave out several, at the cost of some MonteCarlo variability (resulting from the fact that averaging over all possible left out sets is generally impossible).
5 Conclusion
We have shown that the idea of using GDFs to extend informationtheoretical measures of model fit (such as AIC) to nonlikelihood models is burdened with large computational costs and yields variable results for different modelling approaches. Crossvalidation is more variable than GDFs, but a more direct way to compute measures of model complexity, fit and weights (in a model averaging context). As crossvalidation may, but need not, employ the likelihood fit to the holdout, it appears more plausible for models that do not make likelihood assumptions. Thus, we recommend repeated ( times) 10fold crossvalidation to estimate any of the statistics under consideration.
Acknowledgements
This work was partly performed on the computational resource bwUniCluster funded by the Ministry of Science, Research and Arts and the Universities of the State of BadenWürttemberg, Germany, within the framework program bwHPC.
References
 Akaike (1973) Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In Second International Symposium on Information Theory (pp. 267–281).: Akademinai Kiado.
 Burnham & Anderson (1998) Burnham, K. P. & Anderson, D. R. (1998). Model Selection and Multimodel Inference: A Practical InformationTheoretic Approach. Berlin: Springer.
 Burnham & Anderson (2002) Burnham, K. P. & Anderson, D. R. (2002). Model Selection and Multimodel Inference: A Practical InformationTheoretic Approach. Berlin: Springer. 2nd edition.
 Burnham et al. (1994) Burnham, K. P., Anderson, D. R., & White, G. C. (1994). Evaluation of the KullbackLeibler discrepancy for model selection in open population capturerecapture models. Biometrical Journal, 36(3), 299–315.
 DinizFilho et al. (2008) DinizFilho, J. A. F., Rangel, T. F. L. V. B., & Bini, L. M. (2008). Model selection and information theory in geographical ecology. Global Ecology and Biogeography, 17, 479–488.
 Dormann & Kaschner (2010) Dormann, C. & Kaschner, K. (2010). Where’s the sperm whale? A species distribution example analysis. http://www.mcedecology.org/?page_id=355.
 Dormann et al. (2010) Dormann, C. F., Gruber, B., Winter, M., & Herrmann, D. (2010). Evolution of climate niches in european mammals? Biology Letters, 6, 229–232.
 Draper (1995) Draper, D. (1995). Assessment and propagation of model uncertainty. Journal of the Royal Statistical Society Series B, 57, 45–97.
 Elder (2003) Elder, J. F. (2003). The generalization paradox of ensembles. Journal of Computational and Graphical Statistics, 12(4), 853–864.
 Elith et al. (2006) Elith, J., Graham, C. H., Anderson, R. P., Dudík, M., Ferrier, S., Guisan, A., Hijmans, R. J., Huettmann, F., Leathwick, J. R., Lehmann, A., Li, J., Lohmann, L. G., Loiselle, B. A., Manion, G., Moritz, C., Nakamura, M., Nakazawa, Y., Overton, J. M., Peterson, A. T., Phillips, S. J., Richardson, K., ScachettiPereira, R., Schapire, R. E., Soberón, J., Williams, S., Wisz, M. S., Zimmermann, N. E., & Araujo, M. (2006). Novel methods improve prediction of species’ distributions from occurrence data. Ecography, 29(2), 129–151.
 Elith & Leathwick (2009) Elith, J. & Leathwick, J. R. (2009). Species distribution models: ecological explanation and prediction across space and time. Annual Review of Ecology, Evolution, and Systematics, 40(1), 677–697.
 Hastie et al. (2009) Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, volume 2. Berlin: Springer.
 Hawkins (2004) Hawkins, D. M. (2004). The problem of overfitting. Journal of Chemical Information and Computer Sciences, 44(1), 1–12.
 Hegyi & Garamszegi (2011) Hegyi, G. & Garamszegi, L. Z. (2011). Using information theory as a substitute for stepwise regression in ecology and behavior. Behavioral Ecology and Sociobiology, 65(1), 69–76.
 Horne & Garton (2006) Horne, J. S. & Garton, E. O. (2006). Likelihood crossvalidation versus least squares crossvalidation for choosing the smoothing parameter in kernel homerange analysis. Journal of Wildlife Management, 70, 641–648.
 Hurvich & Tsai (1989) Hurvich, C. M. & Tsai, C.L. (1989). Regression and time series model selection in small samples. Biometrika, 76(2), 297–307.
 Kullback & Leibler (1951) Kullback, S. & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 1, 79–86.
 Liaw & Wiener (2002) Liaw, A. & Wiener, M. (2002). Classification and regression by randomForest. R News, 2(3), 18–22.
 Madigan et al. (1995) Madigan, D., York, J., & Allard, D. (1995). Bayesian graphical models for discrete data. International Statistical Review / Revue Internationale de Statistique, 63(2), 215–232.
 Mundry (2011) Mundry, R. (2011). Issues in information theorybased statistical inference – commentary from a frequentist’s perspective. Behavioral Ecology and Sociobiology, 65(1), 57–68.
 Olden et al. (2008) Olden, J. D., Lawler, J. J., & Poff, N. L. (2008). Machine learning methods without tears: A primer for ecologists. The Quarterly Review of Biology, 83(2), 171–193.

Raftery (1996)
Raftery, A. E. (1996).
Approximate bayes factors and accounting for model uncertainty in generalised linear models.
Biometrika, 83(2), 251–266.  Recknagel (2001) Recknagel, F. (2001). Applications of machine learning to ecological modelling. Ecological Modelling, 146(1–3), 303–310.
 Ridgeway et al. (2013) Ridgeway, G. et al. (2013). gbm: Generalized Boosted Regression Models. R package version 2.1.
 Shen & Huang (2006) Shen, X. & Huang, H.C. (2006). Optimal model assessment, selection, and combination. Journal of the American Statistical Association, 101(474), 554–568.
 Shen et al. (2004) Shen, X., Huang, H.C., & Ye, J. (2004). Adaptive model selection and assessment for exponential family distributions. Technometrics, 46(3), 306–317.
 Spiegelhalter et al. (2002) Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society B, 64, 583–639.
 Stone (1977) Stone, M. (1977). An asymptotic equivalence of choice of model by crossvalidation and akaike’s criterion. Journal of the Royal Statistical Society. Series B (Methodological), 39(1), 44–47.
 Sugiura (1978) Sugiura, N. (1978). Further analysts of the data by akaike’ s information criterion and the finite corrections. Communications in Statistics  Theory and Methods, 7(1), 13–26.
 Team (2014) Team, R. C. (2014). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
 Turkheimer et al. (2003) Turkheimer, F. E., Hinz, R., & Cunningham, V. J. (2003). On the undecidability among kinetic models: From model selection to model averaging. Journal of Cerebral Blood Flow & Metabolism, 23(4), 490–498.
 Venables & Ripley (2002) Venables, W. N. & Ripley, B. D. (2002). Modern Applied Statistics with S. New York: Springer, 4th edition.
 Wood (2006) Wood, S. (2006). Generalized Additive Models: An Introduction with R. New York: Chapman and Hall/CRC.
 Wood (2015) Wood, S. N. (2015). Core Statistics. Cambridge: Cambridge University Press.
 Ye (1998) Ye, J. (1998). On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93(441), 120–131.
 Zhang et al. (2012) Zhang, B., Shen, X., & Mumford, S. L. (2012). Generalized degrees of freedom and adaptive model selection in linear mixedeffects models. Computational Statistics and Data Analysis, 56(3), 574–586.
Comments
There are no comments yet.