1 Introduction
Model selection forms a key part of a large proportion of publications in ecology journals. This is particularly true in population modelling studies in which generalised linear models (GLMs) are typically tested with different combinations of potential predictor variables (
Thieme (2018); Jacobson et al. (2004); Imperio et al. (2013)). In a large number of cases, Akaike’s information criterion (AIC), or its adjusted version for small samples AICc, is used to compare the relative performance of different combinations of variables (henceforth ‘models’). The model with the most support according to the information criterion is then usually selected for further use or as a conclusion in itself. There is a noted tendency, however, to neglect to test whether any of the models are indeed useful or even ‘significant’. After all, the best of a bad bunch of models is still a bad model. It was found by Mac Nally et al. (2017) that, out of 119 ecology papers considered that use information criteria to compare the performance of different models, only 55 included some measure of the absolute goodness of fit. The authors of that paper suggest both that some measure of absolute performance should be shown and that the null model (i.e. a statistical model with no explanatory variables) should always be included as a benchmark with which to compare the performance of each of the candidate models. This idea is expanded upon in Wheatcroft (2019) in which more flexible benchmark models are suggested as alternatives to the null model.Whilst it is essential that some measure of the absolute goodness of fit of the ‘best’ model is included, it is argued here that doing so does not solve the problem entirely, due to implicit multiple testing that is not taken into account. Suppose that a statistical test measures the significance of the ‘best’ model, which has been determined by an information criterion or some other method of model selection. Whilst, in this case, only one formal test is actually performed, the model of interest has already been determined as one that appears to perform relatively well under model selection. Since there is a close relationship between model selection techniques such as information criteria and formal statistical significance tests, those that perform well under the former tend also to perform well under the latter. Crucially, this is true both when the model is actually informative. i.e. would perform better than the null model outofsample, and when it appears to be informative only by random chance. Therefore, in the latter case, the probability that a statistical test on the ‘best’ model is wrongly found to be significant is inflated. In statistical terminology, this means that the probability of a type I error is increased. It is argued in this paper that multiple testing needs to be accounted for when assessing the significance of each of the models and a framework is introduced with which to do this.
The distinction between assessing the relative and absolute values of a set of candidate models is well known. For example, in the context of ecology, it was pointed out by Symonds & Moussalli (2011) that, in model selection, ‘you can have a set of essentially meaningless variables and yet the analysis will still produce a best model’. They therefore suggest that it is ‘important to assess the goodness of fit (, ) of the model that includes all the predictors under study, arguing that ’if this global model is a good fit, then you can rest assured that the best approximating model will be a good fit also’. This approach seems somewhat adhoc since a global model with a large enough number of parameters will always appear to provide a good fit insample, regardless of how informative each of the variables are. As an explanation of why testing the significance of the ‘best’ model is often neglected, Burnham & Anderson (2004) suggest that, historically, it has often been assumed that there is a single ‘true’ model and that that model is in the candidate set. The Bayesian derivation of the Bayesian Information Criterion (BIC), for example, works under this assumption (Burnham & Anderson (2004)). If the assumption holds, with enough data, one can eventually expect to select the true model, and that model will, by definition, provide a good fit. In practice, few people believe that the ‘true’ model is ever likely to be a member of the candidate set. Another suggested approach is to ensure that each variable included in the model is carefully justified a priori, such that only variables with a high chance of being informative are included (Burnham & Anderson (2001, 2004)). Whilst this is a sensible suggestion, it does not solve the problem since, whilst those variables may be expected to be important, a priori, this may not be reflected in the models once the data have been considered.
The statistical literature on multiple testing is considerable. Perhaps the most well known approach to the problem is the Bonferroni correction which makes a simple adjustment to the significance level according to the number of hypotheses that are tested (Bonferroni (1936)). Other methodologies, such as the BonferroniHolm method Holm (1979) and Benjamini–Hochberg and Sidak corrections (Benjamini & Hochberg (1995); Šidák (1967)), for example, control the order in which tests are applied to limit the number of overall tests, producing a uniformly more powerful approach. These are discussed further in section 2.5.
A weakness of the above approaches is that they assume that each of the hypotheses are independent of each other. If there is dependency between a set of hypotheses, the probability of committing a type I error in at least one of those hypotheses does not generally grow as quickly as when they are independent. Such methods are therefore too conservative in such cases, with the result that the probability of rejecting an informative model is increased, i.e. a type II error is committed (
Nakagawa (2004)). To attempt to overcome this problem, the WestfallYoung procedure uses permutation tests to adjust the pvalues in multiple correlated hypothesis tests, whilst taking account of the dependency between the hypotheses Westfall et al. (1993). This provides a test which is far more powerful in such cases.In ecology, it is common to define candidate models as different combinations of the same set of candidate variables in a generalised linear model (Bolker et al. (2009)). There is therefore a strong degree of dependency between the candidate models and so the Bonferroni correction is unsuitable (along with other similar procedures). In this paper, two permutation tests are proposed, which are based on the WestfallYoung procedure. The first test, named the single model permutation test, assesses the significance of individual models on the basis of a model selection statistic. This is then extended to define another test, called the model selection permutation test, that measures the significance of the entire model selection procedure, whilst taking into account the dependencies between the candidate models. The result of the first test is an individual pvalue for each model whilst the result of the second test is a single pvalue relating to the model selection procedure itself. The idea is then that, if the pvalue of the model selection permutation test is smaller than the chosen significance level, the whole model selection procedure can be considered to be ‘significant’, that is the probability of finding a model selection statistic as good or better than that of the ‘best’ model by chance is small. Model selection can then go ahead with the reassurance that the performance of the ‘best’ models is unlikely to have occurred simply due to random chance.
The model selection permutation test proposed in this paper has been utilised in another recent paper entitled ‘Effects of weather and hunting on wild reindeer population dynamics in Hardangervidda National Park’ on which the author of this paper is also named. In that paper, the test is referred to as a ‘sanity check’ test and a reference to this paper is provided. As such, some of the analysis from that paper is reproduced here with the primary focus here being the application of the proposed tests. Additionally, in this paper, an example is used in the form of a population modelling analysis of ibex populations in the northern part of Italy which was taken from an existing paper published in 2004 (Jacobson et al. (2004)).
2 Methods
2.1 Forecast Evaluation and Model Selection
It was noted by Mac Nally et al. (2017) that authors commonly neglect to include a measure of the absolute performance of the ‘best’ model alongside a model selection procedure. Of those papers that do include such a measure, the vast majority were found to use , adjusted or related measures. Although the low number of cases in which no absolute measure of fit is provided is concerning, those measures that are commonly used for this purpose can be problematic themselves. AIC and its corrected version are founded in information theory and approximate the expected difference in information loss from approximating the underlying system with different candidate models. Since ‘information’ in this case relates to the probability or probability density assigned to the outcome, it is therefore a measure of probabilistic performance. It can be noted that generalised linear models, as commonly used in ecology, naturally provide a set of probabilistic forecasts. and similar related metrics, however, are measures of deterministic performance, i.e. they only consider a forecast to be a single number. This means that, whilst models are selected according to the performance of the resulting probabilistic forecasts, they are evaluated as point forecasts. This seems like an inconsistent approach to the forecasting problem as a whole.
In fact, probabilistic forecasts can contain a great deal of information that cannot be communicated in point forecasts. In the case of a Gaussian forecast distribution, for example, the variance can be of great value in understanding the uncertainty in the point estimate defined by the mean. For more complex forecast distributions, a single number such as the mean may be entirely inadequate. Consider, for example, a herd of terrestrial animals that, according to a probabilistic forecast distribution, is equally likely to be on either side of a large lake that runs from north to south. It is difficult, in this case, to define a single number from the distribution that represents a useful point forecast. After all, it makes little sense to predict the mean of that distribution since it would fall within the lake, an area in which there is little or no chance of the herd residing. Equally, it would make little sense to forecast that the herd will be on a particular side of the lake since each are deemed equally likely. In summary, to use a point rather than a probabilistic forecast, information must be discarded.
In addition to the issues described above, measures of the predictive performance of point forecasts tend to be fraught with problems. For example , which appears to be the most commonly used measure in ecology papers, is widely known to be a poor measure of forecast performance (Wheatcroft (2015)). Firstly, the correlation is insensitive to scale. This means that, if two variables are correlated, it doesn’t necessarily mean that one is a good predictor of the other. For example, a set of temperature forecasts measured in Fahrenheit when the outcomes are measured in Celsius may still have a high value. This has been widely acknowledged and, for example, Murphy describes as a measure of potential rather than absolute skill (Murphy & Epstein (1989)). Secondly, other well known problems with using correlation coefficients apply. For example, influential observations can greatly increase the correlation between two variables without much, or any, actual improvement in predictive performance (Wheatcroft (2015)).
2.2 Evaluating Probabilistic Forecasts
Probabilistic forecasts are usually evaluated using functions of the forecast and the outcome called scoring rules. A wide range of scoring rules have been proposed and there is still some debate over which are the most appropriate (Gneiting & Raftery (2007)). A property of scoring rules generally considered to be of high importance is called propriety. A score is proper if it is optimised in expectation when the distribution from which the outcome is drawn is issued as the forecast (J.Bröcker & Smith (2007)). Propriety would therefore discourage a forecaster in possession of that forecast distribution from issuing a different one to achieve a better score. It is worth noting that no similar property exists for measures of the performance of point forecasts. For example, needn’t favour a forecast based on the true distribution of the outcome.
An example of a proper scoring rule is the ignorance score (Good (1952); Roulston & Smith (2002)) defined by
(1) 
where is the probability density placed on the outcome. The ignorance score is negatively oriented and hence smaller values indicate better forecast skill. The score is also local because it only takes the probability at the outcome into consideration (Gneiting & Raftery (2007)) and, in fact, can be shown to be the only scoring rule that is both proper and local (Bernardo (1979)). An advantage of the ignorance score is in its interpretation. The difference in the mean ignorance between two sets of forecasts can be interpreted as the base 2 logarithm of the ratio of the density placed on the outcome by each, measured in bits. For example, if the mean ignorance of one set of forecasts is 3 bits smaller than another, it places times more probability density on the outcome, on average. The ignorance score is used in the results section of this paper alongside leaveoneout crossvalidation.
2.3 Approaches to Model Selection
Model selection is a key part of many studies in a wide range of disciplines, including ecology (Johnson & Omland (2004)). The standard approach is to define a set of candidate models a priori and to attempt to rank them according to how well they would generalise outofsample. The basis of model selection techniques is founded on the observation that a fair comparison is needed between models with different numbers of parameters. If an extra parameter is added, the fit of the model will necessarily improve insample but may be ‘overfitted’ and will not improve outofsample. Model selection techniques therefore attempt to account for this issue.
Model selection techniques typically fall into two different categories. Information criteria weigh up the insample fit of the model with the number of parameters to be selected such that extra parameters are penalised. Crossvalidation, on the other hand, divides the dataset such that parameter selection is always performed on data that are distinct from those on which the performance of the model is tested.
By far the most commonly used information criterion in ecology is Akaike’s Information Criterion (AIC) and its corrected version for small samples AICc (Akaike (1974); Wagenmakers & Farrell (2004)). AIC is given by
(2) 
where is the maximised likelihood and is the number of parameters selected. In each case, the model with the lowest AIC is considered to be the most appropriate when applied outofsample. For small sample sizes, however, AIC is slightly biased and thus a corrected, unbiased, version is often used. The corrected version (Claeskens & Hjort (2008)) is defined by
(3) 
Crossvalidation takes a different approach to model selection. Here, the data are divided into two sets: a training set, over which the parameters are selected, and a test set, on which the model is tested with those parameters. The process is then repeated with different subsets of the data set used as the training and test sets. In leaveoneout crossvalidation, the test set consists of a single point and the training set consists of each of the other points. This process is repeated such that each data point forms the test set exactly once. Leaveoneout crossvalidation can be used alongside any method of forecast evaluation and, in this paper, is performed with the ignorance score such that the forecasts can be evaluated probabilistically. In fact, this approach can be shown to be asymptotically equivalent to AIC but will usually be expected to give a different ordering of models for finite sample sizes (Stone (1974)).
This paper suggests a two step process to model selection. First, the significance of the model selection procedure should be assessed at some predefined level to assess the probability that a statistic at least as favourable than that of the ‘best’ model could have occurred by chance, given the candidate models. If the model selection procedure is found to be significant, normal model selection should then take place and the best model(s) chosen. By taking the first step, confidence can be had that the information contained in the models is indeed informative.
It is important to note that, even if the model selection procedure is found to be significant, it is not necessarily the case that any of the model(s) are fit for their required purpose (e.g. population management). To determine whether the models are fit for purpose would require further analysis and consideration beyond the scope of this paper.
2.4 Permutation Tests
A permutation test is a nonparametric statistical test in which the significance of a test statistic is obtained by calculating its distribution under all different permutations of the set of observed outcomes. For example, a permutation test for the slope parameter of a simple linear regression would be performed by permuting the positions of the
values (the dependent variable), keeping the values (the predictor variables) in their original positions and calculating the slope parameter under all possible combinations of . The position of the slope parameter that has been calculated from the data in their original positions would then be compared to this distribution to calculate a pvalue. In practice, it is often computationally prohibitive to consider all possible permutations and thus permutations are randomly chosen a fixed number of times. Such tests are called randomised permutation tests. Permutation tests have a number of advantages over standard parametric tests. Unlike the latter, no assumptions about the distribution of the test statistic under the null hypothesis are required since the method draws from the exact distribution. Permutation tests thus give an exact test and, as such, randomised permutation tests are asymptotically exact. The general nature of permutation tests allows them to be applied in a wide range of settings without knowing the underlying sampling distribution. In this paper, two types of permutation test are demonstrated. The first assesses the significance of a single model without taking into consideration the other models in the model selection procedure whilst the second assesses the significance of the entire model selection procedure and thus takes into account multiple testing.
2.5 Multiple Testing
The problem of multiple testing is wellknown and has been widely studied. Remedies to the problem typically involve adjustments to the pvalue of each hypothesis to reflect the number that are tested. Much of the literature on multiple testing focuses on controlling the familywise error rate (FWER) , defined as the probability of wrongly rejecting at least one of the hypotheses. Whilst, under standard hypothesis testing, usually grows with the number of hypotheses, the aim here is usually to limit the FWER to . Perhaps the most common approach to the problem is the Bonferroni correction which adjusts the required significance level for each test to where is the number of hypotheses tested. A major weakness of the Bonferroni correction, however, is that it assumes that each of the significance tests are independent of each other. When this is not the case, the test is too conservative and the true FWER is less than , resulting in a loss of power. Several modifications to the Bonferroni correction, such as the BonferroniHolm (Holm (1979)), BenjaminiHochberg (Benjamini & Hochberg (1995)) and Sidak (Šidák (1967)) corrections, have been proposed that aim to increase the power by adjusting the order in which hypotheses are considered. None of these approaches take into account dependency between hypotheses, however.
An alternative approach to multiple testing was proposed by Westfall and Young in 1993 and aims to account for dependency between tests (Westfall et al. (1993)). The approach makes use of permutation tests by randomly permuting the outcomes and calculating adjusted pvalues for each hypothesis. An adjusted pvalue for the hypothesis is given by
(4) 
where denotes the observed pvalue for the test, is the ‘complete’ null hypothesis that all null hypotheses are true and is the pvalue of the hypothesis under a given permutation of the outcomes. The adjusted pvalue of the hypothesis corresponds to the probability of obtaining a pvalue as small or smaller from at least one of the hypotheses that are tested simultaneously.
2.6 A Singlemodel Permutation Test
A permutation test is now described with which to test the significance of individual models in a model selection procedure. The test is performed by comparing the observed model selection statistic with the distribution of that statistic under the null hypothesis that the outcomes are independent of the model predictions. An approximate pvalue is calculated by counting the proportion of permutations in which the model selection statistic is smaller (assuming a negatively oriented statistic) than the observed statistic. The singlemodel permutation test is formally described below:

Calculate the model selection statistic under the original ordering of the outcomes.

Set

Randomly permute the outcomes, ensuring that none of them fall into their original positions.

Calculate the model selection statistic under the new ordering.

Set

Repeat steps two to four until .

Calculate the approximate pvalue , where is the indicator function.
In fact, the test outlined above can be considered a standard permutation test and, as such, is not particularly novel and is somewhat similar to that of the WestfallYoung permutation test. However, whilst that test uses individual pvalues to calculate adjusted pvalues for each hypothesis (or model), the above test uses model selection statistics which do not necessarily naturally have pvalues associated with them.
The single model permutation test provides a simple basis with which to assess the significance of a single model. Note that, for a single model, when the chosen model selection statistic is an information criterion, the penalty for the number of parameters is always constant and therefore the test is equivalent to performing a permutation test on the loglikelihood. However, in the next section, the test is extended to multiple models with different numbers of parameters and it is here in which the value of permutation tests for model selection statistics becomes apparent.
2.7 A Modelselection Permutation Test
A permutation test for an entire model selection procedure is now defined. The aim here is to estimate the probability that the ‘best’ model selection statistic could have occurred by chance. We call this test the modelselection permutation test.
Under the model selection permutation test, the outcomes are randomly permuted as they are for the single model permutation test. Here, the null hypothesis is that the outcomes and the model predictions are independent for all tested models. For a given permutation of outcomes, a model selection statistic is calculated for each model. The comparison of interest is between the observed ‘best’ model selection statistic and the statistic of the ‘best’ model under each permutation. The pvalue is estimated by counting the proportion of permutations in which the model selection of the ‘best’ model is more favourable than that of the ‘best’ model under the true ordering of the outcomes. Formally, the procedure is performed as follows:

Calculate the model selection statistic for each model under the original ordering of the outcomes.

Set

Randomly permute the outcomes, ensuring that none of them fall into their original positions.

Calculate the model selection statistic for each model .

Set

Repeat steps two to four until .

Calculate an estimated pvalue where is the indicator function.
2.8 Experiment One: Demonstration of Type I Error Inflation
The aim of experiment one is to demonstrate that, in a case in which all models are, by construction, uninformative, the probability that the ‘best’ model is ‘significant’ increases with the number of candidate models. This represents, by definition, inflation in the probability of a type I error. It is then demonstrated that, for the model selection permutation test, the probability of a type I error is consistent with the prescribed significance level and is not affected by the number of candidate models. This is demonstrated in two cases: one in which the models are defined to be independent of each other and another in which there is dependency between models resulting from shared predictor variables.
The experiment is conducted as follows: Let
be a set of outcomes, each of which are independent, identically distributed draws from a standard Gaussian distribution
. Let be the mth predictor variable of which is also standard Gaussian and is independent of . Define a model to be some combination of predictor variables in a multiple linear regression with as the dependent variable. As such, none of the models have any predictive value outofsample and thus the null model is, by design, the optimal choice. AICc is calculated along with a pvalue from the single model permutation test. The ‘best’ model, according to AICc, is then defined to be significant if its pvalue is less than , i.e. it is significant at the 5 percent level. In addition, the model selection permutation test is performed at the 5 percent level. This procedure is repeated times and the proportion of repeats in which the ‘best’ model is found to be significant and the proportion in which the model selection permutation test is found to be significant is calculated.Two different cases are considered:

independent models  each model is a linear regression with one of as a single predictor variable. There are thus candidate models.

dependent models  candidate variables are defined and distinct combinations are randomly selected, without replacement, as candidate models.
In the former case, by construction, each model is independent. In the latter, however, since different candidate models have shared predictor variables, there is a dependency structure between models. For each value of , there are possible combinations of variables (excluding the null model) and thus only values of up to this value can be considered. Therefore, for , all of the possible combinations of variables are tested and only a subset are tested for .
2.9 Population Modelling Examples
Two real population modelling examples from ecology are used to demonstrate both the single model and model selection permutation tests. Both examples are published in existing papers and are presented here with the minimal details required to effectively demonstrate the methodology presented in this paper. Further details can be found in the papers themselves.
2.9.1 Experiment Two: Ibex
The first population modelling example was published in Ecology in 2004 in ‘Climate forcing and density dependence in a mountain ungulate population’ (Jacobson et al. (2004)). In that paper, the authors fit 20 different population models to attempt to explain changes in the ibex population of Gran Paradiso National Park in Northwestern Italy between the years of 1956 and 2000, using combinations of the following predictor variables:

Current population.

Snow cover.

Interaction between snow cover and current population.
Ten different combinations of the three variables were fitted with both the Modified Stochastic Ricker and Modified Stochastic Gompertz models (defined in the appendix) such that a total of twenty models were assessed. The relative population change in year is defined as where and are the population counts in years and respectively. The Modified Stochastic Ricker and Modified Stochastic Gompertz models are generalised linear models such that the relative population change is modelled as a linear function of the chosen predictor variables. The Stochastic Ricker and Gompertz models differ only in the way they treat the current population size as a predictor variable.
AIC was calculated for each model based on its performance in predicting the relative population change (rather than the actual population size). Although, in that paper, AIC was the only model selection statistic considered, here, for illustration, the models are also compared using leaveoneoutcrossvalidation with the mean ignorance score as the evaluation method (see section 2.2). The model selection statistics for each model are presented relative to that of the null model, i.e. with the statistic of the null model subtracted, such that a negative value indicates more support for the model than for the null model.
To demonstrate the two tests defined in this paper, the single model permutation test is performed for each model and an estimated pvalue is calculated. In addition, results from the model selection permutation test are shown to assess the credibility of the overall model selection procedure.
2.10 Experiment three: Reindeer
This example comes from a study of the population of wild reindeer in Hardangervidda National Park in Southern Norway (Bargmann et al. (n.d.)). The aim of the study was to attempt to understand the factors that cause the population to change over time. This was done using a Modified Stochastic Ricker population model (defined in the appendix) with various combinations of factors as inputs. The following climatic factors were considered as potential predictors of the population:

(a) Mean temperature over January and February.

(b) Days in February/March in which the temperature exceed .

(c) The number of summer growing degree days from June to September (days above 5 degrees Celsius).

(d) The current size of the population (density dependence).

(e) The proportion of the population hunted and killed.

(f) Interaction between proportion killed and chosen weather variable.

(g) Interaction between population size and chosen weather variable.
The winter of 2010 was significantly colder than each of the other years in the data set and was found to be an influential observation (according to Cook’s distance). Given this, the analysis was performed twice: with and without that year included. The corrected version of Akaike’s Information Criterion (AICc) was used to rank the performance of the models.
In this paper, the analysis is repeated and, for the purposes of demonstration, the models are also compared using the crossvalidated mean ignorance score, as an alternative model selection technique. The analysis is performed with the year 2010 removed (see above). Following the original paper, a slightly different approach is taken to that of the ibex example. Whilst, in the ibex case, the performance of the models was assessed in terms of prediction of the relative population change, in this case, forecasts of the actual population counts were produced. To do this, MonteCarlo simulation was used with a large sample and forecast distributions were produced using kernel density estimation.
Both the single model permutation test and the model selection permutation test are performed in the context of both the crossvalidated mean ignorance and the AICc for the original set of candidate models. The experiment is then repeated with a subset of the models to demonstrate a case in which the model selection permutation test is not significant.
3 Results
3.1 Experiment One: Demonstration of Type I Error Inflation
The results of experiment one are now presented. In figure 1, the dashed lines show, as a function of , the proportion of repeats in which the ‘best’ model, as selected by AICc, is found to be significant under the single model permutation test and the solid lines show the proportion of repeats in which the model selection permutation test is found to be significant. Both tests are performed at the 5 percent level. The grey area denotes the interval in which the proportions would fall with 95 percent probability if the underlying probability of a significant result were truly 5 percent. If the proportion falls outside of this range, there is significant evidence that the probability of rejecting, and therefore committing a type I error, is different to the prescribed significance level.
As expected, as the number of candidate models is increased, the probability of a significant result for the ‘best’ model is inflated beyond the prescribed significance level. This is true of both the independent and dependent models cases. In the latter case, the probability increases less quickly because fewer predictor variables are considered and therefore the probability of finding one that happens to be ‘significantly’ correlated with the outcomes is reduced. This shows the importance of taking dependency between models into account. The proportion of cases in which a significant result is found for the model selection permutation test is demonstrated to be consistent with the significance level of 5 percent.
3.2 Experiment Two: Ibex
The results of the model selection procedure for the ibex example are shown in table 1. Consistent with the original paper, columns headed by , and indicate whether density dependence, snow cover and the interaction between the two, respectively, have been included in the model. It is found that almost all of the models outperform the null model under both model selection methods. Estimated pvalues calculated using the single model permutation test are shown for each model based on the two model selection techniques considered. In all cases, the estimated pvalues are found to be extremely small.
Although it seems unlikely that the performance of the models could be explained simply through random chance, it is rigorous to use the model selection permutation test to assess the overall significance of the model selection procedure. For both model selection techniques, out of permutations tested, none were found in which the ‘best’ model outperformed that for the observed outcomes and thus the estimated pvalue is zero. A CDF of the AIC of the ‘best’ model (relative to the null model) under each permutation is shown in the top panel of figure 2 along with the minimum AIC from the observed data set. The equivalent, but with the crossvalidated mean ignorance, is shown in the lower panel. From, these results, it is clear that it is extremely unlikely that the ‘best model’ in the model selection procedure occurred purely by chance. Given its strong significance, confidence can be had that the results indicate genuine predictive skill.
Model  b  c  e  Pars  DD  AIC  pvalue  Mean Ign  pvalue 

M11  *  *  *  7  R  
M12  *  *  *  7  G  
M13  *  *  5  R  
M14  *  *  5  G  
M15  *  *  5  R  
M4  *  *  3  G  
M2  *  *  *  4  G  
M16  *  *  5  G  
M3  *  *  3  R  
M18  *  *  5  G  
M1  *  *  *  4  R  
M5  *  *  3  R  
M6  *  *  3  G  
M8  *  *  3  G  
M7  *  *  3  R  
M17  *  *  5  R  
M9  *  2  R  
M19  *  3  R  
M10  *  2  G  
M20  *  3  G  
M0  1     
3.3 Experiment three: Reindeer
The results of the model selection procedure for the reindeer case are shown in table 2. Here, those variables that are included in the model are indicated with a star. The letters correspond to the variables listed in section 2.10. The AICc and mean ignorance (both shown relative to that of the null model) are shown for each model along with estimated pvalues obtained from the single variable permutation test.
Model  a  b  c  d  e  f  g  h  AICc  Mean ign.  p (IGN)  p (AICc) 

M3  *  *  
M5  *  *  *  
M10  *  *  *  
M1  *  
M8  *  *  
M2  *  *  
M4  *  *  *  
M6  *  
M7  *  *  
M9  *  *  *  
M14  *  *  *  
M13  *  *  
M16  *  *  *  
M11  *  
M0      
M12  *  *  
M15  *  *  * 
Here, whilst a number of the models are found to be strongly significant, the pvalues of those models are typically larger than for the best ibex population models in experiment two. It is therefore prudent to apply the model selection permutation test to assess the probability that the model selection statistics of the ‘best’ model could have occurred by chance. The results of doing this using the multiple model permutation test are shown in table 3. Here, the pvalues are small and therefore confidence can be had that the ‘best’ model is indeed informative relative to the null model and did not simply occur by chance.
AICc  Crossvalidated mean ignorance 

The reindeer example is now used to demonstrate a case in which, whilst one or more of the models is found to be significant, the probability that this occurred by chance is found to be high. Consider a model selection procedure in which the best six models according to the AICc in table 2 were not used as candidate models and therefore the selection is between the ten remaining models. The included models are those below the horizontal line in the table. At least one of the candidate models is significant at the 5 percent level for both model selection techniques. However, given the number of candidate models, caution is advised. Applying the model selection permutation test, the pvalues shown in table 4 are obtained. Here, in both cases, the test is insignificant at the 5 percent level and thus there is a high probability that the significance of the individual models simply occurred by chance.
AICc  Crossvalidated mean ignorance 

4 Discussion
There is a clear and obvious need in ecology for authors to assess the absolute value of the ‘best’ model in a model selection procedure. Currently, this step is all too often completely absent. The single model permutation test defined in this paper provides a generalised approach with which to assess the significance of a model. However, by selecting the ‘best’ model via model selection and proceeding to evaluate its significance, the probability of a type I error can be inflated far beyond the significance level. This is because the model with the best model selection statistic has already been determined as one that performs well relative to the other models, perhaps by chance.
One can imagine that, if each of the models were independent, intuition could be used to assess the impact of multiple testing. Caution would be advised if one out of a total of twenty models were significant at the five percent level, for example. The Bonferroni correction works on this basis. Commonly, in model selection in ecology, the same variables are present in multiple models. Given this dependency, this intuition is lost and therefore more formal methods are required. The model selection permutation test has been proposed for situations such as these. The test estimates the probability that the ‘best’ model could have occurred by chance, whilst taking the dependency structure between the models into account. As such, the test gives a clear and intuitive approach to the problem of significance in model selection by assessing the entire model selection procedure.
The tests described in this paper can be used to assess whether a set of variables can provide better predictions than the null model in a population modelling procedure. Although the focus here is on ecology and, in particular, population modelling, the methodology is highly applicable to other fields in which model selection is applied. For example, in sports forecasting, one may want to determine which combination of factors most impact the probability of scoring a goal or the outcome of a game.
Whilst the tests described can help provide confidence that the best candidate variables are more informative than the null model in terms of making predictions, it should be highly stressed that, even if a model can be shown to significantly outperform the null model, it is not necessarily the case that the model is fit for a particular purpose. Before using the model, further evidence regarding the suitability of the model in a particular setting should be gathered. Nonetheless, the tests described in this paper provide a key step towards rigorous model selection in ecology which, in turn, allows for better modelling and hence a better understanding of the factors that impact animal populations.
Appendix A Population Modelling
The permutation tests described in this paper are demonstrated using two population modelling examples taken from existing papers. Background methodology relevant to both papers is described here. Each of the two examples make use of population models. The Modified Stochastic Gompertz Model is defined by
(5) 
and the Modified Stochastic Ricker model is defined by
(6) 
where is the population count in the year, is called the relative population change, is the explanatory variable, and is a random draw from a Gaussian distribution with mean zero and variance . The two models are very similar and only differ in how the current population is used as an explanatory variable (i.e. which form of so called ‘density dependence’ is considered). The parameters , and are to be selected using leastsquares estimation. The Stochastic Gompertz and Ricker Models automatically give probabilistic forecasts of the relative population change in the form of of a Gaussian distribution . The forecast distribution of the relative population change can be used to estimate a forecast distribution of the actual population. In this paper, where applicable (for the reindeer case), this is done using MonteCarlo simulation with samples.
A ‘null’ model distribution naturally arises from the Modified Stochastic Ricker or Gompertz Model with all parameters except for the intercept and the variance set to zero. The null model therefore takes the form where and are parameters to be selected.
References
 (1)
 Akaike (1974) Akaike, H. (1974), ‘A new look at the statistical model identification’, Automatic Control, IEEE Transactions on 19(6), 716–723.
 Bargmann et al. (n.d.) Bargmann, T., Wheatcroft, E., Imperio, S. & Vettas, O. R. (n.d.), ‘Effects of weather and hunting on wild reindeer population dynamics in hardangervidda national park’, Population Ecology (accepted) .
 Benjamini & Hochberg (1995) Benjamini, Y. & Hochberg, Y. (1995), ‘Controlling the false discovery rate: a practical and powerful approach to multiple testing’, Journal of the Royal statistical society: series B (Methodological) 57(1), 289–300.
 Bernardo (1979) Bernardo, J. M. (1979), ‘Expected information as expected utility’, the Annals of Statistics pp. 686–690.

Bolker et al. (2009)
Bolker, B. M., Brooks, M. E., Clark, C. J., Geange, S. W., Poulsen, J. R., Stevens, M. H. H. & White, J.S. S. (2009), ‘Generalized linear mixed models: a practical guide for ecology and evolution’,
Trends in ecology & evolution 24(3), 127–135.  Bonferroni (1936) Bonferroni, C. E. (1936), ‘Teoria statistica delle classi e calcolo delle probabilita: Libreria internazionale seeber’.
 Burnham & Anderson (2001) Burnham, K. P. & Anderson, D. R. (2001), ‘Kullbackleibler information as a basis for strong inference in ecological studies’, Wildlife research 28(2), 111–119.
 Burnham & Anderson (2004) Burnham, K. P. & Anderson, D. R. (2004), ‘Multimodel inference: understanding aic and bic in model selection’, Sociological methods & research 33(2), 261–304.
 Claeskens & Hjort (2008) Claeskens, G. & Hjort, N. L. (2008), Model Selection and Model Averaging, Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press.
 Gneiting & Raftery (2007) Gneiting, T. & Raftery, A. E. (2007), ‘Strictly proper scoring rules, prediction, and estimation’, Journal of the American Statistical Association 102(477), 359–378.
 Good (1952) Good, I. J. (1952), ‘Rational decisions’, Journal of the Royal Statistical Society: Series B 14, 107–114.
 Holm (1979) Holm, S. (1979), ‘A simple sequentially rejective multiple test procedure’, Scandinavian journal of statistics pp. 65–70.
 Imperio et al. (2013) Imperio, S., Bionda, R., Viterbi, R. & Provenzale, A. (2013), ‘Climate change and human disturbance can lead to local extinction of alpine rock ptarmigan: New insight from the western italian alps’, PloS one 8(11), e81598.

Jacobson et al. (2004)
Jacobson, A. R., Provenzale, A., von Hardenberg, A., Bassano, B. & FestaBianchet, M. (2004), ‘Climate forcing
and density dependence in a mountain ungulate population’, Ecology 85(6), 1598–1610.
http://dx.doi.org/10.1890/020753  J.Bröcker & Smith (2007) J.Bröcker & Smith, L. (2007), ‘Scoring probabilistic forecasts: the importance of being proper’, Tellus A 22(2).
 Johnson & Omland (2004) Johnson, J. B. & Omland, K. S. (2004), ‘Model selection in ecology and evolution’, Trends in ecology & evolution 19(2), 101–108.

Mac Nally et al. (2017)
Mac Nally, R., Duncan, R. P., Thomson, J. R. & Yen, J. D. L.
(2017), ‘Model selection using information
criteria, but is the “best” model any good?’, Journal of Applied
Ecology pp. n/a–n/a.
http://dx.doi.org/10.1111/13652664.13060  Murphy & Epstein (1989) Murphy, A. H. & Epstein, E. S. (1989), ‘Skill scores and correlation coefficients in model verification’, Monthly weather review 117(3), 572–582.
 Nakagawa (2004) Nakagawa, S. (2004), ‘A farewell to bonferroni: the problems of low statistical power and publication bias’, Behavioral ecology 15(6), 1044–1045.
 Roulston & Smith (2002) Roulston, M. S. & Smith, L. A. (2002), ‘Evaluating probabilistic forecasts using information theory’, Monthly Weather Review 130, 1653–1660.

Šidák (1967)
Šidák, Z. (1967), ‘Rectangular confidence regions for the means of multivariate normal distributions’,
Journal of the American Statistical Association 62(318), 626–633.  Stone (1974) Stone, M. (1974), ‘Crossvalidatory choice and assessment of statistical predictions’, Journal of the Royal Statistical Society: Series B (Methodological) 36(2), 111–133.
 Symonds & Moussalli (2011) Symonds, M. R. & Moussalli, A. (2011), ‘A brief guide to model selection, multimodel inference and model averaging in behavioural ecology using akaike’s information criterion’, Behavioral Ecology and Sociobiology 65(1), 13–21.
 Thieme (2018) Thieme, H. R. (2018), Mathematics in population biology, Princeton University Press.
 Wagenmakers & Farrell (2004) Wagenmakers, E.J. & Farrell, S. (2004), ‘Aic model selection using akaike weights’, Psychonomic Bulletin and Review 11(1), 192–196.
 Westfall et al. (1993) Westfall, P. H., Young, S. S. & Wright, S. P. (1993), ‘On adjusting pvalues for multiplicity’, Biometrics 49(3), 941–945.
 Wheatcroft (2015) Wheatcroft, E. (2015), Improving predictability of the future by grasping probability less tightly, PhD thesis, The London School of Economics and Political Science (LSE).
 Wheatcroft (2019) Wheatcroft, E. (2019), ‘Beyond the null model in ecology (in preparation)’, Journal of Applied Ecology .