1 Likelihoodbased goodnessoffit comparisons and their limitations
Traditional goodnessoffit measures are based around likelihood, which is defined as, for observed data , and model with parameter set ,
where
is the probability distribution for data under the model
. Since likelihood can in most instances be increased by arbitrarily increasing model complexity, more sophisticated goodnessoffit measures penalise complexity, measured as the number of parameters estimated in the model. For example, Akaike’s Information Criterion (AIC) is defined [Akaike:1974aa] aswhile the Bayesian Information Criterion (BIC) is given [Schwarz:1978aa] by
where is the likelihood of model under maximum likelihood parameters , and is the number of data points in observed data set . A model with the smallest AIC or BIC is selected. Naturally, BIC penalises complexity more than AIC (unless , which should never occur in practice as this implies that there are 7 or fewer observed data points).
AIC or BICbased model selection, however, is limited in the situations in which it can be applied. Situations in which AIC or BIC are inappropriate include:

where likelihood is intractable, in theory or practice;

where the structure of the set of models, or data upon which the models are compared, makes AIC or BIC inappropriate; and

where alternative model selection bases are deemed more appropriate, for philosophical and/or practical reasons.
Each of these situations will now be discussed in turn.
1.1 Intractable likelihood
Likelihood may be impossible to calculate for one or more of the candidate models. Types of data and models for which this might occur are well documented in the literature; examples include networks [Caimo:2015aa, Ratmann:2007aa, Ratmann:2009aa], complicated time series [Breto:2009aa, Jasra:2014aa]
, and hidden Markov models
[Yildirim:2015aa]. In differentiating between these types of models, Approximate Bayesian Computation (ABC) has recently gained popularity, but this technique is sensitive to prior distributions for both the choice of model and for each model’s parameters, as well as to choices of summary statistics [Robert:2011aa]. This manuscript presents an alternative manner of differentiating between models, without the selection of prior probabilities in the model space.
1.2 Model or data structure
Some model structures are of sufficiently different form to be incomparable using likelihoodbased methods like AIC or BIC. For example, suppose some univariate data of sample size
is to be fit either using a normal distribution, or using kernel density estimates. AIC or BIC require calculation of likelihood, and of the number of parameters. The first model, the normal distribution, has two parameters—the mean and standard deviation. For the kernel density model, the ‘number of parameters’ is somewhat nebulous, as all observed data points are considered for all estimates, but have differing influence.
The structure of data might also lead to models being incomparable using AIC or BIC. For example, suppose two time series models are to be considered, an ARIMA model, and an ARIMA model applied to differenced data. There is one more data point in the ARIMA model than in the differenced ARIMA model. As such, the likelihood function of the ARIMA model will contain one more term than will the differenced ARIMA model, so will be of a different order of magnitude [Harvey:1980aa]. This makes direct comparison of likelihoods, and likelihoodbased goodnessoffit measures, unavailable. The MMM procedure in this manuscript overcomes this issue using a simulationbased approach.
1.3 Alternative model selection bases
Finally, likelihood is not always the preferred criterion according to which we wish to select a model. One example of this is in choosing an appropriate distribution to fit to some given data. Since likelihood measures the probability density of the observed data, treated usually as independent observations, given a model, it does not take into consideration whether the data fits the shape of the proposed distribution; that is, likelihood is not designed to differentiate whether a candidate distribution is appropriate given the skew, kurtosis or other moments of the data.
In this instance, it might be desirable to compare candidate distributions on the basis of a distributional goodnessoffit measure, such as KolmogorovSmirnov statistic [Massey:1951aa], Skèkely and Rizzo’s energy statistic [Szekely:2005aa], or some more rudimentary summary statistic like the number of extreme values in a distribution. Since these statistics’ raw values cannot be directly compared between candidate models, a more rigorous framework to compare these values must be considered. The MMM framework in this manuscript is able to address this.
A similar motivation may be to compare models on the basis of how reasonable the models’ assumptions are. Goodnessoffit to model assumptions is not a simple binary question; some valid models are more wellfounded than others in terms of their assumptions. For example, consider a simple linear regression model. One assumption is normality of the error term, which is usually assessed using a quantilequantile plot of the residuals. The assumption of normality might be considered reasonable for a number of different quantilequantile plots, but one might more readily accept some plots than others. This is one demonstration of the fact that the validity of an assumption is on a continuum. Where a model fits on this continuum for some distributional assumption may distinguish some candidate models from others.
Since fidelity to assumptions justifies the generalisability of a model, a model might be favoured if it satisfies its assumptions better than other candidates. However, it is not immediately obvious how one might compare the fidelity of two or more models to their respective distributional assumptions. For example, if two models assume different error distributions, comparing these assumptions would involve comparing the goodnessoffit of one set of errors to one distribution, with the goodnessoffit of another set of errors to a different distribution. Again, test statistics for distributional fit, like the KolmogorovSmirnoff or energy statistics
[Szekely:2005aa], are not measured on the same scale for all candidate distributions, and may be more sensitive to some kinds of lackoffit than to others. These raw goodnessoffit statistics thus cannot be directly compared between models. The problem of comparing distributional goodnessoffit among such models is explored in detail in this manuscript, with the MMM framework able to provide for such comparisons.2 Genesis of general goodnessoffit comparisons
2.1 Likelihoodratio test for nested hypotheses
The likelihoodratio statistic (Wilks 1938) is a wellknown basis for hypothesis tests comparing two nested models [Wilks:1938aa]. When models are nested, they come from the same parameterised family, so the hypothesis test consists of choosing between two sets of parameters for this family, often denoted and . The likelihoodratio statistic is
for observed data and common likelihood function . Significance levels for this statistic are easily determined, since for nested models,
as sample size , with the dimensionality of and the dimensionality of (Wilks 1938) [Wilks:1938aa].
This test, however, cannot be undertaken for nonnested hypotheses, since in this case, the likelihoods in the expression for would be from different families of distribution, meaning the distribution of does not necessarily follow Wilks’ asymptotic expression.
2.2 Extension to nonnested hypotheses
In response to this shortcoming, Cox (1961) [Cox:1961aa] extends the likelihoodratio test to some cases with nonnested hypotheses. Suppose we have some realisations
, and seek to compare two hypotheses:where and are nonnested distributions, and
is the null hypothesis. Cox’ test statistic compares the logarithm of the ratio of likelihoods under each hypothesis to the expected value of the logratio under the null hypothesis. Cox’ statistic can be written as
where and are likelihood functions of distributions and respectively, and denotes an expectation with respect to distribution with parameters .
Cox demonstrates that
is asymptotically normally distributed, with mean zero. He notes that the variance of
, and the expectation term in the expression for , is difficult to calculate, depending on the distributions and [Cox:1961aa]. Derivations of these values for a limited number of pairs of nonnested hypotheses have been published (see, e.g., [Cox:1962aa, Jackson:1968aa, Walker:1967aa]).2.3 A Monte Carlo approach to test statistic distributions
Due to the intractability of the asymptotic distribution of Cox’ statistic for some pairs of nonnested hypotheses, and the fact that often converges to its asymptotic distribution slowly (Williams 1970) [Williams:1970ab], Williams introduces a simulation approach to determining a distribution of the test statistic. Using an equivalent variation on the test statistic,
Williams proposes simulating distributions for under both the null and alternative hypotheses, and then drawing a conclusion as to which hypothesis is to be favoured [Williams:1970aa, Williams:1970ab]. More explicitly, his approach consists of the following steps:

Estimate parameters and , for null distribution and alternative distribution respectively, from the observations in observed data , and calculate an observed value of .

Simulate datasets of size under , denoted , and datasets of size under , denoted .

For each simulation , calculate

Compare the observed value of to the distributions of the s and the s:
The null hypothesis, that , is to be preferred if the observed value of is more likely under the distribution of than of . Otherwise, the alternative hypothesis, that , is to be preferred.
If the observed value of is unlikely under both distributions, it may be that neither model is preferable, and if the observed value of is likely under both models, there is insufficient evidence to prefer one model over another [Williams:1970aa].
Williams proposes due to the historical computing limitations, but larger values can now be chosen to give greater confidence in model choice [Williams:1970ab].
Williams notes that a limitation of his approach is that the simulated distribution of the test statistic is strongly dependent upon the values of the estimated parameters and , and suggests further simulation may alleviate this [Williams:1970ab]. He does not suggest a specific method for doing so. The model mimicry method accounts for this parameter uncertainty using a bootstrap.
3 The model mimicry method
3.1 The model mimicry method for two models
As noted in Section 2.3, the approach of Williams in nonnested model selection is hindered by the fact that it does not account for parameter uncertainty [Williams:1970ab, Wagenmakers:2004aa].
An additional limitation of his approach is that it is built around a likelihood ratio goodnessoffit measure [Wagenmakers:2004aa]. While this is suitable for a number of model comparison situations, popular measures like the Akaike information criterion (AIC) and the Bayes information criterion (BIC) might be sought to be compared across nonnested models.
Other goodnessoffit measures may also be preferred in specific instances. For example, we might seek a model whose distributional assumptions are best justified. A goodnessoffit measure for multivariate distributions is thus more appropriate here than comparing likelihoods.
Accounting for these drawbacks of the approach of Williams (1970), Wagenmakers et al. (2004) [Wagenmakers:2004aa] present a method for testing hypotheses of nonnested models which both accounts for uncertainty in parameter estimation, and is suited to general goodnessoffit measures. The method consists of reframing Williams’ approach in terms of a generic goodnessoffit measure, and adding the additional step of a nonparametric bootstrap prior to each simulation. The nonparametric bootstrap precludes the distribution of goodnessoffit measures from relying heavily on a particular parameter estimate; instead, for a stable model, a variety of parameters will be used, drawn from the region of the parameter space inhabited by those estimated using the observed data.
Wagenmakers et al. call this approach “model mimicry” [Wagenmakers:2004aa]. This is because the method leads to the selection of models that best replicate the observed data. The specific method they propose is labelled the “parametric bootstrap crossfitting method” (‘PBCM’), since the act of simulating data under each model is a parametric bootstrap, and each model is fit to both models’ simulations. In this manuscript, this is referred to as “model mimicry”, so as to provide a clearer distinction with the method later introduced for comparing multiple models (“multimodel mimicry”)
The model mimicry approach of Wagenmakers et al. is as follows [Wagenmakers:2004aa]:

Apply the nonparametric bootstrap to the observations in observed data ; in other words, take a sample of size from , sampling with replacement. Denote this .

Estimate parameters and , for null distribution and alternative distribution respectively, from the observations in bootstrap .

Simulate a dataset of size under , denoted , and a dataset of size under , denoted .

Fit both models to both sets of simulated data, and calculate goodnessoffit (‘GOF’) measures for each of the models’ fit. In other words:

Estimate parameters and , for null distribution and alternative distribution respectively, from the observations in , and calculate GOF measures and ; and

Estimate parameters and , for null distribution and alternative distribution respectively, from the observations in , and calculate GOF measures and .


For the data generated from , calculate the difference in the goodnessoffit measures between the two models:
Do the same for the data generated from :

Repeat steps 15 for , yielding observations from the distribution of and of .

Meanwhile, fit both models to the observed data , yielding and . Calculate the goodnessoffit of each model, and , and the difference between these:

Compare the observed value to the distributions of and :

The null hypothesis, that , is to be preferred if the observed value is more likely under the distribution of than of ; otherwise, the alternative hypothesis, that , is to be preferred.

In other words, select model if
for density function .

A diagram outlining the model mimicry method can be found in Figure 1.
Data from a Cauchy distribution:
To illustrate this process, a comparison was made between the fit of the normal and Cauchy distributions to data simulated according to a Cauchy distribution. Using the distribution
, 100 variates were simulated. Model mimicry was applied, with 500 replicates, comparing the normal distribution to the Cauchy distribution. The goodnessoffit statistic chosen for comparison of fit to the distributions was Skèkely and Rizzo’s energy statistic [Szekely:2005aa]; this demonstrates the notion that model mimicry can be applied in a much broader range of situations than likelihoodbased approaches.The energy statistic relies upon the idea that the structure of Euclidean distances between independentlydrawn deviates from a given distribution is unique to that distribution. In other words, if two distributions and are different, the expected distance between one point from and one point from should be greater than the mean of: (1) the expected distance between two points from ; and (2) the expected distance between two points from . The energy statistic uses this argument to measure the distance between observed data and a proposed distribution.
The application of the model mimicry technique yielded 500 observations from a distribution of and , the difference in energy statistics between the normal and the Cauchy distributions when the true model is assumed to be normal and Cauchy respectively. For a visual representation, a plot of logarithms of the two distributions, shifted by a constant , with a vertical line for , can be found in Figure 2. Natural logarithms needed to be taken due to the very large variation in empirical energy statistics for the Cauchy distribution. The constant is added to ensure all values are positive, since logarithms can only be taken of positive values.
Logarithms were taken due to large variation in energy statistics for the Cauchy distribution. The observed value, , is represented by a black line. It is clear from Figure 2 that a Cauchy distribution is a better fit to the Cauchy simulated data than a normal distribution, since
for model parameters and , and true data distribution .
As seen in this example, simulated distributions of goodnessoffit measures provide a useful reference by which to compare the fit of two models. When the observed difference in goodnessoffit between two models is more likely under one model than another, this model is to be preferred, regardless of whether the raw goodnessoffit statistic is higher for one model than for another. The model mimicry method, in being able to compare models for which goodnessoffit statistics are not directly comparable, and being able to compare irregular models like the Cauchy distribution, presents a robust method for complicated model comparisons.
3.2 Extensions to more than two models
Wagenmakers et al.’s model mimicry is limited in that it allows for the comparison of only two models. The method of comparing distributions of differences in goodnessoffit does not easily extend to greater than two models, so when more than two nonnested candidate models are to be compared, another approach should be taken.
While more than two models can be compared on a pairwise basis, it should be noted that Wagenmakers et al.’s model mimicry assumes in each comparison that one of the competing models is true. The decision rule for model selection encouraged by the Wagenmakers et al.’s model mimicry, that we should prefer model over if
is less useful when neither model nor is true.
For example, if comparing on a pairwise basis three models and , for which model is true, one of the three comparisons will be undertaken on an incorrect assumption–that between model and model . This, however, will not be a fatal problem if model is indeed true; the pairwise applications of the Wagenmakers et al.’s model mimicry should reveal model is preferable to both models and , rendering the comparison between models and unnecessary. If one is willing to accept the discomfort of undertaking particular pairwise analyses on flawed assumptions, the pairwise model selection procedure may be appropriate.
If one seeks to avoid this by comparing all models simultaneously, the similar approaches of Allcroft and Glasbey (2003) [Allcroft:2003aa] and Schultheis and Naidu (2014) [Schultheis:2014aa] may be preferred. Allcroft and Glasbey (2003) [Allcroft:2003aa], one year before Wagenmakers et al. published their methodology [Wagenmakers:2004aa], propose a technique similar to Wagenmakers et al.’s model mimicry, but with the capability to compare more than two models. The major point of difference between the Allcroft and Glasbey method and Wagenmakers et al.’s model mimicry, is that while distributions are simulated in model mimicry, the Allcroft and Glasbey method uses raw values to simulate multivariate distributions of under each of models . The observed value of is then compared to the simulated distributions to determine which hypothesis is most likely. After simulation, model selection then becomes a classification problem in an dimensional space.
In contrast to Wagenmakers et al.’s model mimicry, the Allcroft and Glasbey method omits the nonparametric bootstrap at each simulation, and does not reestimate the parameters of each model for each simulation, instead using only the estimated parameters from the observed data throughout the procedure. These omissions are reversed in the work of Schultheis and Naidu (2014) [Schultheis:2014aa], and their technique will be preferred here to reduce the procedure’s sensitivity to parameter estimates.
The favoured method, of Schultheis and Naidu (2014), is here called “multimodel mimicry”, is thus as follows [Schultheis:2014aa]:

Apply the nonparametric bootstrap to the observations in observed data ; in other words, take a sample of size from , sampling with replacement. Denote this .

Estimate parameters , for proposed distributions respectively, from the observations in bootstrap .

Simulate dataset of size under each of , denoted, respectively.

Fit every model to every set of simulated data, and calculate goodnessoffit (‘GOF’) measures for each of the models’ fit. In other words:

Estimate parameters , for distributions respectively, from the observations in , and calculate GOF measures ;

Estimate parameters , for distributions respectively, from the observations in , and calculate GOF measures ;

Estimate parameters , for distributions respectively, from the observations in , and calculate GOF measures .


Repeat steps 14 for , yielding observations from the distributions of
where is the goodnessoffit to model of data produced according to model .

Meanwhile, fit all models to the observed data , yielding . Calculate the goodnessoffit of each model,

Compare the observed value to the distributions of , , :

The hypothesis to be preferred is that under which the observed value is most likely.

In other words, select the model satisfying

A diagram outlining the multiple model comparisons method can be found in Figure 3.
3.3 Classifying results from multimodel mimicry
In Step 7 of the method adapted here from Schultheis and Naidu [Schultheis:2014aa], a model is chosen which satisfies
Unfortunately, the distributions of are only known through the simulated observations of these distributions. The task of choosing the model that maximises the density of under that model is, in other words, a supervised classification task, assigning a new point to one of sets of observed points. Three popular methods for supervised classification are

inspection,

modelbased classifiers, and

nonparametric classifiers.
The first two such methods are discussed in the paragraphs below. For further discussion of nonparametric classifiers, including nearest neighbour methods, see Hastie, Tibshirani and Friedman [Hastie:2009aa]; these methods are not used here as the decision boundaries in these scenarios are rarely so irregular as to necessitate such an approach.
Inspection
With just a single data point to classify, it seems natural at first to determine the nearest distribution by inspection. In this context, this would amount to looking at a plot containing and all observations of , and determining which set of observations appears closest to . For simple analyses, there is nothing inherently wrong with this approach; indeed, in their paper introducing model mimicry, Wagenmakers et al. use inspection to classify models for their Example 1 [Wagenmakers:2004aa]. However, there are two key flaws with this methodology; that inspection becomes more difficult in a higherdimensional space, and that inspection may become inefficient for multiple data sets.
In the examples of Wagenmakers et al. [Wagenmakers:2004aa], two models are differentiated using model mimicry; that is, the version able to deal with a binary model selection. Wagenmakers et al. were thus able to make model selections by inspecting histograms. For multimodel mimicry, multidimensional distributions of points are considered, making visualisation of within the goodnessoffit space difficult. Pairwise scatterplots (by dimension) are possible, though information is lost in showing just marginal goodnessoffit distributions. An alternative is a twodimensional principal components plot, which is able to represent a much larger proportion of the variation in the data than a twodimensional marginal plot. This is one visualisation used by Schultheis and Naidu [Schultheis:2014aa], though the principal components plot is also unlikely to fully convey the higherdimensional system it represents. This manuscript will thus propose the use of discriminantbased classifiers for multimodel mimicry output.
Another issue with model selection by inspection is that it becomes inefficient when a larger number of data sets are considered. For example, suppose a model is sought to describe multiple potential realisations of data sets. This situation is common in psychological modelling (see, e.g., Wagenmakers et al. [Wagenmakers:2004aa]). In this instance, it may be necessary to classify many values of , with separate sets of distributions of for each data set. By inspection, this would require considering a large number of visualisations in order to draw conclusions. Numerical classification of should then be considered.
Discriminantbased classifiers
Numerically, we seek the probability that belongs to some model . In other words, for each , we seek
Using Bayes’ theorem, we can rearrange this to
Since there are the same number (denoted earlier ) of observations for each model, and there is no assumed prior preference for any model, we can treat :  
(3.1) 
In order to classify , we thus need to estimate . We briefly discuss here estimation of the distributions of in three manners of increasing complexity (all from Hastie, Tibshirani and Friedman (2009) [Hastie:2009aa]):

Linear discriminant analysis (LDA), which assumes homoscedastic normal distributions;

Quadratic discriminant analysis (QDA), which assumes heteroscedastic normal distributions; and

Mixture discriminant analysis (MDA), which assumes mixtures of heteroscedastic normal distributions.
It should be noted that Hastie, Tibshirani and Friedman restrict their version of MDA to the assumption of mixtures of homoscedastic normal distributions ([Hastie:2009aa] at page 440), while this is generalised here to mixtures of heteroscedastic normal distributions for additional flexibility.
Linear discriminant analysis (LDA), the simplest of the three methods, is named because it produces linear decision boundaries; in other words, the boundary between the region whose points that would be classified to one model, and the region that would be classified to another, is always linear [Hastie:2009aa]. The LDA model supposes that each distribution can be expressed as
for . Note here that is not dependent on , meaning the variance is assumed to be the same for all models. Calculating the density functions can be done by maximum likelihood; each mean is simply estimated to be the sample mean of each , while the variance is estimated to be the weighted average of sample variance matrices for each . Since in this case, there are the same number of observations of each goodnessoffit distribution,
for variance estimate , and sample variances estimated by
(3.2) 
for the th observation of . Once these parameters have been estimated, Equation (3.1) can be used to select a model; the model that maximises the discriminant in (3.1) can be selected. The higher the value of the discriminant for the chosen model, the greater the confidence one can express in the selection of that model.
Quadratic discriminant analysis (QDA) relaxes the constant variance assumption from LDA. This means it is able to respond to differing covariance structures between goodnessoffit distributions. The example later in this manuscript demonstrates that this may be useful in a multimodel mimicry context; in that instance, the variance of distributions, as represented in Figure 4, was not the same for all candidate models.
The QDA model supposes that each distribution can be expressed as
for . Note here that is dependent on . Again, density functions are calculated by maximum likelihood, with means for each model estimated by sample mean , and sample variances as in Equation (3.2) above.
Mixture discriminant analysis (MDA) presupposes that can be approximated by a Gaussian mixture, which is a weighted sum of normal random variables. Thus MDA allows for more flexible decision boundaries than LDA or QDA, by accounting for the fact that the distributions of goodnessoffit among candidate models may be multimodal. The MDA model supposes that each goodnessoffit distribution can be expressed as
for components of the th mixture model, . Note here that the means, variances and even number of components may vary within and among candidate distributions.
The allowance for the covariance matrices being not identical for all components and for all distributions is an extension of the discriminant suggested by Hastie, Tibshirani and Friedman ([Hastie:2009aa] at page 440). The additional flexibility provided by this extension is necessary in situations such as that in the example in Figure 5, in which the distribution of goodnessoffit under the lognormal candidate model, unlike the distribution for other candidate models, has a highly irregular covariance structure. Only a more flexible Gaussian mixture is able to capture this irregular covariance structure, which requires multiple components with different covariance matrices.
The parameters of the mixture model, within each group, can be estimated using an Expectation Maximisation (EM) algorithm for Gaussian mixtures [Xu:1996aa].
Using LDA, QDA or MDA, models can be selected by maximising the discriminant in Equation (3.1). If the LDA, QDA or MDA assumptions hold, the discriminant in Equation (3.1) represents an estimate of the probability that each model is true, given , using an uninformative prior [Hastie:2009aa]. This provides for a numerical representation of not just which model should be selected, but also the uncertainty in this choice of model.
3.4 An example of multimodel mimicry
Using the expansion of the procedure of Schultheis and Naidu (2014) [Schultheis:2014aa]
, this section contains an example concerning data simulated from an exponential distribution. Hypotheses that the data come from an exponential distribution, a lognormal distribution and a chisquared distribution are compared using multimodel mimicry, using Skèkely and Rizzo’s energy statistic
[Szekely:2005aa] as a measure of goodnessoffit to each distribution, as in this manuscript’s earlier example with regard to pairwise model mimicry.Since the three candidate distributions are similar, they should be difficult to distinguish, making this example demonstrative of the power of multimodel mimicry.
First, 100 variates from the distribution Exp were simulated. Multimodel mimicry was then applied to this data, comparing the ability of an exponential distribution, a lognormal distribution and a chisquared distribution to describe the data, with 500 replicates. This yielded 500 observations from the distributions of and , the energy goodnessoffit of all models to the data simulated under the exponential, lognormal and chisquared models respectively. Each model was also fit to the observed data, yielding the observed goodnessoffit of each model to the data, .
For a visual representation of this, pairwise plots of logarithms of two components of each are given in Figure 4, alongside marginal distributions of logarithms of each component of each
. Logarithms are taken to make the plots more easy to visualise. Pairwise plots are used due to the difficulty of representing the threedimensional clouds of variates from the vectors of goodnessesoffit. Each of
and are represented by a cluster of points in the pairwise scatterplots, and by a distribution function on the diagonals of this figure to represent the marginal distributions under each model. The observed value, , is represented by a black point or black line. It can be seen that the observed value appears to fit most closely to the exponential variates in all plots, indicating that the observed goodnessoffit is more likely given that the true model is the exponential distribution, than if the true model is lognormal or chisquared. Thus, the exponential model should be selected.However, it is not sufficient to show that components of are, marginally, closest to components of the variates from . The components of
are marginal distributions of the goodnessoffit, and marginal distributions do not completely characterise a joint distribution. To confirm that the exponential model should be selected, linear, quadratic and mixture discriminant analyses were undertaken to classify the point
. All discriminant analyses classified the observed goodnessoffit into the exponential cluster, with probabilities 0.665, 0.998, and 0.997 for linear, quadratic and mixture discriminant analyses respectively.Further visual representation can be provided by taking principal components of the collection of variates from each of and , and projecting onto this space, providing twodimensional projections of the goodnessoffit space. A plot of this can be found in Figure 5 and reinforces the selection of the exponential model. In this instance, the first two principal components capture 84.3% of the variance in the distributions, meaning Figure 5 is sufficiently characteristic of the data to be of use in model selection.
In summary, multimodel mimicry allows for the comparison of multiple models simultaneously, using any relevant goodnessoffit measure. The approach is effective at distinguishing between models even at small sample sizes; here . While the final act of model classification is difficult to directly visualise, pairwise scatterplots and principal components plots can aid this. Classification can then be undertaken by inspection or using discriminant analyses, with discriminant analyses providing a clearer understanding of the level of uncertainty in model choice.
4 Discussion of the multimodel mimicry framework
The multimodel mimicry technique introduced by Schultheis and Naidu, and given broader statistical underpinnings in this manuscript, is useful for comparing any number of candidate models with the following capabilities:

Model parameters should be estimable given training data.

There must exist some statistic by which the data’s goodnessoffit to the model can be measured.

It must be possible to simulate data under the model.
This makes MMM more flexible than a number of alternative techniques for model comparison. For example, relative to Wilks’ likelihoodratio test, MMM does not have the requirement that the models need be nested. Relative to direct comparisons of AIC or BIC, MMM does not have the requirement that likelihood can be computed and that the models and data are of a form that allow AIC and BIC values to directly be compared. Relative to Approximate Bayesian Computation, MMM does not require the estimation of prior probabilities of models and their parameters, and can be undertaken in a frequentist environment.
That MMM is more flexible than some alternatives does not, however, make it universally applicable; the three requirements above are nontrivial. These requirements will now be discussed in turn, followed by a fourth limitation–the MMM’s preference for complexity.
Parameter estimation
The MMM technique requires the model to be able to be fit, based upon the observed data. The technique does not provide alternative methods for such model fitting, so if a model is poorly specified or unidentifiable, the MMM procedure does not resolve this issue. Computational approaches such as Markov Chain Monte Carlo or Approximate Bayesian Computation may resolve these issues in some cases.
Goodnessoffit statistics
The MMM technique requires goodnessoffit statistics for each model to be defined, and calculable. It is assumed in this manuscript that the same goodnessoffit measure is used for all candidate models, but it may be the case that the MMM procedure is still useful in cases where different goodnessoffit measures are available for different models. An earlier example given in this manuscript is the difficulty of calculating AIC or BIC for a model incorporating kernel density estimates, since the number of parameters is difficult to determine. Other distributional goodnessoffit measures can determine how well a kernel density estimate fits to data. It remains uncertain whether within the MMM framework, the distributional goodnessoffit measures for a kernel density estimate could be compared to AIC or BIC for other candidate models. There is no computational reason that this comparison could not be done, but the statistical consequences of this are a potential course of further research.
Simulation
The MMM approach involves simulation of data under all candidate models, using the parameters of these models. This naturally brings about questions as to the applicability of the approach to nonparametric models, or those where the confines of the model, and its ability to simulate new data, is harder to define. For example, in many machine learning contexts, such as neural networks, goodnessoffit to models can be calculated in the form of loss functions, but the simulation of new data under models is an emerging area of research. For example, a deep learning environment such as that in image processing may be able to classify images into certain categories, and develop a model for doing so, but it may be unable to produce new images according to these classifications. Where simulation is not possible, the MMM technique is unavailable. An course of further research may involve the use of nonparametric bootstraps under each model, rather than parametric bootstraps as in the MMM’s crossfitting method, to compare model performance.
Complexity
When introducing the model mimicry technique for comparison of two models, Wagenmakers et al. [Wagenmakers:2004aa] note that the technique has a preference for more complex models, since a more complex model may be able to produce data which mimics that produced by a simpler model. This issue recurs in the extension to multiple models introduced by Schultheis and Naidu [Schultheis:2014aa] and enumerated here. Further research may illuminate methods for penalising complex models in the MMM context. As a general rule for application of the MMM, this issue is addressed by preferring a more parsimonious model over a model complex one, where MMM does not express a strong preference for the more complex model.
5 Conclusion
This manuscript presents the history behind, and statistical context for, the multimodel mimicry technique. MMM provides a framework for the simultaneous comparison of multiple statistical models, on the basis of generalised goodnessoffit measures and in cases where candidate models differ greatly in model or data structure. In doing so, the technique is not limited to scenarios wherein likelihood for the candidate models is calculable, where likelihoodbased goodnessoffit measures are able to be compared between models, or where it is even likelihood that is sought to be compared between models. For example, the MMM technique allows for the comparison of models on the basis of goodnessoffit to modelbuilding assumptions, to desired features like the frequency or quantification of extreme data points, or to some distribution.
This manuscript also demonstrates the effectiveness of this technique, even at relatively small sample sizes and for otherwise difficulttodistinguish models. For example, for 100 data points simulated from an exponential distribution, MMM was able to correctly identify that this data came from an exponential distribution, rather than a lognormal or a chisquared distribution.
Finally, we discuss potential areas for further research in the area of the MMM. How the MMM framework interacts with situations in which model parameters are difficult to determine, where different goodnessoffit measures are used for different models, or where simulation of new data under models is difficult, all form courses for future exploration. A more sophisticated approach for dealing with MMM’s preference towards more complex model would also be a welcome addition to literature in this area.
Comments
There are no comments yet.