Multi-model mimicry for model selection according to generalised goodness-of-fit criteria

11/21/2019
by   Lachlann McArthur, et al.
0

Selecting between candidate models is at the core of statistical practice. As statistical modelling techniques are rapidly evolving, the need for similar evolution in the ways to select between candidate models is increasing. With Akaike's Information Criterion (AIC) and Bayesian Information Criterion (BIC) not applicable for all sets of candidate models, and likelihood not the only criterion upon which one might seek to compare models, a flexible model selection technique called model mimicry has emerged. Using simulation to compare models based upon generalised goodness-of-fit measures, model mimicry is available for any pair of candidate models, so long as (1) the models are able to be fit to observed data, (2) new sets of data can be simulated under the models, and (3) a metric exists by which a dataset's goodness-of-fit to the model can be calculated. In this manuscript, a variation of model mimicry for the simultaneous comparison of multiple models (multi-model mimicry) is outlined, placed in its historical context, and its effectiveness is demonstrated for simple examples. Classification techniques for model selection using output from multi-model mimicry are also explored.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/16/2020

Assessing the Significance of Model Selection in Ecology

Model Selection is a key part of many ecological studies, with Akaike's ...
02/27/2013

Three Approaches to Probability Model Selection

This paper compares three approaches to the problem of selecting among p...
08/17/2018

Data Consistency Approach to Model Validation

In scientific inference problems, the underlying statistical modeling as...
05/27/2021

Model Selection for Production System via Automated Online Experiments

A challenge that machine learning practitioners in the industry face is ...
03/30/2021

Model combinations through revised base-rates

Standard selection criteria for forecasting models focus on information ...
06/28/2019

FIESTA: Fast IdEntification of State-of-The-Art models using adaptive bandit algorithms

We present FIESTA, a model selection approach that significantly reduces...
08/23/2018

On model selection criteria for climate change impact studies

Climate change impact studies inform policymakers on the estimated damag...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Likelihood-based goodness-of-fit comparisons and their limitations

Traditional goodness-of-fit measures are based around likelihood, which is defined as, for observed data , and model with parameter set ,

where

is the probability distribution for data under the model

. Since likelihood can in most instances be increased by arbitrarily increasing model complexity, more sophisticated goodness-of-fit measures penalise complexity, measured as the number of parameters estimated in the model. For example, Akaike’s Information Criterion (AIC) is defined [Akaike:1974aa] as

while the Bayesian Information Criterion (BIC) is given [Schwarz:1978aa] by

where is the likelihood of model under maximum likelihood parameters , and is the number of data points in observed data set . A model with the smallest AIC or BIC is selected. Naturally, BIC penalises complexity more than AIC (unless , which should never occur in practice as this implies that there are 7 or fewer observed data points).

AIC- or BIC-based model selection, however, is limited in the situations in which it can be applied. Situations in which AIC or BIC are inappropriate include:

  • where likelihood is intractable, in theory or practice;

  • where the structure of the set of models, or data upon which the models are compared, makes AIC or BIC inappropriate; and

  • where alternative model selection bases are deemed more appropriate, for philosophical and/or practical reasons.

Each of these situations will now be discussed in turn.

1.1 Intractable likelihood

Likelihood may be impossible to calculate for one or more of the candidate models. Types of data and models for which this might occur are well documented in the literature; examples include networks [Caimo:2015aa, Ratmann:2007aa, Ratmann:2009aa], complicated time series [Breto:2009aa, Jasra:2014aa]

, and hidden Markov models

[Yildirim:2015aa]. In differentiating between these types of models, Approximate Bayesian Computation (ABC) has recently gained popularity, but this technique is sensitive to prior distributions for both the choice of model and for each model’s parameters, as well as to choices of summary statistics [Robert:2011aa]

. This manuscript presents an alternative manner of differentiating between models, without the selection of prior probabilities in the model space.

1.2 Model or data structure

Some model structures are of sufficiently different form to be incomparable using likelihood-based methods like AIC or BIC. For example, suppose some univariate data of sample size

is to be fit either using a normal distribution, or using kernel density estimates. AIC or BIC require calculation of likelihood, and of the number of parameters. The first model, the normal distribution, has two parameters—the mean and standard deviation. For the kernel density model, the ‘number of parameters’ is somewhat nebulous, as all observed data points are considered for all estimates, but have differing influence.

The structure of data might also lead to models being incomparable using AIC or BIC. For example, suppose two time series models are to be considered, an ARIMA model, and an ARIMA model applied to differenced data. There is one more data point in the ARIMA model than in the differenced ARIMA model. As such, the likelihood function of the ARIMA model will contain one more term than will the differenced ARIMA model, so will be of a different order of magnitude [Harvey:1980aa]. This makes direct comparison of likelihoods, and likelihood-based goodness-of-fit measures, unavailable. The MMM procedure in this manuscript overcomes this issue using a simulation-based approach.

1.3 Alternative model selection bases

Finally, likelihood is not always the preferred criterion according to which we wish to select a model. One example of this is in choosing an appropriate distribution to fit to some given data. Since likelihood measures the probability density of the observed data, treated usually as independent observations, given a model, it does not take into consideration whether the data fits the shape of the proposed distribution; that is, likelihood is not designed to differentiate whether a candidate distribution is appropriate given the skew, kurtosis or other moments of the data.

In this instance, it might be desirable to compare candidate distributions on the basis of a distributional goodness-of-fit measure, such as Kolmogorov-Smirnov statistic [Massey:1951aa], Skèkely and Rizzo’s energy statistic [Szekely:2005aa], or some more rudimentary summary statistic like the number of extreme values in a distribution. Since these statistics’ raw values cannot be directly compared between candidate models, a more rigorous framework to compare these values must be considered. The MMM framework in this manuscript is able to address this.

A similar motivation may be to compare models on the basis of how reasonable the models’ assumptions are. Goodness-of-fit to model assumptions is not a simple binary question; some valid models are more well-founded than others in terms of their assumptions. For example, consider a simple linear regression model. One assumption is normality of the error term, which is usually assessed using a quantile-quantile plot of the residuals. The assumption of normality might be considered reasonable for a number of different quantile-quantile plots, but one might more readily accept some plots than others. This is one demonstration of the fact that the validity of an assumption is on a continuum. Where a model fits on this continuum for some distributional assumption may distinguish some candidate models from others.

Since fidelity to assumptions justifies the generalisability of a model, a model might be favoured if it satisfies its assumptions better than other candidates. However, it is not immediately obvious how one might compare the fidelity of two or more models to their respective distributional assumptions. For example, if two models assume different error distributions, comparing these assumptions would involve comparing the goodness-of-fit of one set of errors to one distribution, with the goodness-of-fit of another set of errors to a different distribution. Again, test statistics for distributional fit, like the Kolmogorov-Smirnoff or energy statistics

[Szekely:2005aa], are not measured on the same scale for all candidate distributions, and may be more sensitive to some kinds of lack-of-fit than to others. These raw goodness-of-fit statistics thus cannot be directly compared between models. The problem of comparing distributional goodness-of-fit among such models is explored in detail in this manuscript, with the MMM framework able to provide for such comparisons.

2 Genesis of general goodness-of-fit comparisons

2.1 Likelihood-ratio test for nested hypotheses

The likelihood-ratio statistic (Wilks 1938) is a well-known basis for hypothesis tests comparing two nested models [Wilks:1938aa]. When models are nested, they come from the same parameterised family, so the hypothesis test consists of choosing between two sets of parameters for this family, often denoted and . The likelihood-ratio statistic is

for observed data and common likelihood function . Significance levels for this statistic are easily determined, since for nested models,

as sample size , with the dimensionality of and the dimensionality of (Wilks 1938) [Wilks:1938aa].

This test, however, cannot be undertaken for non-nested hypotheses, since in this case, the likelihoods in the expression for would be from different families of distribution, meaning the distribution of does not necessarily follow Wilks’ asymptotic expression.

2.2 Extension to non-nested hypotheses

In response to this shortcoming, Cox (1961) [Cox:1961aa] extends the likelihood-ratio test to some cases with non-nested hypotheses. Suppose we have some realisations

of random variable

, and seek to compare two hypotheses:

where and are non-nested distributions, and

is the null hypothesis. Cox’ test statistic compares the logarithm of the ratio of likelihoods under each hypothesis to the expected value of the log-ratio under the null hypothesis. Cox’ statistic can be written as

where and are likelihood functions of distributions and respectively, and denotes an expectation with respect to distribution with parameters .

Cox demonstrates that

is asymptotically normally distributed, with mean zero. He notes that the variance of

, and the expectation term in the expression for , is difficult to calculate, depending on the distributions and [Cox:1961aa]. Derivations of these values for a limited number of pairs of non-nested hypotheses have been published (see, e.g., [Cox:1962aa, Jackson:1968aa, Walker:1967aa]).

2.3 A Monte Carlo approach to test statistic distributions

Due to the intractability of the asymptotic distribution of Cox’ statistic for some pairs of non-nested hypotheses, and the fact that often converges to its asymptotic distribution slowly (Williams 1970) [Williams:1970ab], Williams introduces a simulation approach to determining a distribution of the test statistic. Using an equivalent variation on the test statistic,

Williams proposes simulating distributions for under both the null and alternative hypotheses, and then drawing a conclusion as to which hypothesis is to be favoured [Williams:1970aa, Williams:1970ab]. More explicitly, his approach consists of the following steps:

  • Estimate parameters and , for null distribution and alternative distribution respectively, from the observations in observed data , and calculate an observed value of .

  • Simulate datasets of size under , denoted , and datasets of size under , denoted .

  • For each simulation , calculate

  • Compare the observed value of to the distributions of the s and the s:

    The null hypothesis, that , is to be preferred if the observed value of is more likely under the distribution of than of . Otherwise, the alternative hypothesis, that , is to be preferred.

If the observed value of is unlikely under both distributions, it may be that neither model is preferable, and if the observed value of is likely under both models, there is insufficient evidence to prefer one model over another [Williams:1970aa].

Williams proposes due to the historical computing limitations, but larger values can now be chosen to give greater confidence in model choice [Williams:1970ab].

Williams notes that a limitation of his approach is that the simulated distribution of the test statistic is strongly dependent upon the values of the estimated parameters and , and suggests further simulation may alleviate this [Williams:1970ab]. He does not suggest a specific method for doing so. The model mimicry method accounts for this parameter uncertainty using a bootstrap.

3 The model mimicry method

3.1 The model mimicry method for two models

As noted in Section 2.3, the approach of Williams in non-nested model selection is hindered by the fact that it does not account for parameter uncertainty [Williams:1970ab, Wagenmakers:2004aa].

An additional limitation of his approach is that it is built around a likelihood ratio goodness-of-fit measure [Wagenmakers:2004aa]. While this is suitable for a number of model comparison situations, popular measures like the Akaike information criterion (AIC) and the Bayes information criterion (BIC) might be sought to be compared across non-nested models.

Other goodness-of-fit measures may also be preferred in specific instances. For example, we might seek a model whose distributional assumptions are best justified. A goodness-of-fit measure for multivariate distributions is thus more appropriate here than comparing likelihoods.

Accounting for these drawbacks of the approach of Williams (1970), Wagenmakers et al. (2004) [Wagenmakers:2004aa] present a method for testing hypotheses of non-nested models which both accounts for uncertainty in parameter estimation, and is suited to general goodness-of-fit measures. The method consists of re-framing Williams’ approach in terms of a generic goodness-of-fit measure, and adding the additional step of a non-parametric bootstrap prior to each simulation. The non-parametric bootstrap precludes the distribution of goodness-of-fit measures from relying heavily on a particular parameter estimate; instead, for a stable model, a variety of parameters will be used, drawn from the region of the parameter space inhabited by those estimated using the observed data.

Wagenmakers et al. call this approach “model mimicry” [Wagenmakers:2004aa]. This is because the method leads to the selection of models that best replicate the observed data. The specific method they propose is labelled the “parametric bootstrap cross-fitting method” (‘PBCM’), since the act of simulating data under each model is a parametric bootstrap, and each model is fit to both models’ simulations. In this manuscript, this is referred to as “model mimicry”, so as to provide a clearer distinction with the method later introduced for comparing multiple models (“multi-model mimicry”)

The model mimicry approach of Wagenmakers et al. is as follows [Wagenmakers:2004aa]:

  1. Apply the non-parametric bootstrap to the observations in observed data ; in other words, take a sample of size from , sampling with replacement. Denote this .

  2. Estimate parameters and , for null distribution and alternative distribution respectively, from the observations in bootstrap .

  3. Simulate a dataset of size under , denoted , and a dataset of size under , denoted .

  4. Fit both models to both sets of simulated data, and calculate goodness-of-fit (‘GOF’) measures for each of the models’ fit. In other words:

    • Estimate parameters and , for null distribution and alternative distribution respectively, from the observations in , and calculate GOF measures and ; and

    • Estimate parameters and , for null distribution and alternative distribution respectively, from the observations in , and calculate GOF measures and .

  5. For the data generated from , calculate the difference in the goodness-of-fit measures between the two models:

    Do the same for the data generated from :

  6. Repeat steps 1-5 for , yielding observations from the distribution of and of .

  7. Meanwhile, fit both models to the observed data , yielding and . Calculate the goodness-of-fit of each model, and , and the difference between these:

  8. Compare the observed value to the distributions of and :

    • The null hypothesis, that , is to be preferred if the observed value is more likely under the distribution of than of ; otherwise, the alternative hypothesis, that , is to be preferred.

    • In other words, select model if

      for density function .

A diagram outlining the model mimicry method can be found in Figure 1.

Figure 1: The model mimicry outlined by Wagenmakers et al. (2004). Simulated distributions of differences in goodness-of-fit under competing models are compared to the difference between the models’ goodness-of-fit for the observed data.

Data from a Cauchy distribution:

To illustrate this process, a comparison was made between the fit of the normal and Cauchy distributions to data simulated according to a Cauchy distribution. Using the distribution

, 100 variates were simulated. Model mimicry was applied, with 500 replicates, comparing the normal distribution to the Cauchy distribution. The goodness-of-fit statistic chosen for comparison of fit to the distributions was Skèkely and Rizzo’s energy statistic [Szekely:2005aa]; this demonstrates the notion that model mimicry can be applied in a much broader range of situations than likelihood-based approaches.

The energy statistic relies upon the idea that the structure of Euclidean distances between independently-drawn deviates from a given distribution is unique to that distribution. In other words, if two distributions and are different, the expected distance between one point from and one point from should be greater than the mean of: (1) the expected distance between two points from ; and (2) the expected distance between two points from . The energy statistic uses this argument to measure the distance between observed data and a proposed distribution.

The application of the model mimicry technique yielded 500 observations from a distribution of and , the difference in energy statistics between the normal and the Cauchy distributions when the true model is assumed to be normal and Cauchy respectively. For a visual representation, a plot of logarithms of the two distributions, shifted by a constant , with a vertical line for , can be found in Figure 2. Natural logarithms needed to be taken due to the very large variation in empirical energy statistics for the Cauchy distribution. The constant is added to ensure all values are positive, since logarithms can only be taken of positive values.

Logarithms were taken due to large variation in energy statistics for the Cauchy distribution. The observed value, , is represented by a black line. It is clear from Figure 2 that a Cauchy distribution is a better fit to the Cauchy simulated data than a normal distribution, since

for model parameters and , and true data distribution .

As seen in this example, simulated distributions of goodness-of-fit measures provide a useful reference by which to compare the fit of two models. When the observed difference in goodness-of-fit between two models is more likely under one model than another, this model is to be preferred, regardless of whether the raw goodness-of-fit statistic is higher for one model than for another. The model mimicry method, in being able to compare models for which goodness-of-fit statistics are not directly comparable, and being able to compare irregular models like the Cauchy distribution, presents a robust method for complicated model comparisons.

Figure 2: Plot comparing the fit of the normal (orange) and Cauchy (green) distributions to data simulated from a Cauchy distribution, using the model mimicry method outlined by Wagenmakers et al.. The black line is the difference in goodness of fit to the observed data and this is nested within the Cauchy distribution meaning the Cauchy distribution appears to be the best fit to the data.

3.2 Extensions to more than two models

Wagenmakers et al.’s model mimicry is limited in that it allows for the comparison of only two models. The method of comparing distributions of differences in goodness-of-fit does not easily extend to greater than two models, so when more than two non-nested candidate models are to be compared, another approach should be taken.

While more than two models can be compared on a pairwise basis, it should be noted that Wagenmakers et al.’s model mimicry assumes in each comparison that one of the competing models is true. The decision rule for model selection encouraged by the Wagenmakers et al.’s model mimicry, that we should prefer model over if

is less useful when neither model nor is true.

For example, if comparing on a pairwise basis three models and , for which model is true, one of the three comparisons will be undertaken on an incorrect assumption–that between model and model . This, however, will not be a fatal problem if model is indeed true; the pairwise applications of the Wagenmakers et al.’s model mimicry should reveal model is preferable to both models and , rendering the comparison between models and unnecessary. If one is willing to accept the discomfort of undertaking particular pairwise analyses on flawed assumptions, the pairwise model selection procedure may be appropriate.

If one seeks to avoid this by comparing all models simultaneously, the similar approaches of Allcroft and Glasbey (2003) [Allcroft:2003aa] and Schultheis and Naidu (2014) [Schultheis:2014aa] may be preferred. Allcroft and Glasbey (2003) [Allcroft:2003aa], one year before Wagenmakers et al. published their methodology [Wagenmakers:2004aa], propose a technique similar to Wagenmakers et al.’s model mimicry, but with the capability to compare more than two models. The major point of difference between the Allcroft and Glasbey method and Wagenmakers et al.’s model mimicry, is that while distributions are simulated in model mimicry, the Allcroft and Glasbey method uses raw values to simulate multivariate distributions of under each of models . The observed value of is then compared to the simulated distributions to determine which hypothesis is most likely. After simulation, model selection then becomes a classification problem in an -dimensional space.

In contrast to Wagenmakers et al.’s model mimicry, the Allcroft and Glasbey method omits the non-parametric bootstrap at each simulation, and does not re-estimate the parameters of each model for each simulation, instead using only the estimated parameters from the observed data throughout the procedure. These omissions are reversed in the work of Schultheis and Naidu (2014) [Schultheis:2014aa], and their technique will be preferred here to reduce the procedure’s sensitivity to parameter estimates.

The favoured method, of Schultheis and Naidu (2014), is here called “multi-model mimicry”, is thus as follows [Schultheis:2014aa]:

  1. Apply the non-parametric bootstrap to the observations in observed data ; in other words, take a sample of size from , sampling with replacement. Denote this .

  2. Estimate parameters , for proposed distributions respectively, from the observations in bootstrap .

  3. Simulate dataset of size under each of , denoted, respectively.

  4. Fit every model to every set of simulated data, and calculate goodness-of-fit (‘GOF’) measures for each of the models’ fit. In other words:

    • Estimate parameters , for distributions respectively, from the observations in , and calculate GOF measures ;

    • Estimate parameters , for distributions respectively, from the observations in , and calculate GOF measures ;

    • Estimate parameters , for distributions respectively, from the observations in , and calculate GOF measures .

  5. Repeat steps 1-4 for , yielding observations from the distributions of

    where is the goodness-of-fit to model of data produced according to model .

  6. Meanwhile, fit all models to the observed data , yielding . Calculate the goodness-of-fit of each model,

  7. Compare the observed value to the distributions of , , :

    • The hypothesis to be preferred is that under which the observed value is most likely.

    • In other words, select the model satisfying

A diagram outlining the multiple model comparisons method can be found in Figure 3.

Figure 3: Multi-model mimicry outlined by Schultheis and Naidu (2014). Simulated multivariate distributions of goodness-of-fit under competing models are compared to all models’ goodness-of-fit to the observed data.

3.3 Classifying results from multi-model mimicry

In Step 7 of the method adapted here from Schultheis and Naidu [Schultheis:2014aa], a model is chosen which satisfies

Unfortunately, the distributions of are only known through the simulated observations of these distributions. The task of choosing the model that maximises the density of under that model is, in other words, a supervised classification task, assigning a new point to one of sets of observed points. Three popular methods for supervised classification are

  • inspection,

  • model-based classifiers, and

  • non-parametric classifiers.

The first two such methods are discussed in the paragraphs below. For further discussion of non-parametric classifiers, including -nearest neighbour methods, see Hastie, Tibshirani and Friedman [Hastie:2009aa]; these methods are not used here as the decision boundaries in these scenarios are rarely so irregular as to necessitate such an approach.

Inspection

With just a single data point to classify, it seems natural at first to determine the nearest distribution by inspection. In this context, this would amount to looking at a plot containing and all observations of , and determining which set of observations appears closest to . For simple analyses, there is nothing inherently wrong with this approach; indeed, in their paper introducing model mimicry, Wagenmakers et al. use inspection to classify models for their Example 1 [Wagenmakers:2004aa]. However, there are two key flaws with this methodology; that inspection becomes more difficult in a higher-dimensional space, and that inspection may become inefficient for multiple data sets.

In the examples of Wagenmakers et al. [Wagenmakers:2004aa], two models are differentiated using model mimicry; that is, the version able to deal with a binary model selection. Wagenmakers et al. were thus able to make model selections by inspecting histograms. For multi-model mimicry, multi-dimensional distributions of points are considered, making visualisation of within the goodness-of-fit space difficult. Pairwise scatterplots (by dimension) are possible, though information is lost in showing just marginal goodness-of-fit distributions. An alternative is a two-dimensional principal components plot, which is able to represent a much larger proportion of the variation in the data than a two-dimensional marginal plot. This is one visualisation used by Schultheis and Naidu [Schultheis:2014aa], though the principal components plot is also unlikely to fully convey the higher-dimensional system it represents. This manuscript will thus propose the use of discriminant-based classifiers for multi-model mimicry output.

Another issue with model selection by inspection is that it becomes inefficient when a larger number of data sets are considered. For example, suppose a model is sought to describe multiple potential realisations of data sets. This situation is common in psychological modelling (see, e.g., Wagenmakers et al. [Wagenmakers:2004aa]). In this instance, it may be necessary to classify many values of , with separate sets of distributions of for each data set. By inspection, this would require considering a large number of visualisations in order to draw conclusions. Numerical classification of should then be considered.

Discriminant-based classifiers

Numerically, we seek the probability that belongs to some model . In other words, for each , we seek

Using Bayes’ theorem, we can rearrange this to

Since there are the same number (denoted earlier ) of observations for each model, and there is no assumed prior preference for any model, we can treat :
(3.1)

In order to classify , we thus need to estimate . We briefly discuss here estimation of the distributions of in three manners of increasing complexity (all from Hastie, Tibshirani and Friedman (2009) [Hastie:2009aa]):

  • Linear discriminant analysis (LDA), which assumes homoscedastic normal distributions;

  • Quadratic discriminant analysis (QDA), which assumes heteroscedastic normal distributions; and

  • Mixture discriminant analysis (MDA), which assumes mixtures of heteroscedastic normal distributions.

It should be noted that Hastie, Tibshirani and Friedman restrict their version of MDA to the assumption of mixtures of homoscedastic normal distributions ([Hastie:2009aa] at page 440), while this is generalised here to mixtures of heteroscedastic normal distributions for additional flexibility.

Linear discriminant analysis (LDA), the simplest of the three methods, is named because it produces linear decision boundaries; in other words, the boundary between the region whose points that would be classified to one model, and the region that would be classified to another, is always linear [Hastie:2009aa]. The LDA model supposes that each distribution can be expressed as

for . Note here that is not dependent on , meaning the variance is assumed to be the same for all models. Calculating the density functions can be done by maximum likelihood; each mean is simply estimated to be the sample mean of each , while the variance is estimated to be the weighted average of sample variance matrices for each . Since in this case, there are the same number of observations of each goodness-of-fit distribution,

for variance estimate , and sample variances estimated by

(3.2)

for the th observation of . Once these parameters have been estimated, Equation (3.1) can be used to select a model; the model that maximises the discriminant in (3.1) can be selected. The higher the value of the discriminant for the chosen model, the greater the confidence one can express in the selection of that model.

Quadratic discriminant analysis (QDA) relaxes the constant variance assumption from LDA. This means it is able to respond to differing covariance structures between goodness-of-fit distributions. The example later in this manuscript demonstrates that this may be useful in a multi-model mimicry context; in that instance, the variance of distributions, as represented in Figure 4, was not the same for all candidate models.

The QDA model supposes that each distribution can be expressed as

for . Note here that is dependent on . Again, density functions are calculated by maximum likelihood, with means for each model estimated by sample mean , and sample variances as in Equation (3.2) above.

Mixture discriminant analysis (MDA) presupposes that can be approximated by a Gaussian mixture, which is a weighted sum of normal random variables. Thus MDA allows for more flexible decision boundaries than LDA or QDA, by accounting for the fact that the distributions of goodness-of-fit among candidate models may be multi-modal. The MDA model supposes that each goodness-of-fit distribution can be expressed as

for components of the th mixture model, . Note here that the means, variances and even number of components may vary within and among candidate distributions.

The allowance for the covariance matrices being not identical for all components and for all distributions is an extension of the discriminant suggested by Hastie, Tibshirani and Friedman ([Hastie:2009aa] at page 440). The additional flexibility provided by this extension is necessary in situations such as that in the example in Figure 5, in which the distribution of goodness-of-fit under the log-normal candidate model, unlike the distribution for other candidate models, has a highly irregular covariance structure. Only a more flexible Gaussian mixture is able to capture this irregular covariance structure, which requires multiple components with different covariance matrices.

The parameters of the mixture model, within each group, can be estimated using an Expectation Maximisation (EM) algorithm for Gaussian mixtures [Xu:1996aa].

Using LDA, QDA or MDA, models can be selected by maximising the discriminant in Equation (3.1). If the LDA, QDA or MDA assumptions hold, the discriminant in Equation (3.1) represents an estimate of the probability that each model is true, given , using an uninformative prior [Hastie:2009aa]. This provides for a numerical representation of not just which model should be selected, but also the uncertainty in this choice of model.

3.4 An example of multi-model mimicry

Using the expansion of the procedure of Schultheis and Naidu (2014) [Schultheis:2014aa]

, this section contains an example concerning data simulated from an exponential distribution. Hypotheses that the data come from an exponential distribution, a log-normal distribution and a chi-squared distribution are compared using multi-model mimicry, using Skèkely and Rizzo’s energy statistic

[Szekely:2005aa] as a measure of goodness-of-fit to each distribution, as in this manuscript’s earlier example with regard to pairwise model mimicry.

Since the three candidate distributions are similar, they should be difficult to distinguish, making this example demonstrative of the power of multi-model mimicry.

First, 100 variates from the distribution Exp were simulated. Multi-model mimicry was then applied to this data, comparing the ability of an exponential distribution, a log-normal distribution and a chi-squared distribution to describe the data, with 500 replicates. This yielded 500 observations from the distributions of and , the energy goodness-of-fit of all models to the data simulated under the exponential, log-normal and chi-squared models respectively. Each model was also fit to the observed data, yielding the observed goodness-of-fit of each model to the data, .

For a visual representation of this, pairwise plots of logarithms of two components of each are given in Figure 4, alongside marginal distributions of logarithms of each component of each

. Logarithms are taken to make the plots more easy to visualise. Pairwise plots are used due to the difficulty of representing the three-dimensional clouds of variates from the vectors of goodnesses-of-fit. Each of

and are represented by a cluster of points in the pairwise scatterplots, and by a distribution function on the diagonals of this figure to represent the marginal distributions under each model. The observed value, , is represented by a black point or black line. It can be seen that the observed value appears to fit most closely to the exponential variates in all plots, indicating that the observed goodness-of-fit is more likely given that the true model is the exponential distribution, than if the true model is log-normal or chi-squared. Thus, the exponential model should be selected.

However, it is not sufficient to show that components of are, marginally, closest to components of the variates from . The components of

are marginal distributions of the goodness-of-fit, and marginal distributions do not completely characterise a joint distribution. To confirm that the exponential model should be selected, linear, quadratic and mixture discriminant analyses were undertaken to classify the point

. All discriminant analyses classified the observed goodness-of-fit into the exponential cluster, with probabilities 0.665, 0.998, and 0.997 for linear, quadratic and mixture discriminant analyses respectively.

Further visual representation can be provided by taking principal components of the collection of variates from each of and , and projecting onto this space, providing two-dimensional projections of the goodness-of-fit space. A plot of this can be found in Figure 5 and reinforces the selection of the exponential model. In this instance, the first two principal components capture 84.3% of the variance in the distributions, meaning Figure 5 is sufficiently characteristic of the data to be of use in model selection.

In summary, multi-model mimicry allows for the comparison of multiple models simultaneously, using any relevant goodness-of-fit measure. The approach is effective at distinguishing between models even at small sample sizes; here . While the final act of model classification is difficult to directly visualise, pairwise scatterplots and principal components plots can aid this. Classification can then be undertaken by inspection or using discriminant analyses, with discriminant analyses providing a clearer understanding of the level of uncertainty in model choice.

Figure 4: Plot comparing the fit of exponential, log-normal and chi-squared distributions to data simulated from the exponential distribution, using multi-model mimicry. Each plot on the diagonal is of marginal, univariate densities of energy goodness-of-fit statistics to the labelled model, of data simulated from the three models as denoted in the legend. For example, the plot in the second row and second column is of how well data simulated under all three models fits to the log-normal distribution; the black line in this plot is the observed goodness-of-fit to the log-normal distribution. Scatterplots below the diagonal are of goodness-of-fit statistics to the two labelled models, of data simulated from the three models as denoted in the legend. For example, the plot in the second row and first column is of goodness-of-fit to the exponential (-axis) and log-normal distributions (-axis), of data simulated under all three distributions. A contour plot for each hypothesis is added to aid interpretation. The black dot in this scatterplot corresponds to the observed goodness-of-fit. The exponential distribution appears to be the best fit to the data.
Figure 5: Plot comparing principal components of the fit of the exponential, log-normal and chi-squared distributions to data simulated from the exponential distribution, using multi-model mimicry. Each cluster of points represents data simulated under one of the hypotheses: that the data comes from an exponential (orange), a log-normal (blue) or a chi-squared distribution (green). A contour plot for each hypothesis is added to aid interpretation. Since the black point, the observed goodness-of-fit, fits in the centre of the exponential cluster, the exponential model is to be preferred.

4 Discussion of the multi-model mimicry framework

The multi-model mimicry technique introduced by Schultheis and Naidu, and given broader statistical underpinnings in this manuscript, is useful for comparing any number of candidate models with the following capabilities:

  1. Model parameters should be estimable given training data.

  2. There must exist some statistic by which the data’s goodness-of-fit to the model can be measured.

  3. It must be possible to simulate data under the model.

This makes MMM more flexible than a number of alternative techniques for model comparison. For example, relative to Wilks’ likelihood-ratio test, MMM does not have the requirement that the models need be nested. Relative to direct comparisons of AIC or BIC, MMM does not have the requirement that likelihood can be computed and that the models and data are of a form that allow AIC and BIC values to directly be compared. Relative to Approximate Bayesian Computation, MMM does not require the estimation of prior probabilities of models and their parameters, and can be undertaken in a frequentist environment.

That MMM is more flexible than some alternatives does not, however, make it universally applicable; the three requirements above are non-trivial. These requirements will now be discussed in turn, followed by a fourth limitation–the MMM’s preference for complexity.

Parameter estimation

The MMM technique requires the model to be able to be fit, based upon the observed data. The technique does not provide alternative methods for such model fitting, so if a model is poorly specified or unidentifiable, the MMM procedure does not resolve this issue. Computational approaches such as Markov Chain Monte Carlo or Approximate Bayesian Computation may resolve these issues in some cases.

Goodness-of-fit statistics

The MMM technique requires goodness-of-fit statistics for each model to be defined, and calculable. It is assumed in this manuscript that the same goodness-of-fit measure is used for all candidate models, but it may be the case that the MMM procedure is still useful in cases where different goodness-of-fit measures are available for different models. An earlier example given in this manuscript is the difficulty of calculating AIC or BIC for a model incorporating kernel density estimates, since the number of parameters is difficult to determine. Other distributional goodness-of-fit measures can determine how well a kernel density estimate fits to data. It remains uncertain whether within the MMM framework, the distributional goodness-of-fit measures for a kernel density estimate could be compared to AIC or BIC for other candidate models. There is no computational reason that this comparison could not be done, but the statistical consequences of this are a potential course of further research.

Simulation

The MMM approach involves simulation of data under all candidate models, using the parameters of these models. This naturally brings about questions as to the applicability of the approach to non-parametric models, or those where the confines of the model, and its ability to simulate new data, is harder to define. For example, in many machine learning contexts, such as neural networks, goodness-of-fit to models can be calculated in the form of loss functions, but the simulation of new data under models is an emerging area of research. For example, a deep learning environment such as that in image processing may be able to classify images into certain categories, and develop a model for doing so, but it may be unable to produce new images according to these classifications. Where simulation is not possible, the MMM technique is unavailable. An course of further research may involve the use of non-parametric bootstraps under each model, rather than parametric bootstraps as in the MMM’s cross-fitting method, to compare model performance.

Complexity

When introducing the model mimicry technique for comparison of two models, Wagenmakers et al. [Wagenmakers:2004aa] note that the technique has a preference for more complex models, since a more complex model may be able to produce data which mimics that produced by a simpler model. This issue recurs in the extension to multiple models introduced by Schultheis and Naidu [Schultheis:2014aa] and enumerated here. Further research may illuminate methods for penalising complex models in the MMM context. As a general rule for application of the MMM, this issue is addressed by preferring a more parsimonious model over a model complex one, where MMM does not express a strong preference for the more complex model.

5 Conclusion

This manuscript presents the history behind, and statistical context for, the multi-model mimicry technique. MMM provides a framework for the simultaneous comparison of multiple statistical models, on the basis of generalised goodness-of-fit measures and in cases where candidate models differ greatly in model or data structure. In doing so, the technique is not limited to scenarios wherein likelihood for the candidate models is calculable, where likelihood-based goodness-of-fit measures are able to be compared between models, or where it is even likelihood that is sought to be compared between models. For example, the MMM technique allows for the comparison of models on the basis of goodness-of-fit to model-building assumptions, to desired features like the frequency or quantification of extreme data points, or to some distribution.

This manuscript also demonstrates the effectiveness of this technique, even at relatively small sample sizes and for otherwise difficult-to-distinguish models. For example, for 100 data points simulated from an exponential distribution, MMM was able to correctly identify that this data came from an exponential distribution, rather than a log-normal or a chi-squared distribution.

Finally, we discuss potential areas for further research in the area of the MMM. How the MMM framework interacts with situations in which model parameters are difficult to determine, where different goodness-of-fit measures are used for different models, or where simulation of new data under models is difficult, all form courses for future exploration. A more sophisticated approach for dealing with MMM’s preference towards more complex model would also be a welcome addition to literature in this area.

References