Model criticism is the process of assessing the goodness of fit between some data and a statistical model of that data111Following O’Hagan (2003, p423) we prefer the term model criticism over model validation and model checking, as if “all models are strictly wrong” it is impossible to validate a model, and model criticism has a more active tone of looking to discover problems, compared to model checking, which may seem a more passive activity that does not expect to uncover any problems.. While model criticism uses goodness-of-fit tests to judge aspects of the model, its general objective is to identify deficiencies in the model that can lead to model extension to address these deficiencies. The extended model(s) can again be subjected to criticism, and the process continues until a satisfactory model is found (O’Hagan, 2003). Model criticism is contrasted with model comparison in that model criticism assesses a single model, while model comparison deals with at least two models to decide which model is a better fit. Model comparison can be applied to compare the original and the extended model after model criticism and extension (O’Hagan, 2003, p. 2).
Bayesian modelling has become an indispensable tool in statistical learning, and it is being widely used to model complex signals, e.g., by Ratmann et al. (2009). With its growing popularity, there is need for model criticism in this framework. Most work on model criticism makes use of the idea that “if the model fits, then replicated data generated under the model should look similar to observed data” (Gelman et al., 2004, p. 165). In contrast, in this paper we focus on a less well explored idea that for latent variable models, we can probabilistically pull back the data into the space of the latent variables, and carry out model criticism in that space. We can summarize this principle as that if the model fits, then posterior inferences should match the prior assumptions.
To elaborate, consider a model with observed variables and unobserved variables
with joint distributionwhere are known parameters. In general may contain latent variables , parameters
, and hyperparameters222For example, in the context of the Bayesian matrix factorization (Salakhutdinov and Mnih, 2008), is the observed data matrix, is the matrix of latent factors , the loading matrix , precision hyperparameters , and
denotes the parameters of the hyperpriors.. Given a sample from the marginal distribution , and a single posterior sample from the conditional distribution , the joint sample is a draw from the distribution . This property can be used to check the fit of the model in the latent space by checking if is a sample from the marginal distribution . Testing a single sample against a distribution, however, is not an effective approach. But, in many widely-used models, groups of unknown variables are independently and identically distributed under the prior. These related variables are easily aggregated together, giving a simple test of the prior assumptions. Figure 1 summarizes the overall approach, which is justified in §3.
In comparison to model criticism in the observation space, comparing with prior , provides an additional tool for model criticism which does not require crafting an appropriate discrepancy measure, generating replicate observations, and approximating the null distribution. This approach also does not suffer from the “double use” of data (see discussion in §2). These points have also been made by Yuan and Johnson (2012), but were applied to a relatively small scale hierarchical linear model. We develop the use of model criticism in latent space for large scale and complex models, yielding new insights and developments. Specifically, we apply this approach to the criticism of linear dynamical systems, factor analysis and Gaussian processes, and discuss its connection to the observation space based approach.
The structure of the rest of the paper is as follows: in §2 we describe the methods of model criticism in observation space. §3 provides details of the argument for model criticism in latent space and describes related work, and §4 shows results from applying the method to the three examples. Table 1 describes the notations followed in the paper, and Table 2 shows the distributions used in the paper.
|Upper case italics||Random variable or a group of random variables|
|Lower case italics||Realization of a random variable|
|Lower case bold||Vectors, realization or random variable|
|Upper case bold||Matrices, realization or random variable|
|Distribution of random variable|
|Conditional distribution of random variable given|
|Probability density function of r. v. , abbreviation for|
|Conditional density function of r. v. given , abbreviation for|
|A posterior sample|
|Dirichlet||where and .|
2 Model Criticism in Observation Space
A general approach of model criticism is to evaluate if replicated data generated under the (fitted) model looks similar to observed data. Consider that we are modelling observed data with a latent variable model parameterized by , i.e., we have defined the likelihood and (optionally) a prior distribution over potential parameter values. The principle of model criticism in the observation space is to assess if is a reasonable observation under the proposed model. For example, given the maximum likelihood estimator
(or another point estimate)of the parameters, one standard approach is to find the plug-in p-value (Bayarri and Berger, 2000)
Here is called a discrepancy functionis a replicate observation generated under the fitted model, i.e., .
If the p-value is low, then it implies that the probability of generating a more extreme dataset than the observed data is small, or in other words, the observed data itself is considered extreme relative to the model, and thus, the model does not adequately describe the dataset. In summary, a low p-value rejects the hypothesis that the data is being adequately modelled. The p-value is usually estimated via an empirical average by generating multiple replicates , , and evaluating
where is a sample from .
An alternative to point estimation is to consider a Bayesian treatment of the problem where one can integrate out the contribution of the parameters. The test statistic can be averaged under either the prior distribution or the posterior distribution. The prior-predictive distribution is defined to have the density where parameterizes the prior distribution over . One can generate replicate observations from this distribution, and compute the prior predictive p-value (Box, 1980)
where is a sample from . This approach is not reasonable when the prior distribution is improper (cannot be integrated) or uninformative. Additionally, even if the prior distribution is informative, one might not generate enough samples to represent the data distribution well when the parameter space is large. However, notice that one does not need to fit the model to criticise it.
On the other hand, one can use the posterior distribution
, and sample from the posterior-predictive distribution with density. The posterior predictive p-value (Rubin, 1984) is then computed as:
where is a sample from , i.e., by generating samples from the posterior distribution instead333Note that the p-value might be available in closed form depending on the choice of (Gelman et al., 1996, Eq. (8–9)). Then where are posterior samples.. The support of the posterior is usually more concentrated than prior, and the posterior distribution may be well-defined even if the prior distribution is improper.
The posterior predictive p-value has been criticised for “double use” of data, once for computing the posterior distribution and once for computing the discrepancy measure (Bayarri and Berger, 2000). This means that
does not have a uniform distribution under the null hypothesis, whereasis a valid p-value. is subject to the same criticism as since the MLE uses the observed data as well (Bayarri and Berger, 2000). Lloyd and Ghahramani (2015, §7) view the different p-values as arising from “different null hypotheses and interpretations of the word ‘model’ ”. They argued that the posterior predictive and plug-in p-values are most useful for highly flexible models, as the aim is to assess the fitted model rather than the whole space of models. Lloyd and Ghahramani (2015) also point out that “it may be more appropriate to hold out data and attempt to falsify the null hypothesis that future data will be generated by the plug-in or posterior distribution”, which is also in line with the discussion in (O’Hagan, 2003, §2.1). Further examples of posterior predictive checking can be found in (Belin and Rubin, 1995; Gopalan et al., 2015).
In all of the model criticism described above, a key quantity is the discrepancy function used to compare the data and predictive simulations. We agree with Belin and Rubin (1995, p. 753) who wrote of the importance of identifying discrepancy functions “that would not automatically be well fit by the assumed model”, and that “there is no unique method of Bayesian model monitoring, as there are an unlimited number of non-sufficient statistics that could be studied”.
Lloyd and Ghahramani (2015) suggest the Maximum Mean Discrepancy (MMD) as a measure of discrepancy between the observed data and replicates. The motivation of using this approach is to maximize the discrepancy over a class of discrepancy functions rather than choosing only one, i.e.,
where is a set of functions. The function that maximizes the discrepancy is known as the witness function. When is a reproducing kernel Hilbert space (RKHS) the witness function can be derived in closed form as
where is the kernel of the RKHS. This estimation does not work well in high dimensions, and therefore, the authors suggests reducing the dimensionality of the observation space before applying this statistic (Lloyd and Ghahramani, 2015, p. 4).
3 Model Criticism in Latent Space
Recall we have a model , with observed variables , unobserved variables , and known parameters . In general may contain latent variables , parameters , and hyperparameters . Our procedure depends on the following two key observations:
If is drawn from the above model, then a sample from is a sample from the prior distribution . To see why this is true, observe that a natural way to sample from the joint is to generate a sample from , and then generate a sample from in that order. However, it is also valid to draw samples from the joint by first sampling from and then sampling from . Thus we have
If is a sample from , then a sample from will be a draw from .
It is important to clarify what Statement 1 is not saying. It is not saying that repeated draws from will explore the full prior distribution , but only that it is a valid way to draw one sample from it if is a draw from the model444Throughout this paper, we assume that the prior distribution is proper, so the respective posterior distribution is well-defined, and that any MCMC sampler has converged, i.e., the posterior sample is well-behaved.. However, testing how well a single draw from a given distribution fits that distribution is difficult. This brings us to our second observation.
If is a collection of variables, i.e., , and the prior distribution of decomposes into independent draws from the same distribution, e.g., then it is possible to aggregate these variables together, i.e., instead of testing if is a sample from , one can test if is independent and identical draws from the distribution . In other words, rather than testing one sample against a known high dimensional distribution, one can test if the collection of samples are independent and identical draws from a known lower-dimensional distribution . Thus, we define aggregation as pooling variables with the same prior distribution together, and an aggregated posterior sample (APS) is defined as a set of posterior samples that have been aggregated for comparison with a specific reference distribution. The above can be generalized to the situation where is a collection of variables and parameters such that . Then can be aggregated and tested against 555Alternatively, and can be combined to define a pivotal quantity whose distribution does not depend on (Yuan and Johnson, 2012), and can be tested against that distribution.. Aggregation can be also extended to the case where consists of groups of variables where aggregation is performed within each group by pooling and comparing against where denotes all groups except . We provide more concrete examples of aggregation in §3.1 and Table 3.
We refer to this approach as aggregated posterior checking (APC). We summarize this approach in Algorithm 1. Ideas equivalent to Statement 1 and the aggregation of posterior samples can also be found in Yuan and Johnson (2012) 666We had independently derived the key results. We thank an anonymous referee for pointing out the work of Yuan and Johnson (2012). , but were applied to the case where contains only model parameters, and for hierarchical linear models. See §3.2 for more details on related work.
So far, we have addressed the idea of assessing deviations from the prior distribution and aggregation in the latent space. However, the same idea can be applied to the observation space as well, i.e., to the likelihood by testing if is a sample from . Although it is true that a discrepancy in the choice of likelihood should be reflected in the posterior sample, assessing the discrepancy in the likelihood directly provides better understanding and easier resolution of the discrepancy. Notice that, although we make use of , our approach is not equivalent to model criticism in the observation space since we do not compare the observed data with replicate observations , but only investigate the relation between the latent space and observation space . Both methods, however, require generating posterior samples (for model criticism in the latent space we use ).
3.1 Application to different models
We discuss below the application of model criticism in latent space to factor analysis, linear dynamical systems and Gaussian process regression. These situations are then demonstrated on real data in §4.
Factor analysis model
Consider a factor analysis model with hyperparameters , parameters (loading matrix) , latent variables (factors) and data . Grouping and similarly for 777We have omitted the fixed parameters for simplicity.,
Figure 1d illustrates this model. In Gaussian factor analysis, and (See Table 2). There is an identifiability issue in the factor analysis model between and , which is resolved by fixing the scale of one of the two. In Eq. (7) the dependence of on is taken to be null, i.e., . (In the case of example §4.1 we fix the scale of instead.) Also, and . Thus the fixed parameters .
If is drawn from the above model, then a sample from is a sample from the prior . In factor analysis, decomposes into independent draws from , and therefore, one can pool the posterior samples to assess deviations from . Moreover, each usually decomposes into independent draws over the different latent dimensions as , one can pool the to assess deviations from . Similarly, if the prior over the factor loadings matrix decomposes as then one can pool the s, and compare with . One can also go beyond the marginal or the full vector , and assess a subset of the vector such as bivariate interactions (see §4.1).
Linear dynamical system
One can extend the idea of aggregation beyond factor analysis models. For example, Statement 1 holds for general latent variable models with repeated structure
. Take, for example, a linear dynamical system model with a latent Markov chain, so that
where consists of the system and observation matrices , and precisions, . Then according to Statement 1 a sample drawn from should be distributed according to the prior over . Although the ’s are not independent (due to the Markov chain), we can consider model criticism for . For example for a system model parameterized as with , violations of the model will show up as deviations of the ’s from (see §4.2). Similarly, for an observation model parameterized as with , violations of the model may also show up as deviations of the ’s from . See §4.2.
Gaussian process regression
A Gaussian process probabilistic model is defined as:
where is the mean function parameterized by , is the covariance function (or kernel) parameterized by , and is the observation noise precision, see e.g., Rasmussen and Williams (2006). Given observations , , where , and . Alternatively, considering the eigendecomposition where
is the matrix of the eigenvectorss and
is the diagonal matrix of the corresponding eigenvalues, i.e.,,
This implies that, according to the model, the projections of the signal on the eigenvector are independent samples from . Thus, the normalized projections
should be independent samples from distribution. One can thus test the normality of the ’s to assess the goodness of fit. However, note that if the th eigenvalue of
is much smaller than the noise variance, then this
is dominated by the white noise contribution. Thus we only include’s corresponding to eigenvalues with to assess the fit of the GP model888The factor of 2 on the RHS of the inequality is included because the ’s are shifted by by definition.. See §4.3.
3.2 Related Work
Cook et al. (2006) consider the situation with (in our notation) a prior on the parameters of the model, and data . They then assume that specific parameters are drawn from the prior, then data drawn from . They then consider samples drawn from , and comment (in the caption of their Figure 1, translating to our notation) that “ should look like a draw from for ”. They then use the ‘reverse’ of Statement 1 to validate the correctness of posterior samples generated by a statistical software, by comparing with
. Their recommended method for this is to calculate posterior quantiles for each scalar parameter; if the software is working correctly then the posterior quantiles are uniformly distributed. Although they share with us the observation thatshould look like a draw from , this is used to answer a totally different question. Also, they do not discuss the inclusion of latent variables in the model.
Johnson (2007) and later Yuan and Johnson (2012) also consider a model with parameters and data drawn from . Their interest is in the use of pivotal quantity that has a known and invariant sampling distribution when data are generated from a model with data-generating parameters . Then Yuan and Johnson (2012) show that if the is a pivotal quantity distributed according to , then is also distributed according to , if is drawn from the posterior on given . The result of Yuan and Johnson (2012) extends earlier work by Johnson (2007) to the case where does not depend on the data
—for example this situation can arise in a Bayesian hierarchical linear regression model, when considering the second level where parameters for individual units are generated from a hyperprior.
Regression diagnostics is a well-explored example of model criticism. Existing approaches assess certain statistical assumptions made during modelling, e.g., if the residuals follow a normal distribution with zero mean, (e.g., using a Q-Q plot(Wilk and Gnanadesikan, 1968)), if the residuals are homoscedastic, (e.g., using the Breusch–Pagan (Breusch and Pagan, 1979) or White test (White, 1980)) or if the successive residual terms are uncorrelated (e.g., using the Durbin–Watson test (Durbin and Watson, 1950)). Regression diagnostics can be seen as a special case of model criticism in the latent space since residuals are representatives of errors, which are latent variables of the model. However, our methods are also applicable to more complex models.
Meulders et al. (1998) consider a factor analysis model for binary data, using (in our notation) priors on and . They carry out posterior sampling using block Gibbs sampling for and
and compare histograms of these variables against the prior. Discrepancies between the prior and histograms of the sampled aggregated posterior led to model extension, expanding the model to use a mixture of two beta distributions for the parameters. However, the authors do not explain the basis for carrying out this check (cf. Statement1).
Buccigrossi and Simoncelli (1999) consider the posterior distribution of wavelet coefficients (analogous to in the factor analysis model) in response to image patches. By considering the distribution of a bivariate aggregated posterior, they show that this is not equal to the product of the marginals, but exhibits variance correlations. (This is shown by introducing a “bowtie plot” showing the conditional histogram of given .) This work is a nice example of how the failure of a diagnostic test can give rise to an extended model (see §4.1).
O’Hagan (2003, §3) considered model criticism tools that can be applied at each node of a graphical model (and of course latent variables can be considered as such). O’Hagan (§3.1 2003) discussed the idea of residual testing at different levels of a hierarchical model as well as a generic probabilistic model. He suggested checking if a node in a probabilistic model is misbehaving by comparing the posterior samples at that node to prior distribution. O’Hagan (§3.2 2003) also emphasized that conflict can arise between the different sources of information about a variable at a particular node, arising from contributions from each neighbouring node in the graph. However, he did not suggest using the aggregated posterior to assess goodness of fit, but considered the posterior at each node separately.
Tang et al. (2012) introduce the concept of the “aggregated posterior” as applied to deep mixtures of factor analysers (MFA) model. Consider the situation as above but where is estimated by maximum likelihood, so it is the posterior over that is of interest. Thus Under this model we also have that where the subscript denotes that both and correspond to distributions under the model. Tang et al. (2012, p3) define the aggregated posterior as “the empirical average over the data of the posteriors over the factors”, i.e.,
where the integral wrt has been replaced by the empirical average over samples. If the data distribution is equal to the model distribution then should agree with . However, differences between and will manifest as differences between the two respective distributions in the latent space.
In practice, however, one does not explicitly construct the aggregated posterior Eq. (12) since it is only asymptotically equal to the prior. Instead Tang et al. (2012) compare a collection of samples from for to . This is a valid approach since if follow the distribution , then follow the distribution as we show in Statement 1. Additionally, as is not known in practice, Tang et al. (2012) replace with maximum likelihood estimate in the definition of aggregated posterior Eq. (12). In Statement 1 we extend this idea to a Bayesian setting where and are not fixed parameters but latent variables themselves.
Tang et al. (2012) started with a simple mixture of factor analysers (MFA), and observed that the aggregated posterior for a latent component often doesn’t match the prior. By replacing the prior for a component with another MFA model, they constructed a deep MFA model. The idea of the aggregated posterior (although not the name) can be traced back e.g. to Hinton et al. (2006)
, where in deep belief nets the idea was that the posterior distribution of the latents of a restricted Boltzmann machine (RBM) could be modelled by another RBM.
In this section, we provide three examples of model criticism and extension in the latent space. First, we explore a factor analysis model in the context of image compression (§4.1). The objective of this example is to show how the model can be criticised in the latent space as well as in the observation space. Our analysis leads to changing the latent distribution from a single Gaussian to a scale mixture of Gaussians, which captures both the marginal and the joint structure of the latent space, and improves the model in the observation space as well.
Next, we explore a linear dynamical system model (§4.2) in the context of modelling time series. We show that model criticism in latent space allows us to interrogate not only the standard “innovations” (defined in (20)), but also the latent residuals (defined in (19)).
Finally, we explore a Gaussian process model (§4.3) in the context of modelling time series. The objective of this example is to show when model criticism in the latent space can be a natural choice whereas model criticism in the observation space can be difficult. Our analysis leads to changing the covariance function from squared exponential to a combination of periodic and squared exponential kernels.
We implemented all models (except the Gaussian process model) in JAGS (Plummer, 2003), keeping a single sample in the MCMC run after discarding a burn-in of 1000 samples (10,000 samples for §4.2). Note that for model criticism in the latent space, we need only a single sample. We summarize the aggregation process and corresponding reference distributions used in this section in Table 3.
|MF (13)||for (LABEL:eq:eq:Gaussian)-(LABEL:eq:eq:Laplace) for (LABEL:eq:eq:scalemxgauss)||, and||, and for (LABEL:eq:eq:Gaussian) , and for (LABEL:eq:eq:Laplace) , and for (LABEL:eq:eq:scalemxgauss)|
|LDS (17), (18)||from (19) and from (20) from (19) and from (20)|
|GP (9)||for (22) for (22) and (23) for (22) (large and small) and (23)||from (21)|
4.1 Image patch data
The Berkeley Segmentation Database (Martin et al., 2001) consists of 200 training images. Following Zoran and Weiss (2012), we convert the images to greyscale, extract randomly located patches, and remove DC components from all image patches999https://people.csail.mit.edu/danielzoran/NIPSGMM.zip. We extract 50,000 image patches, and fit different matrix factorization models of the form:
with latent dimensions ( explained variance in PCA). Previously Zoran and Weiss (2012) used full covariance zero-mean Gaussians (, , ). We set .
We start by assuming a Gaussian distribution for the latent model
and generate a sample from the posterior. To criticise the model, we aggregate the univariate posterior samples , and bivariate samples . If the observed data follows the model, then the distributions of the corresponding APSs are univariate normal and bivariate normal
respectively. We observe that neither of the APSs follow the expected prior distribution. Also, the marginal distribution is more concentrated around zero than the expected distribution, whereas the joint distribution shows heteroscedasticity (Figure2, left column) which is inconsistent with the factorized bivariate normal prior.
An alternative latent variable model for the factor analysis model is (See Table 2)
We use the same aggregation strategy as before, and compare the empirical distributions with the univariate distribution and bivariate distribution respectively. We observe similar characteristics in the aggregated posterior as before (Figure 2, middle column).
To accommodate these observations, we allow a scale mixture of Gaussian distributions as used by Wainwright and Simoncelli (2000) with 8 components for the latent variable
|(16, Scale Mixture of Gaussians)|
We generate sample as well. We assess the same aggregated distributions as before and compare them with , and respectively. We observe that the empirical marginal distribution follows the mixture distribution well, although a KS test rejects the hypothesis that the aggregated posterior follows the mixture distribution. Additionally, the joint distribution captures the heteroscedasticity in the latent space (Figure 2, right column).
We also show the eigenvectors of the corresponding loading matrix for each of the three cases (Figure 2 bottom row). We show the eigenvectors rather than the loading matrix themselves since for the Gaussian and Gaussian scale mixture, the columns of the loading matrix may not correspond to any particular pattern due to rotational invariance. We observe that all three loading matrices span a similar space.
The matrix factorization model can be criticised in the observation space with established image statistics as a discrepancy measure. This, however, requires generating replicate data of the same size as the observed data, which in this case is computationally extensive since . To avoid generating multiple replicates, i.e., matrices , we only generate a single replicate for each latent distribution choice and compare them the observed data.
For all three cases, i.e., LABEL:tag:eq:Gaussian, LABEL:tag:eq:Laplace, and LABEL:tag:eq:scalemxgauss, we generate latent samples from the fitted parameters (and for LABEL:tag:eq:scalemxgauss). We use the rest of the fitted parameters, i.e., , , and to generate samples from . We generate 50,000, replicate image patches, and compare the observed and replicate data in terms of the distribution of raw pixel values. We show the results in Figure 3. We observe that the distribution of the image pixel values in the replicate data follows the observed data more closely for LABEL:tag:eq:scalemxgauss than the other latent distributions. However, it is not a perfect fit, and that tells us that this model can improved further; potentially by increasing , and varying the noise characteristics such as using a full diagonal covariance.
4.2 Honey bee data
The honey bee data consists of measurements of coordinate and head angle () of 6 honey bees. The measurements are usually translated into a 4-dimensional multivariate time series , and modelled using a switching linear dynamical system to capture three distinct dynamical regimes, namely, left turn, right turn and waggle (Oh et al., 2008). We follow this strategy, and model each time series by a switching linear dynamical system (SLDS) (Fox et al., 2009) as follows: