Material for the research project on information-theoretic model check
The choice of model class is fundamental in statistical learning and system identification, no matter whether the class is derived from physical principles or is a generic black-box. We develop a method to evaluate the specified model class by assessing its capability of reproducing data that is similar to the observed data record. This model check is based on the information-theoretic properties of models viewed as data generators and is applicable to e.g. sequential data and nonlinear dynamical models. The method can be understood as a specific two-sided posterior predictive test. We apply the information-theoretic model check to both synthetic and real data and compare it with a classical whiteness test.READ FULL TEXT VIEW PDF
The choice of model class is fundamental in statistical learning and sys...
In this paper, we derive generic bounds on the maximum deviations in
In this work, we attempt to explain the prediction of any black-box
In scientific inference problems, the underlying statistical modeling
This paper considers the problem of Phase Identification in power
Bayesian modeling has become a staple for researchers analyzing data. Th...
Choice models, which capture popular preferences over objects of interes...
Material for the research project on information-theoretic model check
Parametric statistical inference often begins with the choice of a model class which is used to describe an unknown data-generating process. In system identification and sequential data analysis, we obtain a sequence of dependent samples from this process. A classical problem has been to assess whether the unknown process is contained in the proposed model class, usually relying on large-sample results (White, 1982). In many real-world applications, however, we only have a limited data record and we expect the model class to be misspecified in some respect. A more relevant question would then be: how consistent is the model class with the observed data?
A classical means of assessing a model is through its residuals or prediction errors. E.g. for linear dynamic models, one can check whether their prediction errors constitute a white noise process, cf.Ljung and Box (1978) and Söderström and Stoica (1989, ch. 11). In such cases, the errors reflect an irreducible component of the data-generating process and the model class is deemed consistent with the data. The whiteness statistic, however, only assesses the maximum likelihood model in the proposed model class, and it is not trivial to extend the whiteness statistics to more general models, such as, e.g., nonlinear state-space models.
In this paper, we develop a method to quantify the consistency of more general model classes to observed data. It follows the principle that ‘if the model fits, then replicated data generated under the model should look similar to observed data’ (Gelman et al., 2014, p. 143). The proposed measure targets those models in the class that provide the best approximation of the data-generating process and it corresponds to a p-value in a Portmanteau test, in that no alternative model classes have to be specified. The method is based on an information-theoretic formulation that quantifies the ‘similarity’ between observed data and data that is generated by the model. For this reason, we will refer to our method as an information-theoretic model check (Itmc). From a Bayesian perspective, the Itmc can be understood as a posterior predictive test (e.g., Rubin 1984, Gelman et al. 1996, (Gelman et al., 2014, Section 6.3)
) using a test statistic with two tail events. We will, however, motivate our contribution from a more general perspective centered on the best models in a specified model class.
For models with latent variables (such as state-space models) another method was recently proposed by Sohan et al. (2017), with the idea of assessing the properties of the posterior distribution for the latent variables (i.e., the state smoothing distribution for state-space models). We will compare the proposed Itmc to this, as well as to the Ljung-Box method.
Notation: For notational convenience, we denote a random sequence as and distinguish it from the observed data in the same space. We denote the number of data samples in the sequence as , and make no assumptions about independence among the samples. Possible known exogenous input signals are omitted from the notation.
Let denote an unknown data-generating process, from which we have obtained a data sequence . We are interested in a mathematical description of , using the knowledge provided to us via . Such inference begins by specifying a class of models, each of which corresponds to a possible data-generating distribution indexed by
. We denote such a parametric model class
where . Our aim is to assess the consistency of this class with respect to the observed data . Specifically, we will assess whether the models in that best approximate could generate sequences similar to the observed sequence .
Using the Kullback-Leibler divergence(Kullback, 1959) as our metric, we index the minimum divergence models by
where the expectation is over . When the minimum divergence is 0, the model class is well-specified. That is, and generate statistically indistinguishable sequences . The model divergence, however, depends on the unknown data-generating process
and can therefore not be computed directly. Our starting point therefore is to assess how probable it is that the best models in the class,, could generate sequences that are similar to observed from .
Consider a specific model in , which generates a random sequence:
We expect certain types of generated sequences to be more probable than others (Cover and Thomas, 2012; Gray, 2011) and this information-theoretic principle enables us to assess whether the model could generate similar to . We use self-information or the ‘surprisal’ of observing ,
as a test statistic. Since
is random, the surprisal is a random variable and its expected value is the entropy of the model. Based on this quantity, we define a set of sequences that are equivalent to,
i.e., a set where each has the same surprisal. This is illustrated by the dashed contour in Figure 1.
Now consider two tail events: has either higher or lower surprisal than . When the probabilities of both events are balanced, the surprisal of is within the typical range of surprisals for sequences generated by model . We can define a p-value for the two tail events:
Thus when is very low, tends to generate sequences with very different levels of surprisal when compared to the actual observations .
Remark 1: The model divergence in (2) is equivalent to the difference in surprisal of under the model and the data generating process, on average.
Remark 2: For models generating with i.i.d. blocks, the sets of sequences with equal surprisal are related to the notion of typical sets, cf. Cover and Thomas (2012).
Remark 3: The p-value for the Ljung-Box method is based on a one tail event, unlike (4).
where the weights are high for models in the neighbourhood of . The weights are normalized and defined as follows
which reflects uncertainty about where is located in the parameter space (Casella and Berger, 2002; Bissiri and Walker, 2012). The initial weights are defined using one of the following approaches when is unique:
We refer to (5), with either of the choices for , as the information-theoretic model check (Itmc). When is close to zero, it is highly improbable that the minimum divergence model can generate sequences that are similar to . In this case, the specified model class is highly inconsistent with the data.
As the weights concentrate around , the dispersion of in this neighbourhood, i.e.,
is reduced. Thus an interval quantifies the confidence in the consistency assessment of .
Remark 1: In the context of single data or independent data, Box (1980) considered a statistic for one-tail events using a model averaged over the entire model class, i.e., . This approach does not target of interest here, cf. Rubin (1984, Section 5.3) for a more elaborate discussion.111In the non-trivial case of a state-space model, which we will illustrate in Section 3.3, we actually consider the ‘posterior’ for the unknown parameters but the ‘prior’ for the equally unknown latent states. This is a natural choice for state-space models, and can yet again be motivated from the discussion by Rubin (1984) on what should be considered as a replication for the problem at hand.
To compute in (5), several integrals which are not readily available in closed form have to be evaluated. These can be approximated numerically using a straightforward Monte Carlo implementation as outlined in Algorithm 1. With being the number of -samples from and the number of -samples, for each -sample, from , it has a computational complexity in the order of . We have found a satisfactory performance of this implementation with only moderate values for and .
Here we assume that new data can be simulated from , and that samples can be drawn from
. The former assumption is natural for many model classes, and methods such as importance sampling or Markov chain Monte Carlo can be used for the latter case if needed (as will be done in Section3.3). We further assume that is available also for point-wise evaluation, which is the case for many (but not all) models. For general state-space models (as in Section 3.3) we can only estimate this using, e.g., a particle filter (Doucet and Johansen, 2011).
We will first consider a series of examples with synthetic data. The behavior of the model check is illustrated and then compared to the well-established Ljung-Box method. Finally the method is applied to a non-trivial real-data system identification problem. All source code is available via GitHub222https://github.com/saerdna-se/itmc.
In the interest of clarity, we consider the simple class of AR(1)-models,
throughout this section, with the auto-regression coefficient as the only unknown parameter. For sake of generality, but of lesser significance herein, we consider initial weights
as a zero-mean Gaussian with unit variance to reflect our initial beliefs about the location of the minimum divergence model in the parameter space. The following two cases are studied:
Case i: The data generated by (8) with .
Case ii: The data generated by a saturated version of (8), namely
The model class (8) is thus well-specified for Case i, but misspecified for Case ii. For illustration, we compute the Itmc (and its dispersion, using Algorithm 1 with and ) as data samples are added sequentially up to . The results are plotted in Figure 2, and we see that the Itmc indicates that the model class is not consistent with the data in Case ii.
In this section, we will evaluate the performance of the proposed Itmc by repeating all experiments 100 times each. We also compare all results to the well-established Ljung-Box method (Ljung and Box, 1978), which builds on the test statistic
where is the lag
sample autocorrelation of the prediction errors of an estimated model typically using the maximum likelihood method. If the minimum divergence model yields white residuals (null hypothesis),follows a distribution. From this one can construct a -value to measure the evidence against the null hypothesis. A similar method is formulated in Söderström and Stoica (1989, ch. 11). Unlike Itmc, the Ljung-Box method concentrates on a single estimated model in a linear model class.
We start by repeating the well-specified Case i from the previous section. For each of the batch sizes and , we simulate such data sets, and apply the Itmc and the Ljung-Box method (with ). The results are shown as histograms in Figure 3
. For both methods the histograms are nearly uniform, similar to classical p-values which have a uniform distribution under the null hypothesis.
We now consider the misspecified Case ii, together with another misspecified Case iii, where the data is generated by
i.e., an AR(2)-model. Note that in case iii, the model misspecification is less severe than in Case ii: The best AR(1) model is more likely to generate data similar to (11) than to (9). As in the well-specified case, we apply the Itmc along with the Ljung-Box method, and report the results for Case ii and iii in Figure 4 and 5, respectively.
The results indicate that when becomes very large, both the Itmc and the Ljung-Box method ultimately find to be inconsistent with the data in Case ii and iii. For Itmc there is a noticeable difference between both cases, in that it takes fewer samples to find to be inconsistent in the more severely misspecified Case ii. For the Ljung-Box method, which targets a different property of the model class, the behavior for Case ii and iii is reversed.
For simplicity, we have so far assumed the noise variances to be known. In most applications, however, this is not the case. To illustrate that the Itmc—in contrast with the Ljung-Box method—also indicates incorrectly specified variances, we consider data sets with generated by
still (erroneously) assuming (8) as the model class (Case iv), as well as the opposite (Case v) with data generated with noise variance but the model class assumes a variance of only . We report these result in Figure 6. As expected, the Ljung-Box method does not register any problems with since it is constructed as a pure whiteness test, in contrast to Itmc. It is however more natural to assume that the noise variance is one of the unknown parameters in the model class . This will indeed be the case in the next example.
Cascaded water tanks, as shown in Figure 7, are well-studied in the automatic control and system identification community. A commonly used first-principles state-space model to describe the system is
where , , and have a physical interpretation, and has to be identified along with the variance of the white noise and . At the nonlinear system identification workshop in Brussels 2016, it was pointed out by Holmes et al. (2016) that this model relies on an improbable assumption of laminar flow. A better model of the system, grounded in fluid dynamics, would be
where and have to be identified as well. Indeed, after parameter estimation, the predictive performance of (14) on test data is in general superior to the original model (13). To complicate the picture slightly, the data set provided by Schoukens and Noël (2017) also contained events with overflow, for which the model was extended as well, see Holmes et al. (2016).
Our interest in this paper, however, is to answer whether the extended model class (14) describes the true data-generating process well, or if there are more room for modelling improvements. We consider a discrete-time formulation of (14), with the extension to also account for overflow, to be our model class in which the noise is assumed to be Gaussian. All unknown parameters, including the noise variance, are contained in .
We evaluate the model with respect to a real data set that contains exogenous input signals. For sake of illustration, we use the same inputs to also generate six synthetic data sets from with physically reasonable values of the unknown parameters. The real and synthetic datasets are shown together in Figure 7(a) and exhibit visually similar characteristics. Intuitively, the proposed model check probes the question: “is there a statistically meaningful difference between the gray data set (true) and the colored ones (synthetic)”?
The Itmc is implemented using a particle filter to estimate and particle Gibbs with ancestor sampling (Lindsten et al., 2014) to find . The result for the real data as well as the synthetic data is found in Figure 8, which suggests that does not contain a minimum divergence model that is consistent with the observed data. Thus there is scope for improved modelling.
A naive attempt to apply the Ljung-Box method – which originally is not intended for nonlinear state-space models – failed to produce any meaningful results. The method recently proposed by Sohan et al. (2017) was also applied, which amounts to
draw one parameter sample ,
draw one state trajectory sample from the state smoothing distribution ,
analyze whether the corresponding process noise realization follows the Gaussian noise assumption using a z-test.
To implement this, we used again particle filters and particle Gibbs with ancestor sampling. Since this procedure relies on a single random sample from , the result of the method varies dramatically each time it is computed, but indicates on average no difference between the real data and the synthetic data, contrary to the result by Itmc. There is no claim that the Itmc and the method by Sohan et al. (2017) should be equivalent in any sense. The different results can therefore not be said to be unexpected but they indicate, however, that the Itmc is perhaps more sensitive in finding model inconsistencies with respect to the data.
We have developed a method to evaluate a specified model class by assessing its capability of reproducing data that is similar to an observed data record. Specifically, we have targeted the minimum divergence models in the class and viewed them as data generators. Using the self-information or surprisal as a test statistic, we formulated the information-theoretic model check (Itmc) as an averaged p-value which can be understood as corresponding to a specific two-sided posterior predictive test.
We have applied Itmc to both synthetic and real data and obtained promising results, also in comparison to the standard whiteness test and the recent method by Sohan et al. (2017). The usefulness of Itmc for other model classes than dynamic time-series models remains to be investigated, as well as a more thorough theoretical understanding of its behavior.
This work has been partly supported by the Swedish Research Council (VR) under contracts 2016-06079 and 621-2014-5874, as well as the Swedish Foundation for Strategic Research (SSF) via the project ASSEMBLE (contract number: RIT15-0012).
The Journal of Machine Learning Research (JMLR), 15(1):2145–2184, 2014.