 # Is My Model Flexible Enough? Information-Theoretic Model Check

The choice of model class is fundamental in statistical learning and system identification, no matter whether the class is derived from physical principles or is a generic black-box. We develop a method to evaluate the specified model class by assessing its capability of reproducing data that is similar to the observed data record. This model check is based on the information-theoretic properties of models viewed as data generators and is applicable to e.g. sequential data and nonlinear dynamical models. The method can be understood as a specific two-sided posterior predictive test. We apply the information-theoretic model check to both synthetic and real data and compare it with a classical whiteness test.

## Code Repositories

### itmc

Material for the research project on information-theoretic model check

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Parametric statistical inference often begins with the choice of a model class which is used to describe an unknown data-generating process. In system identification and sequential data analysis, we obtain a sequence of dependent samples from this process. A classical problem has been to assess whether the unknown process is contained in the proposed model class, usually relying on large-sample results (White, 1982). In many real-world applications, however, we only have a limited data record and we expect the model class to be misspecified in some respect. A more relevant question would then be: how consistent is the model class with the observed data?

A classical means of assessing a model is through its residuals or prediction errors. E.g. for linear dynamic models, one can check whether their prediction errors constitute a white noise process, cf.

Ljung and Box (1978) and Söderström and Stoica (1989, ch. 11). In such cases, the errors reflect an irreducible component of the data-generating process and the model class is deemed consistent with the data. The whiteness statistic, however, only assesses the maximum likelihood model in the proposed model class, and it is not trivial to extend the whiteness statistics to more general models, such as, e.g., nonlinear state-space models.

In this paper, we develop a method to quantify the consistency of more general model classes to observed data. It follows the principle that ‘if the model fits, then replicated data generated under the model should look similar to observed data’ (Gelman et al., 2014, p. 143). The proposed measure targets those models in the class that provide the best approximation of the data-generating process and it corresponds to a p-value in a Portmanteau test, in that no alternative model classes have to be specified. The method is based on an information-theoretic formulation that quantifies the ‘similarity’ between observed data and data that is generated by the model. For this reason, we will refer to our method as an information-theoretic model check (Itmc). From a Bayesian perspective, the Itmc can be understood as a posterior predictive test (e.g., Rubin 1984, Gelman et al. 1996, (Gelman et al., 2014, Section 6.3)

) using a test statistic with two tail events. We will, however, motivate our contribution from a more general perspective centered on the best models in a specified model class.

For models with latent variables (such as state-space models) another method was recently proposed by Sohan et al. (2017), with the idea of assessing the properties of the posterior distribution for the latent variables (i.e., the state smoothing distribution for state-space models). We will compare the proposed Itmc to this, as well as to the Ljung-Box method.

Notation: For notational convenience, we denote a random sequence as and distinguish it from the observed data in the same space. We denote the number of data samples in the sequence as , and make no assumptions about independence among the samples. Possible known exogenous input signals are omitted from the notation.

## 2 Information-theoretic model check

Let denote an unknown data-generating process, from which we have obtained a data sequence . We are interested in a mathematical description of , using the knowledge provided to us via . Such inference begins by specifying a class of models, each of which corresponds to a possible data-generating distribution indexed by

. We denote such a parametric model class

 M={p(y|θ):θ∈Θ}, (1)

where . Our aim is to assess the consistency of this class with respect to the observed data . Specifically, we will assess whether the models in that best approximate could generate sequences similar to the observed sequence .

Using the Kullback-Leibler divergence

(Kullback, 1959) as our metric, we index the minimum divergence models by

 θ⋆∈argminθE0[lnp0(y)−lnp(y|θ)]model divergence, (2)

where the expectation is over . When the minimum divergence is 0, the model class is well-specified. That is, and generate statistically indistinguishable sequences . The model divergence, however, depends on the unknown data-generating process

and can therefore not be computed directly. Our starting point therefore is to assess how probable it is that the best models in the class,

, could generate sequences that are similar to observed from .

### 2.1 Assessment of model as a data-generator

Consider a specific model in , which generates a random sequence:

 y∼p(y|θ).

We expect certain types of generated sequences to be more probable than others (Cover and Thomas, 2012; Gray, 2011) and this information-theoretic principle enables us to assess whether the model could generate similar to . We use self-information or the ‘surprisal’ of observing ,

 D(y,θ)≜−lnp(y|θ),

as a test statistic. Since

is random, the surprisal is a random variable and its expected value is the entropy of the model. Based on this quantity, we define a set of sequences that are equivalent to

,

 {y:D(y,θ)=D(y––,θ)}, (3)

i.e., a set where each has the same surprisal. This is illustrated by the dashed contour in Figure 1. Figure 1: Every fix model p(y|θ) in the model class M (left) generates hypothetical sequences y (right). For any given y, the surprisal D(y,θ) may be greater or less than D(y––,θ). The dashed contour represent the set (3) with equal levels of surprisal as y––, whereas the solid contour represents a set of sequences y with D(y,θ)≤D(y––,θ). A model that tends to generate sequences y with similar levels of surprisal as that of y––, is deemed to be highly consistent with the data. This principle motivates (4).

Now consider two tail events: has either higher or lower surprisal than . When the probabilities of both events are balanced, the surprisal of is within the typical range of surprisals for sequences generated by model . We can define a p-value for the two tail events:

 ρ(y––,θ)=2min{Pr{D(y,θ)≥D(y––,θ)},Pr{D(y,θ)≤D(y––,θ)}}. (4)

Thus when is very low, tends to generate sequences with very different levels of surprisal when compared to the actual observations .

Remark 1: The model divergence in (2) is equivalent to the difference in surprisal of under the model and the data generating process, on average.

Remark 2: For models generating with i.i.d. blocks, the sets of sequences with equal surprisal are related to the notion of typical sets, cf. Cover and Thomas (2012).

Remark 3: The p-value for the Ljung-Box method is based on a one tail event, unlike (4).

### 2.2 Averaging around minimum divergence models

Ideally, the minimum divergence models (2) would be evaluated using (4). That is, evaluate where is unknown. In lieu of , we define the averaged p-value,

 ρ⋆(y––)=∫ρ(y––,θ)w(θ|y––)dθ, (5)

where the weights are high for models in the neighbourhood of . The weights are normalized and defined as follows

 w(θ|y––)≜w0(θ)p(y––|θ)∫w0(θ)p(y––|θ)dθ, (6)

which reflects uncertainty about where is located in the parameter space (Casella and Berger, 2002; Bissiri and Walker, 2012). The initial weights are defined using one of the following approaches when is unique:

• Fisherian: By setting constant, the maximum weight (6

) is located at the maximum likelihood estimate which converges to

as under standard regularity conditions (Ljung and Caines, 1979).

• Bayesian: Alternatively, can be chosen to describe prior assumptions about the location of . Then (6) represents the posterior distribution of the parameters. When the sequence consists of blocks that are i.i.d., the weights (6) concentrate at as (Berk, 1966).

We refer to (5), with either of the choices for , as the information-theoretic model check (Itmc). When is close to zero, it is highly improbable that the minimum divergence model can generate sequences that are similar to . In this case, the specified model class is highly inconsistent with the data.

As the weights concentrate around , the dispersion of in this neighbourhood, i.e.,

 d⋆(y––)=[∫(ρ(y––,θ)−ρ⋆(y––))2w(θ|y––)dθ]1/2, (7)

is reduced. Thus an interval quantifies the confidence in the consistency assessment of .

Remark 1: In the context of single data or independent data, Box (1980) considered a statistic for one-tail events using a model averaged over the entire model class, i.e., . This approach does not target of interest here, cf. Rubin (1984, Section 5.3) for a more elaborate discussion.111In the non-trivial case of a state-space model, which we will illustrate in Section 3.3, we actually consider the ‘posterior’ for the unknown parameters but the ‘prior’ for the equally unknown latent states. This is a natural choice for state-space models, and can yet again be motivated from the discussion by Rubin (1984) on what should be considered as a replication for the problem at hand.

Remark 2: Gelman et al. (1996) defined a Bayesian posterior p-value for general statistics with one-tail events. The measure (5) is similar but uses a specific information-theoretic test statistic with two-tail events and also quantifies the assessment via (7).

### 2.3 Implementation

To compute in (5), several integrals which are not readily available in closed form have to be evaluated. These can be approximated numerically using a straightforward Monte Carlo implementation as outlined in Algorithm 1. With being the number of -samples from and the number of -samples, for each -sample, from , it has a computational complexity in the order of . We have found a satisfactory performance of this implementation with only moderate values for and .

Here we assume that new data can be simulated from , and that samples can be drawn from

. The former assumption is natural for many model classes, and methods such as importance sampling or Markov chain Monte Carlo can be used for the latter case if needed (as will be done in Section

3.3). We further assume that is available also for point-wise evaluation, which is the case for many (but not all) models. For general state-space models (as in Section 3.3) we can only estimate this using, e.g., a particle filter (Doucet and Johansen, 2011).

## 3 Illustrations and experiments

We will first consider a series of examples with synthetic data. The behavior of the model check is illustrated and then compared to the well-established Ljung-Box method. Finally the method is applied to a non-trivial real-data system identification problem. All source code is available via GitHub.

### 3.1 Synthetic data: illustration of method (a) Case i (well-specified). As t grows and data accumulates, the value of ρ⋆(y––) changes, but the overall conclusion from the Itmc is that the specified model class is consistent with y–– throughout.

In the interest of clarity, we consider the simple class of AR(1)-models,

 M={p(y|θ):yt=θyt−1+et,  et∼N(0,1)}, (8)

throughout this section, with the auto-regression coefficient as the only unknown parameter. For sake of generality, but of lesser significance herein, we consider initial weights

as a zero-mean Gaussian with unit variance to reflect our initial beliefs about the location of the minimum divergence model in the parameter space. The following two cases are studied:

• Case i: The data generated by (8) with .

• Case ii: The data generated by a saturated version of (8), namely

 yt=(0.7yt−1+et)∨−0.3,  et∼N(0,1). (9)

The model class (8) is thus well-specified for Case i, but misspecified for Case ii. For illustration, we compute the Itmc (and its dispersion, using Algorithm 1 with and ) as data samples are added sequentially up to . The results are plotted in Figure 2, and we see that the Itmc indicates that the model class is not consistent with the data in Case ii.

### 3.2 Synthetic data: evaluation and comparison

In this section, we will evaluate the performance of the proposed Itmc by repeating all experiments 100 times each. We also compare all results to the well-established Ljung-Box method (Ljung and Box, 1978), which builds on the test statistic

 Q=T(T+2)h∑k=1ˆr2kT−k, (10)

where is the lag

sample autocorrelation of the prediction errors of an estimated model typically using the maximum likelihood method. If the minimum divergence model yields white residuals (null hypothesis),

follows a distribution. From this one can construct a -value to measure the evidence against the null hypothesis. A similar method is formulated in Söderström and Stoica (1989, ch. 11). Unlike Itmc, the Ljung-Box method concentrates on a single estimated model in a linear model class.

We start by repeating the well-specified Case i from the previous section. For each of the batch sizes and , we simulate such data sets, and apply the Itmc and the Ljung-Box method (with ). The results are shown as histograms in Figure 3

. For both methods the histograms are nearly uniform, similar to classical p-values which have a uniform distribution under the null hypothesis.

We now consider the misspecified Case ii, together with another misspecified Case iii, where the data is generated by

 yt=−0.3yt−1+0.5yt−2+et,  et∼N(0,1), (11)

i.e., an AR(2)-model. Note that in case iii, the model misspecification is less severe than in Case ii: The best AR(1) model is more likely to generate data similar to (11) than to (9). As in the well-specified case, we apply the Itmc along with the Ljung-Box method, and report the results for Case ii and iii in Figure 4 and 5, respectively.

The results indicate that when becomes very large, both the Itmc and the Ljung-Box method ultimately find to be inconsistent with the data in Case ii and iii. For Itmc there is a noticeable difference between both cases, in that it takes fewer samples to find to be inconsistent in the more severely misspecified Case ii. For the Ljung-Box method, which targets a different property of the model class, the behavior for Case ii and iii is reversed.

For simplicity, we have so far assumed the noise variances to be known. In most applications, however, this is not the case. To illustrate that the Itmc—in contrast with the Ljung-Box method—also indicates incorrectly specified variances, we consider data sets with generated by

 yt=0.7yt−1+et,  et∼N(0,0.1), (12)

still (erroneously) assuming (8) as the model class (Case iv), as well as the opposite (Case v) with data generated with noise variance but the model class assumes a variance of only . We report these result in Figure 6. As expected, the Ljung-Box method does not register any problems with since it is constructed as a pure whiteness test, in contrast to Itmc. It is however more natural to assume that the noise variance is one of the unknown parameters in the model class . This will indeed be the case in the next example.

### 3.3 Real data: cascaded water tank modeling Figure 7: Cascaded water tanks. We consider the problem of modelling the behavior from input signal (voltage over the pump generating the inflow to the upper tank) to the output signal (water level in the lower tank). Thanks to Maarten Schoukens for the picture in Figure 8. Photo courtesy to Maarten Schoukens.

Cascaded water tanks, as shown in Figure 7, are well-studied in the automatic control and system identification community. A commonly used first-principles state-space model to describe the system is

 ˙x1(t) =−k1√x1(t)+k2u(t)+w1(t), ˙x2(t) =k1√x1(t)−k3√x3(t)+w2(t), y(t) =x2(t), (13)

where , , and have a physical interpretation, and has to be identified along with the variance of the white noise and . At the nonlinear system identification workshop in Brussels 2016, it was pointed out by Holmes et al. (2016) that this model relies on an improbable assumption of laminar flow. A better model of the system, grounded in fluid dynamics, would be

 ˙x1(t) =−k1√x1(t)−k4x1(t)+k2u(t)+w1(t), ˙x2(t) =k1√x1(t)+k4x1(t)−k3√x3(t)−k5x1(t)+w2(t), y(t) =x2(t), (14)

where and have to be identified as well. Indeed, after parameter estimation, the predictive performance of (14) on test data is in general superior to the original model (13). To complicate the picture slightly, the data set provided by Schoukens and Noël (2017) also contained events with overflow, for which the model was extended as well, see Holmes et al. (2016). (a) The real data y–– (gray) together with synthetic data (colored) from different models in the model class (14) (each colored data set is generated with randomly chosen physically reasonable parameter values). Intuitively, the Itmc assesses whether the gray is also likely to come from that model class.

Our interest in this paper, however, is to answer whether the extended model class (14) describes the true data-generating process well, or if there are more room for modelling improvements. We consider a discrete-time formulation of (14), with the extension to also account for overflow, to be our model class in which the noise is assumed to be Gaussian. All unknown parameters, including the noise variance, are contained in .

We evaluate the model with respect to a real data set that contains exogenous input signals. For sake of illustration, we use the same inputs to also generate six synthetic data sets from with physically reasonable values of the unknown parameters. The real and synthetic datasets are shown together in Figure 7(a) and exhibit visually similar characteristics. Intuitively, the proposed model check probes the question: “is there a statistically meaningful difference between the gray data set (true) and the colored ones (synthetic)”?

The Itmc is implemented using a particle filter to estimate and particle Gibbs with ancestor sampling (Lindsten et al., 2014) to find . The result for the real data as well as the synthetic data is found in Figure 8, which suggests that does not contain a minimum divergence model that is consistent with the observed data. Thus there is scope for improved modelling.

A naive attempt to apply the Ljung-Box method – which originally is not intended for nonlinear state-space models – failed to produce any meaningful results. The method recently proposed by Sohan et al. (2017) was also applied, which amounts to

1. [(i)]

2. inferring ,

3. draw one parameter sample ,

4. draw one state trajectory sample from the state smoothing distribution ,

5. analyze whether the corresponding process noise realization follows the Gaussian noise assumption using a z-test.

To implement this, we used again particle filters and particle Gibbs with ancestor sampling. Since this procedure relies on a single random sample from , the result of the method varies dramatically each time it is computed, but indicates on average no difference between the real data and the synthetic data, contrary to the result by Itmc. There is no claim that the Itmc and the method by Sohan et al. (2017) should be equivalent in any sense. The different results can therefore not be said to be unexpected but they indicate, however, that the Itmc is perhaps more sensitive in finding model inconsistencies with respect to the data.

## 4 Conclusions

We have developed a method to evaluate a specified model class by assessing its capability of reproducing data that is similar to an observed data record. Specifically, we have targeted the minimum divergence models in the class and viewed them as data generators. Using the self-information or surprisal as a test statistic, we formulated the information-theoretic model check (Itmc) as an averaged p-value which can be understood as corresponding to a specific two-sided posterior predictive test.

We have applied Itmc to both synthetic and real data and obtained promising results, also in comparison to the standard whiteness test and the recent method by Sohan et al. (2017). The usefulness of Itmc for other model classes than dynamic time-series models remains to be investigated, as well as a more thorough theoretical understanding of its behavior.

## Acknowledgements

This work has been partly supported by the Swedish Research Council (VR) under contracts 2016-06079 and 621-2014-5874, as well as the Swedish Foundation for Strategic Research (SSF) via the project ASSEMBLE (contract number: RIT15-0012).

## References

• Berk (1966) Robert H. Berk. Limiting behavior of posterior distributions when the model is incorrect. The Annals of Mathematical Statistics, 37(1):51–58, 1966.
• Bissiri and Walker (2012) Pier Giovanni Bissiri and Stephen G. Walker. Converting information into probability measures with the Kullback–Leibler divergence. Annals of the Institute of Statistical Mathematics, 64(6):1139–1160, 2012.
• Box (1980) George E. P. Box. Sampling and Bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society, Series A, 143(4):383–430, 1980.
• Casella and Berger (2002) George Casella and Roger L. Berger. Statistical inference. Brooks/Cole, Cengage Learning, 2 edition, 2002.
• Cover and Thomas (2012) Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, 2012.
• Doucet and Johansen (2011) Arnaud Doucet and Adam M. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. In Nonlinear Filtering Handbook. Oxford University Press, 2011.
• Gelman et al. (1996) Andrew Gelman, Xiao-Li Meng, and Hal Stern. Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 5:733–807, 1996.
• Gelman et al. (2014) Andrew Gelman, John B. Carlin, Hal S. Stern, David B. Dunson, Aki Vehtari, and Donald B. Rubin. Bayesian Data Analysis. CRC Press, 3 edition, 2014.
• Gray (2011) Robert M Gray. Entropy and information theory. Springer Science & Business Media, 2011.
• Holmes et al. (2016) Geoffrey Holmes, Tim Rogers, Elisabet J. Cross, Nikolaus Dervilis, Graeme Manson, Robert J. Barthorpe, and Keith Worden. Cascaded tanks benchmark: Parametric and nonparametric identification. Presentation at Workshop on Nonlinear System Identification Benchmarks 2016, May 2016.
• Kullback (1959) Solomon Kullback. Information Theory and Statistics. Dover Publications, 1959.
• Lindsten et al. (2014) Fredrik Lindsten, Michael I. Jordan, and Thomas B. Schön. Particle Gibbs with ancestor sampling.

The Journal of Machine Learning Research (JMLR)

, 15(1):2145–2184, 2014.
• Ljung and Box (1978) Greta Ljung and George E. P. Box. On a measure of lack of fit in time series models. Biometrika, 65(2):297–303, 1978.
• Ljung and Caines (1979) Lennart Ljung and Peter E. Caines. Asymptotic normality of prediction error estimators for approximate models. Stochastics, 3:29–46, 1979.
• Rubin (1984) Donald B. Rubin. Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4):1151–1172, 1984.
• Schoukens and Noël (2017) Maarten Schoukens and Jean Philippe Noël. Three benchmarks addressing open challenges in nonlinear system identification. In Proceedings of the 20th World Congress of the International Federation of Automatic Control (IFAC), pages 448–453, Toulouse, France, 2017.
• Söderström and Stoica (1989) Torsten Söderström and Peter Stoica. System identification. Prentice-Hall, Inc., 1989.
• Sohan et al. (2017) Seth Sohan, Ian Murray, and Christopher K. I. Williams. Model criticism in latent space. arXiv:1711:04674, 2017.
• White (1982) Halbert White. Maximum likelihood estimation of misspecified models. Econometrica, 50(1):1–25, 1982.