The experiment is just as important as the likelihood in understanding the prior: A cautionary note on robust cognitive modelling

by   Lauren Kennedy, et al.

Cognitive modelling shares many features with statistical modelling, making it seem trivial to borrow from the practices of robust Bayesian statistics to protect the practice of robust cognitive modelling. We take one aspect of statistical workflow-prior predictive checks-and explore how they might be applied to a cognitive modelling task. We find that it is not only the likelihood that is needed to interpret the priors, we also need to incorporate experiment information as well. This suggests that while cognitive modelling might borrow from statistical practices, especially workflow, care must be made to make the adaptions necessary.



There are no comments yet.


page 9


Prior knowledge elicitation: The past, present, and future

Specification of the prior distribution for a Bayesian model is a centra...

Toward a principled Bayesian workflow in cognitive science

Experiments in research on memory, language, and in other areas of cogni...

Modelling Agents Endowed with Social Practices: Static Aspects

To understand societal phenomena through simulation, we need computation...

Statistical Decisions Using Likelihood Information Without Prior Probabilities

This paper presents a decision-theoretic approach to statistical inferen...

Predictive Complexity Priors

Specifying a Bayesian prior is notoriously difficult for complex models ...

Does modelling need a Reformation? Ideas for a new grammar of modelling

The quality of mathematical modelling is looked at from the perspective ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Visualize priors and complex likelihoods with prior predictive checks

Cognitive models are designed to postulate a generative process for the complex procedures that occur within the brain. Given this complexity, it is hardly surprising that this generative process is often far more complicated than most models proposed for purely data-driven statisical analyses. However, modelling of data and modelling of the brain do share some commonalities, and we believe these commonalities (e.g., the prevalence of a Bayesian framework, the desire to interpret particular parameters, and the need to compare models) suggest the potential application of robust statistical modelling practices within a cognitive modelling framework.

2 Example: Balloon Analogue Risk Task

We frame the remainder of this comment in the context of an illustrative example, a task is designed to investigate risk aversion through the trade off between choosing to pump a balloon (increasing expected reward for a given trial provided pumping does not pop the balloon) and cashing out a trial. For simplicity, we use the model described in lee2014bayesian but originally proposed by van2011cognitive and reproduced in Figure 1, left panel, for reader convenience. For now we focus on the non-hierarchical model in panel 1, but the hierarchical model explores the difference between sober, tipsy and drunk conditions.



(a) Non-hierarchical model




(b) Hierarchical model
Figure 1: Non-hierarchical and hierarchical models of the Balloon Analogue Risk Task. Reproduced from lee2014bayesian, including using Uniform priors, which would not be our first choice.

In this model the observed parameters are

, the probability the balloon will pop, and

, the decision made by the individual on a given trial (i.e., pump or take reward). The remaining parameters control the expected number of pumps () and the between trial variation ().

2.1 Prior Predictive Checks

Previous advice on specifying priors has encouraged the use of “uninformative priors” to prevent researcher bias (through in prior selection) from impacting the model. However, work by gabry2019visualization demonstrates that uninformative priors, when interpreted in terms of the likelihood can actually be very informative. In fact, gelman2017prior argue that “The prior can often only be understood in the context of the likelihood”, which is already filtering through to cognitive science PrincipledCognitiveScience, albeit in contexts where mixed effects (or multilevel) models are appropriate.

However, in cognitive modelling, we argue that more is needed than the likelihood to understand the prior. We draw the reader’s attention in particular to the uniform hyper priors at the highest level of both non-hierarchical and hierarchical versions of the model. The use of uniform priors at the highest level of the hierarchy is common as it is assumed that these parameters will be well informed by the observed data and hence need little prior regularization. However, diffuse priors on parameters on lower levels of the hierarchy can be surprisingly informative once they filter through the likelihood gabry2019visualization, gelman2017prior,PrincipledCognitiveScience. One technique to understand the priors in terms of the likelihood are prior predictive checks.

Prior predictive checks work by recalling that a Bayesian model is a generative model for the observed data (cf. a cognitive model, which is a generative model for an underlying process). This implies that simulations from the full Bayesian model (that is simulating parameters from the prior, simulating cognitive behaviour from the cognitive model, and then simulating measurements from the observation model) should be plausible data sets. For example, they should not imply that there is a low probability that a drunk undergraduate on their 20th BART trial will pump the balloon less than 50 times. In this way, prior predictive checks can be thought of as sanity checks that the prior specification and observation model have not interacted badly with the cognitive model.

Two things can go wrong when the prior predictive distribution does not conform to our substantive understanding of the experiment. The lesser problem is that these models can cause problems with the computational engine being used for inference. This is a manifestation of the Folk Theorem of statistical computing: when you have computational problems, often there’s a problem with your model FolkThm.

The more troubling problem that can occur when the prior predictive distribution puts a lot of weight on infeasible parts of the data space is that this will affect inference. In extreme cases, any data that is observed will be in conflict with the model, which will lead to poor finite-sample performance. Even in the situation where we know that asymptotically the inference will be consistent, the data will still use up a lot of its information in overcoming the poor prior specification. This will lead to underpowered inferences.

Prior predictive checks should complement the already prevalent practice of parameter recovery checks. These empirically assesses the identifiability of model parameters by simulating data from the model using known parameters and checking that the posterior recovers these parameters when this particular experimental design is used. talts2018validating show that a combination of prior simulations and parameter recovery checks can be used to validate inference software.

(a) Mean number of pumps using prior checks with likelihood, prior.
(b) Mean number of pumps using prior checks with likelihood, prior and experimental design.
Figure 2: Prior checks with likelihood and likelihood plus experimental design. Here we simulated the expected number of pumps for 200 participants with different probabilities of the balloon popping (x-axis), and plot the average number of pumps per participant on the y axis. Note that while both prior checks suggest increasing the probability of the balloon popping decreases the expected number of pumps, the majority of simulated pumps are exceptionally high for the prior checks using only the prior and likelihood (left panel), but not when using the experimental design as well(right panel).

2.1.1 Non-hierarchical model

To demonstrate this, we simulate values from the priors for the non-hierarchical model for the balloon task and then use these priors to predict the expected outcome—in this case the number of pumps—using the likelihood. We simulate each trial as an independent participant to show the distribution of expected number of balloon pumps. In Figure 2 we compare this number to the number of pumps by participant George (data available from lee2014bayesian), which reflects the number of pumps observed in published literature from this task. The prior checks suggest that the number of pumps could reach up to expected pumps, but the observed number of pumps do not extend beyond (marked with red line).

However, in cognitive modelling there is more to interpreting the priors than merely the likelihood. In our example part of the experimental design is that after each decision, there is some probability that the balloon pops, ending the trial. This is not included in the likelihood, but is a feature of the experimental manipulation. When we incorporate this into the prior predictive checks—adding to the likelihood—we see the that the expected number of pumps is much more reasonable.

How can we say that the prior expected number of pumps is reasonable? The natural response to this is to say that before the experiment was conducted, we could not possibly have known the expected number of times the participants would choose to pump the balloon. Although often cognitive modelling examples are not as clear as that proposed by gabry2019visualization, where the priors suggested a pollution concentrate so thick that human life could not survive, experimental experience suggests that a participant is exceptionally unlikely to commit to pressing a button 200 x 90 times for limited increase in payoff (or a potential decrease in payoff) in various stages of inebriation.

Using these adapted prior checks, we can determine that while the wide uniform priors might look like they could potentially be informative, when combined with the experimental design they seem appropriate. However, we can do more. Not only do prior predictive checks help to understand the priors, they also help us to understand the informativeness of data to distinguish between different values of parameters. For example, the prior width on both parameters extend above suggests only slight differences in the tails of the distribution of the expected number of pumps.

2.1.2 Hierarchical model

We see again that prior predictive checks that also incorporate the experimental design are preferred when we consider the hierarchical model. In Figure 3

we plot the difference in mean and variance between a simulated sober condition and the two simulated intoxicated conditions. Without modifying the prior checks, the hierarchical version of this model suggest that the expected variance of the number of pumps across trials is remarkably small. This suggests that once a participant choose a number of times to pump a balloon, they tend to commit to this number strongly across trials. Combined with the large expected number of trials this would be quite remarkable behaviour! However, when we consider the prior checks that incorporate experimental information, the design seem much more reasonable.

(a) Difference in mean from simulated sober condition
(b) Difference in mean from simulated sober condition
(c) Difference in variance from simulated sober condition
(d) Difference in variance from simulated sober condition
Figure 3: Comparison of how the width of the uniform impacts the expected difference between conditions in both variance (top) and mean (bottom) using traditional prior checks (left), and prior checks using experimental design (right).

2.1.3 The difference between prior predictive checks for data modelling and cognitive modelling

The first time we applied the prior predictive check idea to the BART model we did not take into account the fact that an experiment would sometimes end with the balloon popping. This meant that we drew incorrect conclusions about the suitability of the priors. In hindsight, this is a consequence of the likelihood principle, which says that only the parts of the generative model that depend on the unknown parameters are needed in the likelihood in order to perform inference. However, the full generative model is needed to make predictions and, we argue, is also needed to do pre-data modelling. In this case, the probability of popping at each stage is a fixed, known parameter that is independent of all of the other parameters and is hence not in the specification given by lee2014bayesian, which means that using prior predictive checks directly from their model specification would incorrectly lead us to conclude that the priors are quite unreasonable.

This turns out to be the fundamental challenge when adapting prior predictive checks to cognitive modelling. While in data analysis the outcome we want to predict is usually obvious from the structure of the model, in cognitive modelling it often requires further information about the context of the experiment. In order to critique the BART model we need to know that the balloon will sometimes pop. In order to fit the BART model it is not strictly necessary to know this.

3 Model comparison

One concern with potentially informative priors is that there is a carry through impact on the reliability of model comparison techniques. We believe that in some models this could be an unintended consequence, but for our balloon example we find that there are few differences when we vary the simultaneously vary the width of the uniform priors included in the model. This further suggests that modifying the prior predictive checks technique has been useful. We use the George data included in lee2014bayesian and the Stan code included with this chapter. We made a few adjustments to increase computational efficiency (non-centered parameterization, use of combined functions), modified the code slightly to model different probability conditions, and made a few changes to ensure compatibility with the bridge sampling R package. Our final code is included on LK’s Github page111lauken13/Comment-Robust-Cognitive-Modelling-. We also randomly permuted the George data between intoxication conditions to investigate the evidence for the null.

3.1 Bayes Factors

We use the bridgesampling package bridgesampling to calculate Bayes factors for the hierarchical model when compared to non-hierarchical model. As we can see in Figure 

4, while the evidence for the alternative reduces with priors of increasing width, the BF reliably suggests support for the alternative (left panel). We find similar results with permuted datasets (right panel). The BF does decrease with an increase in the widths of the uniform priors, but never so much as to suggest the hierarchical is preferred when the non-hierachical is true (right panel) or to change the conclusions from the George data (left panel).

The practical stability of the Bayes factors for this problem is related to the relative insensitivity of the prior predictive distribution to the upper bound on the uniform prior. Figure 5 shows that increasing this upper bound only slightly changes the tail of the predictive distribution. This is a demonstration of the principle that it is not so much that prior on the parameter that controls the behaviour of the Bayes factor, as the way that prior distribution pushes through the entire model. In this case, a quite large change in the prior on a deep parameter like only results in a mild change in the tail of the prior predictive distribution.

(a) George data
(b) Randomly permuted George data.
Figure 4: Comparison of the width of the uniform prior on the log Bayes factor for a hierarchical (H1, positive Bayes factor) against a non-hierarchical model (H0, negative Bayes factor).
(a) Prior predictive check
(b) Prior predictive checks with experimental design
Figure 5: Comparison of the width of the uniform prior (x axis) on the expected number of pumps (y axis) when the probability of the balloon popping is held constant at .10.

3.2 Leave-one-out cross validation

As an alternative to the Bayes Factor approach, we also employ an approximation to leave-one-out (LOO) cross validation using the LOO package LOO. For simplicity we leave one observation out (i.e., one choice to either pump or cash out), but more appropriate uses of LOO would leave one trial or condition out. As we can see in Figure 6

, LOO estimates remain consistent over priors of increasing width.

This is in line with leave-one-out cross validation being sensitive to changes in the posterior predictive distribution. This is essentially unchanged by the width of the uniform priors.

(a) George data
(b) Randomly permuted George data.
Figure 6:

Comparison of the width of the uniform prior using LOO for a hierarchical (H1) against a non-hierarchical model (H0). Ribbons indicate the standard error of the ELPD approximation.

3.3 What constitutes a meaningful difference?

Herein lies the problem with model comparison—if we are comparing a difference between conditions, are we hypothesizing that there is a difference between conditions for either or . If we were to test each on separately we would need to compare a null model against a model with varying, a null model with beta varying, and a model with both varying. With small and noisy data, what does it actually mean if we can distinguish between these models?

Given the relative flexibility of the modelling and the small range of potential scores in the actual observed data, is it possible to distinguish between small changes in while holding constant and vice-versa? Another way of saying this to question whether the small effects in parameter estimates are due to sampler noise (i.e., MCMC error), measurement error, or due to actual differences in processes.

To the extent that these questions can ever be resolved, we believe that prior simulation from the generative model for the data has a role to play. These simulations can be used to work out what type of model comparison tools are useful for the problem (and experimental design) at hand. They can also be used to answer questions about what type of difference between models can be detected with the data at hand (we are deliberately avoiding the word power here because this remains a vital question even outside the Neyman-Pearson framework). Furthermore, simulation studies can and should be used to assess how these tools perform under different types of data model mis-specification. cognitive modelling can not be robust unless measurement error and mis-specification are properly considered.

4 Statistical tools can’t tell us what we want in practice.

We’ve shown prior predictive checks need to be adapted to understand the practical implications of priors in the context of cognitive modelling. Similarly model comparison tools don’t tell us what we are most interested in. Regardless of method, they are all about prediction—predicted performance on the next participant provided they are echangable with ones we have seen previously, predicted performance on the next iteration given nothing has changed from the previous iterations, prediction from the sample to the population assuming the sample is representative. Model comparisons tools are suited (although not all at once) to ask these questions, but they all assume some type of equivalence.

As navarro2019between notes, cognitive modelling asks a bolder question. Rather than prediction, we are often interested in extrapolation. Extrapolation to different participants, extrapolation to changes in condition, extrapolation to a population that is markedly different from our young and educated sample. We are interested not in whether subject George is less likely to pump a balloon given inebriation, but rather whether inebriation causes some cognitive change in people that is realized in a general risk aversion that is expressed in a number of different domains. Moreover, as navarro2019between further notes we cannot answer these claims with a single experiment, not should we expect to answer these claims with any statistical analysis on a single experiment.

We started this comment by claiming that cognitive models are similar to statistical models with greater interpretability and complexity. If this were true then robust cognitive modelling should borrow heavily from statistical sciences. However the work we present in this comment suggests that we cannot blindly apply the practices of statistics to cognitive modelling. Cognitive models share many traits of statistical models, and so we should employ prior predictive checks, model comparison tools and consider in sample prediction, but we need to do so with adaptions.

We shouldn’t expect to fall back on traditional statistical modelling tools but instead we should strive to reach further. Lee2019 point to some avenues, but the reality is many tools are lacking because the basic assumptions like independence of the likelihood and the notion of equivalence held so dear to statisticians are the very assumptions that cognitive modellers want and need to violate. We find these challenges suggest a grand and exciting future for robust cognitive modelling and look forward to what the future will bring.