1 Visualize priors and complex likelihoods with prior predictive checks
Cognitive models are designed to postulate a generative process for the complex procedures that occur within the brain. Given this complexity, it is hardly surprising that this generative process is often far more complicated than most models proposed for purely datadriven statisical analyses. However, modelling of data and modelling of the brain do share some commonalities, and we believe these commonalities (e.g., the prevalence of a Bayesian framework, the desire to interpret particular parameters, and the need to compare models) suggest the potential application of robust statistical modelling practices within a cognitive modelling framework.
2 Example: Balloon Analogue Risk Task
We frame the remainder of this comment in the context of an illustrative example, a task is designed to investigate risk aversion through the trade off between choosing to pump a balloon (increasing expected reward for a given trial provided pumping does not pop the balloon) and cashing out a trial. For simplicity, we use the model described in lee2014bayesian but originally proposed by van2011cognitive and reproduced in Figure 1, left panel, for reader convenience. For now we focus on the nonhierarchical model in panel 1, but the hierarchical model explores the difference between sober, tipsy and drunk conditions.




In this model the observed parameters are
, the probability the balloon will pop, and
, the decision made by the individual on a given trial (i.e., pump or take reward). The remaining parameters control the expected number of pumps () and the between trial variation ().2.1 Prior Predictive Checks
Previous advice on specifying priors has encouraged the use of “uninformative priors” to prevent researcher bias (through in prior selection) from impacting the model. However, work by gabry2019visualization demonstrates that uninformative priors, when interpreted in terms of the likelihood can actually be very informative. In fact, gelman2017prior argue that “The prior can often only be understood in the context of the likelihood”, which is already filtering through to cognitive science PrincipledCognitiveScience, albeit in contexts where mixed effects (or multilevel) models are appropriate.
However, in cognitive modelling, we argue that more is needed than the likelihood to understand the prior. We draw the reader’s attention in particular to the uniform hyper priors at the highest level of both nonhierarchical and hierarchical versions of the model. The use of uniform priors at the highest level of the hierarchy is common as it is assumed that these parameters will be well informed by the observed data and hence need little prior regularization. However, diffuse priors on parameters on lower levels of the hierarchy can be surprisingly informative once they filter through the likelihood gabry2019visualization, gelman2017prior,PrincipledCognitiveScience. One technique to understand the priors in terms of the likelihood are prior predictive checks.
Prior predictive checks work by recalling that a Bayesian model is a generative model for the observed data (cf. a cognitive model, which is a generative model for an underlying process). This implies that simulations from the full Bayesian model (that is simulating parameters from the prior, simulating cognitive behaviour from the cognitive model, and then simulating measurements from the observation model) should be plausible data sets. For example, they should not imply that there is a low probability that a drunk undergraduate on their 20th BART trial will pump the balloon less than 50 times. In this way, prior predictive checks can be thought of as sanity checks that the prior specification and observation model have not interacted badly with the cognitive model.
Two things can go wrong when the prior predictive distribution does not conform to our substantive understanding of the experiment. The lesser problem is that these models can cause problems with the computational engine being used for inference. This is a manifestation of the Folk Theorem of statistical computing: when you have computational problems, often there’s a problem with your model FolkThm.
The more troubling problem that can occur when the prior predictive distribution puts a lot of weight on infeasible parts of the data space is that this will affect inference. In extreme cases, any data that is observed will be in conflict with the model, which will lead to poor finitesample performance. Even in the situation where we know that asymptotically the inference will be consistent, the data will still use up a lot of its information in overcoming the poor prior specification. This will lead to underpowered inferences.
Prior predictive checks should complement the already prevalent practice of parameter recovery checks. These empirically assesses the identifiability of model parameters by simulating data from the model using known parameters and checking that the posterior recovers these parameters when this particular experimental design is used. talts2018validating show that a combination of prior simulations and parameter recovery checks can be used to validate inference software.
2.1.1 Nonhierarchical model
To demonstrate this, we simulate values from the priors for the nonhierarchical model for the balloon task and then use these priors to predict the expected outcome—in this case the number of pumps—using the likelihood. We simulate each trial as an independent participant to show the distribution of expected number of balloon pumps. In Figure 2 we compare this number to the number of pumps by participant George (data available from lee2014bayesian), which reflects the number of pumps observed in published literature from this task. The prior checks suggest that the number of pumps could reach up to expected pumps, but the observed number of pumps do not extend beyond (marked with red line).
However, in cognitive modelling there is more to interpreting the priors than merely the likelihood. In our example part of the experimental design is that after each decision, there is some probability that the balloon pops, ending the trial. This is not included in the likelihood, but is a feature of the experimental manipulation. When we incorporate this into the prior predictive checks—adding to the likelihood—we see the that the expected number of pumps is much more reasonable.
How can we say that the prior expected number of pumps is reasonable? The natural response to this is to say that before the experiment was conducted, we could not possibly have known the expected number of times the participants would choose to pump the balloon. Although often cognitive modelling examples are not as clear as that proposed by gabry2019visualization, where the priors suggested a pollution concentrate so thick that human life could not survive, experimental experience suggests that a participant is exceptionally unlikely to commit to pressing a button 200 x 90 times for limited increase in payoff (or a potential decrease in payoff) in various stages of inebriation.
Using these adapted prior checks, we can determine that while the wide uniform priors might look like they could potentially be informative, when combined with the experimental design they seem appropriate. However, we can do more. Not only do prior predictive checks help to understand the priors, they also help us to understand the informativeness of data to distinguish between different values of parameters. For example, the prior width on both parameters extend above suggests only slight differences in the tails of the distribution of the expected number of pumps.
2.1.2 Hierarchical model
We see again that prior predictive checks that also incorporate the experimental design are preferred when we consider the hierarchical model. In Figure 3
we plot the difference in mean and variance between a simulated sober condition and the two simulated intoxicated conditions. Without modifying the prior checks, the hierarchical version of this model suggest that the expected variance of the number of pumps across trials is remarkably small. This suggests that once a participant choose a number of times to pump a balloon, they tend to commit to this number strongly across trials. Combined with the large expected number of trials this would be quite remarkable behaviour! However, when we consider the prior checks that incorporate experimental information, the design seem much more reasonable.
2.1.3 The difference between prior predictive checks for data modelling and cognitive modelling
The first time we applied the prior predictive check idea to the BART model we did not take into account the fact that an experiment would sometimes end with the balloon popping. This meant that we drew incorrect conclusions about the suitability of the priors. In hindsight, this is a consequence of the likelihood principle, which says that only the parts of the generative model that depend on the unknown parameters are needed in the likelihood in order to perform inference. However, the full generative model is needed to make predictions and, we argue, is also needed to do predata modelling. In this case, the probability of popping at each stage is a fixed, known parameter that is independent of all of the other parameters and is hence not in the specification given by lee2014bayesian, which means that using prior predictive checks directly from their model specification would incorrectly lead us to conclude that the priors are quite unreasonable.
This turns out to be the fundamental challenge when adapting prior predictive checks to cognitive modelling. While in data analysis the outcome we want to predict is usually obvious from the structure of the model, in cognitive modelling it often requires further information about the context of the experiment. In order to critique the BART model we need to know that the balloon will sometimes pop. In order to fit the BART model it is not strictly necessary to know this.
3 Model comparison
One concern with potentially informative priors is that there is a carry through impact on the reliability of model comparison techniques. We believe that in some models this could be an unintended consequence, but for our balloon example we find that there are few differences when we vary the simultaneously vary the width of the uniform priors included in the model. This further suggests that modifying the prior predictive checks technique has been useful. We use the George data included in lee2014bayesian and the Stan code included with this chapter. We made a few adjustments to increase computational efficiency (noncentered parameterization, use of combined functions), modified the code slightly to model different probability conditions, and made a few changes to ensure compatibility with the bridge sampling R package. Our final code is included on LK’s Github page^{1}^{1}1lauken13/CommentRobustCognitiveModelling. We also randomly permuted the George data between intoxication conditions to investigate the evidence for the null.
3.1 Bayes Factors
We use the bridgesampling package bridgesampling to calculate Bayes factors for the hierarchical model when compared to nonhierarchical model. As we can see in Figure
4, while the evidence for the alternative reduces with priors of increasing width, the BF reliably suggests support for the alternative (left panel). We find similar results with permuted datasets (right panel). The BF does decrease with an increase in the widths of the uniform priors, but never so much as to suggest the hierarchical is preferred when the nonhierachical is true (right panel) or to change the conclusions from the George data (left panel).The practical stability of the Bayes factors for this problem is related to the relative insensitivity of the prior predictive distribution to the upper bound on the uniform prior. Figure 5 shows that increasing this upper bound only slightly changes the tail of the predictive distribution. This is a demonstration of the principle that it is not so much that prior on the parameter that controls the behaviour of the Bayes factor, as the way that prior distribution pushes through the entire model. In this case, a quite large change in the prior on a deep parameter like only results in a mild change in the tail of the prior predictive distribution.
3.2 Leaveoneout cross validation
As an alternative to the Bayes Factor approach, we also employ an approximation to leaveoneout (LOO) cross validation using the LOO package LOO. For simplicity we leave one observation out (i.e., one choice to either pump or cash out), but more appropriate uses of LOO would leave one trial or condition out. As we can see in Figure 6
, LOO estimates remain consistent over priors of increasing width.
This is in line with leaveoneout cross validation being sensitive to changes in the posterior predictive distribution. This is essentially unchanged by the width of the uniform priors.
Comparison of the width of the uniform prior using LOO for a hierarchical (H1) against a nonhierarchical model (H0). Ribbons indicate the standard error of the ELPD approximation.
3.3 What constitutes a meaningful difference?
Herein lies the problem with model comparison—if we are comparing a difference between conditions, are we hypothesizing that there is a difference between conditions for either or . If we were to test each on separately we would need to compare a null model against a model with varying, a null model with beta varying, and a model with both varying. With small and noisy data, what does it actually mean if we can distinguish between these models?
Given the relative flexibility of the modelling and the small range of potential scores in the actual observed data, is it possible to distinguish between small changes in while holding constant and viceversa? Another way of saying this to question whether the small effects in parameter estimates are due to sampler noise (i.e., MCMC error), measurement error, or due to actual differences in processes.
To the extent that these questions can ever be resolved, we believe that prior simulation from the generative model for the data has a role to play. These simulations can be used to work out what type of model comparison tools are useful for the problem (and experimental design) at hand. They can also be used to answer questions about what type of difference between models can be detected with the data at hand (we are deliberately avoiding the word power here because this remains a vital question even outside the NeymanPearson framework). Furthermore, simulation studies can and should be used to assess how these tools perform under different types of data model misspecification. cognitive modelling can not be robust unless measurement error and misspecification are properly considered.
4 Statistical tools can’t tell us what we want in practice.
We’ve shown prior predictive checks need to be adapted to understand the practical implications of priors in the context of cognitive modelling. Similarly model comparison tools don’t tell us what we are most interested in. Regardless of method, they are all about prediction—predicted performance on the next participant provided they are echangable with ones we have seen previously, predicted performance on the next iteration given nothing has changed from the previous iterations, prediction from the sample to the population assuming the sample is representative. Model comparisons tools are suited (although not all at once) to ask these questions, but they all assume some type of equivalence.
As navarro2019between notes, cognitive modelling asks a bolder question. Rather than prediction, we are often interested in extrapolation. Extrapolation to different participants, extrapolation to changes in condition, extrapolation to a population that is markedly different from our young and educated sample. We are interested not in whether subject George is less likely to pump a balloon given inebriation, but rather whether inebriation causes some cognitive change in people that is realized in a general risk aversion that is expressed in a number of different domains. Moreover, as navarro2019between further notes we cannot answer these claims with a single experiment, not should we expect to answer these claims with any statistical analysis on a single experiment.
We started this comment by claiming that cognitive models are similar to statistical models with greater interpretability and complexity. If this were true then robust cognitive modelling should borrow heavily from statistical sciences. However the work we present in this comment suggests that we cannot blindly apply the practices of statistics to cognitive modelling. Cognitive models share many traits of statistical models, and so we should employ prior predictive checks, model comparison tools and consider in sample prediction, but we need to do so with adaptions.
We shouldn’t expect to fall back on traditional statistical modelling tools but instead we should strive to reach further. Lee2019 point to some avenues, but the reality is many tools are lacking because the basic assumptions like independence of the likelihood and the notion of equivalence held so dear to statisticians are the very assumptions that cognitive modellers want and need to violate. We find these challenges suggest a grand and exciting future for robust cognitive modelling and look forward to what the future will bring.
Comments
There are no comments yet.