Amortized Bayesian Inference for Models of Cognition

05/08/2020 ∙ by Stefan T. Radev, et al. ∙ 0

As models of cognition grow in complexity and number of parameters, Bayesian inference with standard methods can become intractable, especially when the data-generating model is of unknown analytic form. Recent advances in simulation-based inference using specialized neural network architectures circumvent many previous problems of approximate Bayesian computation. Moreover, due to the properties of these special neural network estimators, the effort of training the networks via simulations amortizes over subsequent evaluations which can re-use the same network for multiple datasets and across multiple researchers. However, these methods have been largely underutilized in cognitive science and psychology so far, even though they are well suited for tackling a wide variety of modeling problems. With this work, we provide a general introduction to amortized Bayesian parameter estimation and model comparison and demonstrate the applicability of the proposed methods on a well-known class of intractable response-time models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Generative Models in Cognitive Science

Mathematical models formalize theories of cognition and enable the systematic investigation of cognitive processes through simulations and testable predictions. They enable a systematic joint analysis of behavioral and neural data, bridging a crucial gap between cognitive science and neuroscience Turner et al. (2019). Moreover, questions demanding a choice among competing cognitive theories can be resolved at the level of formal model comparison.

The generative property of such models arises from the fact that one can simulate the process of interest and study how it behaves under various conditions. More formally, consider a cognitive model which represents a theoretically plausible, potentially noisy, process by which observable behavior arises from an assumed cognitive system governed by hidden parameters and independent, non-cognitive noise :

(1)

Generative models of this form have been developed in various domains throughout psychology and cognitive science, including decision making Voss et al. (2019), memory Myung et al. (2007)

, reinforcement learning

Fontanesi et al. (2019), risky behavior Stout et al. (2004)

, to name just a few. Once a model (or a set of models) of some cognitive process of interest has been formulated, the challenge becomes to perform inference on real data. We will now focus on the mathematical tools provided by Bayesian probability theory for parameter estimation and model comparison

Jaynes (2003).

2 Bayesian Parameter Estimation

Bayesian parameter estimation leverages prior knowledge about reasonable parameter ranges and integrates this information with the information provided by the data to arrive at a posterior distribution over parameters. In a Bayesian context, the posterior encodes our updated belief about parameter ranges conditional on a set of observations . Bayes’ rule gives us the well known analytical form of the posterior

(2)

where represents the likelihood, that is the distribution of data given parameters , denotes the prior, that is the distribution of before observing the data. The denominator is a normalizing constant usually referred to as marginal likelihood. Note, that all distributions are also implicitly conditional on the particular generative model .

Based on the obtained estimate of the posterior distribution, usually in the form of random draws from the posterior, summary statistics such as posterior means or credible intervals per parameter can be obtained. What is more, the posterior distribution can be further transformed to obtain subsequent quantities of interest, for example, the

posterior predictive distribution which can be compared to the observed data for the purpose of model checking Lynch and Western (2004).

3 Bayesian Model Comparison

In many research domains, there is not a single model for a particular process, but whole classes of models instantiating different and often competing theories. Bayesian model comparison proceeds by assigning a plausibility value to each candidate model. These plausibility values (model weights, model probabilities, model predictions, etc.) can be used to guide subsequent model selection.

To set the stage, consider a set of candidate models . An intuitive way to quantify plausibility is to consider the marginal likelihood of a model given by:

(3)

which is also the denominator in Eq.2 (with implicit in the previous definition). This quantity is also known as evidence, or prior predictive distribution, since the likelihood is weighted by the prior (in contrast to a posterior predictive distribution where the likelihood would be weighted by the posterior). The marginal likelihood penalizes the prior complexity of a model and thus naturally embodies the principle of Occam’s razor Jaynes (2003)

. To compare two competing models, one can focus on the ratio between two marginal likelihoods, called a Bayes factor (BF):

(4)

which quantifies the relative evidence of model over model . Alternatively, if prior information about model plausibility is available, one can consider model posteriors

and compute the posterior odds:

(5)

which combine the relative evidence given by the BF with prior information in the form of prior odds.

4 Model Intractability

In order for cognitive models to be useful in practice, parameter estimation and model comparison should be feasible within reasonable time limits. As evident from their definitions, both Bayesian parameter estimation and model comparison depend on the likelihood function which needs to be evaluated analytically or numerically for any triplet .

When this is possible, standard Bayesian approaches for obtaining random draws from the posterior, such as Markov chain Monte Carlo (MCMC), or optimizing an approximate posterior, such as variational inference (VI), can be readily applied. However, when the likelihood function is not available in closed-form or too expensive to evaluate, standard methods no longer apply.

In fact, many interesting models from a variety of domains in cognitive science and psychology turn out to be intractable Voss et al. (2019); Turner et al. (2016). This has precluded the wide exploration and application of these models, as researchers have often traded off complexity or neurocognitive plausibility for simplicity in order to make these models tractable. In the following, we discuss the most popular approach to inference with intractable models.

5 Simulation-Based Inference

Simulation-based methods leverage the generative property of mathematical models by treating a particular model as a scientific simulator from which synthetic data can be obtained given any configuration of the parameters. Simulation-based inference is common to many domains in science in general Cranmer et al. (2019) and a variety of different approaches exist. These methods have also been dubbed likelihood-free, which is somewhat unfortunate, since the likelihood is implicitly defined by the generative process and sampling from the likelihood is realized through the stochastic simulator:

(6)

Different simulation-based methods differ mainly with respect to how they utilize the synthetic data to perform inference on real observed data Cranmer et al. (2019). The utility of any simulation-based method depends on multiple factors, such as asymptotic guarantees, data utilization, efficiency, scalability, and software availability.

Approximate Bayesian computation (ABC) offers a standard set of theoretically sound methods for performing inference on intractable models Cranmer et al. (2019). The core idea of ABC methods is to approximate the posterior by repeatedly sampling parameters from a proposal (prior) distribution and then generating a synthetic dataset by running the simulator with the sampled parameters. If the simulated dataset is sufficiently similar to an actually observed dataset, the corresponding parameters are retained as a sample from the desired posterior, otherwise rejected. However, in practice, ABC methods are notoriously inefficient and suffer from various problems, such as the curse of dimensionality or curse of inefficiency Marin et al. (2018). More efficient methods employ various techniques to optimize sampling or correct potential biases.

Recently, the scientific repertoire for simulation-based inference has been enhanced with ideas from deep learning and neural density estimation (NDE) in particular Greenberg et al. (2019). These methods employ specialized neural network architectures which are trained with simulated data to perform efficient and accurate inference on previously intractable problems Cranmer et al. (2019). NDE methods are rapidly developing and still largely underutilized in cognitive modeling, even though first applications to simulated Radev et al. (2020b, a) as well as actual data Wieschen et al. (2020) exist.

6 Amortized Inference

(a) Amortized parameter estimation
(b) Amortized model comparison
Figure 1: Graphical illustration of amortized parameter estimation and model comparison with different neural network estimators. (a) Amortized Bayesian parameter estimation with invertible neural networks Radev et al. (2020b). The left panel depicts the training phase in which the summary () and the inference network () are jointly optimized to approximate the true target posterior. The right panel depicts inference with already trained networks on observed data; (b) Amortized Bayesian model comparison with evidential neural networks Radev et al. (2020a). The left panel depicts the training phase during which the evidential network is optimized to approximate the true model posteriors via a higher-order Dirichlet distribution. The right panel depicts inference with an already trained evidential network; the upfront training effort for both inference tasks is amortized over arbitrary numbers of datasets from a research domain.

The majority of simulation-based methods need to be applied to each dataset separately. This quickly becomes infeasible when multiple datasets are to be analysed and multiple candidate models are considered, since the expensive inference procedure needs to be repeated from scratch for each combination of dataset and model.

In contrast, the concept of amortized inference refers to an approach which minimizes the cost of inference by separating the process into an expensive training (optimization) phase and a cheap inference phase which can be easily repeated for multiple datasets or models without computational overhead. Thus, the effort of training or optimization amortizes over repeated applications on multiple datasets or models. In some cases, the efficiency advantage of amortized inference becomes noticeable even for a few datasets Radev et al. (2020b, a).

The field of amortized inference is rapidly growing and a variety of methods and concepts are currently being explored. For instance, inference compilation involves pre-training a neural network with simulations from a generative model and then using the network in combination with a probabilistic program to optimize sampling from the posterior Le et al. (2016). The pre-paid estimation method Mestdagh et al. (2019)

proceeds by creating a large grid of simulations which are reduced to summary statistics and stored on disk. Subsequent inference involves computing the nearest neighbors of an observed dataset in the pre-paid grid and interpolation. Sequential neural posterior estimation (SNPE) methods employ various iterative refinement schemes to transform a proposal distribution into the correct target posterior via expressive NDEs trained over multiple simulation rounds

Greenberg et al. (2019).

In line with these ideas, we recently proposed two general frameworks for amortized Bayesian parameter estimation and model comparison based on specialized neural network architectures Radev et al. (2020b, a). In particular, these frameworks were designed to implement the following desirable properties:

  • Fully amortized Bayesian inference for parameter estimation and model comparison of intractable models

  • Asymptotic theoretical guarantees for sampling from the true parameter and model posteriors

  • Learning maximally informative summary statistics directly from data instead of manual selection

  • Scalability to high-dimensional problems through considerations regarding the probabilistic symmetry of the data

  • Implicit preference for simpler models based purely on generative performance

  • Online learning eliminating the need for storing large grids or reference tables

  • Parallel computations and GPU acceleration applicable to both simulations, training/optimization, and inference

In the following, we describe our recently developed methods parameter estimation and model comparison in turn.

7 Amortized Parameter Estimation with Invertible Neural Networks

Recently, we proposed a novel amortization method based on invertible neural networks Radev et al. (2020b), which we dubbed BayesFlow. The method relies solely on simulations from a process model in order to learn and calibrate the posterior over all possible parameter values and observed data patterns.

The BayesFlow method involves two separate neural networks trained jointly. A permutation invariant summary network is responsible for reducing an entire dataset with a variable number of observations111Note, that the i.i.d. assumption is not a necessary condition for the method to work, but used here only to simplify the discussion.

into a vector of

learned summary statistics. Importantly, permutation invariant networks can deal with i.i.d. sequences of variable size and preserve their probabilistic symmetry. An inference network, implemented as an invertible neural network Radev et al. (2020b)

, is responsible for approximating the true posterior of model parameters given the output of the summary network. Invertible networks can perform asymptotically exact inference and scale well from simple low-dimensional problems to high-dimensional distributions with complex dependencies. During training, model parameters and synthetic datasets are generated on the fly and neural network parameters are adjusted via joint backpropagation (see Figure

0(a), left panel, for a graphical illustration of the training phase).

Given a model and a prior over the model parameters, the goal is thus to train a conditional invertible neural network with adjustable parameters together with a summary network with adjustable parameters . These networks jointly learn an approximate posterior over the relevant parameters for arbitrary numbers of datasets and dataset sizes , as long as they share the same data structure. To achieve this, the networks minimize the Kullback-Leibler (KL) divergence between the true and the approximate posterior:

(7)

Utilizing the fact that we have access to the joint distribution

via the simulator, we minimize the KL divergence in expectation over all possible datasets that can be generated given the prior and the model, resulting in the following criterion:

(8)

In practice, we approximate the criterion via its Monte Carlo (MC) estimate, since we can simulate theoretically infinite amounts of data and can easily evaluate due to our invertible architecture. In case of perfect convergence of the networks, the summary network outputs sufficient summary statistics and the inference network samples from the true posterior Radev et al. (2020b). Importantly, once the networks have been trained with sufficient amounts of simulated data, they can be stored and applied for inference on multiple datasets from a research domain (see Figure 0(a), right panel).

8 Amortized Model Comparison with Evidential Neural Networks

In another recent work Radev et al. (2020a), we explored a framework for Bayesian model comparison on intractable models via evidential neural networks

. We proposed to train a permutation invariant classifier network on simulated data from multiple models. The goal of this network is to approximate posterior model probabilities as accurately as possible. To achieve this, the network is trained to output the parameters of a higher-order probability distribution (parameterized as a Dirichlet distribution) over the model probabilities themselves, which quantifies the uncertainty in model probability estimates. Thus, for a classifier network with parameters

, the higher-order posterior distribution over model probabilities is given by:

(9)

where denotes the vector of concentration parameters obtained by the network for a dataset and is the multivariate beta function. The mean of this Dirichlet distribution can be used as a best estimate for the posterior model probabilities:

(10)

Additionally, its variance can be interpreted as the epistemic uncertainty surrounding the actual evidence which the data provide for model comparison.

For training the network, we again utilize the fact that we have access to the joint distribution via simulations (see Figure 0(b), left panel). Our optimization criterion is:

(11)

where is a strictly properloss function Gneiting and Raftery (2007), is the true model index and the data implicitly depend on . In practice, we approximate this expectation via draws from the joint distribution available through simulation. Optimization of a strictly proper criterion, asymptotic convergence implies that the mean of the Dirichlet distribution represents the true model posteriors. Our simulation-based approach implicitly captures a preference for simpler models (Occam’s razor), since simpler models will tend to generate more similar datasets. As a consequence, when such datasets are plausible under multiple models, the comparably simpler models will be more probable.

As with parameter estimation, once the evidence network has been trained on simulated data from the candidate models, it can be applied to multiple upcoming observations from a research domain (see Figure 0(b), right panel).

9 Example Applications

In the following, we will present two applications of amortized Bayesian parameter estimation to a new intractable evidence accumulation model (EAM). EAMs are a popular class of models in psychology and cognitive science, as they allow a model-based analysis of response time (RT) distributions. Here, we will consider a Lévy flight model (LFM) with a non-Gaussian noise assumption Voss et al. (2019); Wieschen et al. (2020)

as an example. The Lévy flight process is driven by the following stochastic ordinary differential equation (ODE):

(12)
(13)

where denotes accumulated cognitive evidence in condition , denotes the average speed of information accumulation (drift), and

controls the tails of the noise distribution (i.e., smaller values increase the probability of outliers in the accumulation process). Further parameters of the model are: a decision threshold (

) which reflects the amount of information needed for selecting a response; a starting point () indicative of response biases; and a non-decision time () reflecting additive encoding and motor process. The relationship of the parameter to the other parameters of the model has not been previously investigated.

9.1 Simulation Example

(a) Parameter recovery as a function of trial numbers ()
(b) Simulation-based calibration
Figure 2: Simulation results. (a) The left panel depicts parameter recovery of the four drift rate parameters as a function of trial numbers per participant

. The right panel depicts recovery of the other four parameters. Posterior means are used as summaries of the full posteriors and shaded regions represent bootstrap 95% confidence intervals. (

b) The panel depicts simulation-based calibration (SBC) results at as a validation check for the correctness of the full posteriors.

As a first example, consider a simulated RT experiment with four conditions. How many trials are needed for accurate parameter recovery? To answer this question, we can simulate multiple experiments with varying number of trials per participant () and then quantify the discrepancy between ground-truth parameters and their estimates.

Figure 3: Example full posteriors and bivariate posterior correlations from data of one participant in the LDT.

However, since the model is intractable, such a simulation scenario is not feasible with non-amortized methods, which would need weeks on standard machines Voss et al. (2019). However, using the BayesFlow method (Figure 0(a)), we can train the networks with simulated datasets and vary the number of trials during each simulation. Such a training takes approximately one day on a standard laptop equipped with an NVIDIA® GTX1060 graphics card. Subsequent inference is then very cheap, as amortized parameter estimation on 500 simulated participants takes less than 2 seconds.

We visualize the results by plotting the average metric obtained from fitting the LFM model to simulated participants at different between and (see Figure 1(a)). Notably, recovery of the ground-truth parameters via posterior means is nearly perfect at higher trial numbers.

As a validation tool for visually detecting systematic biases in the approximate posteriors, we can also cheaply apply simulation-based calibration (SBC) and inspect the rank statistic of the posterior samples for uniformity Talts et al. (2018). Results from applying SBC to simulated participants at are depicted in Figure 1(b). Indeed, we confirm that no pronounced issues across marginal posteriors are present.

9.2 Real Data Example

We can also apply the same network from the previous simulation example for fully Bayesian inference on real data. Here, we fit the LFM model to previously unpublished data from eleven participants performing a long ( per condition) lexical decision task (LDT). Since the task had a design, with a factor for difficulty (hard vs. easy), and a factor for stimulus type (word vs. non-word), we assume a different drift rate for each design cell.

Using the full estimated posteriors, we can analyze posterior correlations at an individual level and investigate task-dependent relationships between the parameter and the other parameters (see Figure 3 for an example for a single participant). Across participants, displays only small positive correlations with drift rates as well as small positive correlations with threshold and non-decision time parameters. This, together with the simulation results, provides first evidence that the parameter can indeed be decoupled from the other parameters and possibly indicates a separate decision process.

10 Outlook

The purpose of this work was to introduce the main ideas behind amortized Bayesian inference methods for simulation-based parameter estimation and model comparison. Although these methods come with promising theoretical guarantees and clear practical advantages, their utility for cognitive modeling is just beginning to be explored. Moreover, there are still many open questions and avenues for future research.

First, a systematic investigation of a potential amortization gap in certain practical application seems warranted. An amortization gap refers to a drop in estimation accuracy due to the fact that we are relying on a single set of neural network parameters for solving an inference problem globally, instead of performing per-dataset optimization. Even though we have not observed such a scenario in our applications and simulations, this behavior might occur when the neural network estimators are not expressive enough to represent complex posterior distributions.

Second, there are still little systematic guidelines on how to best design and tune the neural network architectures so as to perform optimally across a variety of parameter estimation and model comparison tasks. Even though neural density estimation methods outperform standard ABC methods on multiple metrics and in various contexts, there is certainly room for improvement. Black-box optimization methods for hyperparameter tuning, such as Bayesian optimization or active inference

Snoek et al. (2012), might facilitate additional performance gains and reduce potentially suboptimal architectural choices.

Finally, user-friendly software for applying Bayesian amortization methods out-of-the box is still largely in its infancy. Developing and maintaining such software is a crucial future goal for increasing the applicability and usability of novel simulation-based methods.

11 Conclusion

We hope that the inference architectures discussed in this work will spur the interest of cognitive modelers from various domains. We believe that such architectures can greatly enhance model-based analysis in cognitive science and psychology. By leaving subsidiary tractability considerations to powerful end-to-end algorithms, researchers can focus more on the task of model development and evaluation to further improve our understanding of cognitive processes.

12 Acknowledgments

This research was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation; grant number GRK 2277 ”Statistical Modeling in Psychology”). We thank the Technology Industries of Finland Centennial Foundation (grant 70007503; Artificial Intelligence for Research and Development) for partial support of this work.

References

  • K. Cranmer, J. Brehmer, and G. Louppe (2019) The frontier of simulation-based inference. arXiv preprint arXiv:1911.01429. Cited by: §5, §5, §5.
  • L. Fontanesi, S. Gluth, M. S. Spektor, and J. Rieskamp (2019) A reinforcement learning diffusion decision model for value-based decisions. Psychonomic bulletin & review 26 (4), pp. 1099–1121. Cited by: §1.
  • T. Gneiting and A. E. Raftery (2007) Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association 102 (477), pp. 359–378. Cited by: §8.
  • D. S. Greenberg, M. Nonnenmacher, and J. H. Macke (2019) Automatic posterior transformation for likelihood-free inference. arXiv preprint arXiv:1905.07488. Cited by: §5, §6.
  • E. T. Jaynes (2003) Probability theory: the logic of science. Cambridge university press. Cited by: §1, §3.
  • T. A. Le, A. G. Baydin, and F. Wood (2016) Inference compilation and universal probabilistic programming. arXiv preprint arXiv:1610.09900. Cited by: §6.
  • S. M. Lynch and B. Western (2004) Bayesian posterior predictive checks for complex models. Sociological methods & research 32 (3), pp. 301–335. Cited by: §2.
  • J. Marin, P. Pudlo, A. Estoup, and C. Robert (2018) Likelihood-free model choice. Chapman and Hall/CRC Press Boca Raton, FL. Cited by: §5.
  • M. Mestdagh, S. Verdonck, K. Meers, T. Loossens, and F. Tuerlinckx (2019) Prepaid parameter estimation without likelihoods. PLoS computational biology 15 (9), pp. e1007181. Cited by: §6.
  • J. I. Myung, M. Montenegro, and M. A. Pitt (2007) Analytic expressions for the bcdmem model of recognition memory. Journal of Mathematical Psychology 51 (3), pp. 198–204. Cited by: §1.
  • S. T. Radev, M. D’Alessandro, P. Bürkner, U. K. Mertens, A. Voss, and U. Köthe (2020a) Amortized bayesian model comparison with evidential deep learning. arXiv preprint arXiv:2004.10629. Cited by: §5, Figure 1, §6, §6, §8.
  • S. T. Radev, U. K. Mertens, A. Voss, L. Ardizzone, and U. Köthe (2020b) BayesFlow: learning complex stochastic models with invertible neural networks. arXiv preprint arXiv:2003.06281. Cited by: §5, Figure 1, §6, §6, §7, §7, §7.
  • J. Snoek, H. Larochelle, and R. P. Adams (2012)

    Practical bayesian optimization of machine learning algorithms

    .
    In Advances in neural information processing systems, pp. 2951–2959. Cited by: §10.
  • J. C. Stout, J. R. Busemeyer, A. Lin, S. J. Grant, and K. R. Bonson (2004) Cognitive modeling analysis of decision-making processes in cocaine abusers. Psychonomic bulletin & review 11 (4), pp. 742–747. Cited by: §1.
  • S. Talts, M. Betancourt, D. Simpson, A. Vehtari, and A. Gelman (2018) Validating bayesian inference algorithms with simulation-based calibration. arXiv preprint arXiv:1804.06788. Cited by: §9.1.
  • B. M. Turner, B. U. Forstmann, M. Steyvers, et al. (2019) Joint models of neural and behavioral data. Springer. Cited by: §1.
  • B. Turner, P. Sederberg, and J. McClelland (2016) Bayesian analysis of simulation-based models. Journal of Mathematical Psychology 72, pp. 191–199. Cited by: §4.
  • A. Voss, V. Lerche, U. Mertens, and J. Voss (2019) Sequential sampling models with variable boundaries and non-normal noise: a comparison of six models. Psychonomic bulletin & review 26 (3), pp. 813–832. Cited by: §1, §4, §9.1, §9.
  • E. M. Wieschen, A. Voss, and S. Radev (2020) Jumping to conclusion? a lévy flight model of decision making. TQMP 16 (2), pp. 120–132. Cited by: §5, §9.