1 Introduction
Researchers from various scientific fields face the problem of selecting the most plausible theory for an empirical phenomenon among multiple alternative theories. These theories are often formally stated as mathematical models which describe how observable quantities arise from unobservable (latent) parameters. Focusing on the level of mathematical models, the problem of theory selection then becomes one of model selection.
For instance, neuroscientists might be interested in comparing different models of spiking patterns given in vivo recordings of neural activity [izhikevich2004model]. Epidemiologists, on the other hand, might consider different models for predicting the spread and dynamics of an unfolding infectious disease [verity2020estimates]. Moreover, the preference for one model over alternative models in these examples can have important consequences for research projects or social policies. Accounting for such complex natural phenomena often requires specifying complex models which entail some degree of randomness. Inherent stochasticity, incomplete description, or ignorance all call for some form of uncertainty awareness. To make matters worse, empirical data on which models are fit are necessarily finite and can only be acquired with finite precision. Finally, the plausibility of many nontrivial models throughout various branches of science can be assessed only approximately [ratmann2009model, deisboeck2010multiscale, turner2016bayesian, marin2018likelihood, izhikevich2004model, da2018model]. Therefore, there is a considerable degree of uncertainty in the process of selecting a single most plausible model.
A consistent mathematical framework for describing uncertainty and quantifying plausibility is offered by the Bayesian view on probability theory
[jaynes2003probability]. A lot of Bayesian model comparison approaches involve 1) fitting all competing models to the observed data, 2) assigning weights to each model, and 3) deciding on further actions, such as model comparison, selection, or averaging. In this scheme, steps 1) and 2) can be tackled with Bayesian inference methods, and step 3) pertains to the domain of decision theory.
When standard Bayesian inference methods, such as Markov Chain Monte Carlo (MCMC,
[brooks2011handbook]) or Variational Inference (VI), are applicable, a wide range of model selection and comparison methods can be considered. These include, for instance, informationbased criteria (WAIC, WBIC) [watanabe2013waic], crossvalidation [vehtari2017practical], Bayes factors (BF)
[kass1995bayes], minimum description length (MDL) [hansen2001model], or stacking of posterior predictive distributions
[yao2018using]. These methods come with certain desirable theoretical guarantees, for instance, selecting the true model in the limit of infinite data, given that the model is in the set of considered models. Moreover, they encode a fundamental tradeoff between a model’s complexity and its predictive performance (generalizability). This tradeoff is a manifestation of Occam’s razor, which expresses our preference for a simpler model over a more complex one when both models can account for the data equally well.When comparing complex models, the computational burden of fitting models to data becomes increasingly prohibitive. It is not rare that fitting a single model to multiple data sets can take up entire weeks [papamakarios2018sequential]. Computation becomes even more cumbersome when the above mentioned Bayesian inference methods cannot be applied to one or the whole set of models [voss2019sequential]. This occurs, for instance, when the likelihood function is not available in closed form or cannot be evaluated in reasonable time [turner2014generalized]. This presents a challenge for many model comparison methods, for instance, Bayes factors or expected log posterior density (ELPD) values, as the likelihood needs to be evaluated or approximated at multiple parameter values. Faced with these computational difficulties, researchers are forced to either make simplifying assumptions or resort to simulationbased methods.
When simplifying the models is not an option, simulationbased approaches are the only alternative available for testing competing theories via formal model comparison. Simulationbased methods employ a generative approach which approximates model probabilities via simulations from the models [turner2014generalized, marin2018likelihood, papamakarios2018sequential]. By relying solely on simulations, the likelihood need not be evaluated directly but is implicitly given by the generative (sampling) process. It is therefore of great practical utility to develop simulationbased approximations for all the above mentioned model comparison methods [sisson2011likelihood]. Ideally, a method for approximate model comparison should satisfy the following requirements:

Theoretical guarantee
: Model probability estimates should be, at least in theory, calibrated to the true model probabilities induced by an empirical problem;

Accurate approximation: Model probability estimates should be accurate even for finite or small sample sizes;

Occam’s razor: Preference for simpler models should be expressed by the model probability estimates;

Scalability: The method should be applicable to complex models with implicit likelihood within reasonable time limits;

Maximum data utilization: The method should capitalize on all information contained in the data and avoid information loss through insufficient summary statistics of the data;
However, most approximate methods come with significant tradeoffs between these criteria. In practical applications, these tradeoffs are far from easy to balance. For instance, vanilla approximate Bayesian computation (ABC) methods offer elegant theoretical guarantees but do not scale well in order to be practical in complex scenarios with limited computational power. More scalable developments from the ABC family (ABCSMC, ABCMCMC, ABC neural networks and the recently proposed ABC random forests) offer great efficiency boosts but still rely on handcrafted summary statistics
[marin2018likelihood, jiang2017learning, sisson2011likelihood]. Handcrafted summary statistics (typically some form of dimensionality reduction, e.g. mean, covariance and higher central moments of the observed data) are problematic, because in general there is no guarantee that these summaries extract all relevant information. Model comparison with insufficient summary statistics deteriorates the resulting model posteriors
[robert2011lack]. Learned statistics suffer much less from this problem.Other modern approaches for approximate inference, such as the probability density approximation (PDA, [turner2014generalized]), the sequential neural likelihood (SNL, [papamakarios2018sequential]), or the automatic posterior transformation (ATP, [greenberg2019automatic]) methods, can perform model comparison by using samples from an approximate posterior over the models’ parameters. However, this involves fitting all models for each individual dataset in a costly perdataset optimization loop. In addition, most of these methods also rely on fixed summary statistics.
In the current work, we propose and validate a novel method for Bayesian model comparison based on evidential deep neural networks. Training of the method is purely simulationbased and, further, circumvents the step of explicitly fitting all alternative models to each dataset. Moreover, the method is designed to amortize the effort of model comparison globally over multiple models, datasets, and dataset sizes from a given model family. Thus, the goal is to enable researchers to reuse pretrained networks for arbitrary many new data sets. This makes the method applicable in scenarios where multiple models need to be fitted to multiple datasets, so that perdataset inference is too costly to perform. In addition, we propose to avoid handcrafted summary statistics by using novel deep learning architectures which are aligned to the probabilistic structure of the raw data (e.g., permutation invariant networks [bloem2019probabilistic], recurrent networks [gers1999learning]). Finally, we propose a novel way to measure epistemic uncertainty in model comparison problems, following the pioneering work of [sensoy2018evidential] on image classification. We argue that this measure of epistemic uncertainty provides a proxy to quantify absolute evidence even in an closed framework, which assumes that the true datagenerating model is within the candidate set.
The rest of the paper is organized as follows. The Method section formulates the problem of model comparison formally and derives a framework for learning absolute model evidence via simulations from a set of generative models. With our subsequent Experiments section, we demonstrate empirically the following properties of our model:

Approximation of analytic Bayes factors on a simple toy example with tractable marginal likelihoods and binary i.i.d. data (Experiment 1)

Performance on a Gaussian Mixture Model toy example with hundreds of competing models with
i.i.d. data (Experiment 2) 
Efficiency benefits of using amortized inference over a standard ABCSMC method on a toy Markov jump model with noni.i.d. data (Experiment 3)

Performance and epistemic uncertainty quantification on complex models of human decision making with i.i.d. data (Experiment 4)

Performance and epistemic uncertainty quantification on complex stochastic models of singleneuron spiking patterns with non
i.i.d. data (Experiment 5)
2 Method
2.1 Problem Formulation
Assume we have a collection of competing generative models . Each represents a theoretically plausible (potentially noisy) process by which observable quantities arise from hidden parameters and independent noise :
(1) 
where is the corresponding parameter space of model . The subscript makes it explicit that each model might be specified over a different parameter space. Equivalently, the datagenerating process can be expressed in terms of the likelihood:
(2) 
We assume that the functional or algorithmic form of each is known and that we have a sample (dataset) of observations drawn from . The task of model selection is to choose the model that best describes the observed dataset. Formally, the probability of the entire sample under a model can be expressed by the marginal likelihood:
(3) 
where is the prior over model ’s parameters and is the likelihood of the sample under model with parameters . The marginal likelihood focuses on prior predictions and thus penalizes the prior complexity of a model (i.e., the prior acts as a weight on the likelihood). This is in contrast to posterior predictions, which require marginalization over the parameter posterior and can be used to select the model which best predicts new data. The ratio of marginal likelihoods of two models and is called a Bayes factor (BF) and is given by:
(4) 
The Bayes factor thus encodes relative evidence in favor of model compared to model
. Alternatively, one can focus on the (marginal) posterior probability of a model:
(5) 
and consider the ratio of posterior probabilities (
posterior odds
) of two models:(6) 
The posterior odds are connected to the Bayes factor via the respective models’ priors and . If both models are equally likely a priori, that is, , the posterior odds equal the Bayes factor. In this case, if the Bayes factor, or, equivalently, the posterior odds equal one, the observed data provide no evidence for one of the models over the other. However, a relative evidence of one does not allow to distinguish whether the data are equally likely or equally unlikely under both models, as this is a question of absolute evidence.
Closely related to the distinction between relative and absolute evidence is the distinction between closed and complete frameworks [yao2018using]. Under an closed framework, the true model is assumed to be in the predefined set of competing models , so relative evidence is identical to absolute evidence. Under an complete framework, a true model is assumed to exist but is not necessarily assumed to be a member of . However, one still focuses on the models in due to computational or conceptual limitations^{1}^{1}1In this work, we delegate the discussion of whether the concept of a true model has any ontological meaning to philosophy. See also [yao2018using] for discussion of an open framework, in which no true model is assumed to exist.
Deciding on the particular framework under which a model comparison problem is tackled is often a matter of prior theoretical considerations. However, since in most nontrivial research scenarios is a finite set and candidate models in are often simpler approximations to the true model, there will be uncertainty as to whether the observed data could have been generated by one of these models. In the following, we will refer to this uncertainty as epistemic uncertainty. We will propose a datadriven way to calibrate epistemic uncertainty in addition to the model probabilities through simulations under an closed framework. Consequently, given real observed data, a researcher can obtain a measure of uncertainty with regard to whether the generative model of the data is likely to be in or not. From this perspective, our method lies somewhere between an closed and an complete framework as it provides information from both viewpoints.
2.2 Model Selection as Classification
In line with previous simulationbased approaches to model selection, we will utilize the fact that we can generate arbitrary amounts of data via Eq.2 for each considered model . Following [pudlo2015reliable, marin2018likelihood], we cast the problem of model comparison as a probabilistic classification task. In other words, we seek a parametric mapping from an arbitrary data space to a probability simplex containing the posterior probability over indices. Previously, different learning algorithms (e.g., random forests [marin2018likelihood]) have been employed to tackle model comparison as classification. Following recent developments in algorithmic alignment and probabilistic symmetry [xu2019what, bloem2019probabilistic], our method parameterizes via a specialized neural network with trainable parameters which is aligned to the probabilistic structure of the observed data. See the Network Architectures section for a detailed description of the employed networks’ structure.
In addition, our method differs from previous classification approaches to model comparison in the following aspects. First, it requires no handcrafted summary statistics, since the most informative summary statistics are learned directly from data. Second, it uses online learning (i.e., onthefly simulations) and requires no storage of large reference tables or data grids. Third, the addition of new competing models does not require changing the architecture or retraining the network from scratch, since the underlying data domain remains the same. In line with the transfer learning literature, only the last layer of a pretrained network needs to be changed and training can be resumed from where it Last, our method is uncertaintyaware, as it returns a higherorder distribution over posterior model probabilities. From this distribution, one can extract both absolute and relative evidences, as well as quantify the model selection uncertainty implied by the observed data.
To set up the model classification task, we run Algorithm 1 repeatedly to construct training batches with simulated data sets of size and model indices of the form
. We then feed each batch to a neural network which takes as input simulated data with variable sizes and returns a distribution over posterior model probabilities. The neural network parameters are optimized via standard backpropagation. Upon convergence, we can apply the pretrained network to arbitrarily many datasets of the form
to obtain a vector of probabilities
which approximates the true model posterior .Note, that this procedure incurs no memory overhead, as the training batches need not be stored in memory all at once. Intuitively, the connection between data and models is encoded in the network’s weights. Once trained, the evidential network can be reused to perform instant model selection on multiple real observations. As mentioned above, the addition of new models requires simply adjusting the pretrained network, which requires much less time than retraining the network from scratch. We now describe how model probabilities and evidences are represented by the network.
2.3 Evidence Representation
In order to obtain a measure of absolute evidence by considering a finite number of competing models, we place a Dirichlet distribution over the estimated posterior model probabilities [sensoy2018evidential]. This corresponds to modeling secondorder probabilities in terms of the theory of subjective logic (SL) [jsang2018subjective]. These secondorder probabilities represent an uncertainty measure over quantities which are themselves probabilities. We use the secondorder probabilities to capture epistemic uncertainty about whether the observed data has been generated by one of the candidate models considered during training.
The probability density function (PDF) of a Dirichlet distribution is given by:
(7) 
where belongs to the unit simplex (i.e., and is the multivariate beta function. The Dirichlet density is parameterized by a vector of concentration parameters which can be interpreted as evidences in the ST framework [jsang2018subjective]. The sum of the individual evidence components
is referred to as the Dirichlet strength, and it affects the precision of the higherorder distribution in terms of its variance. Intuitively, the Dirichlet strength governs the
peakedness of the distribution, with larger values leading to more peaked densities (i.e., most of the density being concentrated in a smaller region of the simplex). We can use the mean of the Dirichlet distribution, which is a vector of probabilities given by:(8) 
to approximate the posterior model probabilities , as will become clearer later in this section. A crucial advantage of such a Dirichlet representation is that it allows to look beyond model probabilities by inspecting the vector of computed evidences. For instance, imagine a scenario with three possible models. If , the data provides equally strong evidence for all models ((b), first column) – all models explain the data well. If, on the other hand, , then the Dirichlet distribution reduces to a uniform on the simplex indicating no evidence for any of the models ((b), second column) – no model explains the observations well. Note that in either case one cannot select a model on the basis of the data, because posterior model probabilities are equal, yet the interpretation of the two outcomes is very different: The secondorder Dirichlet distribution allows one to distinguish between equally likely (first case) and equally unlikely (second case) models. The last column of (b) illustrates a scenario with in which case one can distinguish between all models (see also Figure 9 for a scenario with data simulated from an actual model).
We can further quantify this distinction by computing an uncertainty score given by:
(9) 
where is the number of considered models. This uncertainty score ranges between (total certainty) and (total uncertainty) and has a straightforward interpretation. Accordingly, total uncertainty is given when , which would mean that the data provide no evidence for any of the candidate models. On the other hand, implies a large Dirichlet strength , which would read that the data provide plenty of evidence for one or more models in question. The uncertainty score corresponds to the concept of vacuity (i.e., epistemic uncertainty) in the terminology of SL [jsang2018subjective]. We argue that epistemic uncertainty should be a crucial aspect in model selection, as it quantifies the strength of evidence, and, consequently, the strength of the theoretical conclusions we can draw given the observed data.
Consequently, model comparison in our framework consists in inferring the parameters of a Dirichlet distribution given an observed dataset. The problem of inferring posterior model probabilities can be reparameterized as:
(10) 
where
is a neural network with positive outputs (e.g., ReLU outputs). Note, that we suppress the dependence of
on the neural network parameters and the dataset for readability purposes. Additionally, we can also obtain a measure of absolute model evidence by considering the uncertainty encoded by the full Dirichlet distribution (Eq.9).2.4 Learning Evidence in an Closed Framework
How do we ensure that the outputs of the neural network match the true unknown model posterior probabilities? As per Algorithm 1, we only have access to samples from . Consider, for illustrational purposes, a dataset with a single observation, that is and . We use the mean of the Dirichlet distribution parameterized by an evidential neural network with parameters to approximate . To optimize the parameters of the neural network, we minimize some loss in expectation over all possible datasets:
(11)  
(12) 
where
is the onehot encoded vector of the true model index. We also require that
be a strictly proper loss [gneiting2007strictly]. A loss function is strictly proper if and only if it attains its minimum when
[gneiting2007strictly]. To derive a strictly proper loss, let be a concave mapping from the probability simplex to the reals, for example a generalized entropy or a negated norm. A fundamental theorem about proper scoring rules for discrete distributions [gneiting2007strictly] states that a loss is proper if and only if it can be expressed as:(13) 
When we choose as the Shannon entropy , we obtain the logarithmic loss:
(14)  
(15) 
where when is the true model index and otherwise. Thus, in order to estimate , we can minimize the expected logarithmic loss over all simulated datasets where denotes the th component of the Dirichlet density given by the evidential neural network. Since we use a strictly proper loss, the evidential network yields the true model posterior probabilities over all possible datasets when perfectly converged.
Intuitively, the logarithmic loss encourages high evidence for the true model and low evidences for the alternative models. Correspondingly, if a dataset with certain characteristics can be generated by different models, evidence for these models will jointly increase. Additionally, the model which generates these characteristics most frequently will accumulate the most evidence and thus be preferred. However, we also want low evidence, or, equivalently, high epistemic uncertainty, for datasets which are implausible under all models. We address this problem in the next section.
2.5 Learning Absolute Evidence through Regularization
We now propose a way to address the scenario in which no model explains the observed data well. In this case, we want the evidential network to estimate low evidence for all models in the candidate set. In order to attenuate evidence for datasets which are implausible under all models considered, we incorporate a KullbackLeibler (KL) divergence into the criterion in Eq.14. We compute the KL divergence between the Dirichlet density generated by the neural network and a uniform Dirichlet density implying total uncertainty. Thus, the KL shrinks evidences which do not contribute to correct model assignments during training, so an implausible dataset at inference time will lead to low evidence under all models. This type of regularization has been used for capturing outofdistribution (OOD) uncertainty in image classification [sensoy2018evidential]. Accordingly, our modified optimization criterion becomes:
(16) 
with
(17) 
where the term represents the estimate evidence vector after removing the evidence for the true model. This is possible, because we know the true model during simulationbased training. For application on real data sets after training, knowing the ground truth is not required anymore as
has been obtained already at this point. The KL penalizes evidences for the false models and drives these evidences towards unity. Equivalently, the KL acts as a groundtruth preserving prior on the higherorder Dirichlet distribution which preserves evidence for the true model and attenuates misleading evidences for the false models. The hyperparameter
controls the weight of regularization and encodes the tolerance of the algorithm to accept implausible (outofdistribution) datasets. With large values of , it becomes possible to detect cases where all models are deficient; with , only relative evidence is generated. Note, that in the latter case, we recover our original proper criterion without penalization. The KL weight can be selected through prior empirical considerations on how well the simulations cover the plausible set of realworld datasets.Importantly, the introduction of the KL regularizer renders the loss no longer strictly proper. Therefore, a large regularization weight would lead to poorer calibration of the approximate model posteriors, as the regularized loss is no longer minimized by the true model posterior. However, since the KL prior is groundtruth preserving, the accuracy of recovering the true model should not be affected. Indeed, we observe this behavior throughout our experiments.
To make optimization tractable, we utilize the fact that we can easily simulate batches of the form via Algorithm 1 and approximate Eq.16 via standard backpropagation by minimizing the following loss:
(18) 
over multiple batches to converge at a Monte Carlo estimator of . In practice, convergence can be determined as the point at which the loss stops decreasing, a criterion similar to early stopping
. Alternatively, the network can be trained for a predefined number of epochs. Note, that, at least in principle, the network can be trained arbitrarily long, since we assume that we can access the joint distribution
through simulation (see (a)).2.6 Implicit Preference for Simpler Models
Remembering that
, we note that perfect convergence implies that preference for simpler models (Bayesian Occam’s razor) is automatically encoded by our method. This is due to the fact that we are approximating an expectation over all possible datasets, parameters, and models. Accordingly, datasets generated by a simpler model tend to be more similar compared to those from a more complex competitor. Therefore, during training, certain datasets which are plausible under both models will be generated more often by the simpler model than by the complex model. Thus, a perfectly converged evidential network will capture this behavior by assigning higher posterior probability to the simpler model (assuming equal prior probabilities). Therefore, at least in theory, our method captures complexity differences arising purely from the generative behavior of the models and does not presuppose an
ad hoc measure of complexity (e.g., number of parameters).2.7 Neural Network Architectures
As already discussed, we need specialized neural network architectures for dealing with data sets with variable numbers of observations and different probabilistic symmetries (e.g., i.i.d. or temporal ordering). In the following, we describe the network architectures used to tackle the most common probabilistic symmetries observed in the social and life sciences, that is, exchangeable and temporal sequences.
2.7.1 Exchangeable Sequences
The most prominent feature of i.i.d. sequences is permutation invariance, since changing the order of individual elements does not change the associated likelihood function or posterior. Accordingly, if we denote an arbitrary permutation of elements as , the following should hold for the associated model posteriors:
(19) 
We encode probabilistic permutation invariance by enforcing functional permutation invariance with respect to the outputs of the evidential network [bloem2019probabilistic]. Following recent work on deep sets [zaheer2017deep] and probabilistic symmetry [bloem2019probabilistic], we implement a deep invariant network performing a series of equivariant and invariant transformations. An invariant transformation is characterized by:
(20) 
that is, permuting the input elements does not change the resulting output. Such a transformation is often referred to as a pooling operation. On the other hand, an equivariant transformation is characterized by:
(21) 
that is, permuting the input is equivalent to permuting the output of the transformation. We parameterize a learnable invariant function via an invariant module
performing a sequence of nonlinear transformations followed by a pooling (sum) operation and another nonlinear transformation:
(22) 
where and can be arbitrary (nonlinear) functions, which we parameterize via fully connected (FC) neural networks. Figure 2 (left panel) presents a graphical illustration of the invariant module.
We parameterize a learnable equivariant transformation via an equivariant module performing the following operations for each input element :
(23) 
so that that is a combination of elementwise and invariant transforms (see Figure 2, center). Again, we parameterize the internal function via a standard FC neural network. Note, that an equivariant module also takes as an input the output of an invariant module, in order to increase the expressiveness of the learned transformation. Thus, each equivariant module contains a separate invariant module. Finally, we can stack multiple equivariant modules followed by an invariant module, in order to obtain a deep invariant evidential network :
(24) 
where denotes the vector of all learnable neural network parameters and the final invariant module implements a shifted ReLU output nonlinearity:
(25) 
in order to represent Dirichlet evidences ( is technically possible and valid but not of relevance for our purposes). The rightmost panel in Figure 2 provides a graphical illustration of a deep invariant network. We use this architecture in Experiments 1, 2 and 4.
2.7.2 NonExchangeable Sequences
One of the most common nonexchangeable sequences encountered in practice are timeseries with arbitrarily long temporal dependencies. A natural choice for time seriesdata with variable length are LSTM networks [gers1999learning], as recurrent networks are designed to deal with long sequences of variable size. Another reasonable choice might be 1D fully convolutional networks [long2015fully], which can also process sequences with variable length. A different type of frequently encountered nonexchangeable data are images, which have successfully been tackled via 2D convolutional networks.
Since, in this work, we apply our evidential method to timeseries models, we will describe an architecture for processing data with temporal dependencies. We use a combination of a LSTM and a 1D convolutional network to reduce the observed timeseries into fixedsize vector representations. We then concatenate these vectors and pass them through a standard fully connected network to obtain the final Dirichlet evidences. At a high level, our architecture performs the following operations on an input sequence . First, a manytoone LSTM network reduces the input sequence to a vector of predefined size. Then, a 1D convolutional network reduces the input sequence to a matrix where is the length of the filtered sequence and is the number of filters in the final convolutional layer. In this way, only the time dimension of depends on the length of the input. A mean pooling operator is then applied to the time dimension of to obtain a fixedsize representation of size . Finally, the outputs of the LSTM and convolutional networks are concatenated and fed through a FC network with a shifted ReLU output nonlinearity, which yields the Dirichlet evidences . Thus, the computational flow is as follows:
(26)  
(27)  
(28)  
(29) 
Figure 3 illustrates the computational flow of a sequence network. This architecture provides us with a powerful estimator for comparing models whose outputs consist of multivariate timeseries. All neural network parameters are jointly optimized during the training phase. We use this architecture in Experiments 3 and 5.
2.8 Putting it All Together
The essential steps of our evidential method are summarized in Algorithm 2. Note, that steps 47 and 912 can be executed in parallel with GPU support in order to dramatically accelerate convergence and inference. In sum, we propose to cast the problem of model selection as evidence estimation and learn a Dirichlet distribution over posterior model probabilities directly via simulations from the competing models. To this end, we train an evidential neural network which approximates posterior model probabilities and further quantifies the epistemic uncertainty as to whether the true model is contained in the set of candidate models. Moreover, once trained on simulations from a set of models, the network can be reused and extended with new models across a research domain, essentially amortizing the model comparison process. Accordingly, if the priors over model parameter do not change, multiple researchers can reuse the same network for multiple applications. If the priors over model parameters change or additional models need to be considered, the parameters of a pretrained network can be adjusted or the network extended with additional output nodes for the new models.
3 Experiments
3.1 Training the Networks
We train all neural networks described in this paper via minibatch gradient descent. For all following experiments, we use the Adam optimizer with a starter learning rate of and an exponential decay rate of . We did not perform an extensive search for optimal values of network hyperparameters. All networks were implemented in Python using the TensorFlow library [abadi2016tensorflow] and trained on a singleGPU machine equipped with NVIDIA^{®} GTX1060 graphics card^{2}^{2}2Code and simulation scripts for all experiments are available at https://github.com/stefanradev93/BayesFlow..
3.2 Performance Metrics
Throughout the following examples, we use a set of performance metrics to assess the overall performance of our method. To test how well the method is able to recover the true model (hard assignment of model indices), we compute the accuracy of recovery as the fraction of correct model assignments over the total number of test datasets, where model assignments are done by selecting the model with the highest probability. To test how well the posterior probability estimates of the evidence network match the true model posteriors, we compute the expected calibration error (ECE, [guo2017calibration]
). The ECE measures the gap between the confidence and the accuracy of a classifier and is an unbiased estimate of exact miscalibration
[guo2017calibration]. In practice, we will report calibration curves for each model, as these are easier to interpret for multiclass classification problems. Finally, to ensure that the method does not exhibit overconfidence, we compute an overconfidence metric which is given by the difference between a high probability threshold (e.g., ) and the accuracy of the model above this threshold: where is the set of indices of predicted probabilities larger than . Any deviation from zero would be indicative of overconfidence and thus lack of confidence in the method’s estimates.3.3 Experiment 1: BetaBinomial Model with Known Analytical Marginal Likelihood
As a basic proofofconcept for our evidential method, we apply it to a toy model comparison scenario with an analytically tractable marginal likelihood. Here, we pursue the following goals. First, we want to demonstrate that the estimated posterior probabilities closely approximate the analytic model posteriors. To show this, we compare the analytically computed vs. the estimated Bayes factors. Second, we want to demonstrate that the estimated model probabilities are wellcalibrated. Last, we want to show that accuracy of true model recovery matches closely the accuracy obtained by the analytic Bayes factors across all . For this, we consider the simple betabinomial model given by:
(30)  
(31) 
The analytical marginal likelihood of the betabinomial model is:
(32) 
where denotes the number of successes in the trials. For this example, we will consider a model comparison scenario with two models, one with a flat prior on the parameter , and one with a sharp prior . The two prior densities are depicted in (a).
We train a small permutation invariant evidential network with batches of size
until convergence. For each batch, we draw the samples size from a discrete uniform distribution
and input the raw binary data to the network. We validate the network on 5000 separate validation datasets for each . Convergence took approximately 15 minutes. Inference on all 5000 validation datasets less than 2 seconds.achieved with both the analytic and the estimated Bayes factors (the shaded region represents a 95% bootstrap confidence interval around the accuracies of the evidential network).
First, we observe that the estimated Bayes factors closely approximate the analytic Bayes factors ((b)). We also observe no systematic under or overconfidence in the estimated Bayes factors, which is indicated by the calibration curve resembling a straight line ((c)). Finally, the accuracy of recovery achieved with the estimated Bayes factors closely matches that of the analytic Bayes factors across all sample sizes ((d)).
3.4 Experiment 2: 400 Gaussian Mixture Models
With this example, we want to show that our method is capable of performing model comparison on problems involving hundreds of competing models. Further, we want to corroborate the desired improvement in accuracy with increasing number of observations, as shown in the previous experiment.
To this aim, we construct a setting with 2D Gaussian Mixture Models (GMMs) with mixture components. The construction of the models and data proceeds as follows. We first specify the mean vectors on two linearlyspaced grids: and for . We then generate data from each GMM model by:
(33)  
(34)  
(35)  
(36) 
where denotes the discrete uniform distribution and
the identity matrix of appropriate dimension. Thus, each dataset consists of
samples from one of GMMs with different component mean vectors. (a) depicts simulated datasets from GMM with linearly increasing coordinates showing that differences between the models are very subtle.We train an invariant evidential network for epochs for approximately hours wallclock time. We then validate the performance of the network on all possible between and with previously unseen simulated datasets. (b) depicts a heatmap of the normalized confusion matrix obtained on the validation data sets with . Importantly, we observe that the main diagonal of the heatmap indicates that the predicted model indices mostly match the true model indices, with mistakes occurring mainly between models with very close means. Bootstrap mean accuracy at was around (SD=0.007), which is times better than chance accuracy ( considering the dimensionality of the problem. Finally, (c) depicts the accuracy of recovery over all considered. Again, we observe that accuracy improves as more observations become available, with sublinear scaling.
3.5 Experiment 3: Markov Jump Process of Stochastic Chemical Reaction Kinetics
With this example, we first want to apply our evidential method to a simple model of nonexchangeable chemical molecule concentration data. We also want to demonstrate the efficiency benefits of our amortized learning method compared to the standard nonamortized ABCSMC algorithm.
We define two Markov jump process models and for conversion of (chemical) species to species :
(37)  
(38) 
Each model has a single rate parameter . We use the Gillespie simulator to generate simulated timeseries from the two models with an upper time limit of seconds. Both models start with initial concentrations and , so they only differ in terms of their reaction kinetics. The input timeseries consist of a time vector as well as two vectors of molecule concentrations for each species at each time step, and , which we stack together. We place a wide uniform prior over each rate parameter: .
We train an evidential sequence network for epochs of minibatch updates and validate its performance on previously unobserved timeseries. Wallclock training time was approximately minutes. In contrast, wallclock inference time on the validation timeseries was ms, leading to dramatic amortization gains. The bootstrap accuracy of recovery was over the entire validation set.
We also apply the ABCSMC algorithm available from the pyABC [klinger2018pyabc] library to a single dataset generated from model 1 () with rate parameter . Figure 6 depicts the observed data generated from model 1 (left panel) as well as observed data generated from model 2 with . Notably, the differences in the datagenerating processes defined by the two models are subtle and not straightforward to explicitly quantify.
In the ABCSMC method, we set the minimum rejection threshold to and the maximum number of populations to , as these settings lead to perfect recovery of the true model. As a distance function, we use the norm between the raw concentration timesseries of species , evaluated at time points.^{3}^{3}3These settings were picked from the original pyABC documentation available at https://pyabc.readthedocs.io/en/latest/examples/chemical_reaction.html
The convergence of the ABCSMC algorithm on the single data set took minutes wallclock time. Thus, inference on the validation datasets would have taken more than days to complete. Accordingly, we see that the training effort with our method is worthwhile even after as few as 5 datasets. As for recovery on the single test dataset, ABCSMC selects the true model with a probability of , whereas our evidential networks outputs a probability of which results in a negligible difference of between the results from two methods.
3.6 Experiment 4: Stochastic Models of Decision Making
In this experiment, we apply our evidential method to compare several nontrivial nested stochastic evidence accumulator models (EAMs) from the field of decision making [usher2001time, ratcliff2008diffusion]. With this experiment, we want to demonstrate the performance of our method in terms of accuracy and posterior calibration on exchangeable data obtained from complex, researchrelevant models. Additionally, we want to demonstrate how our regularization scheme can be used to capture absolute evidence by artificially rendering the data implausible under all models.
3.6.1 Model Comparison Setting
EAMs describe the dynamics of decision making via different neurocognitively plausible parameters (i.e., speed of information processing, decision threshold, bias/preactivation, etc.). EAMs are most often applied to choice reaction times (RT) data to infer neurocognitive processes underlying generation of RT distributions in cognitive tasks. The most general form of an EAM is given by a stochastic differential equation:
(39) 
where denotes a change in activation of an accumulator, denotes the average speed of information accumulation (often termed the drift rate), and represents a stochastic additive component with .
Multiple flavors of the above stated basic EAM form exist throughout the literature [usher2001time, ratcliff2008diffusion, turner2016bayesian, brown2008simplest]. Moreover, most EAMs are intractable with standard Bayesian methods [brown2008simplest], so model selection is usually hard and computationally cumbersome. With this example, we pursue several goals. First, we want to demonstrate the utility of our method for performing model selection on multiple nested models. Second, we want to empirically show that our method implements Occam’s razor. Third, we want to show that our method can indeed provide a proxy for absolute evidence.
0  0  0  0  
0  0  0  0  
0  0  0  
0  0  
0  
To this end, we start with a very basic EAM defined by four parameters with denoting the speed of information processing (drift rate) for two simulated RT experimental tasks , denoting the decision threshold, and denoting an additive constant representing the time required for nondecisional processes like motor reactions. We then define five more models with increasing complexity by successively freeing the parameters (bias), (heavytaildness of noise distribution), (variability of nondecision time), (varibaility of driftrate), and (variability of bias). Note, that the inclusion of nonGaussian diffusion noise renders an EAM model intractable, since in this case no closedform likelihood is available (see [voss2019sequential] for more details). Table 1 lists the priors over model parameters as well as fixed parameter values.
The task of model selection is thus to choose among six nested EAM models , each able to capture increasingly complex behavioral patterns. Each model is able to account for all datasets generated by the previous models , since the previous models are nested within the th model. For instance, model can generate all data sets possible under the other models at the cost of increased functional and parametric complexity. Therefore, we need to show that our method encodes Occam’s razor purely through the generative behavior of the models.
In order to show that our regularization method can be used as a proxy to capture absolute evidence, we perform the following experiment. We define a temporal shifting constant (in units of seconds) and apply the shift to each response time in each validation data set. Therefore, as increases, each data set becomes increasingly implausible under all models considered. For each , we compute the average uncertainty over all shifted validation data sets and plot is as a function of . Here, we only consider the maximum number of trials .
We train three evidential neural networks with different KL weights: in order to investigate the effects of on accuracy, calibration, and uncertainty. All networks were trained with variable number of trials per batch for a total of 50000 iterations. The training of each network took approximately half a day on a single computer. In contrast, performing inference on 5000 data sets with a pretrained network took less than 10 seconds.
3.6.2 Validation Results
To quantify the global performance of our method, we compute the accuracy of recovery as a function of the number of observations () for each of the models. We also compute the epistemic uncertainty as a function of . To this end, we generate 500 new datasets for each and compute the accuracy of recovery and average uncertainty. These results are depicted in Figure 7.
Accuracy. We observe that accuracy of recovery increases with increasing sample size and begins to flatten out around , independently of the regularization weight ((a)). This behavior is desirable, as selecting the true model should become easier when more information is available. Further, since the models are nested, perfect recovery is not possible, as the models exhibit a large shared data space.
Calibration. (d) depicts calibration curves for each model and each regularization value. The unregularized network appears to be very well calibrated, whereas the regularized networks become increasingly underconfident with increasing regularization weight. This is due to the fact that the regularized networks are encouraged to generate zero evidence for the wrong models, so model probabilities become miscalibrated. Importantly, none of the networks shows overconfidence.
Occam’s Razor. We also test Occam’s razor by generating data sets from each model with and compute the average predicted model posterior probabilities by the unregularized network. Thus, all data sets generated by model are plausible under the remaining models . These average model probabilities are depicted in (e). Even though dataset generated by the nested simpler models are plausible under the more complex models, we observe that Occam’s razor is encoded by the behavior of the network, which, on average, consistently selects the simpler model when it is the true datagenerating model. We also observe that this behavior is independent of regularization (results for and are not depicted in (e))
Epistemic uncertainty and absolute evidence. Epistemic uncertainty over different trial numbers is zero when no KL regularization is applied (). On the other hand, both small () or large () regularization weights lead to nonzero uncertainty over all possible ((b)). This pattern reflects a reduction in epistemic uncertainty with increasing amount of information and mirrors the inverse of the recovery curve. Note, that the value at which epistemic uncertainty begins to flatten out is larger for the highly regularized model, as it encodes more cautiousness with respect to the challenging task of selecting a true nested model. Finally, results on shifted datasets are depicted in (c). Indeed, we observe that the regularized networks are able to detect implausible data sets and output total uncertainty around for all manipulated data sets. Uncertainty increases faster for high regularization. On the other hand, the unregularized model does not have any way of signaling impossibility of a decision, so its uncertainty remains at over all .
3.7 Experiment 5: Stochastic Models of SingleNeuron Activity
In this section, the evidential method is applied to a number of nested spiking neuron models describing the properties of cells in the nervous system. The purpose of this experiment is twofold. On the one hand, we want to assess the ability of our method to classify models deploying a variety of spiking patterns which might account for different cortical and subcortical neuronal activity. On the other hand, we want to challenge the network’s ability to detect biologically implausible data patterns as accounted by epistemic uncertainty. To this aim, we rely on a renowned computational model of biological neuronal dynamics.
3.7.1 Model Comparison Setting
In computational neuroscience, mathematical modeling of neuronal electrical dynamics serves as a basis to investigate and understand the functional organization of the brain from both single neuron and largescale neuronal networks processing perspectives [hodgkin1952quantitative, burkitt2006review, izhikevich2003simple, abbott1990model, dura2019netpyne]. A plurality of different neuron models have been proposed during the last decades, spanning from completely abstract to biologically detailed models. The former offer a simplified mathematical representation able to account for the main functional properties of spiking neurons, the latter are thought to provide a detailed analogy between models’ state variables and ion channels in biological neurons [pospischil2011comparison]. Apparently, the computational models differ in their capability to reproduce firing patterns observed in real cortical neurons [izhikevich2004model].
The model considered here is a HodgkinHuxley type model of cerebral cortex and thalamic neurons [pospischil2008minimal, hodgkin1952quantitative]
. The forward model is formulated as a set of five ordinary differential equations (ODEs) describing how the neuron membrane potential
unfolds in time as a function of an injected current , and ion channels properties. The changing in membrane potential is defined by the first ODE, the membrane equation:(40) 
where is the specific membrane capacitance, the intrinsic neural noise, and the s are the ionic currents flowing through channels, such that:
(41)  
(42)  
(43)  
(44) 
Here, is the leak conductance, while are the sodium, potassium and Mtype channel maximum conductances, respectively. and denote the leak equilibrium potential, the sodium and potassium reversal potentials, respectively. In particular, is assumed constant through time, whilst the other conductances vary over time. Consistently, indicates the vector of the state variables accounting for ion channel gating kinetics evolving according to the following set of ODEs:
(45)  
(46) 
where , and , , and are nonlinear functions of the membrane potential (see [pospischil2008minimal] for details).
Here, we treat conductances and as free parameters, and consider different neuronal models based on different parameter configurations. It is also assumed that such configurations allow to affect the span of the possible firing patterns attainable by each model. In particular, we consider 3 models defined by the parameter sets , and . We place the following priors over the parameters and :
(47)  
(48) 
When not considered as free parameters, and are set to default values, such that and , otherwise parameters are drawn from the following priors:
(49)  
(50) 
(a) depicts example runs from each model with respective parameters , , .
In order to assess performance, we train an unregularized evidential network for epochs resulting in mini batch updated. For each batch, we draw a random input current duration (in units of milliseconds), with the same constant input current, , for each dataset simulation. Here, reflects the physical time window in which biological spiking patterns can unfold. Since the sampling rate of membrane potential is fixed (), affects both the span of observable spiking behaviour and the number of simulated data points. The entire training phase with online learning took approximately hours. On the other hand, model comparison on validation time series took approximately seconds, which highlights the extreme efficiency gains obtainable via globally amortized inference.
3.7.2 Validation Results
Regarding model selection, we observe accuracies above across all , with no gains in accuracy for increasing , which shows that even short input currents are sufficient for performing reliable model selection for these models. Further, mean bootstrap calibration curves and accuracies on validation data sets are depicted in (d). We observe good calibration for all three models, with calibration errors less than . Notably, overconfidence was for all three models. The normalized confusion matrix is depicted in (b).
In order to asses how well we can capture epistemic uncertainty for biologically implausible datasets, we train another evidential network with a gradually increasing regularization weight up to . We then fix the parameter of model and gradually increase its second parameter from to . Since spiking patterns observed with low values of are quite implausible and have not been observed during training, we expect uncertainty to gradually decrease. Indeed, (c) shows this pattern. Three example firing patterns and the corresponding posterior estimates are depicted in Figure 9. On the other hand, changing the sign of the output membrane potential, which results in biologically implausible datasets, leads to a trivial selection of . This is contrary to expectations, and shows that absolute evidence is also relative to what the evidential network has learned during training.
4 Discussion
With the current work, we introduced a novel simulationbased method for approximate Bayesian model comparison based on specialized evidential neural networks, We demonstrated that our method can successfully deal with both exchangeable () and nonexchangeable (time dependent) sequences with variable numbers of observations without relying on fixed summary statistics. Further, we present a way to globally amortize the process of model comparison for a given family of models, by splitting it into a potentially costly global upfront training and a cheap inference phase. In this way, pretrained evidential networks can be stored, shared, and reused across multiple data sets and model comparison applications. Finally, we demonstrated a way to obtain a measure of absolute evidence in spite of operating in an closed framework during the simulation phase. In the following, we reiterate the main advantages of our method.
Theoretical guarantee. By using a strictly proper loss [gneiting2007strictly], we showed that our method can closely approximate analytic model posterior probabilities and Bayes factors in theory and practice. In other words, posterior probability estimates are perfectly calibrated to the true model posterior probabilities when the strictly proper logarithmic loss (Eq.18) is globally minimized. Indeed, our experiments confirm that the network outputs are well calibrated. However, when optimizing the regularized version of the logarithmic loss (Eq.17), we are no longer working with a strictly proper loss, so calibration declines at the cost of capturing implausible data sets. However, we demonstrated that the accuracy of recovery (i.e., selecting the most plausible model in the set of considered models) does not suffer when training with regularization. In any case, asymptotic convergence is never guaranteed in finitesample scenarios, so validation tools such as calibration and accuracy curves are indispensable in practical applications.
Amortized inference. Following ideas from inference compilation [le2016inference] and prepaid parameter estimation [mestdagh2019prepaid]
, our method avoids fitting each candidate model to each data set separately. Instead, we cast the problem of model comparison as a supervised learning of absolute evidence and train a specialized neural network architecture to assign model evidences to each possible data set. This requires only the specification of plausible priors over each model’s parameters and the corresponding forward process, from which simulations can be obtained on the fly. During the upfront training, we use online learning to avoid storage overhead due to large simulated grids or reference tables
[marin2018likelihood, mestdagh2019prepaid]. Importantly, the separation of model comparison into a costly upfront training phase and a cheap inference phase ensures that subsequent applications of the pretrained networks to multiple observed data sets are very efficient. Indeed, we showed in our experiments that inference on thousands of data sets can take less than a second with our method. Moreover, by sharing and applying a pretrained network for inference within a particular research domain, results will be highly reproducible, since the settings of the method will be held constant in all applications.Raw data utilization and variable sample size. The problem of insufficient summary statistics has plagued the field of approximate Bayesian computation for a long time, so as to deserve being dubbed the curse of insufficiency [marin2018likelihood]. Using suboptimal summary statistics can severely compromise the quality of posterior approximations and validity of conclusions based on these approximations [robert2011lack]. Our method avoids using handcrafted summaries by aligning the architecture of the evidential neural network to the inherent probabilistic symmetry of the data [bloem2019probabilistic]. Using specialized neural network architectures, such as permutation invariant networks or a combination of recurrent and convolutional networks, we also ensure that our method can deal with data sets containing variable numbers of observations. Moreover, by minimizing the strictly proper version of the logarithmic loss, we ensure that asymptotic convergence implies maximal data utilization by the network.
Absolute evidence and epistemic uncertainty
. Besides point estimates of model posterior probabilities, our evidential networks yield a full higherorder probability distribution over the posterior model probabilities themselves. By choosing a Dirichlet distribution, we can use the mean of the Dirichlet distribution as the best approximation of model posterior probabilities. Beyond that, following ideas from the study of subjective logic
[jsang2018subjective] and uncertainty quantification in classification tasks [sensoy2018evidential], we can extract a measure of epistemic uncertainty. We employ epistemic uncertainty to quantify the impossibility of making a model selection decision based on a data set, which is classified as implausible under all candidate models. Therefore, the epistemic uncertainty serves as a proxy to measure absolute evidence, in contrast to relative evidence, as given by Bayes factors or posterior odds. This is an important practical advantage, as it allows us to conclude that all models in the candidate set are a poor approximation of the datagenerating process of interest. Indeed, our initial experiments confirm that our measure of epistemic uncertainty increases as data sets no longer lie within the range of certain models. However, extensive validation is needed in order to explore and understand which aspects of an observed sample lead to model misfit.These advantageous properties notwithstanding, our proposed method has certain limitations. First, our regularized optimization criterion induces a tradeoff between calibration and epistemic uncertainty, as confirmed by our experiments. This tradeoff is due to the fact that we capture epistemic uncertainty via a special form of KullbackLeibler (KL) regularization during the training phase, which renders the optimized loss function no longer strictly proper. We leave it to future research to investigate whether this tradeoff is fundamental and whether there are more elegant ways to quantify absolute evidence from a simulationbased perspective. Second, our method is intended for model comparison from a prior predictive (marginal likelihood) perspective. However, since we do not explicitly fit each model to data, we cannot perform model comparison/selection based on posterior predictive performance. We note that in certain scenarios, posterior predictive performance might be a preferable metric for model comparison, so in this case, simulationbased sampling methods should be employed (e.g., ABC or neural density estimation, [da2018model, papamakarios2018sequential]). Third, asymptotic convergence might be hard to achieve in realworld applications. In this case, approximation error will propagate into model posterior estimates. Therefore, it is important to use performance diagnostic tools, such as calibration curves, accuracy of recovery, and overconfidence bounds, in order to detect potential estimation problems. Finally, even though our method exhibits excellent performance on the domain examples considered in the current work, it might break down in highdimensional parameter spaces. Future research should focus on applications to even more challenging model comparison scenarios, for instance, hierarchical Bayesian models with intractable likelihoods, or neural network models.
Funding
This research was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation; grant number GRK 2277 "Statistical Modeling in Psychology"). We thank the Technology Industries of Finland Centennial Foundation (grant 70007503; Artificial Intelligence for Research and Development) for partial support of this work.
Acknowledgments
We thank David Izydorczyk and Mattia Sensi for reading the paper and providing constructive feedback.