Unbiased and Efficient Log-Likelihood Estimation with Inverse Binomial Sampling

The fate of scientific hypotheses often relies on the ability of a computational model to explain the data, quantified in modern statistical approaches by the likelihood function. The log-likelihood is the key element for parameter estimation and model evaluation. However, the log-likelihood of complex models in fields such as computational biology and neuroscience is often intractable to compute analytically or numerically. In those cases, researchers can often only estimate the log-likelihood by comparing observed data with synthetic observations generated by model simulations. Standard techniques to approximate the likelihood via simulation either use summary statistics of the data or are at risk of producing severe biases in the estimate. Here, we explore another method, inverse binomial sampling (IBS), which can estimate the log-likelihood of an entire data set efficiently and without bias. For each observation, IBS draws samples from the simulator model until one matches the observation. The log-likelihood estimate is then a function of the number of samples drawn. The variance of this estimator is uniformly bounded, achieves the minimum variance for an unbiased estimator, and we can compute calibrated estimates of the variance. We provide theoretical arguments in favor of IBS and an empirical assessment of the method for maximum-likelihood estimation with simulation-based models. As case studies, we take three model-fitting problems of increasing complexity from computational and cognitive neuroscience. In all problems, IBS generally produces lower error in the estimated parameters and maximum log-likelihood values than alternative sampling methods with the same average number of samples. Our results demonstrate the potential of IBS as a practical, robust, and easy to implement method for log-likelihood evaluation when exact techniques are not available.

There are no comments yet.

Authors

• 1 publication
• 4 publications
• 3 publications
07/20/2020

Maximum likelihood estimation for matrix normal models via quiver representations

In this paper, we study the log-likelihood function and Maximum Likeliho...
10/24/2020

PEP: Parameter Ensembling by Perturbation

Ensembling is now recognized as an effective approach for increasing the...
07/24/2019

Some computational aspects of maximum likelihood estimation of the skew-t distribution

Since its introduction, the skew-t distribution has received much attent...
09/13/2019

Estimating Fisher Information Matrix in Latent Variable Models based on the Score Function

The Fisher information matrix (FIM) is a key quantity in statistics as i...
11/02/2020

Noise-Contrastive Estimation for Multivariate Point Processes

The log-likelihood of a generative model often involves both positive an...
10/12/2011

Improving parameter learning of Bayesian nets from incomplete data

This paper addresses the estimation of parameters of a Bayesian network ...
09/24/2019

Numerical evaluation of the transition probability of the simple birth-and-death process

The simple (linear) birth-and-death process is a widely used stochastic ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The likelihood function

is one of the most important mathematical objects for modern statistical inference. Briefly, the likelihood function measures how well a model with a given set of parameters can explain an observed data set. For a data set of discrete observations, the likelihood has the intuitive interpretation of the probability that a random sample generated from the model matches the data, for a given setting of the model parameters.

In many scientific disciplines, such as computational neuroscience and cognitive science, computational models are used to give a precise quantitative form to scientific hypotheses and theories. Statistical inference then plays at least two fundamental roles for scientific discovery. First, our goal may be parameter estimation for a model of interest. Parameter values may have a significance in themselves, for example we may be looking for differences in parameters between distinct experimental conditions in a clinical or behavioral study. Second, we may be considering a number of competing scientific hypotheses, instantiated by different models, and we want to evaluate which model ‘best’ captures the data according to some criteria, such as explanation (what evidence the data provide in favor of each model?) and prediction (which model best predicts new observations?).

Crucially, the likelihood function is a key element for both parameter estimation and model evaluation. A principled method to find best-fitting model parameters for a given data set is maximum-likelihood estimation (MLE), which entails optimizing the likelihood function over the parameter space (Myung, 2003)

. Other common parameter estimation methods, such as maximum-a-posteriori (MAP) estimation or full or approximate Bayesian inference of posterior distributions, still involve the likelihood function

(Gelman et al., 2013). Moreover, almost all model comparison metrics commonly used for scientific model evaluation are based on likelihood computations, from predictive metrics such as Akaike’s information criterion (AIC; Akaike, 1974), the deviance information criterion (DIC; Spiegelhalter et al., 2002), the widely applicable information criterion (WAIC; Watanabe, 2010), leave-one-out cross-validation (Vehtari et al., 2017); to evidence-based metrics such at the marginal likelihood (MacKay, 2003) and (loose) approximations thereof, such as the Bayesian information criterion (BIC; Schwarz et al., 1978) or the Laplace approximation (MacKay, 2003).

However, many complex computational models, such as those developed in computational biology (Pritchard et al., 1999; Ratmann et al., 2007; Wilkinson, 2011), neuroscience (Pospischil et al., 2008; Sterratt et al., 2011) and cognitive science (van Opheusden et al., 2016), take the form of a generative model or simulator, that is an algorithm which given some context information and parameter settings returns one or more simulated observations (a synthetic data set). In those cases, the likelihood is often impossible to calculate analytically, and even when the likelihood might be available in theory, the numerical calculations needed to obtain it might be overwhelmingly expensive and intractable in practice. In such situations, the only thing one can do is to run the model to simulate observations (‘samples’). In the absence of a likelihood function, common approaches to ‘likelihood-free inference’ generally try and match summary statistics of the data with summary statistics of simulated observations (Beaumont et al., 2002; Wood, 2010).

In this paper, we ask instead the question of whether we can use samples from a simulator model to directly estimate the likelihood of the full data set, without recurring to summary statistics, in a ‘correct’ and ‘efficient’ manner, for some specific definition of these terms. The answer is yes, as long as we use the right sampling method.

In brief, a sampling method consists of a ‘sampling policy’ (a rule that determines how long to keep drawing samples for) and an ‘estimator’ which converts the samples to a real-valued number. To estimate the likelihood of a single observation (e.g., the response of a participant on a single trial of a behavioral experiment), the most obvious sampling policy is to draw a fixed amount of samples from the simulator model, and the simplest estimator is the fraction of samples that match the observation (or is ‘close enough’ to it, for continuous observations). However, most basic applications, such as computing the likelihood of multiple observations, require one to estimate the logarithm of the likelihood, or log-likelihood (see Section 2.3 for the underlying technical reasons). The ‘fixed sampling’ method described above cannot provide unbiased estimates for the log-likelihood (see Section 3). Such bias vanishes in the asymptotic limit of infinite samples, but drawing samples from the model can be computationally expensive, especially if the simulator model is complex. In practice, the bias introduced by any fixed sampling method can translate to considerable biases in estimates of model parameters, or even reverse the outcome of model comparison analyses. In other words, using poor sampling methods can cause researchers to draw conclusions about scientific hypotheses which are not supported by their data.

In this work, we introduce inverse binomial sampling (IBS) as a powerful and simple technique for correctly and efficiently estimating log-likelihoods of simulator-based models. Crucially, IBS is a sampling method that provides uniformly unbiased estimates of the log-likelihood (Haldane, 1945b; de Groot, 1959) and calibrated estimates of their variance, which is also uniformly bounded.

We note that the problem of estimating functions

from observations of a Bernoulli distribution with parameter

has been studied for mostly theoretical reasons in the mid-th century, with major contributions from Haldane (1945a, b), Girshick et al. (1946), Dawson (1953) and de Groot (1959). These works have largely focused on deriving the set of functions for which an unbiased estimate exists, and demonstrating that for those functions, the inverse sampling policy (see Section 2.4) is in a precise sense ‘efficient’. Our main contribution here is to demonstrate that inverse binomial sampling provides a practically and theoretically efficient solution for a common problem in computational modeling; namely likelihood-free inference of complex models. To back up our claims, we provide theoretical arguments for the efficiency of IBS and a practical demonstration of its value for log-likelihood estimation and fitting of simulation-based models, in particular those used in computational cognitive science.111At time of submission, we became aware of a workshop paper with a similar goal of proposing inverse binomial sampling as a method for likelihood-free inference in certain econometric models (Duncan, 2004). However, the paper does not present an empirical assessment of the quality of the estimation and to our knowledge has not led to further adoption of IBS.

The paper is structured as follows. After setting the stage with useful definitions and notation (Section 2), we describe more in detail the issues with the fixed sampling method and why they cannot be fixed (Section 3). We then present a series of arguments for why IBS solves these issues, and in particular why being unbiased here is of particular relevance (Section 4). Then, we present an empirical comparison of IBS and fixed sampling in the setting of maximum-likelihood estimation (Section 5). As case studies, we take three model-fitting problems of increasing complexity from computational cognitive science: an ‘orientation discrimination’ task, a ‘change localization’ task, and a complex sequential decision making task. In all problems, IBS generally produces lower error in the estimated parameters than fixed sampling with the same average number of samples. IBS also returns solutions that are very close in value to the true maximum log-likelihood. We conclude by discussing further applications and extensions of IBS (Section 6). Our theoretical analyses and empirical results demonstrate the potential of IBS as a practical, robust, and easy-to-implement method for log-likelihood evaluation when exact or numerical solutions are unavailable.

Implementations of IBS with tutorials and examples are available at the following link: https://github.com/lacerbi/ibs.

2 Definitions and notation

The two fundamental ingredients to run IBS are:

1. A data set consisting of ‘trials’ characterized by ‘stimuli’ and discrete ‘responses’ .

2. A generative model for the data (also known as a ‘simulator’): a stochastic function that takes as input a stimulus

and a parameter vector

(and possibly other information) and outputs a response .

In this section we expand on and provide motivations for the above assumptions, and introduce related definitions and notation used in the rest of the paper.

Here and in the following, for ease of reference, we use the language of behavioral and cognitive modeling (e.g., ‘trial’ for data points, ‘stimulus’ for independent or contextual variables, ‘response’ for observations or outcomes), but the statistical techniques that we discuss in the paper apply to any model and data set from any domain as long as they satisfy the fundamental assumption of IBS delineated above.

2.1 The likelihood function

We assume that we want to model a data set consisting of ‘trials’ (data points), where

• is the stimulus (i.e., the experimental context, or independent variable) presented on the -th trial; typically, is a scalar or vector of discrete or continuous variables (more generally, there are no restrictions on what can be as long as the simulator can accept it as input);

• is the response (i.e., the experimental observations, outcomes, or dependent variables) measured on the -th trial; can be a scalar of vector, but crucially we assume it takes discrete values.

The requirement that be discrete will be discussed below, in Section 2.2.

Given a data set , and a model parametrized by parameter vector , we can write the likelihood function as

 Pr(D|θ)=N∏i=1Pr(ri|r1,…,ri−1,s1,…,sN,θ)=N∏i=1Pr(ri|r1,…,ri−1,s1,…,si,θ) (1)

where the first line follows from the chain rule of probability, and holds in general, whereas in the second step we applied the reasonable ‘causal’ (or ‘no-time-travel’) assumption that the response at the

-th trial is not influenced by future stimuli.222We also used the causality assumption that current responses are not influenced by future responses to choose a specific order to apply the chain rule in the first line.

Equation 1 describes the most general class of models, in which the response in the current trial might be influenced by the history of both previous stimuli and previous responses. Many models commonly make a stronger conditional independence assumption between trials, such that the response on the current trial only depends on the current stimulus. Under this stronger assumption, the likelihood takes a simpler form,

 Pr(D|θ)=N∏i=1Pr(ri|si,θ). (2)

While Equation 2 is simpler, it still includes a wide variety of models. For example, note that time-dependence can be easily included in the model by incorporating time into the ‘stimulus’ , and including time-dependent parameters explicitly in the model specification. In the rest of this work, for simplicity we consider models that make conditional independence assumptions as in Equation 2, but our techniques apply in general also for likelihoods as per Equation 1.

Given that the likelihood of the -th trial can be directly interpreted as the probability of observing response in the -th trial (conditioned on everything else), we denote such quantity with . The value is a function of , depends on the current stimulus and response, and may or may not depend on previous stimuli or responses.

With this notation, we can simply write the likelihood as

 Pr(D|θ)=N∏i=1pi. (3)

Finally, we note that for numerical convenience, given that can be large (in the hundreds if not thousands or more), it is common practice to work with the logarithm of the likelihood, or log-likelihood, that is

 L(θ)≡logPr(D|θ)=logN∏i=1pi=N∑i=1logpi. (4)

Crucially, we assume that the likelihood function is unavailable in a tractable form – for example, because the model is too complex to derive an analytical expression for the likelihood. Instead, IBS provides a technique for estimating Equation 4 via simulation.

2.2 The generative model or simulator

While we assume no availability of an explicit representation of the likelihood function, we assume that the model of interest is represented implicitly by a stochastic generative model (or ‘simulator’). In the most general case, the simulator is a stochastic function that takes as input the current stimulus , arrays of past stimuli and responses, and a parameter vector , and outputs a discrete response , conditional on all past events,

 ri∼g(s1,…,si,r1,…,ri−1;θ). (5)

As mentioned in the previous section, a common assumption for a model is that the response in the current trial only depends on the current stimulus and parameter vector, in which case

 r∼g(s;θ). (6)

For example, the model

could be simulating the responses of a human participant in a complex cognitive task; the (discrete) choices taken by a rodent in a perceptual decision-making experiment; or the spike count of a neuron in sensory cortex for a specific time bin after a stimulus presentation.

We list now the requirements that the simulator model needs to satisfy to be used in conjuction with IBS.

Discrete response space

Lacking an expression for the likelihood function, the only way to estimate the likelihood or any function thereof is by drawing samples on each trial, and matching them to the response . This approach requires that there is a nonzero probability for a random sample to match , hence the assumption that the space of responses is discrete. We will discuss in Section 6.3 a possible method to extend IBS to larger or continuous response spaces.

Conditional simulation

An important requirement of the generative model, stated implicitly by Equations 5 and 6, is that the simulator should afford conditional simulation, in that we can simulate the response for any trial , given the current stimulus , and possibly previous stimuli and responses. Note that this class of models, while large, does not include all possible simulators, in that some simulators might not afford conditional generation of responses. For example, models with latent dynamics might be able to generate a full sequence of responses given the stimuli, but it might not be easy (or even possible) to generate the response in a given trial, conditional on a specific sequence of previous responses.

Computational cost

Finally, for the purpose of some of our analyses we assume that drawing a sample from the generative model is at least moderately computationally expensive, which limits the approximate budget of samples one is willing to use for each likelihood evaluation (in our analyses, up to about a hundred, on average, per likelihood evaluation). Number of samples is a reasonable proxy for any realistic resource expenditure since most costs (e.g., time, energy, number of processors) would be approximately proportional to it. Therefore, we also require that every response value in the data has a non-negligible probability of being sampled from the model – given the available budget of samples one can reasonably draw. In this paper, we will focus on the low-sample regime, since that is where IBS considerably outperforms other approaches. For our analyses of performance of the algorithm, we also assume that the computational cost is independent of the stimulus, response or model parameters, but this is not a requirement of the method.

2.3 Reduction to Bernoulli sampling

Given the conditional independence structure codified by Equation 3, to estimate the log-likelihood of the entire data set, we cannot do better than estimating on each trial independently, and combining the results. However, combining estimates into an estimate of is non-trivial. Instead, it is easier to estimate for each trial and calculate the log-likelihood

 (7)

We can estimate this log-likelihood by simply summing estimates

across trials, in which case the central limit theorem guarantees that the distribution of

is normally distributed for large values of

, which is true for typical values of of the order of a hundred or more (see also Section 4.6).

We can make one additional simplification, without loss of generality. The generative model specifies an implicit probability distribution

for each trial. However, to estimate the log-likelihood, we do not need to know the full distribution, only the probability for a random sample from the model to match the observed response . Therefore, we can convert each sample to

 x={1if r=ri(hit')0otherwise(miss'), (8)

and lose no information relevant for estimating the log-likelihood. By construction, follows a Bernoulli distribution with probability . Note that this holds regardless of the type of data, the structure of the generative model or the model parameters. The only difference between different models and data sets is the distribution of the likelihood across trials. Moreover, since

is interpreted as the parameter of a Bernoulli distribution, we can apply standard frequentist or Bayesian statistical reasoning to it.

In conclusion, we can reduce the problem of estimating the log-likelihood of a given model by sampling to a smaller problem: given a method to draw samples from a Bernoulli distribution with unknown parameter , estimate as precisely and accurately as possible using on average as few samples as possible.

2.4 Sampling policies and estimators

A sampling policy is a function that, given a sequence of samples , decides whether to draw an additional sample or not (Girshick et al., 1946). In this work, we compare two sampling policies:

1. The commonly used fixed policy: Draw a fixed number of samples , then stop.

2. The inverse binomial sampling policy: Keep drawing samples until , then stop.

In our case, an estimator (of ) is a function that takes as input a sequence of samples and returns an estimate of . We recall that the bias of an estimator of , for a given true value of the Bernoulli parameter , is defined as

 Bias[^L|p]=E[^L]−logp, (9)

where the expectation is taken over all possible sequences generated by the chosen sampling policy under the Bernoulli probability . Such estimator is (uniformly) unbiased if for all (that is, the estimator is centered around the true value).

Fixed sampling

For the fixed sampling policy, since all samples are independent and identically distributed, a sufficient statistic for estimating from the samples is the number of ‘hits’, . The most obvious estimator for an obtained sequence of samples is then

 ^Lnaive(x)=log(m(x)M), (10)

but this estimator has infinite bias; since as long as , there is always a nonzero chance that , in which case (and thus ). This divergence can be fixed in multiple ways; in the main text we use

 ^Lfixed(x)=log(m(x)+1M+1). (11)

Note that any estimator based on the fixed sampling policy will always produce biased estimates of , as guaranteed by the reasoning in Section 3 below. As an empirical validation, we show in Appendix B.1 that our results do not depend on the specific choice of estimator for fixed sampling (Equation 11).

Inverse binomial sampling

For inverse binomial sampling we note that, since

is a binary variable, the policy will always result in a sequence of samples of the form

 x=(K0,0,0,0,0,…,0,1), (12)

where the length of the sequence is a stochastic variable, which we label (a positive integer). Moreover, since each sample is independent and a ‘hit’ with probability , the length

follows a geometric distribution with parameter

,

 Pr(K=k)=p(1−p)k−1. (13)

We convert a value of into an estimate for using the IBS estimator,

 ^LIBS(x)={0 for K=1−∑K−1k=11k for K>1. (14)

Crucially, Equation 14 combined with the IBS policy provides a uniformly unbiased estimator of (de Groot, 1959). Moreover, we can show that is the uniformly minimum-variance unbiased estimator of under the IBS policy. For a full derivation of the properties of the IBS estimator, we refer to Appendix A.1. Equation 14 can be written compactly as , where is the digamma function (Abramowitz and Stegun, 1948).

We now provide an understanding of why fixed sampling is not a good policy, despite its intuitive appeal, and then show why IBS solves many of the problems with fixed sampling.

3 Why fixed sampling fails

We summarize in Figure 1 the properties of the IBS estimator and of fixed sampling, for different number of samples , as a function of the trial likelihood

. In particular, we plot the expected number of samples, the bias, and the standard deviation of the estimators.

The critical disadvantage of the fixed sampling policy with samples is that its estimates of the log-likelihood are inevitably biased (see Figure 1B). Fixed sampling is ‘inevitably’ biased because the bias decreases as one takes more samples, but for , the estimator remains biased. More precisely, in a joint limit where , and for some constant , the bias collapses onto a single ‘master curve’ (see Figure 2; and Appendix A.2 for the derivation). In particular, we observe that the bias is close to zero for and that it diverges when , or equivalently, for and , respectively.

To convey the intuition for why the bias diverges for small probabilities, we provide a gambling analogy. Imagine playing a slot machine and losing the first bets you make. You can now deduce that this slot machine likely has a win rate less than , but there is no way of knowing whether it is , , or even

apart from any prior beliefs you may have (for example, you expect that the house has stacked the odds in their favor but not overwhelmingly so). In practice, this uncertainty is unlikely to affect your decision whether to continue playing the slot machine, since the expected value of the slot machine depends linearly on its win rate. However, if your goal is to estimate the

logarithm of the win rate, the difference between these percentages becomes infinitely large as the true win rate tends to 0. We provide a more formal treatment of the bias of fixed sampling in Appendix A.2.

3.1 Why fixed sampling cannot be fixed

The asymptotic analyses above suggest an obvious solution to prevent fixed sampling estimators from becoming strongly biased: make sure to draw enough samples so that . Although this solution will succeed in theory, it has practical issues. First of all, choosing requires knowledge of on each trial, which is equivalent to the problem we set out the solve in the first place. Moreover, even if one can derive or estimate an upper bound on (for example, in behavioral models that include a lapse rate, that is a nonzero probability of giving a uniformly random response), fixed sampling will be inefficient. As shown in Figure 2, the bias in is small when or and increasing even further has diminishing returns, at least for the purpose of reducing bias. If we choose inversely proportional to the probability on the trial where the model is least likely to match the observed response, we will draw many more samples than necessary for all other trials.

One might hope that in practice the likelihood is approximately the same across trials, but the opposite is true. As an example, take a typical ‘orientation discrimination’ psychophysical task in which a participant has to detect whether a presented oriented grating is tilted clockwise or anti-clockwise from vertical, and consider a generative model for the observer’s responses that includes sensory measurement noise and lapses (see Section 5.2 for details). Moreover, imagine that the experiment contains trials, and the participant’s true lapse rate is . The model will always assign more probability to correct responses than errors, so, for all correct trials, will be at least . However, there will likely be a handful of trials where the participant lapses and makes a grave error (responding incorrectly to a stimulus very far from the decision boundary), in which case will be . This hypothetical scenario is not exceptional, in fact it is almost inevitable in any experiment where participants occasionally make unpredictable responses, and perform hundreds or more trials.

A more sophisticated solution would relax the assumption that needs to be constant for all trials, and instead choose as a function of on each trial. However, since is unknown, one would need to first estimate by sampling, choose for each trial, then re-estimate . Such an iterative procedure would create a non-fixed sampling scheme, in which adapts to on each trial. This approach is promising, and it is, in fact, how we originally arrived at the idea of using inverse binomial sampling for log-likelihood estimation, while working on the complex cognitive model described in Section 5.4.

Finally, a heuristic solution would be to disregard any statistical concerns, pick

based on some intuition or from similar studies in the literature, and hope that the bias turns out to be negligible. We do not intend to dissuade researchers from using such pragmatic approaches if they work in practice. Unfortunately, this one does not. As Figure 2 shows, estimating log-likelihoods with fixed sampling can cause biases of or more points of model evidence if the data set contains even a single trial on which . Since differences in log-likelihoods larger than to points are often regarded as strong evidence for one model over another (Kass and Raftery, 1995; Jeffreys, 1998; Anderson and Burnham, 2002), it is well possible for such biases to reverse the outcome of a model comparison analysis. This point bears repeating; if one uses fixed sampling to estimate log-likelihoods and the number of samples is too low, one risks of drawing conclusions about scientific hypotheses that are not supported by the experimental data one has collected.

4 Is inverse binomial sampling really better?

While one could expect that the unbiasedness of the IBS estimator would come at a cost, such as more samples, a much higher variance, or perhaps a particularly complex implementation, we show here that IBS is not only unbiased, but it is sample-efficient, its estimates are low-variance, and can be implemented in a few lines of code.

4.1 Implementation

We present in Algorithm 1 a description in pseudo-code of the basic IBS procedure to estimate the log-likelihood of a given parameter vector for a given data set and generative model. The procedure is based on the inverse binomial sampling scheme introduced in Section 2.4, generalized sequentially to multiple trials.

For each trial, we draw sampled responses from the generative model, given the stimulus in that trial, using the subroutine sample_from_model, until one matches the observed response . This yields a value of on each trial , which IBS converts to an estimate (where we use the convention that a sum with zero terms equals 0). We make our way sequentially across all trials, returning then the summed log-likelihood estimate for the entire data set.

In practice, depending on the programming language of choice, it might be useful to take advantage of numerical features such as vectorization to speed up computations. An alternative ‘parallel’ implementation of IBS is described in Appendix C.1. Implementations of IBS in different programming languages can be found at the following web page: https://github.com/lacerbi/ibs.

4.2 Computational time

The number of samples that IBS takes on a trial with probability is geometrically distributed with mean . We saw earlier that for fixed-sampling estimators to be approximately unbiased, one needs at least samples, and IBS does exactly that. Moreover, since IBS adapts the number of samples it takes on different trials, it will be considerably more sample-efficient than fixed sampling with constant across trials. For example, in the aforementioned example of the orientation discrimination task, when most trials have a likelihood , IBS will often take just or samples on those trials. Therefore, it will allocate most of its samples and computational time on trials where is low and those samples are needed.

4.3 Variance

The derivation of the variance of the IBS estimator is similar to the calculation of its expected value (see Equation 25 in Appendix A.1), which yields

 Var[^LIBS]=∞∑k=11k2(1−p)k=Li2(1−p), (15)

where we introduced the dilogarithm or Spence’s function (Maximon, 2003). The variance (plotted in Figure 1C as standard deviation) increases when , but it does not diverge; instead, it converges to . Therefore, IBS is not only uniformly unbiased, but its variance is uniformly bounded.

We already mentioned that is the minimum-variance unbiased estimator of given the inverse binomial sampling policy, but it also comes close (less than distance) to saturating the information inequality, which specifies the minimum variance that can be theoretically achieved by any estimator under a non-fixed sampling policy (an analogue of the Cramer-Ráo bound; de Groot, 1959). We note that fixed sampling eventually saturates the information inequality in the limit , but as mentioned in the previous section, the fixed-sampling approach can be highly wasteful or substantially biased (or both), not knowing a priori how large has to be across trials. See Appendix A.3 for a full discussion of the information inequality and comparison between estimators.

Equation 15 has theoretical relevance, but requires us to know the true value of the likelihood , which is unknown in practice. Instead, we define the estimator of the variance of a specific IBS estimate, having sampled for times until a ‘hit’, as

 Var[^LIBS∣∣K]=ψ1(1)−ψ1(K), (16)

where is the trigamma function, the derivative of the digamma function (Abramowitz and Stegun, 1948). We derived Equation 16 from a Bayesian interpretation of the IBS estimator, which can be found in Appendix A.4.

4.4 Iterative multi-fidelity

We define a multi-fidelity estimator as an estimator with a tunable parameter that affords different degrees of precision at different computational costs (i.e., from a cheaper, inaccurate estimate to a very accurate but expensive one), borrowing the term from the literature on computer simulations and surrogate models (Kennedy and O’Hagan, 2000; Forrester et al., 2007). IBS provides a particularly convenient way to construct an iterative multi-fidelity estimator in that we can perform independent ‘repeats’ of the IBS estimate at , and combine them by averaging,

 ^LIBS-R(θ)=1RR∑r=1^L(r)IBS(θ)Var[^LIBS-R(θ)]=1R2R∑r=1Var[^L(r)IBS(θ)], (17)

where denotes the -th independent estimate obtained via IBS. For , we recover the standard (‘1-rep’) IBS estimator. The variances in Equation 17 are computed empirically using the estimator in Equation 16.

Importantly, we do not need to perform all repeats at the same time, but we can iteratively refine our estimates whenever needed, and only need to store the current estimate, its variance and the number of repeats performed so far:

 ^LIBS-R+1(θ)=1R+1[R⋅^LIBS-R(θ)+^L(r+1)IBS(θ)]Var[^LIBS-R+1(θ)]=1(R+1)2{R2⋅Var[^LIBS-R(θ)]+Var[^L(r+1)IBS(θ)]}. (18)

Crucially, while a similar procedure could be performed with any estimator (including fixed sampling), the fact that IBS is unbiased and its variance is bounded ensures that the combined iterative estimator is also unbiased and eventually converges to the true value for , with variance bounded above by .

Finally, we note that the iterative multi-fidelity approach described in this section can be extended such that, instead of having the same number of repeats for all trials, one could adaptively allocate a different number of repeats to each trial so as to minimize the overall variance of the estimated log-likelihood (see Appendix C.2).

4.5 Bias or variance?

In the previous sections, we have seen that IBS is always unbiased, whereas fixed sampling can be severely biased when using too few samples. However, with the right choice of , fixed sampling can have slightly lower variance. We now list several practical and theoretical arguments for why bias matters more than variance, and being unbiased is a highly desirable property for an estimator of the log-likelihood.

1. To use IBS or fixed sampling to estimate the log-likelihood of a given data set, we sum estimates of across trials. Basic rules of probability imply that, as , the standard deviation of will grow proportional to , whereas the bias grows linearly with .

2. When using the log-likelihood (or a derived metric) for model selection, it is common to collect evidence for a model, possibly hierarchically, across multiple datasets (e.g., different participants in a behavioral experiment), which provides a second level of averaging that can reduce variance but not bias.

3. Besides model selection, the other key reason to estimate log-likelihoods is to infer parameters of a model, for example via maximum-likelihood estimation. For this purpose, one would use an optimization algorithm that calls the routine that estimates many times with different candidate values of , and uses this information to estimate the value that maximizes . Powerful, sample-efficient optimization algorithms, such as those based on Bayesian optimization, work by building a statistical approximation (a surrogate) of the objective function (Jones et al., 1998; Snoek et al., 2012; Shahriari et al., 2015; Acerbi and Ma, 2017), most commonly via Gaussian processes (Rasmussen and Williams, 2006). These methods can operate successfully with noisy objectives by effectively averaging function values from nearby parameter vectors. By contrast, no optimization algorithm can handle bias. This argument is not limited to maximum-likelihood estimation, as recent methods have been proposed to use Gaussian process surrogates to perform (approximate) Bayesian inference and infer posterior distributions (Kandasamy et al., 2015; Acerbi, 2018; Järvenpää et al., 2019); also these techniques can handle variance in the estimates but not bias.

4. The ability to combine unbiased estimates of known variance iteratively (as described in Section 4.4) is particularly useful with adaptive fitting methods based on Gaussian processes, whose algorithmic cost grows super-linearly in the number of distinct training points (Rasmussen and Williams, 2006). Thanks to iterative multi-fidelity estimation, these methods would have the opportunity to refine their estimates of the log-likelihood at a previously evaluated point, whenever deemed useful, without incurring an increased algorithmic cost.

5. On a conceptual level, bias is much more dangerous than variance. Bias can cause researchers to confidently draw false conclusions, variance causes decreased statistical power and lack of confidence. Appropriate statistical tools can account for variance and explain seemingly conflicting findings resulting from underpowered studies (Maxwell et al., 2015), whereas bias is impossible to recognize or correct no matter what statistical techniques one uses.

4.6 Higher-order moments

So far, we have considered the mean (or bias) and variance of and

in detail, but ignored any higher-order moments. This is justified since to estimate the log-likelihood of a model with a given parameter vector, we will sum these estimates across many trials. Therefore, the central limit theorem guarantees that the distribution of

is Gaussian with mean and variance determined by the mean and variance of or on each trial, at least as long as the distribution of across trials satisfies a regularity condition.333Specifically, the Lindeberg (1922) or Lyapunov conditions (Ash et al., 2000, Chapter 7.3), both of which place restrictions on the degree to which the variance of any single trial can dominate the distribution of . A sufficient but far from necessary condition is that there exists a lower bound on , which is the case for example for a behavioral model with a lapse rate. Using the same argument, the total number of samples that IBS uses to estimate is also approximately Gaussian.

In the following, we demonstrate empirically that the distributions of the number of samples taken by IBS and of the estimates are Gaussian. Importantly, we also show that the estimate of the variance from Equation 16 is calibrated. As a realistic scenario, we consider the psychometric function model described in Section 5.2. For each simulated data set, we estimated the log-likelihood under the true data-generating parameters (see Appendix B.1 for details). Specifically, for each data set we ran IBS and recorded the estimated log-likelihood , the total number of samples taken, and a Bayesian estimate for the variance of from Equation 16. For the total number of samples and the estimate, we can compute the theoretical mean and variance by knowing the trial likelihoods , which we can evaluate exactly in this example.

For each obtained , we computed a

-score by subtracting the exact mean and dividing by the exact standard deviation, obtained by knowing the mean and variance of geometric random variables underlying the samples taken in each trial. If

is normally distributed, we expect that the variable across data sets should appear to be distributed as a standard normal, . If is not normally distributed, we should see deviations from normality in the distribution of , especially in the tails. By comparing the histogram of -scores with a standard normal in Figure 3A, we see that the total number of samples is approximately normal.

We did the same analysis for the estimate , using the -scored variable

 z≡^LIBS−Ltrue√Var[^LIBS], (19)

where here is the exact variance of the estimator computed via Equation 15. The histogram of -scores in Figure 3B is again very close to a standard normal.

Finally, in practical scenarios we do not know the true likelihoods, so the key question is whether we can obtain valid estimates of the variance of via Equation 16. If such estimate is correctly calibrated, the distribution of -scores should remain approximately Gaussian if we use Equation 16 for the denominator of Equation 19. Indeed, the calibration plot in Figure 3C shows an excellent match with a standard normal, confirming that our proposed estimator of the variance is well calibrated.

5 Numerical experiments

In this section, we examine the performance of IBS and fixed sampling on several realistic model-fitting problems of increasing complexity. The example problems we consider here model tasks drawn from psychophysics and cognitive science: an orientation discrimination experiment (Section 5.2); a change localization task (Section 5.3); and playing a four-in-a-row game that involves complex sequential decision making (Section 5.4). For the first problem, we can derive the exact analytical expression for the log-likelihood; for the second problem, we have an integral expression for the log-likelihood that we can approximate numerically; and finally, for the third problem, we are in the true scenario in which the log-likelihood is intractable.

First, we describe the procedure used to perform our numerical experiments. Code to run all our numerical experiments and analyses is available at the following link: https://github.com/basvanopheusden/ibs-development.

5.1 Procedure

For each problem, we simulate data from the generative model given different known settings of model parameters, and we compare the accuracy (and other statistics) of both IBS and fixed sampling in recovering the true data-generating parameters through maximum-likelihood estimation. Since these methods provide noisy and possibly biased estimates of , and due to variability in the simulated datasets, the estimates that result from optimizing the log-likelihood will also be noisy and possibly biased. To explore performance in a variety of settings, and to account for variability in the data-generation process, for each problem we consider different parameter settings, where is the number of model parameters (that is, the dimension of ), and for each parameter setting we generate distinct synthetic datasets.

For each dataset, we compare fixed sampling with different numbers of samples (from to or , depending on the problem), and IBS with different number of ‘repeats’ , as defined in Section 4.4 (from to up to , depending on the problem). If available, we also test the performance of maximum-likelihood estimation using the ‘exact’ log-likelihood function (calculated either analytically or via numerical integration).

For all methods, we maximize the log-likelihood with Bayesian Adaptive Direct Search (BADS, Acerbi and Ma, 2017; github.com/lacerbi/bads), a hybrid Bayesian optimization algorithm based on the mesh-adaptive direct search framework (Audet and Dennis Jr, 2006), which affords a fast, robust exploration of the function landscape via Gaussian process surrogates. BADS has been shown to be much more effective than alternative optimization methods particularly when dealing with stochastic objective functions, and with a relatively limited budget of a few hundreds to a few thousands function evaluations (Acerbi and Ma, 2017).

5.2 Orientation discrimination

The first task we simulate is an orientation discrimination task, in which a participant observes an oriented patch on a screen, and indicates whether they believe it was rotated leftwards or rightwards with respect to a reference line (see Figure 4A). Here, on each trial the stimulus is the orientation of the patch with respect to the reference (in degrees), and the response is ‘rightwards’ or ‘leftwards’.

For each dataset, we simulated trials, drawing on each trial the stimulus

from a Gaussian distribution with mean

(the reference) and standard deviation . The generative model assumes that the observer makes a noisy measurement of the stimulus, which is normally distributed with mean and standard deviation , as per standard signal detection theory (Green and Swets, 1966). They then respond ‘rightwards’ if is larger than (a parameter which captures response bias, or an incorrect memory of the reference line) and ‘leftward’ otherwise. However, a fraction of the time, given by the lapse rate , the observer guesses randomly. We visually illustrate the model in Figure 4B. For both theoretical reasons and numerical convenience, we parametrize the slope as . Thus, the model has parameter vector .

We can derive the likelihood of each trial analytically:

 Pr(`rightwards' response|s,θ)=γ2+(1−γ)Φ(s−μσ), (20)

where is the cumulative normal distribution function. Equation 20 takes the form of a typical psychometric function  (Wichmann and Hill, 2001). Note that in this section we use Gaussian distributions for circularly distributed variables, which is justified under the assumption that both the stimulus distribution and the measurement noise are small. For more details about the numerical experiments, see Appendix B.1.

In Figure 5, we show the parameter recovery using fixed sampling, IBS and the exact log-likelihood function from Equation 20. First, we show that IBS can estimate the sensory noise parameter and lapse rate more accurately than fixed sampling while using on average the same or fewer samples (Figure 5A,D). For visualization purposes, we show here a representative example with or repeats of IBS and or fixed samples (see Figure 15 in the Appendix for the plots with all tested values of and ). As baseline, we also plot the mean and standard deviation of exact maximum-likelihood estimation, which is imperfect due to the finite data size ( trials), and stochasticity and heuristics used in the optimization algorithm. We omit results for estimates of the response bias , since even fixed sampling can match the performance of exact MLE with only sample per trial.

Next, we fix and plot the mean and standard deviation of the estimated and across simulated data sets as a function of the (average) number of samples per trial used by IBS or fixed sampling (Figure 5B,E). Fixed sampling is highly sensitive to the number of samples, and with less than samples per trial, its estimate of is severely biased. Estimating remains impossible even with samples per trial. By contrast, IBS estimates and reasonably accurately regardless of of the number of samples per trial. IBS has a slight tendency to underestimate , which is a result of an interaction of the uncertainty handling in BADS with our choice of model parametrization and parameter bounds. In general, estimating lapse rates is notoriously prone to biases (Prins, 2012).

Finally, we measure the root mean squared error (RMSE) of IBS, fixed sampling and the exact solution, averaged across all simulated data sets, as a function of number of samples per trial (Figure 5C,F). This analysis confirms the same pattern: fixed sampling makes severe errors in estimating with fewer than samples, and for it requires as many as samples per trial to become approximately unbiased. IBS outperforms fixed sampling for both parameters and any number of samples, and even with as few as 2 or 3 repeats comes close to matching the RMSE of exact maximum-likelihood inference.

5.3 Change localization

The second problem we consider is a typical ‘change localization’ task (see Figure 6A), in which participants observe a display of oriented patches, and after a short inter-stimulus interval, a second display of patches (Van den Berg et al., 2012). Of these patches, are identical between displays and one denoted by will have changed orientation. The participant responds by indicating which patch they believe changed orientation. Here, on each trial the stimulus is a vector of 12 elements corresponding to a vector of orientations (in degrees) of the six patches in the first display, concatenated with the vector of orientation of the six patches in the second display. The response is the patch reported by the participant.

For each dataset, we simulated

trials. On each trial, the patches on the first display are all independently drawn from a uniform distribution

Uniform[0,360]. For the second display, we randomly select one of the patches and change its orientation by an amount drawn from a von Mises distribution centered at with concentration parameter . A von Mises distribution is the equivalent of a Gaussian distribution in circular space, and the concentration parameter is monotonically related to the precision (inverse variance) of the distribution. Note that, for mathematical convenience (but without loss of generality) we assume that patch orientations are defined on the whole circle, whereas in fact they are defined on the half-circle .

The generative model assumes that participants independently measure the orientation of each patch in both displays. For each patch, the measurement distribution is a von Mises centered on the true orientation with concentration parameter , representing sensory precision. The participant then selects the patch for which the absolute circular difference of the measurements between the first and second display is largest. This model too includes a lapse rate , the probability with which the participant guesses uniformly randomly across responses.

Since thinking in terms of concentration parameter is not particularly intuitive, we reparametrize participants’ sensory noise as , since in the limit , the von Mises distribution with concentration parameter tends to a Gaussian distribution with standard deviation . The model has then two parameters, .

We can express the trial likelihood for the change localization model in an integral form that does not have a known analytical solution (see Appendix B.2 for a derivation). We can, however, evaluate the integral numerically, which can take a few seconds for a high-precision likelihood evaluation across all trials in a dataset. The key quantity in the computation of the trial likelihood is , the difference in orientation between the changed stimulus at position between the first and second display. We plot the probability of a correct response, , as a function of in Figure 6B. As expected, the probability of a correct response increases monotonically with the amount of change, with the slope being modulated by sensory noise and the asymptote by the lapse rate (but also by the sensory noise, for large noise, as we will discuss later). For more details about the numerical experiments, see Appendix B.2.

In Figure 7, we compare the performance of IBS, fixed sampling and the ‘exact’ log-likelihood evaluated through numerical integration. As before, IBS estimates both and more accurately with fewer samples than fixed sampling (Figure 7A,D). As an example, we show IBS with repeats and fixed sampling with or ; the full results with all tested values of and are reported in Figure 16 in the Appendix.

Interestingly, maximum-likelihood estimation via the ‘exact’ method provides biased estimates of when the noise is high. This is because sensory noise and lapse become empirically non-identifiable for large , as large noise produces a nearly-flat response distribution, which is indistinguishable from lapse. This produces a ridge in the likelihood landscape, which may induce biases and spurious trade-offs between the two parameters. This issue can be ameliorated by using Bayesian inference instead of maximum-likelihood estimation (Acerbi et al., 2014). IBS and fixed sampling perform seemingly better here because the interaction between a ridge (or flat region) in the true likelihood landscape and noisy estimates thereof depend on specifics of the problem and of the optimization procedure. For these particular settings of , IBS and fixed sampling perform better at recovering than the ‘exact’ method, but it does not necessarily hold true in general.

In Figure 7B,E, we show the estimates of fixed sampling and IBS for simulated data with and , and find that fixed sampling severely underestimates when using less then samples, and underestimates even with samples per trial. By contrast, IBS produces parameter estimates with relatively little bias and standard deviation close to that of exact maximum-likelihood estimation. Finally, in Figure 7C,F we show that IBS has lower RMSE than fixed sampling for both parameters when compared on equal terms of number of samples.

5.4 Four-in-a-row game

The third problem we examine is a complex sequential decision-making task, a variant of tic-tac-toe in which two players compete to place pieces in a row, column or diagonal on a -by- board (see Figure 8A). In previous work, van Opheusden et al. (2016) showed that people’s decision-making process in this game can be modeled accurately as heuristic search

. A heuristic search algorithm makes a move in a given board state by searching through a decision tree of move sequences starting at that board state for a number of moves into the future. To decide which candidate future moves to include in the tree, the algorithm uses a

value function defined as

 V(board,move)=nf∑i=1wifi(board,move)+N(0,σ2), (21)

in which denotes a set of features (i.e., configurations of pieces on the board, such as ‘three pieces on a row of the same color’; see Figure 8B), the corresponding feature weights, and is a model parameter which controls value noise. As before, we parameterize the model with .

The interpretation of this value function is that moves which lead to a high value are given priority in the search algorithm, and the model is more likely to make those moves. As a heuristic to reduce the search space, any moves for which the value is less than that of the highest-value move minus a threshold parameter are pruned from the tree and never considered as viable options. Finally, when evaluating , the model stochastically omits features from the sum ; the probability for any feature to be omitted (or dropped) is independent with probability (the drop rate). van Opheusden et al. (2016) considered various heuristic search models with different feature sets, and estimated the value of feature weights as well as the size (number of nodes) of the decision tree based on human data. Here, we consider a reduced model in which the feature identity , feature weights and tree size are fixed (see Appendix B.3 for their values). Thus, the current model has three parameters, .

Note that even though the 4-in-a-row task is a sequential game, the heuristic search model makes an independent choice on each move, with the ‘stimulus’ on each trial being the current board state. Hence, the model satisfies the conditional independence assumptions of Equations 2 and 6. Note also that, even though the heuristic search algorithm can be specified as a generative ‘simulator’ which we can query to make moves in any board position, we have no way of calculating the distribution over its moves, since this would require integrating over all possible trees it could build, features which may be dropped, and realizations of the value noise. Therefore, we are in the scenario in which log-likelihood estimation is only possible by simulation, and we cannot compare the performance of fixed sampling or IBS to any ‘exact’ method.

To generate synthetic data sets for the 4-in-a-row task, we first compiled a set of board positions which occurred in human-versus-human play (van Opheusden et al., 2016). For each data set, we then randomly sampled board positions without replacement which we used as ‘stimuli’ for each trial, and sampled a move from the heuristic search algorithm for each position to use as ‘responses’. For more details about the numerical experiments, see Appendix B.3.

In Figure 9, we perform the same tests as before, comparing fixed sampling and IBS, but lacking any ‘exact’ estimation method. Due to the high computational complexity of the model, we only consider IBS with up to repeats, corresponding to samples. The full results with all tested values of and are reported in Figure 17 in the Appendix. As a specific example for Figure 9B,E,H we show the estimates of fixed sampling and IBS for simulated data with , pruning threshold and .

Fixed sampling underestimates the value noise , even when using samples, whereas IBS estimates it accurately with times fewer samples (Figure 9A). This bias of fixed sampling gets worse with fewer samples (Figure 9B), and overall, IBS outperforms fixed sampling when compared on equal terms (Figure 9C). The same holds true for the pruning threshold . IBS estimates about equally well as fixed sampling, but with about half as many samples (Figure 9D), fixed sampling is severely biased when using too few samples (Figure 9E) and overall, IBS outperforms fixed sampling.

The results are slightly more complicated for the feature drop rate . As before, fixed sampling produces severely biased estimates of with up to samples (Figure 9G), and the bias increases when using fewer samples (Figure 9H). However, for this parameter IBS is also biased, but towards (Figure 9G & H), which is the midpoint of the ‘plausible’ upper and lower bounds used as reference by the optimization algorithm (see Appendix B.3

for details). This bias can be interpreted as a form of regression towards the mean; likely a by-product of the optimization algorithm struggling with a low signal-to-noise ratio for this parameter and these settings (i.e., a nearly flat likelihood landscape for the amount of estimation noise on the log-likelihood). The negative bias of fixed sampling helps to reduce its variance in the low-

regime, and therefore in terms of RMSE, fixed sampling compares favorably with IBS regardless of its bias (Figure 9I).

5.5 Log-likelihood loss

In the previous sections, we have analyzed the bias and error of different estimation methods when recovering the generating model parameters in various scenarios. Another important question, crucial for model selection, is how well different methods are able to recover the true maximum log-likelihood. The ability to recover the true parameters and the true maximum log-likelihood are related but distinct properties because, for example, a relatively flat likelihood landscape could yield parameter estimates very far from ground truth, but still afford recovery of a value of the log-likelihood close to the true maximum. We recall that differences in log-likelihood much greater than one point are worrisome as they might significantly affect the outcomes of a model comparison (Kass and Raftery, 1995; Jeffreys, 1998; Anderson and Burnham, 2002).

To compute the log-likelihood loss of a method for a given data set, we estimate the difference between the ‘exact’ log-likelihood evaluated at the ‘true’ maximum-likelihood solution (as found after multiple optimization runs) and the ‘exact’ log-likelihood of the solution returned by the multi-start optimization procedure for a given method, as described in Section 5.1. In terms of methods, we consider IBS and fixed-sampling with different amounts of samples. We perform the analysis for the two scenarios, orientation discrimination (Section 5.2) and change localization (Section 5.3), for which we have access to the exact likelihood, either analytically or numerically.

The results in Figure 10 show that IBS, even with only a few repeats, is able to return solutions which are very close to the true maximum-likelihood solution in terms of log-likelihood (within 1-2 points); whereas fixed sampling remains severely biased even with large number of samples, being thus at risk of inducing wrong inferences in model selection. Note that our analyses of the loss are based on the ‘exact’ log-likelihood values evaluated at the solution returned by the optimization procedure. In practice, we would not have access to the ‘exact’ log-likelihood at the solution; but its value can be estimated up to the desired precision with IBS, by taking multiple repeats at the returned solution (see Section 4.4).

5.6 Summary

The results in this section demonstrate that in realistic scenarios, fixed sampling with too few samples causes severe biases in parameter and maximum log-likelihood estimates, whereas inverse binomial sampling is much more accurate and robust to the number of samples used. Across all models and all parameters, IBS yields parameter estimates with little bias and variance close to that of ‘exact’ maximum-likelihood estimation, even when using only a handful of repeats ( between 1 and 5). Conversely, fixed sampling yields severely biased parameter estimates when using too few samples per trial, especially for parameters which control decision noise, such as measurement noise and lapse rates in the two perceptual decision-making tasks, and value noise in the 4-in-a-row task. Moreover, for the two models for which we have access to ‘exact’ log-likelihood estimates, we found that IBS is able to recover maximum-likelihood solutions close to the true maximum log-likelihood, whereas fixed sampling remains severely biased even for many samples.

It is true that, given a large enough number of samples, fixed sampling is eventually able to recover most parameters and maximum log-likelihood values with reasonable accuracy. However, we have seen empirically that the number of samples required for reliable estimation varies between tasks, models and parameters of interests. For tasks and models where an exact likelihood or a numerical approximation thereof is unavailable, such as the problem we examined in Section 5.4, this limitation renders fixed sampling close to useless. By contrast, IBS automatically chooses the number of samples to allocate to the problem.

Finally, for complex models with a large response space, accurate parameter estimation with fixed sampling will require many more samples per trial than are feasible given the computational time needed to generate them. Therefore, in such scenarios accurate and efficient parameter estimation is only possible with IBS.

6 Discussion

In this work, we presented inverse binomial sampling (IBS), a method for estimating the log-likelihood of simulation-based models given an experimental data set. We demonstrated that estimates from IBS are uniformly unbiased, their variance is uniformly bounded, and we introduced a calibrated estimator of the variance. IBS is sample-efficient and, for the purpose of maximum-likelihood estimation, combines naturally with gradient-free optimization algorithms that handle stochastic objective functions, such as Bayesian Adaptive Direct Search (BADS; Acerbi and Ma, 2017). We compared IBS to fixed sampling and showed that the bias inherent in fixed sampling can cause researchers to draw false conclusions when performing model selection. Moreover, we showed in three realistic scenarios of increasing complexity that maximum-likelihood estimation of model parameters is more accurate with IBS than with fixed sampling with the same average number of samples.

In the rest of this section, we discuss additional applications of IBS, possible extensions, and give some practical usage recommendations.

We developed inverse binomial sampling for log-likelihood estimation of likelihood-free models, for the purpose of model comparison or fitting model parameters with maximum-likelihood estimation, but IBS has other practical uses.

Checking analytical or numerical log-likelihood calculations

We presented IBS as a solution for when the log-likelihood is intractable to compute analytically or numerically. However, even for models where the log-likelihood could be specified, deriving it can be quite involved and time-consuming, and mistakes in the calculation or implementation of the resulting equations are not uncommon. In this scenario, IBS can be useful for:

• quickly prototyping (testing) of new models, as writing the generative model and fitting it to the data is usually much quicker than deriving and implementing the exact log-likelihood;

• checking for derivation or implementation mistakes, as one can compare the supposedly ‘exact’ log-likelihood against estimates from IBS (on real or simulated data);

• assessing the quality of numerical approximations used to calculate the log-likelihood, for example when using methods such as adaptive quadrature for numerical integration (Press et al., 1992).

Estimating entropy and other information-theoretic quantities

We can also use inverse binomial sampling to estimate the entropy of an arbitrary discrete probability distribution , with , a discrete set (see, e.g., Cover and Thomas, 2012, for an introduction to information theory). To do this, we first draw a sample from the distribution, then use IBS to estimate . The first sample and the samples in IBS are independent, and therefore we can calculate the expected value of the outcome of IBS,

 E[^LIBS]=Ex∼Pr(⋅)[logPr(x)]=∑x∈ΩPr(x)logPr(x), (22)

which is the definition of the negative entropy of .

We can use this technique to estimate the entropy of the predicted response distribution of a generative model with a given parameter vector on any trial. For example, such quantity could be used in a behavioral model to test for the generalized Hick-Hyman law, that states that reaction time is proportional to the entropy of the available choices (Hyman, 1953)

. Moreover, we can generalize the method to estimate the cross-entropy between two distributions (sample from one, estimate log-likelihood with the other), or the Kullback-Leibler divergence between distributions. We note that all the estimates of these quantities are also uniformly unbiased.

444The lack of bias in entropy estimates by IBS may be surprising in light of a theorem stating that uniformly unbiased estimators of the entropy given a finite set of samples cannot exist (Paninski, 2003). This theorem does not apply to IBS since its sample size is a stochastic variable. It does, however, prove that one cannot estimate entropy (or similar information-theoretic quantities) with fixed sampling.

6.2 Bayesian inference

In this paper we focused on maximum-likelihood estimation, but another common approach to parameter estimation is Bayesian inference (Gelman et al., 2013). Bayesian inference has the goal of computing the posterior distribution of the parameters given the observations, computed as

 p(θ|D)=Pr(D|θ)p(θ)Z with Z≡∫dPr(D|θ)p(θ)dθ, (23)

where is the likelihood, the prior density of the parameters (typically assumed continuous), and the normalization constant, known as the evidence or marginal likelihood, a quantity used for Bayesian model selection due to a number of desirable properties (MacKay, 2003). Since is often hard to compute, many (approximate) Bayesian inference techniques are able to calculate the posterior distribution by having access only to the unnormalized

posterior, or joint distribution

; or equivalently to the log joint . We see then that IBS could be used to perform Bayesian inference of likelihood-free models by providing a means to compute the log-likelihood in the log joint distribution (the prior is assumed to be a simple distribution which we can express in closed form).

In Appendix C.3

, we describe how several approaches to approximate Bayesian inference could be used in conjunction with the unbiased log-likelihood estimates provided by IBS: Markov Chain Monte Carlo

(Hastings, 1970; Brooks et al., 2011); variational inference (Jordan et al., 1999; Ranganath et al., 2014); and Gaussian process surrogate methods (Kandasamy et al., 2015; Järvenpää et al., 2019), including Variational Bayesian Monte Carlo (Acerbi, 2018, 2019).

Finally, note that the techniques in this paper can be easily applied to maximum-a-posteriori (MAP) estimation – which is not quite Bayesian inference, but more like a regularized form of maximum-likelihood, that still yields a point estimate instead of a full posterior distribution. MAP estimation is attained by simply adding the log-prior to the log-likelihood in the optimization objective, where the log-prior acts as a regularization term.

6.3 Approximate IBS for continuous responses

So far, we have assumed that the space of possible responses is discrete. This assumption is necessary since, for continuous responses, the probability that a sample from the generative model exactly matches an observed response is zero (technically, near-zero since any computer implementation of a real number is finite). For this reason, IBS will never terminate, or at least not within a physically sensible time scale.

A simple approach to make continuous responses discrete is via binning the response space. Alternatively, we recommend an approach inspired by Approximate Bayesian Computation (ABC; Beaumont et al., 2002), which we call Approximate IBS (AIBS). Given a metric to measure distance in response space, and a tolerance threshold , we can use IBS to estimate

 Lε(θ)=N∑i=1logPr(D(~ri,ri)≤ε|si,θ)|Bε(ri)|, (24)

where the are responses drawn from the generative model, and denotes the volume of the set of responses whose distance from is no more than .

The -approximate log-likelihood in Equation 24 can then be used as normal for maximum-likelihood estimation or Bayesian inference. As , the approximate likelihood tends to the true likelihood, under some regularity conditions which we leave to explore for future work (see Prangle 2017 for a similar proof for ABC). However, the expected number of samples used by IBS diverges in that limit, so in practice there is a lower bound for that is feasible and one needs to extrapolate to the limit, or be satisfied to perform inference with an -approximate likelihood.

The common idea between AIBS and ABC is that they both use a distance metric to judge similarity between simulated samples and data. However, ABC commonly bases the comparison on summary statistics of the data (which may not be sufficient statistics, and thus not capture all aspects of the data); whereas AIBS uses the full responses. Secondly, ABC in practice requires dedicated algorithms to perform parameter estimation and inference (basic techniques, such as rejection sampling, can be extremely inefficient); whereas AIBS simply provides a (noisy) log-likelihood, which can then be used in combination with a wider class of likelihood-based inference methods, as long as they support noisy estimates (see Appendix C.3 for some examples). We leave a further analysis of AIBS, and a comparison with other likelihood-free inference approaches, as a promising direction for future work.

6.4 Usage recommendations

We conclude with a number of recommendations for researchers who want to fit a model to a data set, having access only to a simulator or generative model.

• First, try to derive a closed-form analytic expression for the log-likelihood of the model. If this is tractable, validate that the log-likelihood is free of implementation mistakes by comparing its output against log-likelihood estimates obtained by IBS with well-chosen test trials and model parameters.

• If exact analytics are intractable, find an analytical or numerical approximation, for example using variational inference or Riemannian integration, and once again validate the quality of the approximation using IBS.

• If the model is too complex for analytical or numerical approximations, estimate the log-likelihood using inverse binomial sampling.

• Finally, perform inference using the analytical, numerical, or IBS-based log-likelihood function with a sample-efficient inference algorithm, such as those based on Gaussian process surrogate modeling. For maximum-likelihood (or maximum-a-posteriori) estimation, hybrid Bayesian optimization methods have proved to be quite effective

(Acerbi and Ma, 2017).

Avoiding infinite loops

One issue of IBS is that it can ‘hang’, in the sense that the implementation of the estimator can run indefinitely, without returning an answer, if the simulator is unable to match a particularly unlikely observation. This is a natural behavior of IBS that stems from its efficiency in allocating samples, as we examined in Section 4.2. We recommend two easy solutions to avoid infinite loops:

• Implement a ‘lapse rate’ in the simulator model, which represents the probability of a completely random response (typically uniform across all possible responses). The lapse rate could be fixed to a small, non-zero value (e.g., ), or let as a free model parameter; in which case, ensure that the minimum lapse rate is a small, non-zero value (e.g., ).

• Introduce an early-stopping threshold, such that IBS stops sampling when the estimated log-likelihood of the entire data set goes below a threshold (see Appendix C.1).

We implemented both of these solutions in our analyses in Section 5.

Acknowledgments

This work has utilized the NYU IT High Performance Computing resources and services. We thank Aspen Yoo for help with Figures 4 and 6 and useful comments on the manuscript.

Appendix A Further theoretical analyses

a.1 Why inverse binomial sampling works

We start by showing that the inverse binomial sampling policy described in Section 2.4, combined with the estimator (Equation 14), yields a uniformly unbiased estimate of . This derivation follows from de Groot (1959, Theorem 4.1), adapted to our special case of estimating instead of an arbitrary function :

 (25)

The first equality is the definition of (Equation 14), using the notational convention that . In the second equality we introduce the indicator function which is when and

otherwise. The third equality follows by linearity of the expectation and the fourth directly from the definition of the indicator function. The fifth and second-to last equality uses the formula for the cumulative distribution function of a geometric variable, that is

, and thus . The final equality is the definition of the Taylor series of expanded around . Note that this series converges for all .

In the derivation above, we can replace by an arbitrary set of coefficients and show that

 E[K−1∑k=1ak]=∞∑k=1ak(1−p)k, (26)

for all for which the resulting Taylor series converges. Equation 26 immediately proves two corollaries. First, we can use the inverse binomial sampling policy to estimate any analytic function of . Second, since we can rewrite any estimator as , and since Taylor series are unique, is the only choice for which equals . In other words, is the only uniformly unbiased estimator of with the inverse sampling policy. Therefore, it trivially is the uniformly minimum-variance unbiased estimator under this policy, since no other unbiased estimator exist.

a.2 Analysis of bias of fixed sampling

We provide here a more formal analysis of the bias of fixed sampling. We initially consider the estimator defined by Equation 11 in the main text, but we will see that our arguments hold generally for any estimator based on a fixed sampling policy.

We showed in Figure 2 that in the regime of , , while keeping , the bias of

tends to a master curve. This follows since, in this limit, the binomial distribution

converges to a Poisson distribution

and therefore the bias converges to

 Bias[^Lfixed|p]=E[^Lfixed−logλM]=E[exp(−λ)∞∑m=0λmm!log(m+1)−log(M+1)−logλM]⟶p→0M→∞pM→λexp(−λ)∞∑m=0λmm!log(m+1)−logλ, (27)

which is the master curve in Figure 2. In particular, the bias is close to zero for and it diverges when , or equivalently, for and , respectively.

This asymptotic behavior is not a coincidence. In fact, it is mathematically guaranteed since the Fisher information of equals and the reparametrization identity for the Fisher information yields (Lehmann and Casella, 2006). In the limit of , which corresponds to , this Fisher information vanishes and the outcome of fixed sampling simply provides zero information about or . Therefore, any estimates of are not informed by the data and instead are a function of the regularization chosen in the estimator (Equation 11). Note that the argument above does not invoke the specific form of the estimator, and therefore holds for any choice of regularization.

We can express the problem with fixed sampling more clearly using Bayesian statistics, in a formal treatment of the ‘gambling’ analogy we presented in the main text. The ‘correct’ belief about given the outcome of fixed sampling () is quantified by the posterior distribution , which is a product of the likelihood and a prior . In the limit , the Poisson distribution converges to a Kronecker delta distribution concentrated on . In other words, almost surely none of the samples taken by the behavioral model will match the participant’s response. When , the likelihood equals , which is mostly flat (when considered as a function of , see Figure 11) for and therefore our posterior belief ought to be dominated by the prior and become independent of the data. Therefore, we once again conclude that in the limit , the fixed sampling policy provides no information to base an estimate of on, and it is impossible to avoid bias.

a.3 Estimator variance and information inequality

We proved in Section A.1 that is the minimum-variance unbiased estimator of given the inverse binomial sampling policy. Here we show that the estimator also comes close to saturating the information inequality, the analogue of a Cramer-Ráo bound for an arbitrary function and a non-fixed sampling policy (de Groot, 1959),

 Std(^f|p)≥√p(1−p)E[K|p])∣∣∣df(p)dp∣∣∣. (28)

In our case, where , the information inequality reduces to . In Figure 12, we plot the standard deviation of IBS compared to this lower bound.

It may be disappointing that IBS does not match the information inequality. Kolmogorov (1950) showed that the only functions for which the fixed sampling policy with samples allows an unbiased estimator are polynomials of degree up to , and those estimators can saturate the information equality. Dawson (1953) and later de Groot (1959) showed that if an unbiased estimator of a non-polynomial function exists and it matches the information inequality, it must use the inverse binomial sampling policy. Moreover, de Groot derived necessary and sufficient conditions for to allow such estimators (de Groot, 1959). Therefore, the standard deviation in IBS is close (within ) to its theoretical minimum.

To compare the variance of IBS and fixed sampling on equal terms, we use the scaling behavior of as . Specifically, for fixed sampling, we plot and for IBS we plot (see Figure 13). With this scaling, the curves for fixed sampling again collapse onto a master curve555These curves converge pointwise on and uniformly on any interval , but not uniformly on . The limits and are not exchangeable.. Note that repeated-sampling IBS estimators (see Section 4.4), obtained by averaging multiple IBS estimates, overlap with the curve for regular IBS for any .

All these curves increase and diverge as , reflecting the fact that estimating log-likelihoods for small is hard. The standard deviation of fixed sampling is always lower than that of IBS, especially when (specifically when ). In other words, fixed sampling produces low-variance estimators exactly in the range in which its estimates are biased, as guaranteed by the Cramer-Ráo bound. However, in the large- limit, fixed sampling does saturate the information inequality, so its master curve lies slightly below IBS. In other words, if one is able to draw so many samples that bias is no issue, then fixed sampling provides a slightly better trade-off between variance and computational time. However, in Section C.2, we discuss an improvement to IBS which decreases its variance by a factor -, in which case IBS is clearly superior.

a.4 A Bayesian derivation of the IBS estimator

In Sections 3 and A.2 we hinted at a Bayesian interpretation of the problem of estimating . We show here that indeed we can see the IBS estimator as a Bayesian point estimate of with a specific choice of prior for . For the rest of this section, we use to denote the likelihood of a trial (instead of ); that is is the parameter of the Bernoulli distribution and

the quantity we are seeking to estimate. We changed notation to avoid confusion with expressions such as the prior probability of

, which is .

Let be the number of samples until a ‘hit’, as per the IBS sampling policy. Following Bayes’ rule, we can write the posterior over given as

 p(q|K)=Pr(K|q)p(q)Pr(K)=(1−q)K−1qBeta(q;α,β)∫10(1−q)K−1qBeta(q;α,β)dq=Γ(K+α+β)Γ(α+1)Γ(K+β−1)(1−q)K+β−2qα, (29)

where we used the fact that follows a geometric distribution, and we assumed a prior over .

In particular, let us compute the posterior mean of under the Haldane prior, (Haldane, 1932). Thanks to the ‘law of the unconscious statistician’, we can compute the posterior mean of directly from Equation 29,

 Ep(logq|K)[logq]=(K−1)∫10(logq)(1−q)K−2dq=∫10(logq)Beta(q;1,K−1)dq=ψ(1)−ψ(K)=−K−1∑k=11k, (30)

where the first row follows from setting and ; it can be shown that the third row is the expectation of

for a Beta distribution, with

the digamma function (Abramowitz and Stegun, 1948); and the last equality follows from the relationship between the digamma function and harmonic numbers, that is , where is Euler-Mascheroni constant. We also used the notational convention that for any . Note that the last row is equal to the IBS estimator, , as defined in Equation 14 in the main text.

Crucially, Equation 29 shows that we can recover the IBS estimator as the posterior mean of given , under the Haldane prior for . This interpretation allows us to also define naturally the variance of our estimate for a given , as the variance of the posterior over ,

 Varp(logq|K)[logq]=ψ1(1)−ψ1(K), (31)

where is the trigamma function, the derivative of the digamma function; the equality follows from a known expression for the variance of under a Beta distribution for .

Appendix B Experimental details

In this section, we report details for the three numerical experiments described in the main text and supplementary results.

b.1 Orientation discrimination

The parameters of the orientation discrimination model are the (inverse) slope, or sensory noise, represented as , the bias , and the lapse rate . The logarithmic representation for is a natural choice for scale parameters (and more in general, for positive parameters that can span several orders of magnitude).

We define the lower bound (LB), upper bound (UB), plausible lower bound (PLB), and plausible upper bound (PUB) of the parameters as per Table 1

. The upper and lower bounds are hard constraints, whereas the plausible bounds provide information to the algorithm to where the global optimum is likely to be, and are used by BADS, for example, to draw a set of initial points to start building the surrogate Gaussian process, and to set priors over the Gaussian process hyperparameters

(Acerbi and Ma, 2017). Here we also use the plausible bounds to select ranges for the parameters used to generate simulated datasets, and to initialize the optimization, as described below.

To generate synthetic data sets, we select ‘true’ parameter settings for the orientation discrimination task as follows. We set the baseline parameter as , , and . Then, for each parameter , we linearly vary the value of in 40 increments from PLB to PUB as defined in Table 1 (e.g., from to for ), while keeping the other two parameters fixed to their baseline value. For each one of the 120 parameter settings defined in this way, we randomly generated stimuli and responses for 100 datasets from the generative model, resulting in 12000 distinct data sets for which we know the true generating parameters.

We evaluated the log-likelihood with the following methods: fixed sampling with samples, with ; IBS with repeats, with ; and exact. To avoid wasting computations on ‘bad’ parameter settings, for IBS we used the ‘early stopping threshold’ technique described in Section C.1, setting a lower bound on the log-likelihood of IBS equal to the log-likelihood of a chance model, that is .

For each data set and method, we optimized the log-likelihood by running BADS times with different starting points. We selected starting points as the points that lie on one-third or two-third of the distance between the plausible upper and lower bound for each parameter, that is all combinations of , , . Each of these optimization runs returns a candidate for . For methods that return a noisy estimate of the log-likelihood, we then re-evaluate for each of these candidates with higher precision (for fixed sampling, we use samples; for IBS, we use repeats). Finally, we select the candidate with highest (estimated) log-likelihood.

When estimating parameters using IBS or fixed sampling, we enabled the ‘uncertainty handling’ option in BADS, informing it to incorporate measurement noise into its model of the objective function. Note that during the optimization, the algorithm iteratively infers a single common value for the observation noise associated with the function values in a neighborhood of the current point (Acerbi and Ma, 2017). A future extension of BADS may allow the user to explicitly provide the noise associated with each data point, which is easily computed for the IBS estimates (Equation 16 in the main text), affording the construction of a better surrogate model of the log-likelihood.

Alternative fixed sampling estimator

In the main text, we considered the fixed-sampling estimator defined by Equation 11. We performed an additional analysis to empirically validate that our results do not depend on the specific choice of estimator for fixed sampling (as expected given the theoretical arguments in Section 3).

An alternative way of avoiding the divergence of fixed sampling is to correct samples that happen to be all zeros, for example with

 ^Lfixed-bound(x)=log(max{m(x),mmin}M), (32)

for some , which sets a lower bound for the log-likelihood equal to . We then performed our analyses of the orientation discrimination task using the estimator with . As shown in Figure 14, the results are remarkably similar to what we found using the fixed-sampling estimator defined by Equation 11.

Complete parameter recovery results

For completeness, we report in Figure 15 the parameter recovery results for fixed sampling, inverse binomial sampling and ‘exact’ analytical methods for the orientation discrimination task, for all tested number of samples and IBS repeats . All estimates were obtained via maximum-likelihood estimation using the Bayesian Adaptive Direct Search (Acerbi and Ma, 2017), as described previously in this section.

b.2 Change localization

First, we derive the trial likelihood of the change localization model. Assuming that the change happens at location , by symmetry we can write

 Pr(respond i|c changed)=⎧⎪⎨⎪⎩Pcorrect(Δ(c)s;θ)if i=c15(1−Pcorrect(Δ(c)s;θ))otherwise (33)

where is the absolute circular distance between the true orientations of patch in the first and second display. We can derive an expression for by marginalizing over the circular distance between the respective measurements,

 Pcorrect(Δ(c)s;θ)=γ6+(1−γ)∫2π0Pr(Δ(c)x|Δ(c)