Animals, humans, and similarly reinforcement learning agents may safely assume that the world is stochastic and stationary during some intervals of time marked by change points. The exact position and orientation of leafs on a tree, a stock market index, or the time it takes to travel from A to B in a crowded city may be well captured by stationary stochastic processes for extended periods of time. Then sudden changes may happen and the distribution of leaf positions becomes different due to a storm, the stock market index is affected by the enforcement of a new law, or a blocked road causes additional traffic jams. The violation of an agent’s expectation caused by such sudden changes is perceived by the agent as surprise, which can be seen as a measure of how much the agent’s current belief differs from reality. Surprise, with its physiological manifestations in pupil dilation Preuschoff et al. (2011); Nassar et al. (2012) and EEG signals Modirshanechi et al. (2019); Ostwald et al. (2012), is believed to modulate learning, potentially through the release of specific neurotransmitters Yu and Dayan (2005); Gerstner et al. (2018), to allow animals and humans to adapt quickly to sudden changes.
The bulk of work on surprise-based learning has focused more on biological plausibility than accurate learning Nassar et al. (2012); Yu and Dayan (2005); Nassar et al. (2010); Faraji et al. (2018); Friston et al. (2017); Schwartenbeck et al. (2013); Friston (2010); Behrens et al. (2007). On the other hand, exact and approximate Bayesian online Adams and MacKay (2007); Fearnhead and Liu (2007) methods for change point detection and parameter estimation have been developed without any focus on biological plausibility Aminikhanghahi and Cook (2017); Wilson et al. (2010). In this work, we take a top-down approach to surprise-based learning; we start with a generative model of change points and observations and derive approximate online methods that contain a surprise-modulated learning rate. Our goal was to find approximate methods that are computationally efficient and biologically plausible while sacrificing only marginally the learning accuracy. Additionally, we sought to provide theoretical insights on commonalities and differences among existing surprise-based and approximate Bayesian approaches.
2 General Framework and Related Work
2.1 The Generative Model
In order to study learning in an environment that exhibits occasional and abrupt changes we consider the following hierarchical generative model in discrete time. At each time point , the observation
comes from a probability distribution with parameter. Abrupt changes of the environment correspond to sudden changes of this parameter. At every time , there is a change probability for the parameter to be drawn from its prior distribution independently of its previous value, and a probability to stay the same as . A change at time is specified by the event ; otherwise . We sample from the prior , and for the generative model is
Random variables are indicated by capital letters, and values by small letters. P
stands for either probability density function (for the continuous variables) or probability mass function (for the discrete variables), andis the Dirac or Kronecker delta distribution respectively. is the time-invariant likelihood function.
Given a sequence of observations , the agent’s belief about the parameter at time
is defined as the posterior probability distributionof the parameter . In the online learning setting studied here, the agent’s goal is to update the belief to the new belief , or an approximation thereof, upon observing .
First we demonstrate that exact Bayesian inference on the generative model in Eq. 1 leads to an explicit trade-off between integrating the new observations with the old belief into a distribution and forgetting the past observations, so as to restart with the belief
This trade-off is governed by a surprise modulated learning rate
where has the natural interpretation of the surprise of the most recent observation, and is a parameter controlling the effect of surprise on learning. The exact definitions of , , and S will be given in the Results section.
Second, we propose two approximate algorithms (Particle Filtering and Variational SMiLe) which inherit the explicit trade-off and its surprise modulated learning rate from the exact Bayesian approach. Our methods are computationally efficient and biologically plausible; Particle Filtering is shown to have a neuronal implementation in its general formKutschireiter et al. (2017), whereas Variational SMiLe can be implemented with simple update rules for the exponential family of distributions. Moreover, empirical results show that the performance of the two approximate algorithms is comparable to and more robust across environments than other state-of-the-art approximations.
Finally, we interpret existing related algorithms in a unifying way, under the light of this surprise-modulated trade-off in Eq. 2.
2.3 Related Work
Exact Bayesian Inference.
For the generative model in Eq. 1, it is possible to find an exact online Bayesian update of the belief using a message passing algorithm Adams and MacKay (2007); Fearnhead and Liu (2007). This algorithm’s space and time complexity increases quadratically with , which makes it unsuitable for a continual learning setting. However, simple approximations like dropping messages below a certain threshold Adams and MacKay (2007) or stratified resampling Fearnhead and Liu (2007) allow to reduce the computational complexity. Interpretation of these last approaches under our theoretical work, as well as their relationship to our algorithms are discussed in the Supplementary Material.
Leaky Integration and Variations of Delta-Rules.
In order to estimate some sufficient statistic, leaky integration of new observations is a particularly simple form of trade-off between integrating and forgetting. After a transient phase, the update of a leaky integrator takes the form of a delta-rule that can be seen as an approximation of corresponding exact Bayesian updates Meyniel et al. (2016); Heilbron and Meyniel (2019); Yu and Cohen (2009). This update rule was found to be biologically plausible and consistent with human behavioral data Meyniel et al. (2016); Yu and Cohen (2009). However, Behrens et al. (2007); Heilbron and Meyniel (2019) demonstrated that in some situations, the exact Bayesian model is significantly better than leaky integration in explaining human behavior. The inflexibility of leaky integration with a single, constant leak parameter can be overcome by a weighted combination of multiple leaky integrators Wilson et al. (2013), where the weights are updated in a similar fashion as in the exact online methods Adams and MacKay (2007); Fearnhead and Liu (2007), or by considering an adaptive leak parameter Nassar et al. (2012, 2010). The latter Nassar et al. (2012, 2010) bear close connections to our work, which are further discussed in the Supplementary Material.
Learning in the presence of abrupt changes has also been considered without an explicit assumption about the underlying generative model. One approach uses a surprise modulated learning rate Faraji et al. (2018) similar to Eq. 3. Other approaches use different generative models, e.g. conditional sampling of the parameters also when there is a change Yu and Dayan (2005), deeper hierarchy without fixed change probability Wilson et al. (2010), or models with drift in the parameters Mathys et al. (2011); Gershman et al. (2014). In the signal processing literature we find further approaches to address the problem of learning in nonstationary environments with abrupt changes (see Aminikhanghahi and Cook (2017) for a review, and Lin et al. (2017); Cummings et al. (2018) for two recent examples).
3 Theoretical Results
3.1 Recursive Bayesian Inference
Using Bayes’ rule, our aim is to find a rule to update the belief to the new belief
The first term in the numerator is the likelihood of the current observation given its parameter, and the second term is the agent’s estimated probability distribution of before observing . Since there is always the possibility of an abrupt change, the second term is not the agent’s previous belief , but (see Supplementary Material). As a result, it is possible to find a recursive formula for updating the belief. For the derivation of this recursive rule, we define the following terms.
The probability of observing with the belief is
Under the assumption of no change , and using the most recent belief as prior, the exact Bayesian update for is
Note that corresponds to the term of Eq. 2; it is the incorporation of the new information into the current belief via Bayesian updating.
The “Generative Model Surprise" of the observation is defined as the ratio of the probability of observing given (i.e. when there is a change), to the probability of observing given (i.e. when there is no change), i.e.
This definition of surprise measures how much more probable the current observation is under the naive prior relative to the current belief (see Supplementary Material for further discussion and interpretation). We emphasize that this definition is not arbitrary, but it is a term that allows us to write the exact inference on the generative model in a recursive form.
Using the above definitions and Eq. 4, we find the recursive update rule (see Supplementary Material)
where , and as in Eq. 3, with as in Eq. 7 and . The recursive formula of Eq. 8 shows an explicit trade-off between integrating the new sample with the old information and forgetting the previous observations. The weight of this convex sum is modulated by surprise in light of the new observation. Since the parameter of modulation is equal to the effect of surprise on learning increases when the environment is more volatile, i.e. when the change probability increases.
Despite the simplicity of the recursive formula in Eq. 8, the updated belief is generally not in the same family of distributions as the previous belief
, e.g. the result of averaging two normal distributions is not a normal distribution. Hence it is in general impossible to find a simple and exact update rule for e.g. some sufficient statistic. In the following sections, we investigate two approximations that have simple update rules.
3.2 Particle Filtering
The exact Bayesian update can also be performed by marginalization of over . As a result of this marginalization, the agent’s belief is , where we dropped the explicit mentioning of the random variables, e.g. , and display only their values, e.g. , to shorten notation. The first term is simple to compute, because when is known, inference depends only on the observations after the last change point. However, since the computation of the term is difficult and the summation over all hidden states is computationally costly, in this section, we approximate this term via particle filtering Gordon et al. (1993), i.e.
where is a set of realizations (particles) drawn from a proposal distribution and are their corresponding weights at .
Hence the approximated belief is
where is the approximated belief corresponding to particle . The update procedure includes two steps: 1. Updating the weights, and 2. Sampling the new hidden state for each particle. The first step amounts to
where , and are the weights corresponding to a Bayesian update (Eq. 6; see Supplementary Material for the derivation). As a second step we sample each particle’s hidden state from the proposal distribution with the stay probability
Interestingly, the above formulas are in the same spirit as Eq. 8. For the weight update there is a trade-off between an exact Bayesian update and keeping the value of the previous time step, controlled by a learning rate modulated exactly in the same way as in Eq. 8. Note that in contrast to Eq. 8, the trade-off for the particles’ weights is not between forgetting and integrating, but between maintaining the previous knowledge and integrating. However, the stay probability for sampling is a decreasing function of surprise. As a result, although the weights are updated less for surprising events, a higher surprise causes a higher probability for change. This is eventually identical to forgetting, since for a particle whose state is changed, the approximated belief is equal to .
3.3 Variational SMiLe Rule
In order to keep the updated belief in the same family as the previous beliefs one possibility is to apply the weighted averaging of the exact Bayesian update rule (Eq. 8) to the logarithm of the beliefs rather than their normal forms, i.e.
where takes the same functional form as for the exact Bayesian update, but is a positive free parameter which can be tuned to each environment. By doing so, we still have the explicit trade-off of Eq. 8. The advantageous consequence of averaging over logarithms is that, if the initial belief
is the conjugate prior of the likelihood function, then we always have and in the same family, which applies in particular to distributions from the exponential family. This results in a simple update rule for the parameters of .
One way to interpret this new update rule is to rewrite it as the solution of a constraint optimization problem
where is a decreasing function of at each timestep (see Supplementary Material for the derivation). According to Eq. 14, the updated belief is a variational approximation of . Because of its similarity to the Surprise Minimization Learning rule “SMiLe” Faraji et al. (2018), we call this approach “Variational Surprise Minimization Learning" rule, or in short “Variational SMiLe" rule.
3.4 Application to the Exponential Family
For both Particle Filtering and Variational SMiLe, we derive compact update rules for when the likelihood function is in the exponential family and is its conjugate prior. The resulting update rules are easy to implement. The pseudocode for Particle Filtering and Variational SMiLe can be found in the Supplementary Material.
3.5 Modifications and extentions of related approaches
In order to enable fair comparisons in simulations and to allow for a comparative discussion from a theoretical point of view, we modified or extended existing related approaches. In the surprise measure defined by Faraji et al. (2018), the prior
is always a Uniform distribution. We used the generative model prior and simplified the implementation of the modulated learning rate. The algorithms ofNassar et al. (2012, 2010)
were specifically developed for the case of a Uniform prior with a range of values much larger than the range of the (Gaussian) likelihood function. We extended their approaches to a more general case where the prior is a Gaussian distribution with arbitrary variance. We implemented the message passing algorithm ofAdams and MacKay (2007) and an additional simplified version of it, where we simply keep a fixed number of particles at each time step, the ones with the highest weights. All modifications, extensions and comparative interpretations can be found in the Supplementary Material.
We evaluated our algorithms on two tasks, a Gaussian and a Categorical estimation task. We compared our algorithms to the online Bayesian Message Passing algorithm Adams and MacKay (2007) (MP Bayes), a simpler variation thereof – inspired by the work of Fearnhead and Liu (2007) – (MP), the (extended) reduced Bayesian algorithm of Nassar et al. (2010) (reduced Bayes’10), the (extended) reduced Bayesian algorithm of Nassar et al. (2012) (reduced Bayes’12), a slightly modified version of SMiLe Faraji et al. (2018), and a simple Leaky Integrator. The MP Bayes and the MP algorithms come from the field of change point detection. The first has high memory demands and the latter have same memory demands as the Particle Filters we implemented. Note that we also compared to the original algorithm of Fearnhead and Liu (2007) but found that the simpler MP gave rise to better performance, we therefore report the results of the latter here. The reduced Bayes’10 and ’12 and the SMiLe algorithm come from the human learning literature and are more biologically oriented. More details on the aforementioned algorithms as well as the pseudocode of the modified SMiLe rule can be found in the Supplementary Material.
4.1 Gaussian estimation task
The goal of the agent is to estimate the mean of observed samples, which are drawn from a Gaussian distribution with known variance , i.e. . The mean is itself drawn from a Gaussian distribution whenever the environment changes. An example of the task can be seen in Fig. 1A. We simulated the task for all combinations of and . For each combination of and , we first tuned the free parameter of each algorithm, i.e. for SMiLe and Variational SMiLe, and the leak parameter for the Leaky Integrator, by minimizing the mean squared error on three random initializations of the task. For the Particle Filter (pf), the MP Bayes, the MP, the reduced Bayes’10 and the reduced Bayes’12, the true of the environment was indeed the value that gave the best performance and we used this value for the simulations.
We evaluated the performance of the algorithms on ten different random task instances for steps each. Note that the parameter is known to all algorithms, apart from the Leaky Integrator.
In Fig. 1B we show the mean squared estimation error of each algorithm for steps after a change in the environment, over multiple changes, for two exemplar task settings. The Particle Filter with 10 and 20 particles (pf10 and pf20), and the reduced Bayes’12 have a performance very close to that of the MP Bayes algorithm, with much lower memory requirements. The MP algorithm with 10 and 20 particles (MP10 and MP20) is the closest to MP Bayes for low (Fig. 1B, left panel), but its performances deteriorates for the case of high and low levels (Fig. 1B, right panel). Variational SMiLe exhibits very good performance as well. It sometimes outperforms the other algorithms early after an environmental change (Fig. 1B, right panel), but shows slightly higher error values at later phases. For the Leaky Integrator we observe a trade-off between good performance in the transient phase and the stationary phase; a fixed value cannot fulfil both requirements. The Modified SMiLe rule, by construction, never narrows its belief below some minimal value, which allows it to have a very low – sometimes the lowest – error immediately after a change, but leads to high errors subsequently. The reduced Bayes’10 performs sufficiently well for lower , but not for higher values. The Particle Filter with 1 particle is in expectation similar to reduced Bayes’10 and reduced Bayes’12 (See Supplementary Material for derivation and discussion), and its performance is governed by the noise that the sampling of a single particle entails. It therefore performs worse than the two reduced Bayes Models. Still, it performs better than the MP with 1 particle. The latter algorithm can be seen as a “greedy” version of Particle Filtering with 1 particle; at each step the most likely possibility between changing or staying is kept.
In Fig. 1C we can see the average estimation error of the MP Bayesian algorithm over the whole timeline for each of the considered and levels, and in Fig. 1D the difference of the other algorithms from this benchmark. As expected, all algorithms have lower average error values for lower and lower . The Particle Filter pf10 and the Message Passing MP20 have the smallest difference from MP Bayes. The average error of the MP is higher for high and low , and the Particle Filter is more robust across levels of environmental parameters. Next in performance is the reduced Bayes’12. The Variational SMiLe exhibits a large deviation from the MP Bayes for high and low , but is still more resilient compared to the MP algorithm for this type of environments. The simple Leaky Integrator performs well at low and but deviates more from the MP Bayes as these parameters increase (Fig. 1D). The SMiLe rule performs best at lower , i.e. in more deterministic environments.
4.2 Categorical estimation task
In this task, the goal of the agent is to estimate the occurrence probability of five possible states. Each observation is drawn from a Categorical distribution with parameters , i.e. . When there is a change in the environment, the parameters are drawn from a Dirichlet distribution , where is the stochasticity parameter. An illustration of this task is depicted in Fig. 2A. We considered all combinations of stochasticity levels and change probability levels . The algorithms of Nassar et al. (2012, 2010) were specifically developed for a Gaussian estimation task and their extension to a Categorical task is not be straightforward. Similarly to the Gaussian task, all algorithms were first optimized for each combination of environmental parameters, and the parameter is known to all algorithms, but for the Leaky Integrator.
As before, the Particle Filter pf20 and the MP20 have a performance closest to that of MP Bayes (Fig. 2B); Particle Filtering performs better for high and MP20 performs better for low . The MP10 performs also very well. Variational SMiLe is the next in the ranking, with a behavior after a change similar to the Gaussian task. For all algorithms, except for the MP10 and MP20, the highest deviations from the MP Bayes are observed for medium stochasticity levels (Fig. 2D). When the environment is nearly deterministic (e.g. so that the parameter vectors have almost all mass concentrated in one component), or highly stochastic (e.g. so that nearly uniform Categorical distributions are more likely to be sampled), these algorithms achieve higher performance, while the Particle Filter is the one that is most resilient against choice of the stochasticity parameter . For the Variational SMiLe in particular, the lowest mean error is achieved for the extreme cases of high with high and low with low . In summary, for the same memory demands MP10 and MP20 are less robust across stochasticity levels compared to pf10 and pf20.
4.3 Robustness against suboptimal parameter choice
To investigate the robustness of the algorithms to a mismatch between the assumed and the actual probability of change points, we first tuned each algorithm’s parameter for an environment with a change probability , and then tested the algorithms in environments with different change probabilities, while keeping the parameter fixed. For each new environment with a different change probability, we calculated the difference between the mean squared error of these fixed parameters and the minimum possible mean squared error of the MP Bayes algorithm, i.e. the resulting mean squared error for the case that the MP Bayes’ parameter is tuned for the actual . More precisely, if we denote as the mean squared error of an algorithm with parameters – i.e. parameters tuned for an environment with – applied in an environment with , we calculated the quantity , for each algorithm . We call this quantity mean regret. The lower the values and the flatter the curve of the mean regret are, the better the performance and the robustness of the algorithm in the face of lacking knowledge of the environment. The flatness of the curve indicates the degree of deviations of the performance as we move away from the optimally tuned parameter. We ran three random (and same for all algorithms) task initializations for each level.
In Fig. 3 we plot the mean regret for each algorithm for the Gaussian task for 4 pairs of and levels. For and (Fig. 3A) MP Bayes and the MP algorithms show the highest robustness (smaller regret) and are closely followed by the Particle Filter, the Variational SMiLe, and the reduced Bayes’12 (note the regret’s small range of values). The lowest the actual , the highest the regret, but still the changes are very small. The curve for the SMiLe is also quite flat, but the mean regret is much higher. The same holds for the Leaky Integrator. For and (Fig. 3B) MP Bayes, MP, Particle Filtering and Variational SMiLe have very similar robustness levels. The robustness for the Leaky Integrator deteriorates a lot as the actual increases. In Fig. 3C and Fig. 3D we plot the mean regret for , and and respectively. For this high stochasticity level the optimal values for the parameter of the Leaky Integrator were around regardless of the level. This means that in a highly stochastic environment the optimal behavior for the Leaky Integrator is to constantly integrate new observations to its belief, i.e. to act like a Perfect Integrator. This feature makes it blind to the and therefore very robust against the lack of knowledge of it (Fig. 3C). The rest of the algorithms are more sensitive to changes. The Particle Filter is more robust than the MP algorithms, especially for lower , as we saw in the previous subsections. The MP algorithms exhibit high fluctuations in their performance, likely because they are biased estimators. The reduced Bayes’12 is quite robust in this level (Fig. 3C and D). Overall for MP Bayes, Particle Filtering, Variational SMiLe and reduced Bayes’12, a mismatch of the assumed from the actual one does not deteriorate the performance dramatically for , (Fig. 3D). The MP Bayes is the most robust for low if (Fig. 3D). If the reduced Bayes’10 seems to be slightly more robust, likely for reasons similar to the case of Leaky Integrator.
In summary, most of the time, the mean regret for MP Bayes, MP10, and MP20 is less than or equal to the mean regret for pf10 and pf20. However, the variability in the mean regret for pf10 and pf20 is smaller, and their curves are flatter across levels, which makes their performance more predictable. The results for the Categorical estimation task are similar to the Gaussian task (Fig. 4).
We have shown that performing exact Bayesian inference on the generative model of interest naturally leads to a definition of surprise and a surprise modulated adaptive learning rate, which is similar to one that has previously been proposed in the neuroscience literature with heuristic argumentsNassar et al. (2012, 2010); Faraji et al. (2018). We have proposed two approximate algorithms for learning in non-stationary environments, which exhibit the surprise modulated learning rate of the exact Bayesian approach. Empirically we observed that our algorithms achieve levels of performance comparable to approximate Bayesian methods with higher memory demands, and are more resilient across different environments compared to methods with similar memory demands. Our methods may find application in a model-based reinforcement learning setting, where it is desirable to have computationally efficient methods with low approximation errors. Our definition of surprise may be of interest for the active field of research on quantitative measures of surprise Shannon (1948); Itti and Baldi (2006); Schmidhuber (2010); Faraji et al. (2018) (See Supplementary Material for further discussion on connections between the Generative Model Surprise and other surprise measures). Building on the body of literature on three-factor learning rules Gerstner et al. (2018), where a third factor indicating reward or surprise enables a synaptic change or a belief update Yu and Dayan (2005); Angela (2012), our theoretical results may offer interesting interpretations of behavioral and neurophysiological data.
This research was supported by Swiss National Science Foundation No. 200020_184615) and by the European Union Horizon 2020 Framework Program under grant agreement No.785907 (Human Brain Project, SGA2).
- Preuschoff et al. (2011) K. Preuschoff, B. M. t Hart, and W. Einhauser. Pupil dilation signals surprise: Evidence for noradrenaline’s role in decision making. Frontiers in neuroscience, 5:115, 2011.
- Nassar et al. (2012) M. R. Nassar, K. M. Rumsey, R. C. Wilson, K. Parikh, B. Heasly, and J. I. Gold. Rational regulation of learning dynamics by pupil-linked arousal systems. Nature neuroscience, 15(7):1040, 2012.
- Modirshanechi et al. (2019) A. Modirshanechi, M. M. Kiani, and H. Aghajan. Trial-by-trial surprise-decoding model for visual and auditory binary oddball tasks. NeuroImage, 196:302–317, 2019.
- Ostwald et al. (2012) D. Ostwald, B. Spitzer, M. Guggenmos, T. T. Schmidt, S. J. Kiebel, and F. Blankenburg. Evidence for neural encoding of bayesian surprise in human somatosensation. NeuroImage, 62(1):177–188, 2012.
- Yu and Dayan (2005) A. J. Yu and P. Dayan. Uncertainty, neuromodulation, and attention. Neuron, 46(4):681–692, 2005.
- Gerstner et al. (2018) W. Gerstner, M. Lehmann, V. Liakoni, D. Corneil, and J. Brea. Eligibility traces and plasticity on behavioral time scales: experimental support of neohebbian three-factor learning rules. Frontiers in neural circuits, 12, 2018.
- Nassar et al. (2010) M. R. Nassar, R. C. Wilson, B. Heasly, and J. I. Gold. An approximately bayesian delta-rule model explains the dynamics of belief updating in a changing environment. Journal of Neuroscience, 30(37):12366–12378, 2010.
- Faraji et al. (2018) M. Faraji, K. Preuschoff, and W. Gerstner. Balancing new against old information: the role of puzzlement surprise in learning. Neural computation, 30(1):34–83, 2018.
- Friston et al. (2017) K. Friston, T. FitzGerald, F. Rigoli, P. Schwartenbeck, and G. Pezzulo. Active inference: a process theory. Neural computation, 29(1):1–49, 2017.
- Schwartenbeck et al. (2013) P. Schwartenbeck, T. FitzGerald, R. Dolan, and K. Friston. Exploration, novelty, surprise, and free energy minimization. Frontiers in psychology, 4:710, 2013.
- Friston (2010) K. Friston. The free-energy principle: a unified brain theory? Nature reviews neuroscience, 11(2):127, 2010.
- Behrens et al. (2007) T. E. Behrens, M. W. Woolrich, M. E. Walton, and M. F. Rushworth. Learning the value of information in an uncertain world. Nature neuroscience, 10(9):1214, 2007.
- Adams and MacKay (2007) R. P. Adams and D. J. MacKay. Bayesian online changepoint detection. arXiv preprint arXiv:0710.3742, 2007.
- Fearnhead and Liu (2007) P. Fearnhead and Z. Liu. On-line inference for multiple changepoint problems. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(4):589–605, 2007.
- Aminikhanghahi and Cook (2017) S. Aminikhanghahi and D. J. Cook. A survey of methods for time series change point detection. Knowledge and information systems, 51(2):339–367, 2017.
- Wilson et al. (2010) R. C. Wilson, M. R. Nassar, and J. I. Gold. Bayesian online learning of the hazard rate in change-point problems. Neural computation, 22(9):2452–2476, 2010.
- Kutschireiter et al. (2017) A. Kutschireiter, S. C. Surace, H. Sprekeler, and J.-P. Pfister. Nonlinear bayesian filtering and learning: a neuronal dynamics for perception. Scientific reports, 7(1):8722, 2017.
- Meyniel et al. (2016) F. Meyniel, M. Maheu, and S. Dehaene. Human inferences about sequences: A minimal transition probability model. PLoS computational biology, 12(12):e1005260, 2016.
- Heilbron and Meyniel (2019) M. Heilbron and F. Meyniel. Confidence resets reveal hierarchical adaptive learning in humans. PLoS computational biology, 15(4):e1006972, 2019.
- Yu and Cohen (2009) A. J. Yu and J. D. Cohen. Sequential effects: superstition or rational behavior? In Advances in neural information processing systems, pages 1873–1880, 2009.
- Wilson et al. (2013) R. C. Wilson, M. R. Nassar, and J. I. Gold. A mixture of delta-rules approximation to bayesian inference in change-point problems. PLoS computational biology, 9(7):e1003150, 2013.
- Mathys et al. (2011) C. Mathys, J. Daunizeau, K. J. Friston, and K. E. Stephan. A bayesian foundation for individual learning under uncertainty. Frontiers in human neuroscience, 5:39, 2011.
- Gershman et al. (2014) S. J. Gershman, A. Radulescu, K. A. Norman, and Y. Niv. Statistical computations underlying the dynamics of memory updating. PLoS computational biology, 10(11):e1003939, 2014.
- Lin et al. (2017) K. Lin, J. L. Sharpnack, A. Rinaldo, and R. J. Tibshirani. A sharp error analysis for the fused lasso, with application to approximate changepoint screening. In Advances in Neural Information Processing Systems, pages 6884–6893, 2017.
- Cummings et al. (2018) R. Cummings, S. Krehbiel, Y. Mei, R. Tuo, and W. Zhang. Differentially private change-point detection. In Advances in Neural Information Processing Systems, pages 10825–10834, 2018.
- Gordon et al. (1993) N. J. Gordon, D. J. Salmond, and A. F. Smith. Novel approach to nonlinear/non-gaussian bayesian state estimation. In IEE proceedings F (radar and signal processing), volume 140, pages 107–113. IET, 1993.
- Särkkä (2013) S. Särkkä. Bayesian filtering and smoothing, volume 3. Cambridge University Press, 2013.
- Shannon (1948) C. Shannon. A mathematical theory of communication. Bell System Technical Journal 27: 379-423 and 623–656, 20, 1948.
- Itti and Baldi (2006) L. Itti and P. F. Baldi. Bayesian surprise attracts human attention. In Advances in neural information processing systems, pages 547–554, 2006.
- Schmidhuber (2010) J. Schmidhuber. Formal theory of creativity, fun, and intrinsic motivation (1990–2010). IEEE Transactions on Autonomous Mental Development, 2(3):230–247, 2010.
- Angela (2012) J. Y. Angela. Change is in the eye of the beholder. Nature neuroscience, 15(7):933, 2012.
- Doucet et al. (2000) A. Doucet, S. Godsill, and C. Andrieu. On sequential monte carlo sampling methods for bayesian filtering. Statistics and computing, 10(3):197–208, 2000.
6 Supplementary Material
6.1 Derivation of the Recursive Bayesian Formula
The second term in the numerator of Eq. S1 can be written as
The denominator in Eq. S1 can be written as
We denote the posterior given a change in the environment as follows:
We can then write Eq. S4:
with as defined in Eq. 7 of the main text.
6.2 Derivation of the Optimization-Based Formulation of Variational SMiLe Rule
To derive the optimization-based update rule for the Variational SMiLe rule, we used the same approach used in Faraji et al. (2018).
Consider the general form of the following variational optimization problem:
where , and on the extremes of , we will have trivial solutions:
Note that the Kullback–Leibler divergence is a convex function with respect to its first argument, i.e.
Note that the Kullback–Leibler divergence is a convex function with respect to its first argument, i.e.in our setting. Therefore, both the objective function and constraints of the optimization problem in Eq. S8 are convex. For convenience, we assume that the parameter space for is discrete, but the final results can be generalized also to the continuous cases with some considerations. For the discrete setting, the optimization problem in Eq. S8 can be rewritten as
For solving the mentioned problem, we find a which satisfies KKT conditions
where and are the parameters of the dual problem. Defining , and considering the partial derivative to be zero, we have
where is always specified in a way to have as the normalization factor. According to the KKT conditions, , and as result .
6.3 Modified SMiLe Rule
The constraint of the minimization problem for the Variational SMiLe is essentially a modified version of the Confidence Corrected Surprise (See below for the original version) defined by Faraji et al. (2018):
In the original version of (See below), is always assumed to be a uniform distribution for the computation of , which is not well-defined for some types of parameters. With the aim of minimizing the Confidence Corrected Surprise by updating the belief during time, Faraji et al. (2018) suggested a update rule solving the optimization problem:
where is an arbitary bound. The authors showed that the solution to this optimization problem is:
where is specified so that it satisfies the constraint. Although Eq. S16 looks very similar to Eq. 8 of the main text, it signifies a trade-off between the latest belief and the belief updated by only the most recent observation , whereas in the approaches we analyzed the trade-off is (Eq. 8 in the main text) between integrating the new observation to the old ones (i.e. ) and . To modulate the learning rate by surprise, Faraji et al. (2018) considered the boundary as a function surprise, i.e. , where is a free parameter, and is the maximum value for the boundary, . Since Faraji et al. (2018) mentioned that this choice was arbitrary, to be consistent with our other approaches, we modulate the learning rate of the Modified SMiLe rule similar to the Variational SMiLe rule, but with (as opposed to ) as the measure of surprise: .
6.4 Original SMiLe Rule
In the original version of the SMiLe rule proposed by Faraji et al. (2018), the definition of the Confidence Corrected surprise is given by
where is the scaled likelihood defined as
which potentially can be ill-defined, since the normalization factor can be infinite. The other parts are exactly the same as the modified version except for the modulation procedure. In the original version, the modulation is done over the boundary
and then, is found by satisfying the constraint of the optimization.
6.5 Derivation of Particle Filtering
We derive here the weight update for the particle filter. We start by defining the number of changes from beginning until time as the random variable .
The difference in our formalism from a standard derivation Särkkä (2013) is the absence of the Markov property of conditional observations (i.e ). We have