The Hierarchical Adaptive Forgetting Variational Filter

05/15/2018 ∙ by Vincent Moens, et al. ∙ 0

A common problem in Machine Learning and statistics consists in detecting whether the current sample in a stream of data belongs to the same distribution as previous ones, is an isolated outlier or inaugurates a new distribution of data. We present a hierarchical Bayesian algorithm that aims at learning a time-specific approximate posterior distribution of the parameters describing the distribution of the data observed. We derive the update equations of the variational parameters of the approximate posterior at each time step for models from the exponential family, and show that these updates find interesting correspondents in Reinforcement Learning (RL). In this perspective, our model can be seen as a hierarchical RL algorithm that learns a posterior distribution according to a certain stability confidence that is, in turn, learned according to its own stability confidence. Finally, we show some applications of our generic model, first in a RL context, next with an adaptive Bayesian Autoregressive model, and finally in the context of Stochastic Gradient Descent optimization.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning in a changing environment is a difficult albeit ubiquitous task. One key issue for learning in such context is to discriminate between isolated, unexpected events and a prolonged contingency change. This discrimination is challenging with conventional techniques because they rely on prior assumptions about environment stability. When assuming fluctuating context, past experience will be forgotten immediately when an unexpected event occurs, but if that event was just noise, this erroneous forgetting might be very costly. In less variable contexts, model parameters will tend to change more gradually, thus sometimes missing fluctuations when they happen faster than expected. Most models cover one of the two possibilities, and either gradually adapt their predictions to the new contingency or do it abruptly, but not both.

One classical solution to the problem of change detection is to compare the likelihood of the current observation given the previous posterior distribution with a default probability distribution

(Kulhavy & Karny, 1984), representing an initial, naive state of the learner. Usually, the mixing coefficient (or forgetting factor) that is used to weight these two hypotheses is adapted to the current data in order to detect and account for the possible contingency change. This mixing coefficient can be implemented in a linear or exponential manner (Kulhavý & Kraus, 1996). We will focus here on the exponential case.

In the past decade, several Bayesian solutions to this problem based on the aforementioned strategy have been proposed (Smidl, 2004; Smidl & Gustafsson, 2012; Azizi & Quinn, 2015). However, they usually suffer from several drawbacks: many of them put a restrictive prior on the mixing coefficient (e.g. (Smidl, 2004; Masegosa et al., 2017)) and cannot account for the fact that an unexpected event is unlikely to be caused by a contingency change if the environment has been stable for a long time.

We propose the Hierarchical Adaptive Forgetting Variational Filter (HAFVF). The core idea of the model is that the the mixing coefficient can be learned as a latent variable with its own mixing coefficient. It is inspired by the observation that animals tend to decrease their flexibility (i.e. their capacity to adapt to a new contingency) when they are trained in a stable environment and that this flexibility is inversely correlated with the training length (Dickinson, 1985). We suggest that this strategy may be beneficial in many environments, where the stability of the system identified by a learner is a variable that can be learned as an independent variable with a certain confidence: in certain environments, contingency changes are inherently more less than in others. Although this assumption may not hold in every case, we show that it helps the algorithm to stabilize and discriminate contingency changes from accidents.

Accordingly, we frame our algorithm in a RL framework. We explore how the forward learning algorithm can be extended to the forward-backward case. We show three applications of our model: first in the case of a simple RL task, next to fit an autoregressive model and finally for gradient learning in a Stochastic Gradient Descent (SGD) algorithm.

2 Hierarchical Model

Figure 1: Directed Acyclic Graph of the HAFVF. Latent variables are represented by white circles. Mixture of distributions is represented by red squares. The three levels of the model are displayed: latent variables are distributed with probability . The prior of latent variable is a mixture of the previous posterior distribution and an initial prior with parameters . The mixing coefficient is itself distributed according to a similar mixture of distributions with coefficient .

Let be a stream of data distributed according to a set of distributions , where the change trials are unknown and unpredictable. We make the following assumptions:

Assumption 1.

Let and , then .

Corollary 1.

If is a measure of the relative probability that belongs to wrt , and if , then .

Assumption 1 and Corollary 1 state that the probability of seeing a contingency change decreases with time in a steady environment. This might seem counter-intuitive or even maladaptive in many situations, but it is a key assumption we use to discriminate artifacts from contingency change: after a long sequence, the amount of evidence needed to switch from the current belief to the naive belief is greater than after a short sequence. This assumption will lead us to build a model where, if the learner is very confident in his belief, it will take him more time to forget past observations, because he will need more evidence for a contingency change. Therefore, in this context, the learner aims not only to learn the distribution of the data at hand, but also a measure of the confidence in the steadiness of the environment.

Assumption 2.

In the set of probability distributions p

, all elements have the same parametric form that belongs to the exponential family and have a conjugate prior that is also from the exponential family:


We now focus on the problem of approximating the current posterior distribution of given the current and past observations. For clarity, we will make the

subscripts implicit in the following. Let us first focus on the problem of estimating the posterior distribution of

in the stationary case. After steps, and given some prior distribution , the posterior distribution can be formulated recursively as:

Given the restriction imposed by Assumption 2

, this posterior probability distribution has a closed-form expression and can be estimated efficiently.

We enrich this basic model by first formulating the prior distribution of at as a mixture of the previous posterior distribution and an arbitrary prior:


Following Assumption 2, the conjugate distribution is also from the exponential family and reads

where we have expanded , where . is the part of that indicates the effective (prior) number of observations. If has the same form as , then the log-partition function can be computed efficiently (Mandt et al., 2014):

Note that this result simplifies when combined with the numerator of Equation 1:


where . The latent variable weights the initial prior with the posterior at the previous trial. We incorporate this variable in the set of the latent variables, and, we put a mixture prior on with a weight : following this approach, the previous posterior probability of conditions the current one (similarly to ), together with a prior that is blind to the stream of data up to now. Assuming that , and each can be generated by changing distributions, the joint probability now reads:


where we have assumed that the posterior probability factorizes (Mean-Field assumption), and where are the parameters of the naive, initial prior distributions over respectively. The model presented in Equation 3 is not conjugate anymore, and the posterior probability does not generally have an analytical solution. We therefore introduce a variational posterior to approximate the posterior probability . In short, Variational Inference (Jaakkola & Jordan, 2000) works by replacing the posterior by a proxy of an arbitrary form and finding the configuration of this approximate posterior that minimizes the Kullback-Liebler divergence between this distribution and the true posterior. This is virtually identical to maximizing the Expected Lower-Bound to the log-model evidence (ELBO).

For simplicity, we use a factorized variational posterior where each factor has the same form as the prior distribution of its latent variable. Assuming that Equation 3 conveniently simplifies to:


This model is shown in Figure 1. In what follows, we will restrict our analysis to the case where and

are Beta distributed, meaning that the approximate posterior we will optimize for these two variables will also be a Beta distribution.

2.1 Update equations


We first define the following notation: is the difference between the previous approximate posterior and the initial prior. We use as the weighted prior parameters, and as the expectation of under . Similarly, and are the weighted prior over and its expectation under , respectively. Also, we will often abbreviate the summary statistics of as .

We now focus on the problem of finding the approximate posterior configuration that maximizes the ELBO. Various techniques have been developed to solve this problem: whereas Stochastic Gradient Variational Bayes (Kingma & Welling, 2013) and Stochastic Variational Bayes (Hoffman et al., 2012) work well for large datasets, more traditional conjugate (Winn et al., 2005) or non-conjugate (Knowles & Minka, 2011) Variational Message Passing (VMP) algorithms are better suited for our problem. This technique indeed allows us to derive closed-form update equations that can be sequentially applied to each of the nodes of the factorized posterior distribution until a certain convergence criterion is met. We interpret these results in a Hierarchical Reinforcement Learning framework, where each level adapts its learning rate (LR) as a function of expected log-likelihood of the current observation given the past.

Fortunately, under the form of the approximate posterior we chose and using Conjugate VMP, the variational parameters of the posterior over the latent parameters have a simple form given the current value of and . For a number of observations observed at time , we have:


Equation 5 finds an interesting correspondent in the RL literature. Consider the limit case where and (which is still analytically tractable following Equation 2). As the expectation of a distribution of the exponential family has the general form , one can derive a similar posterior expectation of (Diaconis & Ylvisaker, 1979):

Now, replacing by , the above expression becomes (Mathys, 2016)


where is the average at the time of the previous observation and is the LR, whose value is inversely proportional to the effective memory and to the current expected value of the forgetting factor 111One can easily see that dictates the memory of the learner. If and assuming that is stationary, we have: . Equation 7 is a classical incremental update rule in RL (Sutton & Barto, 1998), and our algorithm can be viewed as a special case of such algorithms where the LR is adapted online to the data at hand.

The update equations of is, however, not as simple to derive as , because is not conjugate to its Beta prior . To solve this problem, we used a Non-Conjugate VMP approach (Knowles & Minka, 2011). Briefly, NCVMP minimizes an approximate KL divergence in order to find the value of the approximate posterior parameters that maximize the ELBO. In order to use NCVMP, the first step is to derive the expected log-joint probability of the model, which we will need to differentiate wrt (or, in the case of the approximate posterior update rule for ,

). It quickly appears that part of this expression does not always have an analytical form for common exponential distributions: indeed, the expected value of

is, in general, intractable and needs to be approximated. Expanding the Taylor series of this expression around up to the second order and taking the expectation, we have:


Notice that the second term of the sum in Equation 8 can be expressed as , where is the prior covariance of . Hence, this penalty term becomes important when the product of the following factors increase: the distance between the previous posterior and initial prior

, the posterior variance of

and the prior covariance of . This has the effect of favoring values of and that have a low variance, especially when the two proposed distributions, and , are very distant from each other.

We now derive the update equation for the approximate posterior of . Let us first define


We obtain the following result:

Proposition 1.

Using Algorithm 1 of (Knowles & Minka, 2011), the update equation for has the form:


and is the n order polygamma function.


Follows directly from Algorithm 1 in (Knowles & Minka, 2011)222The full development can be found in the supplementary materials.. ∎

The update equation in Equation 10 can be easily transposed for .

In Proposition 1, we show that the update of can be decomposed in four terms: the first is the (weighted) prior , which acts as a reference for the update.

The second term, , depends upon , the derivative wrt of the expectation of the log probability over , times a constant . has a simple form:

Lemma 1.

The derivative of the first order Taylor expansion of the expected log probability around has the form

The proof is given in the supplementary materials. The expression of is easily understood as a measure of similarity between the current update of the variational posterior and the previous posterior dependent prior . Note that a rather straightforward result of Lemma 1 is that : as the posterior becomes stronger, the relative change that one can expect tends to zero, and the impact of on the update of can be expected to decrease. This is the behaviour we aimed at: a very strong posterior probability becomes more and more difficult to change as the training time increases.

Note also the opposite sign of the related increment in Equation 10 for and . This implies that if , then , and the update of will tend to increase. The opposite is true for , showing that the posterior of effectively deals with the similarity between the current observation and the previous ones.

The third and fourth term of Equation 10, and , are conditioned by the posterior variance of and the prior variance of . In brief, they push the update of in a direction that lowers the variance of both and . We will show in the next section a simple example of the relative contribution that each of these terms has in the update.

An important consideration to make is that the value of must be , which implies that , a restriction that may be violated in practice, especially for low values of . In such cases, we reset the value of to some arbitrary value (typically ) where the above inequality holds, and resume the update loop until convergence or until a certain amount of iterations is reached. Note that NCVMP is not guaranteed to converge, but, as suggested by (Knowles & Minka, 2011), the use of a form of exponential damping can improve the convergence of the algorithm.

2.2 Example: Binary distribution learning

In order to understand better the relative contribution of and

to the variational update scheme, we generated a sequence of 200 binary data distributed according to a binomial distribution, whose probability was switching between


every 40 trials. This distribution can be modelled as a hierarchy of beta distributions, where the first level is a Bernoulli distribution with a conjugate, Beta approximate posterior, and the one or two levels above are both Beta distributions measuring the stability of the level below. We simulated the learning process in three cases:

  • A two-layer HAFVF model, where only the posterior over could be forgotten (incremental).

  • A two-layer HAFVF model, where the posterior of was being forgotten at a fixed rate (i.e. fixed to ).

  • A three-layer HAFVF model, where the posterior of was being forgotten at a rate of .

In each of these examples, we used the following implementation: the beta prior of the first level was set to . The value of was set to , which showed to be a good compromise between informativeness and freedom to fit the data. If applicable, the top-level was set to .

In the first case, the fitting rapidly degenerated, as the memory grew at each trial. Figure 2, left column, gives a hint about the reason of this behaviour: each observation decreases the prior covariance , which results in a positive increment for both and through . This can be viewed as a form of confirmation bias: because the posterior over and are confident about the distribution of the data, they tend to reinforce each other and loose flexibility. Consequently, the impact of the contingency changes decreases as learning goes on. This might seem undesirable (and, in this pathological case, it is indeed the case), but in the case of datasets with outliers it can be very beneficial: a longer training in a stable environment will require a longer and/or stronger sequence of outliers to reset the parameters.

Figure 2: Binary learning with a single level of forgetting. Incremental (Left column) and fixed-decay (Right column) posterior learning of . A. First level learning. The learner looses its capacity to forget as data are observed, because the expected effective memory (B.) tend to grow indefinitely when no decay was assumed. C. Trial-wise increment caused by and . The effects of contingency changes decreased when no decay over was considered.

Adding a forgetting factor to the posterior of can moderate the effect of overtraining. In the case of a fixed-forgetting for the posterior probability of Figure 2, right column, the fitting is much more stable: the model is able to learn and forget the current distribution efficiently with a memory bounded at approximately 5 trials (i.e. ). This shows that adding a forgetting over the posterior of effectively provides the flexibility we aim at: the contingency changes are efficiently detected, and the drop of (through ) triggers a resetting of the parameters in the following trials.

In the last case (Figure 3), the first level of the model acquires a higher memory than in the second example, due to the ability of the model to adapt the forgetting factor of , which relaxes its bound. It is, however, more flexible than the first example.

Figure 3: Binary learning with two levels of adaptive forgetting ( and ) and a third fixed level . D. is similar to C. for the third level updates .

2.3 Forward-Backward algorithm

Let us consider the conjugate posterior of the distribution from the exponential family when the whole dataset has been observed. For a given , one can derive the posterior probability of given as:


Given Equation 2 and Equation 5, if is the conjugate prior of and is from the exponential family, we can substitute the prior by , where and are the effective samples retrieved from the forward and the backward application of the AFVF on the dataset, respectively. Formally, we have:


where the and superscripts index the forward and backward pass, respectively. In offline learning, this technique can increase the effective memory of the approximate posterior distribution just before and after the change trials.

3 Related work

Change detection is a broad field in machine learning, where no optimal and general solution exists (KULHAVÝ & ZARROP, 1993). Consequently, assumptions about the structure of the system can lead to very different algorithms and results.

The Kalman Filter

(Azizi & Quinn, 2015) is a special case of Bayesian Filtering (BF) (Doucet et al., 2001) that has had a large success in the Signal Processing literature due to its sparsity and efficiency. It is, however, highly restrictive and its assumptions need to be relaxed in many instances. One can discriminate two main approaches in order to deal with this problem: the first approach is to use a global approximation of BF such as Particle Filtering (PF) (Smidl & Quinn, 2008; Smidl & Gustafsson, 2012; Özkan et al., 2013), which enjoys a bounded error but suffers from a lower accuracy than other local approximations. The second class of algorithms comprises the Stabilized Forgetting (SF) family of algorithms (KULHAVÝ & ZARROP, 1993; Azizi & Quinn, 2015; Laar et al., 2017), from which our model is a special case. SF suffers from an unbounded error, but it usually has a greater accuracy for a given amount of resources (Smidl & Quinn, 2008). Note that SF has been shown to be essential to reduce the divergence between the true posterior and its approximation in recursive Bayesian estimation (Kárný, 2014). As we apply the SF operator to estimate the posterior of and the mixture weight (through the weighted mixture prior), we ensure that the divergence is reduced for both of these latent variables.

Even though our model is described as a Stabilized Exponential Forgetting (Kulhavý & Kraus, 1996) algorithm and is well suited for signal processing, it can be generalized to models where there is no prediction of future states (e.g. smoothing of a signal, reinforcement learning etc.) Also, it overcomes other methods in several following ways:

First, it uses a Beta prior on the mixing coefficient. This is unusual (but not unique (Dedecius & Hofman, 2012)), as most of previous approaches used a truncated exponential prior (Smidl & Quinn, 2005; Masegosa et al., 2017) or a fixed, linear mixture prior that account for the stability of the process (Smidl & Gustafsson, 2012)

. In Stabilized Linear Forgetting, a Bernoulli prior with a Beta hyperprior has been proposed for the mixture weight

(Laar et al., 2017). Our approach is designed to learn the posterior probability of the forgetting factor in a flexible manner. We show that this posterior probability depends upon its own (and possibly a mixture of) prior distribution and upon the prior covariance of the model parameters . This makes the change detection more subtle than an all-or-none process, as one might observe with a Bernoulli distribution. It also enables us to accumulate evidence for a change of distribution across trials, which can help to discriminate outliers from real, prolonged contingency changes. This is, to our knowledge, an entirely novel feature in the adaptive forgetting literature.

The second important novelty of our model lies in its hierarchical learning of the environment stability. This is somehow similar to the Hierarchical Gaussian Filter (HGF) (Mathys, 2011; Mathys et al., 2014). The present model is, however, much more general, as the generic form we provide can be applied to several members of the exponential family. Also, although the KL divergence (error term) of our model is not bounded in the long run, it can be efficiently applied to a large subset of datasets and models, whereas the HGF often fails to fit processes that are highly stationary, with many datapoints and/or with abrupt contingency changes.

4 Experiments

The HAFVF was coded in the Julia language (Bezanson et al., 2017) using a Forward automatic differentiation algorithm (Revels et al., 2016) for the NCVMP for the RL and AR parts of this section, and using an analytical gradient for the SGD part.

4.1 Reinforcement Learning

We first look at the behaviour of the model in the simple case of estimating the current distribution of a random variable sampled from a moving distribution. We simulated two sequences of 2x200 datapoints where each pair of points was generated according to the same multivariate normal distribution with mean

and . We then added an independent random walk to these means.

We applied the Forward-Backward (FB) version of the HAFVF to these datasets. We used the same Normal Inverse Wishart prior for both of these results (, , , ). The prior over was manipulated to include a high confidence (, ) or a low confidence (, ) about the average value of . Note that both of these priors had the same expected value. To avoid overfitting of early trials (which may happen using weak priors) while keeping the distribution flexible, we used a flat prior over : . The top level forgetting was ignored (). Results are shown in Figure 4.

Figure 4: Experiment 1. Left column: weak prior over . Right column: strong prior over

. Shaded areas represent the 95% posterior confidence interval. See text for more comments.

As the first setting had a weak prior over , it had more freedom to adapt the posterior distribution to the current data. The effective memory trace (measured with the parameter ) was greater when the environment was stable, and changed faster after the contingency change than when the prior was more confident, where the adaptation was slow and the effective memory did not increase much above the prior-defined threshold 10 (or 20 for the FB algorithm).

The behaviour of both models after the contingency change is informative about the effect that the prior had on the inference process: the weak-prior forgetting factor dropped immediately after an unexpected observation was made, which can be advantageous when sudden changes are expected, but maladaptive in the presence of outliers. The strong-prior model behaved in the opposite way, and handled the change more slowly than its weak-prior counterpart.

It is interesting to note that the posterior probability distribution of (not shown in the figure) was also more flexible in the first model fit than in the second, because the observations in the level below were also more variable, due to less confident prior over : this had the effect of increasing the gain in precision over , which increased the strength of the posterior over (through and in Equation 10).

4.2 Autoregressive model

We fitted the HAFVF to a simulated a non-stationary sinusoidal signal of 400 datapoints issued from two separate systems with a low and high frequency. These signals were randomly generated as the sum of five sinusoidal waves, with the scope of observing whether the algorithm was able to adapt to the abrupt contingency change.

Because we also aimed at a more informative view on the performance of the algorithm in the presence of artifacts, we altered this signal by adding two impulses of 2 a.u. at and .

We studied a single implementation of the model, with a relatively strong prior over () and a flat prior over , (). The Forward-backward version of the algorithm was applied. We arbitrarily chose a forward and a backward order of 10 samples. Figure 5 shows the results of this experiment.

Figure 5: Experiment 2: Autoregressive model with a weak prior over . A. Observations and simulated response of the models. The zoomed windows show the effect of the artifacts on the estimated mean value. B. Effective memory (the ”effective number of observations” parameter of the posterior : namely ) of the three parts of the algorithm (plain lines), and corresponding expected effective memory: (dashed lines). Outliers had a limited impact on learning in both cases. C. Value of the AR mean weights through time. The model dealt adequately with the outliers (as the value of the parameters did not change substentially) and with the contingency change (as the values were adapted to the two different signals).

4.3 Stochastic Gradient Descent

SGD is a popular technique to find the minimum of (often computationally expensive) loss-functions over large datasets

(Tran et al., 2015) or involving intractable integrals (Kingma & Welling, 2013)

that can be sampled from. However, SGD can be unstable, especially with recurrent neural networks

(Fabius & van Amersfoort, 2014) where an isolated, highly noisy sample in the sequence can lead to a degenerate sample of the gradient over the whole sequence. This effect is further magnified when the sample size is low.

We implemented a slightly modified version of the HAFVF in a SGD framework, intended to be similar to the Adam optimizer333More details can be found in the supplementary materials (Kingma & Ba, 2015). In short, we used two specific decays and for the posterior of the means and variances of the gradients, respectively, while ensuring that

. We modelled these posteriors as a set of Normal-Inverse-Gamma distributions. Each set of weights and biases of the multilayered perceptrons was provided with its own hierarchical decay, to take advantage of the fact that some groups of partial derivatives might be more noisy than others. We used this algorithm with a strong prior over

and ( and ), to limit the effect of degenerated gradients on the approximate posteriors.

This algorithm was tested with a variational recurrent auto-encoding regression model inspired by (Moens & Zenon, 2018), where the output probability density was a first passage density of a Wiener process (Ratcliff, 1978). The simulated dataset was composed of 64 subjects performing a 500 trials long two alternative forced choice task (Britten et al., 1992), where choices and reaction times were the model was aiming to predict. At each step of the SGD process, 5 subjects were sampled, for a total of 2500 trials.

Figure 6 compares the results of the AdaFVF SGD optimizer with the Adam optimizer, executed with the default parameters (, , ). The AdaFVF showed to be less affected by degenerate samples than Adam, as can be seen from the ELBO trace and from the heat plots of the expected memories, for an estimated average negative ELBO of for Adam and for AdaFVF at the iteration 10000.

Figure 6: Experiment 3: SGD with the HAFVF. A. ELBO for the Adam optimizer and for the AdaFVF SGD. After an outlier was sampled, the AdaFVF simply forgot the gradient history, and reset its belief to the naive prior, thereby decreasing the relative contribution of this sample. B. Expected memory of the variance posteriors. The impact of outliers is highlighted by the zoomed windows.

5 Limitations, perspective and conclusion

Our algorithm has the following limitations: The first lies in the exponential form we have given to the mixture distributions. A linear form, similar to (Laar et al., 2017) could however also be implemented, at specific levels of the hierarchy of the whole model. It may also be difficult to choose an adequate prior on the various levels of the hierarchy. The naive prior of the lower level is usually crucial but hard to specify, but this is a generic feature in adaptive forgetting. For the two top levels, we propose as a rule of thumb to use a weak prior in situations where abrupt contingency changes are expected. They can also provide a higher memory to the model. They are, however, more affected by outliers than stronger priors. The latter option is therefore advisable in situations where the sequence is expected to contain outliers, and when large amount of data are modelled. There is, however, no generic solution and one might need to try different model specifications before selecting the optimal (i.e. more suited) one.

The HAFVF and variants could lead to many promising developments in RL related fields, where they might help to prevent unnecessary forgetting of past events during exploration, in signal processing and more distant fields such as deep learning, where they could be used to prevent the occurrence of catastrophic forgetting.

In conclusion, we present a new generic model aimed at coping with abrupt or slow signal changes and presence of artifacts. This model flexibly adapts its memory to the volatility of the environment, and reduces the risk of abruptly forgetting its learned belief when isolated, unexpected events occur. The HAFVF constitutes a promising tool for decay adaptation in RL, system identification and SGD.


We thank the reviewers for their careful reading and precious comments on the manuscript. We also thank Oleg Solopchuk and Alexandre Zénon, who greatly contributed to the development and writing of the present manuscript. The present work was supported by grants from the ARC (Actions de Recherche Concertées, Communauté Francaise de Belgique).