Indian Buffet Neural Networks for Continual Learning

12/04/2019 ∙ by Samuel Kessler, et al. ∙ University of Oxford 4

We place an Indian Buffet Process (IBP) prior over the neural structure of a Bayesian Neural Network (BNN), thus allowing the complexity of the BNN to increase and decrease automatically. We apply this methodology to the problem of resource allocation in continual learning, where new tasks occur and the network requires extra resources. Our BNN exploits online variational inference with relaxations to the Bernoulli and Beta distributions (which constitute the IBP prior), so allowing the use of the reparameterisation trick to learn variational posteriors via gradient-based methods. As we automatically learn the number of weights in the BNN, overfitting and underfitting problems are largely overcome. We show empirically that the method offers competitive results compared to Variational Continual Learning (VCL) in some settings.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In continual learning a model is required to learn a set of tasks, one by one, and to remember solutions to each. After learning a task, the model loses access to the data Kirkpatrick et al. (2017); Nguyen et al. (2018); Yoon et al. (2018). More formally, in such continual learning problems we have a set of sequential prediction tasks where , , …, . When performing task the learner typically loses access to , yet must be able to continue to perform predictions for all the tasks Zeno et al. (2018).

The core challenges of continual learning are threefold. Firstly, models need to leverage transfer learning from previously learned tasks during the learning of a new task at time

Thrun (1995); Kirkpatrick et al. (2017); Rusu et al. (2016); Lee et al. (2017). Secondly, the model needs to have enough new neural resources available to learn the new task Rusu et al. (2016); Yoon et al. (2018); Blundell et al. (2015); Kirkpatrick et al. (2017). Finally, the model is required to overcome catastrophic forgetting

of old tasks. If the model, for example, is a feed-forward neural network it will exhibit forgetting of previous tasks

Kirkpatrick et al. (2017); Goodfellow et al. (2015).

One of the popular ways to perform continual learning uses the natural sequential learning approach embedded within Bayesian inference, namely that the prior for task

is the posterior obtained from the previous task. This enables knowledge transfer and offers an approach to overcome catastrophic forgetting. Previous Bayesian approaches have involved Laplace approximations Kirkpatrick et al. (2017); Ritter et al. (2018); Lee et al. (2017) and variational inference Nguyen et al. (2018); Swaroop et al. (2018); Zeno et al. (2018), to aid in computational tractability. Whilst such methods solve, in principle, the first and third objectives of continual learning, the second objective (that of ensuring adequate resources for new learning) is not necessarily achieved. For example, additional neural resources can alter performance on MNIST classification (see Table 1 in Blundell et al. (2015)). The problem is made more difficult as neural resources required for a good solution for one task might not be sufficient (or may be redundant) for a different task.

Non-Bayesian neural networks use additional neural resources to remember previous tasks and learn a new task. Neurons which have been trained on previous tasks are frozen and a new neural network is appended to the existing network for learning a new task

Rusu et al. (2016). The problem with this approach is that of scalability: the number of neural resources increases linearly with the number of tasks. The work of Yoon et al. (2018) tackles this problem with selective retraining and expansion with a suitable regulariser to ensure that the network does not expand continuously. However, these expandable networks are unable to shrink and are vulnerable to overfitting if misspecified to begin with. Moreover, knowledge transfer and the prevention of catastrophic forgetting are not solved in a principled manner, unlike approaches couched in a Bayesian framework. We summarise several solutions to the general problem in section B in the appendix.

As the level of resource required is unknown in advance, we propose a Bayesian neural network which adds or withdraws neural resources automatically, in response to the data. We achieve this by using a sparse binary latent matrix , distributed according to a structured Indian Buffet Process (IBP) prior. The IBP prior on an infinite binary matrix, , allows inference on which, and how many, neurons are required for a predictive task. The weights of the BNN are treated as non-interacting draws from Gaussians Blundell et al. (2015). Catastrophic forgetting is overcome by repeated application of the Bayesian update rule, embedded within variational inference Doshi-Velez et al. (2009); Nguyen et al. (2018). In the next section we detail the model and in Section 3 we provide representative results.

2 Expandable Bayesian Neural Network with an Indian Buffet Process prior

We start by considering the matrix factorisation problem where , and . Each column of , a binary matrix, corresponds to the presence of a latent feature from . With , the latent feature is present in observation . In the scenario where we do not know the number of features beforehand and we desire a prior that allows the number of non-zero columns of to be inferred then the IBP provides a suitable prior on Doshi-Velez et al. (2009).

In our proposed model we use distributed according to a nonparametric IBP prior, which induces a posterior to select neurons and their number. We consider a neural network with neurons in each of its layers. Thence, for an arbitrary activation , the binary matrix is applied as follows: where , , , and is the elementwise product. We have ignored biases for simplicity. The IBP has some nice properties for this application, including the number of elements sampled growing with and promoting a "richer get richer" scheme Griffiths and Ghahramani (2011). Hence the number of neurons which are selected grows with the number of points in the dataset and the same neurons will be selected by the IBP enabling learning. This neuron selection scheme is in contrast to dropout which randomly selects weights.

Figure 1: Left

, Comparison of weight pruning for the IBP BNN on MNIST and a comparison with a BNN with no IBP prior. We prune weights of each network according to the absolute value of the weights and the signal to noise ratio

and apply the ’binary’ mask to activation outputs from the IBP prior BNN. These curves are an average of 5 separate optimisations

one standard error.

Middle, the matrix for a batch in the test set to demonstrate which neurons are active. Notice that is not perfectly binary as it has been relaxed with a Concrete distribution. Right, a histogram showing the number of active neurons for each point in the test set.

We use a stick-breaking IBP prior Teh et al. (2007a); Doshi-Velez et al. (2009)

, in which a probability

is assigned to the column . Whether a neuron is active for data point is determined by . Here is generated via the so-called stick-breaking process: and . As a result, decreases exponentially with . The Beta parameter controls how quickly the probabilities decay. By learning the Beta parameters we can influence how many neurons are required for a particular layer and for a particular task.

Our expandable BNN has diagonal Gaussian weights, for all as in Blundell et al. (2015); Nguyen et al. (2018) and the binary matrices will follow the IBP prior. For continual learning the posterior over the BNN weights and IBP parameters will form the prior for the new task. In practice, we will use a variational approximation, where the variational posterior from task is taken as prior for . This will encourage knowledge transfer and prevent catastrophic forgetting. The parameter of the IBP prior controls the number of neurons available for task , as it increases (or decreases) this should encourage the use of more (or less) neurons and hence add (or remove) new computational resources for learning the new task .

The posterior of our model given the data is approximated using structured stochastic variational inference Hoffman and Blei (2015). The variational Beta parameters act globally over , thus the variational approximation we propose here retains some of the structure of the desired posterior. We make use of the reparameterisation trick Kingma and Welling (2014)

together with the Concrete reparamaterisation of the Bernoulli distribution

Maddison et al. (2017); Jang et al. (2017) and the Kumaraswamy reparameterisation of the Beta distribution Nalisnick and Smyth (2017); Singh et al. (2017) to allow stochastic gradients to pass through to the Beta parameters in the hierarchical IBP posterior. The model is discussed in more detail in section C.

3 Results

(a) MNIST
(b) MNIST + noise
(c) MNIST + background image
Figure 2: Continual learning average accuracies on task specific test sets for different datasets and for different models. The BNN with a IBP prior is compared to baselines BNNs with no IBP prior and with fixed hidden state sizes . The accuracies reported are an average of 5 different optimisations. Break-downs of task accuracies versus the number of tasks the model has seen are available in the appendix E.

We investigate whether the neural sparsity imposed by the IBP prior is sensible. This is done by weight pruning on the MNIST multi-class classification problem. We compare our approach with a variational BNN which has the same neural network architecture except without the IBP prior which commands the structure and number of neurons in a layer. The IBP BNN has an accuracy of 0.95 while the BNN achieved an accuracy of 0.96, however the IBP prior BNN is more robust to pruning; with pruned weights coincide with those suppressed by the IBP prior. The pruning accuracy is shown in Figure 1. There is a small improvement in the accuracies by pruning with the signal-to-noise ratio (snr), defined as for all . This is expected as MNIST is a relatively simple problem with good accuracy even on small networks. Note that the sparsity induced by the IBP prior renders the effects of variational overpruning redundant Trippe and Turner (2017). Overall, the above results show a sensible sparsity induced by the variational IBP.

Our main experiments deal with the task of continual learning on various split MNIST datasets. For a fair comparison, we use multi-head network architectures for all experiments. The baselines are VCL networks Nguyen et al. (2018) with a single hidden layer with size . The sizes are chosen to expose potential underfitting or overfitting issues. Our model also uses a single layer with a variational truncation . The IBP prior BNN outperforms all VCL baseline networks for the split MNIST tasks which have background noise or background images as shown Figure 2 while having less than 15 active neurons see Figures 4 and 5. The baseline models overfit on the second task and subsequently propagate a poor approximate posterior. On split MNIST the baseline underfits and the perform well versus our model as this is a simple task and overfitting is difficult to expose, see Figure 1(a). Continual learning experiments on the not MNIST dataset shows that the baselines with underfit, but the performs better than our model on some tasks, see Figure 6. For all the datasets considered in the continual learning experiments, the BNN with an IBP prior is able to expand as the model is required to solve more tasks, see Figure 2(b) to Figure 5(b). Further analysis of the results is presented in the supplementary material in section E and additional experimental details are presented in section D.

4 Summary

We introduce a structured IBP prior over a Bayesian Neural Network (BNN), with application to continual learning. The IBP prior effectively induces sparsity in the network, allowing it to add neural resources for new tasks yet still overcome the overfitting problems which plague VCL networks. Our goal is continual learning and not to induce sparsity for a parsimonious model or for the sake of compression, however it would be interesting to compare to BNNs designed for these goals by introduction of sparsity inducing priors Louizos et al. (2017); Molchanov et al. (2017). Natural extensions to this work include the application of the IBP prior directly to the BNN weights as well as more extensive testing with a broader range of data-sets, with larger numbers of tasks.

References

  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight Uncertainty in Neural Networks. In International Conference on Machine Learning, External Links: 1505.05424v2, Link Cited by: §C.1, Appendix C, §1, §1, §1, §2.
  • T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan (2013) Streaming Variational Bayes. In Neural Information Processing Systems, External Links: Link Cited by: §B.1.
  • S. P. Chatzis (2018) Indian Buffet Process deep generative models for semi-supervised classification. Technical report Vol. abs/1402.3427v7. External Links: Link, 1402.3427v7 Cited by: §B.2.
  • A. Chaudhry, P. K. Dokania, T. Ajanthan, and P. H. S. Torr (2018) Riemannian Walk for Incremental Learning: Understanding Forgetting and Intransigence. In

    European Conference on Computer Vision (ECCV)

    ,
    External Links: 1801.10112v3, Link Cited by: §B.1.
  • F. Doshi-Velez and Z. Ghahramani (2009) Accelerated sampling for the Indian Buffet Process. In International Conference on Machine Learning, Cited by: §B.2.
  • F. Doshi-Velez, K. T. Miller, J. Van Gael, and Y. Whye Teh (2009) Variational Inference for the Indian Buffet Process. In

    International Conference on Artificial Intelligence and Statistics

    ,
    External Links: Link Cited by: §B.2, §1, §2, §2.
  • S. Farquhar and Y. Gal (2018) A Unifying Bayesian View of Continual Learning. In

    Bayesian Deep Learning workshop, Neural Information Processing Systems

    ,
    External Links: Link Cited by: §B.1.
  • Z. Ghahramani and T. L. Griffiths (2006) Infinite latent feature models and the Indian Buffet Process. In Advances in Neural Information Processing Systems, pp. 475–482. Cited by: §A.1.
  • I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio (2015) An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. In arXiv:1312.6211v3, External Links: 1312.6211v3, Link Cited by: §1.
  • T. L. Griffiths and Z. Ghahramani (2011) The Indian Buffet Process: An Introduction and Review. Journal of Machine Learning Research 12, pp. 1185–1224. External Links: Link Cited by: §A.1, §B.2, §2.
  • M. D. Hoffman and D. M. Blei (2015) Structured Stochastic Variational Inference. In International Conference on Artificial Intelligence and Statistics, External Links: Link Cited by: §B.2, Appendix C, Appendix C, §2.
  • L. F. James et al. (2017) Bayesian Poisson calculus for latent feature modeling via generalized Indian Buffet Process priors. The Annals of Statistics 45 (5), pp. 2016–2045. Cited by: §A.1.
  • E. Jang, S. Gu, and B. Poole (2017) Categorical Reparametrization with Gumbel-Softmax. In International Conference on Learning Representations, External Links: 1611.01144v5, Link Cited by: §C.3, Appendix C, §2.
  • D. P. Kingma and J. Lei Ba (2015) ADAM: A Method for Stochastic Optimization. In International Conference on Learning Representations, External Links: 1412.6980v9, Link Cited by: Appendix D.
  • D. P. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. In International Conference on Learning Representations, External Links: 1312.6114v10, Link Cited by: §B.2, §C.1, §C.3, Appendix C, §2.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 114 (13), pp. 3521–3526. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/114/13/3521.full.pdf Cited by: §B.1, §1, §1, §1.
  • S. Lee, J. Kim, J. Jun, J. Ha, and B. Zhang (2017)

    Overcoming Catastrophic Forgetting by Incremental Moment Matching

    .
    In Neural Information Processing Systems, External Links: Link Cited by: §B.1, §1, §1.
  • D. Lopez-Paz and M. Ranzato (2017) Gradient Episodic Memory for Continual Learning. In Neural Information Processing Systems, External Links: 1706.08840v5, Link Cited by: §B.1.
  • C. Louizos, K. Ullrich, and M. Welling (2017) Bayesian Compression for Deep Learning. In Neural Information Processing Systems, External Links: Link Cited by: §4.
  • C. J. Maddison, A. Mnih, and Y. W. Teh (2017)

    The Concrete Distribution: a Continual Relaxation of Discrete Random Variables

    .
    In International Conference on Learning Representations, External Links: 1611.00712v3, Link Cited by: §C.3, §C.3, §C.3, Appendix C, §2.
  • D. Molchanov, A. Ashukha, and D. Vetrov (2017) Variational Dropout Sparsifies Deep Neural Networks. In International Conference on Machine Learning, External Links: 1701.05369v3, Link Cited by: §4.
  • E. Nalisnick and P. Smyth (2017)

    Stick-Breaking Variational Autoencoders

    .
    In International Conference on Learning Representations, External Links: 1605.06197v3, Link Cited by: §B.2, §C.2, Appendix C, §2.
  • C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner (2018) Variational Continual Learning. In International Conference on Learning Representations, External Links: 1710.10628v3, Link Cited by: §A.2, §B.1, §B.1, Appendix C, Appendix D, §1, §1, §1, §2, §3.
  • R. Ranganath, S. Gerrish, and D. M. Blei (2014) Black Box Variational Inference. In Artificial Intelligence and Statistics, External Links: Link Cited by: §B.2.
  • H. Ritter, A. Botev, and D. Barber (2018) Online Structured Laplace Approximations for Overcoming Catastrophic Forgetting. In Neural Information Processing Systems, External Links: Link Cited by: §B.1, §1.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive Neural Networks. In arXiv:1606.04671v3, External Links: 1606.04671v3, ISBN 1606.04671v3, Link Cited by: §B.1, §1, §1.
  • J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress and Compress: A scalable framework for continual learning. In International Conference on Machine Learning, External Links: 1805.06370v2, Link Cited by: §B.1.
  • H. Shin, J. K. Lee, J. Kim, and J. Kim Sk T-Brain (2017) Continual Learning with Deep Generative Replay. In Neural Information Processing Systems, External Links: Link Cited by: §B.1.
  • R. Singh, J. Ling, and F. Doshi-Velez (2017) Structured Variational Autoencoders for the Beta-Bernoulli Process. In Workshop on Advances in Approximate Bayesian Inference, Neural Information Processing Systems, External Links: Link Cited by: §B.2, §C.2, Appendix C, Appendix C, Appendix D, §2.
  • S. Swaroop, C. V. Nguyen, T. D. Bui, and R. E. Turner (2018) Improving and Understanding Variational Continual Learning. In Continual Learning workshop, Neural Information Processing Systems, External Links: 1905.02099v1, Link Cited by: §B.1, §1.
  • Y. W. Teh, D. Grür, and Z. Ghahramani (2007a) Stick-breaking Construction for the Indian Buffet Process. In International Conference on Artificial Intelligence and Statistics, External Links: Link Cited by: §2.
  • Y. W. Teh, D. Grür, and Z. Ghahramani (2007b) Stick-breaking construction for the Indian Buffet Process. In Artificial Intelligence and Statistics, pp. 556–563. Cited by: §B.2.
  • S. Thrun (1995) Lifelong Learning: A Case Study. Technical report External Links: Link Cited by: §1.
  • B. L. Trippe and R. E. Turner (2017) Overpruning in Variational Bayesian Neural Networks. In Advances in Approximate Bayesian Inference workshop, Neural Information Processing Systems, External Links: 1801.06230v1, Link Cited by: §3.
  • F. Wood and T. L. Griffiths (2007) Particle filtering for nonparametric Bayesian matrix factorization. In Neural Information Processing Systems, Cited by: §B.2.
  • J. Yoon, E. Yang, J. Lee, S. Ju Hwang, and S. Korea (2018) Lifelong Learning with Dynamically Expandable Networks. In International Conference on Learning Representations, External Links: 1708.01547v11, Link Cited by: §B.1, §1, §1, §1.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual Learning Through Synaptic Intelligence. In International Conference on Machine Learning, External Links: 1703.04200v3, Link Cited by: §B.1.
  • C. Zeno, I. Golan, E. Hoffer, and D. Soudry (2018) Task Agnostic Continual Learning Using Online Variational Bayes. In Bayesian Deep Learning workshop, Neural Information Processing Systems, External Links: 1803.10123v2, Link Cited by: §1, §1.

Appendix A Preliminaries

In this section, we describe the Indian Buffet Process prior used in our model and the Variational Continual Learning (VCL) framework which we use for continual learning.

a.1 Indian Buffet Process Prior

The Indian Buffet Process (IBP) [8, 10, 12]

is a stochastic process defining a probability distribution over sparse binary matrices with a finite number of rows and an infinite number of columns. This distribution is suitable to use as a prior for models with a potentially infinite number of features. The form of the prior ensures that only a finite number of features will be present in any finite set of observations, but allows for extra features to appear as more data points are observed. The IBP probability density is defined as follows:

(1)

where is the number of non-zero columns in , is the number of ones in column of , is the -th harmonic number, and

is the number of occurrences of the non-zero binary vector

among the columns in . The parameter controls the expected number of features present in each observation.

The name of the Indian Buffet Process originates from the metaphor, where the rows of correspond to customers and the columns correspond to dishes in an infinitely long buffet. The first customer samples the first dishes. The -th customer then samples dishes with probability , where is the number of people who have already sampled dish . The -th customer also samples new dishes. Therefore, is one if customer tried the -th dish and zero otherwise.

a.2 Variational Continual Learning

The continual learning process can be decomposed into a Bayesian update and approximate inference of the task posterior can be used as a prior for the new task . Variational Continual Learning (VCL) [23] uses a BNN to perform a prediction problem where the weights are independent Gaussians and uses the variational posterior from the previous task as the prior for the next. Consider learning the first task and let denote a vector of parameters, then the variational posterior will be: . For the next task , we lose access to and the prior will be . The optimisation of the ELBO yields . Generalising, the negative ELBO for the -th task is:

(2)

The first term acts to regularise the posterior over , ensuring continuity with and second term is the log likelihood of the data.

Appendix B Related work

In this section we discuss the literature on Continual Learning and that associated with the use of the IBP prior in Deep Learning.

b.1 Continual Learning

Continual Learning can be viewed as a sequential learning problem and an approach to learning in this setting is through online Bayesian inference [2]. Elastic Weight Consolidation (EWC) is a seminal piece of work in continual learning which performs online Bayesian inference with a diagonal Laplace approximation to make Bayesian inference tractable [16]. This reduces to an regularisation ensuring that the new weights for task are close to all previous task weights in terms of Euclidean distance. Synaptic Intelligence (SI) [37] creates an importance measure that is determined by the loss over the optimisation trajectory and by the Euclidean distance it has moved from the previous task’s local minimum. SI uses this importance measure to weight an regularisation ensuring that the optimal weights for are similar to those for . Another regularisation based approach, one learns the conditional distribution regularised for task so that it is close to that of in terms of KL-divergence this can also be approximated as an regularisation similarly to EWC and SI [4]. The work of [17] also uses a diagonal approximation to the Fisher information used for the Laplace approximation for Bayesian approximate inference together with techniques from transfer learning literature. Instead of approximating the Fisher information as diagonal and ignoring correlations in parameter space, [25] uses a block-diagonal Kronecker factored approximation which accounts for covariances between weights of the same layer and assumes independence between the weights of different layers. Recent work has also been made on variational approximations to sequential Bayesian inference in continual learning and proposed for discriminative and generative models [23, 30].

Another approach to continual learning is to expand the neural network model to ensure the predictive performance on previous tasks is retained and allowing for new neural capacity for learning of new tasks. One approach is Progressive Networks [26] which freezes weights learnt from previous tasks and connections are made from the frozen networks to a new network which is trained on the current task. This allows the Progressive Network to leverage previous knowledge to remember old tasks and also allows new neural capacity for learning a new task. This solution is linear in the number of networks needed for tasks. A more efficient expansion approach is to selectively retrain neurons and if required, expand the network with a group sparsity regulariser to ensure sparsity at the neuron level [36].

Several other solutions to continual learning have been proposed, involve replaying data from previous tasks with a generative model trained to reconstruct for [28], storing summaries of data with coresets [23] or storing random samples from each task [18] and ensuring that loss incurred on this memory dataset is smaller for than for . Combining methods also yield good results, these include using VCL and generative replay approaches [7] and using Progressive Networks and EWC to ensure that the number of parameters in the network does not increase with the number of tasks [27].

b.2 The Indian Buffet Process prior in Deep Learning

The IBP prior has been used for sparse matrix factorisation. The inference for IBP has been performed in several ways, including Gibbs sampling [5, 10], particle filtering [35], slice sampling [32], and using variational inference [6]. For generative models in deep learning the IBP has been used to model the latent state for VAEs, inference has been performed with mean-field variational inference, using black-box variational inference [24] by [3]. The stick-breaking VAE by [22] introduces a suitable reparameterisation to handle gradient based learning with the reparameterisation trick [15]. Because the mean-field approximation removes much of the structure of a hierarchical model like the IBP, [29] uses structured stochastic variational inference [11] to allow dependencies between global and local parameters and achieve better results in VAEs over the mean-field approximation.

Appendix C Inference

In this section we develop a variational approach for performing inference on the posterior distribution of the BNN weights and the IBP parameters. We use a structured variational model where dependencies are established between global Beta parameters over local parameters which comprise the BNN hidden layers [11], similarly to [29]. Once we have obtained our variational posterior, placing our inference procedure within the VCL framework set out in section A.2 is straightforward. The following set of equations govern the hierarchical IBP prior BNN model for an arbitrary layer of a BNN:

(3)
(4)
(5)
(6)
(7)

denotes a neuron in layer , denotes a row from the weight matrix and identifies a column of our binary matrix. is the elementwise multiplication operation. The binary matrix controls the inclusion of a particular neuron , , and .

The closed form solution to the true posterior of our IBP parameters and BNN weights involves integrating over the joint distribution of the data and our hidden variables,

. Since it is not possible to obtain a closed form solution to this integration we will make use of variational inference and the reparameterisation trick [15]. We use a structured variational approximation [11], this approach has been shown to perform better than the mean-field approximation in VAEs [29]. The variational approximation used is

(8)

where the variational posterior is truncated up to , the prior is still infinite [22]. denotes the set of variational parameters which we optimise over. Each term in Equation (8) is specified as follows

(9)
(10)
(11)
(12)

Now that we have defined our structured variational approximation in Equation (8) we can write down the objective for task as

(13)
(14)

In the above formula, is the approximate posterior for and is the approximate posterior for task and prior for . By substituting Equation (8), we obtain the negative ELBO objective for each task as:

(15)

To estimate the gradient of the Bernoulli and Beta variational parameters requires a suitable reparameterisation. Samples from the Bernoulli distribution in Equation (

11) arise after taking an argmax over the Bernoulli parameters. The argmax is discontinuous and a gradient is not possible to calculate. We reparameterise the Bernoulli as a Concrete distribution [20, 13]. Additionally we reparameterise the Beta as a Kumaraswamy distribution for the same reasons [22]; to separate sampling nodes and parameter nodes in the computation graph (Figure 2 in [13] for clarification) and allow the use of stochastic gradient methods to learn the variational parameters in the approximate IBP posterior. Variational inference on the Gaussian weights of the BNN, in Equation (12) is performed with a mean-field approximation and identical to [1, 23]. In the next sections we detail the reparameterisations of the Bernoulli and Beta distributions and show how to calculate the KL-divergence terms in Equation (15).

c.1 The variational Gaussian weight distribution reparameterisation

The variational posterior over the weights of the BNN are diagonal Gaussian . By using a reparameterisation, one can represent the BNN weights using a deterministic function , where is an auxiliary variable and a deterministic function parameterised by . The BNN weights can be sampled directly through the reparameterisation: . By using this simple reparameterisation the weight samples are now deterministic functions of the variational parameters and and the noise comes from the independent auxiliary variable [15]. Taking a gradient of our ELBO objective in Equation (15) the expectation of the log-likelihood may be rewritten by integrating over so that the gradient with respect to and

can move into the expectation allowing for gradients to be calculated using the chain rule

[1].

c.2 The variational Beta distribution reparameterisation

The Beta distribution can be reparameterised using the Kumaraswamy distribution [22] with parameters and . The Kumaraswamy distribution has a density

(16)

When or

the Kumaraswamy and Beta are identical. This reparameterisation has been used successfully to learn a discrete latent hidden representation in a VAE where the parameters

and

are learnt using stochastic gradient descent

[22, 29]. The Kumaraswamy distribution can be reparameterised as

(17)

where

from the Uniform distribution.

The KL divergence between our variational Kumaraswamy posterior and Beta prior has a closed form:

(18)
(19)

where is the Euler constant, is the digamma function, is the beta function and the infinite sum can be approximated by a finite sum.

c.3 The variational Bernoulli distribution reparameterisation

The Bernoulli distribution can be reparameterised using a continuous approximation to the discrete distribution. If we have a discrete distribution where and , then . Sampling from this distribution requires performing an argmax operation, the crux of the problem is that the argmax operation doesn’t have a well defined derivative.

To address the derivative issue above, we use the Concrete distribution [20] or Gumbel-Softmax distribution [13] as an approximation to the Bernoulli distribution. The idea is that instead of returning a state on the vertex of the probability simplex like argmax does, these relaxations return states inside the inside the probability simplex (see Figure 2 in [20]). We follow the Concrete formulation and notation from [20] to sample from the probability simplex as

(20)

with temperature hyperparameter

, parameters and i.i.d. Gumbel noise . This equation resembles a softmax with a Gumbel perturbation. As the softmax computation approaches the argmax computation. This can be used as a relaxation of the variational Bernoulli distribution and can be used to reparameterise Bernoulli random variables to allow gradient based learning of the variational Beta parameters downstream in our model.

When performing variational inference using the Concrete reparameterisation for the posterior, a Concrete reparameterisation of the Bernoulli prior is required to properly lower bound the ELBO 15. If is the Bernoulli variational posterior over sparse binary masks for weights and all data points and is the Bernoulli prior. To guarantee a lower bound on the ELBO both Bernoulli distributions require replacing with Concrete densities, i.e.,

(21)

where is a Concrete density for the variational posterior with parameter , temperature parameter given global parameters . is the Concrete prior. Equation (21) is evaluated numerically by sampling from the variational posterior (we will take a single Monte Carlo sample [15]). At test time we can sample from a Bernoulli using the learnt variational parameters of the Concrete distribution [20].

In practice, we use log transformation to alleviate underflow problems when working with Concrete probabilities. One can instead work with , as the KL divergence is invariant under this invertible transformation and Equation (21) is valid for optimising our Concrete parameters [20]. For binary Concrete variables we can sample from where and the log-density (before applying the sigmoid activation) is [20].

Appendix D Experimental details

For all experiments, the BNN architecture used for incorporating the IBP prior has a single layer with ReLU activation functions, the variational truncation parameter for the IBP variational posterior is set to

: the maximum number of nodes in our network is . At the start of the optimisation the parameters of the Beta distribution are initialised with and for all . The temperature parameters of the Concrete distributions for the variational posterior and prior are set to and respectively (the prior distribution is also chosen as Concrete in order to find a proper lower bound of the ELBO for reasons discussed in section C.3). Our implementation of the IBP is adapted from the code by [29].

Our implementation of the BNN and the continual learning framework is based off of code from [23]

. The BNNs use a multi-head network. The Gaussian weights of the BNN have their means initialised with the maximum likelihood estimators and variances equal to

. We use an Adam optimiser [14] and train for epochs with a learning rate of .

For the weight pruning experiment the baseline BNN has a hidden layer of size 100. The only difference to the details above is that both the BNN and the BNN with IBP prior are trained for 600 epochs.

We summarise the sizes of the datasets used for experiments in table 1.

Dataset Training set size Test set size
Split MNIST
Split MNIST + noise
Split MNIST + background images
not MNIST
Table 1: Sizes of the training and test sets for the datasets used.

Appendix E Further results on MNIST variants

We elaborate on the results which are presented in section 3. The results are shown for various MNIST variants and the not MNIST dataset. The accuracies of each task after successive continual learning steps are show in Figures 3 - 6.

For the split MNIST dataset Figure 3 we note that the IBP prior BNN is able to outperform the 5 neuron VCL, this network is underfitting. One the other hand the 10 and 50 neuron VCL networks outperform the IBP prior network in particular for tasks 3 to 5. The IBP prior BNN is able to expand a small amount as the number of tasks increases. Regarding MNIST with background noise111The data is obtained from https://sites.google.com/a/lisa.iro.umontreal.ca/public_static_twiki/variations-on-the-mnist-digits, it is clear that the IBP prior model outperforms all VCL baselines for all tasks. The IBP prior model also expands slightly, however it doesn’t extend its capacity past the largest VCL baseline model considered, . Similarly the results using an MNIST + random background images in general shows the IBP prior BNN outperforming all VCL baselines except when the model first sees the data in tasks 2, 3, and 4. Despite this the IBP prior network forgets these tasks less throughout the continual learning process. The results for not MNIST 222Data obtained from http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html show that IBP prior BNN outperforms the BNNs for task 1, 2 and 3. The BNN outperforms the IBP prior BNN for all tasks apart from the first.

Note that the sparse matrices shown in Figures 3 to 6 are not binary like in the original IBP formulation as we use a Concrete relaxation to the Bernoulli distribution. A neuron is defined as active in the IBP BNN when for the Figures 1 and 2(b) to 5(b).

(a) Split MNIST task accuracies versus the number of tasks the model has seen and performed the Bayesian update for. The final plot is a per task average accuracy as shown previously.
(b) Per task sparse matrix for a batch in test set and histograms of the number of active neurons per point in the test set.
Figure 3: Split MNIST task accuracies versus the number of tasks the model has seen and performed the Bayesian update for. Our model is compared to VCL benchmarks with different numbers of hidden states denoted in the plot legend. Accuracies are an average of 5 optimisations. We also show the sparse matrix of the IBP prior model after each Bayesian approximate update together the a histogram of the number of neurons which are active for each data point in the test set. The average number of neurons which are active per point in the test set is increases steadily from 13.2 to 13.9
(a) Split MNIST + noise dataset continual learning task accuracies versus the number of tasks the model has seen and performed the Bayesian update for. The final plot is a per task average accuracy as shown previously.
(b) Per task sparse matrix for a batch in test set and histograms of the number of active neurons per point in the test set.
Figure 4: Split MNIST with random noise task accuracies versus the number of tasks the model has seen and performed the Bayesian update for. Our model is compared to VCL benchmarks with different numbers of hidden states. Accuracies are an average of 5 optimisations. We also show the sparse matrix of the model after each Bayesian approximate update together the a histogram of the number of neurons which are active for each point in the test set. The average number of neurons which are active per point in the test set is increases steadily from 13.3 to 14.2
(a) Split MNIST + background image dataset continual learning task accuracies versus the number of tasks the model has seen and performed the Bayesian update for. The final plot is a per task average accuracy as shown previously.
(b) Per task sparse matrix for a batch in test set and histograms of the number of active neurons per point in the test set.
Figure 5: Split MNIST with random background images task accuracies versus the number of tasks the model has seen and performed the Bayesian update for. Our model is compared to VCL benchmarks with different numbers of hidden states. Accuracies are an average of 5 optimisations. We also show the sparse matrix of the model after each Bayesian approximate update together the a histogram of the number of neurons which are active for each point in the test set. The average number of neurons which are active per point in the test set is increases steadily from 13.4 to 14.0.
(a) Split Not MNIST image dataset continual learning task accuracies versus the number of tasks the model has seen and performed the Bayesian update for. The final plot is a per task average accuracy.
(b) Per task sparse matrix for a batch in test set and histograms of the number of active neurons per point in the test set.
Figure 6: Split Not MNIST task accuracies versus the number of tasks the model has seen and performed the Bayesian update for. Our model is compared to VCL benchmarks with different numbers of hidden states. Accuracies are an average of 5 optimisations. We also show the sparse matrix of the model after each Bayesian approximate update together the a histogram of the number of neurons which are active for each point in the test set. The average number of neurons which are active per point in the test set increases steadily from 13.2 to 14.7