1 Bayesian Incremental Learning
Recent work has shown promise in incremental learning; for example, a set of reinforcement learning problems have been successively solved by a single model with a help of weight consolidation
(Kirkpatrick et al., 2016)(Nguyen et al., 2017). In this work we focus on a specific incremental learning setting – we consider a single fixed task when independent data portions arrive sequentially. We formulate a Bayesian method for incremental learning and use recent advances in approximate Bayesian inference (Kingma & Welling, 2013; Kingma et al., 2015; Louizos & Welling, 2017) to obtain a scalable learning algorithm. We demonstrate the performance of our method on MNIST and CIFAR-10 is improved relative to a naive fine-tuning approach and can be applied to a conventional (non-Bayesian) pre-trained DNN.Consider an i.i.d. dataset . In an incremental learning setting, this dataset is divided into parts , which arrive sequentially during training. The goal is to build an efficient algorithm that takes a model, trained on the first units of data , and retrain it on a new unit of data without access to and without forgetting dependencies.
The most naive deep learning approach for incremental learning is to apply the Stochastic Gradient Descent (SGD) updates with the same loss function on the new data parts, to
fine-tune the model. However, in that case, the model is likely to converge to a local optima on a new data unit without saving the information learned from the previous parts of the data.The Bayesian framework is a powerful tool for working with probabilistic models. It allows to estimate the posterior distribution
over the weights of the model. We can use the Bayes rule to sequentially update the posterior distribution in the incremental learning setting:(1) |
Unfortunately, in most cases the posterior distribution is intractable, so we can use stochastic variational inference (Hoffman et al., 2012) to approximate it. In the next section we present a scalable method for incremental learning, and study different variational approximations of the posterior distribution.
2 Scalable Method for Bayesian Incremental Learning
We apply variational inference to approximate with access only to and the previous approximation . To train our model we follow Kingma & Welling (2013) and use the reparameterization trick to obtain an unbiased differentiable minibatch-based Monte Carlo estimator of the variational lower bound
(2) |
The prior distribution is the posterior approximation from the previous step. Unfortunately, this approximation is not exact and as a result, the incremental procedure becomes biased. The quality of the incremental learning algorithm depends strongly on the posterior approximation , with more expressive families having a lower approximation gap, but poorer stability. We investigate, how different approximations behave in a Bayesian incremental learning algorithm.
Fully Factorized Gaussian Approximation is a fast, stable and easy to use approximation family. For a dense layer with input and output dimension , , respectively, the model is:
(3) |
The approximate posterior for a convolutional layer factorizes similarly over all kernel parameters. Gaussian approximation is widely used (Blundell et al., 2015; Kingma et al., 2015; Kucukelbir et al., 2015; Molchanov et al., 2017), however this family has low expressiveness (Louizos & Welling, 2017) which affects the quality of incremental learning.
Next, we consider a convolutional layer with filters and channels with filter size .
Channel Factorized Gaussian Approximation nicely fits convolutional layers preserving correlations within kernel parameters channel-wise. Following Rezende et al. (2014) we use the Cholesky decomposition to parameterize the covariance matrix. Under this parameterization, we can both perform the reparameterization trick and compute the density efficiently.
(4) |
where denotes a lower triangular matrix with positive diagonal elements.
Multiplicative Normalizing Flow Approximation is a highly expressive variational family. Louizos & Welling (2017) successfully employed it to train Bayesian deep neural networks. MNFs introduce an auxiliary variable to define a posterior approximation :
(5) |
where follows simple distribution and is a normalizing flow (Rezende & Mohamed, 2015). However, this approximation can not be used in incremental learning because it requires computing an intractable integral to calculate (a prior on the next incremental step). To address this issue, we derive a new variational lower bound (Appendix C) to optimize the joint approximation , instead of the marginal :
(6) |
The key difference relative to the original lower bound is that we treat as a regular parameter and not as an auxiliary variable. This allows us to use a joint prior , which leads to a tractable incremental learning procedure. We hope that a more complex posterior will help in an incremental setting.
Pretraining. When training a large neural network, it is beneficial to use a model that has been pre-trained on another task for initialization. In order to apply the Bayesian approach, one would need to specify a prior distribution over the weights given the pretrained DNN. The simplest choice is a fully-factorized Gaussian prior centered around the pretrained value
and with some fixed variance
. Typically, one would use grid search for , but a better approach might be to use the Laplace approximation (Azevedo-filho, 1994) to obtain given old data. Fitting a Laplace approximation, we obtain individual for every weight, which appears to be beneficial based upon our experiments.3 Experiments
In our experiments, we compared test set accuracy after incremental training on MNIST and CIFAR-10 datasets for LeNet5 (LeCun et al., 1998) and 3Conv3FC (Hinton et al., 2012) architectures, respectively, using the proposed approach and fine-tuning. Details for the training procedure are described in Appendix A.
Incremental Learning on MNIST and CIFAR-10 The fine-tuning (FT) approach achieved the same score in a non-incremental setting (with =) compared to Bayesian methods 1. However, it failed to solve the incremental learning task (=), resulting in low classification performance. Fully-factorized Gaussian approximation (FFG) solves the problem successively and matches the performance of a non-incremental setting (=). Normalizing flows (MNF) performed worse than the fully-factorized Gaussian approximation in the incremental learning setting, but better when =. We expect the optimization gap is due to few data available and unstable convergence of complex posteriors.
We have experienced optimization problems when moving to larger architectures. In larger architectures, the data term is being dominated by the KL-term in the objective, which leads to severe underfitting. To cope with this problem, we downscaled the KL term by , which is a common trick in Bayesian deep learning (Ullrich et al., 2017; Higgins et al., 2017). The resulting objective is no longer a proper variational lower bound, but it works well in practice and outperforms fine-tuning by a large margin.
Incremental Leaning with Domain Adaptation on CIFAR(5+5) In this experiment we evaluated the performance of pretraining approaches using CIFAR(5+5) dataset we got by randomly dividing CIFAR-10 into two equal parts based on labels. Pretraining was done on the first half of the data while incremental task was solved on the second half. We compared grid search and the Laplace approximation for described above to the fine-tuning approach. Experiments showed that pretraining helps to improve the performance of Bayesian neural networks trained incrementally on the rest of data. However, the fine-tuning approach fails to benefit from pre-training. Moreover, the experiments showed that the Laplace approximation performs well, so the grid search is not necessary.
Acknowledgments
This research was supported by Samsung Research, Samsung Electronics.
References
- Azevedo-filho (1994) Adriano Azevedo-filho. Laplace’s method approximations for probabilistic inference in belief networks with continuous variables. In In de Mantaras, pp. 28–36. Morgan Kaufmann, 1994.
- Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural networks, 2015.
- Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, and Matthew Botvinick Shakir Mohamed Alexander Lerchner Christopher Burgess, Xavier Glorot. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
- Hinton et al. (2012) Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. Technical report, 2012.
- Hoffman et al. (2012) Matt Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference, 2012.
- Kingma & Welling (2013) Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
- Kingma et al. (2015) Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems 28, pp. 2575–2583. Curran Associates, Inc., 2015.
- Kirkpatrick et al. (2016) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks, 2016.
- Kucukelbir et al. (2015) Alp Kucukelbir, Rajesh Ranganath, Andrew Gelman, and David M. Blei. Automatic variational inference in stan, 2015.
- LeCun et al. (1998) Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Louizos & Welling (2017) Christos Louizos and Max Welling. Multiplicative normalizing flows for variational bayesian neural networks, 2017.
- Molchanov et al. (2017) Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), 2017.
- Nguyen et al. (2017) Cuong V. Nguyen, Yingzhen Li, Thang D. Bui, and Richard E. Turner. Variational continual learning, 2017.
- Rezende & Mohamed (2015) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows, 2015.
- Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
- Ullrich et al. (2017) Karen Ullrich, Edward Meeds, and Max Welling. Soft weight-sharing for neural network compression, 2017.
Appendix
A Bayesian Incremental Learning Algotirhm
Incremental Learning We conducted experiments on the LeNet5 network (MNIST dataset) and the 3Conv3FC network (CIFAR-10 dataset). We applied 4 types of previously described approximations: Fine-Tuning (FT), Fully-Factorized Gaussian (FFG), Channel-Factorized Gaussian (CFG), Multiplicative Normalizing Flows (MNF). Dataset is devided into
slices and perform incremenental training procedure on each of the slices consequently. We use Adam optimizer with default parameters. Optimizer state (e.g. moving moments) is reset before each incremental stage. The predictions of Bayesian models were averaged over 100 samples from the approximate posterior distribution.
Incremental Learning with Pretraining We use the following experiment design on the CIFAR-10 dataset. We split the dataset into two parts. First, we split the CIFAR-10 dataset into two ”CIFAR-5” datasets, with 5 classes in each part (selected at random). We use the first five classes for pretraining, and then apply the incremental learning framework on the second part of the dataset. We call this task CIFAR(5+5). At the initial stage of the experiment we train a neural network on the first dataset in the conventional non-incremental setting. Then we divide the second dataset into
parts and train an incremental model on each part consequently, as described in the previous sections. Such experiment design allows us to model pretraining on an unrelated task from a similar domain. We use the parameters of the network trained on one dataset to initialize the incremental training procedure on the second dataset. We use networks of the same architecture in both stages of the experiment. We use parameters of convolutional layers learned during the first stage. However, we don’t use the pretrained parameters of the fully-connected layers which produce the network’s predictions since we classify objects into different set of classes on different stages. It is a typical technique, as the convolutional layers tend to extract task-independent features, whereas the fully-connected layers use these features to obtain the prediction.
B Models
Block type | Width | Stride | Padding | Input shape | Nonlinearity |
---|---|---|---|---|---|
Convolution () | 1 | 0 | ReLU | ||
Pooling () | 1 | 0 | None | ||
Convolution () | 1 | 0 | ReLU | ||
Pooling () | 1 | 0 | None | ||
Fully-connected | 500 | ReLU | |||
Fully-connected | 10 | Softmax |
Full description of the LeNet-5-Caffe
LeCun et al. (1998) architecture for the MNIST dataset.Block type | Width | Stride | Padding | Input shape | Nonlinearity |
---|---|---|---|---|---|
Convolution () | 1 | 2 | ReLU | ||
Pooling () | 2 | 0 | None | ||
Convolution () | 1 | 2 | ReLU | ||
Pooling () | 2 | 0 | None | ||
Convolution () | 1 | 1 | ReLU | ||
Pooling () | 2 | 0 | None | ||
Fully-connected | 1000 | ReLU | |||
Fully-connected | 1000 | ReLU | |||
Fully-connected | 10 | Softmax |
C MNFs for the incremental learning task
This section describes the proposed MNF-based approximation (Louizos & Welling (2017)) of the posterior distribution that is suitable for the incremental learning task, and describes the training procedure of this model. Unfortunately, we can’t apply MNFs inference technique for the incremental learning setting. For the variational in incremental learning task we have to estimate , where is the posterior approximation obtained at the previous step of incremental learning. This KL-term is intractable, moreover now we can’t evaluate neither the new variational approximation nor the old one , as the computation of these distributions requires marginalization over the whole space of latent variables . Alternative idea is to include latent variables into the original probabilistic model as a new parameter:
(7) |
To simplify notation used in derivation we next omit parameter and time indexing
(8) |
Now our goal is to obtain the joint estimation . Consider an arbitrary step of the incremental learning. Denote variational distribution achieved at the previous step as . Now consider a probabilistic model for the current step of the incremental learning.
(9) |
We can approximate the posterior via optimization of the variational lower bound:
(10) |
For simplicity we use notation as in (10). We can expand the data term and rewrite in in the following form:
(11) |
where
(12) | ||||
For a fully-connected layer it is equivalent to the following sampling scheme:
(13) | |||||
Here the denotation stands for the Normalizing Flow. Instead of sampling weights directly, we can apply the local reparameterization trick described by Louizos & Welling (2017), which concludes the computation of the data term.
Now we need to calculate the KL divergence term. We can rewrite it in the following way:
(14) |
(15) |
The expectation
is essentially a cross-entropy between two normal distributions and it can be computed analytically. The expectation
can be efficiently sampled, as the log-density is computed analytically as the log-density of a Normalizing Flow. Now we need to compute the expectation . It can be written in the following form:(16) |
We can remove the expectation w.r.t. the distribution in the first term, as the entropy of normal distribution depends only on its variance and does not depend on . Therefore we have
(17) |
The first term is computed analytically and the second term can be easily sampled using the log-density, computed by the Normalizing Flow. Therefore if we use the joint model instead of the marginalized model , we do not need to perform the nested variational inference procedure.
The only things left are to write down the cross-entropy of two normal distributions and , the entropy of and sampling procedures for and .
Sampling for (“
” denotes unbiased estimation):
(18) |
(19) |
(20) |
These estimators can be differentiated w.r.t. the Normalization Flow parameters to obtain an unbiased gradient estimate.
Cross-entropy of and :
(21) |
Entropy of :
(22) |
This concludes the definition of the MNF approximation for the Bayesian incremental learning procedure.
Discussion of the model
Multiplicative normalizing flows provide us a good approximation of the posterior distribution and show high predictive performance in practice. Despite it, the multiplicative normalizing flows contain a lot of parameters, which means that they are slow and can cause problems with optimization.