A long-held dream of the AI community is to build a machine capable of operating autonomously for long periods or even indefinitely. Such a machine must necessarily learn and adapt to a changing environment and, crucially, manage memories of what it has learned for the future tasks it will encounter. A spectrum of learning scenarios are available depending on problem requirements. In lifelong learning (Thrun, 1996) the machine is presented a sequence of tasks and must use knowledge learned from the previous tasks to perform better on the next. In the resource-constrained lifelong learning setting the machine is constrained to a small buffer of previous experiences. Some approaches to lifelong learning assume that a task is a set of examples chosen from the same distribution (Rusu et al., 2016; Fernando et al., 2017; Shin et al., 2017; Ramapuram, Gregorova, and Kalousis, 2017; Al-Shedivat et al., 2017; Lee et al., 2018). If instead the machine is given a sequence of examples without any batching, then this is called continual learning. In this paper we focus on this more challenging continual learning scenario.
has three main requirements: (1) continually adapt in a non-stationary environment, (2) retain memories which are useful, (3) manage compute and memory resources over a long period of time. Most neural network research has focused on methods to improve (1) and (2). In this paper we consider (3) as well and further investigate the role of efficient experience storage in avoiding the catastrophic forgetting(McCloskey and Cohen, 1989) problem that makes (2) so challenging.
Experience memory has been influential in many recent approaches. For example, experience replay (Lin, 1992) was integral in helping to stabilize the training of Deep Q Learning on Atari games (Mnih et al., 2015). Episodic storage mechanisms (Schaul et al., 2015; Blundell et al., 2016; Pritzel et al., 2017; Rebuffi, Kolesnikov, and Lampert, 2017; Lopez-Paz and Ranzato, 2017)
were also some of the earliest solutions to the catastrophic forgetting problem in the supervised learning setting(Murre, 1992; Robins, 1995). Unlike approaches which simply focus on remembering representations of old tasks (Li and Hoiem, 2016; Riemer, Khabiri, and Goodwin, 2016; Kirkpatrick et al., 2017), episodic storage techniques achieve superior performance because of their ability to continually improve on old tasks over time as useful information is learned later (Lopez-Paz and Ranzato, 2017).
All of these techniques try to use stored experiences to stabilize learning. However, they do not consider agents which must operate independently in the world for a long time. In this scenario, assuming the kind of high-dimensional data which make up human experiences, the efficient storage of experiences becomes an important factor. Storing full experiences in memory, as these methods do, causes storage costs to scale linearly with the number of experiences. To truly learn over a massive number of experiences in a non-stationary environment, the incremental cost of adding experiences to memory must be sub-linear in the number experiences.
In this paper we propose a scalable experience memory module which learns to adapt to a non-stationary 111We assume the environment is non-stationary, but not adversarial. So past experiences can be helpful for learning future tasks.
environment and improve itself over time. The memory module is implemented using a variational autoencoder which learns to compress high-dimensional experiences to a compact latent code for storage. This code can then be used to reconstruct realistic recollections for both experience replay training and improvement of the memory module itself. We demonstrate empirically that the module achieves sub-linear scaling with the number of experiences and provides a useful basis for a realistic continual learning system. Our experiments show superior performance over state-of-the-art approaches for lifelong learning with a very small incremental storage footprint.
Storing Parameters Instead of Experiences. Our method is complementary to recent work leveraging episodic storage to stabilize learning (Mnih et al., 2015; Blundell et al., 2016; Pritzel et al., 2017; Rebuffi, Kolesnikov, and Lampert, 2017; Lopez-Paz and Ranzato, 2017). Some recently proposed methods for lifelong learning don’t store experiences at all, instead recording the parameters of a network model for each task (Rusu et al., 2016; Kirkpatrick et al., 2017; Fernando et al., 2017). This yields linear (or sometimes worse) storage cost scaling with respect to the number of tasks. For our experiments, and in most settings of long-term interest for continual learning, the storage cost of these extra model parameters per task significantly exceeds the per task size of a corresponding experience buffer.
Generative Models to Support Lifelong Learning. Pseudorehearsals (Robins, 1995)
is a related approach for preventing catastrophic forgetting that unlike our recollection module does not require explicit storage of codes. Instead it learns a generative experience model alongside the main model. For simple learning problems, very crude approximations of the real data such as randomly generated data from an appropriate distribution can be sufficient. However, for complex problems like those found in NLP and computer vision with highly structured high dimensional inputs, more refined approximations are needed to stimulate the network with relevant old representations. To the best of our knowledge, we are the first to consider variational autoencoders(Kingma and Welling, 2014) as a method of creating pseudo-experiences to support supervised learning. Some recent work (Ramapuram, Gregorova, and Kalousis, 2017) considers the problem of generative lifelong learning for a variational autoencoder, introducing a modified training objective. This is potentially complementary to our contributions in this paper.
Biological Inspiration and Comparisons. Interestingly, the idea of scalable experience storage has a biologically inspired motivation relating back to the pioneering work of McClelland, McNaughton, and O’Reilly (1995), who hypothesized complementary dynamics for the hippocampus and neocortex. In this theory, updated in (Kumaran, Hassabis, and McClelland, 2016), the hippocampus is responsible for fast learning, providing a very plastic representation for retaining short term memories. Because the neocortex, responsible for reasoning, would otherwise suffer as a result of catastrophic forgetting, the hippocampus also plays a key role in generating approximate recollections (experience memories) to interleave with incoming experiences, stabilizing the learning of the neocortex. Our approach follows hippocampal memory index theory (Teyler and DiScenna, 1986; Teyler and Rudy, 2007), approximating this role of the hippocampus, with a modern deep neural network model. As this theory suggests, lifelong systems need both a mechanism of pattern completion (providing a partial experience as a query for a full stored experience) and pattern separation (maintaining separate indexable storage for each experience). As in the theory, we do not literally store previous experiences, but rather compact indexes which can be used to retrieve the experience from an association cortex, modeled as an auto-encoder.
Storing Few Experiences. Recent work on continual lifelong learning in deep neural networks (Rebuffi, Kolesnikov, and Lampert, 2017; Lopez-Paz and Ranzato, 2017) has focused on the resource constrained lifelong learning problem and how to promote stable learning with a relatively small diversity of prior experiences stored in memory. In this work, we complete the picture by also considering the relationship to the fidelity of the prior experiences stored memory. We achieve this by considering an additional resource constraint on the number of bits of storage allowed for each experience.
The Scalable Recollection Module (SRM)
The core of our approach is an architecture which supports scalable storage and retrieval of experiences as shown in Figure 1. There are three primary components: an encoder, an index buffer, and a decoder
. When a new experience is received (in the figure, an image of the numeral "6"), the encoder compresses it to a sequence of discrete latent codes (one hot vectors). These codes are concantenated and further compressed to abit binary code or “index” shown in decimal in the figure. This compressed code is then stored in the index buffer. This path is shown in blue. Experiences are retrieved from the index buffer by sampling a code from the index buffer and passing it through the decoder to create an approximate reconstruction of the original input. This path is shown in red in the figure.
The recollection buffer is implemented using a discrete variational auto-encoder (Jang, Gu, and Poole, 2017)(Maddison, Mnih, and Teh, 2017). A discrete variational autoencoder is a generative neural model with two components: an encoder and a decoder. The encoder is trained to take the input (say an image) and produce a discrete distribution over a set of latent categories which describe that input. To generate data, the discrete distribution is sampled. This can be done in a differentiable way using the so-called “reparameterization trick" which pushes the (non-differentiable) sampling operation to the front of the network. The decoder takes the sample and decodes it back into the original form of the input (in our example, an image). The VAE can then be used to encode experiences into a compact discrete code and later to generate experiences by sampling codes in the latent space and running them through the decoder.
Variational autoencoders have been used in the past to learn generative models of experience (called “pseudo-rehearsals" in the literature) (Robins, 1995). Our model differs in two respects: first we are using a discrete VAE which can produce compact codes; and second we are maintaining a buffer of those codes. Typical applications for VAEs focus on a fixed distribution to be learned and require that the VAE reproduce samples from that distribution with high fidelity. For continual learning it is also important that the VAE adapt to the non-stationary data distribution. Without an index buffer, the VAE’s parameters would quickly adapt to the current input and forget its past experiences. Additionally, as we will show later, the buffer leads to greater efficiency in generating samples that capture the variation of the distribution seen.
Improving Experience Replay Training
The recollection module can be used in many ways in a continual learning setting. In Algorithm 1 we show one approach which we will use later in our experiments.
In this setting the model must learn tasks sequentially from dataset . At every step it receives a triplet representing the input, task label, and correct output. There are two models to be trained: the recollection module which consists of a memory index buffer , an encoder and decoder ; and a predictive task model . The training proceeds in two phases: in the first phase we ensure that the recollection module itself is stabilized against forgetting; in the second phase we stabilize the predictive model.
For the recollection module, we achieve stabilization through a novel extension of experience replay (Lin, 1992). When an incoming example is received, we first sample multiple batches of recollections from the index buffer and decode them into experiences using the current decoder. We then perform steps of optimization on the encoder/decoder parameters and by interleaving the current input example with a different batch of past recollections at each of the optimization steps. For each optimization step, the error for each experience in a batch is computed by encoding that experience into a latent code using the encoder and then decoding back to an experience to compute the reconstruction error. On the first optimization step, the the reconstruction error is computed using the same decoder parameters that were used in the creation of that input experience in the batch. In subsequent steps, those parameters change as the recollection module is stabilized, learning parameters to successfully reconstruct both the old experiences in the buffer as well as the new experience. In this way the recollection module continues to remember the relevant past experiences in the buffer while integrating new experiences.
After the recollection module is trained with loss functionand learning rate , the predictive model is trained on just one of the recollection sample sets (we arbitrarily chose the first) using loss function and learning rate . Finally, the new sample is written to the index buffer. Perhaps surprisingly, this strategy of reconstructing experiences from codes and then performing experience replay training using them can be as effective for enabling continual learning as replaying real, uncompressed inputs in some cases.
In the resource constrained setting we study in this paper, there is an upper limit on the number of allowable episodic memories which governs the memory update . Our approach is fairly robust to changes in the update rule. For experience replay, we maintain the buffer using reservoir sampling (Vitter, 1985). In constrast, for the recently proposed Gradient Episodic Memory (GEM) algorithm which modulates gradients on incoming examples by solving a quadratic program with respect to past examples, we follow prior work and keep an equal number of the most recent examples for each task. As detailed in the appendix, integration of Scalable Recollections across algorithms is quite easy and the other differences between the experience replay and GEM algorithms with Scalable Recollections are contained to the different ways of utilizing episodic memory inherent to each lifelong learning algorithm. We note that the PROJECT function within GEM solves a quadratic program explained in (Lopez-Paz and Ranzato, 2017).
In this section we argue that it is more efficient to employ an index buffer rather than sampling directly from the latent VAE code space. This results from the ability of the model to recreate the input distribution. A typical strategy is to randomly sample from old experiences in episodic storage and interleave them with new ones to stabilize learning. So to the extent that the recollections are representative of the full distribution of prior experiences, learning proceeds exactly as if samples were drawn from a stationary i.i.d. distribution.
However, when sampling from the code space of the VAE without a buffer, the sample distribution will be unlikely to match that of the experiences from training. If the number of examples is larger than the number of possible codes (the capacity of the VAE) then the VAE will be unable to differentiate truly different images and hence have poor reconstruction. This scenario must be avoided for performance reasons. On the other hand, if the VAE capacity is considerably larger than the number of training examples, then sampling it at random is highly unlikely to reproduce the input distribution.
To alleviate this problem, a buffer can be added whose distribution of codes will match the distribution of the input data codes, if it is sufficiently large. Intuitively, setting the buffer size to be less than the capacity of the VAE will ensure that sampling from the buffer is more efficient than sampling directly from the VAE code space. Our experiments empirically support this hypothesis (Question 6):
Hypothesis 1 An index buffer is a more efficient parameterization of the input space seen by a variational autoencoder than its latent code if:
is the autoencoder latent variable size (i.e. 32 for typical continuous latent variables), is the number of latent variables in the autoencoders (i.e. the number of hidden units for continuous latent variables), and is the index buffer size.
In the vast majority of settings relevant to the study of continual lifelong learning today it is safe to assume that this inequality will hold. For example, even if we store an index for every example in a dataset like MNIST or CIFAR, this inequality will hold unless a continuous latent variable autoencoder has 3 or fewer latent variables. To put it another way, the per sample compression would have to be over 400X for popular datasets like Omniglot and MNIST and well over 1500X for CIFAR. These levels of compression with good reconstruction are far beyond what is possible with any known tools today. Moreover, given, as an example, the blurry nature of CIFAR images, it is possible that this compression quality is completely unfeasible. In practice, do to natural redundancy in the space of samples, it is more likely that will also only need to be significantly less than the number of examples seen. In our work we find that selecting subsets works well although a buffer may be even more efficient with online clustering strategies as in (Kaiser et al., 2017).
Our experiments will primarily focus on three public datasets commonly used for deep lifelong and multi-task learning.
MNIST-Rotations: (Lopez-Paz and Ranzato, 2017) A dataset with 20 tasks including 1,000 training examples for each task. The tasks are random rotations between 0 degrees and 180 degrees of the input space for each digit in MNIST.
Incremental CIFAR-100: (Lopez-Paz and Ranzato, 2017) A continual learning split of the CIFAR-100 image classification dataset considering each of the 20 course grained labels to be a task with 2,500 examples each.
Omniglot: A character recognition dataset (Lake et al., 2011) in which we consider each of the 50 alphabets to be a task. This is an even more challenging setting than explored in prior work on continual lifelong learning, containing more tasks and fewer examples of each class.
Evaluation for Continual Lifelong Learning
In this section we evaluate the benefits of the Scalable Recollection Module in enabling continual lifelong learning.
Metric: As in prior work, we measure retention as our key metric. It is defined as the test set accuracy on all tasks after sequential training has been completed over each task.
Architecture: We model our experiments after (Lopez-Paz and Ranzato, 2017) and use a Resnet-18 model as for CIFAR-100 and Omniglot as well as a two layer MLP with 200 hidden units for MNIST-Rotations. Across all of our experiments, our autoencoder models include three convolutional layers in the encoder and three deconvolutional layers in the decoder. Each convolutional layer has a kernel size of 5. As we vary the size of our categorical latent variable across experiments, we in turn model the number of filters in each convolutional layer to keep the number of hidden variables consistent at all intermediate layers of the network.
Module hyperparameters:In our experiments we used a binary cross entropy loss for both and . In the appendix we outline a constrained optimization procedure to find the optimal discrete autoencoder latent code design for a given resource footprint constraint. We follow this procedure to derive architectures that can be directly compared to various episodic storage baselines in our experiments.
The key question we consider is the following:
Question 1 Is the recollection module useful in improving retention for continual lifelong learning?
To answer this question we compare the retention performance of a system equipped with the recollection module to one equipped with a buffer of real experiences. We consider three datasets: MNIST-Rotations, CIFAR-100, and Omniglot. To account for the fact that real experiences take up more storage, we give both approaches the same storage budget. Specifically, we define two new quantities: incremental storage and items. The incremental storage is the effective size of the incremental storage used after the sunk cost of the initial model parameters. For clarity we express the effective incremental storage size in terms of the number of real examples of storage that would have the same footprint of resources used. The number of items by contrast refers to the number of items used in the episodic storage buffer whether they are real or approximate recollections. Obviously, by compressing items stored in the buffer we are able to store more items at the same effective incremental storage size.
In Table 1 we consider learning with a very small incremental resource footprint on MNIST-Rotations where we allow the effective storage of only one full experience or fewer per class. We find that storing real experiences can perform well in this regime, but the Scalable Recollection Module significantly outperforms them. For example, we achieve considerable improvements over results reported for the popular elastic weight consolidation (EwC) algorithm (Kirkpatrick et al., 2017) on this benchmark. We note that EwC stores some items for computing the Fisher information. We also note that the large incremental cost of the parameter and Fisher information storage for each task is equal to that of storing a real buffer of over 18 thousand examples.
In Table 2 we show results for Incremental CIFAR-100. This is a significant challenge for the Scalable Recollection Module since the CIFAR tiny image domain is known to be a particularly difficult setting for VAE performance (Kingma et al., 2016; Chen et al., 2017). We explore settings with a very small incremental resource footprint, including a couple of settings with even less incremental memory allowance than the number of classes. Real storage performs relatively well when the number of examples is greater than the number of classes, but otherwise suffers from a biased sampling towards a subset of classes. This can be seen by the decreased performance for small buffer sizes compared to using no buffer at all, learning online. Consistently we see that tuning the recollection module to approximate recollections with a reasonably sized index buffer results in improvements over real storage at the same incremental resource cost.
To further validate our findings, we tried the Omniglot dataset, attempting to learn continually in the difficult incremental 50 task setting. With an incremental resource constraint of 10 full examples, replay achieves 3.6% final retention accuracy (online learning produces 3.5% accuracy). In contrast, the recollection module achieves 5.0% accuracy. For an incremental resource footprint of 50 full examples, replay achieves 4.3% accuracy which is further improved to 4.8% accuracy by taking three gradient descent steps per new example. The recollection module again achieves better performance with 9.3% accuracy at one step per example and 13.0% accuracy at three steps per example.
|GEM Real Storage||100||100||62.5|
|Replay Real Storage||100||100||63.4|
|LwF (Li and Hoiem, 2016)||0||0||34.5|
|Replay Real Storage||10||10||29.4|
|GEM Real Storage||20||20||23.4|
Question 2 How does use of Scalable Recollections influence the long term retention of knowledge?
The value of using recollections becomes even more apparent for long term retention of skills. We demonstrate this in Figure 2 by first training models on Incremental CIFAR-100 and then training them for 1 million training examples on CIFAR-10. The number of training examples seen from CIFAR-100 is only 5% of the examples seen from CIFAR-10. Not only does the recollection module allow experience replay to generalize more effectively than real storage during initial learning, it also retains the knowledge much more gracefully over time. We provide a detailed chart in the appendix that includes learning for larger real storage buffer sizes as a comparison. A six times larger real storage buffer loses knowledge much faster than Scalable Recollections despite better performance during training on CIFAR-100.
Question 3 Can Scalable Recollections overcome the overhead of autoencoder model parameters?
To achieve the greater goals of lifelong learning, we are mostly interested in scaling to conditions where the number of examples seen is very large and as a result the incremental storage footprint dominates the overhead of the initial model parameters. However, we would like to empirically demonstrate that we can overcome this overhead in practical problems. Unfortunately, to demonstrate performance with a very small total storage footprint on a single dataset, an incredibly small autoencoder would then be required to learn a function for the very complex input space from scratch. We demonstrate in Table 3
that transfer learning provides a solution to this problem. By employing unlabeled background knowledge we are able to perform much better at the onset with a small autoencoder. We explore a total resource footprint on top of the size ofequivalent to 200 real examples. This equates to the smallest setting explored in (Lopez-Paz and Ranzato, 2017) and we demonstrate that we are able to achieve state of the art results by initializing only the autoencoder representation with one learned on CIFAR-10. CIFAR-10 is drawn from the same larger database as CIFAR-100, but is non-overlapping.
|Replay Real Storage||200||43.0|
|Replay Recollections - No Transfer||1392||43.7|
|Replay Recollections - CIFAR-10 Transfer||1392||49.7|
|GEM Real Storage||200||48.7|
|GEM Recollections - No Transfer||1392||43.7|
|GEM Recollections - CIFAR-10 Transfer||1392||54.2|
|iCaRL (Rebuffi et al., 2017)||200||43.6|
Why Scalable Recollections Work
Given the impressive performance of the Scalable Recollections Module for supporting the continual lifelong learning for neural networks, we would like to further explore the proposed system to elucidate why it works so well.
Question 4 How do discrete latent codes compare with continuous latent codes for compression?
In Figure 3 we empirically demonstrate that autoencoders with categorical latent variables can achieve significantly more storage compression of input observations at the same average distortion as autoencoders with continuous variables. In this experiment to make the continuous baseline even tougher to beat on the training set, we leverage a standard autoencoder instead of a variational one as it does not add noise to its representation, which would make it harder to reconstruct the original input. See the appendix for details.
Question 5 How do learned methods of compression compare with static compression algorithms?
In Figure 3 we also compare the performance of autoencoders with JPEG, which is a static compression algorithm commonly used in industry. We can see that JPEG performs quite well for low degrees of compression, but scales less gracefully than discrete autoencoder based compression for larger degrees of sample compression. This is because learned compression algorithms have the ability to further customize to the regularities seen in the data than a generic one. More detail is provided in the appendix.
Question 6 Do we see gains in learning efficiency as a result of the index buffer as predicted by Hypothesis 1?
The recollection module must not only provide a means of compressing the storage of experiences in a scalable way, but also a mechanism for efficiently sampling recollections so that they are truly representative of prior experiences.
|Latent Representation||Sampling Strategy||Reconstruction Distortion||Nearest Neighbor Distortion|
|38 2d variables||Code Sampling||0.058||0.074|
|168 2d variables||Code Sampling||0.021||0.081|
We first consider the typical method of sampling a variational autoencoder, which we will refer to as code sampling, where each latent variable is selected randomly. Obviously, by increasing the capacity of the autoencoder we are able to achieve lower reconstruction distortion. However, interestingly, we find that while increasing the autoencoder capacity increases modeling power, it also increases the chance that a randomly sampled latent code will not be representative of those seen in the training distribution. Instead, we maintain an index buffer of indexes associated with prior experiences. Let us call sampling from the index buffer, buffer sampling. Table 4 shows a comparison of code and buffer sampling for two different latent variable representation sizes. The reconstruction distortion is the error in reconstructing the recollection using the decoder. The nearest neighbor distortion is the distance from the sampled code to its nearest neighbor in the training set. We can see that for the same reconstruction distortion, the buffer approach yields a significantly smaller nearest neighbor distortion. This means that the buffer sampling produces a more representative sample than code sampling.
How much does this matter in practice? Figure 4 demonstrates its utility through a knowledge distillation experiment. Here we compare the two representation sizes and approaches at the task of distilling a CNN teacher model into a student model of the same architecture through the latent codes. That is the student is trained on the reconstructed data using the teacher output as a label. By far the best learning curve is obtained using buffer sampling. We would like to emphasize that these results are not a byproduct of increased model capacity associated with the buffer: the small representation with the buffer significantly outperforms the big representation with code sampling despite 7.4x fewer total bits of storage including the model parameters and buffer. In the appendix we include comprehensive experiments showing that distillation based on buffer sampling with a discrete latent code VAE is even more efficient than storing real examples.
Question 7 Does recollection based self-stabilization of the autoencoder lead to effective continual lifelong learning?
We validate our recollection module training procedure by demonstrating that recollections generated by an autoencoder model can actually be effective in preventing catastrophic forgetting for the very same model. As shown in Figure 5 for continual learning on CIFAR-100 with number of steps and an effective incremental buffer size of an average of two items per class, the recollection module is very similarly effective to real storage for stabilizing the lifelong autoencoder. The negative effects of the less effective synthetic examples are apparently drowned out by the positive effects of a larger diversity of stored examples.
Question 8 Can recollection efficiency improve over time?
We explore this question in Figure 6 where we consider online training of the model with a random initialization and no buffer (Online) and offline training with random initialization and full data storage (Offline) trained over 100 iterations. Predictably, access to unlimited storage and all of the tasks simultaneously means that the performance of Offline is consistently better than Online. To demonstrate the value of transfer from a good representation in the continual learning setting, we additionally plot an online model with no replay buffer and a representation initialized after training for 100 iterations on CIFAR-10 (Transfer Online). Transfer adds significant value, performing comparably to the randomly initialized model with access to all tasks simultaneously and unlimited storage (Offline). In fact, it performs considerably better for the first few tasks where the number of prior experiences is much greater than the number of new experiences. Improvements from transfer learning thus have a substantial effect in stabilizing as well as demonstrated in Table 3.
We have proposed and experimentally validated a general purpose Scalable Recollection Module that is designed to scale for very long time-frames. We have demonstrated superior performance over other state-of-the-art approaches for lifelong learning using very small incremental storage footprints. These increases can be dramatically boosted with unsupervised recollection module pre-training. We have shown that VAEs with categorical latent variables significantly outperform those with continuous latent variables (and even JPEG) for lossy compression. Finally, we have also shown that maintaining an explicit buffer is key to capturing the distribution of previously seen samples and generating realistic recollections needed to effectively prevent forgetting.
- Al-Shedivat et al. (2017) Al-Shedivat, M.; Bansal, T.; Burda, Y.; Sutskever, I.; Mordatch, I.; and Abbeel, P. 2017. Continuous adaptation via meta-learning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641.
Bengio et al. (2009)
Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J.
Proceedings of the 26th annual international conference on machine learning, 41–48. ACM.
- Blundell et al. (2016) Blundell, C.; Uria, B.; Pritzel, A.; Li, Y.; Ruderman, A.; Leibo, J. Z.; Rae, J.; Wierstra, D.; and Hassabis, D. 2016. Model-free episodic control. arXiv preprint arXiv:1606.04460.
Bouneffouf and Birol (2015)
Bouneffouf, D., and Birol, I.
Sampling with minimum sum of squared similarities for nystrom-based large scale spectral clustering.
- Chen et al. (2017) Chen, X.; Kingma, D. P.; Salimans, T.; Duan, Y.; Dhariwal, P.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2017. Variational lossy autoencoder. ICLR.
- Fernando et al. (2017) Fernando, C.; Banarse, D.; Blundell, C.; Zwols, Y.; Ha, D.; Rusu, A. A.; Pritzel, A.; and Wierstra, D. 2017. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734.
- Graves et al. (2017) Graves, A.; Bellemare, M. G.; Menick, J.; Munos, R.; and Kavukcuoglu, K. 2017. Automated curriculum learning for neural networks. arXiv preprint arXiv:1704.03003.
- Gumbel (1954) Gumbel, E. J. 1954. Statistical theory of extreme values and some practical applications: a series of lectures. Number 33. US Govt. Print. Office.
- Jang, Gu, and Poole (2017) Jang, E.; Gu, S.; and Poole, B. 2017. Categorical reparameterization with gumbel-softmax. ICLR.
- Kaiser et al. (2017) Kaiser, Ł.; Nachum, O.; Roy, A.; and Bengio, S. 2017. Learning to remember rare events. arXiv preprint arXiv:1703.03129.
- Kingma and Welling (2014) Kingma, D. P., and Welling, M. 2014. Auto-encoding variational bayes. ICLR.
- Kingma et al. (2016) Kingma, D. P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; and Welling, M. 2016. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 4743–4751.
- Kirkpatrick et al. (2017) Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 201611835.
- Kulesza, Taskar, and others (2012) Kulesza, A.; Taskar, B.; et al. 2012. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning 5(2–3):123–286.
- Kumaran, Hassabis, and McClelland (2016) Kumaran, D.; Hassabis, D.; and McClelland, J. L. 2016. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences 20(7):512–534.
- Lake et al. (2011) Lake, B.; Salakhutdinov, R.; Gross, J.; and Tenenbaum, J. 2011. One shot learning of simple visual concepts. In Proceedings of the Cognitive Science Society, volume 33.
- LeCun et al. (1998) LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
- Lee et al. (2018) Lee, J.; Yun, J.; Hwang, S.; and Yang, E. 2018. Lifelong learning with dynamically expandable networks. ICLR.
- Li and Hoiem (2016) Li, Z., and Hoiem, D. 2016. Learning without forgetting. In European Conference on Computer Vision.
Self-improving reactive agents based on reinforcement learning, planning and teaching.Machine learning 8(3-4):293–321.
- Lopez-Paz and Ranzato (2017) Lopez-Paz, D., and Ranzato, M. 2017. Gradient episodic memory for continuum learning. NIPS.
Maddison, Mnih, and Teh (2017)
Maddison, C. J.; Mnih, A.; and Teh, Y. W.
The concrete distribution: A continuous relaxation of discrete random variables.ICLR.
- Maddison, Tarlow, and Minka (2014) Maddison, C. J.; Tarlow, D.; and Minka, T. 2014. A* sampling. In Advances in Neural Information Processing Systems, 3086–3094.
- McClelland, McNaughton, and O’Reilly (1995) McClelland, J. L.; McNaughton, B. L.; and O’Reilly, R. C. 1995. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review 102(3):419.
- McCloskey and Cohen (1989) McCloskey, M., and Cohen, N. J. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation.
- Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature 518(7540):529–533.
- Murre (1992) Murre, J. M. 1992. Learning and categorization in modular neural networks.
- Pritzel et al. (2017) Pritzel, A.; Uria, B.; Srinivasan, S.; Badia, A. P.; Vinyals, O.; Hassabis, D.; Wierstra, D.; and Blundell, C. 2017. Neural episodic control. In International Conference on Machine Learning, 2827–2836.
- Ramapuram, Gregorova, and Kalousis (2017) Ramapuram, J.; Gregorova, M.; and Kalousis, A. 2017. Lifelong generative modeling. arXiv preprint arXiv:1705.09847.
Rebuffi, Kolesnikov, and
Rebuffi, S.; Kolesnikov, A.; and Lampert, C.
icarl: Incremental classifier and representation learning.CVPR.
- Riemer, Khabiri, and Goodwin (2016) Riemer, M.; Khabiri, E.; and Goodwin, R. 2016. Representation stability as a regularizer for improved text analytics transfer learning. arXiv preprint arXiv:1704.03617.
- Ring (1994) Ring, M. B. 1994. Continual learning in reinforcement environments. Ph.D. Dissertation, University of Texas at Austin Austin, Texas 78712.
- Robins (1995) Robins, A. 1995. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7(2):123–146.
- Rusu et al. (2016) Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Hadsell, R. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671.
- Schaul et al. (2015) Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D. 2015. Prioritized experience replay. arXiv preprint arXiv:1511.05952.
- Shin et al. (2017) Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, 2994–3003.
- Teyler and DiScenna (1986) Teyler, T. J., and DiScenna, P. 1986. The hippocampal memory indexing theory. Behavioral neuroscience 100(2):147.
- Teyler and Rudy (2007) Teyler, T. J., and Rudy, J. W. 2007. The hippocampal indexing theory and episodic memory: updating the index. Hippocampus 17(12):1158–1169.
- Thrun (1994) Thrun, S. 1994. Lifelong learning perspective for mobile robot control. In Proceedings of the IEEE/RSJ/GI International Conference on Intelligent Robots and Systems, volume 1, 23–30.
- Thrun (1996) Thrun, S. 1996. Is learning the n-th thing any easier than learning the first? Advances in neural information processing systems 640–646.
- Vitter (1985) Vitter, J. S. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS).
Yang and Hospedales (2017)
Yang, Y., and Hospedales, T.
Deep multi-task representation learning: A tensor factorisation approach.ICLR.
- Zhang, Kjellstrom, and Mandt (2017) Zhang, C.; Kjellstrom, H.; and Mandt, S. 2017. Stochastic learning on imbalanced data: Determinantal point processes for mini-batch diversification. arXiv preprint arXiv:1705.00607.
- Zhu and Bento (2017) Zhu, J.-J., and Bento, J. 2017. Generative adversarial active learning. arXiv preprint arXiv:1702.07956.
Appendix A Recollection Module Implementation
The key role of the scalable recollections module is to efficiently facilitate the transfer of knowledge between two neural models. In this section we argue that the recently proposed discrete latent variable variational autoencoders (Jang, Gu, and Poole, 2017; Maddison, Mnih, and Teh, 2017) are ideally suited to implement the encoder/decoder functionality of the recollection module.
Typical experience storage strategies store full experiences which can be expensive. For example, to store 32x32 CIFAR images with 3 color channels and 8-bits per pixel per channel will incur a cost per image stored of 8x3x32x32 = 24,576 bits. Deep non-linear autoencoders are a natural choice for compression problems. An autoencoder with a continuous latent variable of size , assuming standard 32-bit representations used in modern GPU hardware, will have a storage cost of 32 bits for each latent representation. Unfortunately, continuous variable autoencoders which use 32-bits for their network parameters may incur an unnecessary storage cost for many problems – especially in constrained-resource settings.
A solution which combines the benefit of VAE training with an ability to explicitly control precision is the recently proposed VAE with categorical latent variables (Jang, Gu, and Poole, 2017; Maddison, Mnih, and Teh, 2017). Here we consider a bottleneck representation between the encoder and decoder with categorical latent variables each containing dimensions representing a one hot
encoding of the categorical variable. This can be compressed to a binary representation ofbits.
Discrete VAE Implementation
In order to model an autoencoder with discrete latent variables, we follow the success of recent work (Jang, Gu, and Poole, 2017; Maddison, Mnih, and Teh, 2017) and employ the Gumbel-Softmax function. The Gumbel-Softmax function leverages the Gumbel-Max trick (Gumbel, 1954; Maddison, Tarlow, and Minka, 2014) which provides an efficient way to draw samples
from a categorical distribution with class probabilitiesrepresenting the output of the encoder:
In equation 1, is a sample drawn from Gumbel(0,1), which is calculated by drawing from Uniform(0,1) and computing =-log(-log()). The function quantizes its input into a one hot vector. The softmax function is used as a differentiable approximation to argmax, and we generate -dimensional sample vectors with temperature in which:
The Gumbel-Softmax distribution is smooth for 0, and therefore has a well-defined gradient with respect to the parameters . During forward propagation of the categorical autoencoder, we send the output of the encoder through the sampling procedure of equation 1
to create a categorical variable. However, during backpropagation we replace non-differentiable categorical samples with a differentiable approximation during training using the Gumbel-Softmax estimator in equation2. Although past work (Jang, Gu, and Poole, 2017; Maddison, Mnih, and Teh, 2017) has found value in varying over training, we still were able to get strong results keeping fixed at 1.0 across our experiments. Across all of our experiments, our generator model includes three convolutional layers in the encoder and three deconvolutional layers in the decoder.
One issue when deploying an autoencoder is the determination of the latent code size hyperparameter. In the next section we will derive a maximization procedure that we use throughout our experiments to find the optimal autoencoder latent variable size to use for a given resource constraint.
Details on GEM Integration
Appendix B Optimizing Latent Code Size for a Resource Constraint
The ability of a discrete variational autoencoder to memorize inputs should be strongly related to the effective bottleneck capacity , which we define, for discrete latent variables, as:
Incremental Storage Resource Constraints
First, let us consider the dynamics of balancing resources in a simple setting where we have an incremental storage constraint for new incoming data without regard for the size of the model used to compress and decompress recollections. We refer to the total storage constraint over all incoming examples as and the average storage rate limit as . We can then define as the probability that an incoming example is stored in memory. Thus, the expected number of bits required per example stored is , assuming simple binary encoding. If we treat as fixed, we can then define the following optimization procedure to search for a combination of and that maximizes capacity while fulfilling an incremental resource storage constraint:
which yields the approximate solution . As seen in equation 4, there is an inherent trade-off between between the diversity of experiences we store governed by and the distortion achieved that is related to the capacity. The optimal trade-off is likely problem dependent. Our work takes a first step at trying to understand this relationship. For example, we demonstrate that deep neural networks can see improved stabilization in resource constrained settings by allowing for some degree of distortion. This is because of an increased ability to capture the diversity in the data at the same incremental resource constraint.
Total Storage Resource Constraints
In some ways, the incremental storage constraint setting described in the previous section is not the most rigorous setting when comparing recollections to a selected subset of full inputs. Another important factor is the number of parameters in the model used for compression and decompression. generally is also to some degree a function of and . For example, in most of our experiments, we use the same number of hidden units at each layer as used in the bottleneck layer. With fully connected layers, this yields . As such, we can revise equation 4 to handle a more rigorous constraint for optimizing a discrete latent variable autoencoder architecture:
While this setting is more rigorous when comparing to lossless inputs, it is a somewhat harsh restriction with which to measure lifelong learning systems. This is because it is assumed that the compression model’s parameters should be largely transferable across tasks. To some degree, these parameters can be viewed as a sunk cost from the standpoint of continual learning. In our experiments, we also look at transferring these representations from related tasks to build a greater understanding of this trade-off.
Appendix C Additional Details on Experimental Protocol
Each convolutional layer has a kernel size of 5. As we vary the size of our categorical latent variable across experiments, we in turn model the number of filters in each convolutional layer to keep the number of hidden variables consistent at all intermediate layers of the network. In practice, this implies that the number of filters in each layer is equal to . We note that the discrete autoencoder is stochastic, not deterministic and we just report one stochastic pass through the data for each experimental trial. In all of our knowledge distillation experiments, we report an average result over 5 runs. For MNIST and Omniglot we follow prior work and consider 28x28 images with 1 channel and 8-bits per pixel. MNIST and Omniglot images were originally larger, but others have found the down sampling to 28x28 does not effect performance of models using it to learn.
Distortion as a Function of Compression Experiments
More detail about the architecture used in these experiments are provided for categorical latent variables in Table 5 and for continuous latent variables in Table 6. For each architecture we ran with a learning rate of 1e-2, 1e-3, 1e-4, and 1e-5, reporting the option that achieves the best training distortion. For the distortion, the pixels are normalized by dividing by 255.0 and we take the mean over the vector of the absolute value of the reconstruction to real sample difference and then report the mean over the samples in the training set. Compression is the ratio between the size of an 8bpp MNIST image and the size of the latent variables, assuming 32 bits floating point numbers in the continuous case and the binary representation for the categorical variables. The JPEG data points were collected using the Pillow Python package using quality 1, 25, 50, 75, and 100. We subtracted the header size form the JPEG size so it is a relatively fair accounting of the compression for a large data set of images all of the same size. The JPEG compression is computed as an average over the first 10,000 MNIST training images.
Continual Lifelong Learning Experiments
In our experiments on MNIST-Rotations we found it optimal to set the GEM learning rate to 0.1 and memory strength to 0.5. In our recollections experiments, we used an autoencoder with 104 4d variables for the size 100 buffer and one with 139 8d variables for the size 200 buffer. For experience replay, we used a learning rate of 0.01 for the predictive model and a learning rate of 0.001 for the autoencoder. For the size 100 buffer we experienced more over-fitting to the buffer and used a batch size of 5. For the size 200 buffer we used a batch size of 25.
Our categorical latent variable autoencoders had the following sizes for Incremental CIFAR-100: 48 2d variables for an effective buffer size of 10, 98 2d variables for an effective buffer size of 20, 244 2d variables for an effective buffer size of 50, 294 2d variables for an effective buffer size of 60, and 620 3d variables for an effective buffer size of 200. The predictive model was trained with a learning rate of 1e-3 in all of our replay experiments and 0.1 in all of our GEM experiments. The learning rate for the autoencoder was 1e-4 for the buffer size of 200 regardless of the lifelong learning model and buffer size of 20 with GEM. The autoencoder learning rate was 1e-3 for replay with buffer sizes 10 and 50 as well as GEM with buffer size 60. The GEM memory strength parameter was set to 0.1 for our real storage experiments and 0.01 for our experiments with recollection. In our experiments with GEM and with replay on MNIST-Rotations, we were able to get very good results even just training our autoencoder online and not leveraging the recollection buffer to stabilize its training. We could have seen further improvements, as we show on CIFAR-100, by using the buffer for stabilization.
For incremental Omniglot our learning rate was set to 1e-3. For the effective buffer size of 50 experiments, we leveraged a categorical latent variable autoencoder with 312 2d variables. For the effective buffer size of 10 experiments, we utilized a categorical latent variables consisting of 62 2d variables. We follow 90% multi-task training and 10% testing splits for Omniglot established in (Yang and Hospedales, 2017).
During the transfer learning experiments from CIFAR-10, for replay a learning rate of 1e-3 was used for the Resnet-18 reasoning model and a learning rate of 3e-4 was used for the discrete autoencoder generator. For GEM, we used a learning rate of 0.1 for the resnet model, a learning rate of 1e-3 for the autoencoder, and a memory strength of 0.1. For the experiment without transfer learning, we instead used a higher learning rate of 1e-3 for the replay autoencoder. We used a learning rate of 1e-4 for the autoencoder in our transfer learning GEM experiments.
Detailed Retention Results
We provide a detailed version of Figure 2 in the main text in Figure 7. It includes learning for larger real storage buffer sizes as a comparison. For example, a six times larger real storage buffer looses knowledge significantly faster than scalable recollections despite better performance when originally training on Incremental CIFAR-100.
Appendix D Scalable Recollections for Distillation
Question 9 Can Scalable Recollections retain knowledge so that it is transferable to new neural network instantiations?
In our experiments, we train a teacher model with a LeNet (LeCun et al., 1998)convolutional neural network (CNN) architecture on the popular MNIST benchmark, achieving 99.29% accuracy on the test set. We would like to test whether recollections drawn from our proposed recollection module are sufficient input representations for the teacher neural network to convey its function to a seperate randomly initialized student neural network of the same size. In Table 7 we validate the effectiveness of our technique by comparing it to some episodic storage baselines of interest. As baselines we consider training with the the same number of randomly sampled real examples, using real input and the teacher’s output vector as a target, and using random sampling to select a subset of real examples to store. When training with a large number of memories for a more complete knowledge transfer, the recollection compression clearly shows dividens over random sampling baselines. This is impressive particularly because these results are for the stricter total storage footprint compression setting where we account for the autoencoder model parameters and on a per sample basis the compression is actually 37x, 101x, and 165x.
We also would like to validate these findings in a more complex setting for which we consider distillation with outputs from a 50 task Resnet-18 teacher model that gets 94.86% accuracy on Omniglot. We test performance after one million training episodes, which is enough to achieve teacher performance using all of the real training examples. However, sampling diversity restricts learning significantly, for example, achieving 28.87% accuracy with 10% sampling, 8.88% with 2% sampling, and 5.99% with 1% sampling. In contrast the recollection module is much more effective, achieving 87.86% accuracy for 10x total resource compression, 74.03% accuracy for 50x compression, and 51.45% for 100x compression. Later we also demonstrate that these results generalize to mismatched architectures for the student and teacher. Moreover, in the we will also show that by using heuristics to explore the buffer instead of random sampling we can transfer knowledge even more efficiently to a new model than with randomly drawn real examples. Indeed, it is clear that Scalable Recollections provide a general purpose reservoir of knowledge that is efficiently transferable to anything ranging from old network instantiations of the same model (in which case it reinforces prior learning) to totally new and different architectures (looking to obtain this knowledge from scratch).
MNIST Generative Distillation Experiments
Alongside the teacher model, we train an variational autoencoder model with discrete latent variables. Each model is trained for 500 epochs. During the final pass through the data, we forward propogate through each training example and store the latent code in an index buffer. This buffer eventually grows to a size of 50,000. After training is complete, the index buffer is used as a statistical basis for sampling diverse recollections to train a student network. A logical and effective strategy for training a student model is to sample randomly from this buffer and thus capture the full distribution. For all of our distillation experiments we ran the setting with a learning rate of 1e-3 and 1e-4, reporting the best result. We found that the higher learning rate was beneficial in setting with a low number of examples and the lower learning rate was beneficial in setting with a larger number of examples. The categorical latent variable autoencoders explored had the following representation sizes: 168 2d variables for 10x compression, 62 2d variables for 50x compression, and 38 2d variables for 100x compression. For our code sampling baselines, we used the numpy random integer function to generate each discrete latent variable.
Omniglot Generative Distillation Experiment
The learning rate for the Resnet-18 reasoning model was 1e-4 in our experiments. Our trained discrete autoencoder models were of the following representation sizes: 32 variables of size 2 for 100x compression, 50 variables of size 2 for 50x compression, and 134 variables of size 2 for 10x compression. We follow 90% multi-task training and 10% testing splits for Omniglot established in (Yang and Hospedales, 2017).
CNN to MLP Distillation Results
In Figure 8 we explore generative knowledge distillation transferring knowledge from a CNN teacher network to a MLP student network.
Automated Generative Curriculum Learning
|Episodes||Real||Real||Active 10x||Active 100x||Active & Diverse||Active & Diverse|
|Data||Teacher||Compress||Compress||10x Compress||100x Compress|
While random sampling from a buffer can be very effective, we would like to further maximize the efficiency of distilling knowledge from a teacher model to a student model. This motivates the automated curriculum learning setting (Bengio et al., 2009) as recently explored for multi-task learning in (Graves et al., 2017) or rather automated generative curriculum learning in our case. We tried some simple reinforcement learning solutions with rewards based on (Graves et al., 2017)
but were unsuccessful in our initial experiments because of the difficulty of navigating a complex continuous action space. We also tried an active learning formulation proposed for GANs to learn the best latent code to sample(Zhu and Bento, 2017) at a given time. We had limited success with this strategy as well as it tends to learn to emphasize regions of the latent space that optimize incorrectness, but no longer capture the distribution of inputs.
Designing generative sampling heuristics. Inspired by these findings, we instead employ simple sampling heuristics to try to design a curriculum with prototypical qualities like responsiveness to the student and depth of coverage. We model responsiveness to the student as active sampling by focusing on examples where the student does not have good performance. We randomly sample latent codes using our recollection buffer and choose the one that is most difficult for the current student for backpropagation by cheaply forward propagating through the student for each. By sampling from the recollection buffer, we are able to ensure our chosen difficult samples are still representative of the training distribution. We set to 10 in our experiments so the sampling roughly equates to sampling once from the most difficult class for the student model at each point in time. We model depth of coverage by sampling a bigger batch of random examples and adding a filtering step before considering difficulty. We would like to perform diverse sampling that promotes subset diversity when we filter from examples down to examples. One approach to achieving this is a Determinantal Point Process (DPP) (Kulesza, Taskar, and others, 2012) as recently proposed for selecting diverse neural network mini-batches (Zhang, Kjellstrom, and Mandt, 2017). We use the dot product of the inputs as a measure of similarity between recollections and found the DPP to achieve effective performance as a diverse sampling step. However, we follow (Bouneffouf and Birol, 2015) and use a process for sampling based on the sum of the squared similarity matrix as outlined in the next section. We found the sum of the squared similarity matrix to be equally effective to the determinant and significantly more scalable to large matrices. We also set to 10 in our experiments.
Minimum Sum of Squared Similarities
This algorithm is trying to find a new landmark point that maximizes the determinant by finding a point that minimizes the sum of squared similarities (MSSS). The MSSS algorithm initially randomly chooses two points from the dataset . It then computes the sum of similarities between the sampled points and a subset, , selected randomly from the remaining data points. The point with the smallest sum of squared similarities is then picked as the next landmark data point. The procedure is repeated until a total of landmark points are picked.