Introduction
A longheld dream of the AI community is to build a machine capable of operating autonomously for long periods or even indefinitely. Such a machine must necessarily learn and adapt to a changing environment and, crucially, manage memories of what it has learned for the future tasks it will encounter. A spectrum of learning scenarios are available depending on problem requirements. In lifelong learning (Thrun, 1996) the machine is presented a sequence of tasks and must use knowledge learned from the previous tasks to perform better on the next. In the resourceconstrained lifelong learning setting the machine is constrained to a small buffer of previous experiences. Some approaches to lifelong learning assume that a task is a set of examples chosen from the same distribution (Rusu et al., 2016; Fernando et al., 2017; Shin et al., 2017; Ramapuram, Gregorova, and Kalousis, 2017; AlShedivat et al., 2017; Lee et al., 2018). If instead the machine is given a sequence of examples without any batching, then this is called continual learning. In this paper we focus on this more challenging continual learning scenario.
Continual learning (Thrun, 1994; Ring, 1994; Thrun, 1996)
has three main requirements: (1) continually adapt in a nonstationary environment, (2) retain memories which are useful, (3) manage compute and memory resources over a long period of time. Most neural network research has focused on methods to improve (1) and (2). In this paper we consider (3) as well and further investigate the role of efficient experience storage in avoiding the catastrophic forgetting
(McCloskey and Cohen, 1989) problem that makes (2) so challenging.Experience memory has been influential in many recent approaches. For example, experience replay (Lin, 1992) was integral in helping to stabilize the training of Deep Q Learning on Atari games (Mnih et al., 2015). Episodic storage mechanisms (Schaul et al., 2015; Blundell et al., 2016; Pritzel et al., 2017; Rebuffi, Kolesnikov, and Lampert, 2017; LopezPaz and Ranzato, 2017)
were also some of the earliest solutions to the catastrophic forgetting problem in the supervised learning setting
(Murre, 1992; Robins, 1995). Unlike approaches which simply focus on remembering representations of old tasks (Li and Hoiem, 2016; Riemer, Khabiri, and Goodwin, 2016; Kirkpatrick et al., 2017), episodic storage techniques achieve superior performance because of their ability to continually improve on old tasks over time as useful information is learned later (LopezPaz and Ranzato, 2017).All of these techniques try to use stored experiences to stabilize learning. However, they do not consider agents which must operate independently in the world for a long time. In this scenario, assuming the kind of highdimensional data which make up human experiences, the efficient storage of experiences becomes an important factor. Storing full experiences in memory, as these methods do, causes storage costs to scale linearly with the number of experiences. To truly learn over a massive number of experiences in a nonstationary environment, the incremental cost of adding experiences to memory must be sublinear in the number experiences.
In this paper we propose a scalable experience memory module which learns to adapt to a nonstationary ^{1}^{1}1We assume the environment is nonstationary, but not adversarial. So past experiences can be helpful for learning future tasks.
environment and improve itself over time. The memory module is implemented using a variational autoencoder which learns to compress highdimensional experiences to a compact latent code for storage. This code can then be used to reconstruct realistic recollections for both experience replay training and improvement of the memory module itself. We demonstrate empirically that the module achieves sublinear scaling with the number of experiences and provides a useful basis for a realistic continual learning system. Our experiments show superior performance over stateoftheart approaches for lifelong learning with a very small incremental storage footprint.
Related Work
Storing Parameters Instead of Experiences. Our method is complementary to recent work leveraging episodic storage to stabilize learning (Mnih et al., 2015; Blundell et al., 2016; Pritzel et al., 2017; Rebuffi, Kolesnikov, and Lampert, 2017; LopezPaz and Ranzato, 2017). Some recently proposed methods for lifelong learning don’t store experiences at all, instead recording the parameters of a network model for each task (Rusu et al., 2016; Kirkpatrick et al., 2017; Fernando et al., 2017). This yields linear (or sometimes worse) storage cost scaling with respect to the number of tasks. For our experiments, and in most settings of longterm interest for continual learning, the storage cost of these extra model parameters per task significantly exceeds the per task size of a corresponding experience buffer.
Generative Models to Support Lifelong Learning. Pseudorehearsals (Robins, 1995)
is a related approach for preventing catastrophic forgetting that unlike our recollection module does not require explicit storage of codes. Instead it learns a generative experience model alongside the main model. For simple learning problems, very crude approximations of the real data such as randomly generated data from an appropriate distribution can be sufficient. However, for complex problems like those found in NLP and computer vision with highly structured high dimensional inputs, more refined approximations are needed to stimulate the network with relevant old representations. To the best of our knowledge, we are the first to consider variational autoencoders
(Kingma and Welling, 2014) as a method of creating pseudoexperiences to support supervised learning. Some recent work (Ramapuram, Gregorova, and Kalousis, 2017) considers the problem of generative lifelong learning for a variational autoencoder, introducing a modified training objective. This is potentially complementary to our contributions in this paper.Biological Inspiration and Comparisons. Interestingly, the idea of scalable experience storage has a biologically inspired motivation relating back to the pioneering work of McClelland, McNaughton, and O’Reilly (1995), who hypothesized complementary dynamics for the hippocampus and neocortex. In this theory, updated in (Kumaran, Hassabis, and McClelland, 2016), the hippocampus is responsible for fast learning, providing a very plastic representation for retaining short term memories. Because the neocortex, responsible for reasoning, would otherwise suffer as a result of catastrophic forgetting, the hippocampus also plays a key role in generating approximate recollections (experience memories) to interleave with incoming experiences, stabilizing the learning of the neocortex. Our approach follows hippocampal memory index theory (Teyler and DiScenna, 1986; Teyler and Rudy, 2007), approximating this role of the hippocampus, with a modern deep neural network model. As this theory suggests, lifelong systems need both a mechanism of pattern completion (providing a partial experience as a query for a full stored experience) and pattern separation (maintaining separate indexable storage for each experience). As in the theory, we do not literally store previous experiences, but rather compact indexes which can be used to retrieve the experience from an association cortex, modeled as an autoencoder.
Storing Few Experiences. Recent work on continual lifelong learning in deep neural networks (Rebuffi, Kolesnikov, and Lampert, 2017; LopezPaz and Ranzato, 2017) has focused on the resource constrained lifelong learning problem and how to promote stable learning with a relatively small diversity of prior experiences stored in memory. In this work, we complete the picture by also considering the relationship to the fidelity of the prior experiences stored memory. We achieve this by considering an additional resource constraint on the number of bits of storage allowed for each experience.
The Scalable Recollection Module (SRM)
The core of our approach is an architecture which supports scalable storage and retrieval of experiences as shown in Figure 1. There are three primary components: an encoder, an index buffer, and a decoder
. When a new experience is received (in the figure, an image of the numeral "6"), the encoder compresses it to a sequence of discrete latent codes (one hot vectors). These codes are concantenated and further compressed to a
bit binary code or “index” shown in decimal in the figure. This compressed code is then stored in the index buffer. This path is shown in blue. Experiences are retrieved from the index buffer by sampling a code from the index buffer and passing it through the decoder to create an approximate reconstruction of the original input. This path is shown in red in the figure.The recollection buffer is implemented using a discrete variational autoencoder (Jang, Gu, and Poole, 2017)(Maddison, Mnih, and Teh, 2017). A discrete variational autoencoder is a generative neural model with two components: an encoder and a decoder. The encoder is trained to take the input (say an image) and produce a discrete distribution over a set of latent categories which describe that input. To generate data, the discrete distribution is sampled. This can be done in a differentiable way using the socalled “reparameterization trick" which pushes the (nondifferentiable) sampling operation to the front of the network. The decoder takes the sample and decodes it back into the original form of the input (in our example, an image). The VAE can then be used to encode experiences into a compact discrete code and later to generate experiences by sampling codes in the latent space and running them through the decoder.
Variational autoencoders have been used in the past to learn generative models of experience (called “pseudorehearsals" in the literature) (Robins, 1995). Our model differs in two respects: first we are using a discrete VAE which can produce compact codes; and second we are maintaining a buffer of those codes. Typical applications for VAEs focus on a fixed distribution to be learned and require that the VAE reproduce samples from that distribution with high fidelity. For continual learning it is also important that the VAE adapt to the nonstationary data distribution. Without an index buffer, the VAE’s parameters would quickly adapt to the current input and forget its past experiences. Additionally, as we will show later, the buffer leads to greater efficiency in generating samples that capture the variation of the distribution seen.
Improving Experience Replay Training
The recollection module can be used in many ways in a continual learning setting. In Algorithm 1 we show one approach which we will use later in our experiments.
In this setting the model must learn tasks sequentially from dataset . At every step it receives a triplet representing the input, task label, and correct output. There are two models to be trained: the recollection module which consists of a memory index buffer , an encoder and decoder ; and a predictive task model . The training proceeds in two phases: in the first phase we ensure that the recollection module itself is stabilized against forgetting; in the second phase we stabilize the predictive model.
For the recollection module, we achieve stabilization through a novel extension of experience replay (Lin, 1992). When an incoming example is received, we first sample multiple batches of recollections from the index buffer and decode them into experiences using the current decoder. We then perform steps of optimization on the encoder/decoder parameters and by interleaving the current input example with a different batch of past recollections at each of the optimization steps. For each optimization step, the error for each experience in a batch is computed by encoding that experience into a latent code using the encoder and then decoding back to an experience to compute the reconstruction error. On the first optimization step, the the reconstruction error is computed using the same decoder parameters that were used in the creation of that input experience in the batch. In subsequent steps, those parameters change as the recollection module is stabilized, learning parameters to successfully reconstruct both the old experiences in the buffer as well as the new experience. In this way the recollection module continues to remember the relevant past experiences in the buffer while integrating new experiences.
After the recollection module is trained with loss function
and learning rate , the predictive model is trained on just one of the recollection sample sets (we arbitrarily chose the first) using loss function and learning rate . Finally, the new sample is written to the index buffer. Perhaps surprisingly, this strategy of reconstructing experiences from codes and then performing experience replay training using them can be as effective for enabling continual learning as replaying real, uncompressed inputs in some cases.In the resource constrained setting we study in this paper, there is an upper limit on the number of allowable episodic memories which governs the memory update . Our approach is fairly robust to changes in the update rule. For experience replay, we maintain the buffer using reservoir sampling (Vitter, 1985). In constrast, for the recently proposed Gradient Episodic Memory (GEM) algorithm which modulates gradients on incoming examples by solving a quadratic program with respect to past examples, we follow prior work and keep an equal number of the most recent examples for each task. As detailed in the appendix, integration of Scalable Recollections across algorithms is quite easy and the other differences between the experience replay and GEM algorithms with Scalable Recollections are contained to the different ways of utilizing episodic memory inherent to each lifelong learning algorithm. We note that the PROJECT function within GEM solves a quadratic program explained in (LopezPaz and Ranzato, 2017).
Recollection Efficiency
In this section we argue that it is more efficient to employ an index buffer rather than sampling directly from the latent VAE code space. This results from the ability of the model to recreate the input distribution. A typical strategy is to randomly sample from old experiences in episodic storage and interleave them with new ones to stabilize learning. So to the extent that the recollections are representative of the full distribution of prior experiences, learning proceeds exactly as if samples were drawn from a stationary i.i.d. distribution.
However, when sampling from the code space of the VAE without a buffer, the sample distribution will be unlikely to match that of the experiences from training. If the number of examples is larger than the number of possible codes (the capacity of the VAE) then the VAE will be unable to differentiate truly different images and hence have poor reconstruction. This scenario must be avoided for performance reasons. On the other hand, if the VAE capacity is considerably larger than the number of training examples, then sampling it at random is highly unlikely to reproduce the input distribution.
To alleviate this problem, a buffer can be added whose distribution of codes will match the distribution of the input data codes, if it is sufficiently large. Intuitively, setting the buffer size to be less than the capacity of the VAE will ensure that sampling from the buffer is more efficient than sampling directly from the VAE code space. Our experiments empirically support this hypothesis (Question 6):
Hypothesis 1 An index buffer is a more efficient parameterization of the input space seen by a variational autoencoder than its latent code if:
is the autoencoder latent variable size (i.e. 32 for typical continuous latent variables), is the number of latent variables in the autoencoders (i.e. the number of hidden units for continuous latent variables), and is the index buffer size.
In the vast majority of settings relevant to the study of continual lifelong learning today it is safe to assume that this inequality will hold. For example, even if we store an index for every example in a dataset like MNIST or CIFAR, this inequality will hold unless a continuous latent variable autoencoder has 3 or fewer latent variables. To put it another way, the per sample compression would have to be over 400X for popular datasets like Omniglot and MNIST and well over 1500X for CIFAR. These levels of compression with good reconstruction are far beyond what is possible with any known tools today. Moreover, given, as an example, the blurry nature of CIFAR images, it is possible that this compression quality is completely unfeasible. In practice, do to natural redundancy in the space of samples, it is more likely that will also only need to be significantly less than the number of examples seen. In our work we find that selecting subsets works well although a buffer may be even more efficient with online clustering strategies as in (Kaiser et al., 2017).
Evaluation
Datasets
Our experiments will primarily focus on three public datasets commonly used for deep lifelong and multitask learning.
MNISTRotations: (LopezPaz and Ranzato, 2017) A dataset with 20 tasks including 1,000 training examples for each task. The tasks are random rotations between 0 degrees and 180 degrees of the input space for each digit in MNIST.
Incremental CIFAR100: (LopezPaz and Ranzato, 2017) A continual learning split of the CIFAR100 image classification dataset considering each of the 20 course grained labels to be a task with 2,500 examples each.
Omniglot: A character recognition dataset (Lake et al., 2011) in which we consider each of the 50 alphabets to be a task. This is an even more challenging setting than explored in prior work on continual lifelong learning, containing more tasks and fewer examples of each class.
Evaluation for Continual Lifelong Learning
In this section we evaluate the benefits of the Scalable Recollection Module in enabling continual lifelong learning.
Metric: As in prior work, we measure retention as our key metric. It is defined as the test set accuracy on all tasks after sequential training has been completed over each task.
Architecture: We model our experiments after (LopezPaz and Ranzato, 2017) and use a Resnet18 model as for CIFAR100 and Omniglot as well as a two layer MLP with 200 hidden units for MNISTRotations. Across all of our experiments, our autoencoder models include three convolutional layers in the encoder and three deconvolutional layers in the decoder. Each convolutional layer has a kernel size of 5. As we vary the size of our categorical latent variable across experiments, we in turn model the number of filters in each convolutional layer to keep the number of hidden variables consistent at all intermediate layers of the network.
Module hyperparameters:
In our experiments we used a binary cross entropy loss for both and . In the appendix we outline a constrained optimization procedure to find the optimal discrete autoencoder latent code design for a given resource footprint constraint. We follow this procedure to derive architectures that can be directly compared to various episodic storage baselines in our experiments.The key question we consider is the following:
Question 1 Is the recollection module useful in improving retention for continual lifelong learning?
To answer this question we compare the retention performance of a system equipped with the recollection module to one equipped with a buffer of real experiences. We consider three datasets: MNISTRotations, CIFAR100, and Omniglot. To account for the fact that real experiences take up more storage, we give both approaches the same storage budget. Specifically, we define two new quantities: incremental storage and items. The incremental storage is the effective size of the incremental storage used after the sunk cost of the initial model parameters. For clarity we express the effective incremental storage size in terms of the number of real examples of storage that would have the same footprint of resources used. The number of items by contrast refers to the number of items used in the episodic storage buffer whether they are real or approximate recollections. Obviously, by compressing items stored in the buffer we are able to store more items at the same effective incremental storage size.
In Table 1 we consider learning with a very small incremental resource footprint on MNISTRotations where we allow the effective storage of only one full experience or fewer per class. We find that storing real experiences can perform well in this regime, but the Scalable Recollection Module significantly outperforms them. For example, we achieve considerable improvements over results reported for the popular elastic weight consolidation (EwC) algorithm (Kirkpatrick et al., 2017) on this benchmark. We note that EwC stores some items for computing the Fisher information. We also note that the large incremental cost of the parameter and Fisher information storage for each task is equal to that of storing a real buffer of over 18 thousand examples.
In Table 2 we show results for Incremental CIFAR100. This is a significant challenge for the Scalable Recollection Module since the CIFAR tiny image domain is known to be a particularly difficult setting for VAE performance (Kingma et al., 2016; Chen et al., 2017). We explore settings with a very small incremental resource footprint, including a couple of settings with even less incremental memory allowance than the number of classes. Real storage performs relatively well when the number of examples is greater than the number of classes, but otherwise suffers from a biased sampling towards a subset of classes. This can be seen by the decreased performance for small buffer sizes compared to using no buffer at all, learning online. Consistently we see that tuning the recollection module to approximate recollections with a reasonably sized index buffer results in improvements over real storage at the same incremental resource cost.
To further validate our findings, we tried the Omniglot dataset, attempting to learn continually in the difficult incremental 50 task setting. With an incremental resource constraint of 10 full examples, replay achieves 3.6% final retention accuracy (online learning produces 3.5% accuracy). In contrast, the recollection module achieves 5.0% accuracy. For an incremental resource footprint of 50 full examples, replay achieves 4.3% accuracy which is further improved to 4.8% accuracy by taking three gradient descent steps per new example. The recollection module again achieves better performance with 9.3% accuracy at one step per example and 13.0% accuracy at three steps per example.
Method  Incremental Storage  Items  Retention 
GEM Real Storage  100  100  62.5 
200  200  67.4  
GEM Recollections  100  3000  79.0 
200  3000  81.5  
Replay Real Storage  100  100  63.4 
200  200  71.3  
Replay Recollections  100  3000  75.6 
200  3000  81.1  
EwC  18288  1000  54.6 
Online  0  0  51.9 
Model  Incremental Storage  Items  Retention 
Online  0  0  33.3 
LwF (Li and Hoiem, 2016)  0  0  34.5 
Replay Real Storage  10  10  29.4 
50  50  33.4  
200  200  43.0  
Replay Recollections  10  5000  39.7 
50  5000  47.9  
200  5000  51.6  
GEM Real Storage  20  20  23.4 
60  60  40.6  
200  200  48.7  
GEM Recollections  20  5000  52.4 
60  5000  56.0  
200  5000  59.0 
Question 2 How does use of Scalable Recollections influence the long term retention of knowledge?
The value of using recollections becomes even more apparent for long term retention of skills. We demonstrate this in Figure 2 by first training models on Incremental CIFAR100 and then training them for 1 million training examples on CIFAR10. The number of training examples seen from CIFAR100 is only 5% of the examples seen from CIFAR10. Not only does the recollection module allow experience replay to generalize more effectively than real storage during initial learning, it also retains the knowledge much more gracefully over time. We provide a detailed chart in the appendix that includes learning for larger real storage buffer sizes as a comparison. A six times larger real storage buffer loses knowledge much faster than Scalable Recollections despite better performance during training on CIFAR100.
Question 3 Can Scalable Recollections overcome the overhead of autoencoder model parameters?
To achieve the greater goals of lifelong learning, we are mostly interested in scaling to conditions where the number of examples seen is very large and as a result the incremental storage footprint dominates the overhead of the initial model parameters. However, we would like to empirically demonstrate that we can overcome this overhead in practical problems. Unfortunately, to demonstrate performance with a very small total storage footprint on a single dataset, an incredibly small autoencoder would then be required to learn a function for the very complex input space from scratch. We demonstrate in Table 3
that transfer learning provides a solution to this problem. By employing unlabeled background knowledge we are able to perform much better at the onset with a small autoencoder. We explore a total resource footprint on top of the size of
equivalent to 200 real examples. This equates to the smallest setting explored in (LopezPaz and Ranzato, 2017) and we demonstrate that we are able to achieve state of the art results by initializing only the autoencoder representation with one learned on CIFAR10. CIFAR10 is drawn from the same larger database as CIFAR100, but is nonoverlapping.Model  Items  Retention 

Replay Real Storage  200  43.0 
Replay Recollections  No Transfer  1392  43.7 
Replay Recollections  CIFAR10 Transfer  1392  49.7 
GEM Real Storage  200  48.7 
GEM Recollections  No Transfer  1392  43.7 
GEM Recollections  CIFAR10 Transfer  1392  54.2 
iCaRL (Rebuffi et al., 2017)  200  43.6 
Why Scalable Recollections Work
Given the impressive performance of the Scalable Recollections Module for supporting the continual lifelong learning for neural networks, we would like to further explore the proposed system to elucidate why it works so well.
Question 4 How do discrete latent codes compare with continuous latent codes for compression?
In Figure 3 we empirically demonstrate that autoencoders with categorical latent variables can achieve significantly more storage compression of input observations at the same average distortion as autoencoders with continuous variables. In this experiment to make the continuous baseline even tougher to beat on the training set, we leverage a standard autoencoder instead of a variational one as it does not add noise to its representation, which would make it harder to reconstruct the original input. See the appendix for details.
Question 5 How do learned methods of compression compare with static compression algorithms?
In Figure 3 we also compare the performance of autoencoders with JPEG, which is a static compression algorithm commonly used in industry. We can see that JPEG performs quite well for low degrees of compression, but scales less gracefully than discrete autoencoder based compression for larger degrees of sample compression. This is because learned compression algorithms have the ability to further customize to the regularities seen in the data than a generic one. More detail is provided in the appendix.
Question 6 Do we see gains in learning efficiency as a result of the index buffer as predicted by Hypothesis 1?
The recollection module must not only provide a means of compressing the storage of experiences in a scalable way, but also a mechanism for efficiently sampling recollections so that they are truly representative of prior experiences.
Latent Representation  Sampling Strategy  Reconstruction Distortion  Nearest Neighbor Distortion 

38 2d variables  Code Sampling  0.058  0.074 
Buffer Sampling  0.058  0.054  
168 2d variables  Code Sampling  0.021  0.081 
Buffer Sampling  0.021  0.021 
We first consider the typical method of sampling a variational autoencoder, which we will refer to as code sampling, where each latent variable is selected randomly. Obviously, by increasing the capacity of the autoencoder we are able to achieve lower reconstruction distortion. However, interestingly, we find that while increasing the autoencoder capacity increases modeling power, it also increases the chance that a randomly sampled latent code will not be representative of those seen in the training distribution. Instead, we maintain an index buffer of indexes associated with prior experiences. Let us call sampling from the index buffer, buffer sampling. Table 4 shows a comparison of code and buffer sampling for two different latent variable representation sizes. The reconstruction distortion is the error in reconstructing the recollection using the decoder. The nearest neighbor distortion is the distance from the sampled code to its nearest neighbor in the training set. We can see that for the same reconstruction distortion, the buffer approach yields a significantly smaller nearest neighbor distortion. This means that the buffer sampling produces a more representative sample than code sampling.
How much does this matter in practice? Figure 4 demonstrates its utility through a knowledge distillation experiment. Here we compare the two representation sizes and approaches at the task of distilling a CNN teacher model into a student model of the same architecture through the latent codes. That is the student is trained on the reconstructed data using the teacher output as a label. By far the best learning curve is obtained using buffer sampling. We would like to emphasize that these results are not a byproduct of increased model capacity associated with the buffer: the small representation with the buffer significantly outperforms the big representation with code sampling despite 7.4x fewer total bits of storage including the model parameters and buffer. In the appendix we include comprehensive experiments showing that distillation based on buffer sampling with a discrete latent code VAE is even more efficient than storing real examples.
Question 7 Does recollection based selfstabilization of the autoencoder lead to effective continual lifelong learning?
We validate our recollection module training procedure by demonstrating that recollections generated by an autoencoder model can actually be effective in preventing catastrophic forgetting for the very same model. As shown in Figure 5 for continual learning on CIFAR100 with number of steps and an effective incremental buffer size of an average of two items per class, the recollection module is very similarly effective to real storage for stabilizing the lifelong autoencoder. The negative effects of the less effective synthetic examples are apparently drowned out by the positive effects of a larger diversity of stored examples.
Question 8 Can recollection efficiency improve over time?
We explore this question in Figure 6 where we consider online training of the model with a random initialization and no buffer (Online) and offline training with random initialization and full data storage (Offline) trained over 100 iterations. Predictably, access to unlimited storage and all of the tasks simultaneously means that the performance of Offline is consistently better than Online. To demonstrate the value of transfer from a good representation in the continual learning setting, we additionally plot an online model with no replay buffer and a representation initialized after training for 100 iterations on CIFAR10 (Transfer Online). Transfer adds significant value, performing comparably to the randomly initialized model with access to all tasks simultaneously and unlimited storage (Offline). In fact, it performs considerably better for the first few tasks where the number of prior experiences is much greater than the number of new experiences. Improvements from transfer learning thus have a substantial effect in stabilizing as well as demonstrated in Table 3.
Conclusion
We have proposed and experimentally validated a general purpose Scalable Recollection Module that is designed to scale for very long timeframes. We have demonstrated superior performance over other stateoftheart approaches for lifelong learning using very small incremental storage footprints. These increases can be dramatically boosted with unsupervised recollection module pretraining. We have shown that VAEs with categorical latent variables significantly outperform those with continuous latent variables (and even JPEG) for lossy compression. Finally, we have also shown that maintaining an explicit buffer is key to capturing the distribution of previously seen samples and generating realistic recollections needed to effectively prevent forgetting.
References
 AlShedivat et al. (2017) AlShedivat, M.; Bansal, T.; Burda, Y.; Sutskever, I.; Mordatch, I.; and Abbeel, P. 2017. Continuous adaptation via metalearning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641.

Bengio et al. (2009)
Bengio, Y.; Louradour, J.; Collobert, R.; and Weston, J.
2009.
Curriculum learning.
In
Proceedings of the 26th annual international conference on machine learning
, 41–48. ACM.  Blundell et al. (2016) Blundell, C.; Uria, B.; Pritzel, A.; Li, Y.; Ruderman, A.; Leibo, J. Z.; Rae, J.; Wierstra, D.; and Hassabis, D. 2016. Modelfree episodic control. arXiv preprint arXiv:1606.04460.

Bouneffouf and Birol (2015)
Bouneffouf, D., and Birol, I.
2015.
Sampling with minimum sum of squared similarities for nystrombased large scale spectral clustering.
 Chen et al. (2017) Chen, X.; Kingma, D. P.; Salimans, T.; Duan, Y.; Dhariwal, P.; Schulman, J.; Sutskever, I.; and Abbeel, P. 2017. Variational lossy autoencoder. ICLR.
 Fernando et al. (2017) Fernando, C.; Banarse, D.; Blundell, C.; Zwols, Y.; Ha, D.; Rusu, A. A.; Pritzel, A.; and Wierstra, D. 2017. Pathnet: Evolution channels gradient descent in super neural networks. arXiv preprint arXiv:1701.08734.
 Graves et al. (2017) Graves, A.; Bellemare, M. G.; Menick, J.; Munos, R.; and Kavukcuoglu, K. 2017. Automated curriculum learning for neural networks. arXiv preprint arXiv:1704.03003.
 Gumbel (1954) Gumbel, E. J. 1954. Statistical theory of extreme values and some practical applications: a series of lectures. Number 33. US Govt. Print. Office.
 Jang, Gu, and Poole (2017) Jang, E.; Gu, S.; and Poole, B. 2017. Categorical reparameterization with gumbelsoftmax. ICLR.
 Kaiser et al. (2017) Kaiser, Ł.; Nachum, O.; Roy, A.; and Bengio, S. 2017. Learning to remember rare events. arXiv preprint arXiv:1703.03129.
 Kingma and Welling (2014) Kingma, D. P., and Welling, M. 2014. Autoencoding variational bayes. ICLR.
 Kingma et al. (2016) Kingma, D. P.; Salimans, T.; Jozefowicz, R.; Chen, X.; Sutskever, I.; and Welling, M. 2016. Improved variational inference with inverse autoregressive flow. In Advances in Neural Information Processing Systems, 4743–4751.
 Kirkpatrick et al. (2017) Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A. A.; Milan, K.; Quan, J.; Ramalho, T.; GrabskaBarwinska, A.; et al. 2017. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences 201611835.
 Kulesza, Taskar, and others (2012) Kulesza, A.; Taskar, B.; et al. 2012. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning 5(2–3):123–286.
 Kumaran, Hassabis, and McClelland (2016) Kumaran, D.; Hassabis, D.; and McClelland, J. L. 2016. What learning systems do intelligent agents need? complementary learning systems theory updated. Trends in cognitive sciences 20(7):512–534.
 Lake et al. (2011) Lake, B.; Salakhutdinov, R.; Gross, J.; and Tenenbaum, J. 2011. One shot learning of simple visual concepts. In Proceedings of the Cognitive Science Society, volume 33.
 LeCun et al. (1998) LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998. Gradientbased learning applied to document recognition. Proceedings of the IEEE 86(11):2278–2324.
 Lee et al. (2018) Lee, J.; Yun, J.; Hwang, S.; and Yang, E. 2018. Lifelong learning with dynamically expandable networks. ICLR.
 Li and Hoiem (2016) Li, Z., and Hoiem, D. 2016. Learning without forgetting. In European Conference on Computer Vision.

Lin (1992)
Lin, L.J.
1992.
Selfimproving reactive agents based on reinforcement learning, planning and teaching.
Machine learning 8(34):293–321.  LopezPaz and Ranzato (2017) LopezPaz, D., and Ranzato, M. 2017. Gradient episodic memory for continuum learning. NIPS.

Maddison, Mnih, and Teh (2017)
Maddison, C. J.; Mnih, A.; and Teh, Y. W.
2017.
The concrete distribution: A continuous relaxation of discrete random variables.
ICLR.  Maddison, Tarlow, and Minka (2014) Maddison, C. J.; Tarlow, D.; and Minka, T. 2014. A* sampling. In Advances in Neural Information Processing Systems, 3086–3094.
 McClelland, McNaughton, and O’Reilly (1995) McClelland, J. L.; McNaughton, B. L.; and O’Reilly, R. C. 1995. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review 102(3):419.
 McCloskey and Cohen (1989) McCloskey, M., and Cohen, N. J. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of learning and motivation.
 Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Humanlevel control through deep reinforcement learning. Nature 518(7540):529–533.
 Murre (1992) Murre, J. M. 1992. Learning and categorization in modular neural networks.
 Pritzel et al. (2017) Pritzel, A.; Uria, B.; Srinivasan, S.; Badia, A. P.; Vinyals, O.; Hassabis, D.; Wierstra, D.; and Blundell, C. 2017. Neural episodic control. In International Conference on Machine Learning, 2827–2836.
 Ramapuram, Gregorova, and Kalousis (2017) Ramapuram, J.; Gregorova, M.; and Kalousis, A. 2017. Lifelong generative modeling. arXiv preprint arXiv:1705.09847.

Rebuffi, Kolesnikov, and
Lampert (2017)
Rebuffi, S.; Kolesnikov, A.; and Lampert, C.
2017.
icarl: Incremental classifier and representation learning.
CVPR.  Riemer, Khabiri, and Goodwin (2016) Riemer, M.; Khabiri, E.; and Goodwin, R. 2016. Representation stability as a regularizer for improved text analytics transfer learning. arXiv preprint arXiv:1704.03617.
 Ring (1994) Ring, M. B. 1994. Continual learning in reinforcement environments. Ph.D. Dissertation, University of Texas at Austin Austin, Texas 78712.
 Robins (1995) Robins, A. 1995. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science 7(2):123–146.
 Rusu et al. (2016) Rusu, A. A.; Rabinowitz, N. C.; Desjardins, G.; Soyer, H.; Kirkpatrick, J.; Kavukcuoglu, K.; Pascanu, R.; and Hadsell, R. 2016. Progressive neural networks. arXiv preprint arXiv:1606.04671.
 Schaul et al. (2015) Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D. 2015. Prioritized experience replay. arXiv preprint arXiv:1511.05952.
 Shin et al. (2017) Shin, H.; Lee, J. K.; Kim, J.; and Kim, J. 2017. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, 2994–3003.
 Teyler and DiScenna (1986) Teyler, T. J., and DiScenna, P. 1986. The hippocampal memory indexing theory. Behavioral neuroscience 100(2):147.
 Teyler and Rudy (2007) Teyler, T. J., and Rudy, J. W. 2007. The hippocampal indexing theory and episodic memory: updating the index. Hippocampus 17(12):1158–1169.
 Thrun (1994) Thrun, S. 1994. Lifelong learning perspective for mobile robot control. In Proceedings of the IEEE/RSJ/GI International Conference on Intelligent Robots and Systems, volume 1, 23–30.
 Thrun (1996) Thrun, S. 1996. Is learning the nth thing any easier than learning the first? Advances in neural information processing systems 640–646.
 Vitter (1985) Vitter, J. S. 1985. Random sampling with a reservoir. ACM Transactions on Mathematical Software (TOMS).

Yang and Hospedales (2017)
Yang, Y., and Hospedales, T.
2017.
Deep multitask representation learning: A tensor factorisation approach.
ICLR.  Zhang, Kjellstrom, and Mandt (2017) Zhang, C.; Kjellstrom, H.; and Mandt, S. 2017. Stochastic learning on imbalanced data: Determinantal point processes for minibatch diversification. arXiv preprint arXiv:1705.00607.
 Zhu and Bento (2017) Zhu, J.J., and Bento, J. 2017. Generative adversarial active learning. arXiv preprint arXiv:1702.07956.
Appendix A Recollection Module Implementation
The key role of the scalable recollections module is to efficiently facilitate the transfer of knowledge between two neural models. In this section we argue that the recently proposed discrete latent variable variational autoencoders (Jang, Gu, and Poole, 2017; Maddison, Mnih, and Teh, 2017) are ideally suited to implement the encoder/decoder functionality of the recollection module.
Typical experience storage strategies store full experiences which can be expensive. For example, to store 32x32 CIFAR images with 3 color channels and 8bits per pixel per channel will incur a cost per image stored of 8x3x32x32 = 24,576 bits. Deep nonlinear autoencoders are a natural choice for compression problems. An autoencoder with a continuous latent variable of size , assuming standard 32bit representations used in modern GPU hardware, will have a storage cost of 32 bits for each latent representation. Unfortunately, continuous variable autoencoders which use 32bits for their network parameters may incur an unnecessary storage cost for many problems – especially in constrainedresource settings.
A solution which combines the benefit of VAE training with an ability to explicitly control precision is the recently proposed VAE with categorical latent variables (Jang, Gu, and Poole, 2017; Maddison, Mnih, and Teh, 2017). Here we consider a bottleneck representation between the encoder and decoder with categorical latent variables each containing dimensions representing a one hot
encoding of the categorical variable. This can be compressed to a binary representation of
bits.Discrete VAE Implementation
In order to model an autoencoder with discrete latent variables, we follow the success of recent work (Jang, Gu, and Poole, 2017; Maddison, Mnih, and Teh, 2017) and employ the GumbelSoftmax function. The GumbelSoftmax function leverages the GumbelMax trick (Gumbel, 1954; Maddison, Tarlow, and Minka, 2014) which provides an efficient way to draw samples
from a categorical distribution with class probabilities
representing the output of the encoder:(1) 
In equation 1, is a sample drawn from Gumbel(0,1), which is calculated by drawing from Uniform(0,1) and computing =log(log()). The function quantizes its input into a one hot vector. The softmax function is used as a differentiable approximation to argmax, and we generate dimensional sample vectors with temperature in which:
(2) 
The GumbelSoftmax distribution is smooth for 0, and therefore has a welldefined gradient with respect to the parameters . During forward propagation of the categorical autoencoder, we send the output of the encoder through the sampling procedure of equation 1
to create a categorical variable. However, during backpropagation we replace nondifferentiable categorical samples with a differentiable approximation during training using the GumbelSoftmax estimator in equation
2. Although past work (Jang, Gu, and Poole, 2017; Maddison, Mnih, and Teh, 2017) has found value in varying over training, we still were able to get strong results keeping fixed at 1.0 across our experiments. Across all of our experiments, our generator model includes three convolutional layers in the encoder and three deconvolutional layers in the decoder.One issue when deploying an autoencoder is the determination of the latent code size hyperparameter. In the next section we will derive a maximization procedure that we use throughout our experiments to find the optimal autoencoder latent variable size to use for a given resource constraint.
Details on GEM Integration
Appendix B Optimizing Latent Code Size for a Resource Constraint
The ability of a discrete variational autoencoder to memorize inputs should be strongly related to the effective bottleneck capacity , which we define, for discrete latent variables, as:
(3) 
Incremental Storage Resource Constraints
First, let us consider the dynamics of balancing resources in a simple setting where we have an incremental storage constraint for new incoming data without regard for the size of the model used to compress and decompress recollections. We refer to the total storage constraint over all incoming examples as and the average storage rate limit as . We can then define as the probability that an incoming example is stored in memory. Thus, the expected number of bits required per example stored is , assuming simple binary encoding. If we treat as fixed, we can then define the following optimization procedure to search for a combination of and that maximizes capacity while fulfilling an incremental resource storage constraint:
(4)  
subject to 
which yields the approximate solution . As seen in equation 4, there is an inherent tradeoff between between the diversity of experiences we store governed by and the distortion achieved that is related to the capacity. The optimal tradeoff is likely problem dependent. Our work takes a first step at trying to understand this relationship. For example, we demonstrate that deep neural networks can see improved stabilization in resource constrained settings by allowing for some degree of distortion. This is because of an increased ability to capture the diversity in the data at the same incremental resource constraint.
Total Storage Resource Constraints
In some ways, the incremental storage constraint setting described in the previous section is not the most rigorous setting when comparing recollections to a selected subset of full inputs. Another important factor is the number of parameters in the model used for compression and decompression. generally is also to some degree a function of and . For example, in most of our experiments, we use the same number of hidden units at each layer as used in the bottleneck layer. With fully connected layers, this yields . As such, we can revise equation 4 to handle a more rigorous constraint for optimizing a discrete latent variable autoencoder architecture:
(5)  
subject to 
While this setting is more rigorous when comparing to lossless inputs, it is a somewhat harsh restriction with which to measure lifelong learning systems. This is because it is assumed that the compression model’s parameters should be largely transferable across tasks. To some degree, these parameters can be viewed as a sunk cost from the standpoint of continual learning. In our experiments, we also look at transferring these representations from related tasks to build a greater understanding of this tradeoff.
Appendix C Additional Details on Experimental Protocol
Each convolutional layer has a kernel size of 5. As we vary the size of our categorical latent variable across experiments, we in turn model the number of filters in each convolutional layer to keep the number of hidden variables consistent at all intermediate layers of the network. In practice, this implies that the number of filters in each layer is equal to . We note that the discrete autoencoder is stochastic, not deterministic and we just report one stochastic pass through the data for each experimental trial. In all of our knowledge distillation experiments, we report an average result over 5 runs. For MNIST and Omniglot we follow prior work and consider 28x28 images with 1 channel and 8bits per pixel. MNIST and Omniglot images were originally larger, but others have found the down sampling to 28x28 does not effect performance of models using it to learn.
Distortion as a Function of Compression Experiments
More detail about the architecture used in these experiments are provided for categorical latent variables in Table 5 and for continuous latent variables in Table 6. For each architecture we ran with a learning rate of 1e2, 1e3, 1e4, and 1e5, reporting the option that achieves the best training distortion. For the distortion, the pixels are normalized by dividing by 255.0 and we take the mean over the vector of the absolute value of the reconstruction to real sample difference and then report the mean over the samples in the training set. Compression is the ratio between the size of an 8bpp MNIST image and the size of the latent variables, assuming 32 bits floating point numbers in the continuous case and the binary representation for the categorical variables. The JPEG data points were collected using the Pillow Python package using quality 1, 25, 50, 75, and 100. We subtracted the header size form the JPEG size so it is a relatively fair accounting of the compression for a large data set of images all of the same size. The JPEG compression is computed as an average over the first 10,000 MNIST training images.
Compression  Distortion  

6  20  209.067  0.06609 
10  20  125.440  0.04965 
6  16  261.333  0.07546 
12  10  130.667  0.05497 
10  14  156.800  0.05410 
24  3  130.667  0.05988 
38  2  165.053  0.05785 
6  2  1045.333  0.13831 
40  3  78.400  0.04158 
20  2  313.600  0.08446 
8  6  261.333  0.08423 
12  6  174.222  0.06756 
30  2  209.067  0.06958 
24  6  87.111  0.04065 
4  37  261.333  0.07795 
8  15  196.000  0.06812 
48  10  32.667  0.01649 
209  8  10.003  0.01455 
12  37  87.111  0.03996 
313  4  10.019  0.01420 
392  3  8.000  0.01348 
50  18  25.088  0.01859 
168  2  37.333  0.01955 
108  3  29.037  0.01894 
62  2  101.161  0.04073 
208  2  30.154  0.01832 
68  5  30.745  0.01849 
Compression  Distortion  
1  49  0.135196 
2  24.5  0.124725 
3  16.33333333  0.0947032 
5  9.8  0.0354035 
7  7  0.031808 
20  2.45  0.0149272 
Continual Lifelong Learning Experiments
In our experiments on MNISTRotations we found it optimal to set the GEM learning rate to 0.1 and memory strength to 0.5. In our recollections experiments, we used an autoencoder with 104 4d variables for the size 100 buffer and one with 139 8d variables for the size 200 buffer. For experience replay, we used a learning rate of 0.01 for the predictive model and a learning rate of 0.001 for the autoencoder. For the size 100 buffer we experienced more overfitting to the buffer and used a batch size of 5. For the size 200 buffer we used a batch size of 25.
Our categorical latent variable autoencoders had the following sizes for Incremental CIFAR100: 48 2d variables for an effective buffer size of 10, 98 2d variables for an effective buffer size of 20, 244 2d variables for an effective buffer size of 50, 294 2d variables for an effective buffer size of 60, and 620 3d variables for an effective buffer size of 200. The predictive model was trained with a learning rate of 1e3 in all of our replay experiments and 0.1 in all of our GEM experiments. The learning rate for the autoencoder was 1e4 for the buffer size of 200 regardless of the lifelong learning model and buffer size of 20 with GEM. The autoencoder learning rate was 1e3 for replay with buffer sizes 10 and 50 as well as GEM with buffer size 60. The GEM memory strength parameter was set to 0.1 for our real storage experiments and 0.01 for our experiments with recollection. In our experiments with GEM and with replay on MNISTRotations, we were able to get very good results even just training our autoencoder online and not leveraging the recollection buffer to stabilize its training. We could have seen further improvements, as we show on CIFAR100, by using the buffer for stabilization.
For incremental Omniglot our learning rate was set to 1e3. For the effective buffer size of 50 experiments, we leveraged a categorical latent variable autoencoder with 312 2d variables. For the effective buffer size of 10 experiments, we utilized a categorical latent variables consisting of 62 2d variables. We follow 90% multitask training and 10% testing splits for Omniglot established in (Yang and Hospedales, 2017).
During the transfer learning experiments from CIFAR10, for replay a learning rate of 1e3 was used for the Resnet18 reasoning model and a learning rate of 3e4 was used for the discrete autoencoder generator. For GEM, we used a learning rate of 0.1 for the resnet model, a learning rate of 1e3 for the autoencoder, and a memory strength of 0.1. For the experiment without transfer learning, we instead used a higher learning rate of 1e3 for the replay autoencoder. We used a learning rate of 1e4 for the autoencoder in our transfer learning GEM experiments.
Detailed Retention Results
We provide a detailed version of Figure 2 in the main text in Figure 7. It includes learning for larger real storage buffer sizes as a comparison. For example, a six times larger real storage buffer looses knowledge significantly faster than scalable recollections despite better performance when originally training on Incremental CIFAR100.
Appendix D Scalable Recollections for Distillation
Question 9 Can Scalable Recollections retain knowledge so that it is transferable to new neural network instantiations?
Episodes  Real  10%  2%  1%  Real  10x  50x  100x 

Data  Sample  Sample  Sample  Teacher  Compress  Compress  Compress  
10  10.43  9.94  11.07  10.70  10.07  10.65  10.99  13.89 
100  19.63  18.16  22.82  22.35  25.32  19.34  16.20  21.06 
1000  90.45  88.88  90.71  89.93  91.01  90.66  90.52  90.03 
10000  97.11  96.83  95.98  94.97  97.42  96.77  96.37  95.65 
100000  98.51  97.99  96.14  94.92  98.63  98.59  98.17  97.75 
In our experiments, we train a teacher model with a LeNet (LeCun et al., 1998)convolutional neural network (CNN) architecture on the popular MNIST benchmark, achieving 99.29% accuracy on the test set. We would like to test whether recollections drawn from our proposed recollection module are sufficient input representations for the teacher neural network to convey its function to a seperate randomly initialized student neural network of the same size. In Table 7 we validate the effectiveness of our technique by comparing it to some episodic storage baselines of interest. As baselines we consider training with the the same number of randomly sampled real examples, using real input and the teacher’s output vector as a target, and using random sampling to select a subset of real examples to store. When training with a large number of memories for a more complete knowledge transfer, the recollection compression clearly shows dividens over random sampling baselines. This is impressive particularly because these results are for the stricter total storage footprint compression setting where we account for the autoencoder model parameters and on a per sample basis the compression is actually 37x, 101x, and 165x.
We also would like to validate these findings in a more complex setting for which we consider distillation with outputs from a 50 task Resnet18 teacher model that gets 94.86% accuracy on Omniglot. We test performance after one million training episodes, which is enough to achieve teacher performance using all of the real training examples. However, sampling diversity restricts learning significantly, for example, achieving 28.87% accuracy with 10% sampling, 8.88% with 2% sampling, and 5.99% with 1% sampling. In contrast the recollection module is much more effective, achieving 87.86% accuracy for 10x total resource compression, 74.03% accuracy for 50x compression, and 51.45% for 100x compression. Later we also demonstrate that these results generalize to mismatched architectures for the student and teacher. Moreover, in the we will also show that by using heuristics to explore the buffer instead of random sampling we can transfer knowledge even more efficiently to a new model than with randomly drawn real examples. Indeed, it is clear that Scalable Recollections provide a general purpose reservoir of knowledge that is efficiently transferable to anything ranging from old network instantiations of the same model (in which case it reinforces prior learning) to totally new and different architectures (looking to obtain this knowledge from scratch).
MNIST Generative Distillation Experiments
Alongside the teacher model, we train an variational autoencoder model with discrete latent variables. Each model is trained for 500 epochs. During the final pass through the data, we forward propogate through each training example and store the latent code in an index buffer. This buffer eventually grows to a size of 50,000. After training is complete, the index buffer is used as a statistical basis for sampling diverse recollections to train a student network. A logical and effective strategy for training a student model is to sample randomly from this buffer and thus capture the full distribution. For all of our distillation experiments we ran the setting with a learning rate of 1e3 and 1e4, reporting the best result. We found that the higher learning rate was beneficial in setting with a low number of examples and the lower learning rate was beneficial in setting with a larger number of examples. The categorical latent variable autoencoders explored had the following representation sizes: 168 2d variables for 10x compression, 62 2d variables for 50x compression, and 38 2d variables for 100x compression. For our code sampling baselines, we used the numpy random integer function to generate each discrete latent variable.
Omniglot Generative Distillation Experiment
The learning rate for the Resnet18 reasoning model was 1e4 in our experiments. Our trained discrete autoencoder models were of the following representation sizes: 32 variables of size 2 for 100x compression, 50 variables of size 2 for 50x compression, and 134 variables of size 2 for 10x compression. We follow 90% multitask training and 10% testing splits for Omniglot established in (Yang and Hospedales, 2017).
CNN to MLP Distillation Results
In Figure 8 we explore generative knowledge distillation transferring knowledge from a CNN teacher network to a MLP student network.
Episodes  Real  10%  2%  1%  Real  10x  50x  100x 

Data  Sample  Sample  Sample  Teacher  Compress  Compress  Compress  
10  13.64  17.04  14.57  15.13  15.87  16.70  11.80  14.66 
100  36.37  37.04  38.35  34.04  38.56  37.16  40.09  42.31 
1000  80.54  79.08  78.18  77.76  80.00  80.72  80.00  77.75 
10000  91.04  90.84  88.38  86.83  90.86  91.37  90.60  90.46 
100000  96.66  95.02  91.61  88.97  96.60  96.71  96.24  95.22 
Automated Generative Curriculum Learning
Episodes  Real  Real  Active 10x  Active 100x  Active & Diverse  Active & Diverse 

Data  Teacher  Compress  Compress  10x Compress  100x Compress  
10  10.43  10.07  9.95  10.19  10.67  11.51 
100  19.63  25.32  14.80  22.57  27.05  29.93 
1000  90.45  91.01  93.45  92.97  94.81  92.54 
10000  97.11  97.42  98.61  97.53  98.59  97.66 
100000  98.51  98.63  99.18  98.25  99.20  98.32 
While random sampling from a buffer can be very effective, we would like to further maximize the efficiency of distilling knowledge from a teacher model to a student model. This motivates the automated curriculum learning setting (Bengio et al., 2009) as recently explored for multitask learning in (Graves et al., 2017) or rather automated generative curriculum learning in our case. We tried some simple reinforcement learning solutions with rewards based on (Graves et al., 2017)
but were unsuccessful in our initial experiments because of the difficulty of navigating a complex continuous action space. We also tried an active learning formulation proposed for GANs to learn the best latent code to sample
(Zhu and Bento, 2017) at a given time. We had limited success with this strategy as well as it tends to learn to emphasize regions of the latent space that optimize incorrectness, but no longer capture the distribution of inputs.Designing generative sampling heuristics. Inspired by these findings, we instead employ simple sampling heuristics to try to design a curriculum with prototypical qualities like responsiveness to the student and depth of coverage. We model responsiveness to the student as active sampling by focusing on examples where the student does not have good performance. We randomly sample latent codes using our recollection buffer and choose the one that is most difficult for the current student for backpropagation by cheaply forward propagating through the student for each. By sampling from the recollection buffer, we are able to ensure our chosen difficult samples are still representative of the training distribution. We set to 10 in our experiments so the sampling roughly equates to sampling once from the most difficult class for the student model at each point in time. We model depth of coverage by sampling a bigger batch of random examples and adding a filtering step before considering difficulty. We would like to perform diverse sampling that promotes subset diversity when we filter from examples down to examples. One approach to achieving this is a Determinantal Point Process (DPP) (Kulesza, Taskar, and others, 2012) as recently proposed for selecting diverse neural network minibatches (Zhang, Kjellstrom, and Mandt, 2017). We use the dot product of the inputs as a measure of similarity between recollections and found the DPP to achieve effective performance as a diverse sampling step. However, we follow (Bouneffouf and Birol, 2015) and use a process for sampling based on the sum of the squared similarity matrix as outlined in the next section. We found the sum of the squared similarity matrix to be equally effective to the determinant and significantly more scalable to large matrices. We also set to 10 in our experiments.
Minimum Sum of Squared Similarities
This algorithm is trying to find a new landmark point that maximizes the determinant by finding a point that minimizes the sum of squared similarities (MSSS). The MSSS algorithm initially randomly chooses two points from the dataset . It then computes the sum of similarities between the sampled points and a subset, , selected randomly from the remaining data points. The point with the smallest sum of squared similarities is then picked as the next landmark data point. The procedure is repeated until a total of landmark points are picked.
Comments
There are no comments yet.