It is a typical practice to design and optimize machine learning (ML) models to solve a single task. On the other hand, humans, instead of learning over isolated complex tasks, are capable of generalizing and transferring knowledge and skills learned from one task to another. This ability to remember, learn and transfer information across tasks is referred to as continual learningThrun1995 ; Ruvolo2013 ; Hassabis2017 ; Parisi2019 . The major challenge for creating ML models with continual learning ability is that they are prone to catastrophic forgetting McClelland1995 ; McCloskey1989 ; Goodfellow2013 ; French1999
. ML models tend to forget the knowledge learned from previous tasks when re-trained on new observations corresponding to a different (but related) task. Specifically when a deep neural network (DNN) is fed with a sequence of tasks, the ability to solve the first task will decline significantly after training on the following tasks. The typical structure of DNNs by design does not possess the capability of preserving previously learned knowledge without interference between tasks or catastrophic forgetting. In order to overcome catastrophic forgetting, a learning system is required to continuously acquire knowledge from the newly fed data as well as to prevent the training of the new data samples from destroying the existing knowledge.
In this paper, we propose a novel approach to continual learning with DNNs that addresses the catastrophic forgetting issue, namely a technique called online leverage score sampling (OLSS). In OLSS, we progressively compress the input information learned thus far, along with the input from current task and form more efficiently condensed data samples. The compression technique is based on the statistical leverage scores measure, and it uses the concept of frequent directions in order to connect the series of compression steps for a sequence of tasks.
When thinking about continual learning, a major source of inspiration is the ability of biological brains to learn without destructive interference between older memories and generalize knowledge across multiple tasks. In this regard, the typical approach is enabling some form of episodic-memory in the network and consolidation McClelland1995
via replay of older training data. However, this is an expensive process and does not scale well for learning large number of tasks. As an alternative, taking inspiration from the neuro-computational models of complex synapsesBennaFusi2016 , recent work has focused on assigning some form of importance to parameters in a DNN and perform task-specific synaptic consolidation Kirkpatrick2017 ; Zenke2017 . Here, we take a very different view of continual learning and find inspiration in the brains ability for dimensionality reduction Pang2016 to extract meaningful information from its environment and drive behavior. As such, we enable such progressive dimensionality reduction (in terms of number of samples) of previous task data combined with new task data in order to only preserve a good summary information (discarding the less relevant information or effective forgetting) before further learning. Repeating this process in an online manner we enable continual learning for a large sequence of tasks. Much like our brains, a central strategy employed by our method is to strike a balance between dimensionality reduction of task specific data and dimensionality expansion as processing progresses throughout the hierarchy of the neural network FusiMiller2016 .
1.1 Related Work
Recently, a number of approaches have been proposed to adapt a DNN model to the continual learning setting, from an adaptive model architecture perspective such as adding columns or neurons for new tasksRusu2016 ; Yoon2018 ; Schwarz2018 ; model parameter adjustment or regularization techniques like, imposing restrictions on parameter updates Kirkpatrick2017 ; Zenke2017 ; Li2016 ; Titsias2019 ; memory revisit techniques which ensure model updates towards the optimal directions Lopez-Paz2017 ; Rebuffi2017 ; Shin2017 ; Bayesian approaches to model continuously acquired information Titsias2019 ; Nguyen2018 ; Garnelo2018
; or on broader domains with approaches targeted at different setups or goals such as few-shot learning or transfer learningFinn2017 ; Nichol2018 .
In order to demonstrate our idea in comparison with the state-of-the-art techniques, we briefly discuss the following three popular approaches to continual learning:
: It constrains or regularizes the model parameters by adding additional terms in the loss function that prevent the model from deviating significantly from the parameters important to earlier tasks. Typical algorithms include elastic weight consolidation (EWC)Kirkpatrick2017 and continual learning through synaptic intelligence (SI) Zenke2017 .
II) Architectural modification: It revises the model structure successively after each task in order to provide more memory and additional free parameters in the model for new task input. Recent examples in this direction are progressive neural networks Rusu2016 and dynamically expanding networks Yoon2018 .
III) Memory replay: It stores data samples from previous tasks in a separate memory buffer and retrains the new model based on both the new task input and the memory buffer. Popular algorithms here are gradient episodic memory (GEM) Lopez-Paz2017
, incremental classifier and representation learning (iCaRL)Rebuffi2017 .
Among these approaches, regularization is particularly prone to saturation of learning when the number of tasks is large. The additional / regularization term in the loss function will soon lose its competency when important parameters from different tasks are overlapped too many times. Modifications on network architectures like progressive networks resolve the saturation issue, but do not scale when the number and complexity of tasks increase. The scalability problem is also prominent when using current memory replay techniques, often suffering from high memory and computational costs.
Our approach resembles the use of memory replay since it preserves the original input data samples from earlier tasks for further training. However, it does not require extra memory for training and is cost efficient compared to previous memory replay methods. It also makes more effective use of the model structure by exploiting the model capacity to adapt with more tasks, in contrast to constant addition of neurons or additional network layers for new tasks. Furthermore, unlike the importance assigned to model specific parameters when using regularization methods, we assign importance to the training data that is relevant in effectively learning new tasks, while forgetting less important information.
2 Online Leverage Score Sampling
Before presenting the idea, we first setup the problem: Let represent a sequence of tasks, each task consists of data samples and each sample has a feature dimension and an output dimension , i.e., input and true output . Here, we assume the feature and output dimensions are fixed for all tasks 111If we know apriori that the feature or output dimensions are different, we could choose a presumed larger value of and . In continuous learning our aim is to solve successive problems with some degree of overlap. As such, the feature and output dimensions being the same across tasks is not overly strict.. The goal is to train a DNN over the sequence of tasks and ensure it performs well on all of them, without catastrophic forgetting. Here, we consider that the network’s architecture stays the same and the tasks are received in a sequential manner. Formally, with representing a DNN, our objective is to minimize the loss 222Here, we represent a generic Euclidean loss term. However, this could take the form of any typical formulation in terms of -loss, -loss or cross-entropy loss as commonly used in classification problems.:
Under this setup, we look at some of the existing models:
Online EWC trains on the th task with a loss function containing additional penalty terms
where indicates the importance level of the previous tasks compared to task , represents the th diagonal entry of the Fisher information matrix for Task , represents the number of parameters in the network, corresponds to the th model parameter for the current task and is the th model parameter value for the th task.
Alternately, GEM maintains an extra memory buffer containing data samples from each of the previous tasks with . It trains on the current task with a regular loss function, but subject to inequalities on each update of (update on each parameter ),
2.1 Our approach
The new method OLSS, different from either method above, targets to find an approximation of in a streaming (online) manner, i.e., form a sketch to approximate such that the resulting
is likely to perform on all tasks as good as
In order to avoid extra memory and computation cost during the training process, we could set the approximate to have the same number of rows (number of data samples) as the current task .
Equation (1) and (2) represent nonlinear least squares problems. It is to be noted that a nonlinear least squares problem can be solved with an approximation deduced from an iteration of linear least squares problems with where is the Jacobian of at each update (using the Gauss-Newton method). Besides this technique, there are various other approaches in addressing this problem. Here we adopt a cost effective simple randomization technique - leverage score sampling, which has been used extensively in solving large scale linear least squares and low rank approximation problems Cohen2017 ; Drineas2012 ; Woodruff2014 .
2.2 Statistical Leverage Score and Leverage Score Sampling
Statistical leverage scores measure the non-uniformity structure of a matrix and a higher score indicates a heavier weight of the row contributing to the non-uniformity of the matrix. It has been widely used for outlier detection in statistical data analysis. In recent applicationsDrineas2012 ; Woodruff2014
, it also emerges as a fundamental tool for constructing randomized matrix sketches. Given a matrix, a sketch of is another matrix where is significantly smaller than but still approximates well, more specifically, . Theoretical accuracy guarantees have been derived for random sampling methods based on statistical leverage scores Woodruff2014 ; Ma2014 .
Considering our setup which is to approximate a matrix for solving a least squares problem and also the computational efficiency, we adopt the following leverage score based sampling method:
forms a probability distribution.
Given a sketch size , define a distribution 333Since ,
forms a probability distribution.with , the sketch is formed by independently and randomly selecting rows of without replacement, where the
th row is selected with probability. Based on this, we are able to select the samples that contributes the most to a given dataset. The remaining problem is to embed it in a sequence of tasks and still generate promising approximations to solve the least squares problem. In order to achieve that, we make use of the concept of frequent directions.
2.3 Frequent Directions
Frequent directions extends the idea of frequent items in item frequency approximation problem to a matrix Liberty2013 ; Ghashami2016 ; Teng2019 and it is also used to generate a sketch for a matrix, but in a data streaming environment. As the rows of
are fed in one by one, the original idea of frequent directions is to first perform Singular Value Decomposition (SVD) on the firstrows of and shrink the top singular values by the same amount which is determined by the th singular value, and then save the product of the shrunken top singular values and the top right singular vectors as a sketch for the first rows of . With the next rows fed in, append them behind the sketch and perform the shrink and product. This process is repeated until reaching the final sketch for . Different from the leverage score sampling sketching technique, a deterministic bound is guaranteed for the accuracy of the sketch: with and denotes best rank- approximation of Liberty2013 ; Ghashami2016 .
Inspired by the routine of frequent directions in a streaming data environment, our OLSS method is constructed as follows: First initialize a ‘sketch’ matrix and a corresponding . For the first task , we randomly select rows of and (the corresponding rows of) without replacement according to the leverage score sampling defined above with probability distribution based on ’s leverage scores, then train the model on the sketch (, ); after seeing Task 2, we append (, ) to the sketch (, ) respectively and again randomly select out of data samples according to the leverage score sampling with the probability distribution based on the leverage scores of , and form a new sketch and , then train on it. This process is repeated until the end of the task sequence. We present the step by step procedure in Algorithm 1.
2.4 Main Algorithm
The original idea of leverage score sampling and frequent directions both have the theoretical accuracy bounds with the sketch on the error term . The bounds show that the sketch contains the relevant information used to form the covariance matrix of all the data samples , in other words, the sketch captures the relationship among the data samples in the feature space (which is of dimension ). For a sequence of tasks, it is common to have noisy data samples or interruptions among samples for different tasks. The continuous update of important rows in a matrix (data samples for a sequence of tasks), or the continuous effective forgetting of less useful rows may serve as a filter to remove the unwanted noise.
Different from most existing methods, Algorithm 1 does not work directly with the training model, instead it could be considered as data pre-processing which constantly extracts useful information from previous and current tasks. Because of its parallel construction, OLSS could be combined with all the aforementioned algorithms to further improve its performance.
Regarding the computational complexity, when is large, the SVD of in Step 6 is computationally expensive which takes time. This procedure is for the computation of leverage scores which can be sped up significantly with various leverage score approximation techniques in the literature Drineas2012 ; Cohen2017 ; Rudi2018 , such as through the randomized algorithm in Drineas2012 , the leverage scores for could be approximated in time.
However, one possible drawback of the above procedure is that the relationship represented in a covariance matrix is linear, so any underlying nonlinear connections among the data samples may not be fully captured in the sketch. Furthermore, the structure of the function would also affect the information required to be kept in the sketch in order to perform well on solving the least squares problem in (2). As such, there may exist certain underlying dependency of a data sample’s importance on the DNN model architecture. This remains a future research direction.
We evaluate the performance of the proposed algorithm OLSS on three classification tasks used as benchmarks in related prior work.
Incremental CIFAR100 Rebuffi2017 ; Zenke2017 : a variant of the CIFAR object recognition dataset with classes Krizhevsky2009 . The experiment is on tasks and each task consists of classes; each task consists of training and testing samples. Where, each task introduces a new set of classes; for a total number of 20 tasks, each new task concerns examples from a disjoint subset of 5 classes.
In the setting of Lopez-Paz2017
for incremental CIFAR100, a softmax layer is added to the output vector which only allows entries representing theclasses in the current task to output values larger than . In our setting, we allow the entries representing all the past occurring classes to output values larger than . We believe this is a more natural setup for continual learning.
For the aforementioned experiments, we compare the performance of the following algorithms:
A simple SGD predictor.
iCaRL Rebuffi2017 , it classifies based on a nearest-mean-of-exemplars rule, keeps an episodic memory and updates its exemplar set continuously to prevent catastrophic forgetting. It is only applicable to incremental CIFAR100 experiment due to its requirement on the same input representation across tasks.
In addition to these, experiments were also conducted using SI Zenke2017 and the same three tasks. However, no significant improvement in performance and a sensitivity to learning rate parameter choice was observed, with learning ability being relatively better than online EWC. As such we don’t show SI performance in our plots. It can however be tested using our open sourced code444 The PyTorch code base implementing all the experiments presented here will be made publicly available on Github.
The PyTorch code base implementing all the experiments presented here will be made publicly available on Github.for this paper.
The competing algorithms SGD, EWC, GEM and iCaRL were implemented based on the publicly available code from the original authors of the GEM paper Lopez-Paz2017 ; a plain SGD optimizer is used for all algorithms. The DNN used for rotated and permuted MNIST is an MLP with hidden layers and each with rectified linear units; whereas a smaller version of ResNet18 Heetal2016 , with three times less feature maps across all layers is used for the incremental CIFAR100 experiment. We train epochs with batch size on rotated and permuted MNIST datasets and epochs with batch size on incremental CIFAR100. The regularization and memory hyper-parameters in EWC, iCaRL and GEM were set as described in Lopez-Paz2017 . The space parameter for our OLSS algorithm was set to be equal to the number of samples in each task. The learning rate for each algorithm was determined through a grid search on . The final learning rates used in each experiment corresponding to the different algorithms was set as,
rotated MNIST, SGD: , EWC: , GEM: and OLSS: ; permutated MNIST, SGD: , EWC: , GEM: and OLSS: and for incremental CIFAR100, SGD: , EWC: , GEM: , iCaRL: and OLSS: .
To evaluate the performance of different algorithms, we examine
As observed from Figure 1 (left) across the three benchmarks, OLSS achieves similar average task accuracy or slightly higher compared to GEM and clearly outperforms SGD, EWC and iCaRL. This demonstrates the the ability of OLSS for continuously selecting useful data samples with progressive learning to overcome the catastrophic forgetting issue. In terms of maintaining the performance of the earliest task (Task 1) after training a sequence of tasks, OLSS shows the most robust performance at par with GEM on rotated and permutated MNIST, and slightly worse than GEM as the number of tasks increases in case of incremental CIFAR100. However, both these methods, significantly outperform SGD, EWC and iCaRL.
In order to compare the computational time complexity across the methods, we report the walk clock time in Table 1. Noticeably, SGD is the fastest among all the algorithms, however performs the worst as observed in Figure 1, then followed by OLSS and EWC (only in the case of CIFAR100, EWC is relatively faster than OLSS). The algorithms iCaRL and GEM both demand much higher computational costs, with GEM being significantly slow compared to the rest. This behavior is expected due to the requirement of additional constraint validation and at certain occasions, a gradient projection step (in order to correct for constraint violations across data samples from previously learned tasks stored in the memory buffer) in GEM (see Section 3 in Lopez-Paz2017 ). As such although the buffered replay-memory based approach in GEM prevents catastrophic forgetting, the computational cost becomes prohibitively slow to be performed online while training DNNs on sequential multi-task learning scenarios.
Based on the performance and computational efficiency on all three datasets, OLSS emerges as the most favorable among the current state of the art algorithms for continual learning.
The space parameter of OLSS ( in Algorithm 1) could be varied to balance its accuracy and efficiency. Here the choice of (number of samples in current task) is selected such that the number of training samples would be standardized across all algorithms, enabling effective compression and extraction of data samples for OLSS in a straightforward comparison. However, it is to be noted that if , OLSS indeed requires some additional memory in order to compute the SVD of concatenated sketch of previous tasks and the current task. Unless, the algorithm is run in an edge computing environment with limited memory on chip, this issue could be ignored.
On the other hand, GEM and iCaRL keep an extra episodic memory throughout the training process. Memory size was set to be for GEM and for iCaRL by considering the accuracy and efficiency in the experiments. Variations on the size of the episodic memory would also affect their performance as well as the running time. As described earlier, GEM requires a constraint validation step and a potential gradient projection step for every update of the model parameters. As such the computational time complexity in this case is proportional to the product of the number of samples kept in the episodic memory, the number of parameters in the model and the number of iterations required to converge. In contrast, OLSS uses a SVD to compute the leverage scores for each task which can be achieved in a time complexity proportional to the product of the square of the number of features and the number of data samples. This is considerably less compared to GEM as shown in Table 1. The computational complexity can be further reduced with fast leverage score approximation methods like randomized algorithm in Drineas2012 .
As shown in Figure 2, after training the whole sequence of tasks, both GEM and OLSS are able to preserve the accuracy for most tasks on rotated and permuted MNIST. Nevertheless, it is difficult to completely recover the accuracy of previously trained tasks on CIFAR100 for all algorithms. In case of synaptic consolidation based method like EWC, the loss function contains additional regularization or penalty terms for each previously trained tasks. These additional penalties are isolated from each other. As the number of tasks increases, it may loose the elasticity in consolidating the overlapping parameters, and as such show a steeper slope in the EWC plot of Figure 2.
We presented a new approach in addressing the continual learning problem with deep neural networks. It is inspired by the randomization and compression techniques typically used in statistical analysis. We combined a simple importance sampling technique - leverage score sampling with the frequent directions concept and developed an online effective forgetting or compression mechanism that preserves meaningful information from previous and current task, enabling continual learning across a sequence of tasks. Despite its simple structure, the results on classification benchmark experiments (designed for the catastrophic forgetting issue) demonstrate its effectiveness as compared to recent state of the art.
-  M. K., Benna, and S. Fusi, Computational Principles of Synaptic Memory Consolidation. in Nature Neuroscience, 19(12), pp. 1697–1708, 2016.
-  M. B. Cohen, C. Musco and C. Musco, Input Sparsity Time Low-Rank Approximation via Ridge Leverage Score Sampling, in Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1758 - 1777, 2017.
-  P. Drineas, M. W. Mahoney and S. Muthukrishnan, Relative-error CUR Matrix Decomposition, in SIAM Journal on Matrix Analysis and Applications, 30, pp.844-881, 2008.
-  P. Drineas, M. W. Mahoney, S. Muthukrishnan and T. Sarlós, Faster Least Squares Approximation, in Numerische Mathematik, 117, pp.219-249, 2010.
-  P. Drineas, M. Magdon-Ismail, M. W. Mahoney and D. P. Woodruff, Fast Approximation of Matrix Coherence and Statistical Leverage, in Journal of Machine Learning Research, 13, pp. 3441-3472, 2012.
-  C. Finn, P. Abbeel and S. Levine, Model-Agnostic Meta-learning for Fast Adaptation of Deep Networks, in International Conference on Machine Learning (ICML), 2017.
-  R. M. French, Catastrophic Forgetting in Connectionist Networks, Trends in Cognitive Sciences, 1999.
-  S. Fusi, E.K. Miller, and M. Rigotti, Why Neurons Mix: High Dimensionality for Higher Cognition. in Current Opinion in Neurobiology, 37, pp.66-74, 2016.
-  M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. M. Ali Eslami and Y. W. Teh, Neural Processes, in Theoretical Foundations and Applications of Deep Generative Models Workshop, ICML, 2018.
-  M. Ghashami, E. Liberty, J. M. Phillips and D. P. Woodruff, Frequent Directions: Simple and Deterministic Matrix Sketching, in SIAM Journal of Computing, 45, pp. 1762 - 1792, 2016.
-  I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville and Y. Bengio, An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks, arXiv, 2013.
D. Hassabis, D. Kumaran, C. Summerfield and M. Botvinick,
Neuroscience-Inspired Artificial Intelligence, in Neuron Review, 95 (2), pp. 245-258, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun, Deep Residual Learning for Image Recognition
-  J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran and R. Hadsell, Overcoming Catastrophic Forgetting in Neural Networks, in PNAS, 2017.
-  A. Krizhevsky, Learning Multiple Layers of Features from Tiny Images, in Technical Report, University of Toronto, 2009.
Y. LeCun, C. Cortes and C. J. Burges,
The MNIST Database of Handwritten Digits, URL: http://yann.lecun.com/exdb/mnist/, 1998.
-  Z. Li and D. Hoiem, Learning without Forgetting, in European Conference on Computer Vision (ECCV), pp. 614-629, 2016.
-  E. Liberty, Simple and Deterministic Matrix Sketching, in Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2013.
-  D. Lopez-Paz and Marc’ Aurelio Ranzato, Gradient Episodic Memory for Continual Learning, in Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), 2017.
-  P. Ma, M. Mahoney and B. Yu. A statistical perspective on algorithmic leveraging, in Proceedings of International Conference on Learning Representations (ICML), 2014.
-  M. W. Mahoney, Randomized Algorithms for Matrices and Data, in Foundations and Trends in Machine Learning, 3, pp. 123-224, 2011.
-  J. L. McClelland, B. L. McNaughton and R. C. O’Reilly, Why There Are Complementary Learning Systems in the Hippocampus and Neocortex: Insights From the Successes and Failures of Connectionist Models of Learning and Memory, in Psychological Review, 102, pp. 419-457, 1995.
-  M. McCloskey and N. J. Cohen, Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem, in the Psychology of Learning and Motivation, 24, pp. 104-169, 1989.
-  C. V. Nguyen, Y. Li, T. D. Bui and R. E. Turner, Variational Continual Learning, International Conference on Learning Representations (ICLR), 2018.
-  A. Nichol, J. Achiam and J. Schulman, On First-Order Meta-Learning Algorithms, CoRR, 2018.
-  R. Pang, B.J. Lansdell, and A.L. Fairhall, Dimensionality Reduction in Neuroscience. in Current Biology, 26(14), pp. R656-R660, 2016.
-  G. I. Parisi, R. Kemker, J. L. Part, C. Kanan and S. Wermter, Continual Lifelong Learning with Neural Networks: A Review, in Neural Networks, 2019.
-  S.-A. Rebuffi, A. Kolesnikov, G. Sperl and C. H. Lampert, iCaRL: Incremental Classifier and Representation Learning, in Proceedings of Conference on Computer Vision and Pattern Recognition, 2017.
-  A. Rudi, D. Calandriello, L. Carratino and L. Rosasco, On Fast Leverage Score Sampling and Optimal Learning, in Advances in Neural Information Processing Systems, pp. 5677 - 5687, 2018.
-  A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu and R. Hadsell, Progressive Neural Networks, arXiv:1606.04671, 2016.
-  P. Ruvolo and E. Eaton, ELLA: An Efficient Lifelong Learning Algorithm, in Proceedings of the 30th International Conference on Machine Learning (ICML), 2013.
-  J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu and R. Hadsell, Progress & Compress: A Scalable Framework for Continual Learning, arXiv: 1805.06370, 2018.
-  H. Shin, J. K. Lee, J. Kim and J. Kim, Continual Learning with Deep Generative Replay, in Proceedings of the 31st Conference on Neural Information Processing Systems (NeurIPS), 2017.
-  D. Teng and D. Chu, Fast Frequent Directions for Low Rank Approximation, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 41 (6), pp. 1279-1293, 2019.
-  M. K. Titsias, J. Schwarz, A. G. de G. Matthews, R. Pascanu and Y. W. Teh, Functional Regularisation for Continual Learning using Gaussian Processes, arXiv:1901.11356, 2019.
-  S. Thrun and T. Mitchell, Lifelong Robot Learning, in Robotics and Autonomous Systems, 15, pp. 25-46, 1995.
-  D. Woodruff, Sketching as A Tool for Numerical Linear Algebra, in Foundations and Trends ® in Theoretical Computer Science, 10, pp.1-157, 2014.
-  J. Yoon, E. Yang, J. Lee and S. J. Hwang, Lifelong Learning with Dynamically Expandable Networks, in Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018.
-  F. Zenke, B. Poole and S. Ganguli, Continual Learning Through Synaptic Intelligence, in Proceedings of the 34th International Conference on Machine Learning (ICML), 2017.