1 Introduction
An important ability of humans is to continually build and update abstract concepts. Humans develop and learn abstract concepts to characterize and communicate their perception and ideas [1]
. These concepts often are evolved and expanded efficiently as more experience about new domains is gained. Consider for example, the concept of the printed character “4”. This concept is often taught to represent the “natural number four” in the mother tongue of elementary school students, e.g., English. Upon learning this concept, humans can efficiently expand it by observing only a few samples from other related domains, e.g., variety of hand written digits or printed digits in other secondary languages. Despite remarkable progress in Artificial intelligence (AI) over the past decade, learning concepts efficiently in a way similar to humans remains an unsolved challenge for AI. This is because the exceptional progress of AI is mostly driven by reemergence of deep neural networks. Since deep networks are trained in an endtoend supervised learning setting, access to labeled data is necessary for learning any new distribution. For this reason and despite emergence of behaviors similar to the nervous system in deep nets
[2], adapting a deep neural network to learn a concept in a new domain usually requires model retraining from scratch which is conditioned on the availability of a large number of labeled samples in the new domain. Moreover, training deep networks in a continual learning setting is challenging due to the phenomenon of “catastrophic forgetting” [3]. When a network is trained on multiple sequential tasks, the new learned knowledge usually interferes with past learned knowledge, causing the network to forget what has been learned before.In this paper, we develop a computational model that is able to expand and generalize learned concepts efficiently to new domains using a few labeled data from the new domains. We rely on Parallel Distributed Processing (PDP) paradigm [4] for this purpose. Work on semantic cognition within the parallel distributed processing framework hypothesizes that abstract semantic concepts are formed in higher level layers of the nervous system [5, 6]. We model this hypothesis by assuming that the data points are mapped into an embedding space, which captures existing concepts. To prevent catastrophic forgetting, we rely on the Complementary Learning Systems (CLS) theory [7]. CLS theory hypothesizes that continual lifelong learning ability of the nervous system is a result of a dual long and shortterm memory system. The hippocampus acts as shortterm memory and encodes recent experiences that are used to consolidate the knowledge in the neocortex as longterm memory through offline experience replays during sleep [8]. This suggests that if we store suitable samples from past domains in a memory buffer, like in the neocortex, these samples can be replayed along with current task samples from recentmemory hippocampal storage to train the base model jointly on the past and the current experiences to tackle catastrophic forgetting.
More specifically, we model the latent embedding space via responses of a hidden layer in a deep neural network. Our idea is to stabilize and consolidate the data distribution in this space, where domainindependent abstract concepts are encoded. Doing so, new forms of concepts can be learned efficiently by coupling them to their past learned forms in the embedding space. Data representations in this embedding space can be considered as neocortical representations in the brain, where the learned abstract concepts are captured. We model concept learning in a sequential task learning framework, where learning concepts in each new domain is considered to be a task. To generalize the learned concepts without forgetting, we use an autoencoder as the base network to benefit from efficient coding ability of deep autoencoders and model the embedding space as the middle layer of the autoencoder. This will also make our model generative, which can be used to implement the offline memory replay process in the sleeping brain
[9]. To this end, we fit a parametric multimodal distribution to the training data representations in the embedding space. The drawn points from this distribution can be used to generate pseudodata points through the decoder network for experience replay to prevent catastrophic forgetting. We demonstrate that this learning procedure enables the base model to generalize its learned concepts to new domains using a few labeled samples.2 Related Work
Lake et al. [1]
modeled human concept learning within a “Bayesian probabilistic learning” (BPL) paradigm. They present BPL as an alternative for deep learning to mimic the learning ability of humans as these models require considerably less amount of training data. The concepts are represented as probabilistic programs that can generate additional instances of a concept given a few samples of that concept. However, the proposed algorithm in Lake et al.
[1], requires human supervision and domain knowledge to tell the algorithm how the realworld concepts are generated. This approach seems feasible for the recognition task that they have designed to test their idea, but it does not scale to other more challenging concept learning problems. Our framework similarly relies on a generative model that can produce pseudosamples of the learned concepts, but we follow an endtoend deep learning scheme that automatically encodes concepts in the hidden layer of the network with minimal human supervision requirement. Our approach can be applied to a broader range of problems. The price is that we rely on data to train the model, but only a few data points are labeled. This is similar to humans with respect to how they too need practice to generate samples of a concept when they do not have domain knowledge [10]. This generative strategy has been used in the Machine Learning (ML) literature to address “fewshot learning” (FSL)
[11, 12]. The goal of FSL is to adapt a model that is trained on a source domain with sufficient labeled data to generalize well on a related target domain with a few target labeled data points. In our work, the domains are different but also are related in that similar concepts are shared across the domains.Most FSL algorithms consider only one source and one target domain, which are learned jointly. Moreover, the main goal is to learn the target task. In contrast, we consider a continual learning setting in which the domainspecific tasks arrive sequentially. Hence, catastrophic forgetting becomes a major challenge. An effective approach to tackle catastrophic forgetting is to use experience replay [13, 14]
. Experience replay addresses catastrophic forgetting via storing and replaying data points of past learned tasks continually. Consequently, the model retains the probability distributions of the past learned tasks. To avoid requiring a memory buffer to store past task samples, generative models have been used to produce pseudodata points for past tasks. To this end, generative adversarial learning can be used to match the cumulative distribution of the past tasks with the current task distribution to allow for generating pseudodata points for experience replay
[15]. Similarly, autoencoder structure can also be used to generate pseudodata points [16, 17]. Building upon our prior work [17], we develop a new method for generative experience replay to tackle catastrophic forgetting. Although prior works require access to labeled data for all the sequential tasks for experience replay, we demonstrate that experience replay is feasible even in the setting where only the initial task has labeled data. Our contribution is to combine ideas of fewshot learning with generative experience replay to develop a framework that can continually update and generalize learned concepts when new domains are encountered in a lifelong learning setting. We couple the distributions of the tasks in the middle layer of an autoencoder and use the shared distribution to expand concepts using a few labeled data points without forgetting the past.3 Problem Statement and the Proposed Solution
In our framework, learning concepts in each domain is considered to be an ML task, e.g., different types of digit characters. We consider a continual learning setting [18], where an agent receives consecutive tasks in a sequence over its lifetime. The total number of tasks, distributions of the tasks, and the order of tasks is not known a priori. Since the agent is a lifelong learner, the current tasks is learned at each time step and the agent then proceeds to learn the next task. The knowledge that is gained from experiences is used to learn the current task efficiently, i.e., using minimal number of labeled data. The new learned knowledge from the current task also would be accumulated to the past experiences to potentially ease learning in future. Additionally, this accumulation must be done consistently to generalize the learned concepts as the agent must perform well on all learned task, i.e., not to forget. This is because the learned tasks may be encountered at any time in future. Figure 1 presents a highlevel blockdiagram visualization of this framework.
We model an abstract concept as a class within a domaindependent classification task. Data points for each task are drawn i.i.d. from the joint probability distribution, i.e., which has the marginal distribution over . We consider a deep neural network as the base learning model, where denote the learnable weight parameters. A deep network is able to solve classification tasks through extracting taskdependent high quality features in a datadriven endtoend learning [19]. Within PDP paradigm [4, 5, 6], this means that the data points are mapped into a discriminative embedding space, modeled by the network hidden layers, where the classes become separable, i.e., data points belonging to a class are grouped as an abstract concept. On this basis, the deep network is a functional composition of an encoder with learnable parameter , that encode the input data into the embedding space
and a classifier subnetwork
with learnable parameters, that maps encoded information into the label space. In other words, the encoder network changes the input data distribution as a deterministic function. Because the embedding space is discriminative, data distribution in the embedding space would be a multimodal distribution that can be modeled as Gaussian mixture model (GMM). Figure
1 visualizes this intuition based on experimental data, used in the experimental validation section.Within ML formalism, the agent can solve the task using standard empirical risk minimization (ERM). Given the labeled training dataset , where and , we can solve for the network optimal weight parameters: . Here,
is a suitable loss function, e.g., cross entropy. Conditioned on having large enough number of labeled data points
, the empirical risk would be a suitable function to estimate the real risk function,
[20] as the Bayes optimal objective. Hence, the trained model will generalize well on test data points for the task . Good generalization performance means that each class would be learned as a concept which is encoded in the hidden layers. Our goal is to consolidate these learned concepts and generalize them when the next tasks with minimal number of labeled data points arrive. That is, for tasks , we have access to the dataset , where denotes the labeled data points and denotes unlabeled data points. This learning setting means that the learned concepts must be generalized in the subsequent domains with minimal supervision. Standard ERM can not be used to learn the subsequent tasks because the number of labeled data points is not sufficient, i.e., overfitting would occur. Additionally, even in the presence of enough labeled data, catastrophic forgetting would be consequence of using ERM. This is because the model parameters will be updated using solely the current task data which can potentially deviate the values of from the previous learned values in the past time step. Hence, the agent would not retain its learned knowledge.Following PDP hypothesis, our goal is to use the encoded distribution in the embedding space to expand the concepts that are captured the embedding space such that catastrophic forgetting does not occur. The gist of our idea is to update the encoder subnetwork such that each subsequent task is learned such that its distribution in the embedding space matches the distribution that is shared by at . Since this distribution is initially learned via and subsequent tasks are enforced to share this distribution in the embedding space with , we do not need to learn it from scratch as the concepts are shared across the tasks. As a result, since the embedding space becomes invariant with respect to any learned input task, catastrophic forgetting would not occur.
The key challenge is to adapt the standard ERM such that the tasks share the same distribution in the embedding space becomes. To this end, we modify the base network to form a generative autoencoder by amending the model with a decoder .We train the model such the pair form an autoencoder. Doing so, we enhance the ability of the model to encode the concepts as separable clusters in the embedding. We use the knowledge about data distribution form in the embedding to match the distributions of all tasks in the embedding. This leads to consistent generalization of the learned concepts. Additionally, since the model is generative and knowledge about past experiences is encoded in the network, we can use CLS process [7] to prevent catastrophic forgetting. When learning a new task, pseudodata points for the past learned tasks can be generated by sampling from the shared distribution in the embedding and feeding the samples to the decoder subnetwork. These pseudodata points are used along with new task data to learn each task. Since the new task is learned such that its distribution matches the past shared distribution, pseudodata points generated for learning future tasks would also represent the current task as well.
4 Proposed Algorithm
Following the above framework, learning the first task () reduces to minimizing the discrimination loss for classification and the autoencoder reconstruction loss to solve for optimal parameters:
(1) 
where is the reconstruction loss, is the combined loss, and is a tradeoff parameter.
If the base learning model is complex enough, the concepts would be formed in the embedding space as separable clusters upon learning the first task. This means that the data distribution can be modeled as a GMM distribution in the embedding. We can use standard methods such as expectation maximization to fit a GMM distribution with
components to the multimodal empirical distribution formed by the drawn samples in the embedding space. Let denote the estimated parametric GMM distribution. The goal is to retain this initial estimation that captures concepts when future domains are encountered. Following PDP framework, we learn the subsequent tasks such that the current task shares the same GMM distribution with the previous learned tasks in the embedding space. We also update the estimate of the shared distribution after learning each subsequent task. Updating this distribution means generalizing the concepts to the new domains without forgetting the past domains. As a result, the distribution captures knowledge about past domains when is being learned. Moreover, we can perform experience replay by generating pseudodata points by first drawing samples from and then passing the samples through the decoder subnetwork. The remaining challenge is to update the model such that each subsequent task is learned such that its corresponding empirical distribution matches in the embedding space. Doing so, ensures suitability of GMM to model the empirical distribution.To match the distributions, consider denote the pseudodataset for tasks , generated for experience replay when is being learned. Following the described framework, we form the following optimization problem to learn and generalized concepts:
(2) 
where is a suitable metric function to measure the discrepancy between two probability distributions. and are a tradeoff parameters. The first two terms in Eq. (2) denote the combined loss terms for each of the current task few labeled data points and the generated pseudodataset, defined similar to Eq. (1). The third and the fourth terms implement our idea and enforce the distribution for the current task to be close to the distribution shared by the past learned task. The third term is added to minimize the distance between the distribution of the current tasks and in the embedding space. Data labels is not needed to compute this term. The fourth term may look similar but note that we have conditioned the distance between the two distribution on the concepts to avoid the matching challenge, i.e., when wrong concepts (or classes) across two tasks are matched in the embedding space [21]. We use the few labeled data that are accessible for the current task to compute this term. Adding these terms guarantees that we can continually use GMM to model the shared distribution in the embedding.
The main remaining question is selection of a suitable probability distance metric . Common probability distance measures such as Jensen–Shannon divergence KL divergence are not applicable for our problem as the gradient for these measures is zero when the corresponding distributions have nonoverlapping supports [22]. Since deep learning optimization problems are solved using firstorder gradientbased optimization methods, we must select a distribution metric which has nonvanishing gradients. For this reason, we select the Wasserstein Distance (WD) metric [23] which satisfies this requirement and has recently been used extensively in deep learning applications to measure minimize the distance between two probability distributions [24]. In particular, we use Sliced Wasserstein Distance (SWD) [25] which is a suitable approximation for WD, while it can be computed efficiently using empirical samples, drawn from two distributions. Our concept learning algorithm, Efficient Concept Learning Algorithm (ECLA), is summarized in Algorithm 1.
5 Theoretical Analysis
We follow a standard PAClearning style framework to analyze our algorithm [20] and using result from domain adaptation [26] to demonstrate the effectiveness of our algorithm. We perform the analysis in the embedding space , where the hypothesis class is the set of all the classifiers parameterized by . For any given model in this class, let denotes the observed risk for the domain that contains the task , denotes the observed risk for the same model on another secondary domain, and denotes the optimal parameter for training the model on these two tasks jointly, i.e., . We also denote the Wasserstein distance between two given distributions as . We rely on the following theorem [26] which relates performance of a model trained on a particular domain to another secondary domain.
Theorem 5.1.
Consider two tasks and , and a model trained for , then for any and , there exists a constant number depending on such that for any and with probability at least for all , the following holds:
(3) 
where and are empirical distributions formed by the drawn samples from and .
Theorem 5.1 is a broad result that provides an upperbound on performance degradation of a trained model, when used in another domain. It suggests that if the model performs well on and if the upperbound is small, then the model performs well on . The last term is a constant term which depends on the number of available samples. This term is negligible when . The two important terms are the first and the second terms. The first term is the Wasserstein distance between the two distributions. It may seem that according to this term, if we minimize the WD between two distributions, then the model should perform well on . But it is crucial to note that the upperbound depends on the second term as well. Despite being a third term suggests that the base model should be able to learn both tasks jointly. However, in the presence of “XOR classification problem", the tasks cannot be learned by a single model [27]. This means that not only the WD between two distributions should be small, but the distributions should be aligned classconditionally. Building upon Theorem 5.1, we provide the following theorem for our framework.
Theorem 5.2.
Consider ECLA algorithm at learning time step . Then all tasks and under the conditions of Theorem 5.1, we can conclude:
(4) 
where denotes the risk for the pseudotask with the distribution .
Proof: In Theorem 5.1, consider the task with the distribution and the pseudotask with the distribution in the embedding space. We can use the triangular inequality recursively on the term in Eq. (3), i.e., for all time steps . Adding up all the terms, concludes Eq. (4).
We can rely on Theorem 5.2 to demonstrate that why our algorithm can generalize concepts without forgetting the past learned knowledge. The first term in Eq. (4) is small because, experience replay minimizes this term using the labeled pseudodata set via ERM. The fourth term is small since we use the few labeled data points to align the distributions class conditionally in Eq. (2). The last term is a negligible constant for . The second term denotes the distance between the task distribution and the fitted GMM. When the PDP hypothesis holds and the model learns a task well, this term is small as we can approximate with (see Ashtiani et al. [28] for a rigorous analysis of estimating a distribution with GMM). In other words, this term is small if the classes are learned as concepts. Finally, the terms in the sum term in Eq 4 are minimized because at we draw samples from and by learning enforce that . The sum term in Eq 4 models the effect of history. After learning a task and moving forward, this term potentially grows as more tasks are learned. This means that forgetting effects would increase as more subsequent tasks are learned which is intuitive. To sum up, ECLA minimizes the upper bound of in Eq 4. This means that the model can learn and remember which in turn means that the concepts have been generalized without being forgotten on the old domains.
6 Experimental Validation
We validate our method on learning two sets of sequential learning tasks: permuted MNIST tasks and digit recognition tasks. These are standard benchmark classification tasks for sequential task learning. We adjust them for our learning setting. Each class in these tasks is considered to be a concept, and each task of the sequence is considered to be learning the concepts in a new domain.
6.1 Learning permuted MNIST tasks
Permuted MNIST tasks is standard benchmark that is designed for testing abilities of AI algorithms to overcome catastrophic forgetting [15, 29]. The sequential tasks are generated using the MNIST () digit recognition dataset [30]. Each task in the sequence is generated by applying a fixed random shuffling to the pixel values of digit images across the MNIST dataset [29]. As a result, generated tasks are homogeneous in terms of difficulty and are suitable to perform controlled experiments. Our learning setting is different compared to prior works as we considered the case where only the data for the initial MNIST task is fully labeled. In the subsequent tasks, only few data points are labeled. To the best of our knowledge, no precedent method addresses this learning scenario for direct comparison, so we only compared against: a) classic back propagation (BP) single task learning, (b) full experience replay (FR) using full stored data for all the previous tasks, and (c) learning using fully labeled data (CLEER) [17]. We use the same base network structure for all the methods for fair comparison. BP is used to demonstrate that our method can address catastrophic forgetting. FR is used as a lowerbound to demonstrate that our method is able to learn crosstask concepts without using fully labeled data. CLEER is an instance of ECLA where fully labeled data is used to learn the subsequent tasks. We used CLEER to compare our method against an upperbound.




We used standard stochastic gradient descent to learn the tasks and created learning curves by computing the performance of the model on the standard testing split of the current and the past learned tasks at each learning iteration. Figure
2 presents learning curves for four permuted MNIST tasks. Figure 1(a) presents learning curves for BP (dashed curves) and CLEER (solid curves). As can be seen, CLEER (i.e., ECLA with fully labeled data) is able to address catastrophic forgetting. Figure 1(b) presents learning curves for FR (dashed curves) and ECLA (solid curve) when 5 labeled data points per class are used respectively. We observe that FR can tackle catastrophic forgetting perfectly but the challenge is the memory buffer requirement, which grows linearly with the number of learned tasks, making this method only suitable for comparison as an upperbound. FR result also demonstrates that if we can generate highquality pseudodata points, catastrophic forgetting can be prevented completely. Deviation of the pseudodata from the real data is the major reason for the initial performance degradation of ECLA on all the past learned tasks, when a new task arrives and its learning starts. This degradation can be ascribed to the existing distance between and at for . Note also as our theoretical analysis predicts, the performance on a past learned task degrades more as more tasks are learned subsequently. This is compatible with the nervous system as memories fade out as time passes unless enhanced by continually experiencing a task or a concept.




In addition to requiring fully labeled data, we demonstrate that FR does not identify concepts across the tasks. To this end, we have visualized the testing data for all the tasks in the embedding space in Figures 2 for FR and ECLA after learning the fourth task. For visualization purpose, we have used UMAP [31], which reduces the dimensionality of the embedding space to two. In Figure 1(c) and Figure 1(d), each color denotes the data points of one of the digits (each circular shape indeed is a cluster of data points). We can see that the digits form separable clusters for both methods. This result is consistent with the PDP hypothesis and is the reason behind good performance of both methods. It also demonstrates why GMM is a suitable selection to model the data distribution in the embedding space. However, we can see that when FR is used, four distinct clusters for each digit are formed (i.e., one cluster per domain for each digit class). In other words, FR is unable to identify and generalize abstract concepts across the domains. In contrast, we have exactly ten clusters for the ten digits when ECLA is used, and hence the concepts are identified across the domains. This is the reason that we can generalize the learned concepts to new domains, despite using few labeled data.
6.2 Learning sequential digit recognition tasks
We performed a second set of experiments on a more realistic scenario. We consider two handwritten digit recognition datasets for this purpose: MNIST () and USPS () datasets. USPS dataset is a more challenging classification task as the size of the training set is smaller (20,000 compared to 60,000 images). We performed experiments on the two possible sequential learning scenarios and . The experiments can be considered as concept learning for numeral digits as both tasks are digit recognition tasks but in different domains, i.e. written by different people.
Figure 2(a) and Figure 2(b) present learning curves for these two tasks when 10 labeled data points per class are used for the training of the second task. First note that the network mostly retains the knowledge about the first task following the learning of the second task. Also note that the generalization to the second domain, i.e., the second task learning is faster in Figure 2(a). Because MNIST dataset has more training data points, the empirical distribution can capture the task distribution more accurately and hence the concepts would be learned better which in tern makes learning the second task easier. As expected from the theoretical justification, this empirical result suggests the performance of our algorithm depends on closeness of the distribution to the distributions of previous tasks, and improving probability estimation will boost the performance of our approach. We have also presented UMAP visualization of the data points for the tasks in the embedding space in Figures 2(c) and Figures 2(d). We observe that the distributions are matched in the embedding space and crossdomain concepts are learned by the network. These results demonstrate that our algorithm inspired by PDP and CLS theories can generalize concepts to new domains.
7 Conclusions
Inspired by the CLS theory and the PDP paradigm, we developed an algorithm that enables a deep network to update and generalize its learned concepts in a continual learning setting. Our generative framework is able to encode abstract concepts in a hidden layer of the deep network in the form of a parametric GMM distribution. This distribution can be used to generalize concepts to new domains, where only few labeled samples are accessible. Additionally, the model is able to generate pseudodata points for past tasks, which can be used for experience replay to tackle catastrophic forgetting. Future work will extend our model to detect new concepts automatically and actively ask for few labeled data points as unseen concept samples are encountered.
References
 [1] Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 [2] Yaniv Morgenstern, Mohammad Rostami, and Dale Purves. Properties of artificial networks evolved to contend with natural spectra. Proceedings of the National Academy of Sciences, 111(Supplement 3):10868–10872, 2014.
 [3] Robert M French. Catastrophic forgetting in connectionist networks. Trends in cognitive sciences, 3(4):128–135, 1999.
 [4] James L McClelland, David E Rumelhart, PDP Research Group, et al. Parallel distributed processing. Explorations in the Microstructure of Cognition, 2:216–271, 1986.
 [5] James L McClelland and Timothy T Rogers. The parallel distributed processing approach to semantic cognition. Nature reviews neuroscience, 4(4):310, 2003.
 [6] Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, page 201820226, 2019.
 [7] James L McClelland, Bruce L McNaughton, and Randall C O’reilly. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connectionist models of learning and memory. Psychological review, 102(3):419, 1995.
 [8] S. Diekelmann and J. Born. The memory function of sleep. Nat Rev Neurosci, 11(114), 2010.
 [9] B. Rasch and J. Born. About sleep’s role in memory. Physiol Rev, 93:681–766, 2013.
 [10] Marieke Longcamp, MarieThérèse ZerbatoPoudou, and JeanLuc Velay. The influence of writing practice on letter recognition in preschool children: A comparison between handwriting and typing. Acta psychologica, 119(1):67–79, 2005.
 [11] Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for fewshot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
 [12] Saeid Motiian, Quinn Jones, Seyed Iranmanesh, and Gianfranco Doretto. Fewshot adversarial domain adaptation. In Advances in Neural Information Processing Systems, pages 6670–6680, 2017.
 [13] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
 [14] Anthony Robins. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146, 1995.
 [15] Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pages 2990–2999, 2017.
 [16] German I Parisi, Ronald Kemker, Jose L Part, Christopher Kanan, and Stefan Wermter. Continual lifelong learning with neural networks: A review. Neural Networks, 2019.
 [17] Mohammad Rostami, Soheil Kolouri, and Praveen Pilly. Complementary learning for overcoming catastrophic forgetting using experience replay. In IJCAI, 2019.
 [18] Paul Ruvolo and Eric Eaton. Ella: An efficient lifelong learning algorithm. In International Conference on Machine Learning, pages 507–515, 2013.
 [19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
 [20] Shai ShalevShwartz and Shai BenDavid. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
 [21] Amir Globerson and Sam T Roweis. Metric learning by collapsing classes. In Advances in neural information processing systems, pages 451–458, 2006.
 [22] Julien Rabin and Gabriel Peyré. Wasserstein regularization of imaging problem. In 2011 18th IEEE International Conference on Image Processing, pages 1541–1544. IEEE, 2011.
 [23] Nicolas Bonnotte. Unidimensional and evolution methods for optimal transportation. PhD thesis, Paris 11, 2013.
 [24] Nicolas Courty, Rémi Flamary, Devis Tuia, and Alain Rakotomamonjy. Optimal transport for domain adaptation. IEEE transactions on pattern analysis and machine intelligence, 39(9):1853–1865, 2017.
 [25] Nicolas Bonneel, Julien Rabin, Gabriel Peyré, and Hanspeter Pfister. Sliced and radon wasserstein barycenters of measures. Journal of Mathematical Imaging and Vision, 51(1):22–45, 2015.
 [26] Ievgen Redko, Amaury Habrard, and Marc Sebban. Theoretical analysis of domain adaptation with optimal transport. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 737–753. Springer, 2017.
 [27] Manish Mangal and Manu Pratap Singh. Analysis of multidimensional xor classification problem with evolutionary feedforward neural networks. International Journal on Artificial Intelligence Tools, 16(01):111–120, 2007.
 [28] Hassan Ashtiani, Shai BenDavid, Nicholas Harvey, Christopher Liaw, Abbas Mehrabian, and Yaniv Plan. Nearly tight sample complexity bounds for learning mixtures of gaussians via sample compression schemes. In Advances in Neural Information Processing Systems, pages 3412–3421, 2018.
 [29] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
 [30] Yann LeCun, Bernhard E Boser, John S Denker, Donnie Henderson, Richard E Howard, Wayne E Hubbard, and Lawrence D Jackel. Handwritten digit recognition with a backpropagation network. In Advances in neural information processing systems, pages 396–404, 1990.
 [31] Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426, 2018.