NeuroData's package for exploring and using progressive learning algorithms
In biological learning, data is used to improve performance on the task at hand, while simultaneously improving performance on both previously encountered tasks and as yet unconsidered future tasks. In contrast, classical machine learning starts from a blank slate, or tabula rasa, using data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance degrades upon learning new tasks. Many recent approaches have attempted to mitigate this issue, called catastrophic forgetting, to maintain performance given new tasks. But striving to avoid forgetting sets the goal unnecessarily low: the goal of progressive learning, whether biological or artificial, is to improve performance on all tasks (including past and future) with any new data. We propose a general approach to progressive learning that ensembles representations, rather than learners. We show that ensembling representations—including representations learned by decision forests or neural networks—enables both forward and backward transfer on a variety of simulated and real data tasks, including vision, language, and adversarial tasks. This work suggests that further improvements in progressive learning may follow from a deeper understanding of how biological learning achieves such high degrees of efficiency.READ FULL TEXT VIEW PDF
NeuroData's package for exploring and using progressive learning algorithms
Classical supervised learning[Mohri2018-tf]
considers random variables, where is an -valued input, is a -valued response, and
is the joint distribution of
. Given a loss function, the goal is to find the hypothesis, or predictor, that minimizes expected loss, or risk. The minimum risk depends on the unknown .
A learning algorithm (or rule) is a sequence of functions, , where each maps from training samples, , to a hypothesis in a class of hypotheses , . A learning algorithm is evaluated on its generalization error (or expected risk) at a particular sample size : where the expectation is taken with respect to . The goal is to find a that has a small generalization error assuming each pair is independent and identically distributed from some true but unknown [Mohri2018-tf].
Lifelong learning generalizes classical machine learning in two ways: (i) instead of one task, there is an environment of (possibly infinitely) many tasks, and (ii) data arrive sequentially, rather than in batch mode. This setting requires that an algorithm can generalize to “out-of-past-task” examples. The goal in lifelong learning, given data from a new task, is to use all the data from previous tasks to achieve lower generalization error on this new task, and use the new data to improve the generalization error on all the previous tasks. Note that this is much stronger than simply avoiding catastrophic forgetting, which would mean that generalization on past tasks does not degrade. We define a progressive learning system as one that improves performance on past tasks given new data.
Previous work in lifelong learning falls loosely into two algorithmic frameworks for multiple tasks: (i) learning models with parameters specific to certain tasks and parameters shared across tasks [ruvolo2013ella], and (ii) decreasing the “size" of the hypothesis class with respect to the amount of training data [Finn2019-yv]. Some approaches additionally store or replay (rehearse) previously encountered data to reduce forgetting [kirkpatrick2017overcoming]. The literature, however, has not yet codified general evaluation criteria to demonstrate lifelong learning on sequences of tasks [ll_nn_review].
We define transfer efficiency as the ratio of the generalization error of (i) an algorithm that has learned only from data associated with a given task, to (ii) (ii) the same learning algorithm that also has access to other data. Transfer efficiency is akin to relative efficiency from classical statistics [bickel2015mathematical]
. Whereas relative efficiency typically compares two different estimators for a given sample size, transfer efficiency compares the same estimator on two different datasets (one a subset of the other). Letbe the risk associated with task . Let denote the error on task of the algorithm that consumes only task data: Similarly, let denote the error on task of the algorithm that consumes all the data,
The transfer efficiency of algorithm for given task with sample size is Algorithm transfer learns if and only if .
To evaluate a progressive learning algorithm while respecting the streaming nature of the tasks, it is convenient to consider two extensions of transfer efficiency. Transfer efficiency admits a factorization into forward and backward transfer efficiency:
Forward transfer efficiency is the ratio of (i) the generalization error of the algorithm with access only to task data, to (ii) generalization error of the algorithm with sequential access to the data up to and including the last observation from task . Thus, this quantity measures the relative effect of previously seen out-of-task data on the performance on task . Formally, let , be the index of the last occurrence of task in the data sequence. Let be all data up to and including that data point. is the learning algorithm that consumes only and outputs a hypothesis. Denote the generalization error in this setting as
The forward transfer efficiency of for task given samples is
In task streaming settings, one can also determine the rate of backward transfer by comparing to the generalization error of the algorithm having sequentially seen the entire training sequence. Backward transfer efficiency is the ratio of (i) the generalization error of the algorithm with access to the data up to and including the last observation from task , to (ii) the generalization of the learning algorithm with access to the entire data sequence. Thus, this quantity measures the relative effect of future task data on the performance on task .
The backward transfer efficiency of for task given samples is
Our approach to progressive intelligence relies on hypotheses that can be decomposed into three constituent parts: . The transformer, , maps an -valued input into an internal representation space [Vaswani2017-lq, Devlin2018-lk]. The voter maps the transformed data point into a posterior distribution on the response space . Finally, a decider , such as “argmax", produces a predicted label111In coding theory, these three functions would be called the encoder, channel, and decoder, respectively [Cover2012-sl, Cho2014-ew]. Our key innovation is realizing that decision rules can ensemble representations learned by transformers across tasks. In particular, a representation learned for task might be a useful representation for task and vice versa. Combining these two representations can improve performance on both and and can be extended to an arbitrary number of tasks (Figure 1). Composable hypotheses are more modular and flexible than recursive decompositions, an approach to transfer learning and multi-task learning discussed in [Thrun2012-sj].
Suppose after samples we have data from a set of tasks, or environment . We desire algorithms that use data from task to transfer knowledge to task , for all . Let be the hypothesis function learned for task . Define the cross-task posterior as the function that votes on classes in class using the representation output by the transformer for task
. For example, when using decision trees, this corresponds to learning the partition of a tree from task, and then pushing data from task through it to learn the task vote: . Given tasks, there are such cross-task posteriors for each task. The task decider then combines the votes to obtain a final posterior on learned from all tasks, for example, by averaging. The task-ensembled hypothesis is thus:
In progressive learning settings it is common for the learner to get access to a new task after learning the set of . Using the above approach for ensembling representations, incorporating the information from this new dataset is straightforward. Indeed, it only requires learning a single-task composable hypothesis from the new task , the cross-task posteriors for task , and new cross-task posteriors for the original tasks. The corresponding functions are updated by augmenting the environment, , and then defining each using Eq. 1. In all cases, we ensure that the posterior outputs are calibrated [Guo2019-xe].
A Lifelong Forest (L2F) is a decision forest-based instance of ensembling representations. For each task, the transformer of a L2F is a decision forest [Amit1997-nd, breiman2001random]. The leaf nodes of each decision tree partition the input space [breiman1984classification]. The representation of
corresponding to a single tree can be a one-hot encoded
-dimensional vector with a “1" in the location corresponding to the leaffalls into of tree . The representation of resulting from the collection of trees simply concatenates the one-hot vectors from the trees. Thus, the the transformer is the mapping from to a -sparse vector of length . The in-task and cross-task posteriors are learned by populating the cells of the partitions and taking class votes with out-of-bag samples, as in ‘honest trees’ [breiman1984classification, denil14, Athey19]. The in-task and cross-task posteriors output the average normalized class votes across the collection of trees, adjusted for finite sample bias [Guo2019-xe]. The decider averages the in-task and cross-task posterior estimates and outputs argmax to produce a single prediction, as per (1
). Recall that honest decision forests are universally consistent classifiers and regressors[Athey19], meaning that with sufficiently large sample sizes, under suitable though general assumptions, they will converge to minimize risk. The single task version of this approaches simplifies to an approach called ‘Uncertainty Forests’ (UF) [Guo2019-xe].
A Lifelong Network (L2N) is a deep network (DN)-based instance of ensembling representations. For each task, the transformer in an L2N is the “backbone” of a DN, including all but the final layer. Thus, each maps an element of to an element of , where is the number of neurons in the penultimate layer of the DN. In practice, we use a LeNet [lecun1998gradient] trained using cross-entropy loss and the Adam optimizer [kingma2014adam] to learn the transformer. In-task and cross-task voters are learned via -Nearest Neighbors (-NN) [Stone1977-fi]. Recall that a -NN, with chosen such that as the number of samples goes to infinity goes to infinity and , is a universally consistent classifier [Stone1977-fi]. To calibrate the posteriors estimated by -NN, we rescale them using isotonic regression [Niculescu-Mizil2005-sa, rajendran2019accurate].
Consider a very simple two-task environment: Gaussian XOR and Gaussian Not-XOR (N-XOR), as depicted in Figure 2 (see Appendix A for details). In this environment, the optimal discriminant boundaries are “axis-aligned”, and the two tasks share the exact same decision boundary: the coordinate axes. Thus, transferring from one task to the other merely requires learning a bit flip.
Figure 2 shows the generalization error for Lifelong Forest and Uncertainty Forest on XOR. L2F and UF achieve the same generalization error on XOR, but UF does not improve its performance on XOR with N-XOR data (because it is a single task algorithm and therefore does not operate on other task data), whereas the performance of L2F continues to improve on XOR given N-XOR data, demonstrating forward transfer. Figure 2 also shows the generalization error for L2F and UF on N-XOR. In this case, UF was trained only on XOR. Both algorithms perform at chance levels until the first N-XOR data arrive. However, L2F improves more rapidly than UF, demonstrating backward transfer.
Finally, Figure 2 shows transfer efficiency for L2F. For XOR (task 1), forward transfer efficiency is one until N-XOR data arrive, and then it quickly ramps up prior to saturating. For N-XOR, backward transfer efficiency of L2F shoots up when N-XOR data arrive, but eventually converges to the same limiting performance of UF. Note that forward transfer efficiency is the ratio of generalization errors for XOR (left panel), and backward transfer efficiency is the ratio of the generalization errors for N-XOR (center panel).
Statistics has a rich history of robust learning [huber1996robust], and machine learning has recently focused on adversarial learning [Bruna2013-iq]. However, in both cases the focus is on adversarial examples, rather than adversarial tasks. In the context of progressive learning, we informally define a task to be adversarial with respect to task if the true joint distribution of task , without any domain adaptation, has no information about task . In other words, training data from task can only add noise, rather than signal, for task . An adversarial task for Gaussian XOR is Gaussian Rotated-XOR (R-XOR) (Figure 3, top). Training first on XOR therefore impedes the performance of Lifelong Forests on R-XOR, and thus forward transfer falls below one, demonstrating a graceful forgetting. Because the discriminant boundaries are learned imperfectly with finite data, data from R-XOR can actually improve performance on XOR, and thus backward transfer is above one.
Consider an environment with a three spiral and five spiral task (Figure 3, bottom). In this environment, axis-aligned splits are inefficient, because the optimal partitions can be approximated with fewer irregular polytopes than the orthotopes available from axis-aligned splits. The three spiral data helps the five spiral performance because the optimal partitioning for these two tasks is relatively similar to one another, as indicated by forward transfer increasing with increased five spiral data. This is despite the fact that the five spiral task requires more fine partitioning than the three spiral task. Because L2F grows relatively deep trees, it over-partitions space, thereby rendering tasks with more coarse optimal decision boundaries useful for tasks with more fine optimal decision boundaries. The five spiral data also improves the three spiral performance, as long as there are a sufficient number of samples from the five spiral task to adequately estimate the posteriors within each cell.
The CIFAR 100 challenge [krizhevsky2009learning], consists of 50,000 training and 10,000 test samples, each a 32x32 RGB image of a common object, from one of 100 possible classes, such as apples and bicycles. CIFAR 10x10 divides these data into 10 tasks, each with 10 classes [Lee2019-eg] (see Appendix C
for details). We compare Lifelong Forests and Lifelong Networks, to several state of the art deep learning based lifelong learning algorithms, including Deconvolution-Factorized CNNs (DF-CNN)[Lee2019-eg], elastic weight consolidation (EWC) [kirkpatrick2017overcoming], Online EWC [schwarz2018progress], Synaptic Intelligence (SI) [zenke2017continual], Learning without Forgetting (LwF) [li2017learning], and Progressive Neural Networks (Prog-NN) [rusu2016progressive]. Figure 4 shows the forward, backward, and overall transfer efficiency for each algorithm on this benchmark dataset.
Forward transfer efficiency of Lifelong Forests and Lifelong Networks are (approximately) monotonically increasing. This indicates that as they observe data from new tasks, they were able to leverage the data from previous tasks to improve performance on the new tasks. This is in contrast to the other approaches, none of which appear to reliably demonstrate any degree of forward transfer (top left). Similarly, the backward transfer efficiencies of Lifelong Forests and Lifelong Networks are (approximately) monotonically increasing for each task. This indicates that as they observe data from new tasks, they are able to improve performance on previous tasks, relative to their performance on those tasks prior to observing new out-of-task data. That is, L2F and L2N progressively learn. This is in contrast to the other approaches, none of which appear to demonstrate any degree of backward transfer. To the contrary, the previously proposed approaches appear to demonstrate catastrophic forgetting or do not transfer at all (top right). Taken together, transfer efficiency for L2F and L2N are (approximately) monotonically increasing, yielding a improvement on most tasks by virtue of other task data; in contrast, most other approaches fail to demonstrate transfer for the majority of tasks (bottom).
Consider the same CIFAR 10x10 experiment above, but, for tasks 2 through 9, randomly permuted the class labels within each task, rendering each of those tasks adversarial with regard to the 1 task. Figure 5 (left) indicates that LF gracefully forgets in the presence of these nine adversarial tasks. The other algorithms seem invariant to label shuffling (comparing performance with Figure 4 indicates that performance is effectively unchanged), suggesting that they fail to take advantage of class labels for transferring to other tasks.
Now, consider a Rotated CIFAR experiment, which uses only data from the 1 task, divided into two equally sized subsets (making two tasks), where the second subset is rotated. Figure 5 (center) shows that L2F’s transfer efficiency is nearly invariant to rotation angle, whereas approaches fail to transfer for any angle. Note that zero rotation angle corresponds to the two tasks having identical distributions; the fact that none of the other algorithms transfer even in this setting suggests that they cannot transfer at all. Appendix C contains additional experiments with repeated tasks or subtasks, demonstrating similar results: L2F and L2N transfer efficiently in these contexts as well.
The lifelong learning approaches presented so far either build new, or recruit existing, resources (L2F and L2N both exclusively build resources). However, this binary distinction is unnecessary and unnatural: in biological learning, systems develop from building to recruiting resources. Figure 5 (right) demonstrates that variants of L2F can span the continuum from purely building to purely recruiting resources. We trained L2F on the first nine CIFAR 10x10 tasks using 50 trees per task. For the tenth task, we could select the 50 trees (out of the 450 existing trees) that perform best on task 10 (recruiting). Or, we could train 50 new trees, as L2F would normally do (building). Or we do a hybrid: building 25 and recruiting 25 (hybrid). This hybrid approach performs better than either alternative, suggesting that one can improve our representation ensembling approach by dynamically determining whether to build a new representation, and if so, how to optimally combine those resources with existing resources.
Neither L2F nor L2N leverage any modality specific architecture, such as convolutions. This allows them to be applicable to other modalities out-of-the-box. We consider two natural language processing environments, both using 8 million sentences downloaded from Wikipedia, and trained a 16 dimensional Fasttext[bojanowski2016enriching]
embedding of tokenized words and 2-4 character n-grams from these sentences. These embeddings served as the input to our ensembling representation algorithms (see AppendixD for details). In the first experiment, the goal was to identify the language associated with a given sentence. In the second experiment, the goal was to identify which of 20 different Bing entity types best matches a word or phrase. For example, the entity type for “Johns Hopkins University” is “education.school”. Appendix Figure 4 shows L2F demonstrates backwards transfer in both of these environments.
We introduced representation ensembling as a general approach to progressive learning. The two specific algorithms we developed demonstrate the possibility of achieving both forward and backward transfer. Previous approaches relied on either (i) always building new resources, or (ii) only recruiting existing resources, when confronted with new tasks. We demonstrate that a hybrid approach outperforms both edge cases, motivating the development of more elaborate dynamic systems.
We hope this work motivates a tighter connection between biological and machine learning. By carefully designing behavioral and cognitive human and/or non-human animal experiments, in tandem with neurophysiology, gene expression, anatomy and connectivity, we may be able to infer more about how neural systems are able to progressively learn so efficiently. By designing such experiments in neurodevelopment, neuro-impairment, and neurodegeneration, may yield valuable information regarding limiting or reversing memory or learning impairments.
We thank Raman Arora, Dinesh Jayaraman, Rene Vidal, Jeremias Sulam, and Michael Powell for helpful discussions. This work is graciously supported by the Defense Advanced Research Projects Agency (DARPA) Lifelong Learning Machines program through contract FA8650-18-2-7834.
In the following simulation, we construct an environment with two tasks. For each, we sample 250 times from the first task, followed by 750 times from the second task. These 1,000 samples comprise the training data. We sample another 1,000 hold out samples to evaluate the algorithms. We fit an Uncertainty Forest (UF) (an honest forest with a finite-sample correction [Guo2019-xe]
) for both tasks and use the learned trees to subsequently fit a Lifelong Forest. We repeat this process 1,500 times to obtain errorbars. Error bars correspond to the 95% confidence intervals.
Gaussian XOR is two class classification problem with equal class priors. Conditioned on being in class 0, a sample is drawn from a mixture of two Gaussians with means, and variances proportional to the identity matrix. Gaussian N-XOR is the same distribution as Gaussian XOR with the class labels flipped. Rotated XOR (R-XOR) rotates XOR by degrees.
We study Lifelong Forests on a sequence of 10 rotated Gaussian XOR tasks. Representations of the tasks are shown on the left of Figure 1. The angle between the line and the origin corresponds to the rotation angle for that task. In particular, the line for Task 1 corresponds to Gaussian XOR. The line for Task 10 corresponds to Gaussian -XOR. The eight other tasks were chosen to be progressively less similar to Gaussian XOR. This set of tasks was chosen to study both the effect of many tasks that are dissimilar to a particular task (through the performance of Lifelong Forests on Task 1, Gaussian XOR) and the effect of many tasks close to a particular task (through the performance of Lifelong Forests on Tasks 2-10). For the suite of rotated XORs, a pair of tasks with a small difference in rotation angle between them will result in helpful transfer.
Figure 1 (center) shows the transfer efficiency of Lifelong Forests on the rotated XOR suite. Tasks 2 through 9 had 100 training samples per task. Task 1 had a variable amount of training data. When the number of XOR training samples is small, the representations learned from the tasks close to XOR help improve performance. When the number of training samples from XOR is sufficiently large that the representation learned for XOR is nearly optimal, other tasks cause a graceful forgetting. For an intermediate number of XOR training samples, other task data causes greater forgetting. The representations from the non-XOR tasks are mutually beneficial, as indicated by the (generally) upward trend of the transfer efficiency for each task.
Finally, consider a two task environment, where Task 1 is XOR, and Task 2 is -XOR, for differing values of . Figure 1 shows that -XOR can improve performance on XOR whenever the rotation angle is far from 45.
We compared our approaches to six reference lifelong learning methods. These algorithms can be classified into two groups based on whether they build or recruit resources given new tasks. Among them, Prog-NN [rusu2016progressive] and DF-CNN [Lee2019-eg] learn new tasks by building new resources. The other four algorithms, EWC [kirkpatrick2017overcoming], Online-EWC [schwarz2018progress], SI [zenke2017continual] and LwF [li2017learning], recruit existing resources. EWC, Online-EWC and SI rely on preferentially updating the network parameters depending on their relative importance to the previous task. On the other hand, LwF predicts the labels of the input data from the current task using the model trained on the previous tasks. These predicted labels act as the soft targets for the current training data, i.e., (input data, soft target) pairs are used in the regularization term with the (input data, original target) pairs being used in the main loss function. This prevents the deviation of the parameters too much from the optimum value for the previous tasks while at the same time, it enables the network to learn a new task. The implementations for all of the algorithms are adapted from the codes provided by the authors of [Lee2019-eg] and [Van_de_Ven2019-wy]
. The codes are modified to work on the CIFAR 10x10 setting without any change in the parameters. All algorithms used the default hyperparameters.
|Task #||Image Classes|
|1||apple, aquarium fish, baby, bear, beaver, bed, bee, beetle, bicycle, bottle|
|2||bowl, boy, bridge, bus, butterfly, camel, can, castle, caterpillar|
|3||chair, chimpanzee, clock, cloud, cockroach, couch, crab, crocodile, cup, dinosaur|
|4||dolphin, elephant, flatfish, forest, fox, girl, hamster, house, kangaroo, keyboard|
|5||lamp, lawn mower, leopard, lion, lizard, lobster, man, maple tree, motor cycle, mountain|
|6||mouse, mushroom, oak tree, orange, orchid, otter, palm tree, pear, pickup truck, pine tree|
|7||plain, plate, poppy, porcupine, possum, rabbit, raccoon, ray, road, rocket|
|8||rose, sea, seal, shark, shrew, skunk, skyscraper, snail, snke, spider|
|9||squirrel, streetcar, sunflower, sweet pepper, table, tank, telephone, television, tiger, tractor|
|10||train, trout, tulip, turtle, wardrobe, whale, willow tree, wolf, woman, worm|
Table 1 shows the image classes associated with each task number. Table 2 shows a set of summary statistics from the CIFAR 10x10 experiment of Section 5.1 of the main text. Notably, only Lifelong Forests and Lifelong Networks have both mean and minimum transfer efficiency greater than 1. Further, Table 3 shows the final transfer efficiencies for each algorithm studied for convenience and reproducibility.
|Algorithm||Mean TE||Min. TE||Mean FTE (Task 10)||Mean BTE(Task 1)|
We also considered the setting where each task is defined by a random sampling of 10 out of 100 classes with replacement. This environment is designed to demonstrate the effect of tasks with shared subtasks, which is a common property of real world learning tasks. This setting generalizes the previously proposed “Class-Incremental” and “Task-Incremental” distinction [Van_de_Ven2019-wy]. Figure 2 shows transfer efficiency of Lifelong Forests on Task 1. Finally, we considered a curriculum where tasks repeat, in contrast to CIFAR 10x10. Figure 3 demonstrates that L2F consistently transfers in this setting as well.
We downloaded a language identification corpus consisting of around 8 million sentences and 350 languages from https://tatoeba.org/eng/downloads, and trained a 16 dimensional Fasttext [bojanowski2016enriching] embedding of tokenized words and 2-4 character n-grams from these sentences using a character based skip-gram model without using the language labels. We then picked 30 languages and randomly chose 150 sentences for training and 2500 sentences for testing for every language but Bosnian. For Bosnian we used 150 sentences for training and 396 sentences for testing due to a limited number of samples. A sentence embedding is found by averaging all L2-normalized word and n-grams embedding vectors within a sentence.
We split the 30 languages into ten 3 class tasks and tasks are presented one at a time. Class splits are given in Table 4. The backward transfer efficiencies of Lifelong Forests for the ten tasks are shown in the left panel of Figure 4. Lifelong Forests, generally, transfer knowledge across the stream of tasks.
|Task Number||Language Classes|
|1||Swedish, Norwegian Bokmål, Danish|
|2||Mandarin Chinese, Yue Chinese, Wu Chinese|
|3||Russian, Ukrainian, Polish|
|4||Spanish, Italian, Portuguese|
|5||Finnish, Hungarian, Estonian|
|6||English, Dutch, German|
|7||Croatian, Serbian, Bosnian|
|8||Japanese, Korean, Vietnamese|
|9||Hebrew, Arabic, Hindi|
|10||French, Catalan, Breton|
An entity type is a label of an entity, such as "Johns Hopkins University", that provides a description of the entity, such as "education.school". We obtained a proprietary entity name and type table from Bing catalogs. For each entity we generated an embedding using a pre-trained English Fasttext of 1 million-word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news datasets (16 billion tokens). The entity name embedding used was the summation of the L2-normalized vectors of for all tokens corresponding to the entity name.
We took the entity name embedding vectors for 20 entity types. For each type, we used 10,000 entity names for training and 1,000 entity names as a testing set. That is, we classified entity types based on their names. We split the 20 entity types into 5 tasks of 4 classes each. Tasks are presented one at a time. Entity types and task splits are given in Table 5. The backward transfer efficiencies corresponding to each task are shown in the right panel of Figure 4. Again, Lifelong Forests transfer knowledge across the stream of tasks.
|Task Number||Entity Type Classes|
|1||american_football.player, biology.organism_classification, book.author, book.book|
|2||book.edition, book.written_work, business.operation, commerce.consumer_product|
|3||education.field_of_study, education.school, film.actor, film.character|
|4||film.film, media_common.actor, music.artist, music.group|
|5||organization.organization, people.person, tv.series_episode, tv.program|