Lifelong Learning using Eigentasks: Task Separation, Skill Acquisition, and Selective Transfer

07/14/2020
by   Aswin Raghavan, et al.
20

We introduce the eigentask framework for lifelong learning. An eigentask is a pairing of a skill that solves a set of related tasks, paired with a generative model that can sample from the skill's input space. The framework extends generative replay approaches, which have mainly been used to avoid catastrophic forgetting, to also address other lifelong learning goals such as forward knowledge transfer. We propose a wake-sleep cycle of alternating task learning and knowledge consolidation for learning in our framework, and instantiate it for lifelong supervised learning and lifelong RL. We achieve improved performance over the state-of-the-art in supervised continual learning, and show evidence of forward knowledge transfer in a lifelong RL application in the game Starcraft2.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 12

page 13

03/06/2021

Selective Replay Enhances Learning in Online Continual Analogical Reasoning

In continual learning, a system learns from non-stationary data streams ...
12/18/2021

Continual Learning of a Mixed Sequence of Similar and Dissimilar Tasks

Existing research on continual learning of a sequence of tasks focused o...
10/09/2018

Continual State Representation Learning for Reinforcement Learning using Generative Replay

We consider the problem of building a state representation model in a co...
05/28/2019

Unified Probabilistic Deep Continual Learning through Generative Replay and Open Set Recognition

We introduce a unified probabilistic approach for deep continual learnin...
03/16/2022

Continuous Detection, Rapidly React: Unseen Rumors Detection based on Continual Prompt-Tuning

Since open social platforms allow for a large and continuous flow of unv...
08/14/2019

Skill Transfer in Deep Reinforcement Learning under Morphological Heterogeneity

Transfer learning methods for reinforcement learning (RL) domains facili...
05/19/2018

Autonomous discovery of the goal space to learn a parameterized skill

A parameterized skill is a mapping from multiple goals/task parameters t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The goal of lifelong learning (Chen and Liu, 2016; Silver et al., 2013) is to continuously learn a stream of machine learning tasks over a long lifetime, accumulating knowledge and leveraging it to learn novel tasks faster (positive forward transfer) (Fei et al., 2016)

without forgetting the solutions to previous tasks (negative backward transfer). The learning problem is non-stationary and open-ended. The learner does not know the number or distribution of tasks, it may not know the identity of tasks or when the task changes, and it must be scalable to accommodate an ever-increasing body of knowledge within a finite model. These characteristics make lifelong learning particularly challenging for deep learning approaches, which are vulnerable to catastrophic forgetting in the presence of non-stationary data.

Many approaches to continual learning (Parisi et al., 2019; Nguyen et al., 2018)

and lifelong learning have been studied and applied to computer vision

(Hayes et al., 2018; Liu et al., 2020) and reinforcement learning (RL) (Ammar et al., 2015; Tessler et al., 2016) among other areas. Recent work comprises many complementary approaches including learning a regularized parametric representation (Kirkpatrick et al., 2016, 2018), transfer of learned representation (Lee et al., 2019), meta-learning (Nagabandi et al., 2019), neuromodulation (Masse et al., 2018)

, dynamic neural network architectures

(Li et al., 2019; Rusu et al., 2016), and knowledge consolidation using memory (van de Ven and Tolias, 2019).

Figure 1: Our approach is based on eigentasks, which combine a skill with a generative model of its inputs. Our lifelong learning agents operate in a wake-sleep cycle, solving a stream of tasks during the wake phase, and consolidating new task knowledge during the sleep phase using generative replay.

This paper advances the state of the art in the generative replay approach to lifelong learning (van de Ven and Tolias, 2019; Shin et al., 2017). In this approach, a generative model of the data distribution is learned and used for data augmentation i.e., replay data from old tasks when learning a new task. We refer to as the generator (e.g., of images) and

as a skill (e.g., a classifier). Prior work has focused on generative replay as a way to mitigate catastrophic forgetting, i.e., avoiding negative backward transfer. However, forward transfer – the ability to leverage knowledge to quickly adapt to a novel but related task – is an equally important lifelong learning problem that generative replay has not addressed. A structured form of replay is one of the functions of mammalian sleep

(Krishnan et al., 2019; Louie and Wilson, 2001). Recent evidence in biology (McClelland et al., 2020) shows that only a few selected experiences are replayed during sleep, in contrast to the typical replay mechanisms in machine learning. Our proposed approach bridges these gaps.

Our main contribution is a framework that combines generative memory with a set of skills that span the space of behaviors necessary to solve any given task. The framework consists of a set of generator-skill pairs that partition a stream of data into what we call eigentasks. Each generator models a subset of the input space, and the corresponding skill encodes the appropriate outputs for the inputs in the generator’s support set. The likelihood of an input according to each generator is used to retrieve the appropriate skill, making the eigentask model a content-addressable memory for skills and enabling forward transfer. Eigentasks can be seen as a combination of generative memory with mixture-of-experts (MoE) models (Makkuva et al., 2018; Tsuda et al., 2020). The MoE component facilitates forward transfer, while the generative component avoids forgetting.

We develop a concrete instantiation of eigentasks called Open World Variational Auto Encoder (OWVAE) that uses a set of VAEs (Kingma and Welling, 2014)

as generators. OWVAE partitions data into eigentasks using out-of-distribution detection based on a likelihood ratio test in the latent space of the generator. We present a loss function for end-to-end learning of eigentasks that incorporates the losses of the generators and skills, weighted by the likelihood ratio. We show experimentally that OWVAE achieves superior performance comapred to state-of-the-art (SOTA) generative memory (GM) approaches on a new benchmark that contains a mix of MNIST and FashionMNIST datasets, and comparable performance in the splitMNIST benchmark. OWVAE’s superior performance is attributed to task disentanglement

(Achille et al., 2018) and confirmed visually by comparing the samples generated by each VAE.

Our second contribution is a sampling strategy that improves the quality of generative replay by using the confidence of the predictions output by the paired skills to reject out-of-distribution samples. Our experiments show improved continual learning and reduction in forgetting when using rejection sampling as compared to accepting all generated examples.

Our third contribution is a lifelong RL algorithm that leverages the OWVAE for exploration of new tasks. The OWVAE generates advice by recalling options (Sutton et al., 1999) that are applicable to the current task, based on the eigentask partitioning of old tasks. This advice is incorporated into an off-policy learner whose behavior policy is a mixture of the options and the policy being learned. We apply our lifelong RL algorithm to the challenging video game Starcraft 2 (SC2). On a sequence of “mini-games” (Vinyals et al., 2017) as tasks, our approach shows positive forward transfer when the new task is of the same type as one of the old tasks. In most mini-games we observe a jump start in the accumulated reward, and in one mini-game, forward transfer outperforms the single-task learner by 1.5x with 10x fewer samples due to better exploration.

2 Background

In lifelong learning, an agent is faced with a never-ending sequence of tasks. Each task is defined by the tuple , where is the input space, is the output space, is the input distribution, and gives the loss for outputting for inputs . In this paper we consider the case where all the input and output spaces are the same: , , and our definition of a task is . We denote the (countably infinite) set of tasks by . Since a task is totally characterized by its index , we use tasks and their indices interchangeably.

A task sequence is a sequence of tuples drawn from the task distribution , where each is a task index and the corresponding is the number of samples from that the agent sees before the transition to the next task. The lifelong learning agent’s objective is to learn a single hypothesis that minimizes the expected per-sample loss wrt the task distribution,

(1)

where are instances sampled IID from . Equation 1 defines the optimal hypothesis wrt the task distribution. Note that the task distribution is unknown to the agent, and does not contain the task index. Lifelong learning (Eq. 1) is strictly harder than multi-task learning, where instances have known task IDs.

2.1 Backward and Forward Transfer

The notion of knowledge transfer is central to evaluating success in lifelong learning. In this paper, we approach lifelong learning in an episodic setting where the agent learns in a series of learning epochs of fixed length (not necessarily aligned with task boundaries). In this setting, we can define the key metrics of forward and backward transfer in a manner that is agnostic to the task boundaries. Let

be the total number of samples seen from task through the end of epoch . We define a loss restricted to tasks actually seen by the learner through epoch as

(2)

Backward Transfer (BT) describes the difference in performance on old tasks before and after one or more epochs of learning. Let be the hypothesis obtained after training sequentially on the data from epochs , where is shorthand for . The one-step backward transfer is,

(3)

Negative BT corresponds to forgetting knowledge learned in previous episodes, and positive BT could indicate successful knowledge consolidation between tasks. Forward Transfer (FT) describes the difference in loss on new tasks with and without training on previous tasks.

(4)

The terms within parentheses restrict the loss to new tasks in the th episode. Positive FT indicates transfer of knowledge or skills between tasks, and typically corresponds to jump start performance and lower sample complexity.

2.2 Reinforcement Learning

The flexibility of our eigentask framework allows it to be applied in supervised learning (SL), unsupervised learning, and reinforcement learning (RL). For simplicity, we describe lifelong RL in the finite Markov decision process (MDP) setting. An MDP is a tuple

, where is a finite set of states, is a finite set of actions, is the transition function, is the reward function. The objective is a policy that maximizes the value function . Tasks in lifelong RL are MDPs, with common state and action spaces, but and may differ. The loss is the negative return, .

3 Eigentask Framework

Lifelong learning agents must balance plasticity vs. stability: improving at the current task while maintaining performance on previously-learned tasks. The problem of stability is especially acute in neural network models, where naïvely training an NN model on a sequence of tasks leads to catastrophic forgetting of previous tasks. A proven technique for avoiding forgetting is to mix data from all previous tasks with data from the current task during training, thereby reverting the streaming learning problem back to an offline learning problem. The generative memory (GM) approach accomplishes this with a generative model of the data distribution factorized as a generator over the input space and a discriminator or skill conditioned on the input . A skill can correspond to a classifier in supervised learning (SL) or to a policy or option in RL.

To achieve selective transfer, we propose to use multiple generator-skill pairs to disentangle streaming data into “canonical tasks” that we call eigentasks. Informally, eigentasks partition the joint input-output space such that all inputs within an eigentask use the same skill. Eigentasks capture task similarity defined in terms of the combination of generative and skill losses. The use of multiple generators corresponds to a mixture model. Prior work on mixtures of VAEs or GANs (Dilokthanakul et al., 2016; Zhang et al., 2017; Rao et al., 2019) cluster inputs based on perceptual similarity alone or create one task per label. On the other hand, mixtures-of-experts (Tsuda et al., 2020) capture skill similarity alone. The eigentask framework combines the MoE concept with generative replay, to avoid forgetting and realize selective transfer.

3.1 Eigentask Definition

Formally, an eigentask is a tuple comprising a generator that defines a distribution over inputs as a function of random noise , and a skill that maps inputs to outputs. An eigentask model consists of a set of eigentasks and a similarity function

whose output is a probability vector over the

eigentasks. Typically, , and are parameterized functions such as DNNs. We write as a function of the current input, but in general it may be a function of the entire input history. We first describe the loss function for end-to-end learning of eigentasks in the offline setting. The extension to the streaming section is discussed in the next section. Given a dataset , the general loss function for eigentask learning is,

(5)

where and denote generative and discriminative losses for the generator and skill, respectively. The loss in (5) is agnostic to the choice of generator (e.g., VAE or GAN), skill (e.g. classifier, policy, options), and similarity function . Note that the generator-skill pairs are independent, to avoid interference between eigentasks. This is in contrast to recent approaches that learn a shared embedding across tasks (e.g. Achille et al., 2018). In our experiments, is a cross-entropy loss for classification problems and an RL loss such as the policy gradient for RL problems.

Figure 2: An Open World Auto Encoder (OWVAE) with two eigentasks, showing the terms used in the loss function (Eq. 7).

We develop a concrete instantiation of eigentasks called Open World Variational Auto Encoder (OWVAE) that uses a set of VAEs (Kingma and Welling, 2014) as generators, with latent space denoted by , encoder , decoder and prior . Figure 2 shows an OWVAE with two eigentasks. We use reconstruction error between the input and decoder output as the generative loss . We propose to use a likelihood ratio test to define . 111(Ren et al., 2019) developed a similar LR-test concurrently. We use the likelihood of decoder generating the observed data for some , that is approximated using encoding .

(6)

where subscript denotes eigentask , is the encoding, is the density of standard gaussian, denotes softmax. The assumption is valid when the decoder weights are the inverse of the encoder weights. The loss function for OWVAE is in Eq. 7,

(7)

where is the standard VAE loss,

At test time, given an input we calculate by computing a forward pass through the encoder. We sample the index of the decoder and skill to use according to the categorical distribution given by . In the simplest case, we pass the decoder output as the input to the skill. In our experiments, we observed that using the mid-level features (the encoder features before projecting to latent space) led to the best accuracy of the skill, matching an observation by van de Ven and Tolias (2018). Note that the above sampling method is conditional on , and that the OWVAE does not allow direct sampling from the learned mixture because the mixing coefficient depends on the input. Generative replay can be used to learn an OWVAE incrementally but requires direct sampling of old tasks. Section 4 proposes a strategy for direct sampling.

  Input: OWVAE , buffer , # sleep iterations
  Initialize ; set to empty.
  while True do
     Initialize task learner
     repeat {Wake Phase}
        Classification (Section 4): store new instance in buffer
        RL (Section 5): Update with Alg 3
     until  is full
     Create copy
     for  iterations do {Sleep Phase}
        Fetch batch from
        Generate replay using Alg 2
        Update using Eq. 7 on
     end for
     Set to empty
  end while
Algorithm 1 General Wake-Sleep Cycle

3.2 The Wake-Sleep Cycle

Our lifelong learning agents operate in a wake-sleep cycle, solving a stream of tasks during the wake phase, and consolidating new task knowledge during the sleep phase using generative replay. In the wake phase, the learner’s goal is to maximize FT wrt correctly outputs on incoming inputs. In supervised learning that might mean simply outputting the correct label according to the new task, while in reinforcement learning the agent needs to explore the new task and maximize reward. During the wake phase, the learner converts the streaming input to batched data, by storing new task examples in a short-term buffer along with any intermediate solutions of wake phase learning. Periodically, the learner enters a sleep phase, where the objective is memory consolidation over all tasks with minimal negative BT. We use generative replay to incorporate new task batches in to an eigentask model that is continuously updated. A general wake-sleep cycle is shown in Algorithm 1 where sleep phase is activated whenever the buffer is full. We describe instantiations of the wake-sleep cycle for supervised learning and RL in the next two sections.

4 Lifelong Supervised Learning

Wake phase: In this work, we use a trivial wake phase for supervised classification. We simply store the new task examples (instance and label) to the buffer. The OWVAE could be leveraged in the wake phase, e.g. augment new task data with selective replay of similar tasks, or using the OWVAE skills as hints for knowledge distillation. For example, hints led to positive FT when the new task was a noisy version of old tasks (noise added to labels and pixels). We have not yet investigated these possibilities completely as they are specific to the scenario. Algorithms for learning from streaming data (e.g. Hayes et al., 2018; Smith et al., 2019) can be used to update the task learner.

Sleep phase: As mentioned in Section 3.1, sampling from an OWVAE is conditional on input because the mixing coefficients are a function of . Some eigentasks may have received little or no training (e.g. when old tasks are few and similar). These untrained generators will generate noise that must not be used to augment new task data. To mitigate this problem, we developed a rejection sampling strategy using the confidence of the skill associated with the generator. Algorithm 2 shows the sampling strategy for OWVAE. We reject a sample if the confidence of the associated skill is below a threshold . We reject over-represented labels and generate label-balanced replay. These two refinements to the sampling process were critical in achieving high accuracy on continual learning benchmarks. In addition, rejection sampling improved the accuracy of the basic GM approach as well.

  Input: OWVAE , batch size , threshold
  Initialize deque for each label of size
  repeat
     for Eigentask in  do
        Generate ,
        ,
        if : Push to
     end for
  until Each is full or MAX_TRIES
  Return
Algorithm 2 Rejection Sampling from OWVAE

5 Lifelong Reinforcement Learning

In the lifelong RL setting, tasks are MDPs (Section 2.2), the eigentask skills are policies, and the associated generators generate states where the policies should be applied.

Wake Phase: One of the key determinants of RL performance is the efficiency of exploration. Without any prior knowledge, RL algorithms typically explore randomly in the early stages of learning. Our approach (see Algorithm 3) is to use the skill corresponding to the most relevant eigentask to aid in exploration. An off-policy RL algorithm is used for training because the exploration policy is defined by these skills. The behavior policy is a mixture of the target policy and the mixture of skills induced by the OWVAE’s -function. Let denote the target policy and let be the th skill. Given a state , the behavior policy is,

(8)

where is a “mixing” function that decays over time so that eigentask usage is gradually reduced and replaced by . We use importance weighting as implemented by the off-policy actor-critic algorithm VTrace (Espeholt et al., 2018), but our approach is compatible with any off-policy algorithm. In the last step of the wake phase, trajectories from the target policy are stored in the buffer, to be later consolidated in the sleep phase.

  Input: OWVAE , Buffer , Off-policy learner , MDP
  Initialize policy ,
  repeat
      state from , get from OWVAE using Eq. 6
     Sample action using and as in Eq. 8
     Execute in . Observe next state and reward
     Add to RL training set
     Update using . Decrease
  until Sample budget reached
  Reset to initial state
  Execute to generate set of
  Add to buffer
Algorithm 3 Exploration using OWVAE for lifelong RL

Sleep Phase: In the sleep phase, the goal is to consolidate the final target policy into the eigentask skills . Our approach is based on policy distillation (Rusu et al., 2015), that transforms the problem to a supervised learning problem. The sleep phase proceeds in the same manner as for supervised learning tasks (Section 4).

6 Experiments

We validate the idea of eigentasks on unsupervised, supervised classification and RL tasks. We show experimentally that OWVAE achieves superior performance comapred to state-of-the-art (SOTA) generative memory (GM) approaches RtF (van de Ven and Tolias, 2019) and DGR (Shin et al., 2017) on a new benchmark that contains a mix of MNIST and FashionMNIST datasets, and comparable performance in the splitMNIST benchmark for continual learning. OWVAE’s superior performance is attributed to task disentanglement (Achille et al., 2018) and confirmed visually by comparing the samples generated by each VAE. We demonstrate our lifelong RL algorithm on the Starcraft 2 (SC2) mini-games benchmark (Vinyals et al., 2017). We demonstrate that our approach compares favorably to the baselines of single-task and multi-task RL.

6.1 Illustration: synthetic problem

In this section, we illustrate the OWVAE model on a synthetic but challenging problem for current GM approaches. The problem is inspired by the Wisconsin Card Sorting task (Tsuda et al., 2020), where the tasks cannot be distinguished by “perceptual similarity”, but can be distinguished by “skill similarity”, so a mixture-of-experts model can solve the tasks (Tsuda et al., 2020). We test whether an OWVAE can separate the tasks and achieve a high accuracy.

Consider two binary classification tasks whose input space is an isotropic Gaussian in two dimenstions, and the labels for the two tasks are flipped, e.g. task-0 labels are given by , and task-1 labels are . The data distribution is shown in Figure 2(a). Any classifier that achieves an accuracy of on task-0 must have accuracy on task-1, and thus average accuracy of . Figure 2(b) shows that an OWVAE with two eigentasks is able to achieve high accuracy (). Figure 2(c) shows the reason: a meaningful has been learned that separates the tasks successfully into two eigentasks. Interestingly, the learned eigentasks are and , i.e. label-0 of task-0 is combined with label-1 of task-1 within one eigentask, and label-1 of task-0 is combined with label-0 of task-1 within another eigentask. In contrast, current GM methods will not work because learning either task causes forgetting of the other.

(a) Setup of the synthetic problem with conflicting tasks.
(b) Accuracy with OWVAE(2): both tasks can be separated and learned.
(c) Tracking shows correctly identified eigentasks.
Figure 3: Illustration on synthetic problem (Section 6.1): OWVAE(2) learns two conflicting tasks with no perceptual dissimilarity.

6.2 Continual Learning for Supervised Classification

We use the class-incremental learning (Class-IL) setting introduced in (van de Ven and Tolias, 2019). In this setting, new classes or groups of classes are presented incrementally to the learning alorithm. We use the standard splitMNIST problem and compare to the SOTA. In splitMNIST, the MNIST dataset is split into five tasks with each task having two of the original classes. Further, we create a new benchmark combining MNIST and Fashion-MNIST datasets. The combined MNIST and Fashion-MNIST classes are split evenly into ten tasks. Each task introduces two new classes, one MNIST digit and one fashion article. In this new benchmark, we establish a new SOTA by showing that current replay based approaches are inferior to OWVAE.

In both benchmarks, each new task is trained for 500 iterations with a batch size of 32. All compared SOTA approaches use a two layer perceptron with 650 neurons for the encoder and decoder, and a latent dimension of

. We compare OWVAE against eight continual learning approaches spanning regularization, replay, and replay with exemplars (Table 1). In addition, we show two baselines of single-task learner (lower bound - only learns on current task) and offline multi-task learner (upper bound - knows all tasks). Training is done on the standard train sets and results are reported on the standard test sets.

The OWVAE uses two eigentasks, each with the same architecture as SOTA but with 400 neurons only, so that the OWVAE has the same number of parameters as SOTA. Within the OWVAE, the inputs to the skill are the mid-level features of the corresponding encoder, i.e. activations in the last layer of the encoder. As in Section 4, no wake phase is used and sleep phase uses 500 iterations with a batch size of 32, and the threshold is set to (as in Alg. 2). To study the impact of rejection sampling, we perform an ablation study over different augmentation strategies: (1) BaseAug: all examples generated during replay are accepted, (2) BAug: rejection sampling to create label-balanced replay, (3) VAug: rejecting low confidence examples, and (4) VBAug: combining (2) and (3) (as in Alg 2).

Approach Method D1 D2
Baselines None - lower bound 19.90 10.22
Offline - upper bound 97.94 90.89
Regularization EWC 20.01 10.00
Online EWC 19.96 10.00
SI 19.99 10.00
Replay LwF 23.85 10.07
DGR 90.79 73.36
DGR x2 91.83 65.82
DGR+distill 91.79 72.40
DGR+distill x2 94.01 67.37
RtF 92.56 61.15
RtF x2 92.86 61.41
Replay+Exemplars iCaRL 94.57 82.69
Replay+Eigentask ET1-BaseAug 87.68 69.29
ET1-BAug 90.99 74.11
ET1-VAug 87.33 63.34
ET1-VBAug 90.69 77.43
ET2-BaseAug 88.93 57.91
ET2-BAug 91.27 69.95
ET2-VAug 82.08 69.55
ET2-VBAug 90.25 76.81
Table 1: Average test accuracy over all tasks on splitMNIST (D1) and split(MNIST+FashionMNIST) (D2) benchmarks. ET1 and ET2 denote the number of eigentasks in an OWVAE model. Methods compared: EWC (Kirkpatrick et al., 2016), Online EWC (Schwarz et al., 2018), SI (Zenke et al., 2017), LwF (Li and Hoiem, 2016), DGR(Shin et al., 2017), RtF (van de Ven and Tolias, 2019),and iCaRL (Rebuffi et al., 2017). The variants denoted x2 have the same number of parameters as ET2.

The average accuracies over all tasks at the end of training are shown in Table 1. As observed in prior work, the class-IL setting is hard for the regularization approaches like EWC, as well as LwF; their performance is very low, comparable to the single task learner (they learn and immediately forget each task). Our approach (Replay+Eigentasks) has accuracy comparable to other replay-based methods (except LwF) on splitMNIST. However, on split(MNIST+FashionMNIST), our approach has a higher accuracy (about 4% higher). Unsurprisingly, using exemplars within replay improves accuracy on both benchmarks. Exemplars could be integrated into OWVAE in future work.

The ablation study shows that VBAug (as in the combination of rejection using both confidence and label) leads to the most improvement (2-3% on SplitMNIST, 19% on split(MNIST+FashionMNIST)). The combination VBAug performs better than BAug, BaseAug and VAug, whereas VAug by itself seems to decrease the performance wrt BaseAug. Interestingly, VBAug and BAug improved the performance of the basic GM approach (as in BaseAug vs ET1: 2-3% improvement on splitMNIST, 5-7% improvement on split(MNIST+FashionMNIST)). A detailed breakdown of accuracy per-task over time is shown in the appendix.

Figure 4 shows the task separation learned by OWVAE. The figure visualizes VAE reconstructions for each task. It shows that the first eigentask has learned all the MNIST digits and able to reconstruct them, whereas the second eigentask has learned all the fashion articles. The first eigentask has also learned some fashion articles whereas the second eigentask has not learned any digits. The blurry and noisy images are removed from replay by our rejection sampling strategy.

Figure 4: Split(MNIST+FashionMNIST): Image reconstruction by OWVAE(2) showing task separation. Top row: Ground truth. Middle and Bottom: reconstruction by first and second eigentask.

6.3 Starcraft 2

We use the Starcraft 2 learning environment (SC2) (Vinyals et al., 2017). SC2 is a rich platform in which diverse RL tasks can be implemented. We used their “mini-games” as the task set for our experiments. Our policy architecture is a slight modification of their FullyConv architecture. We use the 17 64x64 feature maps extracted by SC2LE.

6.3.1 Eigentask Learning

In the OWVAE, we use only two feature maps namely “unit type” (identity of the game unit present in each pixel) and “unit density” (average number of units per pixel). An example of these feature maps is shown in the appendix.

The eigentasks for SC2 are learned in an unsupervised and offline manner. We first collected a dataset by executing a random policy in each mini-game and recording the feature maps per frame. Whenever possible, we also collected similar data by running a scripted agent. We then trained an OWVAE incrementally with three eigentasks in a fully unsupervised manner. The setup is similar to the continual learning experiment for splitMNIST (each mini-game is seen for 500 iterations etc.). SC2 mini-games are perceptually different, so the OWVAE is able to separate the tasks into meaningful eigentasks despite the unsupervised training. As shown in Figure 5, by looking at the relative values of the OWVAE -function, we see that the first eigentask grouped all combat tasks together, but incorrectly grouped a navigation task (possibly due to the unsupervised training). The second eigentask learned the BuildMarines task alone, whereas the third eigentask grouped the resource gathering tasks together. For each task type, we observe one eigentask clearly dominating the values, while no single eigentask dominates always. These task groupings can be confirmed by looking at the VAE reconstructions shown in appendix.

Figure 5: Continual Unsupervised Learning of a OWVAE(3) in SC2 mini-games: Variation of over iterations and learned task grouping. Each mini-game is observed for 500 iterations.
(a)
(b)
(c)
Figure 6: Forward transfer in Starcraft 2 mini-games.

6.3.2 Forward Policy Transfer

We focus on the wake phase of our lifelong RL algorithm (Alg 3), and examine forward transfer due to efficient exploration. We manually set the OWVAE skills selected from the set of trained single-task policies. These policies are separated into groups, based on the task separation observed from the unsupervised OWVAE training above. Each skill is assigned a policy from the corresponding group. Training uses the VTrace learning rule (Espeholt et al., 2018) in an A2C implementation heavily adapted from code published by Ring (2018). We examined transfer from skills from source tasks that are either similar or dissimilar to the target task (Table 2). The main results with forward transfer are summarized in Figure 6. The plots include two baselines: the performance of a single task policy trained on the target task, and a multi-task policy trained on batches containing experience from all 6 tasks. In order to demonstrate efficient exploration, the vertical axis shows mean per-episode return obtained by the behavior policy during training.

Category Tasks
Combat DefeatRoaches,
DefeatZerglingsAndBanelings
Navigation MoveToBeacon, CollectMineralShards
Hybrid FindAndDefeatRoaches
Economy CollectMineralsAndGas
Table 2: Categories of SC2 tasks. Tasks in the same category are considered “similar” in our experiments.

Experiment (5(a)) examines transfer to the DefeatZerglingsAndBanelings task. In the Transfer-Similar Task condition, the OWVAE skills are policies trained on CollectMineralShards and DefeatRoaches. The DefeatRoaches task is another combat task, and thus is similar to the target task. In the Transfer-Disimilar Task condition, the OWVAE skills are CollectMineralShards and MoveToBeacon. In the Transfer-Similar condition, our approach yielded both good performance from the start of training and substantially better asymptotic performance. Interestingly, our approach even surpassed the asymptotic performance of single-task and multi-task learning, clearly showing the impact of better exploration transferred through the OWVAE skills. Furthermore, the asymptotic performance is also better than the best published performance for the FullyConv policy architecture (Vinyals et al., 2017) by about 1.5x while using 10x fewer RL iterations. In the Transfer-Dissimilar condition, our approach still resulted in better initial performance than single task training, but converged to the asymptotic performance of the single task policy at a slower rate.

In experiment (5(b)), we study transfer to the MoveToBeacon task. Because of the way we trained the OWVAE, the MoveToBeacon task gets clustered with two Combat tasks, and not with the more similar CollectMineralShards task. As a result, the OWVAE function selects a combat skill to use for transfer to MoveToBeacon. When we use the default schedule for the behavior policy mixing rate , transfer from the inappropriate skill hinders learning. However, using a different mixing schedule that gives more weight to the target policy allows the agent to overcome the effect.

Finally, experiment (5(c)) investigates transfer to the FindAndDefeatZerglings task. This is an interesting target task because it combines elements of Combat and Navigation tasks, but is not highly similar to either of those categories. We evaluated two different skill sets (either CollectMineralShards+DefeatRoaches or CollectMineralShards+MoveToBeacon) but transfer had no clear positive or negative effect for this target task.

7 Discussion and Future Work

We introduced the eigentask framework for lifelong learning, which combines generative replay with mixture-of-experts style skill learning. We use the framework in a wake-sleep cycle where new tasks are solved in the wake phase and experiences are consolidated into memory in the sleep phase. We applied it to both lifelong supervised learning and RL problems. We developed refinements to the standard generative replay approach to enable selective knowledge transfer. Combined with the rejection sampling trick, we achieved SOTA performance on continual learning benchmarks. In lifelong RL, we demonstrated successful forward transfer to new tasks in Starcraft 2, and exceeded the best published performance on one of the Starcraft 2 tasks.

Our immediate goal in future work is to close the wake-sleep loop in lifelong RL. We have demonstrated success for components of the approach, but not for the full eigentask framework. We are interested in adding change-point detection to improve on the likelihood ratio test, and hierarchical eigentasks that could be more compact and more efficient. Finally, we want to incorporate task similarity measures that account for history, to separate RL tasks that have similar observations but different dynamics.

Acknowledgements

This material is based on work supported by the Lifelong Learning Machines (L2M) program of the Defense Advanced Research Projects Agency (DARPA) under contract HR0011-18-C-0051. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of DARPA, the Department of Defense or the U.S. Government.

References

  • A. Achille, T. Eccles, L. Matthey, C. Burgess, N. Watters, A. Lerchner, and I. Higgins (2018)

    Life-long disentangled representation learning with cross-domain latent homologies

    .
    In Advances in Neural Information Processing Systems, pp. 9873–9883. Cited by: §1, §3.1, §6.
  • H. B. Ammar, E. Eaton, J. M. Luna, and P. Ruvolo (2015) Autonomous Cross-Domain Knowledge Transfer in Lifelong Policy Gradient Reinforcement Learning. In IJCAI, Cited by: §1.
  • Z. Chen and B. Liu (2016) Lifelong Machine Learning.

    Synthesis Lectures on Artificial Intelligence and Machine Learning

    10 (3).
    External Links: ISSN 1939-4608, 1939-4616, Link Cited by: §1.
  • S. Diekelmann and J. Born (2010) The memory function of sleep. Nature Reviews Neuroscience 11 (2). Cited by: Acknowledgements.
  • N. Dilokthanakul, P. A. M. Mediano, M. Garnelo, M. C. H. Lee, H. Salimbeni, K. Arulkumaran, and M. Shanahan (2016)

    Deep Unsupervised Clustering with Gaussian Mixture Variational Autoencoders

    .
    arXiv:1611.02648 [cs, stat]. External Links: Link Cited by: §3.
  • L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward, Y. Doron, V. Firoiu, T. Harley, I. Dunning, et al. (2018) IMPALA: scalable distributed deep-RL with importance weighted actor-learner architectures. In ICML, Cited by: §5, §6.3.2.
  • G. Fei, S. Wang, and B. Liu (2016) Learning Cumulatively to Become More Knowledgeable. In ACM SIGKDD ’16, San Francisco, California, USA. External Links: ISBN 978-1-4503-4232-2, Link Cited by: §1.
  • T. L. Hayes, N. D. Cahill, and C. Kanan (2018) Memory Efficient Experience Replay for Streaming Learning. arXiv preprint arXiv:1809.05922. Cited by: §1, §4.
  • C. He (2018) Exemplar-Supported Generative Reproduction for Class Incremental Learning. In British Machine Vision Conference (BMVC), Cited by: Acknowledgements.
  • F. Huszár (2018) Note on the quadratic penalties in elastic weight consolidation. Proceedings of the National Academy of Sciences. Cited by: Acknowledgements.
  • D. P. Kingma and M. Welling (2014) Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat]. External Links: Link Cited by: §1, §3.1.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. (2018) Reply to Huszár: The elastic weight consolidation penalty is empirically valid. Proceedings of the National Academy of Sciences 115 (11). Cited by: §1.
  • J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and R. Hadsell (2016) Overcoming catastrophic forgetting in neural networks. arXiv:1612.00796 [cs, stat]. External Links: Link Cited by: §1, Table 1.
  • G. P. Krishnan, T. Tadros, R. Ramyaa, and M. Bazhenov (2019) Biologically inspired sleep algorithm for artificial neural networks. arXiv:1908.02240 [cs]. External Links: Link Cited by: §1.
  • S. Lee, J. Stokes, and E. Eaton (2019) Learning Shared Knowledge for Deep Lifelong Learning using Deconvolutional Networks. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China. External Links: ISBN 978-0-9992411-4-1, Link Cited by: §1.
  • X. Li, Y. Zhou, T. Wu, R. Socher, and C. Xiong (2019) Learn to Grow: A Continual Structure Learning Framework for Overcoming Catastrophic Forgetting. arXiv:1904.00310 [cs]. External Links: Link Cited by: §1.
  • Z. Li and D. Hoiem (2016) Learning without Forgetting. arXiv:1606.09282 [cs, stat]. External Links: Link Cited by: Table 1.
  • X. Liu, H. Yang, A. Ravichandran, R. Bhotika, and S. Soatto (2020) Continual Universal Object Detection. arXiv:2002.05347 [cs]. External Links: Link Cited by: §1.
  • K. Louie and M. A. Wilson (2001) Temporally structured replay of awake hippocampal ensemble activity during rapid eye movement sleep. Neuron 29 (1). Cited by: §1.
  • A. V. Makkuva, S. Oh, S. Kannan, and P. Viswanath (2018) Breaking the gridlock in Mixture-of-Experts: Consistent and Efficient Algorithms. arXiv:1802.07417 [cs]. External Links: Link Cited by: §1.
  • N. Y. Masse, G. D. Grant, and D. J. Freedman (2018) Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization. Proc Natl Acad Sci USA 115 (44). External Links: ISSN 0027-8424, 1091-6490, Link Cited by: §1.
  • J. L. McClelland, B. L. McNaughton, and A. K. Lampinen (2020) Integration of New Information in Memory: New Insights from a Complementary Learning Systems Perspective. preprint Neuroscience. External Links: Link Cited by: §1.
  • A. Nagabandi, C. Finn, and S. Levine (2019) DEEP ONLINE LEARNING VIA META-LEARNING: CONTINUAL ADAPTATION FOR MODEL-BASED RL. In ICLR, Cited by: §1.
  • C. V. Nguyen, Y. Li, T. D. Bui, and R. E. Turner (2018) Variational Continual Learning. arXiv:1710.10628 [cs, stat]. External Links: Link Cited by: §1.
  • G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter (2019) Continual Lifelong Learning with Neural Networks: A Review. Neural Networks 113. External Links: ISSN 08936080, Link Cited by: §1.
  • J. Ramapuram, M. Gregorova, and A. Kalousis (2017) Lifelong Generative Modeling. arXiv:1705.09847 [cs, stat]. External Links: Link Cited by: Acknowledgements.
  • D. Rao, F. Visin, A. Rusu, R. Pascanu, Y. W. Teh, and R. Hadsell (2019) Continual unsupervised representation learning. In Advances in Neural Information Processing Systems, Cited by: §3.
  • S. Rebuffi, A. Kolesnikov, G. Sperl, and C. H. Lampert (2017) ICaRL: incremental classifier and representation learning. In

    The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: Table 1.
  • J. Ren, P. J. Liu, E. Fertig, J. Snoek, R. Poplin, M. A. DePristo, J. V. Dillon, and B. Lakshminarayanan (2019) Likelihood Ratios for Out-of-Distribution Detection. arXiv:1906.02845 [cs, stat]. External Links: Link Cited by: footnote 1.
  • R. Ring (2018) Reaver: modular deep reinforcement learning framework. GitHub. Note: https://github.com/inoryy/reaver Cited by: §6.3.2.
  • A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell (2015) Policy distillation. arXiv preprint arXiv:1511.06295. Cited by: §5.
  • A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell (2016) Progressive Neural Networks. arXiv:1606.04671 [cs]. External Links: Link Cited by: §1.
  • J. Schwarz, J. Luketina, W. M. Czarnecki, A. Grabska-Barwinska, Y. W. Teh, R. Pascanu, and R. Hadsell (2018) Progress & compress: a scalable framework for continual learning. arXiv preprint arXiv:1805.06370. Cited by: Table 1.
  • H. Shin, J. K. Lee, J. Kim, and J. Kim (2017) Continual learning with deep generative replay. In Advances in Neural Information Processing Systems, pp. 2990–2999. Cited by: §1, Table 1, §6.
  • D. L. Silver, Q. Yang, and L. Li (2013) Lifelong Machine Learning Systems: Beyond Learning Algorithms. In AAAI Spring Symposium Series, Cited by: §1.
  • J. Smith, S. Baer, Z. Kira, and C. Dovrolis (2019) Unsupervised continual learning and self-taught associative memory hierarchies. Cited by: §4.
  • R. S. Sutton, D. Precup, and S. Singh (1999) Between MDPs and semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artificial Intelligence 112 (1-2), pp. 181–211. Cited by: §1.
  • C. Tessler, S. Givony, T. Zahavy, D. J. Mankowitz, and S. Mannor (2016) A Deep Hierarchical Approach to Lifelong Learning in Minecraft. arXiv:1604.07255 [cs]. External Links: Link Cited by: §1.
  • B. Tsuda, K. M. Tye, H. T. Siegelmann, and T. J. Sejnowski (2020) A modeling framework for adaptive lifelong learning with transfer and savings through gating in the prefrontal cortex. bioRxiv. External Links: Link Cited by: §1, §3, §6.1.
  • G. M. van de Ven and A. S. Tolias (2018) Generative replay with feedback connections as a general strategy for continual learning. arXiv preprint arXiv:1809.10635. Cited by: §3.1.
  • G. M. van de Ven and A. S. Tolias (2019) Three scenarios for continual learning. arXiv:1904.07734 [cs, stat]. External Links: Link Cited by: §1, §1, §6.2, Table 1, §6.
  • O. Vinyals, T. Ewalds, S. Bartunov, P. Georgiev, A. S. Vezhnevets, M. Yeo, A. Makhzani, H. Küttler, J. Agapiou, J. Schrittwieser, et al. (2017) Starcraft II: a new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. Cited by: §1, §6.3.2, §6.3, §6.
  • M. A. Wilson and B. L. McNaughton (1994) Reactivation of hippocampal ensemble memories during sleep. Science 265 (5172). Cited by: Acknowledgements.
  • J. Yosinski, J. Clune, Y. Bengio, and H. Lipson (2014) How transferable are features in deep neural networks?. In Advances in neural information processing systems, Cited by: Acknowledgements.
  • F. Zenke, B. Poole, and S. Ganguli (2017) Continual learning through synaptic intelligence. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 3987–3995. Cited by: Table 1.
  • D. Zhang, Y. Sun, B. Eriksson, and L. Balzano (2017) Deep unsupervised clustering using mixture of autoencoders. arXiv preprint arXiv:1712.07788. Cited by: §3.

Appendix A Policy Distillation Results

We use a policy distillation approach for policy consolidation. As a proof-of-concept, we conducted an experiment using policy distillation to combine two SC2 policies – for the CollectMineralShards and DefeatRoaches tasks – into a single policy. In the experiment, we use real observations from both tasks for distillation, rather than sampling one set of observations from a generative model. Figure 7 compares the performance of the distilled policy to the performance of single task policies. Distillation is able to preserve the control knowledge embodied in both policies while compressing them into a single policy, and about 100x fewer training batches are required for distillation than were required originally to learn the policies being distilled. Policy distillation thus provides an effective and efficient means of knowledge consolidation for our framework.

(a)
(b)
Figure 7: Policy distillation

Appendix B Supplementary Experimental Results

(a) Average area under the curve averaged over classes vs iterations.
(b) Mean average precision vs iterations.
(c) Per-class (y-axis) accuracy over iterations (x-axis).
Figure 8: SplitMNIST: Per-task breakdown of accuracy vs iterations. Each task is seen for 500 iterations. Steady increase in these metrics indicates successfull continual learning without forgetting.
(a) Average area under the curve averaged over classes vs iterations.
(b) Mean average precision vs iterations.
(c) Per-class (y-axis) accuracy over iterations (x-axis).
Figure 9: Split(MNIST+FashionMNIST): Per-task breakdown of accuracy vs iterations. Each task is seen for 500 iterations. Steady increase in these metrics indicates successfull continual learning without forgetting.
BuildMarines CollectMineralShards DefeatZerglingsAndBanelings MoveToBeacon CollectMineralsAndGas DefeatRoaches
Table 3: SC2 Data Representation
Figure 10: OWVAE reconstructions for SC2 tasks.