1 Introduction
Human and animal learning is characterized not just by a capacity to acquire complex skills, but also the ability to adapt rapidly when those skills must be carried out under new or changing conditions. For example, animals can quickly adapt to walking and running on different surfaces (Herman, 2017) and humans can easily modulate force during reaching movements in the presence of unexpected perturbations (Flanagan & Wing, 1993). Furthermore, these experiences are remembered, and can be recalled to adapt more quickly when similar disturbances occur in the future (Doyon & Benali, 2005). Since learning entirely new models on such short timescales is impractical, we can devise algorithms that explicitly train models to adapt quickly from small amounts of data. Such online adaptation is crucial for intelligent systems operating in the real world, where changing factors and unexpected perturbations are the norm. In this paper, we propose an algorithm for fast and continuous online learning that utilizes deep neural network models to build and maintain a task distribution, allowing for the natural development of both generalization as well as task specialization.
Our working example is continuous adaptation in the modelbased reinforcement learning setting, though our approach generally addresses any online learning scenario with streaming data. We assume that each “trial” consists of multiple tasks, and that the delineation between the tasks is not provided explicitly to the learner – instead, the method must adaptively decide what “tasks” even represent, when to instantiate new tasks, and when to continue updating old ones. For example, a robot running over changing terrain might need to handle uphill and downhill slopes, and might choose to maintain separate models that become specialized to each slope, adapting to each one in turn based on the currently inferred surface.
We perform adaptation simply by using online stochastic gradient descent (SGD) on the model parameters, while maintaining a mixture model over model parameters for different tasks. The mixture is updated via the Chinese restaurant process (Stimberg et al., 2012), which enables new tasks to be instantiated as needed over the course of a trial. Although online learning is perhaps one of the oldest applications of SGD (Bottou, 1998)
, modern parametric models such as deep neural networks are exceedingly difficult to train online with this method. They typically require mediumsized minibatches and multiple epochs to arrive at sensible solutions, which is not suitable when receiving data in an online streaming setting. One of our key observations is that metalearning can be used to learn a prior initialization for the parameters that makes such direct online adaptation feasible, with only a handful of gradient steps. The metatraining procedure we use is based on modelagnostic metalearning (MAML)
(Finn et al., 2017), where a prior weight initialization is learned for a model so as to optimize improvement on any task from a metatraining task distribution after a small number of gradient steps.Metalearning with MAML has previously been extended to modelbased RL (Nagabandi et al., 2018), but only for the shot adaptation setting: The metalearned prior model is adapted to the
most recent time steps, but the adaptation is not carried forward in time (i.e., adaptation is always performed from the prior itself). This rigid batchmode setting is restrictive in an online learning setup and is insufficient for tasks that are further outside of the training distribution. A more natural formulation is one where the model receives a continuous stream of data and must adapt online to a potentially nonstationary task distribution. This requires both fast adaptation and the ability to recall prior tasks, as well as an effective adaptation strategy to interpolate as needed between the two.
The primary contribution of this paper is a metalearning for online learning (MOLe) algorithm that uses expectation maximization, in conjunction with a Chinese restaurant process prior on the task distribution, to learn mixtures of neural network models that are each updated with online SGD. In contrast to prior multitask and metalearning methods, our method’s online assignment of soft task probabilities allows for task specialization to emerge naturally, without requiring task delineations to be specified in advance. We evaluate MOLe in the context of modelbased RL on a suite of challenging simulated robotic tasks including disturbances, environmental changes, and simulated motor failures. Our simulated experiments show a halfcheetah agent and a hexapedal crawler robot performing continuous model adaptation in an online setting. Our results show online instantiation of new tasks, the ability to adapt to outofdistribution tasks, and the ability to recognize and revert back to prior tasks. Additionally, we demonstrate that MOLe outperforms a stateoftheart prior method that does
shot modelbased metaRL, as well as natural baselines such as continuous gradient updates for adaptation and online learning without metatraining.2 Related Work
Online learning is one of the oldest subfields of machine learning
(Bottou, 1998; Jafari et al., 2001). Prior algorithms have used online gradient updates (Duchi et al., 2011) and probabilistic filtering formulations (Murphy, 2002; Hoffman et al., 2010; Broderick et al., 2013). In principle, commonly used gradientbased learning methods, such as SGD, can easily be used as online learning algorithms (Bottou, 1998). In practice, their performance with deep neural network function approximators is limited (Sahoo et al., 2017): such highdimensional models must be trained with batchmode methods, minibatches, and multiple passes over the data. We aim to lift this restriction by using modelagnostic metalearning (MAML) to explicitly pretrain a model that enables fast adaptation, which we then use for continuous online adaptation via an expectation maximization algorithm with a Chinese restaurant process (Blei et al., 2003) prior for dynamic allocation of new tasks in a nonstationary task distribution.Online learning is related to that of continual or lifelong learning (Thrun, 1998), where the agent faces a nonstationary distribution of tasks over time. However, unlike works that focus on avoiding negative transfer, i.e. catastrophic forgetting (Kirkpatrick et al., 2017; Rebuffi et al., 2017; Zenke et al., 2017; LopezPaz et al., 2017; Nguyen et al., 2017), online learning focuses on the ability to rapidly learn and adapt in the presence of nonstationarity. While some continual learning works consider the problem of forward transfer, e.g. Rusu et al. (2016); Aljundi et al. (2017); Wang et al. (2017), these works and others in continual learning generally focus on small sets of tasks where fast, online learning is not realistically possible, since there are simply not enough tasks to recover structure that enables fast, fewshot learning in new tasks or environments.
Our approach builds on techniques for metalearning or learningtolearn (Thrun & Pratt, 1998; Schmidhuber, 1987; Bengio et al., 1992; Naik & Mammone, 1992). However, most recent metalearning work considers a setting where one task is learned at a time, often from a single batch of data (Santoro et al., 2016; Ravi & Larochelle, 2017; Munkhdalai & Yu, 2017; Wang et al., 2016; Duan et al., 2016). In our work, we specifically address nonstationary task distributions and do not assume that task boundaries are known. Prior work (Jerfel et al., 2018) has also considered nonstationary task distributions; whereas Jerfel et al. (2018)
use the metagradient to estimate the parameters of a mixture over the taskspecific parameters, we focus on fast adaptation and accumulation of taskspecific mixture components during runtime optimization. Other metalearning works have considered nonstationarity within a task
(AlShedivat et al., 2017) and episodes involving multiple tasks at metatest time (Ritter et al., 2018), but they do not consider continual online adaptation with unknown task separation. Prior work has also studied metalearning for modelbased RL (Nagabandi et al., 2018). This prior method updates the model every time step, but each update is a batchmode shot update, using exactly prior transitions and resetting the model at each step. This allows for adaptive control, but does not enable continual online adaptation, since updates from previous steps are always discarded. In our comparisons, we find that our approach substantially outperforms this prior method. To our knowledge, our work is the first to apply metalearning to learn streaming online updates.3 Problem Statement
We formalize our online learning problem setting as follows: at each time step, the model receives an input and produces a prediction . It then receives a ground truth label , which must be used to adapt the model to increase its prediction accuracy on the next input . The true labels are assumed to come from some task distribution , where is the task at time . The tasks themselves change over time, resulting in a nonstationary task distribution, and the identity of the task is unknown to the learner. In realworld settings, tasks might correspond to unknown parameters of the system (e.g., motor malfunction on a robot), user preferences, or other unexpected events. This problem statement covers a range of online learning problems that all require continual adaptation to streaming data and trading off between generalization and specialization.
In our experiments, we use modelbased RL as our working example, where the input is a stateaction pair, and the output is the next state. We discuss this application to modelbased RL in Section 6, but we keep the following derivation of our method general for the case of arbitrary online prediction problems.
4 Online Learning with a Mixture of MetaTrained Networks
We discuss our metalearning for online learning (MOLe) algorithm in two parts: online learning in this section, and metalearning in the next. In this section, we explain our online learning method that enables effective online learning using a continuous stream of incoming data from a nonstationary task distribution. We aim to retain generalization so as to not lose past knowledge, as well as gain specialization, which is particularly important for learning new tasks that are further outofdistribution and require more learning. We discuss the process of obtaining a metalearned prior in Sec. 5, but we first formulate in this section an online adaptation algorithm using SGD with expectation maximization to maintain and adapt a mixture model over task model parameters (i.e., a probabilistic task distribution).
4.1 Method Overview
Let represent the predictive distribution of our model on input , for an unknown task . Our goal is to estimate model parameters for each task in the nonstationary task distribution: This requires inferring the distribution over tasks at each step , using that distribution to make predictions , and also using it to update each model from to . In practice, the parameters of each model will correspond to the weights of a neural network .
Each model begins with some prior parameter vector
, which we will discuss in more detail in Section 5. Since the number of tasks is also unknown, we begin with one task at time step 0, where . From here, we continuously update all parameters in and add new tasks as needed, in the attempt to model the true underlying process . Since task identities are unknown, we must also estimate at each time step. Thus, the online learning problem consists of adapting each at each time step according to the inferred task probabilities . To do this, we adapt the expectation maximization (EM) algorithm and optimize the expected loglikelihood, given by(1) 
where we use to denote the model parameters corresponding to task . Finally, to handle the unknown number of tasks, we employ the Chinese restaurant process to instantiate new tasks as needed.
4.2 Approximate Online Inference
We use expectation maximization (EM) to update the model parameters. In our case, the E step in EM involves estimating the task distribution at the current time step, while the M step involves updating all model parameters from to obtain the new model parameters . The parameters are always updated by one gradient step per time step, according to the inferred task responsibilities.
We first estimate the expectations over all task parameters in the task distribution, where the posterior of each task probability can be written as follows:
(2) 
We then formulate the task prior using a Chinese restaurant process (CRP) to enable new tasks to be instantiated during a trial. The CRP is an instantiation of a Dirichlet process. In the CRP, at time , the probability of each task should be given by
(3) 
where is the expected number of datapoints in task for all steps , and
is a hyperparameter that controls the instantiation of new tasks. The prior therefore becomes
(4) 
Combining the prior and likelihood, we derive the following posterior task probability distribution:
(5) 
Having estimated the latent task probabilities, we next perform the M step, which improves the expected loglikelihood in Equation 1 based on the inferred task distribution. Since each task starts from the prior , the values of all parameters in after one gradient update are given by
(6) 
If we assume that all parameters of have already been updated for the previous time steps , we can approximate this update by simply updating all parameters on the newest data:
(7) 
This procedure is an approximation, since updates to task parameters will in reality also change the corresponding task probabilities at previous time steps. However, this approximation removes the need to store previously seen data points and yields a fully online, streaming algorithm. Finally, to fully implement the EM algorithm, we must alternate the E and M steps to convergence at each time step, rolling back the previous gradient update to at each iteration. In practice, we found it sufficient to perform the E and M steps only once per time step. While this is a crude simplification, successive time steps in the online learning scenario are likely to be correlated, making this procedure reasonable. However, it is also straightforward to perform multiple steps of EM while still remaining fully online.
We now summarize this full online learning portion of MOLe, and we also outline it in Alg. 1. At the first time step , the task distribution is initialized to contain one entry: . At every time step after that, an E step is performed to estimate the task distribution and an M step is performed to update the model parameters. The CRP prior also assigns, at each time step, the probability of adding a new task at the given time step. The parameters of this new task are adapted from on the latest data. The prediction on the next datapoint is then made using the model parameters corresponding to the most likely task .
5 MetaLearning the Prior
We formulated an algorithm above for performing online adaptation using continually incoming data. For this method, we choose to metatrain the prior using the modelagnostic metalearning (MAML) algorithm. This metatraining algorithm is an appropriate choice, because it results in a prior that is specifically intended for gradientbased finetuning. Before we further discuss our choice in metatraining procedure, we first give an overview of MAML and metalearning in general.
Given a distribution of tasks, a metalearning algorithm produces a learning procedure, which can, in some cases, quickly adapt to a new task. MAML optimizes for an initialization of a deep network that achieves good fewshot task generalization when finetuned using a few datapoints from that task. At train time, MAML sees small amounts of data from large numbers of tasks, where data from each task can be split into training and validation subsets ( and ), where is of size . MAML optimizes for model parameters such that one or more gradients steps on results in a minimal loss on . In our case, we will set and , and the loss will correspond to negative log likelihood. A good that allows such adaptation to be successful across various metatraining tasks is thus a good network initialization from which adaptation can solve various new tasks that are related to the previously seen tasks. The MAML objective is defined as follows:
(8) 
Here, is the inner learning rate. Once this metaobjective is optimized, the resulting acts as a prior from which finetuning can occur at testtime, using recent experience from as follows:
(9) 
Here, is adapted from the metalearned prior to be more representative for the current time.
Although Finn et al. (2017) demonstrated this fast adaptation of deep neural networks and Nagabandi et al. (2018) extended this framework to modelbased meta RL, these methods address adaptation in the shot setting, always adapting directly from the metalearned prior and not allowing further adaptation or specialization. In this work, we have extended these capabilities by enabling more evolution of knowledge through a temporallyextended online adaptation procedure.
While our procedure for continual online learning is still initialized with this metatraining for shot adaptation (i.e., MAML), we found that this prior was sufficient to enable effective continual online adaptation at test time. The intuitive rationale for this is that MAML trains the model to be able to change significantly using only a small number of datapoints and gradient steps. Note that this metatrained prior can be used at test time in (a) a shot setting, similar to how it was trained, or it can be used at test time by (b) taking substantially more gradient steps away from this prior. We show in Sec. 7 that our method outperforms both of these methods, but the mere ability to use this metalearned prior in these ways makes the use of MAML enticing.
We note that it is quite possible to modify the MAML algorithm to optimize the model directly with respect to the weighted updates discussed in Section 4.2
. This simply requires computing the task weights (the E step) on each batch during metatraining, and then constructing a computation graph where all gradient updates are multiplied by their respective weights. Standard automatic differentiation software can then compute the corresponding metagradient. For short trial lengths, this is not substantially more complex than standard MAML; for longer trial lengths, truncated backpropagation is an option. Although such a metatraining procedure better matches the way that the model is used during online adaptation, we found that it did not substantially improve our results. While it’s possible that the difference might be more significant if metatraining for longerterm adaptation, this observation does suggest that simply metatraining with MAML is sufficient for enabling effective continuous online adaptation in nonstationary multitask settings. To clarify, although this modified training procedure of incorporating the EM weight updates (during metatraining) did not explicitly improve our results, we see that testtime performance did indeed improve with using more data for the standard MAML metatraining procedure (see Appendix).
6 Application to ModelBased RL
In our experiments, we apply MOLe to modelbased reinforcement learning. RL in general aims to act in a way that maximizes the sum of future rewards. At each time step , the agent executes action from state , transitions to the next state according to the transition probabilities (i.e., dynamics) and receives rewards . The goal at each step is to execute the action that maximizes the discounted sum of future rewards , where discount factor prioritizes nearterm rewards. In modelbased RL, in particular, the predictions from a known or learned dynamics model are used to either learn a policy, or are used directly inside a planning algorithm to select actions that maximize reward.
In our work, the underlying distribution that we aim to model is the dynamics distribution , where the unknown represents the underlying settings (e.g., state of the system, external details, environmental perturbations, etc.). The goal for MOLe is to estimate this distribution with a predictive model . To instantiate MOLe in this context of modelbased RL, we follow Algorithm 1 with the following specifications:
(1) We set the input to be the concatenation of previous states and actions, given by , and the output to be the corresponding next states . This provides us with a slightly larger batch of data for each online update, as compared to using only the data from the given time step. Since individual time steps at high frequency can be very noisy, using the past transitions helps to damp out the updates.
(2) The predictive model represents each of these underlying transitions as an independent Gaussian such that , where each is parameterized with a Gaussian given by mean
and constant variance
. We implement this mean dynamics functionas a neural network model with three hidden layers each of dimension 500, and ReLU nonlinearities.
(3) To calculate the new task parameter , which may or may not be added to the task distribution , we use a set of nearby datapoints that is separate from the set . This is done to avoid calculating the parameter using the same dataset on which it is evaluated, since comes from evaluating the parameter on the data .
(4) Unlike standard online streaming tasks where the next data point is just given, the incoming data point (i.e., the next visited state) in this case is influenced by the predictive model itself. This is because, after the most likely task is selected from the possible tasks, the predictions from the model are used by the controller to plan over a sequence of future actions and select the actions that maximize future reward. Note that the planning procedure is based on stochastic optimization, following prior work (Nagabandi et al., 2018), and we provide more details in the appendix. Since the controller’s action choice determines the next data point, and since the controller’s choice is dependent on the estimated model parameters, it is even more crucial in this setting to appropriately adapt the model.
5) Finally, note that we attain from metatraining using modelagnostic metalearning (MAML), as mentioned in the method above. However, in this case, MAML is performed in the loop of modelbased RL. In other words, the model parameters at a given iteration of metatraining are used by the controller to generate onpolicy rollouts, the data from these rollouts is then added to the dataset for MAML, and this process repeats until the end of metatraining.
7 Experiments
The questions that we aimed to study from our experiments include: Can MOLe 1) autonomously discover some task structure amid a stream of nonstationary data? 2) adapt to tasks that are further outside of the task distribution than can be handled by a shot learning approach? 3) recognize and revert to tasks it has seen before? 4) avoid overfitting to a recent task to prevent deterioration of performance upon the next task switch? 5) outperform other methods?
To study these questions, we conduct experiments on agents in the MuJoCo physics engine (Todorov et al., 2012). The agents we used are a halfcheetah (S ∈ R21, A ∈ R6) and a hexapedal crawler (S ∈ R50, A ∈ R12). Using these agents, we design a number of challenging online learning problems that involve multiple sudden and gradual changes in the underlying task distribution, including tasks that are extrapolated from those seen previously, where online learning is criticial. Through these experiments, we aim to build problem settings that are representative of the types of disturbances and shifts that a real RL agent might encounter.
We present results and analysis of our findings in the following three sections, and videos can be found at https://sites.google.com/berkeley.edu/onlineviameta. In our experiments, we compare to several alternative methods, including two approaches that leverage metatraining and two approaches that do not:
(a) kshot adaptation with metalearning: Always adapt from the metatrained prior , as typically done with metalearning methods (Nagabandi et al., 2018). This method is often insufficient for adapting to tasks that are further out of distribution, and the adaptation is also not carried forward in time for future use.
(b) continued adaptation with metalearning: Always take gradient steps from the previous time step’s parameters. This method oftens overfits to recently observed tasks, so it should indicate the importance of our method effectively identifying task structure to avoid overfitting and enable recall.
(c) modelbased RL
: Train a model on the same data as the methods above, using standard supervised learning, and keep this model fixed throughout the trials (i.e., no metalearning and no adaptation).
(d) modelbased RL with online gradient updates: Use the same model from modelbased RL (i.e., no metalearning), but adapt it online using gradientdescent at run time. This is representative of commonly used dynamic evaluation methods (Rei, 2015; Krause et al., 2017, 2016; Fortunato et al., 2017).
7.1 Terrain Slopes on HalfCheetah
We start with the task of a halfcheetah (Fig. 1) agent, traversing terrains of differing slopes. The prior model is metatrained on data from terrains with random slopes of low magnitudes, and the test trials are executed on difficult outofdistribution tasks such as basins, steep hills, etc. As shown in Fig. 2, neither modelbased RL nor modelbased RL with online gradient updates perform well on these outofdistribution tasks, even though those models were trained on the same data that the metatrained model received. The bad performance of the modelbased RL approach indicates the need for model adaptation (as opposed to assuming a single model can encompass everything), while the bad performance of modelbased RL with online gradient updates indicates the need for a metalearned initialization to enable online learning with neural networks.
For the three metalearning and adaptation methods, we expect continued adaptation with metalearning to perform poorly due to continuous gradient steps causing it to overfit to recent data; that is, we expect that experience on the upward slopes to lead to deterioration of performance on downward slopes, or something similar. However, based on both our qualitative and quantitative results, we see that the metalearning procedure seems to have initialized the agent with a parameter space in which these various “tasks” are not seen as substantially different, where online learning by SGD performs well. This suggests that the metalearning process finds a task space where there is an easy skill transfer between slopes; thus, even when MOLe is faced with the option of switching tasks or adding new tasks to its dynamic latent task distribution, it chooses not to do so (Fig. 3). Unlike findings that we will see later, it is interesting that the discovered task space here does not correspond to humandistinguishable categorical labels. Finally, note that these tasks of changing slopes are not particularly similar to each other (and that the discovered task space is indeed useful), because the two nonmetalearning baselines do indeed fail on these test tasks despite having similar training performance on the shallow training slopes.
7.2 HalfCheetah Motor Malfunctions
While the findings from the halfcheetah on sloped terrains illustrate that separate task parameters aren’t always necessary for what might externally seem like separate tasks, we also want to study agents that experience more drasticallychanging nonstationary task distributions during their experience in the world. For this set of experiments, we train all models on data where a single actuator is selected at random to experience a malfunction during the rollout. In this case, malfunction means that the polarity or magnitude of actions applied to that actuator are altered. Fig. 4 shows the results of various methods on drastically outofdistribution test tasks, such as altering all actuators at once. The left of Fig. 4 shows that when the task distribution during the test trials contains only a single task, such as ‘sign negative’ where all actuators are prescribed to be the opposite polarity, then continued adaptation performs well by repeatedly performing gradient updates on incoming data. However, as shown in the other tasks of Fig. 4, the performance of this continue adaptation substantially deteriorates when the agent experiences a nonstationary task distribution. Due to overspecialization on recent incoming data, such methods that continuously adapt tend to forget and lose previously existing skills. This overfitting and forgetting of past skills is also illustrated in the consistent performance deterioration shown in Fig. 4. MOLe, on the other hand, dynamically builds a probabilistic task distribution and allows adaptation to these difficult tasks, without forgetting past skills. We show a sample task setup in Fig. 5, where the agent experiences alternating periods of normal and crippledleg operation. This plot shows the successful recognition of new tasks as well as old tasks; note that both the recognition and adaptation are all done online, without using a bank of past data to perform the adaptation, and without a humanspecified set of task categories.
7.3 Crippling of End Effectors on SixLegged Crawler
To further examine the effects of our continual online adaptation algorithm, we study another, more complex agent: a 6legged crawler (Fig. 6). In these experiments, models are trained on random joints being crippled (i.e., unable to apply actuator commands). In Fig. 7, we present two illustrative test tasks: (1) the agent sees a set configuration of crippling for the duration of its testtime experience, and (2) the agent receives alternating periods of experience, between regions of normal operation and regions of having crippled legs.
The first setting is similar to data seen during training, and thus, we see that even the modelbased RL and model basedRL with online gradient updates baselines do not fail. The methods that include both metalearning and adaptation, however, do have higher performance. Furthermore, we see again that continued gradient steps in this case of a singletask setting is not detrimental. The second setting’s nonstationary task distribution (when the leg crippling is dynamic) illustrates the need for online adaptation (modelbased RL fails), the need for a good prior to adapt from (failure of modelbased RL with online gradient updates), the harm of overfitting to recent experience and thus forgetting older skills (low performance of continued gradient steps), and the need for further adaptation away from the prior (limited performance of kshot adaptation). With MOLe, this agent is able to build its own representation of “task” switches, and we see that this switch does indeed correspond to recognizing regions of leg crippling (left of Fig. 8). The plot of the cumulative sum of rewards (right of Fig. 8) of each of the three metalearning plus adaptation methods includes this same task switch pattern every 500 steps: Here, we can clearly see that steps 5001000 and 15002000 were the crippled regions. Continued gradient steps actually performs worse on the second and third times it sees normal operation, whereas MOLe is noticeably better as it sees the task more often. Note this improvement of both skills, where development of one skill actually does not hinder the other.
Finally, we examine experiments where the crawler experiences (during each trial) walking straight, making turns, and sometimes having a crippled leg. The performance during the first 500 time steps of ”walking forward in a normal configuration” for continued gradient steps was comparable to MOLe (+/10% difference), but its performance during the last 500 time steps of ”walking forward in a normal configuration” was 200% lower. Note this detrimental effect of performing updates without allowing for separate task specialization/adaptation.
8 Discussion
We presented an online learning method for neural network models that can handle nonstationary, multitask settings within each trial. Our method adapts the model directly with SGD, where an EM algorithm uses a Chinese restaurant process prior to maintain a distribution over tasks and handle nonstationarity. Although SGD generally makes for a poor online learning algorithm in the streaming setting for large parametric models such as deep neural networks, we observe that, by (1) metatraining the model for fast adaptation with MAML and (2) employing our online algorithm for probabilistic updates at test time, we can enable effective online learning with neural networks. In our experiments, we applied this approach to modelbased RL, and we demonstrated that it could be used to adapt the behavior of simulated robots faced with various new and unexpected tasks. Our results showed that our method can develop its own notion of task, continuously adapt away from the prior as necessary (to learn even tasks that require more adaptation), and recall tasks it has seen before. While we use modelbased RL as our evaluation domain, our method is general and could be applied to other streaming and online learning settings. An exciting direction for future work would be to apply our method to domains such as time series modeling and active online learning.
References
 AlShedivat et al. (2017) Maruan AlShedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever, Igor Mordatch, and Pieter Abbeel. Continuous adaptation via metalearning in nonstationary and competitive environments. arXiv preprint arXiv:1710.03641, 2017.

Aljundi et al. (2017)
Rahaf Aljundi, Punarjay Chakravarty, and Tinne Tuytelaars.
Expert gate: Lifelong learning with a network of experts.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2017.  Bengio et al. (1992) Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. On the optimization of a synaptic learning rule. In Optimality in Artificial and Biological Neural Networks, 1992.
 Blei et al. (2003) David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
 Botev et al. (2013) Zdravko I Botev, Dirk P Kroese, Reuven Y Rubinstein, and Pierre L’Ecuyer. The crossentropy method for optimization. In Handbook of statistics, volume 31, pp. 35–59. Elsevier, 2013.
 Bottou (1998) Léon Bottou. Online learning and stochastic approximations. Online learning in neural networks, 17(9):142, 1998.
 Broderick et al. (2013) Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C Wilson, and Michael I Jordan. Streaming variational bayes. In Advances in Neural Information Processing Systems, pp. 1727–1735, 2013.
 Doyon & Benali (2005) Julien Doyon and Habib Benali. Reorganization and plasticity in the adult brain during learning of motor skills. Current opinion in neurobiology, 15(2):161–167, 2005.
 Duan et al. (2016) Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv:1611.02779, 2016.
 Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011.
 Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. International Conference on Machine Learning (ICML), 2017.
 Flanagan & Wing (1993) J Randall Flanagan and Alan M Wing. Modulation of grip force with load force during pointtopoint arm movements. Experimental Brain Research, 95(1):131–143, 1993.
 Fortunato et al. (2017) Meire Fortunato, Charles Blundell, and Oriol Vinyals. Bayesian recurrent neural networks. arXiv preprint arXiv:1704.02798, 2017.
 Herman (2017) Robert Herman. Neural control of locomotion, volume 18. Springer, 2017.
 Hoffman et al. (2010) Matthew Hoffman, Francis R Bach, and David M Blei. Online learning for latent dirichlet allocation. In advances in neural information processing systems, pp. 856–864, 2010.
 Jafari et al. (2001) Amir Jafari, Amy Greenwald, David Gondek, and Gunes Ercal. On noregret learning, fictitious play, and nash equilibrium. In ICML, volume 1, pp. 226–233, 2001.
 Jerfel et al. (2018) Ghassen Jerfel, Erin Grant, Thomas L Griffiths, and Katherine Heller. Online gradientbased mixtures for transfer modulation in metalearning. arXiv preprint arXiv:1812.06080, 2018.
 Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka GrabskaBarwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 2017.
 Krause et al. (2016) Ben Krause, Liang Lu, Iain Murray, and Steve Renals. Multiplicative lstm for sequence modelling. arXiv preprint arXiv:1609.07959, 2016.
 Krause et al. (2017) Ben Krause, Emmanuel Kahembwe, Iain Murray, and Steve Renals. Dynamic evaluation of neural sequence models. CoRR, abs/1709.07432, 2017.
 LopezPaz et al. (2017) David LopezPaz et al. Gradient episodic memory for continual learning. In Advances in Neural Information Processing Systems, 2017.
 Munkhdalai & Yu (2017) Tsendsuren Munkhdalai and Hong Yu. Meta networks. International Conference on Machine Learning (ICML), 2017.

Murphy (2002)
K. Murphy.
Dynamic Bayesian Networks: Representation, Inference and Learning
. PhD thesis, Dept. Computer Science, UC Berkeley, 2002. URL https://www.cs.ubc.ca/~murphyk/Thesis/thesis.html.  Nagabandi et al. (2018) Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to adapt in dynamic, realworld environments through metareinforcement learning. arXiv preprint arXiv:1803.11347, 2018.
 Naik & Mammone (1992) Devang K Naik and RJ Mammone. Metaneural networks that learn by learning. In International Joint Conference on Neural Netowrks (IJCNN), 1992.
 Nguyen et al. (2017) Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. arXiv:1710.10628, 2017.
 Rao (2009) A. Rao. A survey of numerical methods for optimal control. In Advances in the Astronautical Sciences, 2009.
 Ravi & Larochelle (2017) Sachin Ravi and Hugo Larochelle. Optimization as a model for fewshot learning. In International Conference on Learning Representations (ICLR), 2017.

Rebuffi et al. (2017)
SylvestreAlvise Rebuffi, Alexander Kolesnikov, and Christoph H Lampert.
icarl: Incremental classifier and representation learning.
In Proc. CVPR, 2017.  Rei (2015) Marek Rei. Online representation learning in recurrent neural language models. CoRR, abs/1508.03854, 2015.
 Ritter et al. (2018) Samuel Ritter, Jane X Wang, Zeb KurthNelson, Siddhant M Jayakumar, Charles Blundell, Razvan Pascanu, and Matthew Botvinick. Been there, done that: Metalearning with episodic recall. arXiv preprint arXiv:1805.09692, 2018.
 Rusu et al. (2016) Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progressive neural networks. arXiv:1606.04671, 2016.
 Sahoo et al. (2017) Doyen Sahoo, Quang Pham, Jing Lu, and Steven CH Hoi. Online deep learning: Learning deep neural networks on the fly. arXiv preprint arXiv:1711.03705, 2017.
 Santoro et al. (2016) Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy Lillicrap. Metalearning with memoryaugmented neural networks. In International Conference on Machine Learning (ICML), 2016.
 Schmidhuber (1987) Jurgen Schmidhuber. Evolutionary principles in selfreferential learning. Diploma thesis, Institut f. Informatik, Tech. Univ. Munich, 1987.
 Stimberg et al. (2012) Florian Stimberg, Andreas Ruttor, and Manfred Opper. Bayesian inference for change points in dynamical systems with reusable statesa chinese restaurant process approach. In Artificial Intelligence and Statistics, pp. 1117–1124, 2012.
 Thrun (1998) Sebastian Thrun. Lifelong learning algorithms. In Learning to learn. Springer, 1998.
 Thrun & Pratt (1998) Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 1998.
 Todorov et al. (2012) Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5026–5033. IEEE, 2012.
 Wang et al. (2016) Jane X Wang, Zeb KurthNelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv:1611.05763, 2016.
 Wang et al. (2017) YuXiong Wang, Deva Ramanan, and Martial Hebert. Growing a brain: Finetuning by increasing model capacity. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 Williams et al. (2015) Grady Williams, Andrew Aldrich, and Evangelos Theodorou. Model predictive path integral control using covariance variable importance sampling. CoRR, abs/1509.01149, 2015.
 Zenke et al. (2017) Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In International Conference on Machine Learning, 2017.
Appendix A Testtime Performance vs Training Data
We verify below that as the metatrained models are trained with more data, their performance on test tasks does improve.
Appendix B Hyperparameters
In all experiments, we use a dynamics model consisting of three hidden layers, each of dimension 500, with ReLU nonlinearities. The control method that we use is randomshooting model predictive control (MPC) where 1000 candidate action sequences each of horizon length H=10 are sampled at each time step, fed through the predictive model, and ranked by their expected reward. The first action step from the highestscoring candidate action sequence is then executed before the entire planning process repeats again at the next time step.
Below, we list relevant training and testing parameters for the various methods used in our experiments. # Task/itr corresponds to the number of tasks sampled during each iteration of collecting data to train the model, and # TS/itr is the total number of times steps collected during that iteration (sum over all tasks).
Iters  Epochs  # Tasks/itr  # TS/itr  K  outer LR  inner LR ()  

Metalearned approaches (3)  12  50  16  20003000  16  0.001  0.01 
Nonmetalearned approaches (2)  12  50  16  20003000  16  0.001  N/A 
(CRP)  LR (model update)  K (previous data)  
MOLe (ours)  1  0.01  16 
continued adaptation with metalearning  N/A  0.01  16 
kshot adaptation with metalearning  N/A  0.01  16 
modelbased RL  N/A  N/A  N/A 
modelbased RL with online gradient updates  N/A  0.01  16 
Appendix C Controller
As mentioned in Section 6, we use the learned dynamics model in conjunction with a controller to select the next action to execute. The controller uses the learned model together with a reward function that encodes the desired task. Many methods could be used to perform this action selection, including cross entropy method (CEM) (Botev et al., 2013) or model predictive path integral control (MPPI) (Williams et al., 2015), but in our experiments, we use a randomsampling shooting method Rao (2009).
At each time step , we randomly generate candidate action sequences with actions in each sequence.
(10) 
We then use the learned dynamics model to predict the resulting states of executing these candidate action sequences.
(11) 
Next, we use the reward function to select the action sequence with the highest associated predicted reward.
(12) 
Next, rather than executing the entire sequence of selected optimal actions, we use a model predictive control (MPC) framework to execute only the first action from the current state . We then replan at the next time step; This use of MPC can compensate for model inaccuracies by preventing accumulating errors, since we replan at each time step using updated state information.
Comments
There are no comments yet.