1 Introduction
Inspired by the remarkable capability of humans to quickly learn and adapt to new tasks, the concept of learning to learn, or metalearning
, recently became popular within the machine learning community
l2l ; rl2 ; maml . When thinking about optimizing a policy for a reinforcement learning agent or learning a classification task, it appears sensible to not approach each individual task from scratch but to learn a learning mechanism that is common across a variety of tasks and can be reused.The purpose of this work is to encode these learning strategies into an adaptive highdimensional loss function, or a metaloss, which generalizes across multiple tasks and can be utilized to optimize models with different architectures. Inspired by inverse reinforcement learning ng2000algorithms , our work combines the learning to learn
paradigm of metalearning with the generality of learning loss landscapes. We construct a unified fully differentiable framework that can shape the loss function to provide a strong learning signal for a range of various models, such as classifiers, regressors or control policies. As the loss function is independent of the model being optimized, it is agnostic to the particular model architecture. Furthermore, by training our loss function to optimize different tasks, we can achieve generalization across multiple problems. The metalearning framework presented in this work involves an inner and an outer loop. In the inner loop, a model or an
optimizee is trained with gradient descent using the loss coming from our learned metaloss function. Fig. 1 shows the pipeline for updating the optimizee with the metaloss. The outer loop optimizes the metaloss function by minimizing the taskspecific losses of updated optimizees. After training the metaloss function, the taskspecific losses are no longer required since the training of optimizees can be performed entirely by using the metaloss function alone. In this way, our metaloss can find more efficient ways to optimize the original task loss. Furthermore, since we can choose which information to provide to our metaloss, we can train it to work in scenarios with sparse information by only providing inputs that we expect to have at test time.The contributions of this work are as follows: we present a framework for learning adaptive, highdimensional loss functions through backpropagation that shape the loss landscape such that it can be efficiently optimized with gradient descent; we show that our learned metaloss functions are agnostic to the architecture of optimizee models; and we present a reinforcement learning framework that significantly improves the speed of policy training and enables learning in selfsupervised and sparse reward settings.
2 Related Work
Metalearning originates in the concept of learning to learn Schmidhuber:87long ; bengio:synaptic ; ThrunP98 . Recently, there has a been a wide interest in finding ways to improve learning speeds and generalization to new tasks through metalearning. The main directions of the research in this area can be divided into learning representations that can be easily adapted to new tasks maml , learning unsupervised rules that can be transferred between tasks metz2018learning ; cactus , learning optimizer policies that transform policy updates with respect to known loss or reward functions l2l ; li2016learning ; meier2018online ; rl2 , or learning loss/reward landscapes sung2017learning ; epg .
Our framework falls into the category of learning loss landscapes; similar to l2l , we aim at learning a separate optimization procedure that can be applied to various optimizee models. However, in contrast to l2l and rl2 , our framework does not require a specific recurrent architecture of the optimizer and can operate without an explicit external loss or reward function during test time. Furthermore, as our learned loss functions are independent of the models to be optimized, they can be easily transferred to other optimizee models, in contrast to maml , where the learned representation can not be separated from the original model of the optimizee.
The idea of learning loss landscapes or reward functions in the reinforcement learning (RL) setting can be traced back to the field of inverse reinforcement learning (IRL) ng2000algorithms ; an04 . However, in contrast to the original goal of IRL of inferring reward functions from expert demonstrations, in our work we aim at extending this idea and learning loss functions that can improve learning speeds and generalization for a wider range of applications. Furthermore, we design our framework to be fully differentiable, facilitating the training of both the learned metaloss and optimizee models.
A range of recent works demonstrate advantages of metalearning for improving exploration strategies in RL settings, especially in the presence of sparse rewards. In mendonca2019guided , an agent is trained to mimic expert demonstrations while only having access to a sparse reward signal during test time. In hausman2018learning and gupta2018meta , a structured latent exploration space is learned from prior experience, which enables fast exploration in novel tasks. zou2019reward proposes a method for automatically learning potentialbased reward shaping by learning the Qfunction parameters during the metatraining phase, such that at metatest time the Qfunction can adapt quickly to new tasks. In our work, we also demonstrate that we can significantly improve the RL sample efficiency by training our metaloss to optimize an actor policy, even when providing only limited or no reward information to the learned loss function at test time.
Closest to our method are the works on evolved policy gradients epg , teacher networks WuTXFQLL18 and metacritics sung2017learning . In contrast to using an evolutionary approach as in epg , we design a differentiable framework and describe a way to optimize the loss function with gradient descent in both supervised and reinforcement learning settings. In WuTXFQLL18 , instead of learning a differentiable loss function directly, a teacher network is trained to predict parameters of a manually designed loss function, whereas each new loss function class requires a new teacher network design and training. Our method does not require manual design of the loss function parameterization as our loss functions are learned entirely from data. Finally, in sung2017learning a metacritic is learned to provide a value function conditioned on a task, used to train an actor policy. Although training a metacritic in the supervised setting reduces to learning a loss function similar to our work, in the reinforcement learning setting we show that it is possible to use learned loss functions to optimize policies directly with gradient descent.
3 MetaLearning via Learned Loss
In this work, we aim to learn an adaptive loss function, which we call metaloss, that is used to train an optimizee, e.g. a classifier, a regressor or an agent policy. In the following, we describe the general architecture of our framework, which we call MetaLearning via Learned Loss (ML).
3.1 Ml framework
Let be an optimizee with parameters . Let be the metaloss model with parameters . Let be the inputs of the optimizee, outputs of the optimizee and information about the task, such as a regression target, a classification target, a reward function, etc. Let be a distribution of tasks and be the taskspecific loss of the optimizee for the task .
Fig. 2 shows the diagram of our framework architecture for a single step of the optimizee update. The optimizee is connected to the metaloss network, which allows the gradients from the metaloss to flow through the optimizee. The metaloss additionally takes the inputs of the optimizee and the task information variable
. In our framework, we represent the metaloss function using a neural network, which is subsequently referred to as a metaloss network. It is worth noting that it is possible to train the metaloss to perform selfsupervised learning by not including
in the metaloss network inputs. A single update of the optimizee is performed using gradient descent on the metaloss by backpropagating the output of the metaloss network through the optimizee keeping the parameters of the metaloss network fixed:(1) 
where is the learning rate, which can be either fixed or learned jointly with the metaloss network. The objective of learning the metaloss network is to minimize the taskspecific loss over a distribution of tasks and over multiple steps of optimizee training with the metaloss:
(2) 
where is the number of tasks and is the number of steps of updating the optimizee using the metaloss. The taskspecific objective depends on the updated optimizee parameters and hence on the parameters of the metaloss network , making it possible to connect the metaloss network to the taskspecific loss and propagate the error back through the metaloss network. Another variant of this objective would be to only optimize for the final performance of the optimizee at the last step of applying the metaloss: . However, this requires relying on backpropagation through a chain of all optimizee update steps. As we noticed in our experiments, including the task loss from each step and avoiding propagating it through the chain of updates by stopping the gradients at each optimizee update step works better in practice.
In order to facilitate the optimization of the metaloss network for long optimizee update horizons, we split the optimization of into several steps with smaller horizons, which we denote similar to l2l . Algorithm 1 summarizes the training procedure of the metaloss network, which we later refer to as metatrain. Algorithm 2 shows the optimizee training with the learned metaloss at test time, which we call metatest
3.2 Ml for Reinforcement Learning
In this section, we introduce several modifications that allow us to apply the ML framework to reinforcement learning problems. Let
be a finitehorizon Markov Decision Process (MDP), where
and are state and action spaces,is a statetransition probability function or system dynamics,
a reward function, an initial state distribution, a reward discount factor, and a horizon. Let be a trajectory of states and actions and the trajectory reward. The goal of reinforcement learning is to find parameters of a policy that maximizes the expected discounted reward over trajectories induced by the policy: where and . In what follows, we show how to train a metaloss network to perform effective policy updates in a reinforcement learning scenario.To apply our ML framework, we replace the optimizee from the previous section with a stochastic policy . We present two cases for applying ML to RL tasks. In the first case, we assume availability of a differentiable system dynamics model and a reward function. In the second case, we assume a fully modelfree scenario with a nondifferentiable reward function.
In the case of an available differentiable system dynamics model and a reward function , the ML objective derived in Eq. 2 can be applied directly by setting the task loss to and differentiating all the way through the reward function, dynamics model and the policy that was updated using the metaloss .
In many realistic scenarios, we have to assume unknown system dynamics models and nondifferentiable reward functions. In this case, we can define a surrogate objective, which is independent of the dynamics model, as our taskspecific loss reinforce ; sutton00 ; SchulmanHWA15 :
(3) 
Although we are evaluating the task loss on full trajectory rewards, we perform policy updates from Eq. 1
using stochastic gradient descent (SGD) on the metaloss with minibatches of experience
for with batch size , similar to epg . The inputs of the metaloss network are the sampled states, sampled actions, rewards and policy probabilities of the sampled actions: . We notice that in practice, including the policy’s distribution parameters directly in the metaloss inputs, e.g. meanof a Gaussian policy, works better than including the probability estimate
, as it provides a direct way to update the distribution parameters using backpropagation through the metaloss.As we mentioned before, it is possible to provide different information about the task during metatrain and metatest times. In our work, we show that by providing additional rewards in the task loss during metatrain time, we can encourage the trained metaloss to learn exploratory behaviors. This additional information shapes the learned loss function such that the environment does not need to provide this information during metatest time. It is also possible to train the metaloss in a fully selfsupervised fashion, where the task related input is excluded from the metanetwork input.
4 Experiments
In this section we evaluate the applicability and the benefits of the learned metaloss under a variety of aspects. The questions we seek to answer are as follows. (1) Can we learn a loss model that improves upon the original taskspecific loss functions, i.e. can we shape the loss landscape to achieve better optimization performance during test time? With an example of a simple regression task, we demonstrate that our framework can generate convex loss landscapes suitable for fast optimization. (2) Can we improve the learning speed when using our ML loss function as a learning signal in complex, highdimensional tasks? We concentrate on reinforcement learning tasks as one of the most challenging benchmarks for learning performance. (3) Can we learn a loss function that can leverage additional information during metatrain time and can operate in sparse reward or selfsupervised settings during metatest time? (4) Can we learn a loss function that generalizes over different optimizee model architectures?
Throughout all of our experiments, the meta network is parameterized by a feedforward neural network with two hidden layers of 40 neurons each with
activation function. The learning rate for the optimizee network was learned together with the loss.4.1 Learned Loss Landscape
For visualization and illustration purposes, this set of experiments shows that our metalearner is able to learn convex loss functions for tasks with inherently nonconvex or difficult to optimize loss landscapes. Effectively, the metaloss allows eliminating local minima for gradientbased optimization and creates wellconditioned loss landscapes. We illustrate this on an example of sine frequency regression where we fit a single parameter for the purpose of visualization simplicity.
Fig. 3 shows loss landscapes for fitting the frequency parameter of the sine function . Below, we show the landscape of optimization with meansquared loss on the outputs of the sine function using 1000 samples from the target function. The target frequency is indicated by a vertical red line, and the meansquared loss is computed as . As noted in sines , the landscape of this loss is highly nonconvex and difficult to optimize with conventional gradient descent. In our work, we can circumvent this problem by introducing additional information about the ground truth value of the frequency at metatrain time, however only using samples from the sine function at inputs to the metaloss network. That is, during the metatrain time, our taskspecific loss is the squared distance to the ground truth frequency: . The inputs of the metaloss network are the target values of the sine function: , similar to the information available in the meansquared loss. Effectively, during the metatest time we can use the same samples as in the meansquared loss, however achieve convex loss landscapes as depicted in Fig. 3 at the top.
4.2 Reinforcement Learning
For the remainder of the experimental section, we focus on reinforcement learning tasks. Reinforcement learning still remains one of the most challenging problems when it comes to learning performance and learning speed. In this section, we present our experiments on a variety of policy optimization problems. We use ML for modelbased and modelfree reinforcement learning, thus demonstrating applicability of our approach in both settings. In the former, as mentioned in Section 3.2, we assume access to a differentiable reward function and dynamics model that could be available either a priori or learned from samples with differentiable function approximators, such as neural networks. This scenario formulates the task loss as a function of differentiable trajectories enabling direct gradient based optimization of the policy, similar to the trajectory optimization methods such as the iterative LinearQuadratic Regulators (iLQR) Tassa2014 .
In the modelfree setting, we treat the dynamics of the system as a black box. In this case, the direct differentiation of the task loss is not possible and we formulate the learning signal for the metaloss network as a surrogate policy gradient objective. See Section 3.2 for the detailed description. The policy is represented by a feedforward neural network in all experiments.
4.2.1 Sample efficiency
We are now presenting our results for continuous control reinforcement learning tasks, by comparing task performance of a policy trained with our metaloss, to a policy optimized with an appropriate comparison method. When a model is available, we compare the performance with a gradient based optimizer, in this case iLQR Tassa2014 . iLQR has widespread application in robotics levine2013guided ; koenemann2015whole and is therefore a suitable comparison method for approaches that require the knowledge of a model. In the modelfree setting, we use a popular policy gradient method  Proximal Policy Optimization (PPO) schulman2017proximal for comparison. We first evaluate our method on simple, classical continuous control problems where the dynamics are known and then continue with higherdimensional problems where we do not have full knowledge of the model.
In Fig. 3(a), we compare a policy optimized with the learning signal coming from the metaloss network to trajectories optimized with iLQR. The task is a free movement task of a point mass in a 2D space with known dynamics parameters, we call this environment PointmassGoal. The state space is fourdimensional where are the 2D positions and velocities, and the actions are accelerations . The task distribution consists of different target positions that the point mass should reach. The taskspecific loss at training time is defined by the distance from the target at the last time step during the rollout. In Fig. 3(a), we average the learning performance over ten random goals. We observe that the policies optimized with the learned metaloss converge faster and can get closer to the targets compared to the trajectories optimized with iLQR. We would like to point out that on top of the improvement in convergence rates, in contrast to iLQR our trained metaloss does not require a differentiable dynamics model nor a differentiable reward function as its input at metatest time as it updates the policy directly through gradient descent.
In Fig. 3(b), we provide a similar comparison on the task that requires to swing up and balance an inverted pendulum. In this task, the state space is three dimensional: , where is the angle of the pendulum. The action is a one dimensional torque. The task distribution consists of different initial angle configurations the pendulum starts in. The plot shows the averaged result over ten different initial configurations of the pendulum. From the figure we can see that the policy optimized with ML is able to swing up and balance, whereas the iLQR trajectory struggles to keep the pendulum upright after swinging up the pendulum, and oscillates around the vertical configuration.
In the following, we continue with the modelfree evaluation. In Fig. 5, we show the performance of our framework using two continuous control tasks based on OpenAI Gym MuJoCo environments mujoco : ReacherGoal and AntGoal. The ReacherGoal environment is a 2link 2D manipulator that has to reach a specified goal location with its endeffector. The task distribution consists of initial random link configurations and random goal locations. The performance metric for this environment is the mean trajectory sum of negative distances to the goal, averaged over 10 tasks.
The AntGoal environment requires a fourlegged agent to run to a goal location. The task distribution consists of random goals initialized on a circle around the initial position. The performance metric for this environment is the mean trajectory sum of differences between the initial and the current distances to the goal, averaged over 10 tasks.
Fig. 4(a) and Fig. 4(b) show the results of the metatest time performance for the ReacherGoal and the AntGoal environments respectively. We can see that ML loss significantly improves optimization speed in both scenarios compared to PPO. In our experiments, we observed that on average ML requires 5 times fewer samples to reach 80% of task performance in terms of our metrics for the modelfree tasks.
4.2.2 Sparse Rewards and SelfSupervised Learning
By providing additional reward information during metatrain time, as pointed out in Section 3.2, it is possible to shape the learned reward signal such that it improves the optimization during policy training. By having access to additional information during metatraining, the metaloss network can learn a loss function that provides exploratory strategies to the agent or allows the agent to learn in a selfsupervised setting.
In Fig. 6, we show results from the MountainCar environment Moore1990 , a classical control problem where an underactuated car has to drive up a steep hill. The propulsion force generated by the car does not allow steady climbing of the hill. To solve the task, the car has to accumulate energy by repeatedly climbing the hill forth and back. In this environment, greedy minimization of the distance to the goal often results in a failure to solve the task. The state space is twodimensional consisting of the position and velocity of the car, the action space consists of a onedimensional torque. In our experiments, we provide intermediate goal positions during metatrain time, which a not available during the metatest time. The metaloss network incorporates this behavior into its loss leading to an improved exploration during the metatest time as can be seen in Fig. 5(a). Fig. 5(b) shows the average distance between the car and the goal at last rollout time step over several iterations of policy updates with ML and iLQR. As we observe, ML can successfully bring the car to the goal in a small amount of updates, whereas iLQR is not able to solve this task.
The metaloss network can also be trained in a fully selfsupervised fashion, by removing the task related input (i.e. rewards) from the metaloss input. We successfully apply this setting in our experiments with the continuous control MuJoCo environments: the ReacherGoal and the AntGoal (see Fig. 5). In both cases, during metatrain time, the metaloss network is still optimized using the rewards provided by the environments. However, during metatest time, no external reward signal is provided and the metaloss calculates the loss signal for the policy based solely on its environment state input.
4.2.3 Generalization across different model architectures
One key advantage of learning the loss function is its reusability across different policy architectures that is impossible for the frameworks aiming to metatrain the policy directly maml ; rl2 . To test the capability of the metaloss to generalize across different architectures, we first metatrain our metaloss on an architecture with two layers and metatest the same metaloss on architectures with varied number of layers. Fig. 6(a) and Fig. 6(b)
show metatest time comparison for the ReacherGoal and the AntGoal environments in a modelfree setting for four different model architectures. Each curve shows the average and the standard deviation over ten different tasks in each environment. Our comparison clearly indicates that the metaloss can be effectively reused across multiple architectures with a mild variation in performance compare to the overall variance of the corresponding task optimization.
5 Conclusions
In this work we presented a framework to metalearn a loss function entirely from data. We showed how the metalearned loss can become wellconditioned and suitable for an efficient optimization with gradient descent. We observed significant speed improvements in benchmark reinforcement learning tasks on a variety of environments. Furthermore, we showed that by introducing additional guiding rewards during training time we can train our metaloss to develop exploratory strategies that can significantly improve performance during the metatest time, even in sparse reward and selfsupervised settings. Finally, we presented experiments that demonstrated that the learned metaloss transfers well to unseen model architectures and therefore can be applied to new policy classes.
We believe that the ML
framework is a powerful tool to incorporate prior experience and transfer learning strategies to new tasks. In future work, we plan to look at combining multiple learned metaloss functions in order to generalize over different families of tasks. We also plan to further develop the idea of introducing additional curiosity rewards during training time to improve the exploration strategies learned by the metaloss.
References
 [1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICML, 2004.
 [2] Marcin Andrychowicz, Misha Denil, Sergio Gomez Colmenarejo, Matthew W. Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In NeurIPS, pages 3981–3989, 2016.
 [3] Yoshua Bengio and Samy Bengio. Learning a synaptic learning rule. Technical Report 751, Département d’Informatique et de Recherche Opérationelle, Université de Montréal, Montreal, Canada, 1990.
 [4] Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl: Fast reinforcement learning via slow reinforcement learning. CoRR, abs/1611.02779, 2016.
 [5] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 [6] Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, and Sergey Levine. Metareinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems, pages 5302–5311, 2018.
 [7] OpenAI Gym, 2019.
 [8] Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin Riedmiller. Learning an embedding space for transferable robot skills. In International Conference on Learning Representations, 2018.
 [9] Rein Houthooft, Yuhua Chen, Phillip Isola, Bradly C. Stadie, Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy gradients. In NeurIPS, pages 5405–5414, 2018.
 [10] Kyle Hsu, Sergey Levine, and Chelsea Finn. Unsupervised learning via metalearning. CoRR, abs/1810.02334, 2018.
 [11] Jonas Koenemann, Andrea Del Prete, Yuval Tassa, Emanuel Todorov, Olivier Stasse, Maren Bennewitz, and Nicolas Mansard. Wholebody modelpredictive control applied to the hrp2 humanoid. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3346–3351. IEEE, 2015.
 [12] Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages 1–9, 2013.
 [13] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.
 [14] Franziska Meier, Daniel Kappler, and Stefan Schaal. Online learning of a memory for learning rates. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 2425–2432. IEEE, 2018.
 [15] Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Guided metapolicy search. arXiv preprint arXiv:1904.00956, 2019.

[16]
Luke Metz, Niru Maheswaranathan, Brian Cheung, and Jascha SohlDickstein.
Learning unsupervised learning rules.
In International Conference on Learning Representations, 2019.  [17] Andrew Moore. Efficient memorybased learning for robot control. PhD thesis, University of Cambridge, 1990.
 [18] Andrew Y Ng, Stuart J Russell, et al. Algorithms for inverse reinforcement learning. In Icml, pages 663–670, 2000.
 [19] Giambattista Parascandolo, Heikki Huttunen, Tao Xiang, Timothy Hospedales, and Tuomas Virtanen. Taming the waves: sine as activation function in deep neural networks. Submitted to ICLR, 2017.
 [20] Juergen Schmidhuber. Evolutionary principles in selfreferential learning, or on learning how to learn: the metameta… hook. Institut für Informatik, Technische Universität München, 1987.
 [21] John Schulman, Nicolas Heess, Theophane Weber, and Pieter Abbeel. Gradient estimation using stochastic computation graphs. In NeurIPS, pages 3528–3536, 2015.
 [22] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
 [23] Flood Sung, Li Zhang, Tao Xiang, Timothy Hospedales, and Yongxin Yang. Learning to learn: Metacritic networks for sample efficient learning. arXiv preprint arXiv:1706.09529, 2017.
 [24] Richard Sutton, David McAllester, Satinder Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In NeurIPS, 2000.
 [25] Yuval Tassa, Nicolas Mansard, and Emo Todorov. Controllimited differential dynamic programming. IEEE International Conference on Robotics and Automation, ICRA, 2014.
 [26] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business Media, 2012.
 [27] Ronald J. Williams. Simple statistical gradientfollowing algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
 [28] Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, JianHuang Lai, and TieYan Liu. Learning to teach with dynamic loss functions. In NeurIPS, pages 6467–6478, 2018.
 [29] Haosheng Zou, Tongzheng Ren, Dong Yan, Hang Su, and Jun Zhu. Reward shaping via metalearning. arXiv preprint arXiv:1901.09330, 2019.
Comments
There are no comments yet.