I Introduction
Reinforcement learning (RL) is a framework to train an agent to acquire a behavior by reinforcing actions that maximize a notion of taskrelevant future rewards. A reward function, i.e., the function that assigns a reward value to every actiondecision made by the agent, is designed to guide the training to implement the behavior. For specific applications, there are learning frameworks that train with sparse rewards [1, 2]. However, in many real world problems, a behavior is best realized by finding a tradeoff among a number of conflicting reward functions. This is known as multiobjective reinforcement learning (MORL), which is a fundamental problem in designing many autonomous systems. As an example, real systems consume energy and, in most cases, reducing energy consumption as a desirable behavior lowers the performance of the system. Other examples are safety and stability which, also in many cases, contradict the main objectives of the system.
In many RL problems, the tradeoff between different objectives is found by constructing a synthetic reward function, e.g., a weighted linear sum of the objectives, using expert knowledge [3, 4]. This approach may not be suitable for problems that, (1) aim at realizing a multifaceted behavior consisting of several phases in sequence, therefore, preventing utility of a set of predetermined preferences over the objectives, (2) are required to adapt itself to operate in different modes, e.g., to operate in energysaving versus highperformance mode, and (3) for which it is hard to find a set of suitable preference over objectives manually, a problem known as preference elicitation (PE) [5].
However, most of the current MORL solutions are limited to simple toy examples with discrete stateaction spaces [5, 6, 7]. To the best of our knowledge, there is no MORL approach to train deep policies with different objectives for complex continuous control tasks.
In this work, we introduce a novel approach to solve deep MORL problems based on modelagnostic metalearning [8]. We train a metapolicy with tasks sampled from a distribution over MDPs with different preferences over the objectives. We demonstrate that our proposed approach outperforms direct policy search methods, e.g., [9]
, at obtaining Pareto optimal policies for different continuous RL problems. Given a reward vector, a policy is Pareto optimal w.r.t. the expected return measure if it results in higher expected return for at least one component of the vector compared to all other policies.
The rest of this paper is organized as following: We outline related work in Section II. We introduce required background in Section III. Section V provides the details of our method. In Section VI, we explain our experimental results. Finally, in Section VII, we conclude the paper and introduce our future work.
Ii Related Work
MORL approaches mainly try to find Pareto optimal solutions which are defined as nondominated solutions representing tradeoffs among multiple conflicting objectives [5, 6, 7]. MORL methods can be divided into two main categories: single and multipolicy approaches [5, 10], with the main difference being the number of policies that are learned at each run. Singlepolicy algorithms obtain solutions with different tradeoffs between the objectives by running the RL method several times with different parameters. In contrast, multipolicy algorithms update several policies simultaneously in each run to improve the current Pareto optimal solutions.
In case a scalar value can be computed from the multiple objectives using domainspecific expert knowledge, the problem can be converted to a single objective RL problem which can be solved based on standard RL solutions. As examples, Schulman et al., [3] and Duan et al., [4] applied policy search methods on a number of continuous control benchmarks by handdesigning a synthetic reward function from multiple objectives for different locomotion tasks. However, setting the parameters a priori requires finetuning and it is nonintuitive to find the right tradeoff [5, 11]. Lizotte et al., [12] proposed a multiobjective Qlearning solution by linearly combining the objectives related to symptoms and sideeffects of different treatment plans. They identified dominated actions, as well as, preferred actions at different areas of the objective space by using a tabular Qfunction with an extra entry for the objectives. However, it is wellknown that multiobjective Qlearning by the weightedsum approach cannot approximate concave parts of the true Pareto front [13]. Van Moffaert et al., [14] introduced a similar multiobjective Qlearning method but based on the Chebyshev scalarization function as a replacement for the weightedsum to approximate the Pareto front. Still the method updates the Qfunction based on actiondecisions made according to a preference over the objectives which is encoded by the weights of the Chebyshev function, resulting in suboptimal Pareto fronts. In a later study [15]
, they improved the method by exploiting a hypervolumebased actionselection mechanism. The hypervolume is a measure used for performance assessment of the Pareto optimal solutions and not a single preference given by a set of weights. However, estimation of the hypervolume is a computationally expensive NPhard problem
[15]. Our approach trains stochastic policies with a weightedsum scalarization method by sampling different random weights during the metalearning phase to alleviate the aforementioned issue regarding the suboptimality of the Pareto front, while still performing tractable computations.The work done by Natarajan and Tadepalli [16] resembles our work, in that it also tries to obtain a number of policies, based on which a nondominated Pareto front can be established for preferences over different objectives encoded by a weight vector. In their work, a bag of policies is constructed recursively to approximate the front, as well as to initialize new policies such that, the policy training can be resumed from the closest policy in the bag. We instead propose to initialize new policies, in a socalled adaptation phase, from the updated metapolicy which is trained with many different weight values. The metapolicy is proven to adapt to new conditions faster [8], and can be realized by both stochastic and deterministic policies.
Recently, policy search approaches have been studied in the context of MORL. Parisi et al., [9] proposed to estimate a number of Pareto optimal policies by performing gradient ascent in the policy parameter space by solving different singleobjective RL problems found by different convex combinations of the objectives. In contrast, our method does not initialize each policy randomly and it obtains a suitable initial policy parameter in the metalearning phase, resulting in a more efficient way of estimating the Pareto optimal policies. Furthermore, as we demonstrate in the experiment section, our method outperforms [9] for complicated tasks requiring a hierarchy of skills to be obtained in sequences.
Pirotta et al., [17] introduced a manifoldbased policy search MORL solution which assumes policies to be sampled from a manifold, i.e., a parametric distribution over the policy parameter space. They proposed to update the manifold according to an indicator function, such that the sampled policies construct an improved Pareto front. In a more recent work, Parisi et al., [18] extend the method to work with the hypervolume indicator function as a more wellsuited measure when evaluating optimality of different Pareto fronts. However, the number of parameters of the manifold grows quadratically with the number of policy parameters. As an example, a policy with 30 parameters requires 7448 parameters to model the manifold [18]
with Gaussian mixture models. As the result, these approaches may not be applicable to train deep policies with several thousands of parameters.
In this paper, we propose to frame policy search MORL for continuous control with large numbers of policy parameters, e.g., deep policies, as a metalearning problem. In contrast to earlier policy search approaches, we train a metapolicy for different tasks sampled from a distribution of the preferences over the objectives, such that, by finetuning the metapolicy with a few gradient updates, the Pareto front can be constructed more efficiently.
Iii Preliminaries
A multiobjective sequential decision making process can be represented by a Markov decision process (MDP) which is defined by the tuple , where is a set of states, is a set of actions,
is the actionconditioned state transition probability,
represents the reward functions corresponding to different objectives, is the distribution of initial states, and is a discount factor. We consider episodic tasks in which an episode starts by sampling an initial state from , and then for every timestep , sampling actions from , a parametric stochastic policy. The successor state in each time step is given according to and a reward vector is provided at every timestep by the environment. For every stateaction pair in the episode, the return is a vector defined as the discounted sum of future rewards, .Iv Multiobjective policy search
Reinforcement learning is based on performing exploratory actions and reinforcing the actions that result in return outcomes exceeding the expectation of the agent. The expectation of the agent is modeled by a state value function, represented by a neural network in our work, and is continually updated to model the expected return for a given state
while following the policy ,The agent performs exploratory actions and compare the actual return outcomes for every stateaction pair with the expected value to form the advantage function,
Here, the advantage is a vector which can be converted to a scalar value by a parametric scalaraization function , e.g., the weighted sum, . The policy parameters are updated such that stateaction pairs with
are reinforced, i.e., become more probable in the future. This can be achieved by minimizing a loss function
, in our case the clipped version [3] of TRPO loss [19],where, represents Kullback Leibler (KL)divergence and represents the stateaction trajectories and is a scalar parameter found empirically. The parameters are updated such that the new policy assigns higher probability to stateaction pairs resulting in positive scalarized advantages while the assigned policy distribution is not drastically changed from the old policy .
V Multiobjective metaRL (metaMORL)
MetaRL algorithms train a metapolicy by multiple tasks sampled from a given task distribution. As the result, the agent can learn a novel task, sampled from the same task distribution, more efficiently with limited number of training data [8]. In this section, we describe how this paradigm can be applied to the MORL domain to efficiently find a number of Pareto optimal policies. The scalarization function , introduced in the previous section, converts a MORL problem to different singleobjective RL problems by different values of . In fact, a new task is defined by setting the weight parameter to arbitrary (nonnegative) values. Therefore, a distribution over tasks can be found by assigning a distribution over the weight parameter . In this case, metalearning can be applied to find a metapolicy which is trained with all the tasks sampled from the task distribution. The metapolicy, once it is trained, does not contribute to the construction of the Pareto front by itself, but it is used as an optimal initial value of the policy parameters to efficiently train multiple policies with different objectives by a few learning iterations. These policies then contribute to construct the Pareto optimal solutions.
We follow the modelagnostic metalearning approach introduced in [8]. In this formulation, the learning consists of three different phases, (1) an adaptation phase, in which a number of policies are updated for a few iterations from the metapolicy, (2) a metalearning phase, in which the metapolicy is updated by aggregating data generated by the policies trained in the previous phase, and (3) a finetuning phase, in which, the Pareto optimal policies are trained by being initialized with the metapolicy parameters, once the metapolicy is converged. In the following sections, different phases of the training are introduced in more details.
Va Adaptation phase
In the adaptation phase a number of tasks , sampled from , are defined. For each task, a policy is initialized with the parameters of the metapolicy. The policies are updated for one iteration using stateaction trajectories generated by running the metapolicy, and return values found according to the assigned task (specified by ). In short, each policy is trained as,
(1) 
where, is the metapolicy parameter.
VB Metalearning phase
The metapolicy is updated in the metalearning phase by aggregating the information of the policies trained in the previous step,
(2) 
VC Finetuning phase
Finally, once the metapolicy is trained, a set of Pareto optimal policies is found by finetuning the metapolicy for a number of iterations with different tasks randomly sampled from the task distribution .
The details of our method are provided in Algorithm 1. The metapolicy is initialized randomly and it is updated in the training loop. In each iteration, trajectories of states, actions and rewards are sampled by running the metapolicy. For every task, a weight vector is sampled and a new policy is trained, in the adaptation phase. The trained policies are used to sample new trajectories based on which the metapolicy is updated in the metalearning phase. Finally, a set of Pareto front policies is obtained by initializing the policies with the metapolicy and training them for a number of iterations.
Our method resembles the multipolicy approaches, in that it optimizes a metapolicy to estimate the Pareto front implicitly, but it also resembles the singlepolicy approaches, since the final Pareto front policies are trained individually for each objective during the finetuning phase.
Vi Experiment
In the experiments, we aim to answer the following questions: (1) does metaMORL successfully train a set of policies with a large number of parameters that estimate the Pareto front for continuous control problems? (2) compared to direct training using a set of fixed weights over the objectives, does metaMORL perform better considering the training time and the optimality of the resulting Pareto front? To answer these questions, we prepared five continuous control tasks with several conflicting objectives in different simulated environments (Fig. 1). We compared the results of our approach against the Radial Algorithm (RA) [9] regarding the training time and the quality of the estimated Pareto front. For all of the tasks, we exploited the Proximal Policy Optimization (PPO) [3] algorithm as the policy learning method.
Via Environments
The simulated environments are provided by Roboschool [20] and OpenAI gym [21]. The original environments return a single reward, which is the summation of rewards with respect to several objectives. Instead, we decompose the objectives to construct a reward vector. For each task, there is a minimum requirement that a policy needs to achieve in order to be considered as a valid policy (e.g., neither causing crashes nor joint stuck during the execution time). A detailed description of each task is presented in the followings.
ViA1 Reacher
We start with a simple control task using a simulated arm with degreesoffreedoms (DoFs). The goal is to move the tip of the arm as close as possible to a random target position. The environment returns a dimensional reward vector, which is related to the distance from the arm tip to the target position, an energy consumption cost and a fixed joint stuck term penalizing the stuck motors. Among the rewards, the distance to the target and the energy consumption cost are conflicting. A policy is considered valid if it receives an average discontented return of the joint stuck penalty less than .
ViA2 LunarLanderContinuous
The goal of this task is to control the two engines of a rocket to land safely on the ground. The reward vector is a dimensional vector which includes (1) a shaping reward given when the rocket is upright and moving to the center of the landing area, (2) an energy consumption cost for the main engine, (3) an energy consumption cost for the side engine, and (4) a landing reward which can be positive or negative depending on the success or failure of the landing task. There is a weak conflict between the reward of shaping and the energy consumption cost for the engines. A policy is considered to be valid, i.e., not crashing, in case it receives an average discontented return of the landing reward greater than .
ViA3 HalfCheetah, Ant and Humanoid
To study the suitability of the metaMORL method to solve more complex RL problems, we prepared a set of highdimensional locomotion tasks using agents with different DoFs. The goal of the tasks is to control different agents to move forward. The agents are a half cheetah ( DoFs), an ant ( DoFs) and a humanoid robot ( DoFs). All the agents use 5dimensional reward vector consisting of (1) being alive, i.e., staying in an upright position, (2) gaining forward speed, (3) energy consumption penalty, (4) joint limit penalty, and (5) collision penalty. For every time step, the agent receives a fixed alive reward if it stays upright. The forward speed reward is proportional to the forward speed (negative when moves backward). The energy consumption penalty is proportional to the amount of forces applied to all joints. The joint limit penalty punishes the agent by a fixed negative value when a joint is positioned outside its limit. The collision cost punishes the agent for self collision. Among these rewards, the speed and the energy consumption rewards are conflicting, and cannot be optimized at the same time. A policy is valid if it keeps the agent upright. This corresponds to an average discontented return of the alive reward greater than for the ant and the half cheetah, and for the humanoid.
ViB Results
Here, we present metaMORL results to obtain the Pareto optimal policies and compare it with RA [9] as the baseline. As described in Sec. II, RA is a singlepolicy approach that estimates the Pareto front using multiple runs each of which solves a singleobjective RL problem found by the scalarization function with different weights. The starting policy of RA is initialized randomly. MetaMORL method work similar to RA but with the difference that, instead of randomly initializing the policy, the policy is initialized with the metapolicy. The metapolicy is trained as explained in the previous section with random scalarization functions.
ViB1 Estimating the Pareto front
We use the hypervolume indicator ([15]) to evaluate the quality of the Pareto fronts estimated by the two methods. The hypervolume indicator calculates the volume encapsulated between a reference point and the Pareto front points. A larger hypervolume indicates a more optimal Pareto front. The reference point is defined as the minimum value of each objective found by all policies. The hypervolume of the Pareto front constructed by the two methods are listed in Table I. The results show that, for simple tasks, like LunarLander and Reacher, the performance of RA is equal or slightly better than the metaMORL. However for complex tasks, the metaMORL finds solutions resulting in larger hypervolumes, i.e., more optimal Pareto fronts.
To analyze the optimally of the Pareto front for each task, we only demonstrated three of the objectives in Fig. 3. We evaluated the dominance of each policy compared to all other policies and marked the nondominated policies with black circles. The number of nondominated policies found by the two methods is given in Fig. 1(a). Considering both the hypervolume and the number of nondominated policies measures confirms that metaMORL outperforms the baseline in complex continuous control tasks.
ViB2 Dataefficiency
Here, we demonstrate the superiority of the proposed method to construct the Pareto front using less data.
Fig. 4 illustrates the hypervolume measure calculated by the metaMORL method during the finetuning phase.
The red dashed lines denote the hypervolume found by the baseline method when all policies are converged.
To compare the dataefficiency of the methods, we observed how much data is used in each case.
Baseline: Each policy is trained with iterations for Reacher and LunarLanderContinuous, iterations for HalfCheetah and Ant, and iterations for Humanoid.
In total 30 different policies are trained for each task. Therefore, as an example, for the Humanoid task iterations are collected to obtain the hypervolume indicated by the red dashed line.
MetaMORL:
The metapolicy is trained by the following number of metaupdates, i.e., the number of iterations for the adaptation and the metalearning phases: for Reacher and LunarLanderContinuous, for HalfCheetah, for Ant and for Humanoid.
In our implementation, a metaupdate requires five times more training data compared to a regular update.
Therefore, as an example, with of iterations, metaMORL reaches similar performance as achieved by iterations of the baseline method. Furthermore, metaMORL keeps improving the results with more updates, i.e., almost improvement is achieved by more iterations, while no more improvement can be observed by further training using the baseline method.
ViB3 The performance of individual policies
We also studied the performance of each policy w.r.t. its corresponding single objective RL problem given by the scalarization function. For different scalarization functions, we compared metaMORL and the baseline by enumerating the valid policies obtained by each method. Also, for each method, we counted the number of the policies that outperform the other method for the same scalar objective. In this respect, the validity of a policy indicates whether the policy can accomplish a task and the accumulated scalar return indicates its performance for that specific task. These measures are illustrated in Fig. 1(b) and Fig. 1(c), respectively. As it is shown, in both cases, metaMORL outperform the baseline, with greater margins for more complicated control tasks.
It is not surprising that metaMORL outperforms the baseline in this case. Assume that for the Humanoid task, we would like to learn a policy that minimizes the energy cost as the most important objective. In this case, training a randomly initialized policy will most likely result in a policy that always causing the agent to fall down. Since, this is the most likely local optimal behavior when the policy is initialized randomly. However, using metalearning, the metapolicy is found such that the combination of different objectives can be achieved by few more training iterations. Therefore, in the above example, the policy can find a more optimal behavior that minimizes the energy consumption cost and still walks forward. In that sense, we can say that the metalearning method results in a more efficient exploration strategy to acquire optimal behaviors.
Environment  MetaMORL  RA 

Reacher  9.95  13.82 
LunarLander  453  422 
HalfCheetah  
Ant  
Humanoid 
Vii Conclusions
In this work, we introduced a novel metaMORL approach based on modelagnostic metalearning [8] to solve deepMORL problems. We proposed to convert a MORL problem to a number of singleobjective RL tasks using a parametric scalarization function. Then, using metalearning, a metapolicy is obtained such that the performance on each task can be improved by few more training iterations. The metapolicy is finally used as the optimal initial policy to train a set of Pareto optimal policies with different objectives.
We evaluated our method in several simulated continuous control tasks and demonstrated that it scales well to highdimensional control problems. Furthermore, we demonstrated that it outperforms the only baseline method that can be applied to train deep policies to construct the Pareto front for reward vectors with several dimensions.
Viii Acknowledgments
This work is supported by the European Union’s Horizon 2020 research and innovation program, the CENTAURO prjoect (under grant agreement No. 644839), the socSMCs project (H2020FETPROACT2014), and also by the Academy of Finland through the DEEPEN project.
References
 [1] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Van de Wiele, V. Mnih, N. Heess, and J. T. Springenberg, “Learning by playingsolving sparse reward tasks from scratch,” arXiv preprint arXiv:1802.10567, 2018.
 [2] A. Ghadirzadeh, A. Maki, D. Kragic, and M. Björkman, “Deep predictive policy training using reinforcement learning,” in Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ International Conference on. IEEE, 2017, pp. 2351–2358.
 [3] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.

[4]
Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep
reinforcement learning for continuous control,” in
International Conference on Machine Learning
, 2016, pp. 1329–1338.  [5] C. Liu, X. Xu, and D. Hu, “Multiobjective reinforcement learning: A comprehensive overview,” IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 45, no. 3, pp. 385–398, 2015.
 [6] P. Vamplew, R. Dazeley, A. Berry, R. Issabekov, and E. Dekker, “Empirical evaluation methods for multiobjective reinforcement learning algorithms,” Machine learning, vol. 84, no. 12, pp. 51–80, 2011.

[7]
D. M. Roijers, P. Vamplew, S. Whiteson, and R. Dazeley, “A survey of
multiobjective sequential decisionmaking,”
Journal of Artificial Intelligence Research
, vol. 48, pp. 67–113, 2013.  [8] C. Finn, P. Abbeel, and S. Levine, “Modelagnostic metalearning for fast adaptation of deep networks,” arXiv preprint arXiv:1703.03400, 2017.
 [9] S. Parisi, M. Pirotta, N. Smacchia, L. Bascetta, and M. Restelli, “Policy gradient approaches for multiobjective sequential decision making,” in Neural networks (ijcnn), 2014 international joint conference on. IEEE, 2014, pp. 2323–2330.
 [10] K. Van Moffaert and A. Nowé, “Multiobjective reinforcement learning using sets of pareto dominating policies,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 3483–3512, 2014.
 [11] T. Brys, A. Harutyunyan, P. Vrancx, A. Nowé, and M. E. Taylor, “Multiobjectivization and ensembles of shapings in reinforcement learning,” Neurocomputing, vol. 263, pp. 48–59, 2017.
 [12] D. J. Lizotte, M. H. Bowling, and S. A. Murphy, “Efficient reinforcement learning with multiple reward functions for randomized controlled trial analysis,” in Proceedings of the 27th International Conference on Machine Learning (ICML10). Citeseer, 2010, pp. 695–702.
 [13] I. Das and J. E. Dennis, “A closer look at drawbacks of minimizing weighted sums of objectives for pareto set generation in multicriteria optimization problems,” Structural optimization, vol. 14, no. 1, pp. 63–69, 1997.
 [14] K. Van Moffaert, M. M. Drugan, and A. Nowé, “Scalarized multiobjective reinforcement learning: Novel design techniques.” in ADPRL, 2013, pp. 191–199.
 [15] K. Van Moffaert, M. M. Drugan, and A. Nowé, “Hypervolumebased multiobjective reinforcement learning,” in International Conference on Evolutionary MultiCriterion Optimization. Springer, 2013, pp. 352–366.
 [16] S. Natarajan and P. Tadepalli, “Dynamic preferences in multicriteria reinforcement learning,” in Proceedings of the 22nd international conference on Machine learning. ACM, 2005, pp. 601–608.
 [17] M. Pirotta, S. Parisi, and M. Restelli, “Multiobjective reinforcement learning with continuous pareto frontier approximation,” in 29th AAAI Conference on Artificial Intelligence, AAAI 2015 and the 27th Innovative Applications of Artificial Intelligence Conference, IAAI 2015. AAAI Press, 2015, pp. 2928–2934.
 [18] S. Parisi, M. Pirotta, and J. Peters, “Manifoldbased multiobjective policy search with sample reuse,” Neurocomputing, vol. 263, pp. 3–14, 2017.
 [19] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” in International Conference on Machine Learning, 2015, pp. 1889–1897.
 [20] “Roboschool: opensource software for robot simulation,” https://blog.openai.com/roboschool/.
 [21] M. Plappert, M. Andrychowicz, A. Ray, B. McGrew, B. Baker, G. Powell, J. Schneider, J. Tobin, M. Chociej, P. Welinder, V. Kumar, and W. Zaremba, “Multigoal reinforcement learning: Challenging robotics environments and request for research,” 2018.
Comments
There are no comments yet.