I Introduction
Deep rl has achieved great successes over the last years, enabling learning of effective policies from highdimensional input, such as pixels, on complicated tasks. However, compared to problems with discrete action spaces, continuous control problems with highdimensional continuous stateaction spaces – such as those encountered in robotics – have proven much more challenging. One problem encountered in continuous action spaces is that straightforward optimization of task reward leads to idiosyncratic solutions that switch between extreme values for the controls at highfrequency, a phenomenon also referred to as bangbang control. While such solutions can maximize reward and can be acceptable in simulation, they are usually not suitable for realworld systems where smooth control signals are desirable. Unnecessary oscillations are not only energy inefficient, they also exert stress on a physical system by exciting secondorder dynamics and increasing wear and tear.
To regularize the behavior, a common strategy is to add penalties to the reward function. As a result, the reward function is composed of positive reward for achieving the goal and negative reward (cost) for high control actions or high energy usage. This effectively casts the problem into a multiobjective optimization setting, where – depending on the ratio between the reward and the different penalties – different behaviors may emerge. While every ratio will have its optimal policy, finding the ratio that results in the desired behavior can be difficult. Often, one must find different hyperparameter settings for different rewardpenalty tradeoffs or tasks. The process of finding these parameters can be cumbersome, and may prevent robust and general solutions. In this paper we rephrase the problem: instead of trying to find the appropriate ratio between reward and cost, we regularize the optimization problem by adding constraints, thereby reducing its effective dimensionality. More specifically, we propose to minimize the penalty while respecting a lowerbound on the success rate of the task.
Using a Lagrangian relaxation technique, we introduce cost coefficients for each of the imposed constraints that are tuned automatically during the optimization process. In this way we can find the optimal tradeoff between reward and costs (that also satisfies the imposed constraints) automatically. By making the cost multipliers statedependent, and adapting them alongside the policy, we can not only impose constraints on expected reward or cost, but also on their instantaneous values. Such pointwise constraints allow for much tighter control over the behavior of the policy, since a constraint that is satisfied only in overall expectation could still be violated momentarily. Our constrained optimization procedure can further be generalized to a multitask setting to train policies that are able to dynamically tradeoff reward and penalties within and across tasks. This allows us to, for example, learn energyefficient locomotion at a range of different velocities.
The contributions of this work are (i) we demonstrate that statedependent Lagrangian multipliers for large and continuous state spaces can be implemented with a neural network that generalizes across states; (ii) we introduce a structured critic that simultaneously learns both reward and value estimates as well as the coefficient to balance them in a single model; and finally (iii) demonstrate how our constrained optimization framework can be employed in a multitask setting to effectively train goalconditioned policies. Our approach is general and flexible in that it can be applied to any valuebased rl algorithm and any number of constraints. We evaluate our approach on a number of simulated continuous control problems in Section
IV using tasks from the DM Control Suite [20] and a (realistically simulated) locomotion task with the Minitaur quadruped. Finally, we apply our method to a reaching task that requires satisfying a visually defined constraint on a real Sawyer robot arm.Ii Background and related work
We consider mdp [18] where an agent sequentially interacts with an environment, in each step observing the state of the environment and choosing an action according to a policy . Executing the action in the environment causes a state transition with an associated reward defined by some utility function . The goal of the agent is to maximize the expected sum of rewards along trajectories generated by following policy , also known as the return, . While some tasks have welldefined reward (e.g. the increase in score when playing a game) in other cases it is up to the user to define a reward function that produces a desired behavior. Designing suitable reward functions can be a difficult problem even in the singleobjective case [e.g. 15, 4].
morl problems arise in many domains, including robotics, and have been covered by a rich body of literature [see e.g. 16, for a recent review], suggesting a variety of solution strategies. For instance, [12] devise a Deep rl algorithm that implements an outer loop method and repeatedly calls a singleobjective solver. [11]
propose an algorithm for learning in a stochastic game setting with vector valued rewards (their approach is based on approachability of a target set in the reward space). However, most of these approaches explicitly recast the multiobjective problem into a singleobjective problem (that is amenable to existing methods), where one aims to find the tradeoff between the different objectives that yields the desired result. In contrast, we aim for a method that automatically trades off different components in the objective to achieve a particular goal. To achieve this, we cast the problem in the framework of cmdp
[3]. cmdp have been considered in a variety of works, including in the robotics and control literature. For instance, Achiam et al. [2] and Dalal et al. [5] focus on constraints motivated by safety concerns and propose algorithms that ensure that constraints remain satisfied at all times. These works, however, assume that the initial policy already satisfies the constraint, which is usually not practical when, as in our case, the constraint involves the task success rate. The motivation for the work by [21] is similar to ours, but unlike ours their approach maximizes reward subject to a constraint on the cost and enforces constraints only in expectation.Constraintbased formulations are also frequently used in singleobjective policy search algorithms where bounds on the policy divergence are employed to control the rate of change in the policy from one iteration to the next [e.g. 14, 10, 17, 1]. Our use of constraints, although similar in the practical implementation, is conceptually orthogonal. Also these methods typically employ constraints that are satisified only in expectation. While we note that our approach can be applied to any valuebased offpolicy method, we make use of the method described in mpo [1]
as the underlying policy optimization algorithm – without loss of any generality of our method. mpo is an actorcritic algorithm that is known to yield robust policy improvement. In each policy improvement step, for each state sampled from replay buffer, mpo creates a population of actions. Subsequently, these actions are reweighted based on their estimated values such that better actions will have higher weights. Finally, mpo uses a supervised learning step to fit a new policy in continuous state and action space. See
Abdolmaleki et al. [1] and Appendix A for more details.Iii Constrained optimization for control
We consider mdp where we have both a reward and cost, and , which are functions of state and action . The goal is to automatically find a stochastic policy (with parameter ) that both maximizes the (expected) reward that defines task success and minimizes a cost that regularizes the solution. For instance, in the case of the wellknown cartpole problem we might want to achieve a stable swingup while minimizing other quantities, such as control effort or energy. This can be expressed using a penalty proportional to the total cost, i.e. , where we take to mean maximizing the objective with respect to the policy parameters and the expectation is with respect to trajectories produced by executing policy . The problem then becomes one of finding an appropriate tradeoff between task reward and cost, and hence a suitable value of . Finding this tradeoff is often nontrivial. An alternative way of looking at this dilemma is to take a multiobjective optimization perspective. Instead of fixing , we can optimize for it simultaneously and can obtain different Paretooptimal solutions for different values of . In addition, to ease the definition of a desirable regime for , one can consider imposing hard constraints on the cost to reduce dimensionality [7], instead of linearly combining the different objectives. Defining such hard constraints is often more intuitive than trying to manually tune coefficients. For example, in locomotion, it is easier to define desired behavior in terms of a lower bound on speed or an upper bound on an energy cost.
Iiia Constrained MDPs
The perspective outlined above can be formalized as a cmdp [3]. A constraint can be placed on either the reward or the cost. In this work we primarily consider a lower bound on the expected total return (although the theory derived below equivalently applies to constraints on cost, i.e. , where is the minimum desired return. In the case of an infinite horizon with a given stationary state distribution, the constraint can instead be formulated for the perstep reward, i.e. . In practice one often optimizes the discounted return in both cases. To apply modelfree rl methods to this problem we first define an estimate of the expected discounted return for a given policy as the actionvalue function , and similarly the expected discounted cost . We can then recast the cmdp in valuespace, where (i.e. scaling the desired reward with the limit of the converging sum over discounts):
(1) 
IiiB Lagrangian relaxation
We formulate task success as a lower bound on the reward. This constraint is typically not satisfied at the start of learning since the agent first needs to learn how to solve the task. This rules out methods for solving cmdp which assume that the constraint is satisfied at the start and limit the policy update to remain within the constraintsatisfying regime [e.g. 2]. Lagrangian relaxation is a general method for solving general constrained optimization problems; and cmdp by extension [3]. In this setting, the hard constraint is relaxed into a soft constraint, where any constraint violation acts as a penalty for the optimization. Applying Lagrangian relaxation to Equation 1 results in the unconstrained dual problem
(2) 
with an additional minimization w.r.t. the multiplier .
A larger results in a higher penalty for violating the constraint. Hence, we can iteratively update by gradient descent on , and alternate with policy optimization, until the constraint is satisfied. Under assumptions described in Tessler et al. [21], this approach converges to a saddle point. At convergence, when , is exactly the desired tradeoff between reward and cost we aimed to find. To perform the policy optimization for any offtheshelf offpolicy optimization algorithm can be used (since we assume that we have a learned, approximate Qfunction at our disposal). In practice, we perform policy optimization using the MPO algorithm [1] and refer to Appendix A for additional details.
At the start of learning, as the constraint is not yet satisfied, will grow in order to suppress the cost and focus the optimization on maximizing . Depending on how quickly the constraint can be satisfied, can grow very large, resulting in a overall large magnitude of . This can result in unstable learning as most actorcritic methods that have an explicit parameterization of are especially sensitive to large (swings in) values. To improve stability, we reparameterize to be a projection into a convex combination of and . Instead of scaling only the reward term, we can then adaptively reweight the relative importance of reward and cost, and make the magnitude of bounded. To enforce , we can perform a change of variable to obtain the following dual optimization problem
(3) 
Note that to correspond to the formulation in Equation 2, we only perform gradient descent w.r.t. on the first term in the numerator. In practice, we limit to , with for some small , and initialize to .
IiiC Pointwise constraints
One downside of the cmdp formulation given in Equation 1 is that the constraint is placed on the expected total episode return, or expected reward. The constraint will therefore not necessarily be satisfied at every single timestep, or visited state, during the episode. For some tasks this difference, however, turns out to be of importance. For example, in locomotion, a constant speed is more desirable than a fluctuating one, even though the latter might also satisfy a minimum velocity in expectation. Fortunately, we can extend the single constraint introduced in Section IIIA to a set, possibly infinite, of pointwise constraints; one for each state induced by the policy. This can be formulated as the following optimization problem:
(4) 
Analogous to Section IIIB, this problem can be optimized with Lagrangian relaxation by introducing statedependent Lagrangian multipliers. Formally, we can write,
(5) 
Analogous to the assumption that nearby states have a similar value, here we assume that nearby states have similar multipliers. This allows learning a parametric function alongside the actionvalue which can generalize to unseen states . In practice, we train a single critic model that outputs as well as and . We provide pseudocode for the resulting constrained optimization algorithm in Appendix A. Note that, in this case, the lower bound is still a fixed value and does not depend on the state. In general such a constraint might be impossible to satisfy for some states in a given task if the state distribution is not stationary (e.g. we cannot satisfy a reward constraint in the swingup phase of the simple pendulum). However, the lower bound can also be made statedependent and our approach will still be applicable.
IiiD Conditional constraints
Up to this point, we have made the assumption that we are only interested in a single, fixed value for the lower bound. However, in some tasks one would want to solve Equation 4 for different lower bounds , i.e. minimizing cost for various success rates. For example, in a locomotion task, one could be interested in optimizing energy for multiple different target speeds or gaits. Assuming locomotion is a stationary behavior, one could set for a range of velocities . In the limit this would achieve the same result as multiobjective optimization – it would identify the set of solutions wherein it is impossible to increase one objective without worsening another – also known as a Pareto front. To avoid the need to solve a large number of optimization problems, i.e., solving for every separately, we can condition the policy, value function and Lagrangian multipliers on the desired target value and, effectively, learn a boundconditioned policy
(6) 
Here is a goal variable, the desired lower bound for the reward, that is observed by the policy and critic and maps to a lower bound for the value . Such a conditional constraint allows a single policy to dynamically trade off cost and return.
Iv Experiments
In order to understand the generality and potential impact of our approach, we experiment in the four continuous control domains shown in Figure 1: the cartpole and humanoid from the DM Control Suite benchmark, a more challenging, realisticallysimulated robot locomotion task, and finally two variants of a reaching task on a real robot arm.
Iva Control benchmarks
We consider three tasks from the DeepMind Control Suite [20] benchmark to illustrate the problem of bangbang control specifically, and test the effectiveness of our approach: cartpole swingup, humanoid stand and humanoid walk. Each of these tasks, by default, has a shaped reward that combines the success criterion (e.g. pole upright and cart in the center for cartpole) with a bonus for a low control signal. The total reward lies in in all cases. We compare agents trained on this original reward with two variants: i) an agent trained with the control term from the reward removed, ii) an agent trained without the control term but using the proposed approach from Section III. In all cases we learn a neural network controller using the mpo algorithm [1]
. More specifically, we train a twolayer MLP policy to output the mean and variance of a Gaussian policy. For the constrained optimization approach, we use a fixed lower bound on the expected perstep reward of
and use the norm of the force output as the penalty. More details about the training setup can be found in Appendix A. Table I shows the average reward (excl. control penalty) and control penalty for each of the tasks and setups, both averaged across the entire episode as well as the final 50%. The latter is relevant as all three tasks have an initial balancing component, that by its nature requires significant control input.Task  Win.  Constrained  Unconstr.  Original 

cartpole  full  0.891 / 0.302  0.885 / 1.918  0.895 / 0.733 
last  0.998 / 0.013  1.000 / 1.459  0.998 / 0.074  
human.  full  0.961 / 5.608  0.964 / 37.19  0.952 / 27.52 
(stand)  last  0.998 / 4.538  0.993 / 37.29  0.999 / 27.01 
human.  full  0.869 / 21.60  0.953 / 26.84  0.957 / 29.57 
(walk)  last  0.903 / 21.60  0.984 / 26.82  0.990 / 29.42 
For cartpole, we see that all agents obtain almost identical returns. The constrained method, however, is able to achieve significantly lower penalties, even compared to the original reward that included a (nonadaptive) penalty. Figure 1(a) shows a comparison of the typical rollout of the different optimization strategies. When optimizing for the reward alone, we can observe that the average absolute control signal is large and the agent keeps switching rapidly between a large negative and large positive force. While the agent is able to solve the task (and the behaviour can be somewhat smoothed by using the policy mean instead of sampling), this kind of bangbang control is not desirable for any realworld control system. The policy learned with the constrained approach is visibly smoother; in particular it never reaches maximum or minimum actuation levels after the swing up (during which a switch between maximum and minimum actuation is indeed the optimal solution). For the agent trained with the original reward function, which incorporates a fixed control penalty, the action distribution also shrinks after the swingup phase, but not as much as in the constraintbased approach.
We observe a similar trend for the humanoid stand task, where all three setups result in almost the same average reward, but the constraintbased approach is able to reduce the control penalty by 80% compared to the original reward setup. We visualize the resulting policies in Figure 3 by overlaying frames from the final 50% of the episode. Bangbang control will result in a more jittery motion and hence a more blurry image, as can be seen in Figure 2(a). In contrast, both the constrained and original setup show a fixed pose and significantly less jitter. In the constrained case, however, the agent consistently learn to use a smaller control norm by putting the legs closer together. This can be observed in Figure 1(b), where, after the initial standup, the constrained optimization approach results in a lower control norm.
For the humanoid walk task, we observe that while the constraintbased approach still results in a lower penalty, there is also a reduction in the average reward. This is to be expected: when walking, the penalty can be minimized by slowing down, thus the average perstep reward will stick closer to the imposed lower bound of . Interestingly, the original reward configuration results in a higher control penalty compared to the unconstrained case, perhaps because the control penalty is mixed into the reward differently than in the (un)constrained case and may hence result in a different optimum of the reward.
IvB Minitaur locomotion
. (fig:minitaur_reward_time) shows the perstep reward over time. (fig:minitaur_reward_penalty) shows the tradeoff between the perstep reward and penalty during training. Policies start off at 0 m/s and first learn to satisfy the constraint before optimizing the penalty. (fig:minitaur_lambda_time) shows the Lagrangian multiplier(s) change over time. For the statedependent case, we show the mean and standard deviation of
across the training batch.Our second simulated experiment is based on the Minitaur robot developed by Ghost Robotics [9]. The Minitaur is a quadruped with two dof in each of the four legs which are actuated by highpower directdrive motors, allowing it to express various dynamic gaits such as trotting, pronking and galloping. Implementing these gaits with stateoftheart control techniques requires significant effort, however, and performance becomes sensitive to modeling errors when using modelbased approaches. Learningbased approaches have shown promise as an alternative for devising locomotion controllers [19]
. They are less dependent on gait and other taskdependent heuristics and can lead to more versatile and dynamic behaviors. We propose learning gaits that are sufficiently smooth and efficient by optimizing for power usage. This will avoid highfrequency changes in the control signal that in turn could cause instability or mechanical stress.
Although the reported experiments are conducted in simulation, we have made a significant effort to capture many of the challenges of real robots. We model the Minitaur in MuJoCo [23], as depicted in Figure 0(c), using model parameters obtained from data sheets as well as system identification to improve the fidelity. The Minitaur is placed on a varying, rough terrain that is procedurally generated for every rollout. We use a nonlinear actuator model based on a general DC motor model and the torquecurrent characteristic described in De and Koditschek [6]. The observations include noisy motor positions, yaw, pitch, roll, angular velocities and accelerometer readings, but no direct perception of the terrain, making the problem partially observed. The policy outputs position setpoints at 100Hz that are fed to a proportional position controller, with a delay of 20ms between sensor readings and the corresponding control signal, to match delays observed on the real hardware. To improve robustness, and with the aim of simulationtoreal transfer, we perform domain randomization [22] on a number of model parameters, as well as apply random external forces to the body (see Appendix B for details).
As we are only considering forward locomotion, we set the reward to be the forward velocity of the robot’s base expressed in the world frame. The cost is the total power usage of the motors according to the actuator model. As the legs can collide with the main body, when giving the agent access to the full control range, a constant penalty is added to the power penalty during any selfcollision. We use a largely similar training setup as in Section IVA; however, since the episodes are 30 sec in length and observations are partial and noisy, the agent requires memory for effective state estimation, we thus add an LSTM [8] layer to the model. In addition to learning separate values for and , we split up into separate value functions for the power usage and collision penalty. We also increase the number of actors to 100 to sample a larger number of domain variations more quickly. More details can be found in Appendix A.
We first evaluate the effect of applying the lower bound to each individual state instead of to the global average velocity. Figure 4 shows a comparison between the learning dynamics of a model using a single multiplier and a model with a statedependent one, i.e. constrained in expectation or perstep. Both agents try to achieve a lower bound on the value that is equivalent to a minimum velocity of 0.5 m/s. At first, both agents “focus” on satisfying the constraint, increasing the penalty significantly in order to do so. Once the target velocity is exceeded, the agents start to optimize the penalty, which drives them back to the imposed bound. We see that a single global multiplier leads to large oscillations between moving too slow at a lower penalty and moving too fast at a higher penalty. Although this process eventually converges, it is inefficient. In contrast to this, the agent with the statedependent tracks the target velocity more closely, and achieves slightly lower penalties. The statedependent shows generally lower values over time as well (Figure 3(c)).
Target  Constrained  

delta  penalty  delta  penalty  delta  penalty  delta  penalty  delta  penalty  
0.1  0.1,  35.74  0.01,  104.2  0.07,  112.35  0.1,  245.49  0.01,  127.14 
0.2  0.2,  46.48  0.01,  210.04  0.15,  207.19  0.23,  399.83  0.03,  106.88 
0.3  0.3,  50.3  0.06,  154.91  0.16,  213.1  0.24,  429.6  0.04,  89.97 
0.4  0.4,  54.05  0.06,  195.98  0.11,  306.1  0.32,  627.66  0.05,  132.97 
0.5  0.5,  60.71  0.13,  250.69  0.13,  332.53  0.26,  808.38  0.05,  142.93 
0.0,  54.63  1.25,  775.08  1.24,  1556.97  1.24,  1656.42  ,   
Target  Constrained  

delta  penalty  delta  penalty  delta  penalty  delta  penalty  delta  penalty  
0.0  0.0,  53.68  0.01,  116.59  0.17,  272.45  0.37,  757.53  0.0,  84.07 
0.1  0.1,  54.49  0.0,  158.68  0.21,  324.16  0.37,  619.3  0.0,  141.86 
0.2  0.2,  53.54  0.02,  256.68  0.21,  373.13  0.36,  627.19  0.04,  174.79 
0.3  0.3,  53.6  0.02,  314.71  0.16,  336.48  0.42,  747.24  0.02,  188.18 
0.4  0.4,  54.82  0.07,  384.94  0.15,  467.21  0.32,  870.34  0.05,  252.54 
0.5  0.5,  52.37  0.1,  366.48  0.01,  594.36  0.27,  1026.3  0.05,  361.16 
0.6  0.6,  52.36  0.2,  686.36  0.07,  770.67  0.02,  1632.96  0.04,  773.79 
In Table II, we compare the rewardpenalty tradeoff of our approach to baselines where we clip the reward, , and use a fixed coefficient for the penalty. As there is less incentive for the agent to increase the reward over , there is more opportunity to optimize the penalty. Results shown are the perstep overshoot with respect to the desired target velocity and the penalty, averaged across 4 seeds and 100 episodes each (the first 100 ms are clipped to disregard transient behavior when starting from a standstill). We also compare to a baseline where the reward is unbounded, marked as in Table II. In the unbounded reward case, it proves to be difficult to achieve a positive but moderately slow speed. Either is too high and the agent is biased towards standing still, or it is too low and the agent reaches the end of the course before the time limit (corresponding to an average velocity of approx. 1.25 m/s). For the clipped reward, we observe a similar issue when is set too high. In nearly all other cases, the targeted speed is exceeded by some margin that increases with decreasing . While there is less incentive to exceed , a larger margin decreases the chances of the actual speed momentarily dropping below the target speed. Using the constraintbased approach, we generally achieve average actual speeds closer to the target and at a lower average penalty, showing the merits of adaptively trading of reward and cost.
Table III shows a comparison between agents trained across varying target speeds (sampled uniformly in m/s). These agents are given the target speed as observations. The evaluation procedure is the same as before, except that we evaluate the same conditional policy for multiple target values. We make similar observations: a fixed penalty coefficient generally leads to higher speeds then the set target, and higher penalties. Interestingly, for higher target velocities, the actual velocity exceeds the target less, indicating that different values for
are required for different targets. As we learn multipliers that are conditioned on the target, we can track the target more closely, even for higher speeds. We also evaluate these models for a target speed outside out the training range. Performance degrades quite rapidly, with the constraint no longer satisfied, and at significantly higher cost. This can be explained by the way the policies change behavior to match the target speed: generally the speed is changed by modulating the stride length. Increasing the stride length much further than observed during training, however, results in collisions occurring that were not present at lower speeds, and hence higher penalties. The same observation also explains why the penalties in the conditional case are higher than in the fixed case (final column in Table
III vs. Table II), as distinct behaviors are optimal for different target velocities. This is likely a limitation of the relatively simple policy architecture, and improving diversity across goal velocities will be studied in future work.Figure 5 extends the comparisons by plotting penalty over absolute velocity deltas for the different approaches. The plots show that finding a suitable weighting that works for all tasks and setpoints is difficult. While it is easy to identify values for that are clearly too high or low, performance over tasks can vary even for welltuned values. Our approach as shown in Figure 4(e) achieves a consistent performance, with low velocity overshoot errors and low penalty across all tests. These results suggest that since our approach is less sensitive to taskspecific changes, it may also greatly reduce computationally expensive hyperparameter tuning. Videos showing some of the learned behaviors, both in the fixed and conditional constraint case, can be found at https://sites.google.com/view/successatanycost.
IvC Sawyer reaching with visibility constraint
To demonstrate that our algorithm can, without modification, be used on robotic hardware we apply it to a reaching task on a robot arm in a crowded tabletop environment. To explore the versatility of the constraintbased approach, we design the task such that it contains a reward objective, as well as a constraint. The agent must learn to reach to a random 3D target location while maintaining constant visibility.
In more detail, the robot is a Sawyer 7 DoF arm mounted on a table and equipped with a Robotiq 2F85 parallel gripper. We place a 5 cm wide cube inside the gripper, and track the cube with a camera using fiducials (augmented reality tags). The objective is to reach a virtual target position sampled within a 20 cm cube workspace. A number of obstacles are however placed in front of the camera, as seen in Figure 6. There are two ways the agent can lose visibility: either the cube is occluded by obstacles or the wrist is rotated such that the cube faces away from the camera. Hence, we add an objective to keep the cube visible and constrain it by a lower bound; the visibility is a binary signal indicating whether at least one of the cube’s fiducials is detected in the camera frame.
To phrase this in the framework of Section III
, the reward in this case is the visibility, which we constrain to be true 95% of the time. The negative cost is now a sigmoidal function of the distance of the cube to the target, being at most 1 and decaying to 0.05 over a distance of 10 cm.
The policy and critic receive several inputs: the proprioception (joint positions, velocities and torques), the visibility indicator, and the previous action taken by the agent. We use an action and observation history of 2, i.e. the past two observations/actions are provided to the agent. The policy outputs 4dimensional Cartesian velocities: three translational degrees of freedom (limited to [0.07, 0.07] m/s) plus wrist rotation (limited to [1, 1] rad/s). The policy is executed at a 20 Hz control rate and we limit each episode to 600 steps. As the camera image itself is not observed, the obstacle configuration has to be indirectly inferred through trial and error. We hence keep the obstacle configuration fixed during training, though the setup can be extended to varying obstacle configurations if the camera image is also observed.
This task setup shows a possible application of the proposed method to a more complex task. Other approaches to solving such a task exist (weighted costs, multiplicative costs or early episode termination). However, the problem formulation of a hard constrained performance metric with a secondary reward objective that is compatible with the constraint (i.e. being defined in the nullspace of the constraint) feels very natural in our approach.
Figure 7 shows the learning progress of the agent in this task. Taking a look at the value estimates, we see that optimization initially focuses on increasing the value of the visibility objective, while the value for the reaching objective does not change much. Once the lower bound of 95% visibility, corresponding to a value of 9.5, has been met after about 200 training steps, the value of the reaching task starts to increase as well. After about 2500 steps the reaching objective has also achieved its maximum achievable return (without violating the constraint). The visibility value remains nearly constant during the remainder of learning. We observe the same trend in Figure 6(b), which shows the ratio, as defined by Equation 3, of the reaching objective versus the visibility objective. Initially all the weight is put on the visibility objective, which then shifts to about 80% reaching and 20% visibility. Note that the average ratio is plotted, but the actual ratio will differ for each state encountered by the policy. Figure 6 shows an example rollout of the learned policy. The agent is able to avoid shortcuts that affect visibility, maintains a wrist rotation facing the camera and settling in a position where one tag remains visible at all times. A video showing the behavior of the learned policy qualitatively can be found at https://sites.google.com/view/successatanycost.
V Conclusion
In order to regularize behavior in continuous control rl tasks in a controllable way, we introduced a constraintbased RL approach that is able to automatically trade off rewards and penalties, and can be used in conjunction with any modelfree, valuebased rl algorithm. The constraints are applied in a pointwise fashion, for each state that the learned policy encounters. The resulting constrained optimization problem is solved using Lagrangian relaxation by iteratively adapting a set of Lagrangian multipliers, one per state, during training. We show that we can learn these multipliers in the critic model alongside the value estimates of the policy, and closely track the imposed bounds. The policy and critic can furthermore generalize across lower bounds by making the constraint value observable, resulting in a single conditional rl agent that is able to dynamically trade off reward and costs in a controllable way. We applied our approach to a number of continuous control benchmarks and show that without some cost function, we observe highamplitude and highfrequency control. Our method is able to reduce the control actions significantly, sometimes without sacrificing average reward. In a simulated quadruped locomotion task, we are able to minimize electrical power usage with respect to a lower bound on the forward velocity. We show that our method can achieve both lower velocity overshoot as well as lower power usage compared to a baseline that uses a fixed penalty coefficient. Finally, we successfully learn a reaching tasks in a cluttered tabletop environment on a real robot arm with a visibility constraint, demonstrating that our method extends to real world system and nontrivial problems.
References
 Abdolmaleki et al. [2018] Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Riedmiller. Maximum a Posteriori Policy Optimisation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S1ANxQW0b.

Achiam et al. [2017]
Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel.
Constrained Policy Optimization.
In
Proceedings of the 34th International Conference on Machine Learning
, pages 22–31, 2017.  Altman [1999] E. Altman. Constrained Markov Decision Processes. Chapman and Hall, 1999.
 Amodei et al. [2016] Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565, 2016.
 Dalal et al. [2018] Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces. CoRR, abs/1801.08757, 2018.
 De and Koditschek [2015] Avik De and Daniel E. Koditschek. The Penn Jerboa: A platform for exploring parallel composition of templates. CoRR, abs/1502.05347, 2015. URL https://arxiv.org/abs/1502.05347.
 Deb [2014] Kalyanmoy Deb. Multiobjective optimization. In Search methodologies, pages 403–449. Springer, 2014.
 Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 Kenneally et al. [2016] G. Kenneally, A. De, and D. E. Koditschek. Design principles for a family of directdrive legged robots. IEEE Robotics and Automation Letters, 1(2):900–907, July 2016.
 Levine and Koltun [2013] Sergey Levine and Vladlen Koltun. Guided policy search. In Proceedings of the 30th International Conference on Machine Learning, 2013.
 Mannor and Shimkin [2004] Shie Mannor and Nahum Shimkin. A geometric approach to multicriterion reinforcement learning. Journal of Machine Learning Research, 5:325–360, December 2004. ISSN 15324435.
 Mossalam et al. [2016] Hossam Mossalam, Yannis M. Assael, Diederik M. Roijers, and Shimon Whiteson. Multiobjective deep reinforcement learning. CoRR, abs/1610.02707, 2016. URL https://arxiv.org/abs/1610.02707.
 Munos et al. [2016] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS), 2016.
 Peters and Mülling [2010] Jan Peters and Katharina Mülling. Relative entropy policy search. 2010.
 Popov et al. [2017] Ivaylo Popov, Nicolas Heess, Timothy P. Lillicrap, Roland Hafner, Gabriel BarthMaron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin A. Riedmiller. Dataefficient deep reinforcement learning for dexterous manipulation. CoRR, abs/1704.03073, 2017. URL http://arxiv.org/abs/1704.03073.

Roijers et al. [2013]
Diederik M. Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley.
A survey of multiobjective sequential decisionmaking.
Journal of Artificial Intelligence Research
, 48(1):67–113, October 2013. ISSN 10769757.  Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1889–1897, Lille, France, 07–09 Jul 2015. PMLR. URL http://proceedings.mlr.press/v37/schulman15.html.
 Sutton and Barto [1998] Richard S. Sutton and Andrew G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
 Tan et al. [2018] Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen, Yunfei Bai, Danijar Hafner, Steven Bohez, and Vincent Vanhoucke. Simtoreal: Learning agile locomotion for quadruped robots. Robotics: Science and Systems (RSS), 2018.
 Tassa et al. [2018] Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lillicrap, and Martin A. Riedmiller. DeepMind Control Suite. CoRR, abs/1801.00690, 2018. URL https://arxiv.org/abs/1801.00690.
 Tessler et al. [2018] Chen Tessler, Daniel J. Mankowitz, and Shie Mannor. Reward constrained policy optimization. CoRR, abs/1805.11074, 2018. URL https://arxiv.org/abs/1805.11074.
 Tobin et al. [2017] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. CoRR, abs/1703.06907, 2017. URL https://arxiv.org/abs/1703.06907.
 Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for modelbased control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 5026–5033, Oct 2012. doi: 10.1109/IROS.2012.6386109.
Appendix A: Optimization details
Va General algorithm
VB Policy Evaluation
Our method needs to have access to a Qfunction for optimization. While any method for policy evaluation can be used, we rely on the Retrace algorithm [13]. More concretely, we learn the Qfunction for each cost term , where denote the parameters of the function approximator, by minimizing the mean squared loss:
(7)  
where denotes the output of a target Qnetwork, with parameters , that we copy from the current parameters after a fixed number of updates. Note that while the above description uses the definition of reward we learn the value for the costs analogously. We truncate the infinite sum after steps by bootstrapping with . Additionally,
denotes the probabilities of an arbitrary behaviour policy, in our case given through data stored in a replay buffer.
We use the same critic model to predict all values as well as the Lagrangian multipliers . Following Equation 5, we hence also minimize the following loss:
(8) 
Our total critic loss to minimize is , where is used to balance the constraint and value prediction losses.
VC Maximum a Posteriori Policy Optimization
Given the Qfunction, in each policy optimization step, MPO used expectationmaximization (EM) to optimize the policy. In the Estep MPO finds the solution to a following KL regularized RL objective; the KL regularization here helps avoiding premature convergence, we note, however, that our method would work with any other policy gradient algorithm for updating
. MPO performs policy optimization via an EMstyle procedure. In the Estep a sample based optimal policy is found by minimizing:(9)  
Afterwards the parametric policy is fitted via weighted maximum likelihood learning (subject to staying close to the old policy) given via the objective:
(10)  
assuming a Gaussian policy (as in this paper) this objective can further be decoupled into mean and covariance parts for the policy (which inturn allows for more finegrained control over the policy change) yielding:
(11)  
(12) 
where
This decoupling of updating mean and covariance allows for setting different learning rate for mean and covariance matrix and controlling the contribution of the mean and covariance to KL seperatly. For additional details regarding the rationale of this procedure we refer to the original paper [1].
VD Hyperparameters
The hyperparameters for the Qlearning and policy optimization procedure are listed in Table IV. We perform optimization of the above given objectives via gradient descent; using different learning rates for critic and policy learning. We use Adam for optimization.
Parameter  Cartpole  Humanoid  Minitaur  Sawyer 

Hidden units policy  
Hidden units critic  
LSTM cells        
Discount  
Policy learning rate  
Critic learning rate  
Constraint loss scale ()  
Number of actors  
Estep constraint()  
Mstep constraint on ()  
Mstep constraint on () 
Appendix B: Minitaur simulation details
Parameter  Sample frequency  Description 

Body mass  episode  global scale , with scale for each separate body 
Joint damping  episode  global scale , with scale for each separate joint 
Battery voltage  episode  global scale , with scale for each separate motor 
IMU position  episode  offset , both cartesian and angular 
Motor calibration  episode  offset 
Gyro bias  episode  
Accelerometer bias  episode  
Terrain friction  episode  
Gravity  episode  scale 
Motor position noise  time step  , additional dropout 
Angular position noise  time step  
Gyro noise  time step  
Accelerometer noise  time step  
Perturbations  time step  Perstep decay of 5%, with a chance of adding a force in any planar direction 
is the normal distribution,
the corresponding lognormal.is the uniform distribution and
the Bernouilli distribution.