1 Introduction
Recent advances in numerical optimal control and Reinforcement Learning (RL) are enabling researchers to study increasingly complex control problems [19, 21, 24, 26]. Many of these problems, both in simulation and the real world, have hybrid dynamics and action spaces, consisting of continuous and discrete decision variables. In robotics, common examples for continuous actions are analogue outputs, torques or velocities while discrete actions can be control modes, gear switching or discrete valves. Also, outside of robotics we find many hybrid control problems such as in computer games where mouse or joystick inputs are continuous but button presses or clicks are discrete. However, many stateoftheart RL approaches have been optimized to work well with either discrete (e.g. [19]) or continuous (e.g. MPO [2, 1], SVG [14], DDPG [16] or Soft Actor Critic [10]) action spaces but can rarely handle both – notable exceptions are policy gradient methods, e.g. [23] – or perform better in one parameterization than another [21]. This can make it convenient to transform all control variables so that they can be handled by the a single paradigm – e.g by discretizing continuous variables, or by approximating discrete actions as continuous by thresholding them on the environment side instead of as part of the RL agent. Alternatively, control variables may be removed from the optimization problem e.g. by using expertdesigned heuristics for discrete variables in continuous problems. Although either approach can work practice, in general, both strategies effectively reduce control authority or remove structure from the problem, which can affect performance or in the end make a problem harder to solve.
In this work, we propose to approach these problems in their native form, i.e. as hybrid control problems with partially discrete and partially continuous action spaces. We derive a data efficient model free RL algorithm that is able to solve control problems with both continuous and discrete action spaces as well as hybrid optimal control problems with controlled (and autonomous) switching. Being able to handle both discrete and continuous actions robustly with the same algorithm allows us to choose the most natural solution strategy for any given problem rather than letting algorithmic convenience dictate this choice. We demonstrate the effectiveness of our approach on a set of control problems with a focus on robotics tasks, both in simulation and on hardware. These examples contain native hybrid problems, but we also demonstrate how a hybrid approach allows for novel formulations of existing problems by augmenting the action space with ‘meta actions’ or other quasihierarchical schemes. With minimal algorithmic changes this enables, for instance, variable rate control and thereby improving energy efficiency and reducing mechanical wear. More generally it allows to implement strategies that can address some of the challenges in RL such as exploration.
2 Related Work
Control problems with continuous and discrete action spaces, in practice, each tend to have their own idiosyncrasies and challenges. In consequence, even though approaches exist that can, in principle, deal with either type of decision variable, different types of algorithms are favored for solving the different types of problems. To tackle hybrid reinforcement learning problems, recent work such as PDQN [28] and QPAMDP [18]
simply combine a discrete and a continuous RL algorithm and hence do not solve the problem in a unified fashion. A substantial part of hybrid RL literature focuses on a subcategory called Parameterized Action Space Markov Decision Processes (PAMDP)
[18, 13, 7, 9], which is a hierarchical problem where the agent first selects a discrete action and subsequently a continuous set of parameters for that action. Masson et al. [18] solve PAMDPs using a low dimensional policy parameterization. In contrast, [13, 5, 9] make use of continuous (deep) RL to output both continuous continuous actions and weights for the discrete choices. Subsequently, an argmaxoperation (or softmax) is applied to select the discrete action to be applied to the system. While our work can also be applied to PAMDPs, it is not limited to this special case but can solve general hybrid problems. Viceversa, despite designed for solving PAMDPs, the ‘argmaxtrick‘ used in [13, 9] can be used to solve also nonhierarchical hybrid problems. Similarly, a range of work on hierarchical control in continuous action spaces effectively optimizes hybrid action spaces [4, 27, 29, 12].Alternatively to gradientbased optimization, evolutionary approaches [22, 3, 11] can be applied to address hybrid control problems. However, even recent approaches do not achieve a comparable dataefficiency [22], which limits their use for lowdata regimes such as robotics. In the optimal control literature, hybrid problems are frequently tackled using (discrete) Dynamic Programming, e.g. Value or Policy Iteration [8]. Other approaches include partitioning the system into continuous subsystems and then approaching the problem in a hierarchical fashion, i.e. solving each continuous control problem separately and adding a discrete controller on top (see surveys in [6, 8]
) or tackling the overall problem using game theory
[17]. Some approaches treat the entire system as continuous (see survey in [8]). One can then try to solve the problem using stateoftheart (continuous) numerical optimal control. However, most of these algorithms rely on differentiability of the system dynamics with respect to control actions, which is not the case for discrete actions. To overcome this issue, Mixed Integer Programming has been used [6] which solves an optimization problem involving both continuous and discrete optimization variables. With increase in computational power, Mixed Integer approaches allow for online or Model Predictive Control (MPC) implementations [6], often relying on locally linearquadratic approximations. However, solving larger scale MixedInteger, especially in presence of constraints, nondifferentiable costs or nonlinearities, can still be very challenging due to the combinatorial complexity.3 Preliminaries
We consider reinforcement learning with an agent operating in a Markov Decision Process (MDP) consisting of the state space , a hybrid continuous and discrete action space , an initial state distribution
, transition probabilities
which specify the probability of transitioning from state to under action , a reward function and the discount factor. The actions are drawn from a policy which is a stateconditioned probability distribution over actions
. The objective is to optimize, where the expectation is taken with respect to the trajectory distribution induced by . We also define the actionvalue function for policy as the expected discounted return when choosing action in state and acting subsequently according to policy as . This function satisfies the recursive expression where is the value function of .4 Method
In this section we introduce a class of policies for hybrid discretecontinuous decision making and required update rules for optimizing such policies.
4.1 Hybrid Policies
We consider hybrid policies of a simple class in this paper which will allow us to represent action spaces with both continuous and discrete dimensions – or purely continuous / discrete action spaces if desired. Formally, we define a hybrid policy
as a state dependent distribution that jointly models discrete and continuous random variables. We assume independence between action dimensions, denoted by
, for simplicity, i.e,(1) 
where and are the subsets of action dimensions with continuous values and discrete values respectively (with and representing continuous and discrete action spaces). We represent each component of the continuous policy
and represent the discrete policy as a (perdimension) categorical distribution over discrete choices parameterized by statedependent probabilities
where comprises the parameters of all policy components which we want to optimize. In particular, we will consider the case where ,
are all represented as outputs of a neural network. We refer to the appendix for additional details on sampling and computing log probabilities of this policy parameterization.
4.2 Hybrid Policy Optimization
As pointed out in Section 2, a number of policy optimizers are in principle capable of optimizing hybrid policies. In this work we build on top of the MPO algorithm [2, 1]. MPO is a stateoftheart policy optimization algorithm that allows us to use offpolicy data and has been shown to be both data efficient and robust. To train a hybrid policy using MPO we rely on a learned approximation to the Qfunction (we refer to the appendix for an exposition of how this can be learned from a replay buffer ). Using this Qfunction, MPO updates the policy in two steps.
Step 1: first we construct a nonparametric improved policy . This is done by maximizing for states drawn from a replay buffer while ensuring that the solution stays close to the current policy , i.e. , where
denotes the KullbackLeibler divergence. This optimization has a closed form solution given as
where is a temperature parameter that can be computed by minimizing a convex dual function ([2]).Step 2:
we use supervised learning to fit a new parametric policy to samples from
. Concretely, to obtain the parametric policy, we solve the following weighted maximum likelihood problem with a constraint on the change of the policy from one iteration to the next(2)  
s.t. 
To control the change of the continuous and discrete policies independently, we take an approach similar to [1] and decouple the optimization of discrete and continuous policies. This allows us to enforce separate constraints for each: The first constraint bounds the divergence between the old and new continuous policy by and the second one bounds the average KL across categorical distributions by . To solve Equation (2), we first employ Lagrangian relaxation to make it amenable to gradient based optimization and then perform a fixed number of gradient ascent steps (using Adam [15]); details can be found in the Appendix.
5 Verification and Application of Hybrid Reinforcement Learning
With a scalable hybrid RL approach we are no longer bound to formulating purely discrete or continuous problems. This enables us to solve native hybrid control problems in their natural form, but we can also reformulate control problems (or create entirely new ones) to address various issues in control and RL. These novel formulations can lead to better overall performance and learning speed but also reduce the amount of expert knowledge required. We group our experiments in three categories: The first set is used to verify the algorithm, the second set contains examples of ‘native‘ hybrid problems and finally we show novel hybrid formulations in continuous domains, adding ‘meta‘ control actions.
5.1 Validating Hybrid MPO
While there are established benchmarks and proven algorithms for continuous (and discrete) reinforcement learning, there are not many for hybrid RL. We run validation experiments on continuous domains where we partially or fully discretize the action space. This allows us to test our approach on well established benchmarks and compare it to stateoftheart continuous control and hybrid RL algorithms. We consider the DeepMind Control Suite [25] which consists of a broad range of continuous control challenges, as well as a real world manipulation setup with a robotic arm.
5.1.1 Comparison of Continuous, Discrete and Hybrid Control on Control Suite
For our comparison to a continuous baseline, we run Hybrid MPO in two settings, first fully discrete and second hybrid (partially discrete, partially hybrid). In the hybrid setting we discretize the last two action dimensions while in the fully discrete one we discretize all action dimensions. In both the hybrid and the discrete approach, we discretize the continuous actions coarsely, allowing only three distinct actions: . We use regular MPO as our continuous benchmark. Results show no noticeable performance difference between the fully continuous baseline and both the fully discrete as well as the hybrid approach. Surprisingly, despite the coarse control, both learning speed and final performance are practically identical (see Figure 5 in the Appendix for learning curves). Also, computational complexity is similar. Hence, we conclude that the approach is comparable in data efficiency to standard MPO and that our implementation is sound from an optimization perspective.
5.1.2 Comparison to the ‘argmaxtrick’
In our second test, we compare Hybrid MPO against the ‘argmaxtrick’ proposed in the PAMDP literature [13, 9] representing the stateoftheart in this field. The ‘argmaxtrick’ models discrete actions with continuous weights for each (action) option and subsequently applies the option with the highest weight (or samples from the softmax distribution). While developed for PAMDPs, it can be directly applied to nonhierarchical hybrid problems as well. Hence, we believe it is insightful to use a similar approach as a baseline to understand how well a purely continuous RL approach can scale to hybrid problems. For a fair comparison, and to eliminate any influence of the specific RL algorithm, we apply the ‘argmaxtrick’ structure to MPO (instead of using DDPG as in [13] or PPO as in [9]). Effectively, we use fully continuous MPO with an argmaxoperation on continuous weights for each discrete action dimension. We refer to it as ‘argmaxMPO’. In all experiments we are fully discretizing all action dimensions independently at regular intervals between the upper and lower action limit^{1}^{1}1As continuous dimensions are handled the same in all implementations, we discretize the full action space.. We vary the resolution between 3 and 61 control values per dimension, exploring different trade offs between problem size and the granularity of control.
We observe that throughout all tasks, Hybrid MPO learns much faster. A subset of the results are provided in Figure 1. Even when comparing fine resolution Hybrid MPO experiments with coarse argmaxMPO, it still trains faster despite the increase in parameters. There is little difference between both approaches in final performance, however it might take argmaxMPO significantly longer to reach it. argmaxMPO’s learning speed scales quite poorly with task complexity and action resolution. These results are in line with the experiments in previous work [13, 5] where the ‘argmaxtrick’ was applied to low dimensional problems with few discrete options. In comparison, Hybrid MPO is almost unaffected by the action resolution and no noticeable difference between 3 and 61 action discretization is visible. In tasks where finer control is required, such as the Humanoid task in Figure 1, Hybrid MPO actually performs better with finer action resolution. Another big difference between Hybrid MPO and argmaxMPO is scaling of computational complexity. Hybrid MPO sees less than 15% increase in computation time for the learning steps over the action discretizations, even for large action dimension tasks such as the humanoid. This also holds true for argmaxMPO in low dimensional tasks. However, in higher dimensional setups such as the humanoid, computation time (in our implementation) more than quadruples between 3 and 61 discretizations for the actions. On an absolute scale, Hybrid MPO outperforms argmaxMPO across all tasks and action resolutions, emphasizing the benefit of a natively hybrid implementation.
5.1.3 Sawyer ReachGraspLift with Discrete Gripper Control
In order to validate if the results above can be reproduced on hardware, we apply Hybrid MPO to a robotic manipulation task. We use Rethink Robotics Sawyer robot arm (for details on the setup, see Appendix C.1). The goal is to reach, grasp and lift a cube, as shown in Figure 2. The reward is the sum of these three subtasks where the reachreward is given for minimizing the distance between gripper and the cube, the grasp reward is a sparse, binary reward for triggering the builtin grasp sensor and lift is a shaped reward for increasing the height of the object above the table. We run the baseline MPO algorithm with the arm and the gripper controlled in continuous velocity mode, while for Hybrid MPO we discretize the gripper velocity to , i.e. open or close at full speed. The results in Figure 4 left^{2}^{2}2We will revisit this setup in Section 5.3.2. Hence the Figure contains learning curves not discussed yet. show that the Hybrid MPO approach significantly outperforms the baseline, which is unable to solve the task. The reason lies in exploration: In order to reach the cube, the agent needs to open the gripper. However, to grasp the block the gripper needs to be closed again. Both tasks required opposite behaviour and, since the grasp reward is sparse and the gripper is slow, this poses a challenging exploration problem for learning to grasp. Initially, the Gaussian policy will have most of its probability mass concentrated on small action values and will thus struggle to move the gripper’s fingers enough to see any grasp reward, explaining the plateau in the learning curve. The Hybrid MPO approach on the other hand always operates the gripper at full velocity and hence exploration is improved – allowing the robot to solve the task completely. While the improved exploration is a side effect of discretization, it underlines the shortcomings of Gaussian exploration.
5.2 Optimal Mode Selection for PegInHole
Many control problems are actually hybrid but are approximated as purely continuous or purely discrete. Examples are systems that combine continuous actions with a mode selection mechanism or discrete events. These choices are often excluded from the problem and the discrete choice/action is either fixed or selected based on heuristics. A common example in classical formulations is the choice of control mode or action space. Usually an expert chooses the mode that seems most suitable for the task but this choice is rarely verified. In the following example (which can also be interpreted as a PAMDP), we show that Hybrid MPO allows for exposing multiple “modes” to the agent, allowing it to select the mode and a continuous action for it based on the current observation. As a test setup, we create a peginhole setup where the robot has to perform precision insertion. In our example, we provide the agent with two (emulated) control modes that could resemble a gearbox with two gears: A coarse Cartesian velocity controller with limits of 0.07 m/s and a fine one of 0.01 m/s. We further impose wrist force limits resulting in episode termination to protect the setup and encouraging gentle insertion. The reward is shaped and computed based on the known hole position and forward kinematics. We train a hybrid agent that can switch between both modes (“virtual gears”) as well as two continuous actions for either fixed mode.
Our results show that the hybrid approach achieves an average final reward of approximately 2750 compared to 2500 for the fine control agent and around 1750 for the coarse mode agent. These results show that for an expert designer, selecting an appropriate mode beforehand can be difficult: The slow agent does not trigger the force limits but is much slower at inserting, leading to a lower average reward. The fast agent terminates frequently and while it might achieve good reward in some episodes, it cannot match the consistent performance of the hybrid agent, even after four times as much training time. While a switching heuristic could help alleviate the problem, designing an optimal heuristic is difficult and possibly a lengthy, iterative process. In order to understand the mode switching behavior of the hybrid agent, we run 100 test evaluations. The left plot in Figure 3
shows the observed mean and standard deviation of the mode chosen by the Hybrid MPO approach counted over the 100 evaluation runs. The right plot in Figure
3 shows the agent’s mode selection for two sample episodes, representative for all other episodes. The agent uses the coarse control mode to approach the hole quickly and consistently uses the fine action scale to perform the precise insertion. Some episodes show a “wiggling” behavior, where the Hybrid policy switches between modes near the peg while others show a single, discrete switch. While the experiment shows that an a priori choice on the control mode can harm final performance and/or learning speed, one can even imagine tasks where a wrong mode selection would lead to complete failure of an experiment. There will also be tasks where the added complexity of choosing the mode negatively affects learning speed and using a single mode is optimal. But even in such cases, a hybrid approach is beneficial since it only requires a single experiment, whereas a fixed choice would still need to be verified by an ablation.5.3 Adding Meta Control to Continuous Domains
The combination of discrete and continuous actions in Hybrid MPO opens up new ways of formulating control problems, even for fully continuous systems. Hybrid MPO can be used to add ‘meta actions’, i.e. actions that modify the (continuous) actions or change the overall system behavior. In the following experiments, we demonstrate how such additional actions can e.g. improve exploration, solve event triggered control or reduce mechanical wear at training time.
5.3.1 Furuta Pendulum
The first set of experiments using ‘meta control actions’ is conducted on a ‘Furuta Pendulum’ (shown in Figure 2) which is the rotational implementation of a cart pole. We define a sparse reward task that poses a hard exploration challenge: The reward is one when the pendulum is within a range of [5, +5] degrees around the upright position and the main motor is in a range of [15, 15] degrees around the backward pointing position. Otherwise the reward is zero. Before each episode, the motor is reset to the front facing position. Hence, in order to experience reward, first a large motion is required to move the main motor to the back and subsequent fast motions are required to swing the pendulum up. As a result, exploration without time correlation or with fixed time correlation will be quite poor. However, most RL approaches rely on such exploration. As a result, the baseline MPO agent struggles at solving the task. To improve exploration, we use Hybrid MPO and add a discrete ‘meta’ action “actorrepeat” to the problem, where the agent can choose to use the newly sampled continuous action or repeat the previous one (which is provided as an observation such that the problem remains an MDP)^{3}^{3}3This problem can be interpreted as a PAMDP where one discrete action has no continuous parameters.. We bias the initial policy to choose “repeat” with a probability of 95% to encourage exploration. Hence, there is stochastic action repeat leading to exploration at different frequencies. As results in Figure 4 right show, the Hybrid MPO agent solves the task much quicker, despite having to learn to not repeat actions when balancing the pendulum. In the Appendix D.3, we also provide a comparisons on a sparse balancing task in front as well as a fully shaped task. These show that even in simple problems the additional “actorrepeat” action does not harm (but also does not improve) learning speed or final performance.
5.3.2 Sawyer ReachGraspLift with AgentControlled Action Repeat
In this experiment, we test the improved exploration strategy on the robot arm block lifting task, described in Subsection 5.1.3. While discretizing the gripper action helped with exploration, it is masking the underlying problem, which is a complex combination of small issues: in order to reach the cube without pushing it away, the agent learns to keep the gripper fingers open. However, the subsequent grasp is a sparse signal that requires closing the fingers by a fair amount. The gripper fingers are slow and thus they effectively act as a low pass filter that filters out most of the policy’s (zero mean initialized) Gaussian exploration. Hence, with higher control frequency the issue becomes more severe, creating an unfavorable relation between control rate and exploration.
By adding the same “actorrepeat” action as in Subsection 5.3.1, we can once again significantly improve exploration, as shown in Figure 4 left. The Hybrid MPO agent with action repeat does not plateau after learning to reach since it experiences occasional reward from grasping and lifting early on in training. When comparing these results to the baseline, we can see that exploring at a single rate can lead to poor exploration, covering only a limited part of the state space, and effectively hinder learning. When using an additional action to decide whether to repeat the previous (continuous) action or use the newly sampled action, we effectively explore at different rates. As in the Furuta Pendulum experiments, the additional action dimension does not seem to affect learning speed and final performance. We also run an experiment where we combine discrete gripper actions with our “actorrepeat” action. Unsurprisingly, there is not much additional gain in learning speed or performance. However, there is a second sideeffect of using action repeats: it leads to smoother exploration and fewer (mechanical) load direction changes. When dealing with hardware this significantly reduces wear and tear. One could also imagine encouraging action repeats as part of the reward function. This could be directly applied to Event Triggered Control [5]. The experiments also shows how exploration and control rate are often unnecessarily intertwined in RL. Larger step sizes lead to a different exploration behavior. However, exploration and control rate should not be coupled and Hybrid MPO (with action repeat) allows to disentangle both.
5.3.3 Action Repeat on Control Suite Tasks
We extend the experiments for agent controlled action repeats to the Control Suite domain. As the Furuta Pendulum results might already hint at, learning curves show no noticeable gain or performance drop from action repeats (see Figure 5 in the Appendix). We assume that the improved exploration does not pay off in these tasks since rewards are strongly shaped, compared to the sparse rewards in the previous experiments. Given that action repeats do not seem to hurt performance, there is little reason not to make use of them.
5.3.4 Action Attention on Control Suite Tasks
In the last application, we try to push the limits of Hybrid MPO. We test the agent in a setting that could be best described as “actionattention”: From all available action dimensions, the agent can only affect one at a time while zero action is applied to all others. Hence, independent of the number of actuators () in the system, the agent has two actions: one discrete action, which can take any integer value in the range of , and one continuous action between
, which is mapped to the action range of the particular actuator. Also this problem is consistent with the PAMDP formulation. We once again take a look at Control Suite tasks. Compared to the previous experiments where the agent could control all degrees of freedom simultaneously, the “actionattention” poses a severe limitation in control authority, especially with increasing degrees of freedom. Not unexpectedly, there is a loss in performance especially in highdimensional domains. Yet, the agent copes with the control authority loss by learning alternative strategies which can be best seen in the supplemental video. One notable example is the “swimmer” task where the agent’s action attention travels in wave form through the body, actuating one link after the other. A second example is the “walker” where the agent resorts to “walking on its knees” hence focusing its limited attention capacity on a small number of degrees of freedom rather than spreading it across all.
6 Conclusion
In this work, we have introduced a native hybrid reinforcement learning approach that can solve problems with mixed continuous and discrete action spaces. It can be applied both to hierarchical (PAMDPs) and nonhierarchical hybrid control problems. Our algorithm outperforms continuouspolicybased hybrid algorithms in general hybrid problems. Our experiments and application examples show the potential of Hybrid MPO to treat hybrid problems in their native form, improving performance, auxiliary criteria (such as controlling at variable rate) and removing expert priors. We further show that hybrid RL can be used to rethink the way we set up our control problems. By adding discrete ‘meta actions’ to otherwise continuous problems, we can partially address common reinforcement learning pitfalls such as exploration or mechanical wear during training. While we provide a diverse set of examples, we believe there are many more applications of Hybrid RL that should be explored in future work. Finally, a detailed evaluation specifically for PAMDPs using benchmark problems and algorithms from the PAMDP literature could be insightful.
The authors would like to thank the DeepMind robotics team for support for the robotics experiments.
References
 [1] (2018) Relative entropy regularized policy iteration. arXiv preprint arXiv:1812.02256. Cited by: Appendix A, §1, §4.2, §4.2.
 [2] (2018) Maximum a posteriori policy optimisation. CoRR abs/1806.06920. Cited by: §A.3, Appendix A, §1, §4.2, §4.2.
 [3] (1996) Evolutionary algorithms in theory and practice: evolution strategies, evolutionary programming, genetic algorithms. Oxford university press. Cited by: §2.

[4]
(2017)
The optioncritic architecture.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, Cited by: §2.  [5] (2018) Deep reinforcement learning for eventtriggered control. In 2018 IEEE Conference on Decision and Control (CDC), pp. 943–950. Cited by: §2, §5.1.2, §5.3.2.
 [6] (1999) Control of systems integrating logic, dynamics, and constraints. Automatica 35 (3), pp. 407–427. Cited by: §2.
 [7] (2019) Multipass qnetworks for deep reinforcement learning with parameterised action spaces. arXiv preprint arXiv:1905.04388. Cited by: §2.
 [8] (1998) A unified framework for hybrid control: model and optimal control theory. IEEE transactions on automatic control 43 (1), pp. 31–45. Cited by: §2.
 [9] (2019) Hybrid actorcritic reinforcement learning in parameterized action space. arXiv preprint arXiv:1903.01344. Cited by: §2, §5.1.2.
 [10] (2018) Soft actorcritic algorithms and applications. arXiv preprint arXiv:1812.05905. Cited by: §1.
 [11] (2001) Completely derandomized selfadaptation in evolution strategies. Evolutionary computation 9 (2), pp. 159–195. Cited by: §2.
 [12] (2019) The termination critic. CoRR abs/1902.09996. External Links: Link, 1902.09996 Cited by: §2.
 [13] (2015) Deep reinforcement learning in parameterized action space. arXiv preprint arXiv:1511.04143. Cited by: §2, §5.1.2, §5.1.2.
 [14] (2015) Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems, pp. 2944–2952. Cited by: §1.
 [15] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
 [16] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §1.
 [17] (1995) A gametheoretic approach to hybrid system design. In International Hybrid Systems Workshop, pp. 1–12. Cited by: §2.
 [18] (2016) Reinforcement learning with parameterized actions. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §2.
 [19] (2015) Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
 [20] (2016) Safe and efficient offpolicy reinforcement learning. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems (NIPS), Cited by: §A.2.
 [21] (2018) Learning dexterous inhand manipulation. CoRR abs/1808.00177. External Links: Link, 1808.00177 Cited by: §1.
 [22] (2017) Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864. Cited by: §2.
 [23] (2017) Proximal policy optimization algorithms. CoRR abs/1707.06347. External Links: Link Cited by: §1.
 [24] (2017) Mastering the game of go without human knowledge. Nature 550 (7676), pp. 354. Cited by: §1.
 [25] (2018) Deepmind control suite. arXiv preprint arXiv:1801.00690. Cited by: §5.1.
 [26] (2014) Controllimited differential dynamic programming. In 2014 IEEE International Conference on Robotics and Automation (ICRA), pp. 1168–1175. Cited by: §1.
 [27] (2019) Regularized hierarchical policies for compositional transfer in robotics. arXiv preprint arXiv:1906.11228. Cited by: §2.
 [28] (2018) Parametrized deep qnetworks learning: reinforcement learning with discretecontinuous hybrid action space. arXiv preprint arXiv:1810.06394. Cited by: §2.
 [29] (2019) DAC: the double actorcritic architecture for learning options. arXiv preprint arXiv:1904.12691. Cited by: §2.
Appendix
Appendix A General Algorithm
We use policy iteration procedure similar to MPO [2, 1] for optimizing hybrid policies we are interested in. In order to update the policy, we first learn a Qfunction. This step is also known as policy evaluation step. Subsequently, we use the Qfunction to update the policy. This step is also known as policy improvement. In order to act in the environment we directly sample from the hybrid policy. We store the transitions and log probabilities of actions in the replay buffer. Next we explain how we sample from the hybrid policy and how we compute log probabilities.
a.1 Sampling and Log probabilities
As we showed in main paper we model the (behaviour) policy with a joint distribution of categorical and normal distributions (Eq. 1 main paper). And we use this policy to sample both continuous and discrete actions at the same time. This is done by sampling from normal and categoricals independently and concatenating them to obtain the desired action vector. In this case the log probability of the action vector is simply sum of the log probabilities under the normal and categorical distributions as we assumed independence in Eq. 1. We store the log probabilities of the actions that were taken in the replay buffer (i.e. the log probability of the behavior policy at that point in time) to learn the Qfunction.
a.2 Policy Evaluation
As mentioned above, we need to have access to a Qfunction to update the policy. While any method for policy evaluation can be used, we rely on the Retrace algorithm [20]. Concretely, we fit the Qfunction . Note that in this paper is sampled from the hybrid policy and it is a vector with both discrete and continuous dimensions. is represented by a neural network, with parameters which is obtained by minimising the squared loss:
(3)  
where denotes the target network for Qfunction, with parameters , that we copy from the current parameters after a fixed number of steps. denotes number of steps for reward accumulation before we bootstrap with . Additionally, denotes the probabilities of an arbitrary behaviour policy. In our case we use an experience replay buffer and hence is given by the action probabilities stored in the buffer; which correspond to the action probabilities at the time of action selection.
a.3 Maximum a Posteriori Policy Optimization for Hybrid Policies
Given the Qfunction, and current policy , in each policy improvement step MPO performs an EMstyle procedure for policy optimization. In the Estep a sample based optimal policy is found by maximizing:
(4)  
where is the state distribution. In our case state distribution is represented by state samples in the experience replay buffer. We can solve the above equation and obtain the samplebased distribution in closed form,
(5) 
where we can obtain by minimising the following convex dual function,
(6) 
see MPO paper [2] for more details on this optimization procedure. Note that this step only needs samples from the current policy and is not dependent on parametric form of the policy. Afterwards the parametric policy is fitted via weighted maximum likelihood learning (subject to staying close to the old policy) given via the objective:
(7)  
s.t. 
As in this paper we assume a hybrid continuousdiscrete policy , this objective can further be decoupled into continuous and discrete parts for the constraint which allows for independent control over the change of continuous and discrete policies:
(8)  
s.t. 
where
and
The first constraint bounds the KL divergence of the continuous policy, which is a Gaussian distribution (as in this paper), over continuous dimensions and second constraint bounds the average KL divergence across all
discrete dimensions where each dimension is represented by a categorical distribution. To solve this constrained optimisation we first write the generalised Lagrangian equation, i.e,Where and are Lagrangian multipliers. And we solve the following primal problem,
In order to solve for we iteratively solve the inner and outer optimisation programs independently: We fix the Lagrangian multipliers to their current value and optimise for (outer maximisation) and then fix the parameters to their current value and optimise for the Lagrangian multipliers (inner minimisation). We continue this procedure until policy parameters and Lagrangian multipliers converge.
Appendix B Hyperparameters and Reward Functions
b.1 Hyperparameters
All experiments are run in a single actor setup and we use the same hyperparameters for both MPO and Hybrid MPO (where they exist for both). Tables
1 show the hyperparameters we used for the experiments.Hyperparameters  

Policy net  200200200 
Number of actions sampled per state  20 
Q function net  500500500 
0.1  
0.0001 (0.01 on hardware)  
0.001  
Discount factor ()  0.99 
Adam learning rate  0.0003 
Replay buffer size  2000000 
Target network update period  250 
Batch size  3072 
Activation function  elu 
Layer norm on first layer  Yes 
Tanh on output of layer norm  Yes 
Tanh on Gaussian mean  No 
Min variance 
Zero 
Max variance  unbounded 
Max transition use  500 
b.2 Reward Functions
b.2.1 Control Suite
All Control Suite tasks are run with their default reward functions.
b.2.2 Sawyer Cube Manipulation
The reward function for cube manipulation is defined as
(9) 
where the individual components are defined as
reach  (10)  
grasp  (11)  
lift  (12) 
where is the Euclidean distance between the gripper’s pinch position and the object position and is the height of the cube above the table and is the maximum height of the cube at 0.15 m.
b.2.3 Furuta Pendulum
The sparse reward function is defined as
(13) 
where is the main motor angle (measured from the front position) and is the pendulum angle (measured from the upright position).
The shaped reward is defined as
(14) 
where the individual terms are defined as
(15)  
(16)  
(17) 
where is the pendulum angle error and are motor and pendulum velocities.
Appendix C Hardware Setup
c.1 Sawyer Manipulation Setup
The Sawyer cube manipulation setup consists of a Sawyer robot arm developed by Rethink Robotics. It is equipped with a Robotiq 2F85 gripper as well as a Robotiq FTS300 force torque sensor at the wrist. In front of the robot, there is a basket with a base size of 20x20 cm and inclined walls. The robot is controlled in Cartesian velocity space with a maximum speed of 0.07 m/s. The control mode has four control inputs: three Cartesian linear velocities as well as a rotational velocity of the gripper around the vertical axis. Together with the gripper finger angles, the total size of the action space is five.
Inside of the workspace of the robot is a 5x5x5 cm sized cube that is equipped with Augmented Reality markers on its surface. The markers are tracked by three cameras (two in front, one in the back). Using an extrinsic calibration of the cameras, the individual measurements are then fused into a single estimate which is provided as input to the agent. For the peginhole experiments, the insertion is measured based on the known fixed position of the hole. Hence, no additional observation is provided to the agent.
The agent is run at 20 Hz and receives proprioception observations from the arm (joint and endeffector positions and velocities), the gripper (joint positions, velocities and grasp signal) and forcetorque sensor (force and torques). In the case of cube experiments, the agent also receives the cube pose as an input.
During all experiments, episodes are 600 steps long but are terminated early when the forcetorque measurements of the wrist exceed 15 N on any of the principal axes.
c.2 Furuta Pendulum
The Furuta pendulum is driven by a single central brushless motor while the pendulum joint is passive and its position is measured with an encoder. The main motor is controlled in velocity mode and the agent controls the speed setpoint for the velocity controller.
The agent receives both the main motor and the pendulum’s positions and velocities as observations. All position measurements are mapped to sine/cosine space to ensure any wrap around is handled gracefully and input to the agent is bounded.
All trained agents run at 100 Hz and episodes are 1000 steps long, unless the pendulum velocity is too high and the episode is terminated to protect the pendulum joint.
Appendix D Additional Experimental Results
In this section, we provide a few more details on the experiments conducted as well as the learning curves. The main results are discussed in the paper but we decided to provide more details for the interested reader and for the sake of completeness and reproducability.
d.1 Control suite comparison
As described in Section 5.1.1 we compare baseline MPO and hybrid MPO on partially and fully discretized versions of the Control Suite tasks. Figure 5 shows learning curves for a (representative) subset of the tested tasks. We did not see any significant changes in learning speed or final performance between the approaches. One reason that the same performance can be reached in a fully discrete setting is that the Control Suite tasks do not require extreme control precision and mechanical (rigid body dynamics) systems serve as some form of low pass filter, smoothing out discrete actions.
d.2 Action attention
In Section 5.3.4 we describe the mechanism of ‘action attention’ that we tested on the DeepMind Control Suite. Figure 6 shows the learning curves for three tasks. As previously mentioned, the action attention agent does not (and possibly cannot, given the loss of control authority) reach the same performance as the baseline agent. However, it still learns and finds a solution that generates reward, occasionally coming up with novel strategies such as reducing the number of actuators the agent actually makes use of or coordinating them in sequence.
d.3 Furuta Pendulum
In Section 5.3.1, we described the use of agent controlled action repeats and how they can be used for challenging exploration tasks such as balancing the Furuta Pendulum in the back using sparse rewards. Apart from this challenging task, we also tested two simpler tasks: Balancing the pendulum in front using sparse rewards as well as balancing the pendulum anywhere with shaped reward. The goal of these experiments were to verify that there is no performance loss in simple tasks by using (and having to ‘unlearn’) action repeats.
As Figure 7 shows, performance and learning speed are not affected by the action repeat and both the hybrid as well as the baseline agent learn equally fast and reach the same performance. Hence, it seems as if there is no drawback in using action repeats by default, at least not in lower dimensional systems.
d.4 PegInHole
Section 5.2 discussed the ‘peginhole’ insertion tasks and provided basic performance figures. Figure 8 shows the learning curves for the same experiment. As one can observe, the coarse agent trains very slowly since it terminates episodes early due to exceeding the force/torque limits. The fine control agents trains fast, even faster than the hybrid one. However, it never reaches the same performance as the hybrid one. One might argue that an expert could come up with a mode switching strategy. However, it is unclear when the mode switch should happen, e.g. is it distance based and if so, at what distance shall it switch? So even this simple example shows that expert heuristics can be tricky to derive.
To put the learning curves into perspective, we evaluate the final policy for 100 episodes of 200 steps and compare it to the fast MPO baseline. During evaluation, we only apply the mean of the policy. We define success as having achieved full reward during at least one timestep in the episode. Table 2 summarizes the results. Hybrid MPO triggers the forcetorque limits once and thus is unable to complete the task. Despite four times more training time, the fast baseline triggers the limits 8 times and additionally does not succeed in 3 episodes where the limits are not triggered.
Hybrid MPO  MPO (coarse)  

early episode terminations  1/100  8/100 
successful episodes (max reward)  99/100  89/100 
Comments
There are no comments yet.