1 Introduction
Modelbased learning methods possess the desirable trait of being being “sampleefficient” [williams2017information, chua2018deep, abraham2020modelbasedgen] in solving robot learning tasks. That is, in contrast to experiencebased methods (i.e., modelfree methods that learn a direct stateaction map based on experience), modelbased methods require significantly less data to learn a control response to achieve a desired task. However, a downside to modelbased learning is that these models are often highly complex and require special structure to model the necessary intricacies [nagabandi2018neural, havens2019learning, sharma2019dynamics, abraham2020modelbasedgen, AbrahamRSS17, abraham2019active]. Experiencebased methods, on the other hand, approach the problem of learning skills by avoiding the need to model the environment (or dynamics) and instead learn a mapping (policy) that returns an action based on prior experience [schulman2017proximal, haarnoja2018soft]. Despite having better performance than modelbased methods, experiencebased approaches require a significant amount of data and diverse experience to work properly [chua2018deep]. How is it then, that humans are capable of rapidly learning tasks with a limited amount of experience? And is there a way to enable robotic systems to achieve similar performance? Some recent work tries to address these questions by exploring “how” models of environments are structured by combining probabilistic models with deterministic components [chua2018deep]. Other work has explored using latentspace representations to condense the complexities [havens2019learning, sharma2019dynamics]. Related methods use high fidelity Gaussian Processes to create models, but are limited by the amount of data that can be collected [deisenroth2011pilco]. Finally, some researchers try to improve experiencebased methods by adding exploration as part of the objective [pathak2017curiosity]. However, these approaches often do not formally combine the usage of modelbased planning with experiencebased learning. Those that do combine modelbased planning and experiencebased learning tend to do so in stages [chebotar2017path, bansal2017mbmf, nagabandi2018neural]
. First, a model is used to collect data for a task to jumpstart what data is collected. Then, supervised learning is used to update a policy
[levine2014GPS, chebotar2017path] or an experiencebased method is used to continue the learning from that stage [nagabandi2018neural]. While novel, this approach does not algorithmically combine the two robot learning approaches in an optimal manner. Moreover, the model is often used as an oracle, which provides labels to a base policy. As a result, the modelbased method is not improved, and the resulting policy is underutilized. Our approach is to algorithmically combine modelbased and experiencebased learning by using the learned model as a gauge for how well an experiencebased policy will behave, and then optimally update the resulting actions. Using hybrid control as the foundation for our approach, we derive a controller that optimally uses modelbased actions when the policy is uncertain, and allows the algorithm to fall back on the experiencebased policy when there exists high confidence actions that will result in a favorable outcome. As a result, our approach does not rely on improving the model (but can easily integrate high fidelity models), but instead optimally combines the policy generated from modelbased and experiencebased methods to achieve high performance. Our contribution can be summed up as the following:
A hybrid control theoretic approach to robot learning

Deterministic and stochastic algorithmic variations

A measure for determining the agreement between learned model and policy

Improved sampleefficiency and robot learning performance

Diverse implementation using standard offpolicy reinforcement learning
[haarnoja2018soft] and behavior cloning [pomerleau1998autonomous]
The paper is structured as follows: Section 2 provides background knowledge of the problem statement and its formulation; Section 3 introduces our approach and derives both deterministic and stochastic variations of our algorithm; Section 4 provides simulated results and comparisons as well as experimental validation of our approach; and Section 5 concludes the work.
2 Background
Markov Decision Processes:
The robot learning problem is often formulated as a Markov Decision process (MDP)
, which represents a set of accessible continuous states the robot can be in, continuous bounded actions that a robotic agent may take, rewards, and a transition probability
, which govern the probability of transitioning from one state to the next given an action applied at time (we assume a deterministic transition model). The goal of the MDP formulation is to find a mapping from state to action that maximizes the total reward acquired from interacting in an environment for some fixed amount of time. This can also be written as the following objective(1) 
where the solution is an optimal policy . In most common experiencebased reinforcement learning problems, a stochastic policy is learned such that it maximizes the reward at a state . Modelbased approaches solve the MDP problem by modeling the transition function and the reward function ^{1}^{1}1
We exclude the dependency on the action for clarity as one could always append the state vector with the action and obtain the dependency.
, and either use the model to construct a policy or directly generate actions through modelbased planning [chua2018deep]. If the transition model and the reward function are known, the MDP formulation becomes an optimal control problem where one can use any set of existing methods [li2004iterative] to solve for the best set of actions (or policy) that maximizes the reward (often optimal control problems are specified using a cost instead of reward, however, the analysis remains the same).Hybrid Control Theory for Mode Scheduling:
In the problem of mode scheduling, the goal is to maximize a reward function through the synthesis of two (or more) control strategies (modes). This is achieved by optimizing when one should switch which policy is in control at each instant (which is developed from hybrid control theory). The policy switching often known as mode switching [axelsson2008gradient]. For example, a vertical takeoff and landing vehicle switching from landing to flight mode is a common example used for an aircraft switching from flight mode to landing mode. Most problems are written in continuous time subject to continuous time dynamics of the form
(2) 
where is the (possibly nonlinear) transition function divided into the free dynamics and the control map , and is assumed to be generated from some default behavior. The objective function (similar to Eq.(1) but often written in continuous time) is written as the deterministic function of the state and control:
(3) 
where
(4) 
is the time horizon (in seconds), and is generated from some initial condition using the model (2) and action sequence (4). The goal in hybrid control theory is to find the optimal time and application duration to switch from some default set of actions to another set of actions that best improves the performance in (3) subject to the dynamics (2). The following section derives the algorithmic combination of the MDP learning formulation with the hybrid control foundations into a joint hybrid learning approach.
3 Hybrid Learning
The goal of this section is to introduce hybrid learning as a method for optimally utilizing modelbased and experiencebased (policy) learning. We first start with the deterministic variation of the algorithm that provides theoretical proofs, which describe the foundations of our method. The stochastic variation is then derived as a method for relaxing the assumptions made in the deterministic variation. The main theme in both the deterministic and stochastic derivations is that the learning problem is solved indirectly. That is, we solve the (often harder) learning problem by instead solving subproblems whose solutions imply that the harder problem is solved.
3.1 Deterministic
Consider the continuous time formulation of the objective and dynamics in (2) and (3) with the MDP formation where and
are learned using arbitrary regression methods (e.g., neural network least squares, Gaussian processes), and
is learned through a experiencebased approach (e.g., policy gradient [sutton2000policy]). In addition, let us assume that in the default action in (4) is defined as the mean of the policy where we ignore uncertainty for the time being^{2}^{2}2We will add the uncertainty into the hybrid problem in the stochastic derivation of our approach for hybrid learning. That is, is defined by assuming that the policy has the form , whereis a normal distribution and
,are the mean and variance of the policy as a function of state. As we do not have the form of
, let us first calculate how sensitive (3) is at any to switching from for an infinitely small ^{3}^{3}3We avoid the problem of instability of the robotic system from switching control strategies as later we develop and use the best action for all instead of searching for a particular time when to switch.. Assume that , , and are differentiable and continuous in time. The sensitivity of (3) (also known as the mode insertion gradient [axelsson2008gradient]) with respect to the duration time from switching between to and any time is defined as(5) 
where and , and is the adjoint variable which is the the solution to the the differential equation
(6) 
with terminal condition . See Appendix 6 for proof. Lemma 3 gives us the proof and definition of the mode insertion gradient (5), which tells us the infinitesimal change in the objective function when switching from the default policy behavior to some other arbitrarily defined control for a small time duration . We can directly use the mode insertion gradient to see how an arbitrary action changes the performance of the task from the policy that is being learned. However, in this work we use the mode insertion gradient as a method for obtaining the best action the robot can take given the learned predictive model of the dynamics and the task rewards. We can be more direct in our approach and ask the following question. Given a suboptimal policy , what is the best action the robot can take to maximize (3), at any time , subject to the uncertainty (or certainty) of the policy defined by ? We approach this new subproblem by specifying the auxiliary optimization problem:
(7) 
where the idea is to maximize the mode insertion gradient (i.e., find the action that has the most impact in changing the objective) subject to the term that ensures the generated action is penalized for deviating from the policy, when there is high confidence that was based on prior experience. Assuming that , , and are continuous and differentiable in and , the best possible action that improves the performance of (3) and is a solution to (7) for any time is
(8) 
where is defined in (6) and is the affine mapping from actions to the dynamics. Inserting the definition of the mode insertion gradient (5) and taking the derivative of (7) with respect to the pointwise and setting it to zero gives
where we drop the dependency on time for clarity. Solving for gives the best actions
which is the action that maximizes the mode insertion gradient subject to the certainty of the policy for all The proof in Theorem 3.1 provides the best action that a robotic system can take given a default experiencebased policy. Each action generated uses the sensitivity of changing the objective based on the predictive model’s behavior while relying on the experiencebased policy to regulate when the model information will be useful. We convert the result in Theorem 3.1 into our first (deterministic) algorithm (see Alg. 1).
The benefit of the proposed approach is that we are able to make (numerically based) statements about the generated action and the contribution of the learned predictive models towards improving the task. Furthermore, we can even make the claim that (8) provides the best possible action given the current belief of the dynamics and the task reward . Assuming that where is the control Hamiltonian for (3), then and is zero when the policy satisfies the control Hamiltonian condition . Inserting (8) into (5) yields
(9) 
From Pontryagin’s Maximum principle, a solution that is a local optima of the objective function satisfies the following
(10) 
when or . Therefore, if the policy is a solution, then it must be that the adjoint and a solution to the optimal control problem (3). Corollary 2 tells us that the action defined in (8) generates the best action that will improve the performance of the robot given valid learned models. In addition, Corollary 2 also states that if the policy is already a solution, then our approach for hybrid learning does not impede on the solution and returns the policy’s action. Taking note of each proof, we can see that there is the strict requirement of continuity and differentiability of the learned models and policy. As this is not always possible, and often learned models have noisy derivatives, our goal is to try to reformulate (3) into an equivalent problem that can be solved without the need for the assumptions. One way is to formulate the problem in discrete time (as an expectation), which we will do in the following section.
3.2 Stochastic
We relax the continuity, differentiability, and continuoustime restrictions specified (3) by first restating the objective as an expectation:
(11) 
where subject to , and is a sequence of randomly generated actions from the policy . Rather than trying to find the best time and discrete duration , we approach the problem from an hybrid information theoretic view and instead find the best augmented actions to that improve the objective. This is accomplished by defining two distributions and which are the uncontrolled system response distribution^{4}^{4}4We refer to uncontrolled as the unaugmented control response of the robotic agent subject to a stochastic policy
and the open loop control distribution (augmented action distribution) described as probability density functions
and respectively. Here, and use the same variance as the policy. The uncontrolled distribution represents the default predicted behavior of the robotic system under the learned policy . Furthermore, the augmented openloop control distribution is a way for us to define a probability of an augmented action, but more importantly, a free variable for which to optimize over given the learned models. Following the work in [williams2017information], we use Jensen’s inequality and importance sampling on the freeenergy [theodorou2012relative] definition of the control system using the open loop augmented distribution :(12) 
where here is what is known as the temperature parameter (and not the time duration as used prior). Note that in (12) if then the inequality becomes a constant. Further reducing the freeenergy gives the following:
which is the optimal control problem we desire to solve plus a bounding term which binds the augmented actions to the policy. In other words, the freeenergy formulation can be used as an indirect approach to solve for the hybrid optimal control problem by making the bound a constant. Specifically, we can indirectly make which would make the freeenergy bound reduce to a constant. Using this knowledge, we can define an optimal distribution through its density function
(13) 
where is the sample space.^{5}^{5}5The motivation is to use the optimal density function to gauge how well the policy performs. Letting the ratio be defined as gives us the proportionality that we seek to make the freeenergy a constant. However, because we can not directly sample from , and we want to generate a separate set of actions defined in that augments the policy given the learned models, our goal is to push towards . As done in [williams2017information, williams2016aggressive] this corresponds to the following optimization:
(14) 
which minimizes the KullbackLeibler divergence of the optimal distribution
and the openloop distribution . In other words, we want to construct a separate distribution that augments the policy distribution (based on the optimal density function) such that the objective is improved. The recursive, samplebased, solution to (14) is(15) 
where denotes the sample index and . See Appendix 6 for proof The idea behind the stochastic variation is to generate samples from the stochastic policy and evaluate its utility based on the current belief of the dynamics and the reward function. Since samples directly depend on the likelihood of the policy, any actions that steer too far from the policy will be penalized depending on the confidence of the policy. The inverse being when the policy has low confidence (high variance) the sample span increases and more modelbased information is utilizes. Note that we do not have to worry about continuity and differentiability conditions on the learned models and can utilize arbitrarily complex models for use of this algorithm. We outline the stochastic algorithm for hybrid learning Alg. 2.
4 Results
In this section, we present two implementations of our approach for hybrid learning. The first uses experiencebased methods through robot interaction with the environment. The goal is to show that our method can improve the overall performance and sampleefficiency by utilizing learned predictive models and experiencebased policies generated from offpolicy reinforcement learning. In addition, we aim to show that through our approach, both the modelbased and experiencebased methods are improved through the hybrid synthesis. The second implementation illustrates our algorithm with imitation learning where expert demonstrations are used to generate the experiencebased policy (through behavior cloning) and have the learned predictive models adapt to the uncertainty in the policy. All implementation details are provided in Appendix 7.
Learning from Experience:
We evaluate our approach in the deterministic and stochastic settings using experiencebased learning methods and compare against the standard in modelbased and experiencebased learning. In addition, we illustrate the ability to evaluate our method’s performance of the learned models through the mode insertion gradient. Experimental results validate hybrid learning for real robot tasks. For each example, we use Soft Actor Critic (SAC) [haarnoja2018soft] as our experiencebased method and a neuralnetwork based implementation of modelpredictive path integral for reinforcement learning [williams2017information]
as a benchmark standard method. The parameters for SAC are held as default across all experiments to remove any impact of hyperparameter tuning. Hybrid learning is tested in four simulated environments: pendulum swingup, cartpole swingup, the hopper environment, and the halfcheetah environment (Fig.
1) using the Pybullet simulator [coumans2016pybullet]. In addition, we compare against stateoftheart approaches for modelbased and experiencebased learning. We first illustrate the results using the deterministic variation of hybrid learning in Fig. 1 (compared against SAC and a deterministic modelpredictive controller [ansari2016sequential]). Our approach uses the confidence bounds generated by the stochastic policy to infer when best to rely on the policy or predictive models. As a result, hybrid learning allows for performance comparable to experiencebased methods with the sampleefficiency of modelbased learning approaches. Furthermore, the hybrid control approach allows us to generate a measure for calculating the agreement between the policy and the learned models (bottom plots in Fig. 1), as well as when (and how much) the models were assisting the policy. The commonality between each example is the eventual reduction in the assistance of the learned model and policy. This allows us to better monitor the learning process and dictate how well the policy is performing compared to understanding the underlying task.We next evaluate the stochastic variation of hybrid learning, where we compare against a stochastic neuralnetwork modelbased controller [williams2017information] and SAC. As shown in Fig. 2, the stochastic variation still maintains the improved performance and sampleefficiency across all examples while also having smoother learning curves. This is a direct result of the derivation of the algorithm where continuity and differentiability of the learned models are not necessary. In addition, exploration is naturally encoded into the algorithm through the policy, which results in more stable learning when there is uncertainty in the task. In contrast, the deterministic approach required added exploration noise to induce exploring other regions of statespace. A similar result is found when comparing the modelbased performance of the deterministic and stochastic approaches, where the deterministic variation suffers from modeling discontinuous dynamics. We can analyze the individual learned model and policy in Fig. 2 obtained from hybrid learning. Specifically, we look at the cartpole swingup task for the stochastic variation of hybrid learning in Fig. 4 and compare against benchmark modelbased learning (NNMPPI [williams2017information]) and experiencebased learning (SAC [haarnoja2018soft]) approaches. Hybrid learning is shown to improve the learning capabilities of both the learned predictive model and the policy through the hybrid control approach. In other words, the policy is “filtered” through the learned model and augmented, allowing the robotic system to be guided by both the prediction and experience. Thus, the predictive model and the policy are benefited, ultimately performing better as a standalone approach using hybrid learning.
Next, we apply hybrid learning on real robot experiments to illustrate the sampleefficiency and performance our approach can obtain (see Fig. 3 for task illustration). ^{6}^{6}6The same default parameters for SAC are used tor this experiment. We use a Sawyer robot whose goal is to push a block on a table to a specified marker. The position of the marker and the block are known to the robot. The robot is rewarded for pushing the block to the marker. What makes this task difficult is the contact between the arm and the block that the robot needs to discover in order to complete the pushing task. Shown in Fig. 3 our hybrid learning approach is able to learn the task within 20 episodes (total time is 3 minutes, 10 seconds for each episode). Since our method naturally relies on the predictive models when the policy is uncertain, the robot is able to plan through the contact to achieve the task whereas SAC takes significantly longer to discover the pushing dynamics. As a result, we are able to achieve the task with minimal environment interaction.
Learning from Examples:
We extend our method to use expert demonstrations as experience (also known as imitation learning [argall2009survey, ross2010efficient]). Imitation learning focuses on using expert demonstrations to either mimic a task or use as initialization for learning complex dataintensive tasks. We use imitation learning, specifically behavior cloning, as an initialization for how a robot should accomplish a task. Hybrid learning as described in Section 3 is then used as a method to embed modelbased information to compensate for the uncertainty in the learned policy, improving the overall performance through planning. The specific algorithmic implementation of hybrid imitation learning is provided in Appendix 7.
We test hybrid imitation on the Pybullet Ant environment. The goal is for the four legged ant to run as far as it can to the right (from the viewer’s perspective) within the allotted time. At each iteration, we provide the agent with an expert demonstration generated from a PPO [schulman2017proximal] solution. Each demonstration is used to construct a predictive model as well as a policy (through behavior cloning). The stochastic hybrid learning approach is used to plan and test the robot’s performance in the environment. Environment experience is then used to update the predictive models while the expert demonstrations are solely used to update the policy. In Fig. 5, we compare hybrid learning against behavior cloning. Our method is able to achieve the task at the level of the expert within 6 (200 step) demonstrations, where the behavior cloned policy is unable to achieve the expert performance. Interestingly, the ant environment is less susceptible to the covariant shift problem (where the state distribution generated by the expert policy does not match the distribution of states generated by the imitated policy [ross2010efficient]), which is common in behavior cloning. This suggests that the ant experiences a significantly large distribution of states during the expert demonstration. However, the resulting performance for the behavior cloning is worse than that of the expert. Our approach is able to achieve similar performance as behavior cloning with roughly 2 fewer demonstrations and performs just as well as the expert demonstrations. We test our approach on a robot experiment with the Franka Panda robot (which is more likely to have the covariant shift problem). The goal for the robot is to learn how to stack a block on top of another block using demonstrations (see Fig. 6). As with the ant simulated example in Fig. 5, a demonstration is provided at each attempt at the task and is used to update the learned models. Experience obtained in the environment is solely used to update the predictive models. We use a total of ten precollected demonstrations of the block stacking example (given one at a time to the behavior cloning algorithm before testing). At each testing time, the robot arm is initialized at the same spot over the initial block. Since the demonstrations vary around the arm’s initial position, any state drift is a result of the generated imitated actions and will result in the covariant shift problem leading to poor performance. As shown in Fig. 6
, our approach is capable of learning the task in as little as two demonstrations where behavior cloning suffers from poor performance. Since our approach synthesizes actions when the policy is uncertain, the robot is able to interpolate between regions where the expert demonstration was lacking, enabling the robot to achieve the task.
5 Conclusion
We present hybrid learning as a method for formally combining modelbased learning with experiencebased policy learning based on hybrid control theory. Our approach derives the best action a robotic agent can take given the learned models. The proposed method is then shown to improve both the sampleefficiency of the learning process as well as the overall performance of both the model and policy combined and individually. Last, we tested our approach in various simulated and realworld environments using a variety of learning conditions and show that our method improves both the sampleefficiency and the resulting performance of learning motor skills.
References
6 Proofs
Assume that , , and are differentiable and continuous in time. The sensitivity of (3) (also known as the mode insertion gradient) with respect to the duration time from switching between to and any time is defined as
(16) 
where and , and is the adjoint variable which is the the solution to the the differential equation
(17) 
with terminal condition . First define the trajectory
(18) 
generated from where . Next, let us take the derivative of (3) with respect to the time duration so that we have the following expression:
(19) 
Using (18), we can define as
(20) 
where is a place holder for time under the integrand, and and are remaining boundary terms from applying Leibniz’s rule. Noting that (20) is a linear convolution (due to the repeating terms) with initial condition , we can rewrite (20) using a statetransition matrix
(21) 
with initial condition as
(22) 
Using (22) in (19) and pulling out the term from under the integrand, we can rewrite (19) as the following:
(23) 
Taking the limit as gives the instantaneous sensitivity from switching from at any time . Let us define this term as the adjoint variable
(24) 
which give us the mode insertion gradient
(25) 
where the adjoint can be rewritten as the following differential equation
(26) 
with terminal condition . The recursive, samplebased, solution to
(27) 
found in Eq. 14 is
(28) 
where denotes the sample index and . Expanding the objective in (14), we can show that
(29) 
Defining the policy as normally distributed, we can show that
where is used as shorthand notation. Plugging this expression into Eq. (6) gives
which we can solve for at each time by setting the derivative with respect to to zero to give the optimal solution
(30) 
Note that the expression which allows us to rewrite (30) in the following way:
(31)  
(32) 
Using the change of variable , we get the recursive, samplebased solution
(33) 
where
(34) 
7 Implementation Details
Here, we present implementation details for each of the examples provided in the main paper as well as additional algorithmic details presented and mentioned throughout the paper. Any parameter not explicitly mentioned as deterministic or stochastic variations of hybrid learning are equivalent unless otherwise specified. All simulated examples have reward functions specified as the default rewards found in the Pybullet environments [coumans2016pybullet] unless otherwise specified. Table 1 provides a lists of all hyperparameters used for each environment tested.
Models:
For each simulated example using the experiencebased method, we use the same model representation of the dynamics as where , and , , ,
are learned parameters. For locomotion tasks we use the rectifying linear unit (RELU) nonlinearity. The reward function is modeled as a two layer network with
hidden nodes and rectifying linear unit activation function. Both the reward function and dynamics model are optimized using Adam
[kingma2014adam] with a learning rate of . The model is regularized using the negative logloss of a normal distribution where the variance, , is a hyperparameter that is simultaneously learned based on modelfree learning. The predicted reward utility is improved by the error between the predicted target and target reward equal to (similar to the temporaldifference learning [boyan1999least, precup2001off]). This loss encourages learning the value of the state and action that was taken for environments that have rewards that do not strictly depend on the current state (i.e., the reward functions used in Pybullet locomotion examples). A batch size of samples are taken from the data buffer for training.Environment  policy dim  nonlinearity  
Pendulum Swingup  5  200  20  0.1  
Cartpole Swingup  5  200  20  0.1  
Hopper  5  1000  20  0.1  
HalfCheetah  10  1000  20  0.2  
Sawyer  10  100  10  0.01  
Ant  20  400  40  1.0  
Franka Panda  40  200  40  1.0 
Policy:
For the policy, we parameterize a normal distribution with a mean function defined as a single layer network with nonlinearity with nodes (similar to the dynamics model used). The diagonal of the variance is specified using a single layer with nodes and rectifying linear unit activation function. Soft actor critic (SAC) is used to optimize the policy for the pendulum, cartpole, hopper, and halfcheetah environments respectively. All examples use the same hyperparameters for SAC specified by the shared parameters in [haarnoja2018soft] including the structure of the value and soft Q functions, and excluding the batch size and policy (which we match the samples used with model learning and to utilize the simpler policy representation). The ant and panda robot with behavior cloning use the policy structure defined in Table 1, which is structured in a similar stochastic parameterization as mentioned in the paragraph above. The negative log loss of the normal distribution is used for behavior cloning expert demonstrations with a learning rate of for each method.
Robot Experiments:
In all robot experiments, a camera is used to identify the location of objects in the environment using landmark tags and color image processing. For the Sawyer robot example, the state is defined as the pose of the endeffector of the arm to the block as well as the pose of the block to the target. The action space is the target endeffector velocity. The reward is defined as
where denote the poses of the block to the target and the endeffector to the target location respectively. For the Franka robot, the state is defined as the endeffector position, the block position, and the gripper state (open or closed) as well as the measured wrench at the endeffector. The action space is defined as the commanded endeffector velocity. The reward function is defined as
where
denotes the stage at which the Franka is in at the block stacking task. Here, , and denote the endeffector pose, the block pose, the target stacking position, and the measured wrench at the endeffector respectively.
Comments
There are no comments yet.