I Introduction
Our aim in this work is to address two challenges in current reinforcement learning (RL) methods  complex task specification and safe exploration. While much effort has been made in the development of sample efficient RL algorithms, the abilities to accurately specify the desired behavior and ensure safety during learning are paramount to the deployment of learning agents on physical systems.
Because reinforcement learning agents aim to maximize the expected return, they often exploit loopholes in the reward definition to gain an edge in achieving this goal and this results in undesired behavior. This phenomenon is often referred to as reward hacking [1] (some examples are provided in [2]). For tasks with higher complexity, an accurate definition of a reward function is challenging. For example, the authors of [3] shaped a reward function for the peg insertion task consisting of nested min/max functions that requires tuning to perform well. In this work, we use the robustness degree of temporal logic formulas as the reward function which is also constructed with nested min/max functions. However, the user only needs to specify the task as a logic formula and the transformation to the reward function is performed automatically. The robustness reward also has rigorous semantics such that a value greater than zero indicates completion of the task.
Safety is a critical aspect of reinforcement learning when physical systems that potentially operate around humans are involved. This issue is often addressed by constructing a similar environment in simulation, perform training in the simulation environment and transfer the learned policy to the real world [4][5]. However, it is difficult for the dynamics of the simulation to exactly resemble that of reality and some finetuning is necessary for the simtoreal transfer to succeed. Therefore, it is often an effective practice to incorporate a shielding controller both during learning and at deployment to ensure safety as well as to guide exploration [6][7].
In this paper, we propose a system that integrates temporal logic guided reinforcement learning with control barrier functions and control Lyapunov functions [8]. Our main contributions are as follows:

we propose three ways to use temporal logic and its equivalent finite state automata (FSA) in our system  provide reward for the RL agent, perform goal selection for the control Lyapunov function for guided exploration and define safe sets for the control barrier functions.

we extend the FSA augmented MDP framework [9] to handle hard constraints and violation of specifications.

we integrate the system into a quadratic program that can be solved efficiently.

we demonstrate the applicability of our framework in a simulated continuous control task with safety constraints, known system dynamics and unknown environmental dynamics, and study the use and effectiveness of each component in the system.
Ii Related Work
Using temporal logic in an RL setting has been explored in the past. The authors of [10] and [11] use temporal logic and automata to solve the nonMarkovian reward decision process (NMRD). In [12], the authors take advantage of the robustness degree of signal temporal logic to guide the learning process in discrete MDPs. In [13], the authors propose a reward machine, which in effect is an FSA. However, the user is required to manually design the reward machine. In [14] and [15], linear temporal logic (LTL) and geometric linear temporal logic (GLTL) are combined with MDP to learn policies in discrete environments.
Safe exploration and deployment is critical for applying learning agents on physical systems. The authors of [7] and [16] approach the problem in the context of constrained policy optimization and showed minimum violation of safety constraints during learning. The authors of [17] and [18] utilize control barrier functions to safeguard exploration. Temporal logic is used in conjunction with an abstraction of the system dynamics to shield the learning process from unsafe actions in [6]. The Authors of [19] provides a comprehensive survey on safe reinforcement learning. We incorporate hard constraints in a TL specification with control barriers functions such that these constraints can be strictly enforced (in addition to other user specified constraints) during learning. Compared to previous work, our method is relatively easy to implement, able to handle continuous state and action spaces, nonlinear constraints and dynamics, and can be executed efficiently.
Iii Preliminaries
Iiia Reinforcement Learning
We start with the definition of a Markov Decision Process.
Definition 1.
An MDP is defined as a tuple , where is the state space ; is the action space ( and can also be discrete sets); is the transition function with
being the conditional probability density of taking action
at state and ending up in state ; is the reward function with being the reward obtained by executing action at state and transitioning to .We define a task to be the process of finding the optimal policy (or for stochastic policies) that maximizes the expected return, i.e.
(1) 
The horizon of a task (denoted ) is defined as the maximum allowable timesteps of each execution of and hence the maximum length of a trajectory. In Equation (1), is the expectation following . The stateaction value function is defined as
(2) 
i.e. it is the expected return of choosing action at state and following onwards. For offpolicy actor critic methods such as deep deterministic policy gradient [20], is used to evaluate the quality of policy . Parameterized and ( and are learnable parameters) are optimized alternately to obtain .
IiiB scTLTL And Finite State Automata
We consider tasks specified with syntactically cosafe Truncated Linear Temporal Logic (scTLTL) which is a restricted version of truncated linear temporal logic(TLTL) [21]. In particular, we restrict from using the (always) operator. By doing so, we can establish a correspondence between an scTLTL formula with a FSA.
Due to space constraints, we do not provide the complete set of definitions for the Boolean and quantitative semantics of scTLTL (refer to [21]). Examples of scTLTL include which entails that eventually and then eventually become true (sequencing). Another example means until becomes true, is true implies that eventually is true.
We denote to be the MDP state at time , and to be a sequence of states (state trajectory) from time to , i.e., . scTLTL provides a set of realvalued functions that quantify the degree of satisfaction of a given with respective to a formula . This measure is also referred to as robustness degree or simply robustness ( maps a state trajectory and a formula to a real number). For example, . Here, if , then is satisfied. Because requires to be true at least once in the trajectory, hence we take the max over the time horizon. In general, a robustness of greater than zero implies that satisfies and vice versa.
Definition 2.
An FSA corresponding to a scTLTL formula . is defined as a tuple , where is a set of automaton states; is the input alphabet (a set of first order logic formulas); is the initial state; is a conditional probability defined as
(3) 
is a set of final automaton states.
Here is the automaton state of . is the predicate guarding the transition from to . Because is a predicate without temporal operators, the robustness is only evaluated at . Therefore, we use the shorthand . The FSA corresponding to a scTLTL formula can be generated automatically with available packages like Lomap [22] (refer to [23] for details on the generation procedure).
IiiC FSA Augmented MDP
The FSA augmented MDP [9] establishes a connection between the TL specification and the standard reinforcement learning problem. A policy learned using has implicit knowledge of the FSA through the automaton state .
Definition 3.
An FSA augmented MDP corresponding to a scTLTL formula (constructed from FSA and MDP ) is defined as , where , is the probability of transitioning to given and ,
(4) 
is defined in Equation (3). is the FSA augmented reward function, defined by
(5) 
where represents the disjunction of all predicates guarding the outgoing transitions that originate from ( is the set of automata states that are connected with through outgoing edges).
Note that in Equation (5), is calculated from . It is a measure of the progress of satisfying by taking action in state (encapsulated by ).
Intuitively, the reward function in Equation (5) encourages the system to exit the current automaton state and move on to the next, and by doing so eventually reach the final state which satisfies the TL specification (property of FSA).
IiiD Zeroing Control Barrier Functions (CBF) And Control Lyapunov Functions (CLF)
Define an affine control system as
(6) 
where and are locally Lipschitz continuous, is the MDP state, is the control. Here we used the same notation for state and action as the MDP in Definition 1. As will become clear in the next section, we embed the control system as part of the MDP transition dynamics. Following [24] and [8], we provide the definition of the zeroing control barrier function and the control Lyapunov function.
Definition 4.
Given a set for a continuously differentiable function , the function is a zeroing control barrier function defined on set with , if there exists an extended class function such that
(7) 
Definition 5.
A continuously differentiable function is an exponentially stabilizing control Lyapunov function [8] if there exists positive constants such that
(8) 
Proposition 1.
A controller that meets the objectives specified by the CLF and renders the safe set forward invariant (satisfies the condition in Equation (7)) can be found by solving the quadratic program (CLFCBFQP)
(9) 
where is a relaxation variable that grows when there is a conflict between the CBF and CLF constraints. is chosen to be a large positive constant ().
Iv Problem Formulation and Approach
Problem 1.
Given an MDP where is the state space; is the action space; consists of known system dynamics in the form where and unknown environmental dynamics ; is the reward function, a safe set and a scTLTL formula over state predicates, find a policy such that:
(10) 
where is the FSA augmented MDP reward in Equation (5), is a weighting parameter, is the time horizon and is the discount factor, is the control barrier function in Definition 4. It is required that the system trajectory always stays within the safe set defined by and is not violated at all times during learning.
Equation (10) requires to optimize the expectation of a weighted return that combines satisfying (maximizing ) and maximizing the MDP reward while ensuring safety during learning. Here we distinguish between not satisfying and violating . Violation means that a state is entered such that can never be satisfied afterwards.
To solve Problem 1, we first translate to its equivalent FSA and construct a FSA augmented MDP using Definition 3. The FSA is used in three ways  provide reward for the RL agent, provide goal states for the CLF to guide exploration and provide the safe set for the CBF. The actions from the RL agent are then integrated with the CLFCBFQP to solve for the final safe action that’s transmitted to the system and environment. A diagram of our workflow is provided in Figure 1.
V Tl Guided Safe Reinforcement Learning
In this section, we discuss in detail the components of Figure 1. We will use a running example for explanation purposes. We define three circular regions , , each of form where is the center coordinate of the region, is the radius. A specification entails that regions or and are to be eventually visited and and should not be visited before . The FSA for is shown in Figure 2.
One drawback of the FSA definition (Definition 2) is that it does not explicitly handle spec violation. This problem can be solved by adding an additional set of trap states such that if , then no path exists on the FSA that connects to (i.e. entering violates the spec). This can be seen in Figure 2when is visited before .
Va FSA as RL Reward
Here we extend the reward definition in Equation (5) for the FSA augmented MDP to handle the trap state. The modified reward is as follows:
(11) 
where . As an example, in Figure 2 .
in Equation (11) promotes exiting the current FSA state and enter the next nontrap state. penalizes entering the trap state. the min() function penalizes the more violating term with lower robustness. is a weighting coefficient. It is important to note that in Equation (11) does not prevent the RL agent from violating during learning. In fact, the agent needs to explore in order to learn and hence the selfloop. During testing, the user can choose to terminate once the a trap state is entered
VB FSA For CLF Goal Selection
The intuition behind goal selection using the FSA is to find the outgoing edge (leading to a nontrap state) that is easiest for the agent to activate at its current state. Then select the MDP state that maximizes the robustness of the guarding predicate of that edge as the goal. This can be formalized by
(12) 
where is the set of automata states that are connected with through outgoing edges. Finding using Equation (12
) requires solving two optimization problems. The second one is relatively easy since there are only a finite number of outgoing edges to choose from. The first problem can be difficult if the state space is large. One useful heuristic is that if the predicates are defined as regions in the state space, the state with max robustness is the center of that region or at the center of intersection of multiple regions. For the example in Figure
2, would be the center of region or whichever is closer to state .After finding , a simple CLF can be defined as
(13) 
Other CLFs that meet the criteria in Equation (8) can also be used. For simplicity, the above CLF will be adopted in our case studies.
VC FSA for CBF Constraints
We exploit the fact that the guarding predicates in an FSA are propositional logic formulas of inequality constraints over states. They can be naturally interpreted as safe sets and incorporated into the CBF. At state , if there is a trap state connected to , we would like to avoid activating . That is we need to ensure that is always false or is always true. We first restructure to its conjunctive normal form (CNF) i.e. [25]  the conjunction of a set of disjunctive formulas. An example of a CNF is where , , .
We define the FSAbased safe set at state with as
(14) 
Here we assume each predicate is written in the form . The goal of the CBF is to keep the agent from entering the trap state during learning
For the example in Figure 2, at state , , . Therefore, .
VD TlClfCbf Qp
We modify the known system dynamics to
(15) 
is the action provided by the RL agent. and can be found by solving the following quadratic program
(16) 
where , captures extra safety constraints defined by the user, are TL spec related constraints constructed with Equation (14), is constructed with Equation (13). A summary of our algorithm is provided in Algorithm 1.
This framework provides much flexibility in defining a task. It incorporates prior knowledge about the task domain in the form of temporal logic specifications as well as regular reward functions. The user can also specify hard and soft constraints (encapsulated in the CBF and CLF). The system also combines modelbased planning and modelfree learning so that knowledge about the system dynamics can be taken advantage of while unknown dynamics can also be handled.
Vi Case Study
Via Experiment Setup
Figure 3shows our simulation environment. It consists of one agent with known dynamics, safety constraints in the form of two straight lines forming a channel that the agent has to stay within, three circular goal regions whose positions are kept fixed in a episode but are randomized between episodes, and two obstacles that move in the vicinity of the channel and whose dynamics are unknown.
ViB Agent Dynamics
The MDP state of the agent consists of its position and orientation as well as the position of all goals and moving obstacles (13 continuous dimensions). Combined with 1 discrete dimension for the FSA state (from construction of the FSA augmented MDP), the total state space is 14 dimensional. The agent’s controls are the forward velocity and steering speed .
The agent moves according to the unicycle model
(17) 
This underactuated system is uncontrollable. As in [26][27], we use a reference point different from the robot center with coordinates in the robot frame. It is shown that the relationship below holds
(18) 
where are the coordinates of the reference point (orientation is the same as the center). With , we can rewrite the above fully actuated dynamics in the form in Equation (15). Having obtained , the robot’s center coordinates can be calculated via a simple transformation.
ViC Task Definition
The task is defined as
(19) 
In English, entails that the agent is to eventually visit or and eventually and is not to be visited before or and constraints and are to be satisfied at all times before is visited. The predicates are defined as where is the coordinate of the agent, is the center coordinate of goal region . , define the safety constraints as channel of width that the agent needs to navigate within.
We define an additional reward as the minimum distance between the agent and the moving obstacles. This reward is part of the objective in Equation (10).
ViD Implementation Details
For the RL component, we use proximal policy optimization (PPO) [28]
as the learning algorithm. The policy is represented by a feedforward neural network with 3 hidden layers of 300, 200, 100 ReLU units respectively. The value function is of the same architecture. Each episode has horizon
steps and the positions of the goal regions are randomized between episodes (goals may initiate outside the safe channel). We collect a batch of 5 trajectories for each update iteration. During learning, an episode terminates only when the horizon is reached or the task is completed. The agent is allowed to travel outside the safety channel and collide with the moving obstacles during learning (to receive a penalty). However, this is not allowed during evaluation.ViE Results and Discussion
In this section, we study the effect of each component in our system in Figure 1on the learning progress. Specifically, we train the system in four configurations  only RL, RL with CBF, RL with CLF and RL with CBF and CLF.
Figure 4
shows the learning curve in terms of the discounted return. We can see that all configurations are able to achieve similar end performance. The configurations with CLF exhibit faster rising time at the beginning of the learning process. This implies that the RL agent is able to find a useful policy faster with the help of CLF. However, it can also be observed that the learning curves with CLFs show higher variances and volatility. This is mainly due to the fact that the CLF is not aware of moving obstacles or safety constraints, its only job is to guide the agent to the nearest next goal that promotes progress on the FSA according to Equation (
12). Therefore, at moments of near collision and safety violations, the RL agent will need to fight the actions provided by the CLFs and hence adding difficulty to the learning problem. This is most apparent with the configuration RL+CLF+CBF where the RL agent will need to learn to work in harmony with the CLF and CBF controls.
Figure 5shows the maximum violation of the safety constraints and (a value smaller than zero indicates violation). We can see that the CBF is able to prevent the agent from violating the safety constraints at all times during learning. For configurations without CBF, because a violation penalty is imposed on the agent (Equation (11)), it also gradually learns to minimize violation. Figure 5(a) shows that the RL+CLF configuration exhibits a sudden decrease in performance (same time as in Figure 4). This occurs when the nearest next goal happens to be outside the safety region and the RL agent has not learned enough to counter the guiding actions of the CLF.
Figure 6 shows the minimum distance between the agent and the moving obstacles as a function of policy updates. As learning progresses, the agent learns to stay away from the obstacles. The agent is able to keep a further distance from the obstacles in configurations without CBF as this configuration allows the agent to travel outside the safe zone.
Figure 7 shows sample trajectories of the agent in various configurations. The upper subfigures (a) and (b) illustrate trajectories for pure RL and pure CBF. We can see that both agents behave similarly by going to (the nearer of and ) then . The difference is that the RL agent chooses an obstacle free path and tries to trade off between accomplishing the task, avoiding obstacles and minimizing safety violation (controlled by the weights introduced in the previous section). Figure 7(c) shows the agent trajectory with CLF and CBF. CLF still commands the agent to , however, CBF stops the agent at the border of the safety zone and the agent is stuck. Collision with moving obstacles occurs in this case because nor the CLF or the CBF know the obstacle dynamics which is required for avoidance. Figure 7(d) shows the agent trajectory with RL, CLF and CBF. The agent starts close to and tries to move towards . However, it has learned that if it keeps trying to get to it will get stuck at the border and receive a low return. Therefore, near the border it chooses to instead move towards and eventually finishes the task. An agent trained with RL and CBF (not shown here) can potentially also obtain similar behavior. As discussed previously, adding CLF promotes exploration at the beginning of learning. A video can be accessed at https://youtu.be/lUsE3hGpLAk.
Figure 8shows the evaluation success rate over 20 trials. Different from the training process, during evaluation an episode terminates in three circumstances  horizon is reached, task accomplished and the agent comes in collision with the moving obstacles (defined by a minimum threshold on relative distance). To ensure safety, CBF is always turned on during evaluation. The results show that the agents trained with CBF exhibits higher success rates. This is because agents trained without CBF sometimes rely on traveling outside the safe zone to avoid obstacles and get to goals. During evaluation, this is not an option and therefore these agents are more prone to getting stuck at the safe zone boundaries.
Lastly, we want to discuss some practical issues of our approach and a possible area of future work. The system presented in Figure 1
integrates multiple components together. Central to controlling how the components interact with each other is a set of hyperparameters. This includes
in Equation (11), in Equation (10), the coefficients , , and the control bounds in Equation (16). We find that the learning process and the final performance of the policy is noticeably influenced by these parameters. Taking the control bounds as an example, a good practice is to set a wide bound for the CBF actions as these actions will need to dominate at the boundary of the safety constraints. The bounds for the CLF actions should be set narrower than that of the RL actions. This is to ensure that the RL agent is able to overcome the CLF actions when it needs to. A better strategy is to shrink the CLF action bounds as learning progresses such that it can guide exploration at the beginning of training and diminishes its effect as the agent learns. This type of guidance scheduling is also adopted in [29] with demonstrations. An important future work is therefore to alleviate the burden of parameter tuning from the user.Vii Conclusions
In this paper, we explore the combination of temporal logic, reinforcement learning, control Lyapunov functions and control barrier functions to form an expressive, safe and learnable control system. We propose techniques that allow an FSA to simultaneously provide rewards, objectives and safety constraints to each component in the system. We analyzed the effect of each component on the learning process and evaluate the resultant policy. In a simulation experiment with known agent dynamics and unknown environmental dynamics, we showed that our system is able to learn a policy that accomplishes the given task expressed as a logic formula while ensuring safety during learning.
References
 [1] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete Problems in AI Safety,” pp. 1–29, 2016. [Online]. Available: http://arxiv.org/abs/1606.06565
 [2] J. Lehman, J. Clune et al., “The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities,” CoRR, vol. abs/1803.03453, 2018.
 [3] M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. A. Riedmiller, “Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,” CoRR, vol. abs/1707.08817, 2017.
 [4] S. James, P. Wohlhart, M. Kalakrishnan, D. Kalashnikov, A. Irpan, J. Ibarz, S. Levine, R. Hadsell, and K. Bousmalis, “Simtoreal via simtosim: Dataefficient robotic grasping via randomizedtocanonical adaptation networks,” CoRR, vol. abs/1812.07252, 2018.
 [5] Y. Chebotar, A. Handa, D. Khayyat, M. Macklin, J. Issac, N. D. Ratliff, and D. Fox, “Closing the simtoreal loop: Adapting simulation randomization with real world experience,” CoRR, vol. abs/1810.05687, 2018.
 [6] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, and U. Topcu, “Safe reinforcement learning via shielding,” in AAAI, 2018.
 [7] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, and Y. Tassa, “Safe exploration in continuous action spaces,” CoRR, vol. abs/1801.08757, 2018.
 [8] A. D. Ames, J. W. Grizzle, and P. Tabuada, “Control barrier function based quadratic programs with application to adaptive cruise control,” in 53rd IEEE Conference on Decision and Control. IEEE, 2014, pp. 6271–6278.
 [9] X. Li, Y. Ma, and C. Belta, “Automata guided reinforcement learning with demonstrations,” arXiv preprint arXiv:1809.06305, 2018.
 [10] G. D. Giacomo, L. Iocchi, M. Favorito, and F. Patrizi, “Reinforcement learning for ltlf/ldlf goals,” CoRR, vol. abs/1807.06333, 2018.
 [11] A. Camacho, O. Chen, S. Sanner, and S. A. McIlraith, “Decisionmaking with nonmarkovian rewards: From ltl to automatabased reward shaping,” in Proceedings of the Multidisciplinary Conference on Reinforcement Learning and Decision Making (RLDM), 2017, pp. 279–283.
 [12] D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta, “Qlearning for robust satisfaction of signal temporal logic specifications,” in Decision and Control (CDC), 2016 IEEE 55th Conference on. IEEE, 2016, pp. 6565–6570.

[13]
R. T. Icarte, T. Klassen, R. Valenzano, and S. McIlraith, “Using reward
machines for highlevel task specification and decomposition in reinforcement
learning,” in
International Conference on Machine Learning
, 2018, pp. 2112–2121.  [14] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, S. Seshia, and Others, “A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications,” Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pp. 1091–1096, 2014.
 [15] M. L. Littman, U. Topcu, J. Fu, C. L. Isbell, M. Wen, and J. MacGlashan, “Environmentindependent task specifications via gltl,” CoRR, vol. abs/1704.04341, 2017.
 [16] J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained policy optimization,” in ICML, 2017.
 [17] M. Ohnishi, L. Wang, G. Notomista, and M. Egerstedt, “Barriercertified adaptive reinforcement learning with applications to brushbot navigation,” 2018.
 [18] R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “Endtoend safe reinforcement learning through barrier functions for safetycritical continuous control tasks,” in AAAI, 2019.
 [19] J. García and F. Fernández, “A comprehensive survey on safe reinforcement learning,” Journal of Machine Learning Research, vol. 16, pp. 1437–1480, 2015.
 [20] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” CoRR, vol. abs/1509.02971, 2015.
 [21] X. Li, C.I. Vasile, and C. Belta, “Reinforcement learning with temporal logic rewards,” arXiv preprint arXiv:1612.03471, 2016.
 [22] C. Vasile, Github repository, 2017.
 [23] K. Y. Rozier, “Explicit or symbolic translation of linear temporal logic to automata,” Ph.D. dissertation, Rice University, 2013.
 [24] X. Xu, P. Tabuada, J. W. Grizzle, and A. D. Ames, “Robustness of control barrier functions for safety critical control,” IFACPapersOnLine, vol. 48, no. 27, pp. 54–61, 2015.
 [25] P. Jackson and D. Sheridan, “Clause form conversions for boolean circuits,” in International Conference on Theory and Applications of Satisfiability Testing. Springer, 2004, pp. 183–198.
 [26] C. Belta, V. Isler, and G. J. Pappas, “Discrete abstractions for robot motion planning and control in polygonal environments,” IEEE Transactions on Robotics, vol. 21, no. 5, pp. 864–874, 2005.
 [27] J. P. Desai, J. Ostrowski, and V. Kumar, “Controlling formations of multiple mobile robots,” in Proceedings. 1998 IEEE International Conference on Robotics and Automation (Cat. No. 98CH36146), vol. 4. IEEE, 1998, pp. 2864–2869.
 [28] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
 [29] A. Rajeswaran, V. Kumar, A. Gupta, J. Schulman, E. Todorov, and S. Levine, “Learning complex dexterous manipulation with deep reinforcement learning and demonstrations,” CoRR, vol. abs/1709.10087, 2018.