I Introduction
Reinforcement learning methods have the ability of finding well behaved controllers (or policies)for robots without the need to know their internal structure and dynamical details, and the environment they operate in. An often overlooked issue in reinforcement learning research is the design of effective reward functions.
Real world applications usually involve complex tasks that may not be defined as reachavoid operation (“go from A to B while avoiding obstacles”). Consider the task of driving a car for a humanoid robot. A simple extrinsic reward function is the travel duration to destination. However, the robot will need a unacceptable large number of trials to learn to apply the gas and brake, use the steering wheel and transmission, and drive safe, using just the extrinsic reward. On the other hand, humans have mastered driving to an acceptable level of proficiency, and general rules have been defined to perform and learn the skill faster and more reliable. Formally incorporating known rules in the reward function dramatically accelerates the time to learn new skills, where correct behaviors (e.g., always put at least one hand on the steering wheel) are encouraged, and hazardous behaviors (e.g., step on the gas and brake pedal at the same time) are penalized.
The problem of accurately incorporating complex specifications in reward functions is referred to in the literature as reward hacking in [1]. The inability of adhoc rewards to capture the semantics of complex tasks has negative repercussions on the learned policies. Policies that maximize the reward functions are not guaranteed to satisfy the specifications. Furthermore, it is not easy to design and prove that increasing rewards translate to better satisfaction of the specifications. A simple example from [1] that highlights these problems involves a robot learning to clean an office. If only a positive reward is given when the robot cleans up a mess (picks up trash from the ground), then the robot may learn to first make a mess and then clean it up. Imperfect reward functions provide opportunities for a learning robot to exploit, and find high gain solutions that are algorithmically correct, but deviates from the designer’s intentions.
In this paper, we use formal specification languages to capture the designer’s requirements of what the robot should achieve. We propose Truncated Linear Temporal Logic (TLTL) as a specification language with an extended set of operators defined over finitetime trajectories of a robot’s states. TLTL provides convenient and effective means of incorporating complex intentions, domain knowledge, and constraints into the learning process. We define quantitative semantics (also referred to as robustness degree) for TLTL. The robustness degree is used to transform temporal logical formulae into realvalued reward functions.
We compare the convergence rate and the quality of learned policies of RL algorithms using temporal logic (i.e., robustness degree) and heuristic reward functions. In addition, we compare the results of a simple TL algorithm against a more elaborate RL algorithm with heuristic rewards. In both cases better quality policies were learned faster using the proposed approach with TL rewards than the heuristic reward functions.
Ii Background
We will use the terms controller and policy interchangeably throughout the paper.
Iia Policy Search in Reinforcement Learning
In this section we briefly introduce a class of reinforcement learning methods called policy search methods. Policy search methods have exhibited much potential in finding satisfactory policies in Markov Decision Processes (MDP) with continuous state and action spaces (also referred to as an infinite MDP), which especially suits the need for finding a controller in robotic applications
[2].Definition 1.
An infinite MDP is a tuple , where is a continuous set of states; is a continuous set of actions;
is the transition probability function with
being the probability of taking action at state and ending up in state (also commonly written as a conditional probability ); is a reward function where is the stateaction trajectory, is the horizon.In RL, the transition function is unknown to the learning agent. The reward function can be designed or learn (as in the case of inverse reinforcement learning). The goal of RL is to find an optimal stochastic policy that maximizes the expected accumulated reward, i.e.
(1) 
is the trajectory distribution from following policy . And is the reward obtained given .
In policy search methods, the policy is represented by a parameterized model (e.g., neural network, radial basis function) denoted by
(also written as in short) where is the set of model parameters. Search is then conducted in the policy’s parameter space to find the optimal set of that achieves (1)(2) 
Many policy search methods exist to solve the above problem. The authors of [2] provide a survey on policy search methods applied in robotics. In this work we adopt the Relative Entropy Policy Search (REPS) technique. A brief overview of the method is given in the next section.
IiB Relative Entropy Policy Search
Relative Entropy Policy Search is an informationtheoretic approach that solves the policy search problem. The episodebased version of REPS can be formulated as the following constrained optimization problem
(3) 
where is the trajectory distribution following the existing policy. is the KLdivergence between two policies and is a threshold. This constraint limits the step size a policy update can take and ensures that the trajectory distribution resulting from the updated policy stays near the sampled trajectories. This constraint not only promotes exploration safety which is especially important in robotic applications, it also helps the agent to avoid premature convergence.
The optimization problem in (3) can be solved using the Lagrange multipliers method which results in a closedform trajectory distribution update equation given by
(4) 
Since we only have sample trajectories (
), we can estimate
only at the sampled points by(5) 
is dropped from the above result because we are already sampling from . is the Lagrange multiplier obtain from optimizing the dual function
(6) 
We refer interested readers to [2] for detailed derivations.
We adopt the timevarying linearGaussian policies (here for ) and weighted maximumlikelihood estimation to update the policy parameters (feedback gain is kept fixed to reduce the dimension of the parameter space). This approach has been used in [3]. The difference is that [3] recomputes at each step using costtogo before updating . Since a temporal logic reward (described in the next section) depends on the entire trajectory, it doesn’t have the notion of costtogo and can only be evaluated as a terminal reward. Therefore (written short as ) is computed once and used for updates of all (similar approach used in episodic PIREPS [4]). The resulting update equations are
(7) 
where is the feedforward term in the timevarying linearGaussian policy at time and for sample trajectory .
Iii Truncated Linear Temporal Logic(TLTL)
In this section, we propose TLTL, a new temporal logic that we argue is well suited for specifying goals and introducing domain knowledge for the RL problem. In the following definitions, the sets of real and integer numbers are denoted by IR and Z, respectively. The subset of integer numbers a,…,b 1, a,b 2 Z, a ¡ b, is denoted by [a, b), and [a, b] = [a, b) [ b.
Iiia TLTL Syntax And Semantics
A TLTL formula is defined over predicates of form , where is a function and is a constant. A TLTL specification has the following syntax:
(8) 
where is a predicate, (negation/not), (conjunction/and), and (disjunction/or) are Boolean connectives, and (eventually), (always), (until), (then), (next), are temporal operators. Implication is denoted by (implication). TLTL formulas are evaluated against finite time sequences over a set . As it will become clear later, such sequences will be produced by a Markov Decision Process (MDP, see Definition 1).
We denote to be the state at time , and to be a sequence of states (state trajectory) from time to , i.e., . The Boolean semantics of TLTL is defined as:
Intuitively, state trajectory if the specification defined by is satisfied for every subtrajectory . Similarly, if is satisfied for at least one subtrajectory . if is satisfied at every time step before is satisfied, and is satisfied at a time between and . if is satisfied at least once before is satisfied between and . A trajectory of duration is said to satisfy formula if .
We equip TLTL with quantitative semantics (robustness degree) , i.e., a realvalued function of state trajectory and a TLTL specification that indicates how far is from satisfying or violating the specification . The quantitative semantics of TLTL is defined as follows:
where represents the maximum robustness value. Moreover, and , which implies that the robustness degree can substitute Boolean semantics in order to enforce the specification . As an example, consider specification , where is a one dimensional state, and a two step state trajectory . The robustness is . Since , and the value is a measure of the satisfaction margin (refer to Example 1 in [5] for a more detail example on task specification using TL and robustness).
IiiB Comparison With Existing Formal Languages
In our view, a formal language for RL task specification should have the following characteristics: (1) The language should be defined over predicates so tasks can be conveniently specified as functions of states (2) The language should provide quantitative semantics as a continous measure of its satisfaction. (3) The specification formula should be evaluated over finite sequences (state trajectories) of variable length, thus allow for perstep evaluation on currently available data. (4) Temporal operators can have time bounds but should not require them. A wide variety of formal specification languages exists and some possess parts of the above characteristics. we selectively analyze three specification languages, namely Signal Temporal Logic(STL) [6] (related to Metric Temporal Logic (MTL), omitted here for simplicity), Bounded Temporal Logic (BLTL) [7] and Linear Temporal Logic on Finite Traces () [8].
One of the most important elements in using a formal language in reinforcement learning is the ability to transform a specification into a realvalued function that can be used as reward. This requires quantitative semantics to be defined for the chosen language. One obvious choice is Signal Temporal Logic (STL), which is defined over infinite realvalued signals with a time bound required for every temporal operator. While this is useful for analyzing signals, it can cause problems when defining tasks for robots. For example if the goal is to have the robot learn to put a beer in the fridge, the robot only needs to find the correct way to operate a fridge (e.g. open the fridge door, place the beer on a shelf and close the fridge door) and possibly perform this sequence of actions at an acceptable speed. But using STL to specify this task would require the designer to manually put time bounds on how long each action/subtask should take. If this bound is set inappropriately, the robot may fail to find a satisfying policy due to its hardware constraints even though it is capable of performing the task. This is quite common in robotic tasks where we care about the robot accomplishing the given task but don’t have hard constraints on when and how fast the task should be finished. In this case mandatory time bounds add unnecessary complexity to the specification and thus the overall learning process.
Two other possible choices are BLTL and . Both can be evaluated over finite sequences of states. However, similar to STL, temporal operators in BLTL require time bounds. Both languages are defined over atomic propositions rather than predicates, and do not come with quantitative semantics.
With the above requirements in mind, we design TLTL such that its formulas over state predicates can be evaluated against finite trajectories of any length. In the context of reinforcement learning this can be the length of an execution episode. TLTL does not require a time bound to be specified with every use of a temporal operator. If however the user feels that explicit time bounds are helpful in certain cases, the semantics of STL can be easily incorporated into TLTL. The set of operators provided for TLTL can be conveniently used to specify some common components (goals, constraints, sequences, decisions, etc) that many tasks or rules are made of. The combination of these components can cover a wide range of specifications for robotic tasks.
Iv Related Work
Making good use of the reward function in RL has been looked at in the past but has not been the main focus in reinforcement learning research. In [9], the authors proposed the method of potentialbased reward shaping. It was shown that this method can be used to provide additional training rewards while preserving the optimal policy of the original problem. Efforts have also been made in inverse reinforcement learning (IRL) where the goal is to extract a reward function from observed optimal/professional behavior, and learn the optimal policy using this reward. Authors in [10] presented three algorithms that address the problem of IRL and showed their applicability in simple discrete and continuous environments.
Combining temporal logic with reinforcement learning to learn logically complex skills has been looked at only very recently. In [5], the authors used the logsumexp approximation to adapt the robustness of STL specifications to Qlearning on MDPs in discrete spaces. Authors of [11] and [12] has also taken advantage of automatabased methods to synthesize control policies that satisfy LTL specifications for MDPs with unknown transition probability.
The methods mentioned above are constrained to discrete state and action spaces, and a somewhat limited set of temporal operators. To the best of our knowledge, this paper is the first to apply TL in reinforcement learning on continuous state and action spaces, and demonstrates its abilities in experimentation.
V Experiments
In this section we first use two simulated manipulation tasks to compare TLTL reward with a discrete reward as well as a distancebased continuous reward commonly used in the RL literature. We then specify a toast placing task in TLTL where a Baxter robot is required to learn a combination of reaching policy and gripper timing policy ^{1}^{1}1The simulation is implemented in rllab [13] and gym [14]. The experiment is implemented in rllab and ROS.
Va Simulated 2D Manipulation Tasks
Figure 2 shows a 2D simulated environment with a three joint manipulator. The 8 dimensional state feature space includes joint angles, joint velocities and the endeffector position. The 3 dimensional action space includes the joint velocities. Gaussian noise is added to the velocity commands.
For the first task, the endeffector is required to reach the goal position while avoiding obstacles and . The discrete and continuous rewards are summarized as follows:
(9) 
In the above rewards, is the Euclidean distance between the endeffector and the goal, is the distance between the endeffector and obstacle , is the radius of obstacle . The TLTL specification and its resulting robustness function is described as
(10) 
(11) 
In English, describes the task of ”eventually always stay at goal and always stay alway from obstacles”. The user needs only to specify and the reward function is generated automatically from the quantitative semantics indicated in Section IIIA. Here is the trajectory of the endeffector position. is the is distance at time .
For the second task, the gripper is required to visit goals , , and in this specific sequence while avoiding the obstacles (one more obstacle is added to further constrain the free space). The discrete and continuous rewards are summarized as
(12) 
Here an addition state vector is maintained to record which goals have already been visited in order to know what the next goal is. In
is the correct next goal to visit and are the goals to avoid. The TLTL specification is defined as(13) 
where is the predicate for goal , is the obstacle avoidance constraint ( is a shorthand for a sequence of conjunction). In English, states ”visit then then , and don’t visit or until visiting , and don’t visit until visiting , and always if visited implies next always don’t visit (don’t revisit goals), and always avoid obstacles” . Due to space constraints the robustness of will not be explicitly presented, but it will also be a complex function consisted of nested functions that would be difficult to design by hand but can be generated from the quantitative semantics of TLTL.
During training, we consider the obstacles as ”penetrable” in that the endeffector of the gripper can enter them with a negative reward received, and depending on the reward function the negative reward may be proportional to the penetration depth. In practice, we find this approach to better facilitate learning than simply granting the agent a negative reward at contact with an obstacle and reinitiate the episode. We will also adopt this approach in the physical experiment in the next section.
Task 1 has a horizon of 200 timesteps, and is trained for 200 iterations with each iteration updated on 30 sample trajectories. Because of the added complexity, task 2 has a horizon of 500 timesteps and is trained for 500 iterations with the same number of samples per update. To compare the influence of reward functions on the learning outcome, we first fix the learning algorithm to be the episode based REPS and compare the average return per iteration for TLTL robustness reward, discrete reward and continuous reward. However it is meaningless to compare returns on different scales. We therefore take the sample trajectories learned with and and calculate their corresponding TLTL robustness return for comparison. The reason for choosing TLTL robustness as the comparison measure is that both the discrete and continuous rewards have semantic ambiguity depending on the choices of the discrete returns and coefficients . TLTL is rigorous in its semantics and a robustness greater than zero guarantees satisfaction of the task specification.
In addition, since and can provide a immediate reward per step (as oppose to TLTL robustness which requires the entire trajectory to produce a terminal reward), we also used a step based REPS[3]
that updates at each step using the costtogo. This is a common technique used to reduce the variance in the Monte Carlo return estimate. For continuous rewards, a grid search is performed on the coefficients
and the best outcome is reported. We train each comparison case on 4 different random seeds. The mean and variance of the average returns are illustrated in Figure 3 .It can be observed that in both tasks TLTL robustness reward resulted in the best learning outcome in terms of convergence rate and final return. For the level of stochasticity presented in the simulation, step based REPS showed only minor improvement in the rate of convergence and variance reduction. For the simpler case of task 1, a well tuned continuous reward achieves comparable learning performance with the TLTL robustness reward. For task 2, the TLTL reward outperforms competing reward functions by a considerable margin. Discrete reward fails to learn a useful policy due to sparse returns. A video of the learning process is provided.
The results indicate that a reward function with well defined semantics can significantly improve the learning outcome of an agent while providing convenience to the designer. For tasks with a temporal/causal structure (such as task 2), a hierarchical learning approach is usually employed where the agent learns higher level policies that schedules over lower level ones[15]. We show that incorporating the temporal structure correctly into the reward function allows for a relatively simple nonhierarchical algorithm to learn hierarchical tasks in continuous state and action spaces.
VB Learning ToastPlacing Task With A Baxter Robot
Pickandplacement tasks have been a common test scenario in reinforcement learning research [16],[17]. The task is framed as correctly reaching a grasp position where the endeffector will perform the grasp operation upon approach. For the object placing process, progress is measured by tracking the distance between the object and the place to deploy. In our experiment, we will be focusing on the placing task. We will not be tracking the position of the object but rather express the desired behavior as a TLTL specification. The robot will simultaneously learn to reach the specified region and a gripper timing policy that releases the object at the right instant (as oppose to directly specifying the point of release).
Figure 4 shows the experimental setup. A Baxter robot is used to perform the task of placing a piece of bread in a toaster. The 21 dimensional state feature space includes 7 joint angles and joint velocities, the xyzrpy pose of the endeffector and the gripper position. The endeffector pose is tracked using the motion tracking system as an additional source of information. The gripper position ranges continuously from 0 to 100 with 0 being fully closed. The 8 dimensional action space includes 7 joint velocities and the desired gripper position. Actions are sent at 20hz.
The placing task is specified by the TLTL formula
(14) 
where , , are predicates describing spatial regions in the form ( is the position of the endeffector). Orientation constraints are specified in a similar way to ensure the correct pose is reached at the position of release. The regions for and are depicted in Figure 4. and describe the conditions for gripper open/close. In English, the specification describes the process of ”always don’t hit the table or the toaster, and eventually reach the slot, and keep gripper closed until slot is reached, and always if slot is reached implies next always keep gripper open”. The resulting robustness for is
(15) 
Due to space constraints (15) is written in its recursive form where the robustness of the individual predicates are evaluated at each time step. Again is generated from the TLTL quantitative semantics and specification is satisfied when . Also the robustness for and are normalized to the same scale as that of the other predicates. This is to ensure that all subformulas are treated equally during learning. Implementation of (15) or robustness in general is highly vectorized and run time evaluation speed for complicated specifications do not usually cause significant overhead.
For a comparison case, we design the following reward function
(16) 
In the above equation, and are the Euclidean distances between the endeffector and the center of the toaster regions defined in Figure 4 (at time ). is the gripper position at time . The reward function encourages being close to the slot and keeping away from the toaster. If the gripper has yet to reached to within 3 centimeters of the slot center at all times before , the gripper should be closed (), otherwise the gripper should be opened. The coefficients are manually tuned and the best outcome is reported.
Similar to the simulation experiment, during training the obstacle (toaster in this case) is taken away and the region is penetrable with a negative reward proportional to the penetration depth (highest at the center of the region) provided by the robustness. The table is kept at its position and a new episode is initialized if collision with the table occurs. Each episode has a horizon of 100 timesteps (around 6 seconds) and each update iteration uses 10 sample trajectories. Episode based REPS is again used as the RL algorithm for this task. The resulting training curves are plotted in Figure 5.
In Figure 5, trajectories learned from at each iteration are used to calculated their corresponding robustness value (as explained in the previous section) for a reasonable comparison. We can observe that training with TLTL reward has reached a significantly better policy than that with the comparison reward. One important reason is that the semantics of in Equation (16) relies heavily on the relative magnitudes of the coefficients . For example if is much higher than and , then
will put most emphasis on reaching the slot and pay less attention on learning the correct gripper timing policy or obstacle avoidance. An exhaustive hyperparameter search on the physical robot is infeasible. In addition,
expresses much less information than . For example, penalizing collision with the toaster is necessary only when the gripper comes in contact with the toaster. Otherwise the agent should focus on the other subtasks (reaching the slot, improving the gripper policy). For reward , this logistics is again achieved only by obtaining the right combination of hyperparameters. However, because the robustness function is made up of a series of embeddedfunctions, at any instant the agent will be maximizing only a set of active functions. These active functions represent the bottlenecks in improving the overall return. By adopting this form, the robustness reward effectively focuses the agent’s effort in improving the most critical set of subtasks at any time so to achieve an efficient overall learning progress. However, this may render the TLTL robustness reward susceptible to scaling (if the robustness of a subformulae changes on a different scale than other subformula, the agent may devote all its effort in improving on this one subtask and fail to improve on the others). Therefore, proper normalization is required. Currently this normalization process is achieved manually, future work can include automatic or adaptive normalization of predicate robustnesses.
To evaluate the resulting behavior, 10 trials of the toastplacing task is executed with the policy learned from each reward. The policy from the TLTL reward achieves 100% success rate while the comparison reward fails to learn the task (due to its inability to learn the correct gripper time policy). A video of the learning progress is provided.
Vi Conclusion
Looking over our learning process as humans, we are usually given a goal and a set of well defined rules. And it is up to us to find methods to best achieve the goal within the given rules. But imagine if we are only given the goal (drive safely from A to B) but not the rules (traffic rules), even if we can instantly reinitialize after the occurrence of an accident it would take an intractable number of trials before we learn to drive safely. Robot learning is analogous. Rules can be experience/knowledge (switch to low gear if driving on steep slopes) that accelerate learning, or they can be constraints that an agent must abide by (rules in traffic, sports and games). Being able to formally express these rules as reward functions and incorporate them in the learning process is helpful and often necessary for a robot to operate in the world.
In this paper we proposed TLTL, a formal specification language with quantitative semantics that is designed for convenient robotic task specification. We compare learning performance of the TLTL reward with two of the more commonly used forms of reward (namely a discrete and continuous form of reward functions) in a 2D simulated manipulation environment by fixing the RL algorithm. We also compare the outcome of TLTL reward trained using a relatively inefficient episode based method with the discrete/continuous rewards trained using a lower variance step based method. Results show that TLTL reward not only outperformed all of its comparison cases, it also enabled a nonhierarchical RL method to successfully learn to perform a temporally structured task. Furthermore, We used TLTL to express a toastplacing task and demonstrated successful learning on a Baxter robot. Future work includes adapting TLTL robustness reward to more efficient gradient based methods [18] by exploiting its smooth approximations [5]. And using automata theory for guided exploration and learning [19].
References
 [1] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete Problems in AI Safety,” pp. 1–29, 2016. [Online]. Available: http://arxiv.org/abs/1606.06565
 [2] M. P. Deisenroth, “A Survey on Policy Search for Robotics,” Foundations and Trends in Robotics, vol. 2, no. 1, pp. 1–142, 2011. [Online]. Available: http://www.nowpublishers.com/articles/foundationsandtrendsinrobotics/ROB021
 [3] Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine, “Path integral guided policy search,” arXiv preprint arXiv:1610.00529, 2016.

[4]
V. Gómez, H. J. Kappen, J. Peters, and G. Neumann, “Policy search for path
integral control,” in
Joint European Conference on Machine Learning and Knowledge Discovery in Databases
. Springer, 2014, pp. 482–497.  [5] D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta, “Q Learning for Robust Satisfaction of Signal Temporal Logic Specifications,” 2016.

[6]
A. Donzé and O. Maler, “Robust satisfaction of temporal logic over
realvalued signals,”
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, vol. 6246 LNCS, pp. 92–106, 2010.  [7] T. Latvala, A. Biere, K. Heljanko, and T. Junttila, “Simple bounded LTL model checking,” Formal Methods in ComputerAided Design, vol. 3312, no. LCNS, pp. 186–200, 2004. [Online]. Available: http://www.springerlink.com/index/A1JNFCB7Q9KNC1Q1.pdf
 [8] G. De Giacomo and M. Y. Vardi, “Linear temporal logic and Linear Dynamic Logic on finite traces,” IJCAI International Joint Conference on Artificial Intelligence, pp. 854–860, 2013.
 [9] “Policy invariance under reward transformations : Theory and application to reward shaping,” Sixteenth International Conference on Machine Learning, vol. 3, pp. 278–287, 1999.
 [10] A. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” Proceedings of the Seventeenth International Conference on Machine Learning, vol. 0, pp. 663–670, 2000. [Online]. Available: http://wwwcs.stanford.edu/people/ang/papers/icml00irl.pdf
 [11] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, S. Seshia, and Others, “A learning based approach to control synthesis of markov decision processes for linear temporal logic specifications,” Decision and Control (CDC), 2014 IEEE 53rd Annual Conference on, pp. 1091–1096, 2014.
 [12] J. Fu and U. Topcu, “Probably Approximately Correct MDP Learning and Control With Temporal Logic Constraints,” 2014. [Online]. Available: http://arxiv.org/abs/1404.7073
 [13] Y. Duan, X. Chen, R. Houthooft, J. Schulman, and P. Abbeel, “Benchmarking deep reinforcement learning for continuous control,” in Proceedings of the 33rd International Conference on Machine Learning (ICML), 2016.
 [14] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
 [15] T. G. Dietterich, “Hierarchical reinforcement learning with the maxq value function decomposition,” J. Artif. Intell. Res.(JAIR), vol. 13, pp. 227–303, 2000.
 [16] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning HandEye Coordination for Robotic Grasping with Deep Learning and LargeScale Data Collection,” arXiv, 2016. [Online]. Available: http://arxiv.org/abs/1603.02199v1
 [17] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates,” arXiv preprint arXiv:1610.00633, 2016.
 [18] N. Heess, J. J. Hunt, T. P. Lillicrap, and D. Silver, “Memorybased control with recurrent neural networks,” arXiv preprint arXiv:1512.04455, 2015.
 [19] E. Aydin Gol, M. Lazar, and C. Belta, “Languageguided controller synthesis for discretetime linear systems,” in Proceedings of the 15th ACM international conference on Hybrid Systems: Computation and Control. ACM, 2012, pp. 95–104.