I Introduction
Reinforcement learning (RL) has enjoyed groundbreaking success in recent years ranging from playing Atari games at superhuman level [1], playing competitively with world champions in the game of Go [2] to generating visuomotor control policies for robots [3], [4]
. Despite much effort being put into developing sample efficient algorithms, an important aspect of RL remains less explored. The reward function is the window for designers to specify the desired behavior and impose important constraints for the system. While most reward functions used in the current RL literature have been based on heuristics for relatively simple tasks, real world applications typically involve tasks that are logically more complex.
Commonly adopted reward functions take the form of a linear combination of basis functions (often quadratic) [5]. This type of reward function has limited expressibility and is semantically ambiguous because of its dependence on a set of weights. Reward functions of this form have been used to successfully learn high dimensional control tasks such as humanoid walking [6] and multiple household tasks (e.g. placing coathangers, twisting bottle caps, etc) [3]. However, parameter tuning of the reward function is required and this iteration is expensive for robotic tasks. Moreover, these tasks are logically straightforward in that there is little logical interactions between subtasks (such as sequencing, conjunction/disjunction, implication, etc).
Consider the harder task of learning to use an oven. The agent is required to perform a series of subtasks in the correct sequence (set temperature and timer preheat open oven door place item in oven close oven door). In addition, the agent has to make the simple decision of when to open the oven door and place the item (i.e. preheat finished implies open oven door). Tasks like this are commonly found in household environments (using the microwave, refrigerator or even a drawer) and a function that correctly maps the desired behavior to a realvalued reward can be difficult to design. If the semantics of the reward function can not be guaranteed, then an increase in the expected return will not necessarily represent better satisfaction of the task specification. This is referred to as reward hacking by [7].
Reward engineering has been briefly explored in the reinforcement learning literature. Authors of [8] and [9] provide general formalisms for reward engineering and discuss its significance. Authors of [10] proposed potentialbased reward shaping and proved policy invariance under this type of reward transformation. Another line of work aims to infer a reward function from demonstration. This idea is called inverse reinforcement learning and is explored by [11] and [12].
In this paper, we adopt the expressive power of temporal logic and use it as a task specification language for reinforcement learning in continuous state and action spaces. Its quantitative semantics (also referred to as robustness degree or simply robustness) translate a TL formula to a realvalued function that can be used as the reward. By definition of the quantitative semantics, a robustness value of greater than zero guarantees satisfaction of the temporal logic specification.
Temporal logic (TL) has been adopted as the specification language for a wide variety of control tasks. Authors of [13] use linear temporal logic (LTL) to specify a persistent surveillance task carried out by aerial robots. Similarly, [14] and [15] applied LTL in traffic network control. Application of TL in reinforcement learning has been less investigated. [16] combined signal temporal logic (STL) with Qlearning while also adopting the logsumexp approximation of robustness. However, their focus is in the discrete state and action spaces, and ensured satisfiability by expanding the state space to a history dependent state space. This does not scale well for large or continuous stateaction spaces which is often the case for control tasks.
Our main contributions in this paper are as follows:

we present a modelfree policy search algorithm, which we call temporal logic policy search (TLPS), that takes advantage of the robustness function to facilitate learning. We show that an optimal parameterized policy that maximizes the robustness could be obtained by solving a constrained optimization,

a smoothing approximation of the robustness degree is proposed which is necessary for obtaining the gradients of the objective and constraints. We prove that using the smoothed robustness as reward provides similar semantic guarantees to the original robustness definition while providing significant speedup in learning,

finally, we demonstrate the performance of the proposed approach using simulated navigation tasks.
Ii Preliminaries
Iia Truncated Linear Temporal Logic (TLTL)
In this section, we provide definitions for TLTL (refer to our previous work [17] for a more elaborate discussion of TLTL). A TLTL formula is defined over predicates of form , where is a function of state and is a constant. We express the task as a TLTL formula with the following syntax:
(1) 
where is the boolean constant true, is a predicate, (negation/not), (conjunction/and), and (disjunction/or) are Boolean connectives, and (eventually), (always), (until), (then), (next), are temporal operators. Implication is denoted by (implication). TLTL formulas are evaluated against finite time sequences of states .
We denote to be the state at time , and to be a sequence of states (state trajectory) from time to , i.e., . The Boolean semantics of TLTL is defined as:
Intuitively, state trajectory (reads satisfies ) if the specification defined by is satisfied for every subtrajectory . Similarly, if is satisfied for at least one subtrajectory . if is satisfied at every time step before is satisfied, and is satisfied at a time between and . if is satisfied at least once before is satisfied between and . A trajectory of duration is said to satisfy formula if .
TLTL is equipped with quantitative semantics (robustness degree) , i.e., a realvalued function that indicates how far is from satisfying or violating the specification . We define the task satisfaction measurement , which is recursively expressed as:
where represents the maximum robustness value.
Moreover, and
,
which implies that the robustness degree can substitute Boolean semantics in order to enforce
the specification .
Example 1
Consider specification , where is a one dimensional state. Intuitively, this formula specifies that eventually reaches region for at least one time step. Suppose we have a state trajectory of horizon 3. The robustness is . Since , and the value is a measure of the satisfaction margin. Note that both states and stayed within the specified region, but ”more” satisfies the predicate by being closer to the center of the region and thereby achieving a higher robustness value than .
IiB Markov Decision Process
In this section, we introduce the finite horizon infinite Markov decision process (MDP) and the semantics of a TLTL formula over an MDP. We start with the following definition:
Definition 1
A finite horizon infinite MDP is defined as a tuple , where is the continuous state space; is the continuous action space;
is the conditional probability density of taking action
at state and ending up in state . We denote as the horizon of MDP.Given an MDP in Definition 1, a state trajectory of length (denoted ) can be produced. The semantics of a TLTL formula over can be evaluated with the robustness degree defined in the previous section. implies that satisfies , i.e. and vice versa. In the next section, we will take advantage of this property and propose a policy search method that aims to maximize the expected robustness degree.
Iii Problem Formulation And Approach
We first formulate the problem of policy search with TLTL specification as follows:
Problem 1
Given an MDP in Definition 1 and a TLTL formula , find a stochastic policy ( determines a probability of taking action at state ) that maximizes the expected robustness degree
(2) 
where the expectation is taken over the trajectory distribution following policy , i.e.
(3) 
In reinforcement learning, the transition function is unknown to the agent. The solution to Problem 1 learns a stochastic timevarying policy [18]
which is a conditional probability density function of action
given current state at time step .In this paper, policy is a parameterized policy (also written as in short, where ) is used to represent the policy parameter. The objective defined in Equation (2) then becomes finding the optimal policy parameter such that
(4) 
To solve Problem 1, we introduce temporal logic policy search (TLPS)  a model free RL algorithm. At each iteration, a set of sample trajectories are collected under the current policy. Each sample trajectory is updated to a new one with higher robustness degree by following the gradient of while also keeping close to the sample so that dynamics is not violated. A new trajectory distribution is fitted to the set of updated trajectories. Each sample trajectory is then assigned a weight according to its probability under the updated distribution. Finally, the policy is updated with weight maximum likelihood. This process ensures that each policy update results in a trajectory distribution with higher expected robustness than the current one. Details of TLPS will be discussed in the next section.
Iv Temporal Logic Policy Search (TLPS)
Given a TLTL formula over predicates of , TLPS finds the parameters of a parametrized stochastic policy that maximizes the following objective function.
(5) 
where is defined in Equation (3).
In TLPS, we model the policy as a timevarying linear Gaussian, i.e. where are the feedback gain, feedforward gain and covariance of the policy at time . (similar approach has been adopted in [19], [20]). And the trajectory distribution in Equation (3) is modeled as a Gaussian where and .
At each iteration, sample trajectories are collected (denoted , ). For each sample trajectory , we find an updated trajectory by solving
(6) 
In the above equation, is the logsumexp approximation of . This is to take advantage of the many offtheshelf nonlinear programming methods that require gradient information of the Lagrangian (sequential quadratic programming is used in our experiments). Using the logsumexp approximation we can show that its approximation error is bounded. In additional, the local ascending directions on the approximated surface coincide with the actual surface given mild constraints (these will be discussed in more detail in the next section). Equation (6) aims to find a new trajectory that achieves higher robustness. The constraint is to limit the deviation of the updated trajectory from the sample trajectory so the system dynamics is not violated.
After we obtain a set of updated trajectories, an updated trajectory distribution is fitted using
(7) 
The last step is to update the policy. We will only be updating the feedforward terms and the covariance . The feedback terms is kept constant (the policy parameters are , ). This significantly reduces the number of parameters to be updated and increases the learning speed. For each sample trajectory, we obtain its probability under
(8) 
( is also written in short as ) where is the sample index. Using these probabilities, a normalized weighting for each sample trajectory is calculated using the softmax function ( is a parameter to be tuned). Finally, similar to [19], the policy is updated using weighted maximum likelihood by
V Robustness Smoothing
In the TLPS algorithm introduced in the previous section, one of the steps requires solving a constrained optimization problem that maximizes the robustness (Equation (6)). The original robustness definition in Section IIA is nondifferentiable and thus rules out many efficient gradientbased methods. In this section we adopt a smooth approximation of the robustness function using logsumexp. Specifically
(10) 
where is a smoothness parameter. We denote an iterative maxmin function as
where . mami denotes a function as where is a operator such that . and are index of the functions in mami and can be any positive integer. As we showed in Section IIA, any robustness function could be expressed as an iterative maxmin function.
Following the logsumexp approximation, any iterative maxmin function (i.e., the robustness of any TL formula) can be approximated as follows
where if and if . In the reminder of this section, we provide three lemmas that show the following:

the approximation error between and approaches zero as . This error is always bounded by the of the number of which is determined by the number of predicates in the TL formulae and the horizon of the problem. Tuning trades off between differentiability of the robustness function and approximation error.

despite the error introduced by the approximation, the optimal points remain invariant (i.e. ). This result provides guarantee that the optimal policy is unchanged when using the approximated TL reward,

even though the logsumexp approximation smooths the robustness function. Locally the ascending directions of and can be tuned to coincide with small error and the deviation is controlled by the parameter . As many policy search methods are local methods that improve the policy near samples, it is important to ensure that the ascending direction of the approximated TL reward does not oppose that of the real one.
Due to space constraints, we will only provide sketches of the proofs for the lemmas.
Lemma 1
Let be the number of terms of , and satisfy
where and .
For simplicity and without loss of generality, we illustrate the proof of Lemma 1 by constructing an approximation for a finite maxminmax problem
Let , , , and , , . Firstly, we define . Straightforward algebraic manipulation reveals that
(11)  
Furthermore, let us define , we have
By substituting into Equation (11), we obtain
Multiplying on both side, then
Finally, let , then we have
(12) 
Substitute into Equation (12)
Then we conclude the proof.
Lemma 2
Suppose , there exist a positive constant such that for all is also one of the maximum point of for any , i.e.
We start by considering as a maximum function, i.e. .let us denote , then when
There always exists a positive constant , such that for all the above statement holds. Lemma 2 can be obtained by using the above proof for the mami function in general.
Lemma 3
Let us denote the subgradient of as and the gradient of as . There exists a positive constant such that for all , and satisfy
where denotes the inner product.
Here we will only provide the proof when
is a pointwise maximum of convex functions. One can generalize it to any iterative maxmin function using the chain rule. Supposing
, the subgradient of iswhere
is the set of ”active” functions. The corresponding
is defined aswhere its first order derivative is
if
Therefore, there always exists a positive constant , such that holds for all .
Vi Case Studies
In this section, we apply TLPS on a vehicle navigation example. As shown in Figure 1, the vehicle navigates in a 2D environment. It has a 6 dimensional continuous state feature space where is the position of its center and is the angle its heading makes with the axis. Its 2 dimensional action space consists of the forward driving speed and the steering angle of its front wheels. The car moves according to dynamics
(13) 
with added Gaussian noise ( is the distance between the front and rear axles). However the learning agent is not provided with this model and needs to learn the desired control policy through trialanderror.
We test TLPS on two tasks with increasing difficulty. In the first task, the vehicle is required to reach the goal while avoiding the obstacle . We express this task as a TLTL specification
(14) 
In Equation (14), defines the square shaped goal region, is the Euclidean distance between the vehicle’s center and the center of the obstacle, is the radius of the obstacle. In English, describes the task of ”eventually reach goal and always stay away from the obstacle”. Using the quantitative semantics described in Section IIA , the robustness of is
(15) 
where and are the vehicle position and distance to obstacle center at time . Using the logsumexp, approximation for can be obtained as
(16) 
Because we used the same throughout the approximation, intermediate and cancel and we end up with Equation (16). is used in the optimization problem defined in Equation (6).
In task 2, the vehicle is required to visit goals 1, 2, 3 in this specific order while avoiding the obstacle. Expressed in TLTL results in the specification
(17) 
where is a shorthand for a sequence of conjunction, are the predicates for goal . In English, states ”visit then then , and don’t visit or until visiting , and don’t visit until visiting , and always if visited implies next always don’t visit (don’t revisit goals), and always avoid the obstacle” . Due to space constraints the robustness of and its approximation will not be explicitly presented, but it will take a similar form of nested functions that can be generated from the quantitative semantics of TLTL.
During training time, the obstacle is considered ”penetrable” in that the car can surpass its boundary with a negative reward granted according to the penetrated depth. In practice we find that this facilitates learning compared to a single negative reward given at contact with the obstacle and restarting the episode.
Each episode has a horizon
timesteps. 40 episodes of sample trajectories are collected and used for each update iteration. The policy parameters are initialized randomly in a given region (the policy covariances should be initialized to relatively high values to encourage exploration). Each task is trained for 50 iterations and the results are presented in Figures 2 and 3. Figure 2 shows sample trajectory distributions for selected iterations. Trajectory distributions are illustrated as shaded regions with width equal to 2 standard deviations. Lighter shade indicates earlier time in the training process. We used
for this set of results. We can see from Figure 2 that the trajectory distribution is able to converge and satisfy the specification. Satisfaction occurs much sooner for task 1 (around 30 iterations) compared with task 2 (around 50 iterations).Figure 3 compares the average robustness (of 40 sample trajectories) per iteration for TLPS with different values of the approximation parameters in (10). As a baseline, we also compare TLPS with episodebased relative entropy policy search (REPS) [18]. The original robustness function is used as the terminal reward for REPS and our previous work [17] has shown that this combination outperforms heuristic reward designed for the same robotic control task. The magnitude of robustness value changes with varying . Therefore, in order for the comparison to be meaningful (putting average returns on the same scale), sample trajectories collected for each comparison case are used to calculate their original robustness values against the TLTL formula and plotted in Figure 3 (a similar approach taken in [17]). The original robustness is chosen as the comparison measure for its semantic integrity (value greater than zero indicates satisfaction).
Results in Figure 3 shows that larger results in faster convergence and higher average return. This is consistent with the results of Section V since larger indicates lower approximation error. However, this advantage diminishes as increases due to the approximated robustness function losing differentiability. For the relatively easy task 1, TLPS performed comparatively with REPS. However, for the harder task 2, TLPS exhibits a clear advantage both in terms of rate of convergence and quality of the learned policy.
TLPS is a local policy search method that offers gradual policy improvement, controllable policy space exploration and smooth trajectories. These characteristics are desirable for learning control policies for systems that involve physical interactions with the environment. S (likewise for other local RL methods). Results in Figure 3 show a rapid exploration decay in the first 10 iterations and little improvement is seen after the iteration. During experiments, the authors find that adding a policy covariance damping schedule can help with initial exploration and final convergence. A principled exploration strategy is possible future work.
Similar to many policy search methods, TLPS is a local method. Therefore, policy initialization is a critical aspect of the algorithm (compared with valuebased methods such as Qlearning). In addition, because the trajectory update step in Equation (6) does not consider the system dynamics and relies on being close to sample trajectories, divergence can occur with a small
or a large learning rate. Making the algorithm more robust to hyperparameter changes is also an important future direction.
Vii Conclusion
As reinforcement learning research advance and more general RL agents are developed, it becomes increasingly important that we are able to correctly communicate our intentions to the learning agent. A well designed RL agent will be proficient at finding a policy that maximizes its returns, which means it will exploit any flaws in the reward function that can help it achieve this goal. Human intervention can sometimes help alleviate this problem by providing additional feedback. However, as discussed in [8], if the communication link between human and the agent is unstable (space exploration missions) or the agent operates on a timescale difficult for human to respond to (financial trading agent), it is critical that we are confident about what the agent will learn.
In this paper, we applied temporal logic as the task specification language for reinforcement learning. The quantitative semantics of TL is adopted for accurate expression of logical relationships in an RL task. We explored robustness smoothing as a means to transform the TL robustness to a differentiable function and provided theoretical results on the properties of the smoothed robustness. We proposed temporal logic policy search (TLPS), a modelfree method that utilizes the smoothed robustness and operates in continuous state and action spaces. Simulation experiments are conducted to show that TLPS is able to effectively find control policies that satisfy given TL specifications.
References
 [1] V. Mnih, K. Kavukcuoglu, D. Silver, A. a. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Humanlevel control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015.

[2]
D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. V. D. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, and K. Kavukcuoglu, “Mastering the game of Go with deep neural networks and tree search,”
Nature, vol. 529, no. 7585, pp. 484–489, 2016. [Online]. Available: http://dx.doi.org/10.1038/nature16961  [3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “EndtoEnd Training of Deep Visuomotor Policies,” Arxiv, p. 6922, 2015. [Online]. Available: http://arxiv.org/abs/1504.00702
 [4] S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning HandEye Coordination for Robotic Grasping with Deep Learning and LargeScale Data Collection,” arXiv, 2016. [Online]. Available: http://arxiv.org/abs/1603.02199v1
 [5] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous offpolicy updates,” arXiv preprint arXiv:1610.00633, 2016.
 [6] J. Peters, S. Vijayakumar, and S. Schaal, “Reinforcement learning for humanoid robotics,” in Proceedings of the third IEEERAS international conference on humanoid robots, 2003, pp. 1–20.
 [7] D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané, “Concrete Problems in AI Safety,” pp. 1–29, 2016. [Online]. Available: http://arxiv.org/abs/1606.06565
 [8] D. Dewey, “Reinforcement learning and the reward engineering principle,” in 2014 AAAI Spring Symposium Series, 2014.
 [9] I. Arel, “The threat of a rewarddriven adversarial artificial general intelligence,” in Singularity Hypotheses. Springer, 2012, pp. 43–60.

[10]
“Policy invariance under reward transformations : Theory and application to
reward shaping,”
Sixteenth International Conference on Machine Learning
, vol. 3, pp. 278–287, 1999.  [11] A. Ng and S. Russell, “Algorithms for inverse reinforcement learning,” Proceedings of the Seventeenth International Conference on Machine Learning, vol. 0, pp. 663–670, 2000. [Online]. Available: http://wwwcs.stanford.edu/people/ang/papers/icml00irl.pdf
 [12] P. Sermanet, K. Xu, and S. Levine, “Unsupervised perceptual rewards for imitation learning,” arXiv preprint arXiv:1612.06699, 2016.
 [13] K. Leahy, D. Zhou, C.I. Vasile, K. Oikonomopoulos, M. Schwager, and C. Belta, “Persistent surveillance for unmanned aerial vehicles subject to charging and temporal logic constraints,” Autonomous Robots, vol. 40, no. 8, pp. 1363–1378, 2016.
 [14] S. Sadraddini and C. Belta, “A provably correct mpc approach to safety control of urban traffic networks,” in American Control Conference (ACC), 2016. IEEE, 2016, pp. 1679–1684.
 [15] S. Coogan, E. A. Gol, M. Arcak, and C. Belta, “Traffic network control from temporal logic specifications,” IEEE Transactions on Control of Network Systems, vol. 3, no. 2, pp. 162–172, 2016.
 [16] D. Aksaray, A. Jones, Z. Kong, M. Schwager, and C. Belta, “Qlearning for robust satisfaction of signal temporal logic specifications,” in Decision and Control (CDC), 2016 IEEE 55th Conference on. IEEE, 2016, pp. 6565–6570.
 [17] X. Li, C.I. Vasile, and C. Belta, “Reinforcement learning with temporal logic rewards,” IEEE International Conference on Intelligent Robots and Systems, 2017.
 [18] M. P. Deisenroth, “A Survey on Policy Search for Robotics,” Foundations and Trends in Robotics, vol. 2, no. 1, pp. 1–142, 2011. [Online]. Available: http://www.nowpublishers.com/articles/foundationsandtrendsinrobotics/ROB021
 [19] Y. Chebotar, M. Kalakrishnan, A. Yahya, A. Li, S. Schaal, and S. Levine, “Path integral guided policy search,” arXiv preprint arXiv:1610.00529, 2016.
 [20] W. H. Montgomery and S. Levine, “Guided policy search via approximate mirror descent,” in Advances in Neural Information Processing Systems, 2016, pp. 4008–4016.
 [21] F. Stulp and O. Sigaud, “Path integral policy improvement with covariance matrix adaptation,” ICML, 2012.