A Policy Search Method For Temporal Logic Specified Reinforcement Learning Tasks

09/27/2017 ∙ by Xiao Li, et al. ∙ 0

Reward engineering is an important aspect of reinforcement learning. Whether or not the user's intentions can be correctly encapsulated in the reward function can significantly impact the learning outcome. Current methods rely on manually crafted reward functions that often require parameter tuning to obtain the desired behavior. This operation can be expensive when exploration requires systems to interact with the physical world. In this paper, we explore the use of temporal logic (TL) to specify tasks in reinforcement learning. TL formula can be translated to a real-valued function that measures its level of satisfaction against a trajectory. We take advantage of this function and propose temporal logic policy search (TLPS), a model-free learning technique that finds a policy that satisfies the TL specification. A set of simulated experiments are conducted to evaluate the proposed approach.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reinforcement learning (RL) has enjoyed groundbreaking success in recent years ranging from playing Atari games at super-human level [1], playing competitively with world champions in the game of Go [2] to generating visuomotor control policies for robots [3], [4]

. Despite much effort being put into developing sample efficient algorithms, an important aspect of RL remains less explored. The reward function is the window for designers to specify the desired behavior and impose important constraints for the system. While most reward functions used in the current RL literature have been based on heuristics for relatively simple tasks, real world applications typically involve tasks that are logically more complex.

Commonly adopted reward functions take the form of a linear combination of basis functions (often quadratic) [5]. This type of reward function has limited expressibility and is semantically ambiguous because of its dependence on a set of weights. Reward functions of this form have been used to successfully learn high dimensional control tasks such as humanoid walking [6] and multiple household tasks (e.g. placing coat-hangers, twisting bottle caps, etc) [3]. However, parameter tuning of the reward function is required and this iteration is expensive for robotic tasks. Moreover, these tasks are logically straightforward in that there is little logical interactions between sub-tasks (such as sequencing, conjunction/disjunction, implication, etc).

Consider the harder task of learning to use an oven. The agent is required to perform a series of sub-tasks in the correct sequence (set temperature and timer preheat open oven door place item in oven close oven door). In addition, the agent has to make the simple decision of when to open the oven door and place the item (i.e. preheat finished implies open oven door). Tasks like this are commonly found in household environments (using the microwave, refrigerator or even a drawer) and a function that correctly maps the desired behavior to a real-valued reward can be difficult to design. If the semantics of the reward function can not be guaranteed, then an increase in the expected return will not necessarily represent better satisfaction of the task specification. This is referred to as reward hacking by [7].

Reward engineering has been briefly explored in the reinforcement learning literature. Authors of [8] and [9] provide general formalisms for reward engineering and discuss its significance. Authors of [10] proposed potential-based reward shaping and proved policy invariance under this type of reward transformation. Another line of work aims to infer a reward function from demonstration. This idea is called inverse reinforcement learning and is explored by [11] and [12].

In this paper, we adopt the expressive power of temporal logic and use it as a task specification language for reinforcement learning in continuous state and action spaces. Its quantitative semantics (also referred to as robustness degree or simply robustness) translate a TL formula to a real-valued function that can be used as the reward. By definition of the quantitative semantics, a robustness value of greater than zero guarantees satisfaction of the temporal logic specification.

Temporal logic (TL) has been adopted as the specification language for a wide variety of control tasks. Authors of [13] use linear temporal logic (LTL) to specify a persistent surveillance task carried out by aerial robots. Similarly, [14] and [15] applied LTL in traffic network control. Application of TL in reinforcement learning has been less investigated. [16] combined signal temporal logic (STL) with Q-learning while also adopting the log-sum-exp approximation of robustness. However, their focus is in the discrete state and action spaces, and ensured satisfiability by expanding the state space to a history dependent state space. This does not scale well for large or continuous state-action spaces which is often the case for control tasks.

Our main contributions in this paper are as follows:

  • we present a model-free policy search algorithm, which we call temporal logic policy search (TLPS), that takes advantage of the robustness function to facilitate learning. We show that an optimal parameterized policy that maximizes the robustness could be obtained by solving a constrained optimization,

  • a smoothing approximation of the robustness degree is proposed which is necessary for obtaining the gradients of the objective and constraints. We prove that using the smoothed robustness as reward provides similar semantic guarantees to the original robustness definition while providing significant speedup in learning,

  • finally, we demonstrate the performance of the proposed approach using simulated navigation tasks.

Ii Preliminaries

Ii-a Truncated Linear Temporal Logic (TLTL)

In this section, we provide definitions for TLTL (refer to our previous work [17] for a more elaborate discussion of TLTL). A TLTL formula is defined over predicates of form , where is a function of state and is a constant. We express the task as a TLTL formula with the following syntax:

(1)

where is the boolean constant true, is a predicate,  (negation/not),  (conjunction/and), and  (disjunction/or) are Boolean connectives, and  (eventually),  (always),  (until),  (then),  (next), are temporal operators. Implication is denoted by  (implication). TLTL formulas are evaluated against finite time sequences of states .

We denote to be the state at time , and to be a sequence of states (state trajectory) from time to , i.e., . The Boolean semantics of TLTL is defined as:

Intuitively, state trajectory (reads satisfies ) if the specification defined by is satisfied for every subtrajectory . Similarly, if is satisfied for at least one subtrajectory . if is satisfied at every time step before is satisfied, and is satisfied at a time between and . if is satisfied at least once before is satisfied between and . A trajectory of duration is said to satisfy formula if .

TLTL is equipped with quantitative semantics (robustness degree) , i.e., a real-valued function that indicates how far is from satisfying or violating the specification . We define the task satisfaction measurement , which is recursively expressed as:

where represents the maximum robustness value. Moreover, and , which implies that the robustness degree can substitute Boolean semantics in order to enforce the specification .

Example 1

Consider specification , where is a one dimensional state. Intuitively, this formula specifies that eventually reaches region for at least one time step. Suppose we have a state trajectory of horizon 3. The robustness is . Since , and the value is a measure of the satisfaction margin. Note that both states and stayed within the specified region, but ”more” satisfies the predicate by being closer to the center of the region and thereby achieving a higher robustness value than .

Ii-B Markov Decision Process

In this section, we introduce the finite horizon infinite Markov decision process (MDP) and the semantics of a TLTL formula over an MDP. We start with the following definition:

Definition 1

A finite horizon infinite MDP is defined as a tuple , where is the continuous state space; is the continuous action space;

is the conditional probability density of taking action

at state and ending up in state . We denote as the horizon of MDP.

Given an MDP in Definition 1, a state trajectory of length (denoted ) can be produced. The semantics of a TLTL formula over can be evaluated with the robustness degree defined in the previous section. implies that satisfies , i.e. and vice versa. In the next section, we will take advantage of this property and propose a policy search method that aims to maximize the expected robustness degree.

Iii Problem Formulation And Approach

We first formulate the problem of policy search with TLTL specification as follows:

Problem 1

Given an MDP in Definition 1 and a TLTL formula , find a stochastic policy ( determines a probability of taking action at state ) that maximizes the expected robustness degree

(2)

where the expectation is taken over the trajectory distribution following policy , i.e.

(3)

In reinforcement learning, the transition function is unknown to the agent. The solution to Problem 1 learns a stochastic time-varying policy [18]

which is a conditional probability density function of action

given current state at time step .

In this paper, policy is a parameterized policy (also written as in short, where ) is used to represent the policy parameter. The objective defined in Equation (2) then becomes finding the optimal policy parameter such that

(4)

To solve Problem 1, we introduce temporal logic policy search (TLPS) - a model free RL algorithm. At each iteration, a set of sample trajectories are collected under the current policy. Each sample trajectory is updated to a new one with higher robustness degree by following the gradient of while also keeping close to the sample so that dynamics is not violated. A new trajectory distribution is fitted to the set of updated trajectories. Each sample trajectory is then assigned a weight according to its probability under the updated distribution. Finally, the policy is updated with weight maximum likelihood. This process ensures that each policy update results in a trajectory distribution with higher expected robustness than the current one. Details of TLPS will be discussed in the next section.

As introduced in Section II-A, the robustness degree consists of embedded functions and calculating the gradient is not possible. In Section V, we discuss the use of log-sum-exp to approximate the robustness function and provide proofs of some properties of the approximated robustness.

Iv Temporal Logic Policy Search (TLPS)

Given a TLTL formula over predicates of , TLPS finds the parameters of a parametrized stochastic policy that maximizes the following objective function.

(5)

where is defined in Equation (3).

In TLPS, we model the policy as a time-varying linear Gaussian, i.e. where are the feedback gain, feed-forward gain and covariance of the policy at time . (similar approach has been adopted in [19][20]). And the trajectory distribution in Equation (3) is modeled as a Gaussian where and .

At each iteration, sample trajectories are collected (denoted , ). For each sample trajectory , we find an updated trajectory by solving

(6)

In the above equation, is the log-sum-exp approximation of . This is to take advantage of the many off-the-shelf nonlinear programming methods that require gradient information of the Lagrangian (sequential quadratic programming is used in our experiments). Using the log-sum-exp approximation we can show that its approximation error is bounded. In additional, the local ascending directions on the approximated surface coincide with the actual surface given mild constraints (these will be discussed in more detail in the next section). Equation (6) aims to find a new trajectory that achieves higher robustness. The constraint is to limit the deviation of the updated trajectory from the sample trajectory so the system dynamics is not violated.

After we obtain a set of updated trajectories, an updated trajectory distribution is fitted using

(7)

The last step is to update the policy. We will only be updating the feed-forward terms and the covariance . The feedback terms is kept constant (the policy parameters are , ). This significantly reduces the number of parameters to be updated and increases the learning speed. For each sample trajectory, we obtain its probability under

(8)

( is also written in short as ) where is the sample index. Using these probabilities, a normalized weighting for each sample trajectory is calculated using the softmax function ( is a parameter to be tuned). Finally, similar to  [19], the policy is updated using weighted maximum likelihood by

(9)

According to [21], such update strategy will result in convergence. The complete algorithm is described in Algorithm 1.

1:Inputs: Episode horizon , batch size , KL constraint parameter , smoothed robustness function , softmax parameter
2:Initialize policy
3:Initialize trajectory buffer
4:for  to number of training episodes do
5:      = SampleTrajectories()
6:     Store in
7:     if Size()  then
8:          GetUpdatedTrajectories() for to end for Using Equation (6)
9:          FitTrajectoryDistribution() Using Equation (7)
10:         for i=1 to N do
11:              
12:              
13:         end for
14:         for t = 0 to T-1 do
15:              
16:              
17:         end for
18:         Clear buffer
19:     end if
20:end for
Algorithm 1 Temporal Logic Policy Search

V Robustness Smoothing

In the TLPS algorithm introduced in the previous section, one of the steps requires solving a constrained optimization problem that maximizes the robustness (Equation (6)). The original robustness definition in Section II-A is non-differentiable and thus rules out many efficient gradient-based methods. In this section we adopt a smooth approximation of the robustness function using log-sum-exp. Specifically

(10)

where is a smoothness parameter. We denote an iterative max-min function as

where . mami denotes a function as where is a operator such that . and are index of the functions in mami and can be any positive integer. As we showed in Section II-A, any robustness function could be expressed as an iterative max-min function.

Following the log-sum-exp approximation, any iterative max-min function (i.e., the robustness of any TL formula) can be approximated as follows

where if and if . In the reminder of this section, we provide three lemmas that show the following:

  • the approximation error between and approaches zero as . This error is always bounded by the of the number of which is determined by the number of predicates in the TL formulae and the horizon of the problem. Tuning trades off between differentiability of the robustness function and approximation error.

  • despite the error introduced by the approximation, the optimal points remain invariant (i.e. ). This result provides guarantee that the optimal policy is unchanged when using the approximated TL reward,

  • even though the log-sum-exp approximation smooths the robustness function. Locally the ascending directions of and can be tuned to coincide with small error and the deviation is controlled by the parameter . As many policy search methods are local methods that improve the policy near samples, it is important to ensure that the ascending direction of the approximated TL reward does not oppose that of the real one.

Due to space constraints, we will only provide sketches of the proofs for the lemmas.

Lemma 1

Let be the number of terms of , and satisfy

where and .

For simplicity and without loss of generality, we illustrate the proof of Lemma 1 by constructing an approximation for a finite max-min-max problem

Let , , , and , , . Firstly, we define . Straightforward algebraic manipulation reveals that

(11)

Furthermore, let us define , we have

By substituting into Equation (11), we obtain

Multiplying on both side, then

Finally, let , then we have

(12)

Substitute into Equation (12)

Then we conclude the proof.

Lemma 2

Suppose , there exist a positive constant such that for all is also one of the maximum point of for any , i.e.

We start by considering as a maximum function, i.e. .let us denote , then when

There always exists a positive constant , such that for all the above statement holds. Lemma 2 can be obtained by using the above proof for the mami function in general.

Lemma 3

Let us denote the sub-gradient of as and the gradient of as . There exists a positive constant such that for all , and satisfy

where denotes the inner product.

Here we will only provide the proof when

is a point-wise maximum of convex functions. One can generalize it to any iterative max-min function using the chain rule. Supposing

, the sub-gradient of is

where

is the set of ”active” functions. The corresponding

is defined as

where its first order derivative is

if

Therefore, there always exists a positive constant , such that holds for all .

Vi Case Studies

In this section, we apply TLPS on a vehicle navigation example. As shown in Figure 1, the vehicle navigates in a 2D environment. It has a 6 dimensional continuous state feature space where is the position of its center and is the angle its heading makes with the -axis. Its 2 dimensional action space consists of the forward driving speed and the steering angle of its front wheels. The car moves according to dynamics

(13)

with added Gaussian noise ( is the distance between the front and rear axles). However the learning agent is not provided with this model and needs to learn the desired control policy through trial-and-error.

[width=1.]problem_description

Fig. 1 : Vehicle navigation task using TLTL specifications. The vehicle is shown in brown, the obstacle is shown as the green circle and the goals are shown as the green squares. left: Task 1 is to reach the goal while avoiding the obstacle. right: Task 2 is to visit goals 1,2,3 in this order while avoiding the obstacle

We test TLPS on two tasks with increasing difficulty. In the first task, the vehicle is required to reach the goal while avoiding the obstacle . We express this task as a TLTL specification

(14)

In Equation (14), defines the square shaped goal region, is the Euclidean distance between the vehicle’s center and the center of the obstacle, is the radius of the obstacle. In English, describes the task of ”eventually reach goal and always stay away from the obstacle”. Using the quantitative semantics described in Section II-A , the robustness of is

(15)

where and are the vehicle position and distance to obstacle center at time . Using the log-sum-exp, approximation for can be obtained as

(16)

Because we used the same throughout the approximation, intermediate and cancel and we end up with Equation (16). is used in the optimization problem defined in Equation (6).

In task 2, the vehicle is required to visit goals 1, 2, 3 in this specific order while avoiding the obstacle. Expressed in TLTL results in the specification

(17)

where is a shorthand for a sequence of conjunction, are the predicates for goal . In English, states ”visit then then , and don’t visit or until visiting , and don’t visit until visiting , and always if visited implies next always don’t visit (don’t revisit goals), and always avoid the obstacle” . Due to space constraints the robustness of and its approximation will not be explicitly presented, but it will take a similar form of nested functions that can be generated from the quantitative semantics of TLTL.

During training time, the obstacle is considered ”penetrable” in that the car can surpass its boundary with a negative reward granted according to the penetrated depth. In practice we find that this facilitates learning compared to a single negative reward given at contact with the obstacle and restarting the episode.

Each episode has a horizon

time-steps. 40 episodes of sample trajectories are collected and used for each update iteration. The policy parameters are initialized randomly in a given region (the policy covariances should be initialized to relatively high values to encourage exploration). Each task is trained for 50 iterations and the results are presented in Figures 2 and 3. Figure 2 shows sample trajectory distributions for selected iterations. Trajectory distributions are illustrated as shaded regions with width equal to 2 standard deviations. Lighter shade indicates earlier time in the training process. We used

for this set of results. We can see from Figure 2 that the trajectory distribution is able to converge and satisfy the specification. Satisfaction occurs much sooner for task 1 (around 30 iterations) compared with task 2 (around 50 iterations).

[width=2.0]fig2

Fig. 2 : Sample trajectory distributions for selected iterations for left: task 1, right: task 2. Each iteration consists of 40 sample trajectories each having a horizon of 200 time-steps. The width of each distribution is 2 standard deviations and color represent recency in the training process (lighter color indicates earlier time in training).

[width=2.0]fig3

Fig. 3 : Average return vs training iteration for left: task 1, right: task2. The average return is represented as the original robustness value calculated from sample trajectories. TLPS is compared with varying . REPS with the original robustness as terminal reward is used as a baseline.

Figure 3 compares the average robustness (of 40 sample trajectories) per iteration for TLPS with different values of the approximation parameters in (10). As a baseline, we also compare TLPS with episode-based relative entropy policy search (REPS) [18]. The original robustness function is used as the terminal reward for REPS and our previous work [17] has shown that this combination outperforms heuristic reward designed for the same robotic control task. The magnitude of robustness value changes with varying . Therefore, in order for the comparison to be meaningful (putting average returns on the same scale), sample trajectories collected for each comparison case are used to calculate their original robustness values against the TLTL formula and plotted in Figure 3 (a similar approach taken in [17]). The original robustness is chosen as the comparison measure for its semantic integrity (value greater than zero indicates satisfaction).

Results in Figure 3 shows that larger results in faster convergence and higher average return. This is consistent with the results of Section V since larger indicates lower approximation error. However, this advantage diminishes as increases due to the approximated robustness function losing differentiability. For the relatively easy task 1, TLPS performed comparatively with REPS. However, for the harder task 2, TLPS exhibits a clear advantage both in terms of rate of convergence and quality of the learned policy.

TLPS is a local policy search method that offers gradual policy improvement, controllable policy space exploration and smooth trajectories. These characteristics are desirable for learning control policies for systems that involve physical interactions with the environment. S (likewise for other local RL methods). Results in Figure 3 show a rapid exploration decay in the first 10 iterations and little improvement is seen after the iteration. During experiments, the authors find that adding a policy covariance damping schedule can help with initial exploration and final convergence. A principled exploration strategy is possible future work.

Similar to many policy search methods, TLPS is a local method. Therefore, policy initialization is a critical aspect of the algorithm (compared with value-based methods such as Q-learning). In addition, because the trajectory update step in Equation (6) does not consider the system dynamics and relies on being close to sample trajectories, divergence can occur with a small

or a large learning rate. Making the algorithm more robust to hyperparameter changes is also an important future direction.

Vii Conclusion

As reinforcement learning research advance and more general RL agents are developed, it becomes increasingly important that we are able to correctly communicate our intentions to the learning agent. A well designed RL agent will be proficient at finding a policy that maximizes its returns, which means it will exploit any flaws in the reward function that can help it achieve this goal. Human intervention can sometimes help alleviate this problem by providing additional feedback. However, as discussed in [8], if the communication link between human and the agent is unstable (space exploration missions) or the agent operates on a timescale difficult for human to respond to (financial trading agent), it is critical that we are confident about what the agent will learn.

In this paper, we applied temporal logic as the task specification language for reinforcement learning. The quantitative semantics of TL is adopted for accurate expression of logical relationships in an RL task. We explored robustness smoothing as a means to transform the TL robustness to a differentiable function and provided theoretical results on the properties of the smoothed robustness. We proposed temporal logic policy search (TLPS), a model-free method that utilizes the smoothed robustness and operates in continuous state and action spaces. Simulation experiments are conducted to show that TLPS is able to effectively find control policies that satisfy given TL specifications.

References