Automata Guided Reinforcement Learning With Demonstrations

09/17/2018 ∙ by Xiao Li, et al. ∙ Boston University 0

Tasks with complex temporal structures and long horizons pose a challenge for reinforcement learning agents due to the difficulty in specifying the tasks in terms of reward functions as well as large variances in the learning signals. We propose to address these problems by combining temporal logic (TL) with reinforcement learning from demonstrations. Our method automatically generates intrinsic rewards that align with the overall task goal given a TL task specification. The policy resulting from our framework has an interpretable and hierarchical structure. We validate the proposed method experimentally on a set of robotic manipulation tasks.



There are no comments yet.


page 1

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Learning robotic skills for tasks with complex structures and long horizons poses a significant challenge for current reinforcement learning methods. Recent endeavors have focused mainly on lower level motor control tasks such as grasping [1] [2], dexterous hand manipulation[3], lego insertion [4]. However, demonstration of robotic systems capable of learning controls for tasks that require logical execution of subtasks has been less successful.

The first challenge in learning of complex tasks is low initial success rates. The agent rarely receives a positive reward signal through exploration. It has been shown that providing demonstration data can significantly facilitate learning. This idea has been adopted to learn tasks such as block stacking [5][6], insertion [7][8] as well as autonomous driving [9][10]. Intrinsic rewards have also been shown to provide extra learning signal [11], however, carelessly designed intrinsic rewards can adversely effect learning performance.

Learning only from demonstrations suffers from covariate shift (accumulative error resulting from deviation of state and action distributions from demonstrations) which is often addressed by combining demonstrations with reinforcement learning [12][13]. However, tasks that require long sequences of actions to complete usually results in high variance learning signals (gradients) which drastically hinders the learning progress. This problem can be alleviated by using temporal abstractions [14]. Hierarchical reinforcement learning has recently been successfully applied in simulated control tasks [15], simple navigation tasks [16] and robotic manipulation tasks [17][18].

Fig. 1: left: Training environment in the V-REP simulator [19]. right: Experimental environment.

The third challenge in learning complex tasks arises with the specification of task rewards. Reward engineering is time-consuming for low-level control tasks where efforts in reward shaping [20] and tuning are often necessary. This process is much more difficult when the structure of the tasks complicates. Authors of [21] have shown that it is already a considerable effort to specify the reward function for simple block-stacking tasks.

We propose to address the above problems by using formal specification languages, particularly temporal logic (TL) as the task specification language. TL has been used in control synthesis [22], path planning [23] and learning [24][25]. It has been shown to provide convenience and performance guarantees in tasks with logical structures and persistence requirements.

Our goal in this work is to provide a framework that integrates temporal logic with reinforcement learning from demonstrations. We show that our framework generates intrinsic rewards that are aligned with the task goals and results in a policy with interpretable hierarchy. We experimentally validate our framework on learning of robotic manipulation tasks with logical structures. All of our training is done in the simulation environment as shown in Figure 1. We show that when configured properly, the policies can transfer directly to the real robot.

Ii Related Work

Policy/task sketches have been used to decompose a complex task to a set of sub-tasks [26][27]. However, these methods only support sequential execution of the subtasks whereas our approach is able to compose subtasks in any temporal/logical relationships. Moreover, given the specification of the task in syntactically co-safe truncated linear temporal logic (scTLTL), our method does not require specification of each subtask in terms of reward functions.

The works in [28][29] are the most related to ours. Authors of [28] incorporates maximum-likelihood inverse reinforcement learning with side information (addition constraints of the task) in the form of co-safe linear temporal logic (which is transformed to an equivalent finite state automaton). However, their methods only support discrete state and action spaces. The authors of [29] propose the reward machine which in effect is an FSA. However, the user is required to manually design the reward machine whereas our method generates the reward machine from TL specifications.

Iii Preliminaries

Iii-a Off-Policy Reinforcement Learning

We start with the definition of a Markov Decision Process.

Definition 1

An MDP is defined as a tuple , where is the state space ; is the action space ( and can also be discrete sets); is the transition function with

being the conditional probability density of taking action

at state and ending up in state ; is the reward function with being the reward obtained by executing action at state and transitioning to .

We define a task to be the process of finding the optimal policy (or for stochastic policies) that maximizes the expected return, i.e.


The horizon of a task (denoted ) is defined as the maximum allowable time-steps of each execution of and hence the maximum length of a trajectory. In Equation (1), is the expectation following . The state-action value function is defined as


i.e. it is the expected return of choosing action at state and following onwards. For off-policy actor critic methods such as deep deterministic policy gradient [30], is used to evaluate the quality of policy . Parameterized and ( and are learnable parameters) are optimized alternately to obtain .

Iii-B scTLTL and Finite State Automata

We consider tasks specified with syntactically co-safe Truncated Linear Temporal Logic (scTLTL) which is derived from truncated linear temporal logic(TLTL) [31]. The  (always) operator is omitted in order to establish a connection between TLTL and finite state automaton (Definition 2). The syntax of scTLTL is defined as


where is the True Boolean constant. is a MDP state in Definition 1. is a predicate over the MDP states where .  (negation/not),  (conjunction/and) are Boolean connectives.  (eventually),  (until),  (then),  (next), are temporal operators. (implication) and and  (disjunction/or) can be derived from the above operators.

We denote to be the state at time , and to be a sequence of states (state trajectory) from time to , i.e., . The Boolean semantics of scTLTL is defined as:

A trajectory is said to satisfy formula if .

There exists a real-valued function called robustness degree (sometimes referred to as just robustness) that measures the level of satisfaction of trajectory with respect to a scTLTL formula . The robustness can be defined recursively as

where represents the maximum robustness value. A robustness of greater than zero implies that satisfies and vice versa ( and ). The robustness can substitute Boolean semantics to enforce the specification .

Definition 2

An FSA corresponding to a scTLTL formula 111Here we slightly modify the conventional definition of FSA and incorporate the probabilities in Equations (4). For simplicity, we continue to adopt the term FSA. is defined as a tuple , where is a set of automaton states; is the input alphabet (a set of first order logic formula); is the initial state; is a conditional probability defined as


is a set of final automaton states. The transitions in the FSA are deterministic. For reasons that will become clear later, we adopt the probability notation in Equation (4) so that we can combine it with an MDP transition.

We denote the predicate guarding the transition from to . Because is a predicate without temporal operators, the robustness is only evaluated at . Therefore, we use the shorthand . The translation from a TLTL formula to a FSA can be done automatically with available packages like Lomap [32]. An example of scTLTL is provided in the next section.

Iv Problem Formulation and Approach

Problem 1

Given an MDP with unknown transition dynamics and a scTLTL formula as in Definition 2, find a policy such that


where is an indicator function with value if and otherwise.

in Equation (5) is said to satisfy . Problem 1 defines a policy search problem where the trajectories resulting from following the optimal policy should satisfy the given scTLTL formula in expectation. On a high level, our approach is to construct a product MDP between and and learn policy using the product. To accelerate learning, we provide human demonstrations of the task specified by and provide a simple technique to transform the demonstrations compatible with the product MDP.

V FSA Augmented MDP

We introduce the FSA augmented MDP:

Definition 3

An FSA augmented MDP corresponding to scTLTL formula (constructed from FSA and MDP ) is defined as where , is the probability of transitioning to given and ,


is defined in Equation (4). is the FSA augmented reward function, defined by


where represents the disjunction of all predicates guarding the transitions that originate from ( is the set of automata states that are connected with through outgoing edges). Equation (7) effectively acts as an intrinsic reward that aligns with the overall goal of Equation (5).

Fig. 2: Finite state automaton generated from formula
Example 1

Figure 2 illustrates the FSA resulting from formula (where are predicates over states). In English, entails that during a run, regions specified by and need to be visited at least once. The FSA has four automaton states with being the input(initial) state (here serves to track the progress in satisfying ). The input alphabet is defined as . Shorthands are used in the figure, for example . represents the power set of , i.e. . During execution, the FSA always starts from state and transitions according to Equation (6). The specification is satisfied when is reached.

The goal is to find the optimal policy that maximizes the expected sum of discounted return, i.e.


where is the discount factor, is the time horizon.

The reward function in Equation (7) encourages the system to exit the current automaton state and move on to the next, and by doing so eventually reach the final state (property of FSA) which satisfies the TL specification and hence Equation (5). The discount factor in Equation (8) reduces the number of satisfying policies to one.

The FSA augmented MDP can be constructed with any standard MDP and a scTLTL formula, and Equation (8) can be solved with any off-the-shelf RL algorithm. After obtaining the optimal policy , executing without transitioning the automaton state (i.e. keeping fixed) results in a set of meaningful policies that can be used as is or composed with other such policies.

Vi FSA Guided Reinforcement Learning From Demonstrations

In this section, we introduce our main algorithm - FSA guided reinforcement learning from demonstrations. The algorithm takes as input a scTLTL formula , a randomly initialized policy and a set of demonstration trajectories that satisfy , where is the state-action trajectory. The algorithm consists of the following steps:

  1. Construct the FSA augmented MDP .

  2. For each demonstration trajectory , construct the Q-appended demonstration trajectory by finding the corresponding for each using Equation (4). Denote .

  3. Perform behavior cloning (supervise learning on the demonstration trajectories) to initialize policy (details provided in Section 


  4. Train the agent using any reinforcement learning from demonstration algorithm (such as [13], [5], [12]).

Algorithm 1 shows each step with its inputs and output. We will discuss our choices of behavior cloning and RL algorithms in the next section.

1:Inputs: scTLTL task specification , randomly initialized policy , a set of demonstration trajectories .
2:Construct the FSA augmented MDP
5: stands for any learning from demonstration algorithm
Algorithm 1 FSA Guided Reinforcement Learning From Demonstrations
Fig. 3: Finite state automaton generated from formula .
Fig. 4: Sample execution of task 1: with FSA (same as Figure 3) transitions shown. The shaded state represents the current automaton state.

In this section we present some preliminary experimental results using the FSA augmented MDP to learn temporal logic specified tasks.

Vii Experiments

Vii-a Experiment Setup

As shown in Figure 4

, we control one arm of a Baxter robot (7 degrees of freedom) to traverse among three regions defined by the red, green and blue disks. The positions of the disks are tracked by our motion capture system and thus fully observable. Our state space is 16 dimensional that includes 7 joint angles and the three disk positions relative to the gripper (9 dimensional) denoted by

. Our action space is the 7-dimensional joint velocities. We define three predicates , is a threshold which we set to be 5 centimeters.

We test our algorithm on two tasks

  • Task 1:
    Description: visit regions red, green, and blue in this order.

  • Task 2:
    Description: Eventually visit regions red, green and blue. Order does not matter.

Figure 3 shows the FSA resulting from . The FSA for is similar in nature to that presented in Figure 2 and therefore not included due to space constraints.

Vii-B Algorithm Details

For each task, we collect 50 human demonstration state-action trajectories (each demonstration about 12 seconds long) with randomized initial conditions (arm configuration and position of the regions). Demonstrations are collected by holding Baxter’s gripper in gravity compensation mode while performing the task. Behavior cloning is used to initialize the policy with the following loss function



is a deterministic policy represented by a feedforward neural network with 3 layers, each lay consisting of 100 relu units.

is the number of samples. Other behavior cloning losses can also be used [33].

We use deep deterministic policy gradient (DDPG) [30] as our reinforcement learning algorithm. During training, we maintain two replay buffers, one for interaction data and one for demonstration data. At each update step, we sample a batch of experience from the interaction data buffer using prioritized experience replay [34] and another batch from the demonstration data buffer and combine the two batches for one update. In addition, we modify the policy loss to be


where is the usual DDPG actor loss (similar technique is used in [5]). During training, we linearly decay from 0.8 to 0.1 over 30000 update steps to favor demonstration in the beginning and unbiased DDPG loss towards the end (similar technique is used in [12]). We set the horizon to be 100 steps (5 seconds). 5 episodes of exploration data are collected to perform 10 updates. We use a learning rate of 0.0003, a discount factor of 0.99, batch size of 32 (from both buffers).

We randomly initialize the joint angles, the automaton state as well as the positions of the regions at reset of each episode in order to achieve generalization over different configurations of the workspace. An episode resets if the gripper comes too close to the table. All of our training is performed in simulation using the V-REP platform [19]. The simulation environment is calibrated to the real world workspace. We set the control frequencies in both the real and simulated robot to be 20 Hz and show that the learned policies transfer directly to the real robot without fine-tuning.

Vii-C Comparison Cases

As comparison, we introduce a binary vector

with three digits. A digit in is 1 if the corresponding region has been reached at least once and 0 otherwise (i.e. if occurs at least once in an episode. Likewise for for blue and for green). is used to track progress towards accomplishing the task. We train each task with the following shaped reward


on the original MDP. We also compare cases with and without demonstration.

Due to the scale difference between rewards provided by the FSA augmented MDP and the shaped reward, we present all learning curves in terms of robustness for a clear comparison. This is because the semantics of the robustness entails that a trajectory evaluating to a higher robustness value achieves better satisfaction of the TL specification (a value greater than zero guarantees satisfaction).

We acknowledge that for any given task, a well-shaped reward that accelerates learning can be provided if enough effort goes into the design and tuning process. However, this effort grows quickly with the complexity of the task. Our goal is to use formal languages to free users of this burden while achieving similar sample efficiency as a shaped reward.

Viii Results and Discussion

In this section, we present our experimental results along with discussions of their implications. Figure 4 shows an example execution of task 1 on Baxter. The automaton serves as a progress tracking mechanism that hierarchically abstracts a temporal dependent task to a set of independent ones.

Fig. 5: Learning curve for left: Task 1 and right: Task 2. Steps here are referred to as environmental step

As stated in Section VII-C

, since we are training with different reward functions, in order for a fair comparison, we sample a batch of 10 trajectories every 25,000 environmental steps (robot interaction step, as opposed to policy update step) and calculate the robustness for each trajectory. Their means and standard deviations are presented in Figure 

5. In the figure, we refer to Algorithm 1 as ’Ours’, and learning from only FSA augmented MDP as ’Ours without demonstration’. The shaped rewards are used to train with the same learning procedure as stated in Section VII-B.

The results in Figure 5 show that our method is able to solve both tasks with and without demonstrations (task is considered solved if the average robustness stabilizes above zero). However, demonstrations and behavior cloning significantly decreased the time to convergence as well as the variance during training. We can see that the agent is also able to learn very slowly using the shaped rewards but is unable to solve either task in the allocated time. The speedup of our method is mainly due to the temporal hierarchy the FSA provides. By adding one discrete dimension (the state) to the state space and randomizing on that dimension during learning, a curriculum is created to help the agent learn a set of simpler sub-tasks building up to the final task. This way the agent is able to visit various states along the task without having to first learn the correct actions leading up to those states.

Fig. 6: Task success rate of the trained policies.

In all comparison cases, learning task 2 if faster than task 1. This is because task 1 imposes more constraints on the desired behavior (ordering). It is expected that even with the shaped reward, demonstration and behavior cloning is able help bootstrap learning at the initial stages. However, such initialization can be damaged as shown in Figure 5 left. After training, we evaluate the policies by running 10 trials with randomly initialized robot and workspace configurations. Results in Figure 6 show that the resulting policies from our method (with and without demonstrations) is able to accomplish the tasks relatively reliably whereas the policies from the shaped rewards struggled.

Task 1 Task 2
Ours 36.5 34.3
Ours without demostration 35.7 34.1
Shaped reward with demonstration 100 93.2
Shaped reward without demonstration 99.6 95.5
TABLE I: Average number of steps to finish the task

It should be noted that there typically will be more than one policy that satisfies Equation (5). However, the discount factor in Equation (8) reduces the number of optimal policies to one (the one that yields a satisfying trajectory in the least number of steps). Table I shows the average number of steps each policy takes to accomplish the corresponding task.

As with any formal method based technique, there is a learning curve to understanding formal languages and using them well in writing specifications. We find that the FSA has significantly helped us in understanding what we are specifying to the agent which served as an effective means to alleviate reward hacking [35]. At its current state, our framework does not support specification of persistent tasks [36]. We have also yet to demonstrate tasks specified over MDP states and actions (e.g. if some state occurs then do something). These are possible extensions of future work.

Ix Conclusions

Learning to follow logical instruction can be useful in real life (e.g. following a recipe or the traffic rules). In this work, we proposed a method to combine temporal logic with reinforcement learning from demonstrations which provides the agent with temporal hierarchy and task aligned intrinsic rewards. We showed that comparing to heuristically designed reward functions, our method provides a formalism for task specification and is able to learn with less experience. By toggling the automaton state

, our learned policy is able to exhibit different behaviors specified by the intrinsic reward in Equation (7) even though no hierarchy is imposed on the policy architecture (simple feedforward neural network). For future work, we will take advantage of this characteristic and develop a set of techniques for skill composition and task-space transfer. We will also demonstrate our methods on more complex tasks.