Automata Guided Hierarchical Reinforcement Learning for Zero-shot Skill Composition

10/31/2017 ∙ by Xiao Li, et al. ∙ Boston University 0

An obstacle that prevents the wide adoption of (deep) reinforcement learning (RL) in control systems is its need for a large amount of interactions with the environ- ment in order to master a skill. The learned skill usually generalizes poorly across domains and re-training is often necessary when presented with a new task. We present a framework that combines methods in formal methods with hierarchi- cal reinforcement learning (HRL). The set of techniques we provide allows for convenient specification of tasks with complex logic, learn hierarchical policies (meta-controller and low-level controllers) with well-defined intrinsic rewards us- ing any RL methods and is able to construct new skills from existing ones without additional learning. We evaluate the proposed methods in a simple grid world simulation as well as simulation on a Baxter robot.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hierarchical reinforcement learning (HRL) is an effective means of improving sample efficiency and achieving transfer among tasks. The goal is to obtain task-invariant low-level policies, and by re-training the meta-policy that schedules over the low-level policies, different skills can be obtain with less samples than training from scratch. Heess et al. (2016) have adopted this idea in learning locomotor controllers and have shown successful transfer among simulated locomotion tasks. Oh et al. (2017) have utilized a deep hierarchical architecture for multi-task learning using natural language instructions.

Skill composition is the idea of constructing new skills out of existing ones (and hence their policies) with little to no additional learning. In stochastic optimal control, this idea has been adopted by Todorov (2009) and Da Silva et al. (2009)

to construct provably optimal control laws based on linearly solvable Markov decision processes.

Haarnoja et al. (2018) have showed in simulated and real manipulation tasks that approximately optimal policies can result from adding the Q-functions of the existing policies.

Temporal logic(TL) is a formal language commonly used in software and digital circuit verification by Baier and Katoen (2008) as well as formal synthesis by Belta et al. (2017). It allows for convenient expression of complex behaviors and causal relationships. TL has been used by Tabuada and Pappas (2004), Fainekos et al. (2006), Fainekos et al. (2005) to synthesize provably correct control policies. Aksaray et al. (2016) have also combined TL with Q-learning to learn satisfiable policies in discrete state and action spaces.

In this work, we focus on hierarchical skill learning and composition. Once a set of skills is acquired, we provide a technique that can synthesize new skills with little to no further interaction with the environment. We adopt the syntactically co-safe truncated linear temporal logic(scTLTL) as the task specification language. Compared to most heuristic reward structures used in the RL literature, formal specification language has the advantage of semantic rigor and interpretability. Our main contributions are:

  • Compared to existing skill composition methods, we are able to learn and compose logically complex tasks that would otherwise be difficult to analytically expressed as a reward function. We take advantage of the transformation between scTLTL formulas and finite state automata (FSA) to construct deterministic meta-controllers directly from the task specifications. We show that by adding one discrete dimension to the original state space, structurally simple parameterized policies such as feed-forward neural networks can be used to learn tasks that require complex temporal reasoning.

  • Intrinsic motivation has been shown to help RL agents learn complicated behaviors with less interactions with the environment (Singh et al. (2004), Kulkarni et al. (2016), Jaderberg et al. (2016)). However, designing a well-behaved intrinsic reward that aligns with the extrinsic reward takes effort and experience. In our work, we construct intrinsic rewards directly from the input alphabets of the FSA, which guarantees that maximizing each intrinsic reward makes positive progress towards satisfying the entire task specification. From a user’s perspective, the intrinsic rewards are constructed automatically from the TL formula without the need for further reward engineering.

  • In our framework, each FSA represents a hierarchical policy with low-level controllers that can be re-modulated to achieve different tasks. Skill composition is accomplished by taking the product of FSAs. Instead of interpolating/extrapolating among learned skills/latent features, our method is based on graph manipulation of the FSA. Therefore, the compositional outcome is much more transparent. At testing time, the behavior of the policy is strictly enforced by the FSA and therefore safety can be guaranteed if encoded in the specification. We introduce a method that allows learning of such hierarchical policies with any non-hierarchical RL algorithm. Compared with previous work on skill composition, we impose no constraints on the policy representation or the problem class.

2 Preliminaries

2.1 Reinforcement Learning

We start with the definition of a Markov Decision Process.

Definition 1.

An MDP is defined as a tuple , where is the state space ; is the action space ( and can also be discrete sets); is the transition function with

being the conditional probability density of taking action

at state and ending up in state ; is the reward function with being the reward obtained by executing action at state and transitioning to .

Let be the horizon of the task. The optimal policy (or for stochastic policies) that solves the MDP maximizes the expected return, i.e.


where is the expectation following . The state-action value function is defined as


to be the expected return of choosing action at state and following onwards. Assuming the policy is greedy with respect to i.e. , then at convergence, Equation (2) yields


where is the optimal state-action value function, is a discount factor that favors near term over long term rewards if smaller than 1. can be any exploration policy (will sometimes be omitted for simplicity of presentation). This is the setting that we will adopt for the remainder of this work.

2.2 scTLTL and Finite State Automata

We consider tasks specified with syntactically co-safe Truncated Linear Temporal Logic (scTLTL) which is a fragment of truncated linear temporal logic(TLTL) (Li et al. (2016)). The set of allowed operators are


where is the True Boolean constant. is a predicate.  (negation/not),  (conjunction/and) are Boolean connectives.  (eventually),  (until),  (then),  (next), are temporal operators. (implication) and and  (disjunction/or) can be derived from the above operators. Compared to TLTL, we excluded the  (always) operator to maintain a one to one correspondence between an scTLTL formula and a finite state automaton (FSA) defined below.

Definition 2.

An FSA111Here we slightly modify the conventional definition of FSA and incorporate the probabilities in Equations (5). For simplicity, we continue to adopt the term FSA. is defined as a tuple , where is a set of automaton states; is the input alphabet; is the initial state; is a conditional probability defined as


is a set of final automaton states.

We denote the predicate guarding the transition from to . Because is a predicate without temporal operators, the robustness is only evaluated at . Therefore, we use the shorthand . We abuse the notation to represent both kinds of transitions when the context is clear. For each scTLTL formula, one can construct a corresponding FSA . An example of an FSA is provided in Section C.1 in the supplementary material. The translation from TLTL formula to FSA to can be done automatically with available packages like Lomap (Vasile (2017)).

There exists a real-valued function called robustness degree that measures the level of satisfaction of trajectory (here is the state trajectory from time 0 to ) with respect to a scTLTL formula . indicates that satisfies and vice versa (full semantics of scTLTL are provided in Section A in supplementary material).

3 Problem Formulation

Problem 1.

Given an MDP in Definition 1 with unknown transition dynamics and a scTLTL specification as in Definition 2, find a policy such that


where is an indicator function with value if and otherwise. is said to satisfy .

Problem 1 defines a policy search problem where the trajectories resulting from following the optimal policy should satisfy the given scTLTL formula in expectation. It should be noted that there typically will be more than one policy that satisfies Equation (6). We use a discount factor to reduce the number of satisfying policies to one (one that yields a satisfying trajectory in the least number of steps). Details will be discussed in the next section.

Problem 2.

Given two scTLTL formula and along with policy that satisfies and that satisfies (and their corresponding state-action value function and )), obtain a policy that satisfies .

Problem 2 defines the problem of skill composition. Given two policies each satisfying a scTLTL specification, construct the policy that satisfies the conjunction of the given specifications. Solving this problem is useful when we want to break a complex task into simple and manageable components, learn a policy that satisfies each component and "stitch" all the components together so that the original task is satisfied. It can also be the case that as the scope of the task grows with time, the original task specification is amended with new items. Instead of having to re-learn the task from scratch, we can learn only policies that satisfies the new items and combine them with the old policy.

4 FSA Augmented MDP

Problem 1 can be solved with any RL algorithm using robustness as the terminal reward as is done by Li et al. (2016). However, doing so the agent suffers from sparse feedback because a reward signal can only be obtained at the end of each episode. To address this problem as well as setting up ground for automata guided HRL, we introduce the FSA augmented MDP

Definition 3.

An FSA augmented MDP corresponding to scTLTL formula (constructed from FSA and MDP ) is defined as where , is the probability of transitioning to given and ,


is defined in Equation (5). is the FSA augmented reward function, defined by


where represents the disjunction of all predicates guarding the transitions that originate from ( is the set of automata states that are connected with through outgoing edges).

The goal is to find the optimal policy that maximizes the expected sum of discounted return, i.e.


where is the discount factor, is the time horizon.

The reward function in Equation (8) encourages the system to exit the current automata state and move on to the next, and by doing so eventually reach the final state (property of FSA) which satisfies the TL specification and hence Equation (6). The discount factor in Equation (9) reduces the number of satisfying policies to one.

The FSA augmented MDP can be constructed with any standard MDP and a scTLTL formula, and can be solved with any off-the-shelf RL algorithm. By directly learning the flat policy we bypass the need to define and learn each sub-policy separately. After obtaining the optimal policy , the optimal sub-policy for any can be extracted by executing without transitioning the automata state, i.e. keeping fixed. The sub-policy is thus




5 Automata Guided Skill Composition

In section, we provide a solution for Problem 2 by constructing the FSA of from that of and and using to synthesize the policy for the combined skill. We start with the following definition.

Definition 4.
222details can be found in pro (2011)

Given and corresponding to formulas and , the FSA of is the product automaton of and , i.e. where is the set of product automaton states, is the product initial state, are the final accepting states. Following Definition 2, for states and , the transition probability is defined as


An example of product automaton is provided in Section C.2 in the supplementary material.

For , let , and denote the set of predicates guarding the edges originating from , and respectively. Equation (12) entails that a transition at in the product automaton exists only if corresponding transitions at , exist in and respectively. Therefore, , for (here is a state such that ). Following Equation (11),


Here is the FSA state of at time . are FSA states that are connected to through an outgoing edge. It can be shown that




We provide the derivation in Section B in the supplementary material.

Equation (15) takes similar form as Equation (11). Since we have already learned and , and is nonzero only when there are states where is true, we should obtain a good initialization of by adding and (similar technique is adopted by Haarnoja et al. (2018)). This addition of local

functions is in fact an optimistic estimation of the global

function, the properties of such Q-decomposition methods are studied by Russell and Zimdars (2003).

Here we propose an algorithm to obtain the optimal composed Q function given the already learned , and the data collected while training them.

1:Inputs: The learned Q functions and , replay pool collected when training . The product FSA
Algorithm 1 FSA guided skill composition

The Q functions in Algorithm 1 can be grid representation or a parametrized function. The function that takes in a Q-function, the product FSA, stored replay buffer and a reward, and performs off-policy Q update. If the initial state distribution remains unchanged, Algorithm 1 should provide a decent estimate of the composed Q function without needing to further interact with the environment.The intuition is that the experience collected from training and should have well explored the regions in state space that satisfy and , and hence also explored the regions that satisfy . Having obtained , a greedy policy can be extracted in similar ways to DQN (Mnih et al. (2015)) for discrete actions or DDPG (Silver et al. (2014)) for continuous actions. Details of Algorith 1 are provided in Section D.5 in the supplementary materal.

6 Case Studies

We evaluate the proposed methods in two types of environments. The first is a grid world environment that aims to illustrate the inner workings of our method. The second is a kitchen environment simulated in AI2Thor (Kolve et al. (2017)).

6.1 Grid World

Consider an agent that navigates in a grid world. Its MDP state space is where are its integer coordinates on the grid. The action space is [up, down, left, right, stay]. The transition is such that for each action command, the agent follows that command with probability 0.8 or chooses a random action with probability 0.2. We train the agent on two tasks, and . In English, expresses the requirement that for the horizon of task, regions and need to be reached at least once. The regions are defined by the predicates and . Because the coordinates are integers, and define a point goal rather than regions. expresses a similar task for . Figure 1 shows the FSA for each task.

We apply standard tabular Q-learning (Watkins (1989)) on the FSA augmented MDP of this environment. For all experiments, we use a discount factor of 0.95, learning rate of 0.1, episode horizon of 200 steps, a random exploration policy and a total number of 2000 update steps which is enough to reach convergence.

Figure 1 : FSA and policy for (a) . (b) . (c) . The agent moves in a gridworld with 3 labeled regions. The agent has actions [up, down, left, right, stay] where the directional actions are represented by arrows, stay is represented by the blue dot.

Figure 1 (a) and (b) show the learned optimal policies extracted by . We plot for each and observe that each represents a sub-policy whose goal is given by Equation (8). The FSA effectively acts as a meta-policy. We are able to obtain such meaningful hierarchy without having to explicitly incorporate it in the learning process.

Figure 1 (c) shows the composed FSA and policy using Algorithm 1. Prior to composition, we normalized the Q functions by dividing each by its max value put them in the same range. This is possible because the Q values of both policies have the same meaning (expected discounted edge distance to on the fSA).In this case the initialization step (step 2) is sufficient to obtain the optimal composed policy without further updating necessary. The reason is that there are no overlaps between regions , therefore for all states and actions which renders steps 3, 4, 5 unnecessary. We found that step 6 in Algorithm 1 is also not necessary here.

6.2 AI2Thor

In this section, we apply the proposed methods in a simulated kitchen environment. The goal is to find a user defined object (e.g. an apple) and place it in a user defined receptacle (e.g. the fridge). Our main focus for this experiment is to learn a high level decision-making policy and therefore we assume that the agent can navigate to any desired location.

There are a total of 17 pickupable objects and 39 receptacle objects which we index from 0 to 55. Our state space depends on these objects and their properties/states. We have a set of 62 discrete actions {pick, put, open, close, look up, look down, navigate(id)} where id can take values from 0 to 55. Detailed descriptions of the environment, state and action spaces are provided in Sections D.1 , D.2 and D.3 of the supplementary material.

We start with a relatively easy task of "find and pick up the apple and put it in the fridge"(which we refer to as task 1) and extend it to "find and pick up any user defined object and put it in any user defined receptacle" (which we refer to as task 2). For each task, we learn with three specifications with increasing prior knowledge encoded in the scTLTL formula. The specifications are referred to as with denoting the task number and denoting the specification number. The higher the more prior knowledge is encoded. We also explore the combination of the intrinsic reward defined in the FSA augmented MDP with a heuristic penalty. Here we penalize the agent for each failed action and denote the learning trials with penalty by . To evaluate automata guided skill composition, we combine task 1 and task 2 and require the composed policy to accomplish both tasks during an episode (we refer to this task as composition task). Details on the specifications are provided in Section D.4 of the supplementary material.

We use a feed forward neural network as the policy and DDQN (Van Hasselt et al. (2016)) with prioritized experience replay (Schaul et al. (2015)) as the learning algorithm. We found that adaptively normalizing the Q function with methods proposed in (van Hasselt et al. (2016)) helps accelerate task composition. Algorithm details are provided in Section D.5 of the supplementary material. For each task, we evaluate the learned policy at various points along the training process by running the policy without exploration for 50 episodes with random initialization. Performance is evaluated by the average task success rate and episode length (if the agent can quickly accomplish the task). We also include the action success rate (if the agent learns not to execute actions that will fail) during training as a performance metric.

Figure 2 : (a) FSA for specification . (b) Agent’s first person view of the environment at each transition to a new FSA state (the apple in the last image is placed on the first bottom shelf). Task success rate, action success rate and mean episode length for (c) task 1. (d) task 2. (e) composition task

Figure 2(a) shows the FSA of specification , and Figure 2(b) illustrates the agent’s first person view at states where transition on the FSA occurs. Note that navigating from the sink (with apple picked up) to the fridge does not invoke progress on the FSA because such instruction is not encoded in the specification. Figure 2(c) shows the learning performance of task 1. We can see that the more elaborate the specification, the higher the task success rate which is as expected ( and fail to learn the task due to sparse reward). It can also be observed that the action penalty helps facilitate the agent to avoid issuing failing actions and in turn reduces the steps necessary to complete the task.

Figure 2(d) shows the results for task 2. Most of the conclusions from task 1 persists. The success rate for task 2 is lower due to the added complexity of the task. The mean episode length is significantly larger than task 1. This is because the object the agent is required to find is often initialized inside receptacles, therefore the agent needs to first find the object and then proceed to completing the task. This process is not encoded in the specification and hence rely solely on exploration. An important observation here is that learning with action penalty significantly improves the task success rate. The reason is also that completing task 2 may requires a large number of steps when the object is hidden in receptacles, the agent will not have enough time if the action failure rate is high.

Figure 2(e) shows the performance of automata guided skill composition. Here we present results of progressively running Algorithm 1. In the figure, represents running only the initialization step (step 2 in the algorithm), represents running the initialization and compensation steps (steps 3, 4, 5) and is running the entire algorithm. As comparison, we also learn this task from scratch with FSA augmented MDP with the specification . From the figures we can see that the action success rate is not effected by task complexity. Overall, the composed policy considerably outperforms the trained policy (the resultant product FSA for this task has 23 nodes and 110 edges, therefore is expected to take longer to train). Simply running the initiation step already results in a decent policy. Incorporating the compensation step in did not provide a significant improvement. This is most likely due to the lack of MDP states where (, ). However, improves the composed policy by a significant margin because this step fine tunes the policy with the true objective and stored experiences. We provide additional discussions in Section D.6 of the supplementary material.

7 Conclusion

We present a framework that integrates the flexibility of reinforcement learning with the explanability and semantic rigor of formal methods. In particular, we allow task specification in scTLTL - an expressive formal language, and construct a product MDP that possesses an intrinsic hierarchical structure. We showed that applying RL methods on the product MDP results in a hierarchical policy whose sub-policies can be easily extracted and re-combined to achieve new tasks in a transparent fashion. In practice, the authors have particularly benefited from the FSA in terms of specification design and behavior prediction in that mistakes in task expression can be identified before putting in the time and resources for training.