1 Introduction
Hierarchical reinforcement learning (HRL) is an effective means of improving sample efficiency and achieving transfer among tasks. The goal is to obtain taskinvariant lowlevel policies, and by retraining the metapolicy that schedules over the lowlevel policies, different skills can be obtain with less samples than training from scratch. Heess et al. (2016) have adopted this idea in learning locomotor controllers and have shown successful transfer among simulated locomotion tasks. Oh et al. (2017) have utilized a deep hierarchical architecture for multitask learning using natural language instructions.
Skill composition is the idea of constructing new skills out of existing ones (and hence their policies) with little to no additional learning. In stochastic optimal control, this idea has been adopted by Todorov (2009) and Da Silva et al. (2009)
to construct provably optimal control laws based on linearly solvable Markov decision processes.
Haarnoja et al. (2018) have showed in simulated and real manipulation tasks that approximately optimal policies can result from adding the Qfunctions of the existing policies.Temporal logic(TL) is a formal language commonly used in software and digital circuit verification by Baier and Katoen (2008) as well as formal synthesis by Belta et al. (2017). It allows for convenient expression of complex behaviors and causal relationships. TL has been used by Tabuada and Pappas (2004), Fainekos et al. (2006), Fainekos et al. (2005) to synthesize provably correct control policies. Aksaray et al. (2016) have also combined TL with Qlearning to learn satisfiable policies in discrete state and action spaces.
In this work, we focus on hierarchical skill learning and composition. Once a set of skills is acquired, we provide a technique that can synthesize new skills with little to no further interaction with the environment. We adopt the syntactically cosafe truncated linear temporal logic(scTLTL) as the task specification language. Compared to most heuristic reward structures used in the RL literature, formal specification language has the advantage of semantic rigor and interpretability. Our main contributions are:

Compared to existing skill composition methods, we are able to learn and compose logically complex tasks that would otherwise be difficult to analytically expressed as a reward function. We take advantage of the transformation between scTLTL formulas and finite state automata (FSA) to construct deterministic metacontrollers directly from the task specifications. We show that by adding one discrete dimension to the original state space, structurally simple parameterized policies such as feedforward neural networks can be used to learn tasks that require complex temporal reasoning.

Intrinsic motivation has been shown to help RL agents learn complicated behaviors with less interactions with the environment (Singh et al. (2004), Kulkarni et al. (2016), Jaderberg et al. (2016)). However, designing a wellbehaved intrinsic reward that aligns with the extrinsic reward takes effort and experience. In our work, we construct intrinsic rewards directly from the input alphabets of the FSA, which guarantees that maximizing each intrinsic reward makes positive progress towards satisfying the entire task specification. From a user’s perspective, the intrinsic rewards are constructed automatically from the TL formula without the need for further reward engineering.

In our framework, each FSA represents a hierarchical policy with lowlevel controllers that can be remodulated to achieve different tasks. Skill composition is accomplished by taking the product of FSAs. Instead of interpolating/extrapolating among learned skills/latent features, our method is based on graph manipulation of the FSA. Therefore, the compositional outcome is much more transparent. At testing time, the behavior of the policy is strictly enforced by the FSA and therefore safety can be guaranteed if encoded in the specification. We introduce a method that allows learning of such hierarchical policies with any nonhierarchical RL algorithm. Compared with previous work on skill composition, we impose no constraints on the policy representation or the problem class.
2 Preliminaries
2.1 Reinforcement Learning
We start with the definition of a Markov Decision Process.
Definition 1.
An MDP is defined as a tuple , where is the state space ; is the action space ( and can also be discrete sets); is the transition function with
being the conditional probability density of taking action
at state and ending up in state ; is the reward function with being the reward obtained by executing action at state and transitioning to .Let be the horizon of the task. The optimal policy (or for stochastic policies) that solves the MDP maximizes the expected return, i.e.
(1) 
where is the expectation following . The stateaction value function is defined as
(2) 
to be the expected return of choosing action at state and following onwards. Assuming the policy is greedy with respect to i.e. , then at convergence, Equation (2) yields
(3) 
where is the optimal stateaction value function, is a discount factor that favors near term over long term rewards if smaller than 1. can be any exploration policy (will sometimes be omitted for simplicity of presentation). This is the setting that we will adopt for the remainder of this work.
2.2 scTLTL and Finite State Automata
We consider tasks specified with syntactically cosafe Truncated Linear Temporal Logic (scTLTL) which is a fragment of truncated linear temporal logic(TLTL) (Li et al. (2016)). The set of allowed operators are
(4) 
where is the True Boolean constant. is a predicate. (negation/not), (conjunction/and) are Boolean connectives. (eventually), (until), (then), (next), are temporal operators. (implication) and and (disjunction/or) can be derived from the above operators. Compared to TLTL, we excluded the (always) operator to maintain a one to one correspondence between an scTLTL formula and a finite state automaton (FSA) defined below.
Definition 2.
An FSA^{1}^{1}1Here we slightly modify the conventional definition of FSA and incorporate the probabilities in Equations (5). For simplicity, we continue to adopt the term FSA. is defined as a tuple , where is a set of automaton states; is the input alphabet; is the initial state; is a conditional probability defined as
(5) 
is a set of final automaton states.
We denote the predicate guarding the transition from to . Because is a predicate without temporal operators, the robustness is only evaluated at . Therefore, we use the shorthand . We abuse the notation to represent both kinds of transitions when the context is clear. For each scTLTL formula, one can construct a corresponding FSA . An example of an FSA is provided in Section C.1 in the supplementary material. The translation from TLTL formula to FSA to can be done automatically with available packages like Lomap (Vasile (2017)).
There exists a realvalued function called robustness degree that measures the level of satisfaction of trajectory (here is the state trajectory from time 0 to ) with respect to a scTLTL formula . indicates that satisfies and vice versa (full semantics of scTLTL are provided in Section A in supplementary material).
3 Problem Formulation
Problem 1.
Given an MDP in Definition 1 with unknown transition dynamics and a scTLTL specification as in Definition 2, find a policy such that
(6) 
where is an indicator function with value if and otherwise. is said to satisfy .
Problem 1 defines a policy search problem where the trajectories resulting from following the optimal policy should satisfy the given scTLTL formula in expectation. It should be noted that there typically will be more than one policy that satisfies Equation (6). We use a discount factor to reduce the number of satisfying policies to one (one that yields a satisfying trajectory in the least number of steps). Details will be discussed in the next section.
Problem 2.
Given two scTLTL formula and along with policy that satisfies and that satisfies (and their corresponding stateaction value function and )), obtain a policy that satisfies .
Problem 2 defines the problem of skill composition. Given two policies each satisfying a scTLTL specification, construct the policy that satisfies the conjunction of the given specifications. Solving this problem is useful when we want to break a complex task into simple and manageable components, learn a policy that satisfies each component and "stitch" all the components together so that the original task is satisfied. It can also be the case that as the scope of the task grows with time, the original task specification is amended with new items. Instead of having to relearn the task from scratch, we can learn only policies that satisfies the new items and combine them with the old policy.
4 FSA Augmented MDP
Problem 1 can be solved with any RL algorithm using robustness as the terminal reward as is done by Li et al. (2016). However, doing so the agent suffers from sparse feedback because a reward signal can only be obtained at the end of each episode. To address this problem as well as setting up ground for automata guided HRL, we introduce the FSA augmented MDP
Definition 3.
An FSA augmented MDP corresponding to scTLTL formula (constructed from FSA and MDP ) is defined as where , is the probability of transitioning to given and ,
(7) 
is defined in Equation (5). is the FSA augmented reward function, defined by
(8) 
where represents the disjunction of all predicates guarding the transitions that originate from ( is the set of automata states that are connected with through outgoing edges).
The goal is to find the optimal policy that maximizes the expected sum of discounted return, i.e.
(9) 
where is the discount factor, is the time horizon.
The reward function in Equation (8) encourages the system to exit the current automata state and move on to the next, and by doing so eventually reach the final state (property of FSA) which satisfies the TL specification and hence Equation (6). The discount factor in Equation (9) reduces the number of satisfying policies to one.
The FSA augmented MDP can be constructed with any standard MDP and a scTLTL formula, and can be solved with any offtheshelf RL algorithm. By directly learning the flat policy we bypass the need to define and learn each subpolicy separately. After obtaining the optimal policy , the optimal subpolicy for any can be extracted by executing without transitioning the automata state, i.e. keeping fixed. The subpolicy is thus
(10) 
where
(11) 
5 Automata Guided Skill Composition
In section, we provide a solution for Problem 2 by constructing the FSA of from that of and and using to synthesize the policy for the combined skill. We start with the following definition.
Definition 4.
^{2}^{2}2details can be found in pro (2011)Given and corresponding to formulas and , the FSA of is the product automaton of and , i.e. where is the set of product automaton states, is the product initial state, are the final accepting states. Following Definition 2, for states and , the transition probability is defined as
(12) 
An example of product automaton is provided in Section C.2 in the supplementary material.
For , let , and denote the set of predicates guarding the edges originating from , and respectively. Equation (12) entails that a transition at in the product automaton exists only if corresponding transitions at , exist in and respectively. Therefore, , for (here is a state such that ). Following Equation (11),
(13) 
Here is the FSA state of at time . are FSA states that are connected to through an outgoing edge. It can be shown that
(14) 
where
(15) 
(16) 
We provide the derivation in Section B in the supplementary material.
Equation (15) takes similar form as Equation (11). Since we have already learned and , and is nonzero only when there are states where is true, we should obtain a good initialization of by adding and (similar technique is adopted by Haarnoja et al. (2018)). This addition of local
functions is in fact an optimistic estimation of the global
function, the properties of such Qdecomposition methods are studied by Russell and Zimdars (2003).Here we propose an algorithm to obtain the optimal composed Q function given the already learned , and the data collected while training them.
The Q functions in Algorithm 1 can be grid representation or a parametrized function. The function that takes in a Qfunction, the product FSA, stored replay buffer and a reward, and performs offpolicy Q update. If the initial state distribution remains unchanged, Algorithm 1 should provide a decent estimate of the composed Q function without needing to further interact with the environment.The intuition is that the experience collected from training and should have well explored the regions in state space that satisfy and , and hence also explored the regions that satisfy . Having obtained , a greedy policy can be extracted in similar ways to DQN (Mnih et al. (2015)) for discrete actions or DDPG (Silver et al. (2014)) for continuous actions. Details of Algorith 1 are provided in Section D.5 in the supplementary materal.
6 Case Studies
We evaluate the proposed methods in two types of environments. The first is a grid world environment that aims to illustrate the inner workings of our method. The second is a kitchen environment simulated in AI2Thor (Kolve et al. (2017)).
6.1 Grid World
Consider an agent that navigates in a grid world. Its MDP state space is where are its integer coordinates on the grid. The action space is [up, down, left, right, stay]. The transition is such that for each action command, the agent follows that command with probability 0.8 or chooses a random action with probability 0.2. We train the agent on two tasks, and . In English, expresses the requirement that for the horizon of task, regions and need to be reached at least once. The regions are defined by the predicates and . Because the coordinates are integers, and define a point goal rather than regions. expresses a similar task for . Figure 1 shows the FSA for each task.
We apply standard tabular Qlearning (Watkins (1989)) on the FSA augmented MDP of this environment. For all experiments, we use a discount factor of 0.95, learning rate of 0.1, episode horizon of 200 steps, a random exploration policy and a total number of 2000 update steps which is enough to reach convergence.
Figure 1 (a) and (b) show the learned optimal policies extracted by . We plot for each and observe that each represents a subpolicy whose goal is given by Equation (8). The FSA effectively acts as a metapolicy. We are able to obtain such meaningful hierarchy without having to explicitly incorporate it in the learning process.
Figure 1 (c) shows the composed FSA and policy using Algorithm 1. Prior to composition, we normalized the Q functions by dividing each by its max value put them in the same range. This is possible because the Q values of both policies have the same meaning (expected discounted edge distance to on the fSA).In this case the initialization step (step 2) is sufficient to obtain the optimal composed policy without further updating necessary. The reason is that there are no overlaps between regions , therefore for all states and actions which renders steps 3, 4, 5 unnecessary. We found that step 6 in Algorithm 1 is also not necessary here.
6.2 AI2Thor
In this section, we apply the proposed methods in a simulated kitchen environment. The goal is to find a user defined object (e.g. an apple) and place it in a user defined receptacle (e.g. the fridge). Our main focus for this experiment is to learn a high level decisionmaking policy and therefore we assume that the agent can navigate to any desired location.
There are a total of 17 pickupable objects and 39 receptacle objects which we index from 0 to 55. Our state space depends on these objects and their properties/states. We have a set of 62 discrete actions {pick, put, open, close, look up, look down, navigate(id)} where id can take values from 0 to 55. Detailed descriptions of the environment, state and action spaces are provided in Sections D.1 , D.2 and D.3 of the supplementary material.
We start with a relatively easy task of "find and pick up the apple and put it in the fridge"(which we refer to as task 1) and extend it to "find and pick up any user defined object and put it in any user defined receptacle" (which we refer to as task 2). For each task, we learn with three specifications with increasing prior knowledge encoded in the scTLTL formula. The specifications are referred to as with denoting the task number and denoting the specification number. The higher the more prior knowledge is encoded. We also explore the combination of the intrinsic reward defined in the FSA augmented MDP with a heuristic penalty. Here we penalize the agent for each failed action and denote the learning trials with penalty by . To evaluate automata guided skill composition, we combine task 1 and task 2 and require the composed policy to accomplish both tasks during an episode (we refer to this task as composition task). Details on the specifications are provided in Section D.4 of the supplementary material.
We use a feed forward neural network as the policy and DDQN (Van Hasselt et al. (2016)) with prioritized experience replay (Schaul et al. (2015)) as the learning algorithm. We found that adaptively normalizing the Q function with methods proposed in (van Hasselt et al. (2016)) helps accelerate task composition. Algorithm details are provided in Section D.5 of the supplementary material. For each task, we evaluate the learned policy at various points along the training process by running the policy without exploration for 50 episodes with random initialization. Performance is evaluated by the average task success rate and episode length (if the agent can quickly accomplish the task). We also include the action success rate (if the agent learns not to execute actions that will fail) during training as a performance metric.
Figure 2(a) shows the FSA of specification , and Figure 2(b) illustrates the agent’s first person view at states where transition on the FSA occurs. Note that navigating from the sink (with apple picked up) to the fridge does not invoke progress on the FSA because such instruction is not encoded in the specification. Figure 2(c) shows the learning performance of task 1. We can see that the more elaborate the specification, the higher the task success rate which is as expected ( and fail to learn the task due to sparse reward). It can also be observed that the action penalty helps facilitate the agent to avoid issuing failing actions and in turn reduces the steps necessary to complete the task.
Figure 2(d) shows the results for task 2. Most of the conclusions from task 1 persists. The success rate for task 2 is lower due to the added complexity of the task. The mean episode length is significantly larger than task 1. This is because the object the agent is required to find is often initialized inside receptacles, therefore the agent needs to first find the object and then proceed to completing the task. This process is not encoded in the specification and hence rely solely on exploration. An important observation here is that learning with action penalty significantly improves the task success rate. The reason is also that completing task 2 may requires a large number of steps when the object is hidden in receptacles, the agent will not have enough time if the action failure rate is high.
Figure 2(e) shows the performance of automata guided skill composition. Here we present results of progressively running Algorithm 1. In the figure, represents running only the initialization step (step 2 in the algorithm), represents running the initialization and compensation steps (steps 3, 4, 5) and is running the entire algorithm. As comparison, we also learn this task from scratch with FSA augmented MDP with the specification . From the figures we can see that the action success rate is not effected by task complexity. Overall, the composed policy considerably outperforms the trained policy (the resultant product FSA for this task has 23 nodes and 110 edges, therefore is expected to take longer to train). Simply running the initiation step already results in a decent policy. Incorporating the compensation step in did not provide a significant improvement. This is most likely due to the lack of MDP states where (, ). However, improves the composed policy by a significant margin because this step fine tunes the policy with the true objective and stored experiences. We provide additional discussions in Section D.6 of the supplementary material.
7 Conclusion
We present a framework that integrates the flexibility of reinforcement learning with the explanability and semantic rigor of formal methods. In particular, we allow task specification in scTLTL  an expressive formal language, and construct a product MDP that possesses an intrinsic hierarchical structure. We showed that applying RL methods on the product MDP results in a hierarchical policy whose subpolicies can be easily extracted and recombined to achieve new tasks in a transparent fashion. In practice, the authors have particularly benefited from the FSA in terms of specification design and behavior prediction in that mistakes in task expression can be identified before putting in the time and resources for training.
References
 pro [2011] Lecture notes in formal languages, automata and compuatation. https://www.andrew.cmu.edu/user/ko/pdfs/lecture3.pdf, 2011.
 Aksaray et al. [2016] Derya Aksaray, Austin Jones, Zhaodan Kong, Mac Schwager, and Calin Belta. Qlearning for robust satisfaction of signal temporal logic specifications. In Decision and Control (CDC), 2016 IEEE 55th Conference on, pages 6565–6570. IEEE, 2016.
 Baier and Katoen [2008] Christel Baier and JoostPieter Katoen. Principles of model checking. MIT press, 2008.
 Belta et al. [2017] Calin Belta, Boyan Yordanov, and Ebru Aydin Gol. Formal Methods for DiscreteTime Dynamical Systems. Springer, 2017.
 Da Silva et al. [2009] Marco Da Silva, Frédo Durand, and Jovan Popović. Linear bellman combination for control of character animation. Acm transactions on graphics (tog), 28(3):82, 2009.
 Fainekos et al. [2005] Georgios E Fainekos, Hadas KressGazit, and George J Pappas. Hybrid controllers for path planning: A temporal logic approach. In Decision and Control, 2005 and 2005 European Control Conference. CDCECC’05. 44th IEEE Conference on, pages 4885–4890. IEEE, 2005.
 Fainekos et al. [2006] Georgios E Fainekos, Savvas G Loizou, and George J Pappas. Translating temporal logic to controller specifications. In Decision and Control, 2006 45th IEEE Conference on, pages 899–904. IEEE, 2006.
 Haarnoja et al. [2018] Tuomas Haarnoja, Vitchyr Pong, Aurick Zhou, Murtaza Dalal, Pieter Abbeel, and Sergey Levine. Composable deep reinforcement learning for robotic manipulation. arXiv preprint arXiv:1803.06773, 2018.
 Heess et al. [2016] Nicolas Heess, Greg Wayne, Yuval Tassa, Timothy Lillicrap, Martin Riedmiller, and David Silver. Learning and transfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182, 2016.
 Jaderberg et al. [2016] Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z Leibo, David Silver, and Koray Kavukcuoglu. Reinforcement learning with unsupervised auxiliary tasks. preprint arXiv:1611.05397, 2016.
 Kolve et al. [2017] Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, Abhinav Gupta, and Ali Farhadi. AI2THOR: An Interactive 3D Environment for Visual AI. arXiv, 2017.
 Kulkarni et al. [2016] Tejas D. Kulkarni, Karthik R. Narasimhan, Ardavan Saeedi, and Joshua B. Tenenbaum. Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation. preprint arXiv:1604.06057, 2016.
 Li et al. [2016] Xiao Li, CristianIoan Vasile, and Calin Belta. Reinforcement learning with temporal logic rewards. arXiv preprint arXiv:1612.03471, 2016.
 Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei a Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015. doi: 10.1038/nature14236.
 Oh et al. [2017] Junhyuk Oh, Satinder Singh, Honglak Lee, and Pushmeet Kohli. ZeroShot Task Generalization with MultiTask Deep Reinforcement Learning. preprint arXiv:1706.05064, 2017.

Russell and Zimdars [2003]
Stuart J Russell and Andrew Zimdars.
Qdecomposition for reinforcement learning agents.
In
Proceedings of the 20th International Conference on Machine Learning (ICML03)
, pages 656–663, 2003.  Schaul et al. [2015] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv preprint arXiv:1511.05952, 2015.
 Silver et al. [2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic Policy Gradient Algorithms. Proceedings of the 31st International Conference on Machine Learning (ICML14), pages 387–395, 2014.
 Singh et al. [2004] S. Singh, A.G. Barto, and N. Chentanez. Intrinsically motivated reinforcement learning. 18th Annual Conference on Neural Information Processing Systems (NIPS), 17(2):1281–1288, 2004. ISSN 19430604. doi: 10.1109/TAMD.2010.2051031.
 Tabuada and Pappas [2004] Paulo Tabuada and George J Pappas. Linear temporal logic control of linear systems. IEEE Transactions on Automatic Control, 2004.
 Todorov [2009] Emanuel Todorov. Compositionality of optimal control laws. In Advances in Neural Information Processing Systems, pages 1856–1864, 2009.
 Van Hasselt et al. [2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. In AAAI, volume 16, pages 2094–2100, 2016.
 van Hasselt et al. [2016] Hado P van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, and David Silver. Learning values across many orders of magnitude. In Advances in Neural Information Processing Systems, pages 4287–4295, 2016.
 Vasile [2017] C Vasile. Github repository, 2017.
 Watkins [1989] Christopher John Cornish Hellaby Watkins. Learning From Delayed Rewards. PhD thesis, King’s College, Cambridge, England, 1989.
Comments
There are no comments yet.