Temporal logic has been developed in computer engineering as a useful formalism of formal specifications [1, 2]. A merit of temporal logics is its resemblance to natural languages and it has been widely used in several other areas of engineering. Especially, a complicated mission or task in computer-controlled systems such as robots can be described by a temporal logic specification precisely and many synthesis algorithms of a controller or a planner that satisfies the specification have been proposed [3, 4, 5, 6]. Linear temporal logic (LTL) is often used as a specification language because of its rich expressiveness. It can explain many important -regular properties such as liveness, safety, and persistence . It is known that the LTL specification is converted into an -automaton such as a nondeterministic Büchi automaton and a deterministic Rabin automaton [1, 7]. In the synthesis of a control policy for the LTL specification, we model a controlled system by a transition system that abstracts its dynamics, construct a product automaton of the transition system and the -automaton corresponding to the LTL specification, and compute a winning strategy of a game over the product automaton .
In general, there are uncertainties in a controlled system and we often use a Markov decision process (MDP) as a finite-state abstraction of the controlled system 
. In the case where the probabilities are unknown a priori, we have two approaches to the synthesis of the control policy. One is robust control where we assume that state transition probabilities are in uncertainty sets while the other is learning using samples .
Reinforcement learning (RL) is a useful approach to learning an optimal policy from sample behaviors of the controlled system . In RL, we use a reward function that assigns a reward to each transition in the behaviors and evaluate a control policy by the return that is an expected (discounted) sum of the rewards along the behaviors. Thus, to apply RL to the synthesis of a control policy for the LTL specification, it is an important issue how to introduce the reward function, which depends on the acceptance condition of an -automaton converted from the LTL specification. A reward function based on the acceptance condition of a Rabin automaton was proposed in . It was applied to a control problem where the controller optimizes a given control cost under the LTL constraint .
Recently, a limit-deterministic Büchi automaton (LDBA) is paid much attention to as an -automaton corresponding to the LTL specification . The RL-based approaches to the syntehsis of a control policy using LDBAs have been proposed in [14, 15, 16, 17]. To deal with the acceptance condition of an LDBA that accepts behaviors visiting all accepting sets infinitely often, the accepting frontier function was introduced in [14, 16]. The reward function is defined based on the function. However, the function is memoryless, that is, it does not provide the information of accepting sets that have been visited, which is important to improve learning performance. In this letter, we propose a novel method to augment an LDBA converted from a given LTL formula. Then, we define a reward function based on the acceptance condition of the product MDP of the augmented LDBA and the controlled system. As a result, we can learn a dynamic control policy that satisfies the LTL specification.
The rest of the letter is organized as follows. Section II reviews an MDP, LTL, and automata. Section III proposed a novel RL-based method for the synthesis of a control policy. Section IV presents a numerical example for which the previous method cannot learn a control policy but the proposed one can.
Ii-a Markov Decision Process
A (labeled) Markov decision process (MDP) is a tuple = , where S is a finite set of states, is a finite set of actions, is a mapping that maps each state to the set of possible actions at the state, is a transition probability such that for any state and any action , is the initial state, is a finite set of atomic propositions, and is a labeling function that assigns a set of atomic propositions to each transition .
In the MDP , an infinite path starting from a state is defined as a sequence such that for any , where is the set of natural numbers including zero. A finite path is a finite sequence in . In addition, we sometimes represent as to emphasize that starts from . For a path , we define the corresponding labeled path . is defined as the set of infinite (resp., finite) paths starting from in the MDP . For each finite path , denotes its last state.
A policy on an MDP is defined as a mapping . A policy is a positional policy if for any and any , it holds that and there exists such that
Let (resp., ) be the set of infinite (resp., finite) paths starting from in the MDP under a policy . The behavior of an MDP under a policy is defined on a probability space .
A Markov chain induced by an MDPwith a positional policy is a tuple , where , for and such that . The state set of can be represented as a disjoint union of a set of transient states and closed irreducible sets of recurrent states with , as . In the following, we say a “recurrent class” instead of a “closed irreducible set of recurrent states” for simplicity.
In an MDP , we define a reward function , where is the set of nonnegative real numbers. The function denotes the immediate scalar bounded reward received after the agent performs an action at a state and reaches a next state as a result.
For a policy on an MDP , any state , and a reward function , we define the expected discounted reward as
where denotes the expected value given that the agent follows the policy from the state and is a discount factor. The function is often referred to as a state-value function under the policy . For any state-action pair , we define an action-value function under the policy as follows.
For any state in , a policy is optimal if
where is the set of positional policies over the state set .
Ii-B Linear Temporal Logic and Automata
In our proposed method, we use linear temporal logic (LTL) formulas to describe various constraints or properties and to systematically assign corresponding rewards. LTL formulas are constructed from a set of atomic propositions, Boolean operators, and temporal operators. We use the standard notations for the Boolean operators: (true), (negation), and (conjunction). LTL formulas over a set of atomic propositions are defined as
where , , and are LTL formulas. Additional Boolean operators are defined as , , and . The operators X and U are called “next” and “until”, respectively. Using the operator U, we define two temporal operators: (1) eventually, and (2) always, .
Let be an MDP. For an infinite path of with , let be the -th state of i.e., and let be the -th suffix .
For an LTL formula , an MDP , and an infinite path of with , the satisfaction relation is recursively defined as follows.
The next operator X requires that is satisfied by the next state suffix of . The until operator U requires that holds true until becomes true over the path . In the following, we write for simplicity without referring to MDP .
For any policy , we denote the probability of all paths starting from on the MDP that satisfy an LTL formula under the policy as
We say that an LTL formula is satisfied by a positional policy if
Any LTL formula can be converted into various automata, namely finite state machines that recognize all words satisfying . We define a generalized Büchi automaton at the beginning, and then introduce a limit-deterministic Büchi automaton.
A transition-based generalized Büchi automaton (tGBA) is a tuple , where is a finite set of states, is the initial state, is an input alphabet, is a set of transitions, and is an acceptance condition, where for each , is a set of accepting transitions and called an accepting set.
Let be the set of all infinite words over and let an infinite run be an infinite sequence where for any . An infinite word is accepted by if and only if there exists an infinite run starting from such that for each , where is the set of transitions that occur infinitely often in the run .
A tGBA is limit-deterministic (tLDBA) if the following conditions hold.
A tLDBA is a tGBA whose state set can be partitioned into the initial part and the final part , and they are connected by a single “guess”. The final part has all accepting sets. The transitions in each part are deterministic. It is known that, for any LTL formula , there exists a tLDBA that accepts all words satisfying . In particular, we represent a tLGBA recognizing an LTL formula as , whose input alphabet is given by .
Iii Reinforcement-Learning-Based Synthesis of Control Policy
We introduce an automaton augmented with binary vectors. The automaton can explicitly represent whether transitions in each accepting set occur at least once, and ensure transitions in each accepting set occur infinitely often.
Let be a set of binary-valued vectors, and let and be the -dimentional vectors with all elements 1 and 0, respectively. In order to augment a tLDBA , we introduce three functions , , and as follows. For any , , where
For any ,
For any , , where for any .
Intuitively, each vector represents which accepting sets have been visited. The function returns a binary vector whose -th element is 1 if and only if a transition in the accepting set occurs. The function returns the zero vector if at least one transition in each accepting set has occurred after the latest reset. Otherwise, it returns the input vector without change.
For a tLDBA , its augmented automaton is a tLDBA = , where , , , is defined as = , and is defined as for each , where is the -th element of .
Given an augmented tLDBA and an MDP , a tuple is a product MDP, where is the finite set of states, is the finite set of actions, is the mapping defined as , is the initial states, is the transition probability defined as
is the set of transitions, and is the acceptance condition, where for each .
The reward function is defined as
where is a positive value.
Under the product MDP and the reward function , which is based on the acceptance condition of , we show that if there exists a positional policy satisfying the LTL specification , maximizing the expected discounted reward produces a policy satisfying .
For a Markov chain induced by a product MDP with a positional policy , let be the set of states in , where is the set of transient states and is the recurrent class for each , and let be the set of all recurrent classes in . Let be the set of transtions in a recurrent class , namely , and let : be the transition probability under .
For any policy and any recurrent class in the Markov chain , satisfies one of the following conditions.
Suppose that satisfies neither conditions 1 nor 2. Then, there exists a policy , , and , such that and . In other words, there exists a nonempty and proper subset such that for any . For any transition , the following equation holds by the properties of the recurrent states in .
where is the probability that the transition occurs again after the occurrence of itself in time steps. Eq. (1) means that the agent obtains a reward infinitely often. This contradicts the definition of the acceptance condition of the product MDP .
Lemma 1 implies that for an LTL formula if a path under a policy does not satisfy , then the agent obtains no reward in recurrent classes; otherwise there exists at least one recurrent class where the agent obtains rewards infinitely often.
Let be the product MDP corresponding to an MDP and an LTL formula . If there exists a positional policy satisfying , then there exists a discount factor such that any algorithm that maximizes the expected reward with will find a positional policy satisfying .
Suppose that is an optimal policy but does not satisfy the LTL formula . Then, for any recurrent class in the Markov chain and any accepting set of the product MDP , holds by Lemma 1. Thus, the agent under the policy can obtain rewards only in the set of transient states. We consider the best scenario in the assumption. Let be the probability of going to a state in time steps after leaving the state , and let be the set of states in recurrent classes that can be transitioned from states in by one action. For the initial state in the set of transient states, it holds that
where the action is selected by . By the property of the transient states, for any state in , there exists a bounded positive value such that . Therefore, there exists a bounded positive value such that . Let be a positional policy satisfying . We consider the following two cases.
Assume that the initial state is in a recurrent class for some . For any accepting set , holds by the definition of . The expected discounted reward for is given by
where the action is selected by . Since is in , there exists a positive number . We consider the worst scenario in this case. It holds that
whereas all states in are positive recurrent because . Obviously, holds for any by the Chapman-Kolmogorov equation . Furthermore, we have by the property of irreducibility and positive recurrence . Hence, there exists such that for any and we have
Therefore, for any and any , there exists such that implies
Assume that the initial state is in the set of transient states . holds by the definition of . For a recurrent class such that for each accepting set , there exist a number , a state in , and a subset of transient states such that , for , and by the property of transient states. Hence, it holds that for the state . Thus, by ignoring rewards in , we have
where is a constant and for any . Therefore, for any and any , there exists such that implies .
The results contradict the optimality assumption of .
In this section, we evaluate our proposed method and compare it with an existing work. We consider a path planning problem of a robot in an environment consisting of eight rooms and one corridor as shown in Fig. 1. The state is the initial state and the action space is specified with for any state and , where means attempting to go to the state for . The robot moves in the intended direction with probability 0.9 and it stays in the same state with probability 0.1 if it is in the state . In the states other than , it moves in the intended direction with probability 0.9 and it moves in the opposite direction to that it intended to go with probability 0.1. If the robot tries to go to outside the environment, it stays in the same state. The labeling function is as follows.