Reinforcement learning (RL) [Sutton] is a useful approach to learning an optimal policy from sample behaviors of a controlled system with inherent stochasticity, e.g., a Markov decision process (MDP) [Puterman]
, when the probabilities associated with the controlled system are unknown a priori. In RL, we use a reward function that assigns a reward to each transition in the behaviors and evaluate a control policy by the return, namely an expected (discounted) sum of the rewards along the behaviors. One of the recent trends is to apply RL to synthesis problems under linear temporal logic (LTL) constraints. LTL is a formal language with rich expressivity and thus suitable for describing complex missions assigned to a controlled system[BK2008, Belta2017].
It is known that any LTL formula can be converted into an -automaton with a Büchi or a Rabin acceptance condition [BK2008]. In many studies on LTL synthesis problems using RL, reward functions are formed systematically from automata corresponding to the LTL specification. This direction was first investigated by Sadigh et al. [Sadigh2014], where they defined a reward function based on the acceptance condition of a deterministic Rabin automaton [BK2008] that accepts all words satisfying the LTL constraint. Reward functions defined on specification automata were also proposed for a deep reinforcement learning method [QDYYM2019] and for an extension of LTL in collaboration with a control barrier function [Li2019].
Recently, a limit-deterministic Büchi automaton (LDBA) or a generalized one (LDGBA) is paid much attention to as an -automaton corresponding to the LTL specification [SEJK2016]. The RL-based approaches to the synthesis of a control policy using LDBAs or LDGBAs have been proposed in [HAK2019, HKAKPL2019, Hahn2019, BWZP2019]. An (LD)GBA has multiple accepting sets and accepts behaviors visiting all accepting sets infinitely often. One possible approach to generalized Büchi acceptance conditions is to convert a GBA into a corresponding BA, which has a single accepting set. The conversion, however, fixes the order of visits to accepting sets of the GBA [BK2008] and causes the sparsity of the reward, which is a critical issue in RL-based controller synthesis. Another approach to RL-based synthesis for generalized Büchi conditions is the accepting frontier function introduced in [HAK2019, HKAKPL2019], based on which the reward function is defined. However, the function is memoryless, that is, it does not provide information of accepting sets that have been visited, which is important to improve learning performance.
In this letter, we propose a novel method to augment an LDGBA converted from a given LTL formula. Then, we define a reward function based on the acceptance condition of the product MDP of the augmented LDGBA and the controlled system represented as the MDP. As a result, we relax the sparsity of rewards compared to using LDBAs. The rest of the letter is organized as follows. Section II reviews an MDP, LTL, and automata. Section III proposes a novel RL-based method for the synthesis of a control policy. Section IV presents a numerical example to show the effectiveness of our proposed method.
For sets and , denotes their concatenation. denotes the infinite concatenation of the set and denotes the finite one. is the set of nonnegative integers. is the set of nonnegative real numbers.
Ii-a Markov Decision Process
A (labeled) Markov decision process (MDP) is a tuple = , where S is a finite set of states, is a finite set of actions, is a transition probability function, is the initial state, is a finite set of atomic propositions, and is a labeling function that assigns a set of atomic propositions to each transition. Let . Note that for any state and action .
In the MDP , an infinite path starting from a state is defined as a sequence such that for any . A finite path is a finite sequence in . In addition, we sometimes represent as to emphasize that starts from . For a path , we define the corresponding labeled path . is defined as the set of infinite (resp., finite) paths starting from in the MDP . For each finite path , denotes its last state.
A policy on an MDP is defined as a mapping . A policy is a positional policy if for any and any , it holds that ; and there exists one action such that if , and for any with .
Let (resp., ) be the set of infinite (resp., finite) paths starting from in the MDP under a policy . The behavior of the MDP under a policy is defined on a probability space .
A Markov chain induced by the MDPwith a positional policy is a tuple , where , for and such that . The state set of can be represented as a disjoint union of a set of transient states and closed irreducible sets of recurrent states with , i.e., [ESS]. In the following, we say a “recurrent class” instead of a “closed irreducible set of recurrent states” for simplicity.
In the MDP , we define a reward function . The function returns the immediate reward received after the agent performs an action at a state and reaches a next state as a result.
For a policy on an MDP , any state , and a reward function , we define the expected discounted reward as
where denotes the expected value given that the agent follows the policy from the state and is a discount factor. The function is often referred to as a state-value function under the policy . For any state-action pair , we define an action-value function under the policy as follows.
For any state , a policy is optimal if
where is the set of positional policies over the state set .
Ii-B Linear Temporal Logic and Automata
In our proposed method, we use linear temporal logic (LTL) formulas to describe various constraints or properties and to systematically assign corresponding rewards. LTL formulas are constructed from a set of atomic propositions, Boolean operators, and temporal operators. We use the standard notations for the Boolean operators: (true), (negation), and (conjunction). LTL formulas over a set of atomic propositions are defined as
where , , and are LTL formulas. Additional Boolean operators are defined as , , and . The operators X and U are called “next” and “until”, respectively. Using the operator U, we define two temporal operators: (1) eventually, and (2) always, .
For an LTL formula and an infinite path of an MDP with , the satisfaction relation is recursively defined as follows.
where be the -th suffix .
The next operator X requires that is satisfied by the next state suffix of . The until operator U requires that holds true until becomes true over the path . In the following, we write for simplicity without referring to the MDP .
For any policy , the probability of all paths starting from on the MDP that satisfy an LTL formula under the policy , or the satisfaction probability under is defined as
We say that an LTL formula is satisfied by a positional policy if
Any LTL formula can be converted into various -automata, namely finite state machines that recognize all infinite words satisfying . We review a generalized Büchi automaton at the beginning, and then introduce a limit-deterministic generalized Büchi automaton [HKAKPL2019].
A transition-based generalized Büchi automaton (tGBA) is a tuple , where is a finite set of states, is the initial state, is an input alphabet including , is a set of transitions, and is an acceptance condition, where for each , is a set of accepting transitions and called an accepting set. We refer to a tGBA with one accepting set as a tBA.
An infinite sequence is called an infinite run if for any . An infinite word is accepted by if and only if there exists an infinite run starting from such that for each , where is the set of transitions that occur infinitely often in the run .
A transition-based limit-deterministic generalized Büchi automaton (tLDGBA) is a tGBA such that is partitioned into two disjoint sets and such that
An -transition enables the tLDGBA to change its state with no input. Then, -transitions reflect a single “guess” from to . Note that by the construction in [SEJK2016], the transitions in each part are deterministic except for -transitions from to . It is known that, for any LTL formula , there exists a tLDGBA that accepts all words satisfying [SEJK2016]. We refer to a tLDGBA with one accepting set as a tLDBA. In particular, we represent a tLDGBA recognizing an LTL formula as , whose input alphabet is given by .
Iii Reinforcement-Learning-Based Synthesis of Control Policy
We introduce an automaton augmented with binary-valued vectors. The automaton can ensure that transitions in each accepting set occur infinitely often.
Let be a set of binary-valued vectors, and let and be the -dimentional vectors with all elements 1 and 0, respectively. In order to augment a tLDGBA , we introduce three functions , , and as follows. For any , , where if and otherwise. For any , if and otherwise. For any , , where for any .
Each vector is called a memory vector and represents which accepting sets have been visited. The function returns a binary vector whose -th element is 1 if and only if a transition in the accepting set occurs. The function returns the zero vector if at least one transition in each accepting set has occurred after the latest reset. Otherwise, it returns the input vector without change.
For a tLDGBA , its augmented automaton is a tLDGBA , where , , , is defined as = , and is defined as for each .
The augmented tLDGBA keeps track of previous visits to the accepting sets of . Intuitively, along a run of , a memory vector is reset to when at least one transition in each accepting set of the original tLDGBA has occurred. We now show that a tLDGBA and its augmented automaton accept the same language. Let be the accepted language of an automaton with the alphabet , namely the set of all infinite words accepted by .
Let and be an arbitrary tLDGBA and its augmentation, respectively. Then, we have .
Recall that . We prove set inclusions in both directions.
Consider any . Then, there exists a run of such that and for each . For the run , we construct a sequence satisfying and for any , where and
Clearly from the construction, we have for any . Thus, is a run of starting from .
We now show that for each . Since for each , we have
for any , where for each . By the construction of , therefore, there are infinitely many indices with . Let be arbitrary nonnegative integers such that , , and for any . Then,
where is the -th element of . Hence, we have for each , which implies .
Consider any . Then, there exists a run of such that and for each , i.e.,
For the run , we construct a sequence such that and for any . It is clear that is a run of starting from . It holds by Eq. (III) that for each , which implies .
The acceptance condition of the tLDGBA is given by , where and . Practically, states in a strongly connected component that contains no accepting transitions can be merged as shown in Fig. 2.
We modify the standard definition of a product MDP to deal with -transitions in the augmented automaton.
Given an augmented tLDGBA and an MDP , a tuple is a product MDP, where is the finite set of states; is the finite set of actions such that , where is the action for the -transition to the state ; is the initial state; is the transition probability function defined as
where and ; is the set of transitions; and is the acceptance condition, where for each .
The reward function is defined as
where is a positive value.
When constructing a tBA from a tGBA, the order of visits to accepting sets of the tGBA is fixed. Consequently, the rewards based on the acceptance condition of the tBA tends to be sparse. The sparsity is critical in RL-based controller synthesis problems. The augmentation of the tGBA relaxes the sparsity since the augmented tGBA has all of the order of visits to all accepting sets of the original tGBA. For the acceptance condition of the tGBA, the size of the state space of the augmented tGBA is about times larger than the tBA constructed from the tGBA. However, the number of accepting transitions to all transitions in the augmented tGBA is much greater than the tBA. Therefore, our proposed method is expected to be better than using the tLDBA in the sense of sample efficiency.
In the following, we show that a positional policy satisfying on the product MDP is synthesized by using the reward function in Definition 10, which is based on the acceptance condition of .
For a Markov chain induced by a product MDP with a positional policy , let be the set of states in , where is the set of transient states and is the recurrent class for each , and let be the set of all recurrent states in . Let be the set of transitions in a recurrent class , namely , and let : be the transition probability function under .
For any policy and any recurrent class in the Markov chain , satisfies one of the following conditions.
Suppose that satisfies neither conditions 1 nor 2. Then, there exist a policy , , and , such that and . In other words, there exists a nonempty and proper subset such that for any . For any transition , the following equation holds by the properties of the recurrent states in [ESS].
where is the probability that the transition reoccurs after it occurs in time steps. Eq. (3) means that each transition in occurs infinitely often. However, the memory vector is never reset in by the assumption. This directly contradicts Eq. (3).
Lemma 1 implies that, for an LTL formula , if a policy does not satisfy , then the agent obtains no reward in recurrent classes; otherwise there exists at least one recurrent class where the agent obtains rewards infinitely often.
For a product MDP of an MDP and an augmented tLDGBA corresponding to a given LTL formula and the reward function given by Definition 10, if there exists a positional policy satisfying on , then there exists a discount factor such that any algorithm that maximizes the expected discounted reward with will find a positional policy satisfying on .
Suppose that is an optimal policy but does not satisfy the LTL formula . Then, for any recurrent class in the Markov chain and any accepting set of the product MDP , holds by Lemma 1. Thus, the agent under the policy can obtain rewards only in the set of transient states. We consider the best scenario in the assumption. Let be the probability of going to a state in time steps after leaving the state , and let be the set of states in recurrent classes that can be transitioned from states in by one action. For the initial state in the set of transient states, it holds that
where the action is selected by . By the property of the transient states, for any state in , there exists a bounded positive value such that [ESS]. Therefore, there exists a bounded positive value such that . Let be a positional policy satisfying . We consider the following two cases.
Assume that the initial state is in a recurrent class for some . For any accepting set , holds by the definition of . The expected discounted reward for is given by
where the action is selected by . Since is in , there exists a positive number [ESS]. We consider the worst scenario in this case. It holds that
whereas all states in are positive recurrent because [ISP]. Obviously, holds for any by the Chapman-Kolmogorov equation [ESS]. Furthermore, we have by the property of irreducibility and positive recurrence [SM]. Hence, there exists such that