# Reinforcement Learning for Temporal Logic Control Synthesis with Probabilistic Satisfaction Guarantees

Reinforcement Learning (RL) has emerged as an efficient method of choice for solving complex sequential decision making problems in automatic control, computer science, economics, and biology. In this paper we present a model-free RL algorithm to synthesize control policies that maximize the probability of satisfying high-level control objectives given as Linear Temporal Logic (LTL) formulas. Uncertainty is considered in the workspace properties, the structure of the workspace, and the agent actions, giving rise to a Probabilistically-Labeled Markov Decision Process (PL-MDP) with unknown graph structure and stochastic behaviour, which is even more general case than a fully unknown MDP. We first translate the LTL specification into a Limit Deterministic Buchi Automaton (LDBA), which is then used in an on-the-fly product with the PL-MDP. Thereafter, we define a synchronous reward function based on the acceptance condition of the LDBA. Finally, we show that the RL algorithm delivers a policy that maximizes the satisfaction probability asymptotically. We provide experimental results that showcase the efficiency of the proposed method.

## Authors

• 12 publications
• 12 publications
• 31 publications
• 31 publications
• 53 publications
• 16 publications
• ### Control Synthesis from Linear Temporal Logic Specifications using Model-Free Reinforcement Learning

We present a reinforcement learning (RL) framework to synthesize a contr...
09/16/2019 ∙ by Alper Kamil Bozkurt, et al. ∙ 0

• ### Reinforcement Learning Based Temporal Logic Control with Maximum Probabilistic Satisfaction

This paper presents a model-free reinforcement learning (RL) algorithm t...
10/14/2020 ∙ by Mingyu Cai, et al. ∙ 0

• ### Inverse Reinforcement Learning of Autonomous Behaviors Encoded as Weighted Finite Automata

This paper presents a method for learning logical task specifications an...
03/10/2021 ∙ by Tianyu Wang, et al. ∙ 0

• ### A Theoretical Connection Between Statistical Physics and Reinforcement Learning

Sequential decision making in the presence of uncertainty and stochastic...
06/24/2019 ∙ by Jad Rahme, et al. ∙ 6

• ### Logically-Correct Reinforcement Learning

We propose a novel Reinforcement Learning (RL) algorithm to synthesize p...
01/24/2018 ∙ by Mohammadhosein Hasanbeig, et al. ∙ 0

• ### Certified Reinforcement Learning with Logic Guidance

This paper proposes the first model-free Reinforcement Learning (RL) fra...
02/02/2019 ∙ by Mohammadhosein Hasanbeig, et al. ∙ 12

• ### Reinforcement Learning of Control Policy for Linear Temporal Logic Specifications Using Limit-Deterministic Büchi Automata

This letter proposes a novel reinforcement learning method for the synth...
01/14/2020 ∙ by Ryohei Oura, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## I Introduction

The use of temporal logic has been promoted as formal task specifications for control synthesis in Markov Decision Processes (MDPs) due to their expressive power, as they can handle a richer class of tasks than the classical point-to-point navigation. Such rich specifications include safety and liveness requirements, sequential tasks, coverage, and temporal ordering of different objectives [1, 2, 3, 4, 5]. Control synthesis for MDPs under Linear Temporal Logic (LTL) specifications has also been studied in [6, 7, 8, 9, 10]. Common in these works is that, in order to synthesize policies that maximize the satisfaction probability, exact knowledge of the MDP is required. Specifically, these methods construct a product MDP by composing the MDP that captures the underlying dynamics with a Deterministic Rabin Automaton (DRA) that represents the LTL specification. Then, given the product MDP, probabilistic model checking techniques are employed to design optimal control policies [11, 12].

In this paper, we address the problem of designing optimal control policies for MDPs with unknown stochastic behaviour so that the generated traces satisfy a given LTL specification with maximum probability. Unlike previous work, uncertainty is considered both in the environment properties and in the agent actions, provoking a Probabilistically-Labeled MDP (PL-MDP). This model further extend MDPs to provide a way to consider dynamic and uncertain environments. In order to solve this problem, we first convert the LTL formula into a Limit Deterministic Büchi Automaton (LDBA) [13]. It is known that this construction results in an exponential-sized automaton for LTL, and it results in nearly the same size as a DRA for the rest of LTL. LTL is a fragment of linear temporal logic with the restriction that no until operator occurs in the scope of an always operator. On the other hand, the DRA that are typically employed in relevant work are doubly exponential in the size of the original LTL formula [14]. Furthermore, a Büchi automaton is semantically simpler than a Rabin automaton in terms of its acceptance conditions [15, 10], which makes our algorithm much easier to implement. Once the LDBA is generated from the given LTL property, we construct on-the-fly a product between the PL-MDP and the resulting LDBA and then define a synchronous reward function based on the acceptance condition of the Büchi automaton over the state-action pairs of the product. Using this algorithmic reward shaping procedure, a model-free RL algorithm is introduced, which is able to generate a policy that returns the maximum expected reward. Finally, we show that maximizing the expected accumulated reward entails the maximization of the satisfaction probability.

Related work – A model-based RL algorithm to design policies that maximize the satisfaction probability is proposed in [16, 17]. Specifically, [16] assumes that the given MDP model has unknown transition probabilities and builds a Probably Approximately Correct MDP (PAC MDP), which is composed with the DRA that expresses the LTL property. The overall goal is to calculate the finite-horizon (-step) value function for each state, such that the obtained value is within an error bound from the probability of satisfying the given LTL property. The PAC MDP is generated via an RL-like algorithm, then value iteration is applied to update state values. A similar model-based solution is proposed in [18]: this also hinges on approximating the transition probabilities, which limits the precision of the policy generation process. Unlike the problem that is considered in this paper, the work in [18] is limited to policies whose traces satisfy the property with probability one. Moreover, [16, 17, 18] require to learn all transition probabilities of the MDP. As a result, they need a significant amount of memory to store the learned model [19]. This specific issue is addressed in [20], which proposes an actor-critic method for LTL specification that requires the graph structure of the MDP, but not all transition probabilities. The structure of the MDP allows for the computation of Accepting Maximum End Components (AMECs) in the product MDP, while transition probabilities are generated only when needed by a simulator. By contrast, the proposed method does not require knowledge of the structure of the MDP and does not rely on computing AMECs of a product MDP. A model-free and AMEC-free RL algorithm for LTL planning is also proposed in [21]. Nevertheless, unlike our proposed method, all these cognate contributions rely on the LTL-to-DRA conversion, and uncertainty is considered only in the agent actions, but not in the workspace properties.

In [22] and [23] safety-critical settings in RL are addressed in which the agent has to deal with a heterogeneous set of MDPs in the context of cyber-physical systems. [24] further employs DDL [25], a first-order multi-modal logic for specifying and proving properties of hybrid programs.

The first use of LDBA for LTL-constrained policy synthesis in a model-free RL setup appears in [26, 27]. Specifically, [27]

propose a hybrid neural network architecture combined with LDBAs to handle MDPs with continuous state spaces. The work in

[26] has been taken up more recently by [28], which has focused on model-free aspects of the algorithm and has employed a different LDBA structure and reward, which introduce extra states in the product MDP. The authors also do not discuss the complexity of the automaton construction with respect to the size of the formula, but given the fact that resulting automaton is not a generalised Büchi, it can be expected that the density of automaton acceptance condition is quite low, which might result in a state-space explosion, particularly if the LTL formula is complex. As we show in the proof for the counter example in the Appendix-E the authors indeed have overlooked that our algorithm is episodic, and allows the discount factor to be equal to one. Unlike [26, 27, 28], in this work we consider uncertainty in the workspace properties by employing PL-MDPs.

Summary of contributionsFirst, we propose a model-free RL algorithm to synthesize control policies for unknown PL-MDPs which maximizes the probability of satisfying LTL specifications. Second, we define a synchronous reward function and we show that maximizing the accumulated reward maximizes the satisfaction probability. Third, we convert the LTL specification into an LDBA which, as a result, shrinks the state-space that needs to explored compared to relevant LTL-to-DRA-based works in finite-state MDPs. Moreover, unlike previous works, our proposed method does not require computation of AMECs of a product MDP, which avoids the quadratic time complexity of such a computation in the size of the product MDP [11, 12].

## Ii Problem Formulation

Consider a robot that resides in a partitioned environment with a finite number of states. To capture uncertainty in both the robot motion and the workspace properties, we model the interaction of the robot with the environment as a PL-MDP, which is defined as follows.

###### Definition II.1 (Probabilistically-Labeled MDP [9])

A PL-MDP is a tuple , where is a finite set of states; is the initial state; is a finite set of actions. With slight abuse of notation denotes the available actions at state ; is the transition probability function so that is the transition probability from state to state via control action and , for all ; is a set of atomic propositions; and specifies the associated probability. Specifically, denotes the probability that is observed at state , where , .

The probabilistic map provides a means to model dynamic and uncertain environments. Hereafter, we assume that the PL-MDP is fully observable, i.e., at any time/stage the current state, denoted by , and the observations in state , denoted by , are known.

At any stage we define the robot’s past path as , the past sequence of observed labels as , where and the past sequence of control actions , where . These three sequences can be composed into a complete past run, defined as . We denote by , , and the set of all possible sequences , and , respectively.

The goal of the robot is accomplish a task expressed as an LTL formula. LTL is a formal language that comprises a set of atomic propositions , the Boolean operators, i.e., conjunction and negation , and two temporal operators, next and until . LTL formulas over a set can be constructed based on the following grammar:

 ϕ::=true | π | ϕ1∧ϕ2 | ¬ϕ | ◯ϕ | ϕ1 ∪ ϕ2,

where . The other Boolean and temporal operators, e.g., always , have their standard syntax and meaning. An infinite word over the alphabet is defined as an infinite sequence , where denotes infinite repetition and , . The language is defined as the set of words that satisfy the LTL formula , where is the satisfaction relation [29].

In what follows, we define the probability that a stationary policy for satisfies the assigned LTL specification. Specifically, a stationary policy for is defined as , where . Given a stationary policy , the probability measure , defined on the smallest -algebra over , is the unique measure defined as where denotes the probability that at time the action will be selected given the current state [11, 30]. We then define the probability of satisfying under policy as [11, 12]

 PξM(ϕ)=PξM(\ccalR∞:\ccalL∞⊨ϕ), (1)

The problem we address in this paper is summarized as follows.

###### Problem 1

Given a PL-MDP with unknown transition probabilities, unknown label mapping, unknown underlying graph structure, and a task specification captured by an LTL formula , synthesize a deterministic stationary control policy that maximizes the probability of satisfying captured in (1), i.e., .111The fact that the graph structure is unknown implies that we do not know which transition probabilities are equal to zero. As a result, relevant approaches that require the structure of the MDP, as e.g., [20] cannot be applied.

## Iii A New Learning-for-Planning Algorithm

In this section, we first discuss how to translate the LTL formula into an LDBA (see Section III-A). Then, we define the product MDP , constructed by composing the PL-MDP and the LDBA that expresses (see Section III-B). Next, we assign rewards to the product MDP transitions based on the accepting condition of the LDBA . As we show later, this allows us to synthesize a policy for that maximizes the probability of satisfying the acceptance conditions of the LDBA. The projection of the obtained policy over model results in a policy that solves Problem 1 (Section III-C).

### Iii-a Translating LTL into an LDBA

An LTL formula can be translated into an automaton, namely a finite-state machine that can express the set of words that satisfy . Conventional probabilistic model checking methods translate LTL specifications into DRAs, which are then composed with the PL-MDP, giving rise to a product MDP. Nevertheless, it is known that this conversion results, in the worst case, in automata that are doubly exponential in the size of the original LTL formula [14]. By contrast, in this paper we propose to express the given LTL property as an LDBA, which results in a much more succinct automaton [13, 15]. This is the key to the reduction of the state-space that needs to be explored; see also Section V.

Before defining the LDBA, we first need to define the Generalized Büchi Automaton (GBA).

###### Definition III.1 (Generalized Büchi Automaton [11])

A GBA is a structure where is a finite set of states, is the initial state, is a finite alphabet, is the set of accepting conditions where , , and is a transition relation.

An infinite run of over an infinite word , , is an infinite sequence of states , i.e., , such that . The infinite run is called accepting (and the respective word is accepted by the GBA) if where is the set of states that are visited infinitely often by .

###### Definition III.2 (Limit Deterministic Büchi Automaton [13])

A GBA is limit deterministic if can be partitioned into two disjoint sets , so that (i) and , for every state and ; and (ii) for every , it holds that and there are -transitions from to .

An -transition allows the automaton to change its state without reading any specific input. In practice, the -transitions between and reflect the “guess” on reaching : accordingly, if after an -transition the associated labels in the accepting set of the automaton cannot be read, or if the accepting states cannot be visited, then the guess is deemed to be wrong, and the trace is disregarded and is not accepted by the automaton. However, if the trace is accepting, then the trace will stay in ever after, i.e. is invariant.

###### Definition III.3 (Non-accepting Sink Component)

A non-accepting sink component in an LDBA is a directed graph induced by a set of states such that (1) is strongly connected, (2) does not include all accepting sets , and (3) there exist no other strongly connected set that . We denote the union set of all non-accepting sink components as .

### Iii-B Product MDP

Given the PL-MDP and the LDBA , we define the product MDP as follows.

###### Definition III.4 (Product MDP)

Given a PL-MDP and an LDBA , we define the product MDP as , where (i) is the set of states, so that , , , and ; (ii) is the initial state; (iii) is the set of actions inherited from the MDP, so that , where ; (iv) is the transition probability function, so that

 PP([x,ℓ,q],a,[x′,ℓ′,q′])=PC(x,u,x′)PL(x′,ℓ′), (2)

where , , and ; (v) is the set of accepting states, where . In order to handle -transitions in the constructed LDBA we have to add the following modifications to the standard definition of the product MDP [15]. First, for every -transition to a state we add an action in the product MDP, i.e., . Second, the transition probabilities of -transitions are given by

 PP(s,a,s′)={1, if (x=x′)∧(ℓ=ℓ′)∧(δ(q,εq′)=q′)0, otherwise, (3)

where and .

Given any policy for , we define an infinite run of to be an infinite sequence of states of , i.e., , where . By definition of the accepting condition of the LDBA , an infinite run is accepting, i.e., satisfies with a non-zero probability (denoted by ), if , .

In what follows, we design a synchronous reward function based on the accepting condition of the LDBA so that maximization of the expected accumulated reward implies maximization of the satisfaction probability. Specifically, we generate a control policy that maximizes the probability of (i) reaching the states of from and (ii) the probability that each accepting set will be visited infinitely often.

### Iii-C Construction of the Reward Function

To synthesize a policy that maximizes the probability of satisfying , we construct a synchronous reward function for the product MDP. The main idea is that (i) visiting a set , yields a positive reward ; and (ii) revisiting the same set returns zero reward until all other sets , are also visited; (iii) the rest of the transitions have zero rewards. Intuitively, this reward shaping strategy motivates the agent to visit all accepting sets of the LDBA infinitely often, as required by the acceptance condition of the LDBA; see also Section IV.

To formally present the proposed reward shaping method, we need first to introduce the the accepting frontier set which is initialized as the family set

 A={Fk}fk=1. (4)

This set is updated on-the-fly every time a set is visited as where is the accepting frontier function defined as follows.

###### Definition III.5 (Accepting Frontier Function)

Given an LDBA , we define as the accepting frontier function, which executes the following operation over any given set :

 AF(q,A)=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩A∖\ccalFj:(q∈\ccalFj)∧(A≠\ccalFj):{Fk}fk=1∖\ccalFj:(q∈\ccalFj)∧(A=\ccalFj).□:

In words, given a state and the set , outputs a set containing the elements of minus those elements that are common with (first case). However, if , then the output is the family set of all accepting sets of minus those elements that are common with , resulting in a reset of to (4) minus those elements that are common with (second case). Intuitively, always contains those accepting sets that are needed to be visited at a given time and in this sense the reward function is synchronous with the LDBA accepting condition.

Given the accepting frontier set , we define the following reward function

 R(s,a)={rif q′ ∈A, s′=(x′,ℓ′,q′),0otherwise. (5)

In (5), is the state of the product MDP that is reached from state by taking action , and is an arbitrary positive reward. In this way the agent is guided to visit all accepting sets infinitely often and, consequently, satisfy the given LTL property.

###### Remark III.6

The initial and accepting components of the LDBA proposed in [13] (as used in this paper) are both deterministic. By Definition III.2, the discussed LDBA is indeed a limit-deterministic automaton, however notice that the obtained determinism within its initial part is stronger than that required in the definition of LDBA. Thanks to this feature of the LDBA structure, in our proposed algorithm there is no need to “explicitly build” the product MDP and to store all its states in memory. The automaton transitions can be executed on-the-fly, as the agent reads the labels of the MDP states.

Given , we compute a stationary deterministic policy , that maximizes the expected accumulated return, i.e.,

 μ∗(s)=argmaxμ∈D Uμ(s), (6)

where is the set of all stationary deterministic policies over , and

 Uμ(s)=Eμ[∞∑n=0γn R(sn,μ(sn))|s0=s], (7)

where denotes the expected value given that the product MDP follows the policy [30], is the discount factor, and is the sequence of states generated by policy up to time step , initialized at . Note that the optimal policy is stationary as shown in the following result.

###### Theorem III.7 ([30])

In any finite-state MDP, such as , if there exists an optimal policy, then that policy is stationary and deterministic.

In order to construct , we employ episodic Q-learning (QL), a model-free RL scheme described in Algorithm LABEL:alg:stationary.222Note that any other off-the-shelf model-free RL algorithm can also be used within Algorithm LABEL:alg:stationary, including any variant of the class of temporal difference learning algorithms [19]. Specifically, Algorithm LABEL:alg:stationary requires as inputs (i) the LDBA , (ii) the reward function defined in (5), and (iii) the hyper-parameters of the learning algorithm.

Observe that in Algorithm LABEL:alg:stationary, we use an action-value function to evaluate instead of , since the MDP is unknown. The action-value function can be initialized arbitrarily. Note that . Also, we define a function that counts the number of times that action has been taken at state . The policy is selected to be an -greedy policy, which means that with probability , the greedy action is taken, and with probability a random action is selected. Every episode terminates when the current state of the automaton gets inside (Definition III.3) or when the iteration number in the episode reaches a certain threshold . Note that it holds that asymptotically converges to the optimal greedy policy : where is the optimal function. Further, , where is the optimal value function that could have been computed via Dynamic Programming (DP) if the MDP was fully known [19, 31, 32]. Projection of onto the state-space of the PL-MDP, yields the finite-memory policy that solves Problem 1.

algocf[!t]

## Iv Analysis of the Algorithm

In this section, we show that the policy generated by Algorithm LABEL:alg:stationary maximizes (1), i.e., the probability of satisfying the property . Furthermore, we show that, unlike existing approaches, our algorithm can produce the best available policy if the property cannot be satisfied. To prove these claims, we need to show the following results. All proofs are presented in the Appendix. First, we show that the accepting frontier set is time-invariant. This is needed to ensure that the LTL formula is satisfied over the product MDP by a stationary policy.

###### Proposition IV.1

For an LTL formula and its associated LDBA , the accepting frontier set is time-invariant at each state of .

As stated earlier, since QL is proved to converge to the optimal Q-function [19], it can synthesize an optimal policy with respect to the given reward function. The following result shows that the optimal policy produced by Algorithm LABEL:alg:stationary satisfies the given LTL property.

###### Theorem IV.2

Assume that there exists at least one deterministic stationary policy in whose traces satisfy the property with positive probability. Then the traces of the optimal policy defined in (6) satisfy with positive probability, as well.

Next we show that and subsequently its projection maximize the satisfaction probability.

###### Theorem IV.3

If an LTL property is satisfiable by the PL-MDP , then the optimal policy that maximizes the expected accumulated reward, as defined in (6), maximizes the probability of satisfying , defined in (1), as well.

Next, we show that if there does not exist a policy that satisfies the LTL property , Algorithm LABEL:alg:stationary will find the policy that is the closest one to property satisfaction. To this end, we first introduce the notion of closeness to satisfaction.

###### Definition IV.4 (Closeness to Satisfaction)

Assume that two policies and do not satisfy the property

. Consequently, there are accepting sets in the automaton that have no intersection with runs of the induced Markov chains

and . The policy is closer to satisfying the property if runs of have more intersections with accepting sets of the automaton than runs of .

###### Corollary IV.5

If there does not exist a policy in the PL-MDP  that satisfies the property , then proposed algorithm yields a policy that is closest to satisfying .

## V Experiments

In this section we present three case studies, implemented on MATLAB R2016a on a computer with an Intel Xeon CPU at 2.93 GHz and 4 GB RAM. In the first two experiments, the environment is represented as a discrete grid world, as illustrated in Figure 1. The third case study is an adaptation of the well-known Atari game Pacman (Figure 2), which is initialized in a configuration that is quite hard for the agent to solve.

The first case study pertains to a temporal logic planning problem in a dynamic and unknown environment with AMECs, while the second one does not admit AMECs. Note that the majority of existing algorithms fail to provide a control policy when AMECs do not exist [8, 34, 20], or result in control policies without satisfaction guarantees [18].

The LTL formula considered in the first two case studies is the following:

 ϕ1=◊(target1)∧□◊(target2)∧□◊(user)∧(¬user∪target2)∧□(¬% obs). (8)

In words, this LTL formula requires the robot to (i) eventually visit target 1 (expressed as ); (ii) visit target  infinitely often and take a picture of it (); (iii) visit a user infinitely often where, say, the collected pictures are uploaded (captured by ); (iv) avoid visiting the user until a picture of target  has been taken; and (v) always avoid obstacles (captured by ).

The LTL formula (8) can be expressed as a DRA with states. On the other hand, a corresponding LDBA has states (fewer, as expected), which results in a significant reduction of the state space that needs to be explored.

The interaction of the robot with the environment is modeled by a PL-MDP with states and actions per state. The actions space is . We assume that the targets and the user are dynamic, i.e., their location in the environment varies probabilistically. Specifically, their presence in a given region is determined by the unknown function from Definition II.1 (Figure 1).

The LTL formula specifying the task for Pacman (third case study) is:

 ϕ2=◊[(food1∧◊food2)∨(food2∧◊% food1)]∧□(¬ghost). (9)

Intuitively, the agent is tasked with (i) eventually eating food1 and then food2 (or vice versa), while (ii) avoiding any contact with the ghosts. This LTL formula corresponds to a DRA with states and to an LDBA with states. The agent can execute actions per state and if the agent hits a wall by taking an action it remains in the previous location. The ghosts dynamics are stochastic: with a probability each ghost chases the Pacman (often referred to as “chase mode”), and with its complement it executes a random action (“scatter mode”).

In the first case study, we assume that there is no uncertainty in the robot actions. In this case, it can be verified that AMECs exist. Figure 3(a) illustrates the evolution of over episodes, where denotes the -greedy policy. The optimal policy was constructed in approximately minutes. A sample path of the robot with the projection of optimal control strategy onto , i.e. policy , is given in Figure 1 (red path).

In the second case study, we assume that the robot is equipped with a noisy controller and, therefore, it can execute the desired action with probability , whereas a random action among the other available ones is taken with a probability of . In this case, it can be verified that AMECs do not exist. Intuitively, the reason why AMECs do not exist is that there is always a non-zero probability with which the robot will hit an obstacle while it travels between the access point and target and, therefore, it will violate . Figure 3(b) shows the evolution of over episodes for the -greedy policy. The optimal policy was synthesized in approximately hours.

In the third experiment, there is no uncertainty in the execution of actions, namely the motion of the Pacman agent is deterministic. Figure 3(c) shows the evolution of over 186000 episodes where denotes the -greedy policy. On the other hand, the use of standard Q-learning (without LTL guidance) would require either to construct a history-dependent reward for the PL-MDP as a proxy for the considered LTL property, which is very challenging for complex LTL formulas, or to perform exhaustive state-space search with static rewards, which is evidently quite wasteful and failed to generate an optimal policy in our experiments.

Note that given the policy for the PL-MDP, probabilistic model checkers, such as PRISM [35], or standard Dynamic Programming methods can be employed to compute the probability of satisfying . For instance, for the first case study, the synthesized policy satisfies with probability , while for the second case study, the satisfaction probability is , since AMECs do not exist. For the same reason, even if the transition probabilities of the PL-MDP are known, PRISM could not generate a policy for the second case study. Nevertheless, the proposed algorithm can synthesize the closest-to-satisfaction policy, as shown in Corollary IV.5.

## Vi Conclusions

In this paper we have proposed a model-free reinforcement learning (RL) algorithm to synthesize control policies that maximize the probability of satisfying high-level control objectives captured by LTL formulas. The interaction of the agent with the environment has been captured by an unknown probabilistically-labeled Markov Decision Process (MDP). We have shown that the proposed RL algorithm produces a policy that maximizes the satisfaction probability. We have also shown that even if the assigned specification cannot be satisfied, the proposed algorithm synthesizes the best possible policy. We have provided evidence via numerical experiments on the efficiency of the proposed method.

## References

• [1] G. E. Fainekos, H. Kress-Gazit, and G. J. Pappas, “Hybrid controllers for path planning: A temporal logic approach,” in CDC and ECC, December 2005, pp. 4885–4890.
• [2] H. Kress-Gazit, G. E. Fainekos, and G. J. Pappas, “Temporal-logic-based reactive mission and motion planning,” IEEE Transactions on Robotics, vol. 25, no. 6, pp. 1370–1381, 2009.
• [3] A. Bhatia, L. E. Kavraki, and M. Y. Vardi, “Sampling-based motion planning with temporal goals,” in ICRA, 2010, pp. 2689–2696.
• [4] Y. Kantaros and M. M. Zavlanos, “Sampling-based optimal control synthesis for multi-robot systems under global temporal tasks,” IEEE Transactions on Automatic Control, 2018. [Online]. Available: DOI:10.1109/TAC.2018.2853558
• [5] ——, “Distributed intermittent connectivity control of mobile robot networks,” IEEE Transactions on Automatic Control, vol. 62, no. 7, pp. 3109–3121, 2017.
• [6] X. C. Ding, S. L. Smith, C. Belta, and D. Rus, “MDP optimal control under temporal logic constraints,” in CDC and ECC, 2011, pp. 532–538.
• [7] E. M. Wolff, U. Topcu, and R. M. Murray, “Robust control of uncertain Markov decision processes with temporal logic specifications,” in CDC, 2012, pp. 3372–3379.
• [8] X. Ding, S. L. Smith, C. Belta, and D. Rus, “Optimal control of Markov decision processes with linear temporal logic constraints,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1244–1257, 2014.
• [9] M. Guo and M. M. Zavlanos, “Probabilistic motion planning under temporal tasks and soft constraints,” IEEE Transactions on Automatic Control, 2018.
• [10] I. Tkachev, A. Mereacre, J.-P. Katoen, and A. Abate, “Quantitative model-checking of controlled discrete-time Markov processes,” Information and Computation, vol. 253, pp. 1–35, 2017.
• [11] C. Baier and J.-P. Katoen, Principles of model checking.   MIT Press, 2008.
• [12] E. M. Clarke, O. Grumberg, D. Kroening, D. Peled, and H. Veith, Model Checking, 2nd ed.   MIT Press, 2018.
• [13] S. Sickert, J. Esparza, S. Jaax, and J. Křetínskỳ, “Limit-deterministic Büchi automata for linear temporal logic,” in CAV.   Springer, 2016, pp. 312–332.
• [14] R. Alur and S. La Torre, “Deterministic generators and games for LTL fragments,” TOCL, vol. 5, no. 1, pp. 1–25, 2004.
• [15] S. Sickert and J. Křetínskỳ, “MoChiBA: Probabilistic LTL model checking using limit-deterministic Büchi automata,” in ATVA.   Springer, 2016, pp. 130–137.
• [16] J. Fu and U. Topcu, “Probably approximately correct MDP learning and control with temporal logic constraints,” in Robotics: Science and Systems X, 2014.
• [17] T. Brázdil, K. Chatterjee, M. Chmelík, V. Forejt, J. Křetínskỳ, M. Kwiatkowska, D. Parker, and M. Ujma, “Verification of Markov decision processes using learning algorithms,” in ATVA.   Springer, 2014, pp. 98–114.
• [18] D. Sadigh, E. S. Kim, S. Coogan, S. S. Sastry, and S. A. Seshia, “A learning based approach to control synthesis of Markov decision processes for linear temporal logic specifications,” in CDC.   IEEE, 2014, pp. 1091–1096.
• [19] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press Cambridge, 1998, vol. 1.
• [20] J. Wang, X. Ding, M. Lahijanian, I. C. Paschalidis, and C. A. Belta, “Temporal logic motion control using actor–critic methods,” The International Journal of Robotics Research, vol. 34, no. 10, pp. 1329–1344, 2015.
• [21]

Q. Gao, D. Hajinezhad, Y. Zhang, Y. Kantaros, and M. M. Zavlanos, “Reduced variance deep reinforcement learning with temporal logic specifications,” 2019 (to appear).

• [22] N. Fulton and A. Platzer, “Verifiably safe off-model reinforcement learning,” arXiv preprint arXiv:1902.05632, 2019.
• [23] N. Fulton, “Verifiably safe autonomy for cyber-physical systems,” Ph.D. dissertation, Carnegie Mellon University Pittsburgh, PA, 2018.
• [24] N. Fulton and A. Platzer, “Safe reinforcement learning via formal methods: Toward safe control through proof and learning,” in

Thirty-Second AAAI Conference on Artificial Intelligence

, 2018.
• [25] A. Platzer, “Differential dynamic logic for hybrid systems,”

Journal of Automated Reasoning

, vol. 41, no. 2, pp. 143–189, 2008.
• [26] M. Hasanbeig, A. Abate, and D. Kroening, “Logically-constrained reinforcement learning,” arXiv preprint arXiv:1801.08099, 2018.
• [27] ——, “Logically-constrained neural fitted Q-iteration,” in AAMAS, 2019, pp. 2012–2014.
• [28] E. M. Hahn, M. Perez, S. Schewe, F. Somenzi, A. Trivedi, and D. Wojtczak, “Omega-regular objectives in model-free reinforcement learning,” arXiv preprint arXiv:1810.00950, 2018.
• [29] A. Pnueli, “The temporal logic of programs,” in Foundations of Computer Science.   IEEE, 1977, pp. 46–57.
• [30] M. L. Puterman, Markov decision processes: Discrete stochastic dynamic programming.   John Wiley & Sons, 2014.
• [31] A. Abate, M. Prandini, J. Lygeros, and S. Sastry, “Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems,” Automatica, vol. 44, no. 11, pp. 2724–2734, 2008.
• [32] A. Abate, J.-P. Katoen, J. Lygeros, and M. Prandini, “Approximate model checking of stochastic hybrid systems,” European Journal of Control, vol. 16, no. 6, pp. 624–641, 2010.
• [33]
• [34] J. Fu and U. Topcu, “Probably approximately correct MDP learning and control with temporal logic constraints,” arXiv preprint arXiv:1404.7073, 2014.
• [35] M. Kwiatkowska, G. Norman, and D. Parker, “PRISM 4.0: Verification of probabilistic real-time systems,” in CAV.   Springer, 2011, pp. 585–591.
• [36] R. Durrett, Essentials of stochastic processes.   Springer, 1999, vol. 1.
• [37] V. Forejt, M. Kwiatkowska, and D. Parker, “Pareto curves for probabilistic model checking,” in ATVA.   Springer, 2012, pp. 317–332.
• [38] E. A. Feinberg and J. Fei, “An inequality for variances of the discounted rewards,” Journal of Applied Probability, vol. 46, no. 4, pp. 1209–1212, 2009.