Overcoming Exploration: Deep Reinforcement Learning in Complex Environments from Temporal Logic Specifications

by   Mingyu Cai, et al.

We present a Deep Reinforcement Learning (DRL) algorithm for a task-guided robot with unknown continuous-time dynamics deployed in a large-scale complex environment. Linear Temporal Logic (LTL) is applied to express a rich robotic specification. To overcome the environmental challenge, we propose a novel path planning-guided reward scheme that is dense over the state space, and crucially, robust to infeasibility of computed geometric paths due to the unknown robot dynamics. To facilitate LTL satisfaction, our approach decomposes the LTL mission into sub-tasks that are solved using distributed DRL, where the sub-tasks are trained in parallel, using Deep Policy Gradient algorithms. Our framework is shown to significantly improve performance (effectiveness, efficiency) and exploration of robots tasked with complex missions in large-scale complex environments.


Creativity in Robot Manipulation with Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) has emerged as a powerful control tech...

Learning with Training Wheels: Speeding up Training with a Simple Controller for Deep Reinforcement Learning

Deep Reinforcement Learning (DRL) has been applied successfully to many ...

Temporal Logic Guided Safe Reinforcement Learning Using Control Barrier Functions

Using reinforcement learning to learn control policies is a challenge wh...

Safe-Critical Modular Deep Reinforcement Learning with Temporal Logic through Gaussian Processes and Control Barrier Functions

Reinforcement learning (RL) is a promising approach and has limited succ...

Adaptive Informative Path Planning Using Deep Reinforcement Learning for UAV-based Active Sensing

Aerial robots are increasingly being utilized for a wide range of enviro...

Neural Logic Reinforcement Learning

Deep reinforcement learning (DRL) has achieved significant breakthroughs...

Robust Quadruped Jumping via Deep Reinforcement Learning

In this paper we consider a general task of jumping varying distances an...

1 Introduction

Deep Reinforcement Learning (DRL) employs neural networks to find optimal policies for unknown dynamic systems via maximizing a long-term rewards

(mnih2015human). In principle, DRL offers a method to learn such policies based on the exploration vs. exploitation trade-off (sutton2018reinforcement), but the efficiency of the required exploration has prohibited its usage in real-world applications. To effectively collect non-zero rewards, existing DRL algorithms for robotic tasks simply explore the environments, using noisy policies. As the environment becomes complex and large-scale, naive exploration strategies become less effective. This problem becomes even more challenging for complex robotic tasks. Consequently, the desired exploration techniques in DRL are expected to deal with two common challenges: (1) identify states that are worth exploring with respect to the navigation tasks; (2) guide the exploration toward the satisfaction of the navigation tasks.

Related Work: Motivated by task-guided planning and control, formal languages are shown to be efficient tools for expressing a diverse set of high-level specifications (baier2008). For unknown dynamics, temporal logic based rewards are developed and integrated with various DRL algorithms. In particular, deep Q learning is employed in (Icarte2018; Camacho2019; hasanbeig2019deepsynth; xu2020joint) over discrete action spaces. For continuous state-action space, the authors in (Li2019; vaezipoor2021ltl2action; icarte2022reward) utilize actor-critic algorithms, e.g., proximal policy optimization (PPO) (schulman2017proximal) for policy optimization, validated in robotic manipulations and safety gym, respectively. All aforementioned works only study LTL specifications over finite horizons. To facilitate defining LTL tasks over infinite horizons, recent works (Cai2021modular; cai2021safe) improve the results from (hasanbeig2020deep) by converting LTL into a novel automaton, i.e., E-LDGBA, which is a variant of the Limit Deterministic Generalized Büchi Automaton (LDGBA) (Sickert2016). To improve the performance for the long-term (infinite horizons) satisfaction, they propose a modular architecture of Deep Deterministic Policy Gradient (DDPG) (lillicrap2015continuous) to decompose the global missions into sub-tasks. However, none of the existing works can handle large-scale, complex environments, since a LTL-based reward requires the RL-agent to visit the regions of interest towards the LTL satisfaction. Consequently, such rewards are too sparse to be explored in challenging environments.

Many prior works (schulman2015trust; lillicrap2015continuous; schulman2017proximal; lowe2017multi; haarnoja2018soft) employ noise-based exploration policies for the actor-critic method, integrated with different versions of DPGs, whereas their sampling efficiency relies mainly on the density of specified rewards. Recent works (hester2018deep; nair2018overcoming) leverage human demonstrations to address exploration issues for robotic manipulations. On the other side, the works in (vecerik2017leveraging; fujimoto2019off) focus on effectively utilizing the dataset stored in the reply buffer to speed up the learning process. However, these advances all assume a prior dataset is given beforehand, and they are not applicable to learning from scratch. The work in (liu2021cooperative) proposes a multi-agent exploration method, by restricting and expanding the state space. But it’s still unknown whether each agent takes care of its own independent responsibilities.

Contribution: Intuitively, the most effective way of addressing the above challenges is optimizing the density of rewards to easily collect more effective data during learning such that the portion of transitions with positive rewards in the reply buffer is dramatically increased. Motivated by this, our paper proposes an effective exploration technique based on the Rapidly-exploring Random Tree (RRT*) (karaman2011sampling) sampling-based method, which is a powerful paradigm for point-to-point navigation in large-scale, complex environments. In contrast to the existing RRT* works that assume the controllers for following the optimal paths are available, we consider the robotic system as a black box and design novel dense RRT*-rewards such that the optimal policies for general-goal reaching missions are learned by DPGs. Furthermore, the high-level robotic tasks are expressed by rich Linear Temporal Logic (LTL) (baier2008) specifications. We extend the innovative exploration strategy to propose a distributed sampling-based learning framework that is able to find the optimal policies for LTL satisfaction, subject to environmental challenges. Our framework integrates actor-critic RL and RRT* methods, and synthesizes optimal policies over continuous state-action space. In summary, the main contributions of the paper are as follows:

  • To the best of our knowledge, this is the first work to bridge the gap between sampling-based planning and DPGs, to synthesize policies that handle the complex and large-scale environments. In particular, a RRT*-based reward is introduced to complete the standard goal-reaching navigation, which is dense and efficient in configuration space for overcoming exploration.

  • Although the exploration procedure relies on the optimal geometric paths without considering the dynamics, our novel reward scheme is robust to the infeasible RRT* paths, and enables the robot to not strictly follow them. Rigorous analysis is provided for the optimal policy regarding goal-reaching satisfaction.

  • We propose a cooperative team of distributed DPGs that is trained synchronously. Such an architecture is integrated with the exploration guide of RRT*-based reward for the LTL satisfaction, to address both challenging environments and complex tasks. This framework is flexible and it’s possible to generalize it with other formal languages or learning-based navigation.

  • We validate our algorithm through case studies to show its better performance compared to the point-to-point DPG algorithms in literature. We show that it achieves significant improvements over exploration with respect to the environmental challenges.

2 Preliminaries

The evolution of a continuous-time dynamic system starting from an initial state is given by



is the state vector in the compact set

and is the control input. The functions , are locally Lipschitz continuous and unknown.

Consider a robot with the unknown dynamics in (1), operating in an environment that is represented by a compact subset as a geometric space (workspace of the robot). The relation between and is defined by the projection . The space contains regions of interest that are labeled by a set of atomic propositions , with the labeling function . Let be a labeling function over i.e., .

Reinforcement Learning: The interactions between an and the unknown dynamic system with the state-space

, is abstracted as a continuous labeled-Markov Decision Processes (cl-MDP)

(thrun2002). A cl-MDP is a tuple , where is a continuous state space, is a set of initial states, is a continuous action space, represents the unknown system dynamics as a distribution, is the set of atomic propositions, is the labeling function, is the reward function, and is the discount factor. The distribution is a Borel-measurable conditional transition kernel, s.t.

is a probability measure of the next state given current

and over the Borel space , where is the set of all Borel sets on .

Let denote a policy which is either deterministic, i.e., , or stochastic, i.e., , that maps states to distributions over actions. At each episode, the initial state of robot in is denoted by . At each time step , the agent observes the state and executes an action , according to the policy , and returns the next state sampled from . The process is repeated until the episode is terminated. The objective of the robot is to learn an optimal policy that maximizes the expected discounted return under the policy .

Linear Temporal Logic (LTL): LTL is a formal language to describe complex properties and high-level specifications of a system. The ingredients of an LTL formula are a set of atomic propositions, and combinations of Boolean and temporal operators. The syntax of LTL is defined as (baier2008)

where is an atomic proposition, true, negation , and conjunction are propositional logic operators, and next and until are temporal operators. Alongside the standard operators introduced above, other propositional logic operators, such as false, disjunction , and implication , and temporal operators, such as always and eventually , are derived from the standard operators.

The semantics of an LTL formula are interpreted over words, where a word is an infinite sequence , with for all , and represents the power set of . The satisfaction of an LTL formula by the word is denote by . For a infinite word starting from the step , let denotes the value at step , and denotes the word starting from step . The semantics of LTL are defined as (baier2008):

In this work, we restrict our attention to LTL formulas that exclude the next temporal operator, which is not meaningful for continuous state-action space (kloetzer2008fully).

3 Problem Formulation

Consider a cl-MDP . The induced path under a policy over is , where if . Let be the sequence of labels associated with , such that . We denote the satisfaction relation of the induced trace with by . The probability of satisfying under the policy , starting from an initial state , is defined as

where is the set of admissible paths from the initial state , under the policy , and the detailed computation of can be found in (baier2008). The transition distributions of are unknown due to the unknown dynamic , and DRL algorithms are employed to learn the optimal control policies.

In this paper, the cl-MDP captures the interactions between a complex environment with geometric space , and an unknown dynamic system . Note that explicitly constructing a cl-MDP is impossible, due to the continuous state-action space. We track any cl-MDP on-the-fly (abstraction-free), according to the evolution of the dynamic system operating in .

Problem 1.

Consider a goal region and obstacles in labeled as and , respectively. The standard obstacle-free goal-reaching mission is expressed as the LTL formula . We aim at finding the optimal policy of that satisfies the specification i.e. .

For Problem 1, typical learning-based algorithms only assign positive rewards when the robot reaches the goal region , resulting exploration issues of DRL rendered from the environmental challenge. Following on Problem 1, we account another challenge derived from complex tasks.

Problem 2.

Consider a set of labeled goal regions in i.e., . The safety-critical task is expressed as , where is a general LTL specification. The objective is to synthesize the optimal policy of satisfying the task i.e., . Note that is a special (simple) case of .

Assumption 1 Let denote the obstacle-free space. We assume that there exists at least one policy that drives the robot toward the regions of interest in , which is reasonable since the assumption ensures the existence of policies satisfying a given valid LTL specification.

Figure 1: Consider an autonomous vehicle operating in a complex and large-scale environment. It is challenging to learn optimal policies as described in example 1.
Example 1.

Consider an autonomous vehicle as an RL-agent with unknown dynamics that is tasked with specification , shown in Fig. 1 (a). If the RL-agent only receives a reward after reaching the goal region , it will be hard to explore the effective data with positive rewards using noisy policies during learning. Such a problem becomes more challenging in Fig. 1 (b), where the LTL task requires the agent to safely visit regions sequentially infinitely many times.

4 Overcoming Exploration

To overcome the exploration challenges of DRL, this section proposes a solution to Problem 1 by designing a sampling-based reward scheme. In Sec. 4.1, we briefly introduce the geometric sampling-based algorithm to generate an optimal path. Based on that, Sec. 4.2 develops a novel dense reward to overcome the exploration challenges. Sec. 4.3 introduces the DPGs, as variants of DRL, and provides rigorous analysis for task satisfaction using optimal policies.

4.1 Geometric RRT*

The standard optimal RRT* method (karaman2011sampling) is a variant of RRT (lavalle2001randomized). Both RRT and RRT* are able to handle the path planning in complex and large-scale environments. Due to its optimality, we choose RRT* over RRT, to improve the learned performance of the optimal policies. Since the dynamic system is unknown, we use the geometric RRT* method that builds the tree incrementally in , where is the set of vertices and is the set of edges. If intersects the goal region, we find a geometric trajectory to complete the task . The detailed procedure of geometric RRT* is described in the Appendix A.1. Here, we briefly introduce two of the functions in the geometric RRT* method, which are used in explaining our method in the next sections.

Distance and cost: The function is the metric that computes the geometric Euclidean distance. The function returns the length of the path in the tree from the initial state to the input state.

Steering: Given two states , the function returns a state such that lies on the geometric line connecting to , and its distance from is at most , i.e., , where is an user-specified parameter. In addition, the state must satisfy .

Having the tree generated by the RRT* method, if there exists at least one node that is located within the goal regions, we find the optimal state trajectory satisfying the task as a sequence of geometric states , where and , .

4.2 Sampling-based reward

Here we use the optimal geometric trajectory and the properties of the generated tree to synthesize the reward scheme. First, let denote a state of , and the total length of in the tree be equal to . We define the distance from each state to the destination as . Based on the distance, we design the RRT* reward scheme to guide the learning progress towards the satisfaction of . Passing an exact state in the continuous state-space is challenging for robots. We define the norm r-ball for each state to allow the robot passing the neighboring area of the state as , where is the center and is the radius. For simplicity, we select based on the steering function of the geometric RRT*, such that the adjacent r-balls along the optimal trajectory will not intersect with each other.

We develop a progression function to track whether the current state is getting closer to the goal region, by following the sequence of balls along as:


For the cl-MDP capturing the interactions between and , the intuitive of the sampling-based reward design is to assign a positive reward whenever the robot gets geometrically closer to the goal region, along the optimal path obtained by the RRT* (Alg. 2).

During each episode of learning, a state-action sequence up to current time is projected into the state and action sequences and , respectively. We define

as the progression energy that is equal to the minimum distance along the optimal path towards the destination, up to step . The objective of the reward is to drive the robot such that keeps dropping. However, employing the function for the reward design that depends on the history of the trajectory results a non-Markovian reward function (Icarte2018), while the policy only takes the current state as the input, and can not distinguish the progresses achieved by the histories .

To address the issue, inspired by the product MPD  (baier2008), given the history , we keep tracking the index of the center state of the visited r-ball regions with minimum distance deterministically, i.e.,

If none of the r-balls are visited up to , we set . Then, the current state is embedded with the index as a product state , which is considered as the input of the policy i.e., . Note that we can treat the embedded component as the state of a deterministic automaton  (baier2008). The relevant analysis can be found in Appendix A.2.

Let denote the episodic reward function. We propose a novel scheme to assign the Markovian reward with respect to the product state as:


where is a positive constant reward, is a boosted positive constant that is awarded when the robot reaches the destination, and is the negative constant reward that is assigned when the robot violates the safety task of , i.e., . Note that if the robot crosses both obstacles and r-balls, it’ll receive the negative reward , which has the first priority. This setting will not restrict selections of the parameter (radius of r-balls) for implementations.

Example 2.

As shown in the Fig. 1 (a), we apply the RRT* method to generate the optimal trajectory (colored gray) in the challenging environment. Then, we construct the sequence of r-Balls along it, and apply the reward design (3) to guide the learning and overcome exploration.

Remark 1.

Since geometric RRT* does not consider the dynamic system, the optimal path may be infeasible for the robot to follow exactly, with respect to any policy. Thanks to our robust RRT* reward, the robot is not required to strictly follow the optimal path. Instead, in order to receive the positive reward, the robot only needs to move towards the destination and pass by some of the r-balls along the optimal path.

Finally, by applying the reward design (3), we can formally verify the performance of the reward (3) as:

Theorem 1.

If Assumption 1 holds, by selecting to be sufficiently larger than , i.e., , any algorithm that optimizes the expected return is guaranteed to find the optimal policy satisfying the LTL task i.e., .

Proof is presented in Appendix A.3. Theorem 1 provides a theoretical guarantee for the optimization performance, allowing us to apply practical algorithms to find the approximated optimal policy in continuous space.

4.3 Deep Policy Gradient (DPG)

Based on Theorem 1, we apply advanced DRL methods to find the optimal policy . Consider a policy , parameterized by . Deep Q-learning (mnih2015human) is a DRL approach that aims to find the optimal policy via optimizing the parameters and maximizes the expected discount return:

It learns the optimal policy by minimizing the loss function:


where is the reply buffer that stores experience tuples , is the state-action valuation function parameterized by , and

. To learn policies over continuous state-action space, this work focuses on DPG-based RL methods, which estimates the optimal expected return

with respect to the policy based on sampled trajectories. For the stochastic policy, the gradient is estimated as follows:


where is the state distribution. It is possible to extend the DPG framework to deterministic policies (silver2014deterministic). In particular, we formulate the gradient as


The DPG theorem gives rise to several practical algorithms, leading to a variety of actor-critic algorithms (sutton2018reinforcement). As observed in (4), all DPGs, e.g., (5) and (6), rely on effective data in the replay buffer, or sample efficiency of the state distribution to minimize the loss function. In this paper, the sampling-based reward improves the exploration performance with any actor-critic algorithm using noisy policies, due to its high reward density.

Theorem 2.

If Assumption 1 holds, by selecting to be sufficiently larger than , i.e., , a suitable DPG algorithm that optimizes the expected return , finds the optimal parameterized policy satisfying the LTL tasks , i.e., in the limit.

Theorem 2 is an immediate result of Theorem 1 and the nature of nonlinear regressions in deep neural networks. In practice, the number of episodes and steps are limited and training has to be stopped eventually.

5 LTL Task Satisfaction

This section extends the novelty of Sec. 4 to satisfy the LTL task in Problem. 2. Sec. 5.1 describes how to generate the optimal path, and Sec. 5.2 explains how to integrate distributed DPGs to learn the optimal policy.

5.1 Geometric TL-RRT*

Due to the unknown dynamic system, we define the transition system over the geometric space , referred as Geometric-Weighted Transition System (G-WTS).

Definition 1.

A G-WTS of is a tuple , where is the configuration space of , is the initial state of robot; is the geometric transition relation s.t. if and the straight line connecting to is collision-free, is the set of atomic propositions as the labels of regions, is the labeling function that returns an atomic proposition satisfied at a location , and is the geometric Euclidean distance i.e.; .

Let denote a valid run of . An LTL formula can be converted into a Non-deterministic Büchi Automata (NBA) to verify its satisfaction.

Definition 2.

(vardi1986automata) An NBA over is a tuple , where is the set of states, is the set of initial states, is the finite alphabet, is the transition relation, and is the set of accepting states.

A valid infinite run of is called accepting, if it intersects with infinite often. Infinite words generated from an accepting run satisfy the corresponding LTL formula . An LTL formula is converted into NBA using the tool (gastin2001fast). As the work (kantaros2020stylus), we can prune the infeasible transitions of the resulting NBA to obtain the truncated NBA.

Definition 3.

Given the G-WTS and the NBA , the product Büchi automaton (PBA) is a tuple , where is the set of infinite product states, is the set of initial states; is the transition relation defined by the rule: , where denotes the transition , is the set of accepting states, is the cost function defined as the cost in the configuration space, e.g., , and is the labelling function s.t. .

A valid trace of a PBA is called accepting, if it visits infinitely often, referred as the acceptance condition. Its accepting words satisfy the corresponding LTL formula . Let denote an accepting trace, and is the projection into the workspace, i.e., . Using the projection, we extract the geometric trajectory that satisfies the LTL formula. More details is presented in (baier2008). Therefore, the planning objective is to find an accepting path of PBA, with minimum accumulative configuration cost .

However, the state-space of G-WTS and PBA are both infinite. Consequently, we are not able to apply a graph search method into a PBA with infinite states. Thanks to the TL-RRT* algorithm (luo2021abstraction) for providing an abstraction-free method, it allows us to incrementally build trees that explore the product state-space and find the feasible optimal accepting path. The procedure applies the sampling-based method over the PBA, and is inspired by the fact that the accepting run is a lasso-type sequence in the form of prefix-suffix structure i.e. , where the prefix part is only executed once, and the suffix part with is executed infinitely.

Following this idea, we build the trees for the prefix and suffix paths, respectively. To satisfy the acceptance condition, the set of goal states of the prefix tree is defined as , where is the collision-free configuration space. After obtaining the prefix tree, we construct the set , and compute the optimal prefix path reaching a state from the root . The suffix tree is built by treating as the root, and its goal states are:

collects all states that can reach the state via one transition, and this way it ensures the feasible cyclic path matching the suffix structure. Finally, we search the optimal suffix path , by constructing .

1:  Input: , , Black-box ;
2:  Initialize: Geometric space , Primitives of TL-RRT*;
3:  Convert into NBA
4:  Build the incremental trees for PBA geometrically, based on definition 1 and definition 3
5:  Generate the optimal trajectory
6:  Reformulate the trajectory into the modular form
7:  for  do
8:     Construct the RRT* reward based on (3) for
9:     Assign an actor-critic DPG e.g., DDPG, PPO, for
10:  end for
11:  Assign the rewards (3) and DPGs for each .
12:  Train the distributed DPGs in parallel
13:  Extract the optimal policy from each DPG
14:  Concatenate all optimal policies in the form
Algorithm 1 LTL-RRT*-Distributed DPGs

5.2 Distributed DPGs

In this section, we first employ the optimal geometric path from Section 4.1

, to propose a distributed reward scheme. Since the policy gradient strategy suffers from the variance issue and only finds the optimal policy in the limit (see theorem 

2), instead of directly applying the reward design (3) for the whole path , we decompose it into sub-tasks.

To do so, we modularly divide into separated consecutive segments, each of which share the same automaton components i.e., such that all states of each sub trajectory have the same automaton components. Each segment can be treated as a collision-free goal-reaching problem, denoted as , where is label of the goal region. Especially, suppose the state trajectory of each is , and we select the region labeled as containing the geometric state . An example of the optimal decomposition can be found in Fig. 3. The lasso-type optimal trajectory is reformulated as: . For the cl-MDP , we treat each as a task solved in Section 4.2.

In particular, we propose a collaborative team of RRT* rewards in (3) for each sub-task, and assign distributed DPGs for each that are trained in parallel. After training, we extract the concatenate policy of each to obtain the global optimal policy as . The overall procedure is summarized in Alg. 1, and the demonstrated diagram with rigorous analysis is presented in the Appendix A.4.

Remark 2.

Applying sampling-based method or reachability control synthesis for LTL satisfaction is investigated in (vasile2020reactive; kantaros2020stylus; srinivasan2020control; luo2021abstraction). They all assume the known dynamic system . In contrast, from an AI perspective regarding the unknown black box , we use the geometric RRT* path to synthesize dense rewards, and apply distributed DPGs to learn the optimal policies.

6 Experimental Results

We evaluate the RRT* reward and distributed DPGs by applying the framework to satisfy the LTL tasks in the form of and , respectively, where is a special case of . The algorithm test focuses on the large-scale complex environments that generalizes the simple ones e.g., (lowe2017multi), to illustrate its performance. Finally, we show that our algorithm improves the success rates of task satisfaction over both infinite and finite horizon in complex environments, and significantly reduces training time for task over finite horizons. The detailed descriptions of environments and LTL tasks will be introduced.

Baseline Approaches: We refer to our distributed framework as ”D-RRT*” or ”RRT*”, and compare it against three baselines: (i) The TL-based rewards in  (hasanbeig2020deep; Cai2021modular) referred as ”TL”, for the single LTL task over infinite horizons, have shown excellent performance in non-complex environments, which generalizes the cases of finite horizons in existing literature (Li2019; vaezipoor2021ltl2action; icarte2022reward); (ii) Similar as (qin2021learning), for the goal-reading task , the baseline referred to as ”NED” applies the negative Euclidean distance between the robot and destination, and the goal-reaching boosted value as the rewards; (iii) For a complex LTL task, instead of decomposition, this baseline, referred as ”G-RRT*”, directly apply the reward scheme (3) for the global trajectory . For fair comparisons, baseline ”NED” only records the boosted positive reward and negative rewards of obstacle collisions.

Figure 2: Baselines comparison of task in of Fig. 3. (Left) Baselines using DDPG. (Right) Baselines using PPO.

6.1 Goal-Reaching We first consider the LTL task satisfaction in the form of in the environment of Fig. 3 referred as . We implement all baselines with both DDPG and PPO, respectively. We run episodes, each of which has maximum steps. The learning results are shown in Fig. 2. We observe that, ”TL” with point-to-point sparse rewards and ”NED” with minimizing the euclidean distance, have poor performances in complex environments, due to exploration issues. With the RRT* based reward, it is shown that our framework can be integrated with either the off-policy DPG or the on-policy DPG. The consecutive r-balls of rewards in the configuration space make the PPO perform better in Fig. 2, since it allows to efficiently evaluate the current policies.

Figure 3: Decomposition example. (Left) Truncated NBA of the LTL formula for ; (Right) Decomposed sub goal-reaching tasks.

6.2 LTL Satisfaction As for general LTL tasks, we first show the decomposition process for the LTL task over infinite horizons, where requires to infinitely visit goal regions labeled as . The resulting truncated NBA and decomposed trajectories of TL-RRT* are shown in Fig. 3, where decomposed sub-paths are expressed as , such that the distributed DPGs are applied to train the optimal sub-policies for each one in parallel.

Considering more complex LTL formulas, we use the environment shown in Fig. 3 (b) referred as . The LTL task is expressed as . We increase the complexity by random sampling obstacle-free goal regions, and set the rich specification as . For all tasks over infinite horizons, we apply the DDPG to compare our framework with baselines ”TL” and ”G-RRT*”. The results are shown in Fig. 4, and we observe that the ”TL” method is sensitive to the environments, and when the optimal trajectories become complicated in the sense of the complexity of LTL tasks, ”G-RRT*” easily converge to the sub-optimal solutions.

The finite-horizon form of , , and are expressed by removing the ”Always” operator in the part e.g., is the finite-horizon case of . It is shown that our framework generalizes the LTL satisfaction over both finite and infinite horizons.

Figure 4: (a) (b) (c) Results of DDPG for each task in its specific environment using different baselines i.e., ”G-RRT*” and ”TL”; (d) Training time comparison for tasks over finite horizons.

6.3 Performance Evaluation We statistically run trials of the learned policies for each sub-task, and record the success rates of the satisfaction for all global tasks. The results are shown in Table 1 of Appendix A.5. We see that in complex environments, success rates of other all baselines are , and our method achieves success rates of near . As a result, the effectiveness of the performance has been significantly improved.

Since our algorithm learns to complete the task faster, that terminates each episode during learning earlier for tasks over finite horizons. To illustrate the efficiency, we implement the training process times for tasks , , , in their corresponding complex environments, and record the average training time compared with the baseline modular ”TL” (Cai2021modular). The results in Fig. 4 (d) show that we have optimized the learning efficiency and the training time is reduced. In practice, we can apply distributed computing lines to train each local DPG simultaneously for complicated global tasks to alleviate the learning burden.

7 Conclusion

Applying DPG algorithms to complex environments produces vastly different behaviors and results in failure to complete complex tasks. A persistent problem is the exploration phase of the learning process that limits its applications to real-world robotic systems. This paper provides a novel path-planning-based reward scheme to alleviate this problem, enabling significant improvement of reward density and generating optimal policies satisfying complex specifications in challenging environments. To facilitate rich high-level specification, we develop an optimal decomposition strategy for the global LTL task, allowing to train all sub-tasks in parallel and optimize the efficiency. The main limitation of our approach is relying on the geometric motion planning algorithms that limits its application for some systems. In the future, we aim at achieving safety-critical requirements during learning and investigate multi-agent systems.


Appendix A Appendix (Supplementary Materials)

a.1 Summary of Geometric RRT*

Before discussing the algorithm in details, it is necessary to introduce few algorithmic primitives as follows:

Random sampling: The

function provides independent, uniformly distributed random samples of states, from the geometric space


Distance and cost: The function is the metric that returns the geometric Euclidean distance. The function returns the length of the path from the initial state to the input state.

Nearest neighbor: Given a set of vertices in the tree and a state , the function generates the closest state from which can be reached with the lowest distance metric.

Steering: Given two states , the function returns a state such that lies on the geometric line connecting to , and its distance from is at most , i.e., , where is an user-specified parameter. In addition, the state must satisfy . The function also returns the straight line connecting to .

Collision check: A function that detects if a sate trajectory lies in the obstacle-free portion of space . is the distance of .

Near nodes: Given a set of vertices in the tree and a state , the function returns a set of states that are closer than a threshold cost to :

where is the number of vertices in the tree, is the dimension of the configuration space, and is a constant.

Optimal Path: Given two states in , the function returns a local optimal trajectory .

The Alg. 2 proceeds as follows. First, the graph is initialized with and (line 1). Then a state is sampled from of (line 3), then, the nearest node in the tree is found (line 4) and extended toward the sample, denoted by , in addition to the straight line connecting them (line 5). If line is collision free (line 6), the algorithm iterates over all near neighbors of the state and finds the state that has the lowest cost to reach (lines 715). Then the tree is updated with the new state (lines 1617), and the algorithm rewires the near nodes, using Alg. 3 (line 18). Alg. 3 iterates over the near neighbors of the new state and updates the parent of a near state to if the cost of reaching from is less than the current cost of reaching to .

1:  Initialize: ; ,
2:  for   do
4:     ,
5:     ,
6:     if  then
8:        , ,
9:        for  do
11:           if  then
13:              ;
14:           end if
15:        end for
18:         Rewire
19:     end if
20:  end for
21:  return
Algorithm 2 Geometric RRT* (, )
1:  for  do
3:     if  then
4:        if  then
8:        end if
9:     end if
10:  end for
11:  return
Algorithm 3 Rewire(, , )

Figure 5: General diagram of the LTL-RRT*-Distributed DPGs method explained in Alg. 1.

a.2 Analysis of Product State

In this section, we show that the product state can be applied with RL algorithms and the (3) is a Markovian reward for the state . Let be a set of all sequential indexes of the r-balls for an optimal trajectory , and define as function to return the index at each time s.t. , where the output of the function follows the requirements:

Consequently, given an input , the generates a deterministic output such that we regard the tuple as a deterministic automaton without accepting states (baier2008), where . The state is derived from a product structure defined as:

Definition 4.

The product between cl-MPD