1 Introduction
Deep Reinforcement Learning (DRL) employs neural networks to find optimal policies for unknown dynamic systems via maximizing a longterm rewards
(mnih2015human). In principle, DRL offers a method to learn such policies based on the exploration vs. exploitation tradeoff (sutton2018reinforcement), but the efficiency of the required exploration has prohibited its usage in realworld applications. To effectively collect nonzero rewards, existing DRL algorithms for robotic tasks simply explore the environments, using noisy policies. As the environment becomes complex and largescale, naive exploration strategies become less effective. This problem becomes even more challenging for complex robotic tasks. Consequently, the desired exploration techniques in DRL are expected to deal with two common challenges: (1) identify states that are worth exploring with respect to the navigation tasks; (2) guide the exploration toward the satisfaction of the navigation tasks.Related Work: Motivated by taskguided planning and control, formal languages are shown to be efficient tools for expressing a diverse set of highlevel specifications (baier2008). For unknown dynamics, temporal logic based rewards are developed and integrated with various DRL algorithms. In particular, deep Q learning is employed in (Icarte2018; Camacho2019; hasanbeig2019deepsynth; xu2020joint) over discrete action spaces. For continuous stateaction space, the authors in (Li2019; vaezipoor2021ltl2action; icarte2022reward) utilize actorcritic algorithms, e.g., proximal policy optimization (PPO) (schulman2017proximal) for policy optimization, validated in robotic manipulations and safety gym, respectively. All aforementioned works only study LTL specifications over finite horizons. To facilitate defining LTL tasks over infinite horizons, recent works (Cai2021modular; cai2021safe) improve the results from (hasanbeig2020deep) by converting LTL into a novel automaton, i.e., ELDGBA, which is a variant of the Limit Deterministic Generalized Büchi Automaton (LDGBA) (Sickert2016). To improve the performance for the longterm (infinite horizons) satisfaction, they propose a modular architecture of Deep Deterministic Policy Gradient (DDPG) (lillicrap2015continuous) to decompose the global missions into subtasks. However, none of the existing works can handle largescale, complex environments, since a LTLbased reward requires the RLagent to visit the regions of interest towards the LTL satisfaction. Consequently, such rewards are too sparse to be explored in challenging environments.
Many prior works (schulman2015trust; lillicrap2015continuous; schulman2017proximal; lowe2017multi; haarnoja2018soft) employ noisebased exploration policies for the actorcritic method, integrated with different versions of DPGs, whereas their sampling efficiency relies mainly on the density of specified rewards. Recent works (hester2018deep; nair2018overcoming) leverage human demonstrations to address exploration issues for robotic manipulations. On the other side, the works in (vecerik2017leveraging; fujimoto2019off) focus on effectively utilizing the dataset stored in the reply buffer to speed up the learning process. However, these advances all assume a prior dataset is given beforehand, and they are not applicable to learning from scratch. The work in (liu2021cooperative) proposes a multiagent exploration method, by restricting and expanding the state space. But it’s still unknown whether each agent takes care of its own independent responsibilities.
Contribution: Intuitively, the most effective way of addressing the above challenges is optimizing the density of rewards to easily collect more effective data during learning such that the portion of transitions with positive rewards in the reply buffer is dramatically increased. Motivated by this, our paper proposes an effective exploration technique based on the Rapidlyexploring Random Tree (RRT*) (karaman2011sampling) samplingbased method, which is a powerful paradigm for pointtopoint navigation in largescale, complex environments. In contrast to the existing RRT* works that assume the controllers for following the optimal paths are available, we consider the robotic system as a black box and design novel dense RRT*rewards such that the optimal policies for generalgoal reaching missions are learned by DPGs. Furthermore, the highlevel robotic tasks are expressed by rich Linear Temporal Logic (LTL) (baier2008) specifications. We extend the innovative exploration strategy to propose a distributed samplingbased learning framework that is able to find the optimal policies for LTL satisfaction, subject to environmental challenges. Our framework integrates actorcritic RL and RRT* methods, and synthesizes optimal policies over continuous stateaction space. In summary, the main contributions of the paper are as follows:

To the best of our knowledge, this is the first work to bridge the gap between samplingbased planning and DPGs, to synthesize policies that handle the complex and largescale environments. In particular, a RRT*based reward is introduced to complete the standard goalreaching navigation, which is dense and efficient in configuration space for overcoming exploration.

Although the exploration procedure relies on the optimal geometric paths without considering the dynamics, our novel reward scheme is robust to the infeasible RRT* paths, and enables the robot to not strictly follow them. Rigorous analysis is provided for the optimal policy regarding goalreaching satisfaction.

We propose a cooperative team of distributed DPGs that is trained synchronously. Such an architecture is integrated with the exploration guide of RRT*based reward for the LTL satisfaction, to address both challenging environments and complex tasks. This framework is flexible and it’s possible to generalize it with other formal languages or learningbased navigation.

We validate our algorithm through case studies to show its better performance compared to the pointtopoint DPG algorithms in literature. We show that it achieves significant improvements over exploration with respect to the environmental challenges.
2 Preliminaries
The evolution of a continuoustime dynamic system starting from an initial state is given by
(1) 
where
is the state vector in the compact set
and is the control input. The functions , are locally Lipschitz continuous and unknown.Consider a robot with the unknown dynamics in (1), operating in an environment that is represented by a compact subset as a geometric space (workspace of the robot). The relation between and is defined by the projection . The space contains regions of interest that are labeled by a set of atomic propositions , with the labeling function . Let be a labeling function over i.e., .
Reinforcement Learning: The interactions between an and the unknown dynamic system with the statespace
, is abstracted as a continuous labeledMarkov Decision Processes (clMDP)
(thrun2002). A clMDP is a tuple , where is a continuous state space, is a set of initial states, is a continuous action space, represents the unknown system dynamics as a distribution, is the set of atomic propositions, is the labeling function, is the reward function, and is the discount factor. The distribution is a Borelmeasurable conditional transition kernel, s.t.is a probability measure of the next state given current
and over the Borel space , where is the set of all Borel sets on .Let denote a policy which is either deterministic, i.e., , or stochastic, i.e., , that maps states to distributions over actions. At each episode, the initial state of robot in is denoted by . At each time step , the agent observes the state and executes an action , according to the policy , and returns the next state sampled from . The process is repeated until the episode is terminated. The objective of the robot is to learn an optimal policy that maximizes the expected discounted return under the policy .
Linear Temporal Logic (LTL): LTL is a formal language to describe complex properties and highlevel specifications of a system. The ingredients of an LTL formula are a set of atomic propositions, and combinations of Boolean and temporal operators. The syntax of LTL is defined as (baier2008)
where is an atomic proposition, true, negation , and conjunction are propositional logic operators, and next and until are temporal operators. Alongside the standard operators introduced above, other propositional logic operators, such as false, disjunction , and implication , and temporal operators, such as always and eventually , are derived from the standard operators.
The semantics of an LTL formula are interpreted over words, where a word is an infinite sequence , with for all , and represents the power set of . The satisfaction of an LTL formula by the word is denote by . For a infinite word starting from the step , let denotes the value at step , and denotes the word starting from step . The semantics of LTL are defined as (baier2008):
In this work, we restrict our attention to LTL formulas that exclude the next temporal operator, which is not meaningful for continuous stateaction space (kloetzer2008fully).
3 Problem Formulation
Consider a clMDP . The induced path under a policy over is , where if . Let be the sequence of labels associated with , such that . We denote the satisfaction relation of the induced trace with by . The probability of satisfying under the policy , starting from an initial state , is defined as
where is the set of admissible paths from the initial state , under the policy , and the detailed computation of can be found in (baier2008). The transition distributions of are unknown due to the unknown dynamic , and DRL algorithms are employed to learn the optimal control policies.
In this paper, the clMDP captures the interactions between a complex environment with geometric space , and an unknown dynamic system . Note that explicitly constructing a clMDP is impossible, due to the continuous stateaction space. We track any clMDP onthefly (abstractionfree), according to the evolution of the dynamic system operating in .
Problem 1.
Consider a goal region and obstacles in labeled as and , respectively. The standard obstaclefree goalreaching mission is expressed as the LTL formula . We aim at finding the optimal policy of that satisfies the specification i.e. .
For Problem 1, typical learningbased algorithms only assign positive rewards when the robot reaches the goal region , resulting exploration issues of DRL rendered from the environmental challenge. Following on Problem 1, we account another challenge derived from complex tasks.
Problem 2.
Consider a set of labeled goal regions in i.e., . The safetycritical task is expressed as , where is a general LTL specification. The objective is to synthesize the optimal policy of satisfying the task i.e., . Note that is a special (simple) case of .
Assumption 1 Let denote the obstaclefree space. We assume that there exists at least one policy that drives the robot toward the regions of interest in , which is reasonable since the assumption ensures the existence of policies satisfying a given valid LTL specification.
Example 1.
Consider an autonomous vehicle as an RLagent with unknown dynamics that is tasked with specification , shown in Fig. 1 (a). If the RLagent only receives a reward after reaching the goal region , it will be hard to explore the effective data with positive rewards using noisy policies during learning. Such a problem becomes more challenging in Fig. 1 (b), where the LTL task requires the agent to safely visit regions sequentially infinitely many times.
4 Overcoming Exploration
To overcome the exploration challenges of DRL, this section proposes a solution to Problem 1 by designing a samplingbased reward scheme. In Sec. 4.1, we briefly introduce the geometric samplingbased algorithm to generate an optimal path. Based on that, Sec. 4.2 develops a novel dense reward to overcome the exploration challenges. Sec. 4.3 introduces the DPGs, as variants of DRL, and provides rigorous analysis for task satisfaction using optimal policies.
4.1 Geometric RRT*
The standard optimal RRT* method (karaman2011sampling) is a variant of RRT (lavalle2001randomized). Both RRT and RRT* are able to handle the path planning in complex and largescale environments. Due to its optimality, we choose RRT* over RRT, to improve the learned performance of the optimal policies. Since the dynamic system is unknown, we use the geometric RRT* method that builds the tree incrementally in , where is the set of vertices and is the set of edges. If intersects the goal region, we find a geometric trajectory to complete the task . The detailed procedure of geometric RRT* is described in the Appendix A.1. Here, we briefly introduce two of the functions in the geometric RRT* method, which are used in explaining our method in the next sections.
Distance and cost: The function is the metric that computes the geometric Euclidean distance. The function returns the length of the path in the tree from the initial state to the input state.
Steering: Given two states , the function returns a state such that lies on the geometric line connecting to , and its distance from is at most , i.e., , where is an userspecified parameter. In addition, the state must satisfy .
Having the tree generated by the RRT* method, if there exists at least one node that is located within the goal regions, we find the optimal state trajectory satisfying the task as a sequence of geometric states , where and , .
4.2 Samplingbased reward
Here we use the optimal geometric trajectory and the properties of the generated tree to synthesize the reward scheme. First, let denote a state of , and the total length of in the tree be equal to . We define the distance from each state to the destination as . Based on the distance, we design the RRT* reward scheme to guide the learning progress towards the satisfaction of . Passing an exact state in the continuous statespace is challenging for robots. We define the norm rball for each state to allow the robot passing the neighboring area of the state as , where is the center and is the radius. For simplicity, we select based on the steering function of the geometric RRT*, such that the adjacent rballs along the optimal trajectory will not intersect with each other.
We develop a progression function to track whether the current state is getting closer to the goal region, by following the sequence of balls along as:
(2) 
For the clMDP capturing the interactions between and , the intuitive of the samplingbased reward design is to assign a positive reward whenever the robot gets geometrically closer to the goal region, along the optimal path obtained by the RRT* (Alg. 2).
During each episode of learning, a stateaction sequence up to current time is projected into the state and action sequences and , respectively. We define
as the progression energy that is equal to the minimum distance along the optimal path towards the destination, up to step . The objective of the reward is to drive the robot such that keeps dropping. However, employing the function for the reward design that depends on the history of the trajectory results a nonMarkovian reward function (Icarte2018), while the policy only takes the current state as the input, and can not distinguish the progresses achieved by the histories .
To address the issue, inspired by the product MPD (baier2008), given the history , we keep tracking the index of the center state of the visited rball regions with minimum distance deterministically, i.e.,
If none of the rballs are visited up to , we set . Then, the current state is embedded with the index as a product state , which is considered as the input of the policy i.e., . Note that we can treat the embedded component as the state of a deterministic automaton (baier2008). The relevant analysis can be found in Appendix A.2.
Let denote the episodic reward function. We propose a novel scheme to assign the Markovian reward with respect to the product state as:
(3) 
where is a positive constant reward, is a boosted positive constant that is awarded when the robot reaches the destination, and is the negative constant reward that is assigned when the robot violates the safety task of , i.e., . Note that if the robot crosses both obstacles and rballs, it’ll receive the negative reward , which has the first priority. This setting will not restrict selections of the parameter (radius of rballs) for implementations.
Example 2.
Remark 1.
Since geometric RRT* does not consider the dynamic system, the optimal path may be infeasible for the robot to follow exactly, with respect to any policy. Thanks to our robust RRT* reward, the robot is not required to strictly follow the optimal path. Instead, in order to receive the positive reward, the robot only needs to move towards the destination and pass by some of the rballs along the optimal path.
Finally, by applying the reward design (3), we can formally verify the performance of the reward (3) as:
Theorem 1.
If Assumption 1 holds, by selecting to be sufficiently larger than , i.e., , any algorithm that optimizes the expected return is guaranteed to find the optimal policy satisfying the LTL task i.e., .
4.3 Deep Policy Gradient (DPG)
Based on Theorem 1, we apply advanced DRL methods to find the optimal policy . Consider a policy , parameterized by . Deep Qlearning (mnih2015human) is a DRL approach that aims to find the optimal policy via optimizing the parameters and maximizes the expected discount return:
It learns the optimal policy by minimizing the loss function:
(4) 
where is the reply buffer that stores experience tuples , is the stateaction valuation function parameterized by , and
. To learn policies over continuous stateaction space, this work focuses on DPGbased RL methods, which estimates the optimal expected return
with respect to the policy based on sampled trajectories. For the stochastic policy, the gradient is estimated as follows:(5) 
where is the state distribution. It is possible to extend the DPG framework to deterministic policies (silver2014deterministic). In particular, we formulate the gradient as
(6) 
The DPG theorem gives rise to several practical algorithms, leading to a variety of actorcritic algorithms (sutton2018reinforcement). As observed in (4), all DPGs, e.g., (5) and (6), rely on effective data in the replay buffer, or sample efficiency of the state distribution to minimize the loss function. In this paper, the samplingbased reward improves the exploration performance with any actorcritic algorithm using noisy policies, due to its high reward density.
Theorem 2.
If Assumption 1 holds, by selecting to be sufficiently larger than , i.e., , a suitable DPG algorithm that optimizes the expected return , finds the optimal parameterized policy satisfying the LTL tasks , i.e., in the limit.
5 LTL Task Satisfaction
This section extends the novelty of Sec. 4 to satisfy the LTL task in Problem. 2. Sec. 5.1 describes how to generate the optimal path, and Sec. 5.2 explains how to integrate distributed DPGs to learn the optimal policy.
5.1 Geometric TLRRT*
Due to the unknown dynamic system, we define the transition system over the geometric space , referred as GeometricWeighted Transition System (GWTS).
Definition 1.
A GWTS of is a tuple , where is the configuration space of , is the initial state of robot; is the geometric transition relation s.t. if and the straight line connecting to is collisionfree, is the set of atomic propositions as the labels of regions, is the labeling function that returns an atomic proposition satisfied at a location , and is the geometric Euclidean distance i.e.; .
Let denote a valid run of . An LTL formula can be converted into a Nondeterministic Büchi Automata (NBA) to verify its satisfaction.
Definition 2.
(vardi1986automata) An NBA over is a tuple , where is the set of states, is the set of initial states, is the finite alphabet, is the transition relation, and is the set of accepting states.
A valid infinite run of is called accepting, if it intersects with infinite often. Infinite words generated from an accepting run satisfy the corresponding LTL formula . An LTL formula is converted into NBA using the tool (gastin2001fast). As the work (kantaros2020stylus), we can prune the infeasible transitions of the resulting NBA to obtain the truncated NBA.
Definition 3.
Given the GWTS and the NBA , the product Büchi automaton (PBA) is a tuple , where is the set of infinite product states, is the set of initial states; is the transition relation defined by the rule: , where denotes the transition , is the set of accepting states, is the cost function defined as the cost in the configuration space, e.g., , and is the labelling function s.t. .
A valid trace of a PBA is called accepting, if it visits infinitely often, referred as the acceptance condition. Its accepting words satisfy the corresponding LTL formula . Let denote an accepting trace, and is the projection into the workspace, i.e., . Using the projection, we extract the geometric trajectory that satisfies the LTL formula. More details is presented in (baier2008). Therefore, the planning objective is to find an accepting path of PBA, with minimum accumulative configuration cost .
However, the statespace of GWTS and PBA are both infinite. Consequently, we are not able to apply a graph search method into a PBA with infinite states. Thanks to the TLRRT* algorithm (luo2021abstraction) for providing an abstractionfree method, it allows us to incrementally build trees that explore the product statespace and find the feasible optimal accepting path. The procedure applies the samplingbased method over the PBA, and is inspired by the fact that the accepting run is a lassotype sequence in the form of prefixsuffix structure i.e. , where the prefix part is only executed once, and the suffix part with is executed infinitely.
Following this idea, we build the trees for the prefix and suffix paths, respectively. To satisfy the acceptance condition, the set of goal states of the prefix tree is defined as , where is the collisionfree configuration space. After obtaining the prefix tree, we construct the set , and compute the optimal prefix path reaching a state from the root . The suffix tree is built by treating as the root, and its goal states are:
collects all states that can reach the state via one transition, and this way it ensures the feasible cyclic path matching the suffix structure. Finally, we search the optimal suffix path , by constructing .
(7) 
(8) 
5.2 Distributed DPGs
In this section, we first employ the optimal geometric path from Section 4.1
, to propose a distributed reward scheme. Since the policy gradient strategy suffers from the variance issue and only finds the optimal policy in the limit (see theorem
2), instead of directly applying the reward design (3) for the whole path , we decompose it into subtasks.To do so, we modularly divide into separated consecutive segments, each of which share the same automaton components i.e., such that all states of each sub trajectory have the same automaton components. Each segment can be treated as a collisionfree goalreaching problem, denoted as , where is label of the goal region. Especially, suppose the state trajectory of each is , and we select the region labeled as containing the geometric state . An example of the optimal decomposition can be found in Fig. 3. The lassotype optimal trajectory is reformulated as: . For the clMDP , we treat each as a task solved in Section 4.2.
In particular, we propose a collaborative team of RRT* rewards in (3) for each subtask, and assign distributed DPGs for each that are trained in parallel. After training, we extract the concatenate policy of each to obtain the global optimal policy as . The overall procedure is summarized in Alg. 1, and the demonstrated diagram with rigorous analysis is presented in the Appendix A.4.
Remark 2.
Applying samplingbased method or reachability control synthesis for LTL satisfaction is investigated in (vasile2020reactive; kantaros2020stylus; srinivasan2020control; luo2021abstraction). They all assume the known dynamic system . In contrast, from an AI perspective regarding the unknown black box , we use the geometric RRT* path to synthesize dense rewards, and apply distributed DPGs to learn the optimal policies.
6 Experimental Results
We evaluate the RRT* reward and distributed DPGs by applying the framework to satisfy the LTL tasks in the form of and , respectively, where is a special case of . The algorithm test focuses on the largescale complex environments that generalizes the simple ones e.g., (lowe2017multi), to illustrate its performance. Finally, we show that our algorithm improves the success rates of task satisfaction over both infinite and finite horizon in complex environments, and significantly reduces training time for task over finite horizons. The detailed descriptions of environments and LTL tasks will be introduced.
Baseline Approaches: We refer to our distributed framework as ”DRRT*” or ”RRT*”, and compare it against three baselines: (i) The TLbased rewards in (hasanbeig2020deep; Cai2021modular) referred as ”TL”, for the single LTL task over infinite horizons, have shown excellent performance in noncomplex environments, which generalizes the cases of finite horizons in existing literature (Li2019; vaezipoor2021ltl2action; icarte2022reward); (ii) Similar as (qin2021learning), for the goalreading task , the baseline referred to as ”NED” applies the negative Euclidean distance between the robot and destination, and the goalreaching boosted value as the rewards; (iii) For a complex LTL task, instead of decomposition, this baseline, referred as ”GRRT*”, directly apply the reward scheme (3) for the global trajectory . For fair comparisons, baseline ”NED” only records the boosted positive reward and negative rewards of obstacle collisions.
6.1 GoalReaching We first consider the LTL task satisfaction in the form of in the environment of Fig. 3 referred as . We implement all baselines with both DDPG and PPO, respectively. We run episodes, each of which has maximum steps. The learning results are shown in Fig. 2. We observe that, ”TL” with pointtopoint sparse rewards and ”NED” with minimizing the euclidean distance, have poor performances in complex environments, due to exploration issues. With the RRT* based reward, it is shown that our framework can be integrated with either the offpolicy DPG or the onpolicy DPG. The consecutive rballs of rewards in the configuration space make the PPO perform better in Fig. 2, since it allows to efficiently evaluate the current policies.
6.2 LTL Satisfaction As for general LTL tasks, we first show the decomposition process for the LTL task over infinite horizons, where requires to infinitely visit goal regions labeled as . The resulting truncated NBA and decomposed trajectories of TLRRT* are shown in Fig. 3, where decomposed subpaths are expressed as , such that the distributed DPGs are applied to train the optimal subpolicies for each one in parallel.
Considering more complex LTL formulas, we use the environment shown in Fig. 3 (b) referred as . The LTL task is expressed as . We increase the complexity by random sampling obstaclefree goal regions, and set the rich specification as . For all tasks over infinite horizons, we apply the DDPG to compare our framework with baselines ”TL” and ”GRRT*”. The results are shown in Fig. 4, and we observe that the ”TL” method is sensitive to the environments, and when the optimal trajectories become complicated in the sense of the complexity of LTL tasks, ”GRRT*” easily converge to the suboptimal solutions.
The finitehorizon form of , , and are expressed by removing the ”Always” operator in the part e.g., is the finitehorizon case of . It is shown that our framework generalizes the LTL satisfaction over both finite and infinite horizons.
6.3 Performance Evaluation We statistically run trials of the learned policies for each subtask, and record the success rates of the satisfaction for all global tasks. The results are shown in Table 1 of Appendix A.5. We see that in complex environments, success rates of other all baselines are , and our method achieves success rates of near . As a result, the effectiveness of the performance has been significantly improved.
Since our algorithm learns to complete the task faster, that terminates each episode during learning earlier for tasks over finite horizons. To illustrate the efficiency, we implement the training process times for tasks , , , in their corresponding complex environments, and record the average training time compared with the baseline modular ”TL” (Cai2021modular). The results in Fig. 4 (d) show that we have optimized the learning efficiency and the training time is reduced. In practice, we can apply distributed computing lines to train each local DPG simultaneously for complicated global tasks to alleviate the learning burden.
7 Conclusion
Applying DPG algorithms to complex environments produces vastly different behaviors and results in failure to complete complex tasks. A persistent problem is the exploration phase of the learning process that limits its applications to realworld robotic systems. This paper provides a novel pathplanningbased reward scheme to alleviate this problem, enabling significant improvement of reward density and generating optimal policies satisfying complex specifications in challenging environments. To facilitate rich highlevel specification, we develop an optimal decomposition strategy for the global LTL task, allowing to train all subtasks in parallel and optimize the efficiency. The main limitation of our approach is relying on the geometric motion planning algorithms that limits its application for some systems. In the future, we aim at achieving safetycritical requirements during learning and investigate multiagent systems.
References
Appendix A Appendix (Supplementary Materials)
a.1 Summary of Geometric RRT*
Before discussing the algorithm in details, it is necessary to introduce few algorithmic primitives as follows:
Random sampling: The
function provides independent, uniformly distributed random samples of states, from the geometric space
.Distance and cost: The function is the metric that returns the geometric Euclidean distance. The function returns the length of the path from the initial state to the input state.
Nearest neighbor: Given a set of vertices in the tree and a state , the function generates the closest state from which can be reached with the lowest distance metric.
Steering: Given two states , the function returns a state such that lies on the geometric line connecting to , and its distance from is at most , i.e., , where is an userspecified parameter. In addition, the state must satisfy . The function also returns the straight line connecting to .
Collision check: A function that detects if a sate trajectory lies in the obstaclefree portion of space . is the distance of .
Near nodes: Given a set of vertices in the tree and a state , the function returns a set of states that are closer than a threshold cost to :
where is the number of vertices in the tree, is the dimension of the configuration space, and is a constant.
Optimal Path: Given two states in , the function returns a local optimal trajectory .
The Alg. 2 proceeds as follows. First, the graph is initialized with and (line 1). Then a state is sampled from of (line 3), then, the nearest node in the tree is found (line 4) and extended toward the sample, denoted by , in addition to the straight line connecting them (line 5). If line is collision free (line 6), the algorithm iterates over all near neighbors of the state and finds the state that has the lowest cost to reach (lines 7 15). Then the tree is updated with the new state (lines 16 17), and the algorithm rewires the near nodes, using Alg. 3 (line 18). Alg. 3 iterates over the near neighbors of the new state and updates the parent of a near state to if the cost of reaching from is less than the current cost of reaching to .
a.2 Analysis of Product State
In this section, we show that the product state can be applied with RL algorithms and the (3) is a Markovian reward for the state . Let be a set of all sequential indexes of the rballs for an optimal trajectory , and define as function to return the index at each time s.t. , where the output of the function follows the requirements:
Consequently, given an input , the generates a deterministic output such that we regard the tuple as a deterministic automaton without accepting states (baier2008), where . The state is derived from a product structure defined as:
Definition 4.
The product between clMPD