I Introduction
Motion planning is a wellknown fundamental challenge in mobile robotics [3]. Depending on the dynamic nature of the environment and the physical constraints that are considered it soon becomes complex. Moreover, in many real world applications mobile robots are faced with highly dynamic environments and frequent changes in immediate surroundings that require quick reaction times.
Consider the soccer application depicted in Fig. 1. At least one player (blue) tries to score a goal while at least one mobile robot (orange) defends. A realtime locating system (RTLS) tracks the ball and the player kinematics and the mobile robots and delivers a stream of their positions [5, 18]. A highlevel tactical analysis monitors all trajectories and generates a stream of target positions per robot so that the goal is best shielded at any time
. As inertia limits the robots’ movement flexibility we need to keep them constantly be moving. Therefore, a tactical analysis monitors the players’ movement patterns and also generates appropriate orientations and velocities for the prospective target positions. These together describe a target state that is then used to estimate a trajectory for the robot (even before it reaches its current target state, see
).Sampling methods [12] estimate such trajectories by reducing the continuous control problem to a discrete search problem. But the reduction in complexity comes with a loss in optimality and completeness of the solution which in turn can lead to unsatisfying results in such highly dynamic environments [12]. However, while it is possible to find such a trajectory through analytical methods [16] it becomes significantly more challenging if we also consider the kinematic and dynamic side constraints. This leads to a nonlinear optimization problem that can no longer be solved analytically fast enough for such interactive scenarios.
To address the full motion estimation problem we use recent advances in deep reinforcement learning and teach a robot to generate feasible trajectories that obey kinematic and dynamic constraints in realtime. A simulation environment models the environmental and physical constraints of a real robotic platform. We generate random target states and let a deep reinforcement learning (RL) agent based on Deep Deterministic Policy Gradient (DDPG) [15] interact with the simulator on a trialanderror basis to solve the motion planning problem within a continuous state and action space.
The remainder of this paper is structured as follows. Sect. II discusses related work and Sect. III provides background information. Next, we show our main contributions: we formalize the problem and describe the implementation of our rewardsystem and RL agent in Sect. IV. Sect. V evaluates our approach with a deep RLagent within a simulation environment that also models the robot’s and environmental physics before Sect. VI concludes.
Ii Related Work
Trajectory planning [6] generates a sequence of control inputs based on a geometric path and a set of kinematic or dynamic constraints. The geometric path comes from path planning methods. Both together solve the motion planning problem. While they are typically decoupled [3] some approaches solve the motion planning problem directly [3].
Samplingbased approaches discretize the action and state space, and transform the underlying control problem into a graph search problem [14], which allows for the use of precomputed motion primitives [9, 20, 19]. Further extensions propose bettersuited cost functions [13] for the search algorithms, e.g. state lattice planners [23, 26] sample the state space with a deterministic lattice pattern. Randomness [12] helps to reduce generated trajectories that are not relevant for the given search problem. While samplingbased approaches are not complete, they offer probabilistic completeness [10, 12]
, i.e., the probability to find a solution grows with the runtime. However, as sampling based approaches discretize the state and action space, they require a high amount of computational resources
[16] to account for all possible motions. In general, a higher number of motion primitives leads to a rapid growth in time complexity [17]. Realistic motion planners need thousands of such primitives and are therefore not suitable for a interaction [17].Interpolationbased methods
use way points to compute a path with higher trajectory continuity. In the simplest case they interpolate with lines and circles. More elaborate approaches use clothoid curves
[1], polynomial curves [7] and Bézier curves [21]. Clothoid curves allow for the definition of trajectories based on linear changes in curvature since their curvature is defined as their arclength [1]. Polynomial curves are commonly used to also take sideconstraints into account, since their coefficients are defined by the constraints in their beginning and ending segments [7]. Bézier curves are defined by a set of control points and represent parametric curves [21]. However, the trajectories are not necessarily optimal and although these methods may generate smooth trajectories they may also not obey kinodynamic sideconstraints.Numerical optimization extends both sampling and interpolationbased algorithms. Kinematic and dynamic constraints require additional optimization. Numerical optimization generates trajectories based on a differentiable cost function and sideconstraints. Under convex cost and constraint functions [16] they find globally optimal trajectories [24]. Either they optimize a suboptimal trajectory [4] or compute a trajectory on predefined constraints [25]. Such planners generate continuous trajectories by optimizing a function that considers planning parameters like position, velocity, acceleration and jerk [25, 24]. However, the downside of the additional optimization step is the rise in time complexity. While a suboptimal trajectory can be computed quickly, an optimal solution is time consuming [11] and therefore less applicable for timecritical tasks.
Since we need a fast and frequent (re)computation of (ideally nearoptimal) trajectories that obey kinodynamic sideconstraints existing methods are not sufficient. Instead, our deep reinforcement learning is trained offline and delivers trajectories satisfying all sideconstraints in realtime given a continuous action and state space.
Iii Background on Reinforcement Learning
The key idea behind reinforcement learning is an agent that interacts with the environment, see Fig. 2. In each time step the agent receives an observation , i.e., some (partial) information on the current state, and selects an action based on the observation. The environment then rewards or punishes this action with a reward .
Iiia Markov Decision Process
A reinforcement learning task that satisfies the Markov property, i.e, that the probabilities of future states only depend on the state
but not on previous events, is called a Markov Decision Process (MDP). With a set of states
, actions , and rewards a controlled process with Markov dynamics at time is defined bywhere
is defined by a probability distribution that models the transition dynamics (as often state transition may be probabilistic) and
is defined by the expected reward if we choose action in and end up in .In fully observed environments an observation completely describes the underlying (real) state of the environment . However, in partially observable environments the agent must estimate the real state based on or a set of past and present observations. For simplification we consider a fully observable environment, i.e., we assume .
IiiB Reinforcement Learning
We use the statevalue function to denote the value of state under a given policy , i.e., the expected total reward given that the agent starts from state and behaves according to . Accordingly, the value of a state is defined as the (expected) sum of discounted future rewards
where denotes the expected value, is any time step, and is the discount factor that favors immediate rewards over future rewards (and that also determines how far in the future the rewards are considered). The goal of reinforcement learning lies in the maximization of the expected discounted future reward from the initial state.
The policy maps a state to a probability distribution over the available actions . Since the future actions of the agent are defined by the policy and the state transitions by the transition dynamics we can estimate the future reward that we can expect in any given state.
Similarly, we can define the value of taking action in state under a policy as the expected return starting from , taking the action , and thereafter following policy :
which is called the (state)actionvalue function for policy . The recursive formula describes the relationship between the value of a state and the value of the successor states and is also often called Bellman equation for .
To learn the optimal actionvalue function is one of the central goals of reinforcement learning. Popular methods are dynamic programming, on and offpolicy montecarlomethods, temporaldifference learning, and Qlearning.
IiiC Deep Deterministic Policy Gradient (DDPG)
Valuebased reinforcement learning such as Qlearning has poor convergence properties as slow changes in the value estimate have a big influence on the policy. Instead, policybased reinforcement learning directly updates the policy but often converges to local optima.
Actorcriticmethods combine the best of both worlds, see Fig. 3. Instead of directly estimating the stateactionvalue function we use two separate components. While a critic estimates the value of stateactionpairs an actor takes a state and estimates the best action from within this state.
The combination of an actorcriticframework and deep neural networks allows the application of Qlearning to a continuous state and action space
[15]. DDPG uses a parameterized actorfunction and criticfunction , where and are the weight parameters of neural networks that approximate these functions.The critic function is learned using the the Bellman equation and standard backpropagation. As learning through neural networks in RL is unstable DDPG uses a concept of target networks, i.e., and , to ensure that the critic’s network updates only change slowly. The target networks are initialized by copying the actor and critic networks and then updated using through
To train the actor we use the chain rule to the expected return from the start distribution with respect to the actor parameters. To better explore the action space we sample from a normal distribution and add it to the actor policy.
To allow for minibatch (learning from a batch of transitions) and offpolicy learning, as well as the i.i.d assumption (independent and identically distributed), DDPG uses a replay buffer that stores transitions after each interaction from which we later sample for training.
Iv Method
Iva Kinematic Model
Although our application assumes a differential drive mobile robot we model the robot by a unicycle with a clearer physical interpretation as both kinematics are equivalent [22]. Hence, its configuration is given by , with wheel orientation and the contact point of the wheel with the ground . The kinematic model is described by
where is the velocity and is the steering velocity.
IvB Dynamic Constraints
We also consider physical limitations known from real physical robot platforms. We limit the linear acceleration and the angular acceleration by
in each time step . As we need to prevent the robot from tipping over we also include centrifugal force limits:
We also restricted the steering to a maximum velocity and an angular rate .
IvC Problem Formulation
In accordance to our kinematic model we describe a state by , where and are the robot’s Cartesian coordinates, its linear velocity, and its forward orientation.
To formalize our motion planning consider Fig. 4. The robot’s task is to reach a goal state from a start state . We consider a control affine nonlinear dynamic system with drift dynamics and stochastic effectiveness of control inputs. Both are unknown, but are assumed to be locally Lipschitz functions, i.e., locally constant. We also assume the control effectiveness function to be bounded.
Our RLAgent receives a new observation and a reward per timestep for the current and past actions from our environment. We define an observation by , where is the polar coordinate of the goal position in the robot’s frame, is the residual of the current to the goal velocity, is the residual of the (global) current to the goal orientation, and are the actual velocity and rotation rate of the robot. The latter two ensure a fully observable environment to our agent.^{1}^{1}1
Those can be also omitted if we add a memory to the agent, e.g. with a recurrent neural network
[8] or if we stack some past observations together.We relax the motion planning constraints and allow the robot to reach the goalstate both forwards and backwards as long as its motion vector is correct. Therefore, the residual of the velocity
is andAfterwards we normalize to .
Our simulation environment returns an immediate reward , where the error is the Euclidean distance between the current state and the goal state .
We end an episode, if either the error is lower than a threshold or if we reach a maximum number of steps . In the former case, the agent receives an additional fixed reward at the last step to enforce this behavior as the reward alone does not tell the agent how timeefficient its trajectory has been.
Fig. 5 shows how a discounted reward (orange line) helps to score for timeoptimality. The closer a state is to our final goal state the higher is the discounted reward for the agent (e.g. it is higher at than at ). This encourages the agent to move closer to at a later point. The orange curve is mainly influenced by the final reward if the agent has reached the goal state . If the agent does not reach the goal state the orange curve uses the discounted immediate reward , which is much smaller. However, this smaller reward initially helps the agent to find the goal state in the search space.
V Evaluation
We use a simulator (Sect. VA) to train our RLagent (Sect. VB). We evaluate the optimization of our agent during training (Sect. VC) before we show the efficiency of our RLbased approach (Sect. VD). We also evaluate our approach in a complex usecase scenario (Sect. VE).
Va Simulation and Training Setup
To train and test our RLbased motion planner we implemented the environment, i.e., the kinodynamic constraints, the kinematic model, our reward system, and the action, observation and state scheme, within the OpenAI Gym framework [2]. We implemented a discrete form with a stepsize of and applied the following kinematic constraints: =, =, =, = and =.
To let our agent learn, we generate random pairs of start and goalstates. We transform them such that the startstate is at the origin and its initial orientation points to the positive xaxis, and sample its velocity (
) from a uniform distribution. We generate the goalstate in relation to the local frame within a position sampled uniformly within a distance of
to the origin. We also sample the goal orientation and velocity uniformly. Note, that we allow to reach the goal state backwards or forwards, i.e., the orientation then turns by with a negative velocity.The agent gets a reward = if it reaches the final goalstate within a threshold of . We used a grid search to evaluate the reward . We end an episode if the agent needs more than steps to reach the goal. In this case, the last reward is the immediate reward .
For both training and testing the agent and the simulator run on a desktop machine equipped with an Intel Core i77700 CPU@3.60GHz (4 cores, 8 threads), 32GB memory, and an Nvidia GeForce GTX1070 with 8GB memory. We implemented all our algorithm in Python.
VB Training Parameters of the RLAgents
Our DDPG agent uses two separate networks for the actor and the critic that have a similar design. Both use an input layer, three fully connected hidden layers (with 200 neurons each) and an output layer. The hidden layers are necessary to approximate the nonlinearities in the valuefunction. Both the actor and the critic receive the observation vector
at the input layer. The actor output is additionally connected to the second hidden layer of the critic. We use a linear action (ReLU) for the critic’s output layer and
tanhactivation in all other cases. While we initialize all biases with a constant value of , we initialize the weights of the critic by sampling from a normal distribution (=, =) and the actor by sampling from (=, =).For the training using backpropagation we use the ADAM optimizer. The exploration in our DDPG agent uses a simple
greedy algorithm. We evaluated the following combinations with a grid search to find the best hyperparameters for the training (best setting in bold): learning rate actor (
, =, =, =) learning rate critic (, =, =, =), discount factor , batch size (has no influence since it’s indirect proportional to the training time), exploration probability , exploration variation , memory size (), and soft update factor .For the test we set = to fully exploit the policy. While the immediate output of the actor is a greedy action we select a nongreedy action by sampling from a normal distribution with the greedy action as its mean: =
. The variance in training defines the randomness of the action: with a higher variance we more likely choose random actions.
VC Optimization during Training
To understand how our agent optimizes the motion planning we train the agent in the 4D state space and analyze the error on the particular dimensions over the training episodes.
Fig. 6 shows the distance/error of the 4Dagent per state dimension, i.e., positional distance, angular distance , and velocity distance , over the training episodes. As we now only focus on the optimization of each individual dimension we scale the yaxis to show the results in percentages (the maximal positional distance was m, the maximal angular distance was , and the maximal velocity distance was m/s). The episodes that the agent run until the dashed line at episode #250 are (until that point) only used to fill the replay memory. They are generated using the randomly initialized policy.
Afterwards, the agent samples from the replay memory and starts training with a absolute positional error of around m, and angular error of approx. and a velocity error of m/s (max. m/s allowed). The agent optimizes for the positional error first. It drops below m after only 500 episodes, i.e., 250 episodes in training. The optimization of the angular error comes with a little delay: it drops below at episode 1,000. The agent optimizes for the velocity at latest. Starting with an error of m/s the error drops below m/s after 1,700 episodes. Interestingly, velocity and angular distance fight for a minimum error until the end. Note that the errors are higher during training as the agent applies a nongreedy behavior.
VD Accuracy Results
The agent’s task is to reach a goalstate with time and wayefficient controls. However, as the 4D problem is complex we break it down to subproblems on which we analyze our agent’s behavior.
We separately train an agent and evaluate its accuracy within the 2D (only position), 3Da (position and orientation), 3Db (position and velocity), and 4Dproblem (complete goalstate). Hence, we trained four agents using the same reward function for 4,000 episodes each. The total training time has been 45 minutes for all agents in parallel.
We later evaluated the performance of the agents with randomly sampled start and goal statepairs, see Sect. VA. Fig. 7 shows the results for the four agents. For each agent we separately plot the errors in the different dimensions.
The positional error is best for the 2Dproblem (mean of m) and increases as the problem becomes more complex, i.e., a mean error of m (3Da), m (3Db), and m (4D). The mean angular error of in the 3Daproblem also increases a bit to for the 4Dproblem. The velocity error behaves similarly: from m/s in 3Db is increases to m/s in the 4Dcase. Please note, that the plot shows the median error, which is a bit smaller.
What we do not see here is that the agents achieved success rates (i.e., that the error is below a threshold of 0.5) of (2D), (3Da), (3Db) and even (4D) for a total number of 1,000 testing episodes. With increasing dimensions the problem becomes more complex and harder to solve. However, the 4Dagent even solves the complex motion planning task in most of the cases and satisfies all kinodynamic sideconstraints at any time
We also compared the timeefficiency of the generated trajectories to a cubic spline derived from a velocity ramp. The spline’s velocity starts from the initial velocity in the start state, increases to the maximum possible velocity, and decreases to the target velocity at the latest point. This achieves maximal linear acceleration and maximal velocity by design. But as the spline interpolation does not consider kinematic constraints most of the generated trajectories are not feasible. However, we use them to estimate a baseline for the duration of the generated trajectories.
The mean and standard deviation of the duration ratio between the spline baseline and our agent for the different subproblems was
(2D), (3Da), (3Db), and (4D). Our RLapproach generates faster trajectories than the splinebased approach in the 2D, 3Da and 3Db case. But the 4Dagent’s trajectories take times longer. However, this is the tradeoff between timeefficiency and kinodynamic sideconstraints as the trajectories of the RLagent are feasible while the splinebased trajectories are not.VE Application Results
To evaluate our approach for the use case we initially introduced we implemented a composite task that continuously updates the goalstates, see Fig. 8. We generate four goalstates apriori. The green vectors describe the states: the goal position is the start of the vector, the length describes the velocity (also labeled) and the vector’s orientation the goal orientation of the robot at that state. The agents did not see this scenario or such a situation within their training phase explicitly (although similar combinations might have been included in the training episodes).
Fig. 8 shows the trajectories of the trained agents from Sect. VD. The dotted line denotes the splinebased trajectory, which again, is not said to satisfy the constraints at any time. We see that the lower dimensional agent D2 steers to the desired goal positions but generates unsteady trajectories near the goal states. Considering the introductory use case, a motion planner that generates trajectories only based on positions will fail to react fast enough (as it introduces inplace rotations and zero velocities, see the first goal point). The 3Daagent behaves much more smoothly and generates correct orientations, but fails for the last goal point where the orientation is achieved by an inplace rotation. The 3Dbagent is better and generates a smooth trajectory. However, it may still also fail if the tactical analysis updates the goal state that then might be at an unprofitable position.
But the 4Dagent not only gets close to the spline baseline but also considers all kinodynamic constraints that are violated by the spline between the third and fourth goalpoint. The agent achieves a smooth and wayefficient steering to reach goalpoints only by exploiting the learned behavior.
Vi Conclusion
This papers presents a motion planning based on deep reinforcement learning. In contrast to previous research in motion planning our algorithm does not compute trajectory plans apriori or ondemand but rather delivers a stream of control commands that we can use directly to steer a robot. Our approach uses a reward function, state representation and a framework which we used to train an Deep Deterministic Policy Gradient (DDPG) agent.
We evaluated our approach in a simulation environment and compared it to a splineinterpolation. We also proved the applicability of our approach in a dynamic usecase that continuously provides updated goal states.
Acknowledgements
This work was supported by the Bavarian Ministry for Economic Affairs, Infrastructure, Transport and Technology through the Center for AnalyticsDataApplications (ADACenter) within the framework of ”BAYERN DIGITAL II”.
References
 [1] (2014) Realtime approximation of clothoids with bounded error for path planning applications. IEEE Trans. Robotics 30 (2), pp. 507–515. Cited by: §II.
 [2] (2016) OpenAI gym. External Links: arXiv:1606.01540 Cited by: §VA.
 [3] (2005) Principles of robot motion: theory, algorithms, and implementation. MIT press. Cited by: §I, §II.
 [4] (2010) Path planning for autonomous vehicles in unknown semistructured environments. Intl. J. Robotics Research 29 (5), pp. 485–501. Cited by: §II.
 [5] (2018) Recurrent Neural Networks on Drifting TimeofFlight Measurements. In 9th Intl. Conf. Indoor Positioning and Indoor Navigation, Cited by: §I.
 [6] (2012) Trajectory planning in robotics. Mathematics in Comp. Sc. 6 (3), pp. 269–279. Cited by: §II.
 [7] (2010) Maneuverbased trajectory planning for highly autonomous vehicles on real road with traffic and driver interaction. IEEE Trans. Intelligent Transportation Systems 11 (3), pp. 589–606. Cited by: §II.
 [8] (2015) Memorybased control with recurrent neural networks. CoRR abs/1512.04455. Cited by: footnote 1.
 [9] (2008) State space sampling of feasible motions for highperformance mobile robot navigation in complex environments. J. Field Robotics 25 (67), pp. 325–345. Cited by: §II.
 [10] (2006) On the probabilistic foundations of probabilistic roadmap planning. Intl. J. Robotics Research 25 (7), pp. 627–643. Cited by: §II.
 [11] (2011) STOMP: stochastic trajectory optimization for motion planning. In IEEE Intl. Conf. Robotics and Automation, pp. 4569–4574. Cited by: §II.
 [12] (2001) Randomized kinodynamic planning. Intl. J. Robotics Research 20 (5), pp. 378–400. Cited by: §I, §II.
 [13] (2006) Planning algorithms. Cambridge University Press, New York, NY. Cited by: §II.
 [14] (2008) Anytime search in dynamic graphs. Artificial Intelligence 172 (14), pp. 1613–1643. Cited by: §II.
 [15] (2015) Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971. Cited by: §I, §IIIC.
 [16] (2018) Hierarchical trajectory planning of an autonomous car based on the integration of a sampling and an optimization method. Trans. Intel. Transp. Sys. 19 (2). Cited by: §I, §II, §II.
 [17] (2005) Current issues in samplingbased motion planning. In Proc. 11th Intl. Symp. Robotics Research, San Francisco, CA, pp. 36–54. Cited by: §II.

[18]
(2019)
A Deep Learning Approach to Position Estimation from Channel Impulse Responses
. Sensors 19 (5), pp. 1064:1–1064:23. Cited by: §I.  [19] (2011) Kinodynamic motion planning with state lattice motion primitives. In Proc. IEEE/RSJ Intl. Conf. Intelligent Robots and Systems, San Francisco, CA, pp. 2172–2179. Cited by: §II.
 [20] (2009) Differentially constrained mobile robot motion planning in state lattices. J. Field Robotics 26 (3), pp. 308–333. Cited by: §II.
 [21] (2014) Dynamic trajectory generation using continuouscurvature algorithms for door to door assistance vehicles. In Proc. IEEE Symp. Intelligent Vehicles, Dearborn, MI, pp. 510–515. Cited by: §II.
 [22] (2009) Robotics–modelling, planning and control. Springer, London, UK. Cited by: §IVA.
 [23] (2015) State latticebased motion planning for autonomous onroad driving. Ph.D. Thesis, Freie Universität Berlin. Cited by: §II.
 [24] (2014) Trajectory planning for bertha—a local, continuous method. In Proc. IEEE Symp. Intelligent Vehicles, St. Louis, USA, pp. 450–457. Cited by: §II.
 [25] (2014) Making bertha drive—an autonomous journey on a historic route. IEEE Intelligent Transportation Systems Magazine 6 (2), pp. 8–20. Cited by: §II.
 [26] (2009) Spatiotemporal state lattices for fast trajectory planning in dynamic onroad driving scenarios. In Proc. IEEE/RSJ Intl. Conf. Int. Robots and Sys., St. Louis, MO, pp. 1879–1884. Cited by: §II.
Comments
There are no comments yet.