SABER: Data-Driven Motion Planner for Autonomously Navigating Heterogeneous Robots

08/03/2021
by   Alexander Schperberg, et al.
7

We present an end-to-end online motion planning framework that uses a data-driven approach to navigate a heterogeneous robot team towards a global goal while avoiding obstacles in uncertain environments. First, we use stochastic model predictive control (SMPC) to calculate control inputs that satisfy robot dynamics, and consider uncertainty during obstacle avoidance with chance constraints. Second, recurrent neural networks are used to provide a quick estimate of future state uncertainty considered in the SMPC finite-time horizon solution, which are trained on uncertainty outputs of various simultaneous localization and mapping algorithms. When two or more robots are in communication range, these uncertainties are then updated using a distributed Kalman filtering approach. Lastly, a Deep Q-learning agent is employed to serve as a high-level path planner, providing the SMPC with target positions that move the robots towards a desired global goal. Our complete methods are demonstrated on a ground and aerial robot simultaneously (code available at: https://github.com/AlexS28/SABER).

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 6

02/10/2021

Learning Interaction-Aware Trajectory Predictions for Decentralized Multi-Robot Motion Planning in Dynamic Environments

This paper presents a data-driven decentralized trajectory optimization ...
01/11/2022

Decentralized Probabilistic Multi-Robot Collision Avoidance Using Buffered Uncertainty-Aware Voronoi Cells

In this paper, we present a decentralized and communication-free collisi...
11/30/2018

PEARL: PrEference Appraisal Reinforcement Learning for Motion Planning

Robot motion planning often requires finding trajectories that balance d...
05/31/2022

Multi-robot Task Assignment for Aerial Tracking with Viewpoint Constraints

We address the problem of assigning a team of drones to autonomously cap...
06/17/2019

Heterogeneous Robot Teams for Informative Sampling

In this paper we present a cooperative multi-robot strategy to adaptivel...
10/20/2020

Action-Conditional Recurrent Kalman Networks For Forward and Inverse Dynamics Learning

Estimating accurate forward and inverse dynamics models is a crucial com...
02/25/2018

NanoMap: Fast, Uncertainty-Aware Proximity Queries with Lazy Search over Local 3D Data

We would like robots to be able to safely navigate at high speed, effici...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The field of robotics has made remarkable progress in providing diverse sets of robotic platforms with different physical properties, sensor configurations, and locomotion capabilities (e.g., climbing, running, or flying). Thus, developing new planning algorithms that can be ubiquitously applied to a team of heterogeneous robots is a worthwhile endeavor, and applicable to a wide range of tasks from search and rescue to space exploration. However, for multi-agent planners to be used in unknown and uncertain environments, they should consider complex robot dynamics, uncertainty from imperfect exteroceptive and proprioceptive sensor measurements, update uncertainty when robots are in communication range, avoid obstacle collisions, and address desired multi-agent behavior.

One common approach is to fully address some but not all of the described requirements while assuming the rest can either be satisfied in future work, or can be combined with other existing methods. However, combining multiple approaches towards a unified motion planning framework satisfying all requirements can be nontrivial, requiring close examination of the overall feasibility and performance of such a complex system. Thus, in this work, we will examine the feasibility and performance of an end-to-end motion planning framework that addresses the above requirements termed as

Stochastic model predictive control for Autonomous Bots in uncertain Environments using Reinforcement learning’ or SABER.

Fig. 1: SABER framework.

SABER combines controls (stochastic model predictive control), vision (simultaneous localization and mapping), and machine learning (RNN and DQN), to provide local and globally optimized solutions in unknown and uncertain environments.

Summary of Our Contributions

(1) SABER is an end-to-end motion planning framework for a team of heterogeneous robots that unifies controls, vision, and machine learning approaches to plan paths that account for safety, optimality, and global solutions (our complete framework is shown on a UGV-UAV team).

(2) Cooperative localization algorithms are used for cross-communicating robots, which may include both non-Gaussian and Gaussian measurement noise, where uncertainty is modeled with recurrent neural networks (RNNs) for each agent’s sensor configuration using outputs from simultaneous localization and mapping algorithms (SLAM).

(3) Instead of simple heuristics when sampling the map for target positions, we employ Deep Q-learning (DQN) for high-level path planning, which is easily modifiable for learning desired multi-agent behavior and finds global solutions (DQN scalability for more than two robots is also evaluated).

Fig. 2: SABER Algorithm. This figure demonstrates the overall SABER planning algorithm in the testing phase, which can plan paths for one or more robots simultaneously. At timestep k, the environment provides information to robots that either carry a LiDAR or RGB camera and IMU; for the LiDAR configuration, a particle-filter SLAM is implemented, while for the RGB configuration, Visual-Inertial Odometry SLAM (VIO SLAM) is implemented. The sensors provide either scans or distance to feature information to a recurrent neural network model (which serve as inputs), and outputs the propagation of state uncertainty for future timesteps. If two or more robots are within communication range, a distributed Kalman filter updates the current and future states and their uncertainties to a more accurate estimate. These updated states and uncertainties are used to update the chance constraints for obstacle avoidance. These constraints are then considered by a stochastic MPC controller, which follows a given target position, provided by a deep Q-learning (DQN) agent that aims to move the robot towards a global goal. DQN uses the relative distances between the robots and the respective obstacles as its states, provides a target position for all robots as its actions, and is trained on several different maps with obstacles randomly distributed in each. Note, that the SMPC, SLAM, and RNNs components run on each robot individually, however, the DQN is run on a centralized base (which may be on the robot itself).

Ii Related Works

Several works that examine planning for heterogeneous robots (typically composed of a two robot UGV-UAV team) have focused mainly on fusing different sensor data to build a unified global map [heter_3], integrating several components such as path planning, sensor fusion, mapping, and motion control towards a single framework [heter_1], or strictly analyzing multi-agent localization (i.e., multi-SLAM) [heter_2]. While we also consider a UGV-UAV team as done in the above works, here, we are more concerned with the feasibility (i.e., computation time) of such a complex system, and also in how uncertainty is not only estimated for robots with different sensor configurations, but how it’s tightly coupled with a local stochastic model predictive controller (SMPC) towards coordinating multi-agent behavior.

We also seek to address the major challenge for multi-agent (or even single-agent) planners, which is to estimate a path that is both safe (e.g., considers uncertainty in agent/obstacle avoidance) and fast (e.g., finding the shortest path to the goal). This is significant because conservative approaches to safety would lead to over avoidance or non-optimal solutions, and high-risk behavior may cause undesired collisions. Currently, few motion planners fully investigate this problem. For example, in [wurm_coordinating_2013], the cost of reaching a target position for each robot in a heterogeneous team depends on its individual characteristics (e.g., varying sensors, travelling speed, and payloads). However, by not considering uncertainty in their planner, their cost-to-go function can be significantly affected by disturbances. Conversely, a multi-agent planner that does consider probabilistically-safe motion planning can be found in [bajcsy_scalable_2019]. Still, their planner may lead to conservative solutions as they assume a worst-case behavior approach to safety. SABER addresses the problem of avoiding obstacles without over avoidance by using an RNN, which predicts and propagates future state uncertainty dynamically and does not make the assumption that uncertainty increases when no future measurements are received [Schperberg_rnn]. The downside with an RNN (which is typical with learning-based approaches) is that the accuracy of the uncertainty estimates is directly correlated with the quality of the data collection.

The geometric representation of obstacles is also critical for planning. For example, FASTER [tordesillas_faster_2019] is a decentralized and asynchronous planner where obstacles are represented as outer polyhedrals (estimated from convex decomposition) and applied as constraints into the optimization. In SABER, we also represent obstacles as polyhedral constraints for each timestep, however, we decompose them into a disjunction of linear chance constraints (thus, obstacle size and location’ are a function of exteroceptive and proprioceptive uncertainty). Using chance constraints in motion planning is not new, and has been shown with success in single-agent planners such as [blackmore_chance-constrained_2011] and [risk_aware]. In this work, our chance constraints are also influenced by the cross-communication uncertainty of a heterogeneous robot team (see III-B). Additionally, while obstacle constraints can be explicitly used in the optimization, other works show a learning-based approach to avoid collisions by modeling the distribution of promising regions for travel [region_learning] or predicting the separation distance between the robot and its surroundings [chase_kew_neural_2021]. Our work is a hybrid approach, where we use the RNN to predict uncertainty of state estimates (which affect the size’ of polyhedral obstacles), but still use these obstacles as constraints in our SMPC optimization. This choice sacrifices computation time, but may be more generalizable to environments not observed in training and prevent collisions.

Finding a suitable path to the goal also has a wide array of different solutions. Most commonly, sampling-based methods which do not consider workspace topology (such as grid-sampling [grid_sample] or rapidly-exploring random tree or RRT and its variants [gammell_informed_2014]) can lead to very dense roadmaps and may not scale well when the shortest path to the goal is desired. This issue has been investigated in [probabilistic_roadmaps]

, which uses a self-supervised learning approach to build sparse probabilistic roadmaps (PRM) for bias sampling (sampling only regions the robot is likely to safely travel). Moreover, using a learning-based approach for path planning also has the potential for integrating semantic behavior that can be gained from multi-agent coordination, as evidenced in

[multi_agent_rendezvous] and [Baker2020Emergent]. Motivated by these works, we use a DQN for high-level planning, where simple modifications of a reward function can yield desired multi-agent behavior (e.g., rewarding agents based on proximity or reaching the goal concurrently). Nevertheless, the tradeoff of using a DQN compared to sampling-based methods is that it cannot guarantee asymptotic optimality (e.g., RRT-star) or probabilistic completeness (e.g., PRM). However, for our DQN, we are primarily concerned with the feasibility relative to our application (i.e., finding a near-optimal path in a computationally efficient manner while satisfying multi-agent behavior).

1 Initialize state , goal , , horizon , robot size , timestep , empty , uncertainty , error to goal , number of robots
2 while  do
3       if LiDAR configuration:
4         
5         
6       if RGB camera configuration:   
7         
8       continue:
9      
10      
11      
12       if
13         
14      
15 end while
Algorithm 1 SABER

Iii Methods

The SABER framework contains both learning (requiring data collection) and non-learning components (traditional control schemes). The non-learning components consist of an SMPC and a distributed Kalman filter, while the learning components consist of an RNN and DQN agent. The RNN and DQN components are trained separately and offline before being implemented into the overall system for online deployment (note, that the RNN is supervised by the SMPC controller on each robot, while the DQN is not integrated with any other component during training). Overall, the algorithm is structured as an SMPC problem, which moves a robot toward a target location as formulated in III-A. By using state uncertainties and obstacle locations, obstacles are represented as chance constraints within the SMPC cost function (III-A1, III-A2). If two or more robots are in communication range, their state and uncertainty values are updated using a distributed Kalman filtering approach as described in III-B. To quickly propagate state uncertainties for future timesteps, we use different RNN models based on the robot’s sensor configuration, as explained in III-C. In III-D, we formulate a DQN approach, providing the SMPC with target locations which help generate trajectories that move the robots toward a global goal and prevent local minima solutions. See Algorithm (1), Fig. 2, or the attached video111https://youtu.be/EKCCQtN5Z6A for an overview of the methods, and IV-A for implementation details.

Iii-a Stochastic Model Predictive Control Formulation

The goal of the cost function (equation (1)) is to find the optimal control value that minimizes the distance between the current and predicted states () with a reference state or trajectory () while under equality and inequality constraints – where is given by the results of localization for each robot , and is the current timestep ( and are control matrices, and is described further below):

(1)
s.t. (2)
(3)
(4)

Obstacle Constraints:

(5)

Constraint (2) represents multiple shooting constraints, which ensure that the next state is equivalent to a time-invariant linear discretized model, where and

represent the robot’s dynamic matrices. Uncertainty in state as well as the addition of a non-unit variance random Gaussian noise (

) is shown in (3). Note, that the propagation in state uncertainty () is received from an RNN model (Section III-C), and can be affected by cooperative localization algorithms (Section III-B) at timestep , if multiple robots are in communication range at timestep .

Limits on state and controls variables are imposed by constraint (4). For robustness [rawlings_model_2017], a terminal cost is included with a weighting matrix which can be obtained by solving the discrete-time Riccati equation (6):

(6)

Iii-A1 Chance Constraints for Obstacle Avoidance

Constraint (5) represents chance constraints that enable obstacle avoidance subject to uncertainty in convex regions as done in [blackmore_chance-constrained_2011]. This constraint can be rewritten as a disjunction of linear constraints for obstacle :

(7)

where is the set of linear constraints (indexed by ) for each obstacle (indexed by ), is the set of timesteps in the MPC prediction horizon (indexed by ),

is the vector normal to each line constraint and directed toward state

, is the radius/size of the robot, and is given by:

(8)

Important to consider is that the degree of risk’ can be controlled by changing the values of (related to for each obstacle . Lower values lead to more evasive behavior (robot moves further away from obstacles) while higher values lead to more risky behavior (robot moves closer to obstacles).

If obstacles are assumed circular (centered at , ), we can use the following equation, where is equal to an identity vector, and is the center position of the robot, and only a single value needs to be calculated per obstacle:

(9)

Iii-A2 Mixed-Integer Nonlinear Programming

To more effectively consider the disjunctive convex program for polygon-shaped obstacles, as introduced by (7), we can change these constraints into a mixed integer format (we assume the line constraints are in the x-y plane, however, the same equations can be used for the x-z, and y-z planes respectively):

(10)
(11)
(12)
(13)
(14)
(15)
(16)

where and are the slope and y-intercept of each line constraint belonging to obstacle , is , and are the and position retrieved from robot state , and the coordinates of the point on the line constraint closest to is represented by and (equations (11) and (12)). The dist(*) function approximates the distance between and one of the linear constraints of an obstacle as shown in equation (13). Equation (14) returns a positive distance if the robot is outside’ the obstacle boundary, and a negative distance if the robot is inside’ the obstacle boundary (a negative distance would cause the line constraint to fail). By definition of constraint (10), only one or more of the line constraints need to be satisfied per obstacle , which is ensured by using binary integer variables under constraints (15) and (16) (e.g., for line constraint belonging to , if , the robot is outside this obstacle). For simplicity, we assume we have a perfect’ object detection system. If the robot is close enough to an obstacle, the obstacle is automatically seen’ and embedded into the SMPC cost function.

Iii-B Cooperative Multi-Agent Localization

While the propagation of uncertainty for each robot is calculated using an RNN (see III-C), updating the uncertainty after information is exchanged with another robot is done using a distributed Kalman filtering approach [coop_local2000]. Thus, when two or more robots are in communication range (as pre-specified by the user), their individual uncertainty estimates should be updated to correctly reflect the gain from additional sensor information.

Equations (17) - (25) describe how the pose for robot is updated while in communication range of another robot . The same equations can be further extrapolated to consider additional robots as explained in [coop_local2000].
For and :
Propagation:

(17)
(18)

Update:

(19)
(20)
(21)

Update A (first time robots meet):

(22)
(23)

Update B (all other times robots meet):

(24)
(25)

where is the relative pose measurement between robot and robot ( is received by current localization, and can be retrieved by the SMPC solution), and is the relative measurement noise between robot and robot .

Iii-C Recurrent Neural Network for Uncertainty Propagation

Fig. 3: Network structures. We show the RNN structure used to model an EKF from a VIO SLAM algorithm in (A), or a particle-filter SLAM algorithm in (B) (III-C). The inputs are shown in orange, and correspond to either features/robot position (using VIO SLAM) or LiDAR scans/robot position (using particle-filter SLAM). The outputs are shown in red, and correspond to the x-y covariance matrix (which represents uncertainty in x-y position). The layer type is color coded below, where green represents a simple RNN layer, and purple a dense layer.
Fig. 4: Training loss. Here we show the training loss for the RNNs, which were trained on uncertainty covariance outputs (in position) of a Visual-Inertial SLAM in (A) and a particle-filter SLAM in (B

). The training was done using 500 epochs and on 4 different maps. Note, that the noise observed in (

B) may be due to the particle-filter estimations/simplifications done in [Gmapping].

Because our SMPC calculates the optimal control () based on a prediction horizon (, ), we must also provide as input the propagation of uncertainty () at each timestep. As described in more detail in [Schperberg_rnn], an RNN can provide a computationally fast prediction of future state uncertainties (as it only requires a small network) and can operate in continuous space, making it ideal for online replanning in complex environments. This is achieved as the RNN can model the behavior of a filter (e.g., particle filter or EKF) from SLAM algorithms. However, in this study, we have multiple robots with different sensor configurations, which requires training separate RNN models for each.

In this work, we estimate state uncertainty using two different SLAM algorithms, a Rao-Blackwellized particle-filter SLAM (LiDAR camera configuration) and a Visual-Inertial Odometry (VIO) SLAM (RGB camera configuration). In the particle-filter case, the following equation is used for factorization:

(26)

where = is the robot’s trajectory, = are the given observations, = are the odometry measurements, and is the joint posterior estimate about map (see [AMCL]). For training the RNN for the particle-filter SLAM case, we have 362 input units for each timestep . The first 2 units is the center position of the robot ( and ), and the 360 other units represent the range distances from LiDAR scans (e.g., for timestep k, we have

relative scan distances). The output layers, which use a linear activation function, correspond to the robot’s 2

2 x-y covariance matrix which is flattened into a 41 array or 4 output units. For the VIO SLAM configuration, we used the same methods for training the RNN as described in [Schperberg_rnn]. See Fig. 3 for an overview of the network structure, and IV-A for further implementation details.

Iii-D Deep Q-Learning (DQN) for Global Planning

Fig. 5: DQN Training and Testing Procedure. In (A), we show the neural network structure used in our DQN algorithm (III-D). The network maps the inputs (i.e., states or relative distances between robots and obstacles/goals) to the outputs (i.e., actions or next target positions for the robots). The states and actions are connected by a linear neural network model (blue). In (B) we visually show the training process of the DQN for a 2 to 5 robot team, where all robots were trained to go to the goal location while avoiding obstacles (obstacles are randomized for each episode). The average rewards (calculated from 25 episodes at a time and divided by number of robots) are shown across the 35,000 episodes of training (training time was 5 hours). In (C) we show an example of how the environment can be transcribed into a 2D plot and apply the DQN to traverse multiple robots toward their goals. We also allow the UAV to fly over the obstacles (and assume we know the height of the obstacle a priori) while the UGV must avoid it. In (D) we show another example of our DQN, but this time the 2-robot UAV and UGV team have separate goal locations, and we add a reward incentive when both robots are near each other at each timestep.

Iii-D1 DQN formulation

To provide local target positions () that direct the robots toward a global goal (), we implement a DQN agent and use the Bellman equation:

(27)

where the state is represented by , action by , learning rate by , is the number of robots, and reward function by (where is the number grid spaces required to represent each obstacle; described further below). The idea of DQN is to use the Bellman equation (27

) and a function approximator (with neural networks) to reduce the loss function (we use the Adam optimizer)

[Q-learning].

Ultimately, the goal of the DQN-agent is to generate target positions at each timestep k for multiple robots and to move them towards a global goal position. To accomplish this and generalize our methods to any map (or changing the location of obstacles after each episode), we use the relative distances between the robot and surrounding obstacles and the relative distance between the robot’s position and the global goal as our states: , where is the relative distance between the robot and obstacles and can be described by , and by ( is assumed to be the and position of the robot). Our DQN agent is trained on a 2D-grid map, where the robots and obstacles are represented as squares (1m) within the grid. Thus, the actions permitted by the robot is an 8-directional x-y movement (or no movement) at each timestep. The reward function is simply formulated by providing a positive reward if the robot gets to the goal and a negative reward if the robot hits an obstacle, which results in termination of the episode (we allow the UAV to fly’ over some obstacles by modifying its reward function to not receive a penalty for hitting that obstacle). To test different behaviors, we also added a reward at each timestep when both robots are within a 2 meter distance (see (D) in Fig. 5).

We also assume the dimensions of each obstacle are known a priori (to estimate this without this assumption may require an object detection pipeline). Thus, by knowing the height of each obstacle, a UAV can use this value in its chance constraint to fly over obstacles. Finally, mapping our states to our actions is done through a linear neural network. For an overview of the network structure, and the training/testing process, see Fig. 5.

Iv Experimental Validation

Fig. 6: SABER Algorithm Results. This figure demonstrates the overall SABER planning algorithm. In (A) and (B) we first show the capability of the SMPC-RNN to navigate the UGV and UAV in a densely populated space. In (C), we show that the SMPC-RNN of the UGV cannot get to the goal state, because of the occurrence of a local minima. However, with a DQN (which provides a global path illustrated by triangles), the UGV (orange) can correctly maneuver around the obstacle. The UAV (purple) can simply use it’s z-axis to fly above the obstacle. A more complex example is shown in (D), where both robots are directed towards different goal locations simultaneously.
Fig. 7: RNN Results. In this figure, we show the true covariance value (where xx’ represents the covariance of the center of mass in the x position as an example) in blue and the predicted covariance in orange for about 430 seconds of data. This is done when modeling the uncertainty using VIO SLAM (A) and particle-filter SLAM (B) on a test map. Note, that the predicted and true values almost perfectly align, demonstrating the RNN’s ability to make valid uncertainty predictions. Lastly, we show more explicitly in the graphs, that when more features are tracked (green arrow) the lower the estimated uncertainty, while fewer features corresponded to higher uncertainty (red arrow).

Iv-a Implementation details

SABER is demonstrated on a UGV (Turtlebot3) equipped with a 360 degree LiDAR camera, and a UAV (Quadrotor) equipped with a RGB camera in a Gazebo simulation running in real-time with additive noise. The dynamic equation of motion for the UGV assumes states composed of center of mass position and heading angle (), actions composed of linear and angular velocity () and the following matrices for the discretized equation:

The UAV assumes similar dynamics as in [quadrotor], where the states are center of mass position, linear velocity (, , and components), angle, and angular velocity (we consider pitch and roll but keep yaw fixed) or , and actions composed of thrust . The and matrices and their parameters are fully described in [quadrotor], where and .

To solve the cost function (1), we use the mixed-integer nonlinear programming (MINLP) solver bonmin’ [bonami_algorithmic_2008], as we need to consider both integer and continuous variables. Constraints and problem formulation were setup using CasADi [Casadi] running on a laptop with an Intel Core i7‐8850H CPU, and NVIDIA Quadro P3200 GPU.

To collect data for training our RNN model for the UGV, we used the GMapping package [Gmapping] to create a map of the environment, and then implemented the AMCL package [AMCL] to track the robot’s pose and receive the uncertainty covariances using this map with particle-filter SLAM (uncertainty outputs from the filter are considered as the ground truth’). In the UAV case, we used the XIVO SLAM package [xiao-semantic-mapping] to make localization and covariance estimations (XIVO uses an Extended-Kalman Filter). The Adamax optimizer was used for training and the Mean Squared Error (MSE) was implemented as the loss function for both RNN models (covariance matrices need to be converted to be positive semi-definite during usage, see [Schperberg_rnn]). For training (for both models), we created 4 different maps with obstacles randomly distributed, where robots traverse about these maps via the SMPC. Note, that by definition of a particle-filter and an EKF, the former assumes non-Gaussian noise while the latter assumes Gaussian noise.

Iv-B Results

Fig. 8: System performance. We compare the results of SABER (SMPC-RNN-DQN) using the same map as (D) in Fig. 6) against baselines for the UAV (A) and UGV (B) with distance to goal vs time as our metric. See IV-B2 for details.

Iv-B1 Analysis of learning components and their significance

The training loss for the RNN networks are shown in Fig. 4. The RNN networks for both SLAM algorithms were trained for 500 epochs (validation loss was observed to be close to the training loss), and shows a strong correlation between truth (output of SLAM in training) and predicted covariance. We also demonstrate that our RNN can model the behavior of covariance outputs generated by different sensor configurations (i.e., SLAM algorithms), as seen in (A) and (B) of Fig. 7 (e.g., we observed an increase in uncertainty for VIO SLAM when too close to obstacles, while seeing the opposite behavior for a particle-filter SLAM). Important to note, is that this result indicates that the RNN can model both non-Gaussian (particle-filter) and Gaussian (EKF) noise. By modeling the propagation of uncertainty of different measuring systems and integrating them into the SMPC prediction horizon through chance constraints, we ensure control values that avoid obstacle collision (without over avoidance) for each individual robot.

In (B) of Fig. 5, we show that during training of our DQN, the rewards would increase over a majority of the 35,000 episodes, reaching a steady-state at approximately 27,000 (except for the 5-robot team). In Table I, we also compared our DQN to other 2D baseline algorithms (we chose resolution settings for RRT, RRT-star, and A-star so that their path lengths were similar to the DQN, then evaluated their computation time). The results indicate that on average, the DQN had the lowest computation time and was comparable to A-star in terms of its path length to the goal (considered optimal for our map resolution). Unlike the baseline algorithms, our DQN also has the additional functionality of learning multi-agent semantic behavior–it was successful at moving the robots simultaneously to their goal points as shown in (C), and, as expected, stayed closer to each other when rewarding them based on close proximity as shown in (D) of Fig. 5. Although it would be possible to modify the baseline planners to consider multi-agent planning, the learning-based approach considers potential semantics between agents that would be difficult to quantify and define beforehand within a cost function.

High-level planner analysis (100 trials on map (D) of Fig. 6)
Planner Path length(m) std dev(m) Solve time(s) std dev(s)
RRT 10.83 1.28 0.120 0.091
RRT-star 10.40 1.20 0.084 0.073
A-star 9.66 0.00 0.154 0.002
DQN 9.68 0.00 0.051 0.001
Average Computation time of SABER components (10 minutes of data)
Component SMPC RNN VIO SLAM PF SLAM SABER
Solve time(s) 0.0559 0.0212 0.030 0.073 0.193
std dev(s) 0.0187 0.0077 0.003 0.005 0.110
TABLE I:

The limitation of our DQN is that it’s currently most capable in planning in 2D rather than 3D space, which helps lower training time and increase convergence (more complex DQN formulations may be necessary if the problem is either scaled to 3D, assumes a map bigger than 1010m, or uses more than 3 robots, see (B) of Fig. 5). However, since we use an SMPC to avoid obstacles in 3D space using chance constraints, a 2D planner is sufficient for our application.

Iv-B2 Validation of the SABER algorithm

The complete planner is exemplified in Fig. 6 on a UGV-UAV team. We show that the SMPC of the UGV and UAV uses the state uncertainties estimated by an RNN to avoid colliding in obstacles in dense maps (see (A), and (B)). A special case is also shown in (C), where the SMPC of the UGV (without the DQN) reaches a local minima solution and is stuck behind the obstacle. However, when using the DQN’s proposed path, the UGV can successfully reach its global goal. Note, that the SMPC and not the DQN considers both the dynamics of the robots and the uncertainty provided by their RNN models, thus, the actual path (shown in orange/purple) will differ slightly from the proposed DQN path (marked by triangles). We also evaluate the average computation time of the SABER algorithm (and its individual components) in Table I, showing a computation time of seconds/timestep. Lastly, in Fig. 8, we verify that SABER performs best compared to several baselines (using distance to goal vs time as our metric on map (D) of Fig. 6). The figure shows that MPC alone (no uncertainty is considered) causes an obstacle collision for the UGV and UAV (this likely occurred as the simulation contains noise, and the MPC is unaware of this noise during optimization). Both robots are able to get to their goals using a naïve stochastic MPC (uncertainty is considered by artificially inflating all obstacle boundaries), but due to over avoidance take longer to reach their goals. By adding our RNN to the SMPC allows both robots to reach their goals more quickly (uncertainty is now more accurately propagated within the SMPC prediction horizon). However, we observed that without a global planner, both robots run into local minima issues (i.e., robots would sometimes get stuck behind an obstacle for some time before reaching their goals). With a global planner (e.g., DQN, A-star), the robots reach their goals in the quickest manner (avoiding local minima issues). Although A-star and DQN provide near-optimal paths, the reason the DQN shows slightly better improvements (including better computation time) over A-star is because the DQN also accounts for multi-agent behavior, where robots were trained to stay close to each other when possible before reaching their respective goals (staying in close proximity decreases uncertainty via the Kalman filter).

V Conclusion

In this work, we demonstrated that the SABER algorithm (which combines several fields of robotics including controls, vision, and machine learning into a single framework) is computationally feasible ( seconds/timestep) and plans paths for heterogeneous robots to reach a global goal while satisfying diverse dynamics, constraints, and consideration of uncertainty. In future work, we plan to relax the assumption of a perfect object detection system and will focus on expanding our DQN to consider more complex behavior and tasks.

References