A long-awaited promise of mobile robotics, almost from its inception, has been the autonomous exploration of environments inhospitable to humans. Since at least the early 2000s, work has been ongoing to autonomously explore the particularly dangerous and complex environment of subterranean mines [Thrun2004] [Morris2005]. Mines present a wide variety of challenges to virtually all aspects of robotics, including challenging and complex terrain, inherent GNSS and communication denial, and sensory degradation. In addition, if robots are to be used practically in the field, procedures for robot deployment and operator control must be highly automated, robust, and reliable.
One of the most significant constraint in underground robotics is communications [Yarkan2009]. Limited bandwidth and connectivity require robotic systems to operate successfully even without any real-time feedback from the human operator. All subsystems must run entirely on-board on relatively compute-limited platforms. Furthermore, significant attention must be given to data compression, transmission, and storage in order to provide the human operator with the most possible information over a limited-bandwidth link.
In addition, underground environments contain a wide variety of terrains: from concrete, to deep mud, to pitted railroad ties, to vertical shafts over kilometers of tunnel. Since no single platform, or type of platform, is perfectly suited to all of these environments, we propose a loosely-coupled heterogeneous multi-robot system consisting of Micro-Aerial Vehicles (MAVs) and legged quadrupeds. Recent work has noted the capabilities for MAVs in underground environments to traverse areas rapidly that ground-based systems cannot reach [Papachristos2019] [Li2018] [Ozaslan2018]. However, physical limitations limit flight time, mission length, and payload. We therefore additionally employ Ghost Robotics (GR) Vision 60 quadrupeds (Fig. 1) to enable long duration missions with a larger sensor payload. Quadrupeds have the potential capability to traverse highly complex and unstructured terrain more easily than treaded or wheeled robots, making them suited for underground environments [Zimroz2019], but they have been minimally utilized in real-world autonomous applications. In a work similar to ours, Bellicoso et al. [Bellicoso2018] autonomously navigate outdoor environments with a quadruped, but they (a) assume GNSS availability and (b) assume a pre-existing global map or manual path definition.
In this work, we discuss the system design for the legged robots, but note that the detection and communication portions of our approach are platform-agnostic and are utilized across a heterogeneous suite of robots.
The primary contributions of our work are as follows:
We present the architecture and detail the components of our system for autonomous exploration of tunnel systems using legged quadruped robots.
We discuss communication, detection, and mapping systems that enable rapid situational awareness for a single operator in communication-challenged environments.
We show results from testing in a motion capture space in the lab, as well as NIOSH at the SubT Tunnel Circuit, including multiple autonomous traversals of hundreds of meters in different mines. To our knowledge, these experiments represent the first long-duration entirely autonomous exploration undertaken by legged robots.
Ii System Overview
Ii-a Hardware Architecture
We chose the NVidia Jetson AGX Xavier as the primary computer for the legged platform. The Xavier contains 8 ARM CPU cores as well as 512 GPU CUDA cores with 16GB of shared RAM. We can therefore distribute the software load between the CPU and GPU, which is particularly valuable for tasks such as mapping and image processing.
The hardware architecture is shown in Fig. 2
. Our primary exteroceptive sensor is an Ouster OS-1 64-beam LiDAR, which is used for navigation, planning, and mapping. Two pmd picoflexx time-of-flight sensors are used for short-range depth sensing in front of the robot to aid in local planning. We employ an Intel RealSense T265 stereo tracking camera for local high-speed pose estimation. A StereoLabs Zed Mini stereo camera and FLIR Boson IR camera are the primary sensors used for object detection.
For communication, we use a Rajant dual-band mesh networking system. The Xavier additionally communicates with the Vision 60 mainboard via UDP over Ethernet to send high-level navigation commands in the form of body-frame twists.
Ii-B Software Architecture
Our full software stack is built on ROS. Because communication is intermittent and unreliable in subterranean environments, all robots and external computers run separate ROS masters. Communication between robots occurs exclusively via the distributed database system (Sec. VIII), and communication with the basestation via a combination of the distributed database and a separate mechanism for sending commands and obtaining telemetry via TCP in real time. Subsystems, packaged as ROS nodes, are launched, stopped, and monitored by a launch manager systemctl daemon. In this fashion, the operator can individually stop and start the robot’s subsystems, as well as verify that each is operating as expected without needing to be directly connected to the robot. This interface is pictured in Fig. 4. This approach proved to be extremely useful for rapid startup, shutdown, and debug of systems in the field.
A block diagram of the software architecture is shown in Fig. 3. Each subsystem is largely independent, enabling a high degree of robustness. We made an early decision to separate the lower level autonomy systems, such as the local planner and controller, from higher level systems such as the mapper and state machine. Therefore, even if the global mapper fails or diverges, the robot remains capable of locally determining feasible directions to travel. This robustness was found to be very useful in experiments.
The goal of SubT is to localize as many artifacts as possible within a single time-limited run, a task which lends itself strongly towards parallelization. In implementing multi-robot systems within constrained and confined subterranean environments, there are a number of spatial, temporal, and communications challenges in addition to the necessity of handling robot failure. Spatial challenges include mitigating robot-to-robot interference while avoiding redundant exploration. The exploration time for each robot cannot exceed its battery limit and any exploration strategy must ultimately allow the robot to convey the necessary information to the base station before run termination.
Given that communication is challenging, low-bandwidth, and intermittent in subterranean environments, we develop a distributed system so each robot behaves exclusively based on its own observations and information. Simply put, we view communication as a luxury and focus on developing fully autonomous systems whose behaviors are triggered by individual high-level state machines.
Iii-a State Machine
We define multiple states for individual robots, namely START, EXPLORE, STOP, MOVEBASE, and ESTOP. The state machine begins in START while the robot stays in the staging area waiting for operator’s command before starting mission. In this state the operator can set the time limit for robot exploration before returning as well as the turn specification (Sec. IV-B). This time limit can additionally be changed as long as the robot can communicate with the base station. When the operator commands, the robot enters EXPLORE and the robot actively searches for tunnels to follow, as detailed in Sec. IV-B. The robot returns to the base when the elapsed time exceeds the user-defined time limit, and MOVEBASE is entered. STOP and ESTOP can be manually triggered to either simply pause or sit down the robot. For Tunnel Circuit, we deployed robots one-by-one with decreasing time limits in order to mitigate inter-robot interference and maintain communication through the mesh network. Extending this system by allowing inter-robot collaboration could improve performance and remains as a future work.
Iii-B User Interface
To monitor the states of the robots and artifact positions, we develop a top-down view user interface as shown in Fig. 4. The interface shows the reconstructed pointcloud, topological graphs and detections for each robot, statuses, and object detection thumbnails.
Iv High Level Navigation
Due to our decision to decouple exploration from mapping, our exploration algorithm was developed to operate entirely locally. A tunnel exploration algorithm must be able to both determine potential tunnels for exploration, as well as decide which tunnel at any given time to explore.
Iv-a Tunnel Detection
It is natural to represent tunnels at a high level by a graph as in [Morris2005], as the confined nature of the environment results in high degrees of sparsity. For exploration, detection of crossroads is of particular significance [Silver2006] as they represent decision points. We use instantaneous depth panoramas generated by a single 3D LiDAR scan as the primary input to our exploration algorithm due to several useful properties. A large section of distant points, corresponding to a connected component in the depth panorama, represents a tunnel. Therefore, we threshold the depth panorama at a target distance, detect connected components of sufficient area, and find the corresponding centroids. These centroids represent the azimuths of potential tunnel axes emanating from the robot’s current position. We then track these centroids over time using an EKF, associating centroid measurements to tracked tunnels by checking whether the detected centroid is within the tunnel bearing uncertainty. If tunnel uncertainty drops below a threshold, the detection is published as a "new frontier". In order to avoid inadvertent backtracking, we track the recent pose history and project these bearings onto the depth panorama, ignoring centroids near them. This can be seen in Fig. 5
Iv-B Exploration Behavior
We continuously track the tunnel closest to the robot’s current heading. If the local terrain planner informs the exploration controller that the cost of traversal is too high, as described in Sec. V-A, then the frontier is marked as untraversable and the next closest frontier selected. If there are no other frontiers immediately available, a return to base is triggered.
In order to ensure multiple robots take different trajectories, we expose simple mission specifications to the operator in the form of turn sequences such as "Left Right Right". The robot queues up these turns, and if a new tunnel is detected on the relevant side of the robot, that one is automatically selected as the new current tunnel and the turn command is popped off the top of the queue. Once the queue is emptied, the robot simply continues along the current tunnel until it becomes incapable of continuing or exploration times out.
Iv-C Return to Base Behavior
In order to return to the entrance to the tunnel we leverage the mapper’s topological graph as described in Sec. VI
. By performing a graph search, we can find a path from the current keyframe to the start keyframe. Since keyframes are typically evenly spaced, we traverse the graph using breadth-first search. Once a path is found, we compute a vector from our current position estimate to the next vertex in the path (which we subsequently update as we get nearby). This vector is fed to our existing planning framework of tunnel following. When the direction is similar to the direction of a detected tunnel, we follow the tunnel as in sectionIV-B. When it is above the angular difference threshold, we ignore detected tunnels and plan paths in that direction (as in the case of turning around at the beginning of the path). This ’go to node’ behavior could be used without modification to navigate to other nodes in the pose graph, such as nodes adjacent to unexplored frontiers, but in practice was used only to return to the start node.
V Local Planning and Control
It is common, when considering the motion planning problem for ground robots, to encode the local environment as a 2-dimensional costmap as in [Gerkey2008]. The cost, effectively the inverse of some measure of traversability. The problem of local planning is then broken into two primary pieces: quantifying terrain traversability and planning feasible trajectories on the generated cost map. Due to imperfect odometry and our decision to minimally couple planning and mapping, we adopt the approach of [Fankhauser2018] and continually center the costmap on the robot’s current position. We additionally model traversability as a function of the magnitude of the terrain gradient, which makes costmap integration extremely simple. The full planning pipeline is shown in Fig. 6.
V-a Terrain Mapping
V-A1 Height Map Generation
In generating the terrain map, we begin with the fused pointcloud generated from the Ouster and ToF sensor data. We then subdivide the X-Y plane of the region of interest around the robot into a grid. For each grid cell, we compute , where is a measure of the quantity and closeness to of points in at height near the cell centered at and is a similar quantity, weighted by height. Explicitly,
where we define to be the Cartesian distance from the center of the grid cell at to , such that , denotes the z-coordinate of the point, and and are thresholds. We then find , which is the value giving the smallest local maximum of above some threshold .
This approach has a number of advantages:
Ceilings are handled smoothly, since we take the lowest surface in the given grid cell.
We smoothly interpolate between the sparse lidar points, with the maximum distance given by.
We filter out sparse points with a strength determined by . Note that this filter will fail if a point is exactly at the center of a grid cell, but in practice we found this filter to work reasonably well.
Terrain heights are continuous, since we take the average of points in the height bin at the end. This is particularly useful when computing the gradient, as we do later.
V-A2 Configuration Space Costmap
Once we have the discretized local heightmap, it is trivial to compute the gradient on it. We motivate this operation by noting that for ground robots, the traversability of the terrain effectively comes down to steepness and roughness, both of which are well-captured in the gradient. We consider only the magnitude of the gradient at each point. Additionally, we make the conservative assumption that regions with unknown gradient have infinite cost.
A key motivation for using the gradient to compute traversability is that the absolute elevation is effectively filtered out. We can therefore make the assumption that all robot motion is in the 2D X-Y plane when integrating without significant error. Let be the robot pose projected into SE(2) at time . Let the gradient map be . We then have
We then fuse the propagated old map and the new measured map with
We now convert the gradient map in Cartesian space into configuration space by modelling the robot as a 2D rectangular region centered on with angle . We then define
which is the most conservative possible traversability assumption. If any are unknown (NaN), we set , again making the conservative assumption that all unknown space is untraversable. The configuration space conversion lends itself to high degrees of parallelization, so we perform this step on the GPU. We additionally precompute the bounds of in a lookup table, so the entire configuration space generation takes ~10 ms on the Xavier. As a final step, we blur the configuration space using a 1D Gaussian kernel along the y axis in order to push the robot towards the center of the tunnel.
V-B Terrain Planning
All planning is now performed on the configuration space cost volume. The planner can be defined to be a function where is a 2D direction and is the optimal time parametrized trajectory. We additionally define
where is the cost function, is the cost associated with commanded direction, and is the set of admissible trajectories. This hybrid planner allows us to impose both "hard" and "soft" constraints on . We define to be the set of trajectories which never go through a cell in configuration space of cost greater than a threshold.
where . These terms correspond to a rotation cost, distance cost, traversability cost, sidestep cost, and reversal cost. Notably, we additionally penalize sidestepping on the robot as it is significantly less stable while sidestepping when compared to moving directly forward. We strongly penalize backwards motion since no sensors on the robot can instantaneously see backwards. In order to direct the path in the desired direction, we add the direction cost where is a unit vector in the desired direction.
We run Dijkstra’s algorithm over configuration space to plan the initial path, terminating the search once we have reached any point sufficiently far along . Once the path is found, we recursively prune the planned path to remove all waypoints that do not increase the overall cost and are in , linearly interpolating between waypoints. Pruning smooths the final trajectory, removing the jagged turns that are an artifact of planning in discretized Cartesian space.
Once the planned path has been generated, we time parametrize it assuming fixed linear and angular velocities. At every time we then compute as well as , the current actual state of the planner. We finally simple use a simple proportional controller to command a twist proportional to in order to track the trajectory.
The mapper is based on an atlas framework [Bosse03anatlas] with a pose graph [Dellaert2012FactorGA] skeleton linking together local maps. The local maps support small-scale loop closures by virtue of powering local frame-to-model tracking in the spirit of KinectFusion [Newcombe:KinectFusion:2011], while the pose graph enables large-scale loop closures. The local map representation is a panoramic keyframe [Meilland13b, Taylor:DepthPanoramas:2015] recording depth, surface normal, and confidence for each pixel, Fig. 7. Panoramic images efficiently represent free space, as it is implicitly encoded along the ray to each recorded range measurement, without sacrificing much representational power when used for maps gathered over limited translational excursions.
The LiDAR used as the primary mapping sensor rotates a collection of range finding devices around a central axis, and was configured to produce 360°sweeps at 10Hz. Robot motion over these 100ms intervals can be significant, so an initial rotation correction based on a time-synchronized IMU is used to “de-rotate” measurements to the time when the sweep began. The rigid motion of each sweep is initialized by the pose change reported by an Intel RealSense stereo camera maintaining its own independent pose estimate, then optimized to register the sweep to the active local map with projective iterative closest point (ICP) [Blais:ProjectiveICP:1995] with a point-to-plane metric [Chen:ICP:1992]. The registered sweep is used to update the active local map by averaging range estimates that are within a small threshold, or decreasing the confidence of an estimate with which the new observation disagrees. A new local map is initialized from the previous local map whenever the current map can not represent enough of a new sweep due to occlusions or changes in the environment, or if the current pose is above a threshold distance from the local map origin.
The mapping pipeline typically runs at 50Hz on the Nvidia Xavier, with most of the computation handled by the GPU. When a new local map is created, a pose graph including the just-finished active map is optimized using GTSAM. The depth image for the newly finished local map is then downsampled and losslessly compressed as a 16-bit PNG image for transmission back to the base station for use in generating a traditional point cloud visualization. Because the global point cloud is assembled from individual keyframes as part of the visualization procedure, local map poses may be updated on each robot by its ongoing map optimization without requiring that a full global map be re-transmitted. Instead, optimized pose graphs are periodically shared among the robots, as these involve very little data compared to the depth panoramas.
Vii Object Detection
Vii-a Vision Based Detection
The overall objective of the Tunnel Circuit is to detect, localize and communicate global artifact locations to DARPA. The artifact types are known in advance, and are fire-extinguisher, backpack, hand-drill, cellphone and survivor; examples are shown in Fig. 8. We achieve artifact recognition using the ERFNet [romera2017erfnet]deep learning based semantic segmentation network. We selected this network architecture because of its compact and efficient design, allowing fast inference on our compute hardware [MAVNet].
We collected a large corpus of training data from various challenging environments such as mines, basements, cluttered indoor and outdoor spaces from a variety of sensors including three different types of cellphones and a number of Stereolabs Zed Mini cameras over the course of a year. We also collected human annotated per-pixel labels for each of these images. Additionally, to add more variation to our dataset, we extracted relevant image frames from over 25 YouTube videos of caves and mines. To a fraction of these images, we programmatically and artificially pasted artifacts into these scenes and automatically generated labels. This resulted in an annotated dataset of 3416 images of resolution .
For performance reasons we trained our network at half the original resolution using the dataset described above, splitting our dataset such that training and test data are from different environments. We then selected the model with the lowest test accuracy (mean intersection over union) over the relevant classes.
The output of our semantic segmentation network, along with the depth image acquired from the Stereolabs Zed Mini passive stereo camera is used to identify the artifact’s 3D position relative to the camera frame. These 3D positions are then transformed to the nearest key-frame from our mapping algorithm.
Our segmentation network runs at ~5Hz and we usually receive multiple detections of the same artifact. To filter these artifact detections and reduce the number of detections communicated back to the base-station, a filter is designed that clusters multiple detections based on their 3D positions relative to the last keyframe. Artifact detections of the same class that are detected within some distance threshold are collapsed to the same artifact instance and are returned to the human operator with a high confidence score. This confidence value is calculated as the ratio of the number of detections that are in agreement with our threshold to the total number of detections within a particular time-window. However, artifacts that exist as outliers, i.e those with higher error in calculated 3D position or false positives from the neural network are also returned to the operator for final decision making.
Vii-B Cellphone Detection
The bluetooth detector consisted of a HC-05 bluetooth module on a FT232RL FTDI USB to TTL Serial adapter. The HC-05 is used in command mode to scan for bluetooth enabled devices. Identified devices are then logged by their unique MAC ID and strength of bluetooth signal using the RSSI values returned during the scan. Detections are pushed to the database each time a stronger signal corresponding to a given device was detected, tagged with the current pose of the robot.
Viii Distributed Database
As previously discussed, communication presents a key challenge in underground environments. Furthermore, the information generated by one robot may be useful for other robots deployed in the field, as we described in Sec. III. Therefore, we opted to use a network and data sharing architecture in which all nodes share the same information. Both robots and the basestation are considered nodes in our network architecture. We refer to this module as the distributed database. Similar architectures have been proposed in the literature to coordinate distributed autonomous agents [cowley2004distributed] [ferrer2018blockchain].
The database module running in each node (Fig. 9), is composed of three key elements:
The database server: provides a simple in-memory database, as well as an API to write and read data to it.
The translators: receive messages from other modules in the robot, and execute the API calls required to store and retrieve the data.
The communication channels: provide a peer-to-peer link with other nodes in the network, allowing them to synchronize data.
The database stores messages using the format specified in Tab. I. Each message is identified by its unique hash which is calculated when the message is recorded. The length of the hash has been fixed to
bits, such that the probability of a hash collision is less thanwhen more than messages are inserted into the database. On average, each robot inserted messages during individual runs in the NIOSH mine, well within this bound.
While the database is shared by all the nodes in the network, each robot is assigned a different table. Therefore, a robot only has read/write access to its own table, but it has read access to the data published by other nodes in the network. A robot may use this information to modify its behavior depending on the behavior of the other robots. The only exception to this rule is the basestation, which may overwrite the Ack field to indicate that a message has been propagated to the operator.
For data integrity reasons, messages written into the database cannot be erased. Nevertheless, the same message may be overwritten modifying its data, as well as other fields. For instance, when a robot expands the map by adding new keypoints.
|Field||Data Type||Length (in Bytes)|
The communication channels are written on top of ZeroMQ (ZMQ). ZMQ provides a network framework to send messages between nodes, reducing data retransmission when robots lose communications momentarily. Our architecture is designed to follow the Request/Reply model of ZMQ, which is the architecture that proved the best reliability during our testing.
Viii-B Synchronization Procedure
While the Rajant radios that are used include an automatic meshing system, we opt to use a peer-to-peer approach to reduce the amount of data transmitted in the network. The synchronization procedure, run in each synchronization channel, is depicted in Fig. 10. Either endpoint of the network can act as a client or a server. The channel provides unidirectional synchronization such that the client is pulling information from the server, but not the opposite.
Synchronization is triggered when a direct link with good bandwidth is measured between two nodes. We use the Rajant API to detect this condition.
The client requests a list of available hashes in the server. The list of hashes is sent using the order stipulated in the priority field. This ensures the right order of messages when there are dependencies between them and that important data is sent first.
The client requests specific hashes in the server database. After receiving the required messages, the hash is verified and the data is inserted into its own database.
The communication process is driven by the client, and the server only replies to specific commands. This architecture is specially resilient to communication loss: when a transmission is interrupted, the client may resume the communication by requesting a specific message from the server.
Ix Experimental Results
Ix-a Lab Experiments
In order to validate our approach in controlled circumstances, we performed a short autonomous mission in an outdoor motion capture system using a simulated tunnel and a short mission time-out. We additionally set up a drill, backpack, and fire extinguisher in the environment. Fig. 12 shows the overlaid mapper and ground truth (motion capture) trajectories as well as the detection locations against ground truth. All objects are localized well within a meter, and the total error after a ~30m traversal was on the order of a meter, giving the mapper ~3% error. Some false positives are shown, such as the backpack being incorrectly identified as a fire extinguisher in some frames, but these cases are easily filtered out by the human operator.
Ix-B Mine Experiments
We additionally performed a number of tests as part of the DARPA SubT Tunnel Circuit at the NIOSH Research mine. In the Experimental Mine (Fig. 11), 4 artifacts were correctly detected over the course of a 165m traversal as well as some false positives. However, at this mine, we were unable to localize these artifacts within 5 meters due to the spatial smoothness of the mine corridor and our reliance on LiDAR for localization. At the end of this run, the robot fell by putting a foot in a small hole not seen in the terrain map due to the relatively low terrain map resolution of 10cm. Nonetheless, given that each step the robot takes is approximately 10cm, a 165m traverse corresponds to over a thousand successful steps taken by the planner. In the Safety Research Mine, the robot traversed 240m before slipping and falling on gravel. Additionally, after ~100m, a cell phone was correctly detected and reported to within 5m, corresponding to less than 5% error.
In this work, we have presented the exploration, planning, mapping, and detection technologies to autonomously explore, map, and detect artifacts in underground tunnel environments using legged robots. We have additionally demonstrated the feasibility and functionality of such a system in controlled lab experiments as well as extensive field testing.