TerraPN: Unstructured terrain navigation using Online Self-Supervised Learning

by   Adarsh Jagan Sathyamoorthy, et al.

We present TerraPN, a novel method that learns the surface properties (traction, bumpiness, deformability, etc.) of complex outdoor terrains directly from robot-terrain interactions through self-supervised learning, and uses it for autonomous robot navigation. Our method uses RGB images of terrain surfaces and the robot's velocities as inputs, and the IMU vibrations and odometry errors experienced by the robot as labels for self-supervision. Our method computes a surface cost map that differentiates smooth, high-traction surfaces (low navigation costs) from bumpy, slippery, deformable surfaces (high navigation costs). We compute the cost map by non-uniformly sampling patches from the input RGB image by detecting boundaries between surfaces resulting in low inference times (47.27 segmentation methods. We present a novel navigation algorithm that accounts for a surface's cost, computes cost-based acceleration limits for the robot, and dynamically feasible, collision-free trajectories. TerraPN's surface cost prediction can be trained in  25 minutes for five different surfaces, compared to several hours for previous learning-based segmentation methods. In terms of navigation, our method outperforms previous works in terms of success rates (up to 35.84 slowing the robot on bumpy, deformable surfaces (up to 46.76 different scenarios.



page 1

page 4

page 5


Robot Navigation in Irregular Environments with Local Elevation Estimation using Deep Reinforcement Learning

We present a novel method for safely navigating a robot in unknown and u...

Sim-to-Real Strategy for Spatially Aware Robot Navigation in Uneven Outdoor Environments

Deep Reinforcement Learning (DRL) is hugely successful due to the availa...

Complex Terrain Navigation via Model Error Prediction

Robot navigation traditionally relies on building an explicit map that i...

Robot Perception enables Complex Navigation Behavior via Self-Supervised Learning

Learning visuomotor control policies in robotic systems is a fundamental...

History-free Collision Response for Deformable Surfaces

Continuous collision detection (CCD) and response methods are widely ado...

Active Perception and Modeling of Deformable Surfaces using Gaussian Processes and Position-based Dynamics

Exploring and modeling heterogeneous elastic surfaces requires multiple ...

Multiview Rectification of Folded Documents

Digitally unwrapping images of paper sheets is crucial for accurate docu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Autonomous robots are currently being used for a variety of outdoor applications such as food/grocery delivery, agriculture, surveillance, planetary exploration, etc. designed for these applications need to account for a terrain’s geometric properties such as slope or elevation changes, as well as its surface characteristics such as texture, bumpiness (level of undulations), softness/deformability, etc. to compute smooth and efficient robot trajectories [weerakoon2021terp, sriram-siva-1, kahn2020badgr].

This is because, apart from a terrain’s slope and elevation, its surface properties (texture, bumpiness, and deformability) govern its navigability for a robot. For instance, a surface’s texture determines the traction experienced by the robot, its bumpiness determines the vibrations experienced, and deformability [where-should-i-walk, rover-cmu] determines whether a robot could get stuck or experience wheel slips (e.g. in mud). Other factors that affect navigability over a terrain involve the robot’s properties such as dynamics, inertia, physical dimensions, velocity limits, etc.

Therefore, a key issue for performing smooth robot navigation on different terrains is perceiving and characterizing the surface properties. To this end, many works in computer vision, specifically semantic segmentation

[guan2021ganav, ttm, geo-visual] based on supervised learning, have demonstrated good terrain classification capabilities on RGB images. However, they rely on large image datasets with different classes of terrains annotated by humans. Such datasets do not account for a robot’s properties and might misclassify a traversable terrain for a robot as non-traversable (or vice-versa). In addition, their classification outputs must be converted into quantities that measure a terrain’s degree of navigability (costs) to be used for planning and navigation [semantic-mapping, trav-analysis-terrain-mapping].

Figure 1: The trajectories of our method, TerraPN (green, yellow, red corresponding to fast, intermediate and slow speeds respectively), DWA [DWA] (blue), TERP [weerakoon2021terp] (orange), OCRNet-based (pink), and PSPNet-based (purple) navigation schemes in unstructured terrain. We observe that TerraPN’s trajectories navigate the robot on smooth surfaces (asphalt) as much as possible with high velocities and adapt the velocity to different surfaces based on their navig ability costs. Other methods directly drive the robot towards its goal with its maximum velocity. In some cases (PSPNet-based), the trajectories are wandering since the segmentation falters when the surface becomes too unstructured.

To overcome the aforementioned limitations, navigability costs could be directly predicted using input and label data collected in the real world through regression using self-supervised learning [where-should-i-walk, model-error]

. That is, an image (input) can be associated with data vectors collected through other sensors such as Inertial Measurement Units (IMU) and wheel encoders on the robot (labels) instead of a human-provided label/annotation. Such label vectors generated from real-world sensor data could lead to a more accurate characterization of the robot-terrain interaction instead of a human-perceived ground truth. Therefore, using the self-supervised learning paradigm could be used to quickly, and efficiently learn surface characteristics.

Existing works that use self-supervised learning (online and offline) in the outdoor domain have predominantly focused on unstructured obstacles detection [traverse-classific-unsupervised-online-visual-learning, kahn2020badgr], roadway and horizon detection [optical-flow-SS], or long-range terrain classification into various categories [long-range-from-short-range, classifier-ensembles, forested-terrain-SS]. Therefore, there is a lack of online self-supervised learning methods that can characterize a terrain’s surface properties. Many previous works also train their network offline by first collecting data before training on a powerful GPU [where-should-i-walk, kahn2020badgr]. As a result, these methods can take a long time to re-train for a new kind of terrain.

Main Contributions: We present TerraPN (Terrain cost Prediction and Navigation), a method that uses online self-supervised learning to predict a navigability cost map (or surface cost map) and uses it for efficient robot navigation in outdoor terrains. For training TerraPN’s surface cost prediction network, a robot autonomously collects inputs (RGB image patches, robot velocities) and labels (IMU vibrations, and robot’s odometry errors) by traversing various surfaces with different velocities. Our network learns the correlations between a surface’s visual properties (color, texture, etc.) with its attributes such as traction, bumpiness, and deformability and encodes them in the form of surface costs. TerraPN does not depend on human-annotated datasets and requires minimal human supervision during data collection and training. Using the cost map, TerraPN adapts the Dynamic Window Approach (DWA) [DWA], a method that guarantees dynamically feasible robot trajectories, for outdoor navigation. The novel aspects of our approach include:

  • A novel method that trains a neural network in an online self-supervised manner to learn a terrain’s surface properties (texture, bumpiness, deformability, etc.) and computes a robot-specific 2D surface cost map. The predicted cost map is a concatenation of patches of costs corresponding to different input RGB patches. Our network trains in

    minutes for 5 different surfaces compared to segmentation methods that require 3-10 hours of offline training to achieve similar levels of performances suitable for navigation. Using patches of RGB images leads to a low input dimensional formulation for our network.

  • An algorithm to non-uniformly sample patches from the input image to obtain improved navigability cost representations at the boundaries between surfaces and maintain low inference times. Non-uniform sampling leads to using a lower number of patches per image to predict the costs for the surfaces in a scene compared to uniform sampling. This results in a reduction of 47.27% in our method’s inference time on a mobile commodity GPU.

  • A novel outdoor navigation algorithm that computes dynamically feasible collision-free trajectories by accounting for different surface costs. Our method adapts DWA’s [DWA] formulation by modulating the robot’s acceleration limits for different surfaces and computing trajectories with lower surface costs compared to DWA’s trajectories. This results in improved success rates (up to 35.84%), vibration costs (up to 21.52% lower), and mean velocity is high-cost surfaces (up to 46.76% lower) for different outdoor scenarios. We implement our method on a Clearpath Husky robot and evaluate it in real-world unstructured scenarios with previously unseen surfaces.

We use a learning-based approach for the cost map prediction and a model-based (extending DWA [DWA]) approach for navigation. Therefore, TerraPN combines the benefits of accurate characterization of robot-terrain interaction and guaranteed low surface cost, dynamically feasible navigation.

Ii Related Works

In this section, we discuss previous works in computer vision for characterizing a terrain’s traversability and methods for outdoor navigation.

Ii-a Characterizing Traversability

Traditional vision works have used methods such as Markov Random fields [prob-terrain-class] and triangular mesh reconstruction [3d-mesh] on 3D point clouds to analyze surface roughness and traversability. Learning-based works for traversability prediction fall into a combination of supervised/unsupervised and classification/regression categories.

Ii-A1 Supervised Methods

Works in pixel-wise semantic segmentation classify a terrain into multiple predefined classes such as traversable, non-traversable, forbidden, etc. [guan2021ganav, ttm, geo-visual]. To this end, fusing a terrain’s semantic and geometric features has also been studied [geo-visual]. These works are typically supervised and utilize large hand-labeled datasets of images [rugd, rellis] to train classifiers. However, manually annotating datasets is time- and labor-intensive, not scalable to large amounts of data, and may not be applicable for robots of different sizes, inertias, and dynamics [anomaly-detection]. Such methods also assume that visually similar terrains have the same traversability [traverse-classific-unsupervised-online-visual-learning] without considering the robot’s dynamics, velocities, or other constraints.

Ii-A2 Self-supervised Methods

Unsupervised learning-based methods overcome the need for such datasets by automating the labeling process by either collecting terrain-interaction data such as forces/torques [where-should-i-walk], contact vibrations [rover-cmu], acoustic data [proprioceptive-sensor], vertical acceleration experienced [multi-sensor-correlation], and stereo depth [long-range-from-short-range, classifier-ensembles]

and associating them with visual features (RGB data) for self-supervision or reinforcement learning

[weerakoon2021terp]. Other works have correlated 3D elevation maps and egocentric RGB images [trav-analysis-terrain-mapping] or overhead RGB images [overhead-images] for classification.

Few methods have performed self-supervised regression, where for each RGB pixel in outdoor terrain, the corresponding force-torque measurements [where-should-i-walk], trajectory model error [model-error], and resistive forces experienced [pliable] were predicted post-training.

Ii-B Outdoor Navigation

Early works in outdoor navigation proposed using the binary classification of obstacles versus free space [Laubach] and potential fields [Shimoda_potential]

for outdoor collision avoidance. With the advent of deep learning, methods to estimate navigability/energy cost in uneven terrains through imitation learning

[silver2010learning, sriram-siva-1] using egocentric sensor data and a priori environmental information [zakharov2020energy] have been proposed. Siva et al. [siva2021robot] addressed navigational setbacks due to wheel slip and reduced tire pressure in outdoor terrains by learning compensatory behaviors.

BADGR [kahn2020badgr] presents an end-to-end DRL-based navigation policy that learns the correlation between events (collisions, bumpiness, and change in position) and the actions performed by the robot. Since learning-based approaches cannot guarantee any optimality in terms of a navigation metric (minimal cost, path length, dynamic feasibility, etc.), [weerakoon2021terp] proposed a hybrid model of spatial attention for perception and a dynamically feasible DRL method for navigation.

Iii Background and Problem Formulation

In this section, we define the problem formulation and provide some background on DWA [DWA].

Figure 2: TerraPN’s overall architecture. Our method uses RGB images and a set of robot velocities to predict a surface cost map for the robot. In the pre-processing step, a classical image segmentation method (Weak segmentation) is used to differentiate regions with different pixel intensities. A non-uniform patch sampling scheme is then applied where large/small patches are extracted from regions that have the same/varied pixel intensities. The patches are passed into our self-supervised network (see Fig. 4) that predict a label vector . We compute the surface cost as ’s weighted norm, create patches with the cost values, and concatenate them based on their locations on the RGB image to form the cost map (blue implies low, yellow implies high costs). The cost map is then used by our novel navigation algorithm to compute dynamically feasible velocities with low surface costs in real-time. This velocity computation task is performed iteratively at every time step for a given image and velocity vector input pair.

Iii-a Problem Formulation

The problem of navigating in outdoor environments with various surface properties can be divided into two stages: surface cost prediction, and low-cost navigation. To predict surface costs, we train a neural network that uses RGB image patches , a set of the robot’s linear and angular velocities () as inputs, and a label vector consisting of two dimensions corresponding to IMU measurements, and two to robot’s odometry errors. The IMU measurements and odometry errors characterize a surface’s traction, bumpiness, and deformability (see Section IV-B). After training, we use the neural network’s predicted label to construct a 2D surface cost map. The cost map is a non-uniformly discretized concatenation of patches that depend on the surface boundaries in the input image.

Next, the cost map is used by the navigation component to compute dynamically feasible trajectories with low surface costs. We use i, j for denoting various indices, to denote positions relative to different coordinate frames. Our method’s overall architecture is shown in 2.

Iii-B Dynamic Window Approach

DWA is a model-based collision avoidance algorithm that guarantees dynamically feasible robot velocities. However, its formulation implicitly assumes that the robot traverses on a uniform, smooth navigable surface. TerraPN’s navigation component modifies and extends DWA’s formulation for outdoor navigation by accounting for surface costs.

DWA’s formulation involves two stages: 1. computing a collision-free, dynamically feasible velocity search space, and 2. choosing a velocity in the search space to maximize an objective function.

In the first stage, all possible velocities in , respectively, are considered for the search space . Here, represent the robot’s maximum achievable linear and angular velocities respectively. Next, all the pairs that prevent a collision in are used to form the admissible velocity set . Lastly, the velocity pairs that are reachable, accounting for the robot’s acceleration limits within a short time interval , are considered to construct a dynamic window set . The resulting search space is constructed as .

In the second stage, DWA searches for , that maximizes the following objective function.


Here, head() measures the progress towards the robot’s goal, dist() is the distance to the closest obstacle when executing a certain , and vel() measures the forward velocity of the robot and encourages higher velocities. on the RHS are omitted for clarity.

Iv TerraPN: Surface Cost Prediction

In this section, we describe the components in computing a terrain’s surface cost map: 1. data collection, 2. network architecture and training, 3. cost map generation using variable sampling.

Iv-a Autonomous Data Collection

To generate the input and label data for training the cost map prediction network (Fig. 4), we collect the raw sensor data from an RGB camera, robot’s odometry, 6-DOF IMU, and 3D lidar autonomously on different surfaces. The robot performs a set of maneuvers in two different speed ranges: 1. slow ([0, ] m/s and [-, ] rad/s), and 2. fast ([0, ] m/s and [, ] rad/s). The maneuvers include: 1. moving along a rectangular path, 2. moving in a serpentine trajectory, and 3. random motion.

The maneuvers are designed to cover all the pairs within the robot’s velocity limits to emulate different terrain interactions while the data is collected. If the robot encounters an obstacle, it switches from executing the maneuvers to avoiding a collision using DWA [DWA].

Iv-B Computing Inputs and Labels

As mentioned in section III-A, our network’s uses an patch () that is cropped from the center bottom of the collected full-sized image of size (). To generate the velocity input, the linear and angular velocities for the past instances from the robot’s odometry are obtained and reshaped to a vector. The velocity vector’s dimensions are chosen such that the image input does not dominate the network’s predictions.

Our label vector consists of IMU and odometry error components. They are robot-specific and implicitly encode the robot’s dynamics, inertia, etc., and its interactions with the terrain. To generate the IMU component, we apply Principal Component Analysis (PCA) to reduce the dimensions of the collected 6-dimensional IMU data (linear accelerations and angular velocities) to its two principal components. From Fig.


, we observe that the variances (

and ) of the data along the principal components help differentiate various surfaces in terms of their bumpiness. Additionally, for the same surface, higher velocities lead to higher variances in the data (justifying the need to consider velocities as inputs).

Figure 3: Sample results of the PCA applied on the 6-dimensional IMU data. The variances of the data along the two principal axes for [Left] two different surfaces concrete (red), and grass (green), and [Right] two different velocity ranges for the same surface. We observe that the variances can differentiate various surfaces and speed levels.

To generate the odometry error component of the label vector, the distance traveled by the robot () and its change in orientation () in a time interval are obtained from the robot’s odometry. We obtain the same data (, ) from a 3D lidar-based odometry and mapping system [legoloam]. The distance and orientation change errors are calculated as,


This component of the label vector differentiates surfaces with high deformability or poor traction where, if the robot’s wheels get stuck or slip (e.g. in mud), and , whereas and would have high values. The final label vector is given by,


Iv-C Network Architecture and Online Training

The network (see Fig. 4) is trained to predict the vector in equation 4

given the image and velocity inputs. The architecture uses a series of 2D convolution with skip connections and batch normalization on the image input, and several layers of fully connected layers with dropout and batch normalization for the velocity input, shown in Fig.


. In the image stream, after the initial convolution operation with ReLU activation, the image is connected to four residual blocks and one linear layer with batch normalization. Since we have limited data when collecting labels and training the network online, we use the residual connection to ensure that the network would not overfit the collected data given many layers and parameters in the network. Dropout and batch normalization layers also improve generalization capabilities and avoid overfitting.

Figure 4: Our novel two-stream network architecture. One stream encodes image patches (green) and the other stream processes the linear and angular velocities (yellow). The feature embeddings from each stream are concatenated and finally passed into a set of linear layers (blue), and the final predictions are the estimated IMU readings and the odometry errors associated with this image patch.

As the robot autonomously collects sensor data on different surfaces, the inputs and labels are generated and shuffled. The network’s training is started and performed online once sufficiently varied data is collected ( minutes). The data collection and online training is normally completed in minutes.

Iv-D Navigability Cost Calculation

Based on the vector predicted by the network (), the navigability cost for a given RGB patch and velocity vector is computed as the weighted norm of ,


Here, is a diagonal matrix with positive weights. To make navigability cost predictions on a full-sized RGB image , the image is first resized into new dimensions as follows,


where is the nearest integer operator. Next, non-overlapping patches are cropped along the width and height of the image. This batch of images is passed as inputs along with a batch containing the input velocity vectors to obtain the navigability cost predictions and for different regions of the resized image . The predicted costs are normalized to be in the range .

Finally, the surface navigability cost map is constructed by vertically and horizontally concatenating patches with the values of corresponding to different regions. is then resized back to .

Iv-E Non-uniform Patch Sampling

Although using patches reduces the input dimensionality of our network and helps train it faster, it could result in regions with multiple surfaces (boundaries) in the full-sized image having inaccurate costs depending on the patch size. Therefore, we employ a non-uniform patch sampling technique to obtain finer patches in multi-surface regions (boundaries). Conversely, portions with a single surface can be sampled with larger patch sizes . This reduces the total number of patches used for cost prediction on the full-sized image, thus maintaining the method’s inference rate.

Iv-E1 Weak Segmentation

To detect regions with multiple surfaces and differentiate them in a scene, we use a weak segmentor which is based on classical image segmentation methods. The weak segmentation method is computationally light (inference time of ) and may not have pixel-level precision. But, it sufficiently demarcates the boundaries of various terrains in the input image. First, the Sobel edge detector is applied to the grayscale input image of the scene, and the histogram of the result is computed. Next, based on the Bayesian information criterion [schwarz1978estimating]

a Gaussian mixture model is fit to the histogram, and the mean of each Gaussian curve is used as a marker/threshold (

) to differentiate the regions of pixels with different intensity levels in the image. Finally, the watershed filter [watershed] is applied to highlight the regions of different intensities to obtain (See figure 2).

Iv-E2 Selecting Sampling Patch Size

We consider the patches in the weak segmentation output and the intensity of pixels within them. If a patch larger than satisfies the condition in 7, smaller patches are not considered for cost prediction.


where is the number of pixels with intensity greater than or equal to , and is a threshold. This condition ensures that when a large patch has a significant number of pixels with the same intensity, implying the presence of a single surface, smaller patches are not used for cost prediction. The larger patch is resized to before passing into the network.

V TerraPN: Navigation

In this section, we explain how the computed surface cost map is used by TerraPN’s navigation to adapt DWA for outdoor navigation and compute dynamically feasible, low surface cost velocities.

V-a Trajectory Navigability Cost

To adapt DWA to outdoor terrains, the trajectory corresponding to a pair must be associated with a surface cost. The trajectory for a given pair relative to a coordinate frame attached to the robot’s center (X-axis pointing forward, Y-axis pointing left) is calculated as,


Here, is the initial time instant and is the number of time steps used for extrapolating the trajectory. This trajectory is then transformed relative to the camera frame attached to the robot using a homogeneous transformation matrix as . Next, the trajectory is converted to correspond to the image/pixel coordinates of , i.e., using the camera’s intrinsic parameters. The navigability cost for a velocity pair can be then computed from as,


Here, cost() is the surface cost at a given pixel’s coordinates.

V-B Variable Acceleration Limits

Robot navigation methods consider a constant range of linear () and angular () accelerations. Our formulation varies the linear and angular acceleration limits available for planning depending on the properties of the surface on which the robot is traversing such that and .

This is done because, intuitively, the robot accelerating on a smooth surface (e.g., concrete, asphalt) would lead to a low navigability cost. Therefore, the robot can proceed towards its goal faster. Whereas on a bumpy surface or one with poor traction, (e.g., tiled surface, dry leaves), accelerating would lead to high vibrations and the risk of getting stuck (e.g. mud). We do not limit the maximum deceleration available to the robot since it may have to slow down to avoid obstacles or while moving on a rough surface.

First, we divide the trajectory corresponding to the robot’s current and calculate the cost for the second half as follows,


We limit the robot’s accelerations using this navigability cost as,


If is low (low-cost surface), the robot is allowed to accelerate towards its goal, while a high restricts the robot from speeding up. Considering only the second half of the trajectory also implicitly accounts for transitions between surfaces. Using these acceleration limits, a new dynamic window is constructed. The new resultant search space is calculated as .

V-C Optimization

Finally, a which maximizes the objective function


is chosen. Here, are weights for each component.

Lemma V.1.

TerraPN’s navigation computes collision-free, dynamically feasible trajectories with surface costs that are always lesser than or equal to the DWA’s trajectory’s surface cost.


Let , and , . From equations 1 and 14, we get . We know that, . Rearranging the terms, we get .

Since maximizes , in the LHS, . This inequality holds since . The dynamically feasibility of TerraPN’s velocities follows from the fact that acceleration limits obey as . ∎

A complete pseudocode of our algorithm can be found in [terrapn-arxiv].

Vi Results and Evaluations

We detail our method’s implementation, our evaluation metrics, the different environments we tested in, and compare with other methods in this section.

Vi-a Implementation

Our self-supervised learning network is implemented using Tensorflow. A Clearpath Husky robot mounted with a VLP-16 lidar and Intel Realsense camera is used to evaluate our method in real-world scenarios. The lidar is only used for the initial lidar-based odometry data collection and to provide 2D scans for detecting obstacles for navigation. For processing, the robot is equipped with a laptop with an Intel i9 processor and Nvidia RTX2080 GPU.

In our formulation, we use n = 50, w = 640, h = 480, , , , . The velocity and acceleration limits are set according to the Husky’s limits. We train our cost map prediction on 5 surfaces: 1. concrete, 2. tiles, 3. grass, 4. asphalt, and 5. fallen yellow leaves. Each surface offer different levels of traction, bumpiness and deformability.

Vi-B Navigation Evaluation Metrics

We use the following metrics for our comparisons.
Success Rate - The number of times the robot reached its goal while avoiding getting stuck or colliding over the total number of trials.

Normalized Trajectory length - The robot’s trajectory length normalized over the straight-line distance to the goal for all the successful trials.

Vibration Cost - The summation of the vertical motion’s gradient experienced by the robot along its trajectory.

Mean Velocity - The robot’s average velocity along its trajectory as it traverses various surfaces.

We compare our method with DWA [DWA], and TERP [weerakoon2021terp] which navigates the robot based on the elevation maps in the environment. We also compare with two methods that use OCRNet [ocrnet], and PSPNet [pspnet] for semantic segmentation and the Dijkstra’s algorithm for waypoint computation and navigation. OCRNet and PSPNet classify different surfaces into several discrete classes based on their degree of navigability. Different cost values are associated with each class and used for waypoint computation using Dijkstra’s algorithm. The OCRNet and PSPNet networks were trained on the RELLIS-3D outdoor dataset which contains all the surfaces TerraPN’s cost prediction network is trained for.

Vi-C Testing Scenarios

We evaluate and compare TerraPN’s navigation with prior methods in five scenarios. We characterize each scenario’s difficulty based on the number of surfaces, the level of unstructuredness, and whether they were previously seen during TerraPN’s cost prediction training.
Scenario 1: Two trained surfaces (concrete and grass). See Fig. 6a.

Scenario 2: Three surfaces (concrete, asphalt and rocks) with one untrained surface (rocks). See Fig. 6b.

Scenario 3: Four surfaces (tiles, concrete, mud, grass) with one untrained surface (mud) where the robot could get stuck. See Fig. 6c.

Scenario 4: Four surfaces (asphalt, rocks, discolored grass, unstructured dry brown leaves with mud). The grass and dry leaves surfaces had undergone considerable seasonal changes. See Fig. 1.

Scenario 5: Untrained, highly unstructured surface with dry brown leaves, broken branches, mud, etc. See Fig. 6d.

Vi-D Analysis and Comparisons

Figure 5: Surface cost map predictions for different scenarios with 2-3 surfaces. We consider non-uniformly sampled patches on the RGB images and feed them into our network along with the robot’s velocities. We observe that surfaces are differentiated well based on their navigability (dark blue denotes low cost and therefore better navigable surface and lighter, yellow shades denote high cost). We observe that our cost predictions are fairly accurate even for regions not seen during training (person’s legs, and rocks).
Figure 6: Robot trajectories when navigating in different unstructured terrains using our method, TerraPN (green, yellow, red corresponding to different speed levels), DWA [DWA] (blue), TERP [weerakoon2021terp] (orange), OCRNet-based (pink), PSPNet-based (purple) navigation schemes. (a) Scenario 1; (b) Scenario 2; (c) Scenario 3; (d) Scenario 5. We observe that TerraPN generates relatively shorter or comparable trajectories which navigate the robot on low-cost surfaces (concrete, asphalt, etc). Further, TerraPN varies the velocity appropriately when the robot encounters a relatively rough surface such as grass, rocks, dry leaves, or mud, while other methods always navigate the robot at its maximum speed on all surfaces. This leads to high vibrations and odometry errors in the robot. In (d), TerraPN moves slowly (red) throughout its progress towards its goal. TERP takes a much longer trajectory to the goal based on elevation changes. All other methods fail to reach the goal either due to incorrect segmentation (OCRNet, PSPNet) or high odometry errors (DWA).
Metrics Method Scenario
1 Scenario
2 Scenario
3 Scenario

Success Rate (%) (Higher is better)

DWA [DWA] 100 70 79 53
TERP [weerakoon2021terp] 100 77 78 61
OCRNet [ocrnet] 94 72 73 58
PSPNet [pspnet] 92 75 72 56
TerraPN 100 88 85 72

Norm. Traj. Length (Close to 1 is better)

DWA [DWA] 0.965 1.001 1.076 1.141
TERP [weerakoon2021terp] 0.996 1.271 1.158 1.229
OCRNet [ocrnet] 1.389 1.084 1.403 1.287
PSPNet [pspnet] 1.241 1.066 1.524 1.369
TerraPN 1.147 1.034 1.127 1.261

Vibration Cost (lower is better)

DWA [DWA] 2.334 1.678 1.518 3.652
TERP [weerakoon2021terp] 1.199 1.279 1.627 4.156
OCRNet [ocrnet] 0.893 2.115 1.393 4.378
PSPNet [pspnet] 0.967 2.384 1.424 4.456
TerraPN 0.766 1.329 1.166 2.886

Mean Velocity (lower is better)

DWA [DWA] 0.581 0.564 0.531 0.542
TERP [weerakoon2021terp] 0.544 0.506 0.525 0.522
OCRNet [ocrnet] 0.561 0.532 0.529 0.515
PSPNet [pspnet] 0.548 0.526 0.509 0.513
TerraPN 0.347 0.336 0.271 0.296
Table I: Relative performance of our method TerraPN compared to other methods on various metrics. We observe that TerraPN leads to the highest success rates in all the scenarios. Further, TerraPN results in vibration cost decrease up to 21.52%, mean velocity reduction up to 46.76%, and shorter or comparable trajectory lengths than the other segmentation methods in most of the scenarios. DWA and TERP take shorter trajectories in some cases since they directly move towards the goal in the absence of obstacles and elevation changes, without considering surface properties.
Method Inference Time (sec) Training Time
OCRNet 0.052 10hrs and 5mins
PSPNet 0.045 2hrs and 47mins
CGNet 0.015 9hrs and 53mins
TerraPN-50 0.055 20-25mins
TerraPN-non-uniform 0.029 20-25mins
Table II: TerraPN’s inference time and training time compared to existing semantic segmentation methods are executed on a laptop with Nvidia RTX2080 GPU. We observe that TerraPN’s non-uniform sampling approach reduces the inference time significantly compared to PSPNet, OCRNet, and uniform sampling of patches (TerraPN-50). Even though CGNet outperforms all the other methods in terms of inference time, it needs hours of training time to achieve satisfactory segmentation accuracy. However, TerraPN achieves cost prediction results suitable for outdoor navigation within minutes of training for 5 different surfaces.

Vi-D1 Cost Map Prediction Performance

Fig. 5 shows the results of our cost map prediction in scenarios with various surfaces, both seen during training (tiles, grass) and unseen (soil, rocks, human obstacles). For unseen surfaces, our network typically predicts higher costs for navigation, which results in cautious, slow trajectories near such surfaces. We observe a clear differentiation of the surfaces (blue denoting low-cost surfaces) based on the robot’s current velocity. Additionally, non-uniform patch sampling uses smaller patches in certain regions of interest such as the boundaries between different surfaces, and even over a pedestrian’s shoes. In all other regions, larger patches are used, achieving a good balance between the fineness of the cost map and computation time.

Vi-D2 Navigation Performance

We evaluate TerraPN’s navigation performance both quantitatively (Table I) and qualitatively (Fig. 1 and 6) in our test scenarios. We do not report the metrics for scenario 5 since other methods mostly failed to reach the goal. Quantitatively, we observe that TerraPN leads to the highest success rates in all the scenarios. Other methods failed to reach the goal by either traversing surfaces where the robot got stuck such as rocks (scenario 2), or mud (scenario 3, 4). Or, in the cases of OCRnet and PSPnet, their trajectories wandered away from the goal due to the mis-classification of certain surfaces as forbidden.

In terms of trajectory lengths, TerraPN is comparable or lower to the navigation methods based on OCRnet [ocrnet] and PSPnet [pspnet]. This is again due to wandering trajectories as explained before. DWA and TERP take shorter trajectories in some cases since they directly move towards the goal in the absence of obstacles and major elevation changes, without considering surface variations. However, in the cases where the normalized length is less than 1, DWA and TERP wrongly assumed that the robot had reached its goal due to odometry errors (due to wheel slip) while navigating with high velocities on high-cost surfaces. Although TerraPN’s trajectories deviate to avoid high-cost surfaces, their normalized lengths are close to one. This is due to the influence of the head() term in the objective function.

TerraPN also outperforms the other methods in terms of vibration costs in three scenarios since its trajectories were on smooth, low-cost surfaces as much as possible (see Fig. 1 and 6) with appropriate speeds. In scenario 2, TERP outperforms TerraPN due to the presence of concrete curbs. TerraPN could not distinguish the elevation changes between the ground and the curb, and moved over its edge near the rocks (see Fig. 6b) in some trials. Since TERP uses elevation maps to compute cost maps, it takes an overly conservative trajectory to avoid the curb. OCRNet and PSPNet have higher vibration costs although they possess semantic information about the terrains due to their longer, meandering trajectories.

In terms of mean velocity, we observe that TerraPN varies the robot’s speed when navigating on different surfaces in the various scenarios (see yellow and red trajectories in Figs. 1, 6). All other methods move towards the robot’s goal close to its maximum velocity (0.6 m/s). TerraPN’s overall lower mean velocity varies depending on the number and type of surfaces in the scenario (which also reflects on the vibration cost). The regions where TerraPN slowed down the robots can be seen in Figs. 1 and 6. In the highly unstructured scenario 5, only TERP and TerraPN reached the robot’s goal. TerraPN took a slow, cautious speed towards the robot’s goal while TERP took a slow winding path based on the elevation changes it sensed in the terrain. DWA moved fast directly towards the goal and stopped short of it due to odometry errors caused due to the surface’s poor traction.

Vi-D3 Inference and Training Times

Table II compares TerraPN’s cost prediction inference time with various semantic segmentation methods trained for outdoor environments. TerraPN’s inference time when using a uniform patch sampling of patches, is comparable to OCRNet and PSPNet. However, our non-uniform patch sampling when using patches of side lengths 50, 100, and 200 reduces the inference time by compared to uniform sampling. For comparison, we also include CGNet [cgnet], an extremely light-weight ( parameters) segmentation network.

However, TerraPN’s cost prediction network trains in minutes to achieve a performance suitable for navigation compared to hours need to train other networks for satisfactory segmentation accuracy. Additionally, using TerraPN’s cost map leads to superior navigational performance, as observed before. Additional results and analysis can be found in [terrapn-arxiv].

Vii Conclusions, Limitations and Future Work

We present TerraPN, a novel approach that uses self-supervised learning to identify the surface characteristics in complex outdoor environments to perform autonomous robot navigation. Our method incorporates RGB images of surfaces and the robot’s velocities as inputs, and the IMU vibrations and odometry errors encountered by the robot while traversing on a surface as labels for self-supervision. The trained self-supervised network outputs a surface navigability cost map that differentiates smooth, high-traction surfaces from bumpy, deformable surfaces. We introduce a novel navigation algorithm that accounts for the surface cost, computes cost-based acceleration limits for the robot, and computes dynamically feasible and collision-free trajectories. We validate our approach on real-world unstructured terrains and compare it with the state-of-the-art navigation techniques on various navigation metrics.

Our approach has a few limitations. TerraPN must navigate on a surface to estimate the corresponding navigability costs. This strategy cannot be applied on completely non-traversable surfaces (e.g. swampy terrains). Significant lighting changes in the environment could adversely affect the surface cost prediction. Further analysis on the number of surfaces our method can learn effectively must be conducted. TerraPN cannot distinguish subtle elevation changes in the same or similar surfaces due to the unavailability of depth and elevation inputs to the system. To this end, an intelligent multi-sensor fusion-based approach could be utilized to identify both surface and geometric properties in challenging outdoor conditions in the future.