Navigation-Oriented Scene Understanding for Robotic Autonomy: Learning to Segment Driveability in Egocentric Images

by   Galadrielle Humblot-Renaux, et al.
Aalborg University

This work tackles scene understanding for outdoor robotic navigation, solely relying on images captured by an on-board camera. Conventional visual scene understanding interprets the environment based on specific descriptive categories. However, such a representation is not directly interpretable for decision-making and constrains robot operation to a specific domain. Thus, we propose to segment egocentric images directly in terms of how a robot can navigate in them, and tailor the learning problem to an autonomous navigation task. Building around an image segmentation network, we present a generic and scalable affordance-based definition consisting of 3 driveability levels which can be applied to arbitrary scenes. By encoding these levels with soft ordinal labels, we incorporate inter-class distances during learning which improves segmentation compared to standard one-hot labelling. In addition, we propose a navigation-oriented pixel-wise loss weighting method which assigns higher importance to safety-critical areas. We evaluate our approach on large-scale public image segmentation datasets spanning off-road and urban scenes. In a zero-shot cross-dataset generalization experiment, we show that our affordance learning scheme can be applied across a diverse mix of datasets and improves driveability estimation in unseen environments compared to general-purpose, single-dataset segmentation.



There are no comments yet.


page 1

page 3

page 6

page 7


Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding

We present a deep learning framework for probabilistic pixel-wise semant...

Task Decomposition and Synchronization for Semantic Biomedical Image Segmentation

Semantic segmentation is essentially important to biomedical image analy...

Learning Generalizable Robotic Reward Functions from "In-The-Wild" Human Videos

We are motivated by the goal of generalist robots that can complete a wi...

NudgeSeg: Zero-Shot Object Segmentation by Repeated Physical Interaction

Recent advances in object segmentation have demonstrated that deep neura...

Dark Model Adaptation: Semantic Image Segmentation from Daytime to Nighttime

This work addresses the problem of semantic image segmentation of nightt...

IDD: A Dataset for Exploring Problems of Autonomous Navigation in Unconstrained Environments

While several datasets for autonomous navigation have become available i...

NSS-VAEs: Generative Scene Decomposition for Visual Navigable Space Construction

Detecting navigable space is the first and also a critical step for succ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A robot left to roam “in the wild” needs to know where it can go, and what to avoid. This is especially challenging when navigating outdoors and exclusively relying on images captured by an on-board camera to interpret the scene: the robot may traverse vast areas with unfamiliar terrain, unexpected obstacles or challenging environmental conditions which degrade its view, yet should still be able to identify safe and suitable terrain to drive on. In this work, we tackle the first step in a vision-based autonomous navigation pipeline: interpreting raw images capturing a robot’s immediate surroundings into useful representations for decision-making. We note that image-based outdoor scene understanding remains widely tackled as a domain-specific object recognition task with the aim of maximizing performance on individual benchmarks [38, 39]. These object-based approaches make implicit assumptions about the environment content and structure (such as the presence of a road or surrounding traffic), scaling poorly to unseen obstacles or unstructured scenes [20]. Relying on such descriptive scene representations also requires an over-arching logic to determine where a robot can safely drive, and involves fine-grained recognition of objects which are simply not relevant for control (aside from the fact that they should be avoided).

Bypassing these concerns entirely, end-to-end “behaviour reflex” approaches propose to directly regress the vehicle steering angle from a camera capture [37]. However, such a level of abstraction cannot capture the ambiguity or complexity of driving decisions (e.g. in case of a cross-road or open area) [8, 3], and has shown serious limitations when faced with an increasing number of dynamic obstacles [9]. In fact, a recent in-depth study on vision for robotic action demonstrates the importance of equipping agents with explicit visual representations for urban and off-road navigation, with semantic scene segmentation in particular bringing significant improvement in task performance and generalization [46]. Boiling down the number of semantic classes to those essential for navigation has also shown promising results in urban driving tasks [3, 28, 4].

Building on these considerations and findings, our aim is to learn a compact, task-oriented scene representation for navigation which retains spatial detail for planning while abstracting away from appearance-based descriptions. We argue that for open-world navigation outside of constrained, predictable environments, visual scene understanding for robotics cannot simply be reduced to learning specific object classes on a single dataset. We draw from the concept of visual affordance, which interprets a scene directly in terms of potential action rather than descriptive labels. This concept has led to interesting developments in robotic perception, improving generalization and decision making in a wide range of tasks [1]. However, the bulk of work in this direction pertains to object manipulation or human-robot interaction, and is largely limited to indoor scenes [16]. Existing approaches carrying affordances to the outdoors often rely on pre-defined visual cues which do not apply outside of a highly constrained traffic scenario [8], or on real-world exploration [18], a time- and resource-intensive process which remains impractical outside of static and small-scale off-road environments due to the necessity of driving and experiencing collisions for learning.

In contrast, we approach visual affordance learning as a fully supervised image segmentation problem, leveraging the abundance of large-scale scene understanding datasets. Given a single RGB image captured on-board a robot or vehicle, we propose to interpret the scene at the pixel-level in terms of driveability - an affordance which is both useful for robotic decision-making, and generic to enable task-oriented perception in a wide range of scenes with a single model. We present a deep image segmentation framework which learns to segment driveability from a combination of existing datasets without the need for additional labelling or robotic exploration, and without constraining perception to a single domain. Our contributions can be summarized as follows:

  1. a driveability affordance which does not make any assumptions about the types of obstacles in the scene or the presence of an explicit path, widening the learning problem beyond structured, predictable landscapes;

  2. a soft ordinal label encoding which incorporates the ambiguity and order between levels of driveability during learning, with some areas in the image being more driveable than others;

  3. a loss weighting scheme which, rather than treating all pixels as equally important for navigation, concentrates learning in safety-critical areas while allowing leniency around object outlines and distant scene background;

  4. a challenging experimental procedure: beyond same-dataset testing, we evaluate the generalizability of our approach on three unseen datasets, including the WildDash benchmark [44] which captures a large variety of difficult driving scenarios across 100+ countries.

This navigation-oriented learning scheme is, to the best of our knowledge, the first attempt at incorporating an inter-class ranking in a scene understanding task, taking into account both the type and location of mistakes during learning to improve task-oriented pixel classification.

Ii Related work

Ii-a Functional image segmentation for navigation

Image segmentation for autonomous driving in urban scenes is extensively studied [17]. However, task-driven approaches pre-suppose an explicit path in the scene. For instance, the “path proposal” method proposed in [3] shows remarkable results along city streets, but fails in ambiguous intersections or open areas. Furthermore, reducing navigation-oriented scene understanding to a road vs. non-road problem [38] does not capture the kinds of degrees of driveability which may be relevant for robotic applications traversing diverse terrain [18]. Outside of an urban landscape, existing works predicting dense navigational affordance from RGB images remain scarce. Among them, [30] shows how pixel-wise navigability affordance maps can be learned in dynamic environments to remarkably improve goal-directed planning - however, the agent is contained in a first-person shooter simulator in which exposure to potential hazards through trial and error is acceptable. Out in the real world, [33] and more recently [23] demonstrate how robotic affordances can be learned at the pixel-level from RGB images in a fully supervised manner with ground truth segmentation masks. These works are similar to our approach in that they generate affordance labels from existing object labels - however, they only consider indoor environments which are largely structured, static and well-lit. In contrast, our approach considers challenging outdoor scenes and applies task-specific strategies for learning a single type of affordance.

Ii-B Soft ordinal segmentation

In a classification task, instead of encoding ground truth labels as a one-hot vector, they can be softened to describe class membership probabilities 


. In their most generic form, soft labels can be generated via label smoothing, where the ground truth is taken as a weighted average between the original one-hot target and a uniform distribution across all classes - in

[36], the authors show this to have a regularizing effect, discouraging high-confidence predictions, and thus aiding generalization. However, uniform label smoothing treats each incorrect class as equally probable. More interestingly, soft labelling has been used to express known relationships between classes (e.g. similarity [45] or hierarchy [11, 5]) or to capture natural ambiguity in the data (e.g. at the borders of segmentation masks [15] or due to inconsistent/subjective labels from multiple annotators [24]).

Soft labelling schemes have successfully been applied to semantic segmentation in prior work [13, 15], however these works only consider a binary classification case, and generate pixel labels not based on inter-class relations, but on spatial location, to capture ambiguity along object boundaries. Conversely, works which demonstrate the use of soft labels for ordinal classification [13, 11, 22], apply it to other tasks such as full-image ranking (e.g. age estimation, aesthetic quality prediction or medical diagnosis) or pixel-wise regression (e.g. depth estimation). To the best of our knowledge, the use of soft labels for ordinal scene segmentation has not yet been explored.

Ii-C It’s (not) all in the details

Pixels along object contours are highly ambiguous and thus the most difficult to learn: in semantic segmentation state-of-the-art, refinement of scene detail is therefore typically of primary concern [21]. However, our aim is entirely different: we argue that for navigation, sharp, precise contours are not necessary to achieve a functional understanding of the scene, therefore rather than getting lost in detail, we focus learning away from boundary pixels, towards areas in the camera’s immediate vicinity. Our approach for selectively emphasizing relevant pixels during learning is largely inspired by the loss weight maps proposed in [32], designed to refine segmentation outlines in medical images. We adapt its formulation to our task, and introduce a notion of image depth to distinguish between close-range and background scene elements. Furthermore, unlike [32], our work includes quantitative and qualitative evaluation of the proposed loss weighting scheme compared to a standard uniformly-weighted approach.

Ii-D Cross-dataset learning

Training models across a combination of heterogeneous datasets has shown to improve generalization to unseen domains and challenging benchmarks in image segmentation [20] and other pixel-wise prediction tasks [31]. However, inconsistent object-based labelling across semantic segmentation datasets prevents training data from being naively combined. [20] addresses this issue by building a detailed unified taxonomy spanning indoor and outdoor scenes. In contrast, rather than increasing the label space to accommodate an ever-increasing number of object labels, we bridge the gap across datasets by mapping narrow descriptive classes to a much broader and functional class definition.

Iii Approach

Our approach primarily revolves around how we generate soft pixel-wise driveability labels for training, and how we scale the loss on a per-pixel basis during learning. First, we describe how we map pixel annotations from existing semantic segmentation datasets to a 3-level driveability affordance. We then explain how these affordance labels are softened to model the levels’ hierarchy. Lastly, we present our loss weighting scheme which selectively emphasizes the areas most relevant to navigation.

Iii-a From object semantics to driveability labels

Loosely inspired by the broad semantic classes defined in [3] for urban scene understanding and the robotic action plausibility ratings proposed in [24], we define three classes to characterize the driveability level of a pixel:

  • Preferable: where we expect the robot to drive;

  • Possible, but not preferable: areas which are technically navigable but more challenging or less suitable, and would not be chosen as a first resort;

  • Impossible or undesirable: any part of the scene which is unreachable (e.g. the sky) or should be unconditionally avoided (obstacles, hazardous terrain).

(a) Original object labels
(b) Driveability labels
(c) Loss weight map
Fig. 1: Example of a pixel-annotated outdoor scene from the IDD dataset [41] (top). We map the original classes to driveability levels (Section III-A), and compute a pixel-wise loss weight map from the ground truth mask (Section III-C).
Fig. 2:

Label class probabilities with a standard one-hot encoding (left) vs. the SORD labelling scheme (right).

Figure (b)b shows how an outdoor scene can be segmented according to this definition. Unlike other types of visual affordances which often inherit the challenges of multi-label learning [16] (e.g. due to objects having multiple possible functions or uses), our driveability levels can be tackled as a single-label pixel classification problem. Indeed, the degree of driveability can be reasonably inferred from the type of object in the image: a car will always be impossible to drive on; and although it is possible to drive on sand, it is preferable to drive on pavement. Thus, to generate ground truth labels, we perform a direct mapping from the original semantic classes of existing pixel-annotated datasets to driveability levels, based on the type of scene element. Essentially, any obstacle or non-ground surface is considered impossible to drive on. Paved areas or paths are considered fully driveable i.e. preferable, while other terrains (e.g. grass, sand) or areas on the side of the path (e.g. sidewalk) are assigned to the possible level. As discussed in [1]

, such a mapping from descriptive semantic labels to affordance is somewhat reductive as it does not take any contextual information into account - however, it remains a common approach, since it enables fully-supervised learning without the need for manual affordance labelling.

Iii-B From one-hot to soft ordinal labels

Intuitively, mis-classifying an area which is

preferable (e.g. the path) to drive on as impossible should be penalized more heavily than classifying it as possible

. However, a standard one-hot encoding coupled with a categorical loss function do not capture this distinction during learning: mis-classifications are treated as equally severe regardless of the target. Therefore, to incorporate a notion of distance between the driveability levels, we implement the Soft Ordinal vectors (or SORD) labelling scheme proposed in 


: standard one-hot encoded labels are converted to a softmax-normalized probability distribution based on a ranking definition, such that the target class has the highest probability and the other probabilities encode a distance from the target class. Given a set of ranks

(one per driveability level), a SORD ground truth label is generated based on a target rank as follows:


where is a metric penalty function which penalizes deviation from the target rank . As inter-rank distances approach infinity, reduces to a one-hot encoded vector; as the distances approach 0, approaches a uniform probability distribution. For this application, we consider a simple ranking definition, where the levels {, , } are mapped to the ranks (least to most driveable). We define the metric penalty function as the square log difference (SLD) , which reduces the penalty with increasing rank. Figure 2 shows the resulting asymmetrical SORD label encoding for each of the 3 possible driveability targets, compared to a one-hot categorical encoding. Compared to absolute difference for instance, SLD shifts the middle rank possible much “closer” to preferable than to impossible: in other words, the distinction between obstacles and driveable areas is much more clear-cut than the blurry line between driveable areas which are preferable or not. Intuitively, this mirrors the ambiguity that a human annotator would face when labelling images: we are much less likely to hesitate when categorizing an area as obstacle vs. non-obstacle than when determining whether a driveable area is preferable or not.

Iii-C Loss formulation

Following [11], we take the loss per pixel as the Kullback-Leibler (KL) divergence between the predicted class probability vector and the SORD label from (1). However, rather than giving each pixel equal contribution in the total loss, we adapt the approach in [32] and formulate a pixel-wise loss weighting scheme which assigns higher importance to the pixels most relevant for driving decisions. Given a pixel location in the ground truth mask, we compute a weight map which depends on its Euclidean distance to the closest boundary and on its vertical position (height) in the image :


where is an experimentally defined constant. The height map is used to scale the rate at which the pixel weight increases when moving away from a boundary, and as a pixel-wise multiplication factor which assigns higher weight to lower pixels. It serves as a naive placeholder for depth data, under the simple assumption that the lower a pixel in the image, the closer it is to the camera.

We generate a weight map from a ground truth mask in three steps:

  1. the height map is pre-computed for all possible pixel locations based on the image height H as: such that pixels in the lowest row of the image have the value 1 and top row pixels have a value of 0.

  2. for computing the distance map

    , we first perform edge detection on the gray-scaled ground truth mask, binarize the edge map, and apply a distance transform 

    [6] with a kernel.

  3. the weight map is then computed following (2), and min-max normalized to lie within .

Figure (c)c shows an example of the resulting weight map computed from a ground truth driveability segmentation mask. During learning, is applied to the loss per pixel via element-wise multiplication.

Iv Experimental set-up

Iv-a Architecture and hyper-parameters

For pixel-wise classification, we pick SegNet [2] as a base network, similarly to [3]. Our variant applies drop-out (rate of 0.5) in the six deepest encoder and decoder blocks for regularization, and reduces the number of convolutional layers in each block to 2 (as opposed to 3 in the deepest blocks of VGG-16 [35]), resulting in a total of 20 convolutional layers. We measure a forward pass time of 32ms on the NVIDIA TITAN X for a single sample. In all our experiments, we train SegNet using Adam optimization [19] (, ) to minimize the KL divergence. Unlabeled/void pixels are ignored: the batch loss is computed as the sum of the loss per non-void pixel, divided by the number of non-void pixels in the batch. Samples are fed to the network in shuffled mini-batches of size 8, and the best model is selected based on minimal validation loss.

Iv-B Cross-domain datasets

=0ex =0ex scene type Training & validation (# images) Testing (# images) urban driving Cityscapes [10] (3,484) Kitti [14] (200) BDD [43] (8,000) Mapillary [29] (20,000) ACDC [34] (2,006) unstructured / off-road RUGD [42] (5492) Freiburg Forest [40] (366) YCOR [26] (1076) TAS500 [27] (540) mixed IDD [41] (8089) WildDash [44] (4256)

TABLE I: Cross-domain combination of image segmentation datasets used in our zero-shot cross-dataset experiment.

Our approach is entirely data-driven: accurate estimates of driveability in unconstrained environments require challenging samples to be included during training. For evaluating generalization to new environments with our method, we adopt a similar zero-shot cross-dataset strategy to [20]: models are trained on a combination of cross-domain datasets, and evaluated on a separate combination of datasets which have never been seen during training or validation.

We select outdoor scene understanding datasets with pixel-level annotations and RGB images captured by a vehicle or mobile robot, as outlined in Table I. For training, we include Cityscapes [10], a widespread benchmark featuring “clean” scenes, as well as more recent driving datasets covering a wide range of environmental conditions, sensor characteristics and geographical contexts including Mapillary [29], Berkeley DeepDrive (BDD) [43] and ACDC [34]. Outside of urban landscapes, RUGD [42], YCOR [26] and TAS500 [27] cover off-road grassy environments. Lastly, IDD [41] brings a unique challenge since it was captured in unstructured Indian traffic and rural scenes. For evaluation, we select 3 datasets with varying levels of difficulty. Kitti [14] is a small-scale benchmark of “clean” city driving scenes. Freiburg Forest [40] was captured by a mobile robot traversing forested trails, with some challenging illumination conditions, but no dynamic obstacles. WildDash [44] was specifically designed as a difficult test set for evaluating robustness to visual driving hazards in diverse environments. We use each dataset’s official train/validation split during learning, and full datasets during testing, resulting in a total of 42,759 / 5,939 / 4,822 images for training/validation/testing respectively.

Note that these 11 datasets were annotated under different sets of semantic classes, but mapping their original object labels to a generic notion of driveability allows us to bridge this semantic gap during training and evaluation. During learning, each driveability level is informed by samples from all 8 training datasets. To counteract the imbalance in dataset size, similarly to [20], mini-batches are constructed with an equal number of samples (1 in our case) from each dataset.

Iv-C Data preparation

While it is commonplace to preserve color information in input images for scene understanding [39, 17], we speculate that color may add unnecessary or distracting information when trying to learn such an abstract concept as driveability. Thus, we investigate the importance of color in our experiments by comparing the standard RGB representation with a single-channel grayscale input. Input size is also an important consideration, with a trade-off between computational cost and segmentation detail. Resizing images to fixed dimensions is common practice, especially when learning from a combination of datasets [20]. For our affordance learning task, retaining a high level of detail is not a primary concern, but incorporating global context is crucial [23]. Therefore, we opt for a small input resolution of - the same width as in [2], but with a wider aspect ratio to accommodate wide-FOV datasets. During training, input samples are randomly augmented on-the-fly with geometric (horizontal flip, rotation, crop, perspective transform, grid-based distortion) and photometric (brightness, contrast, tone curve and color manipulation) transformations, each having a probability of . See [7] for a detailed description.

Iv-D Training procedure

Our driveability model is trained in three stages:

  1. Pre-training: as a starting point, we train SegNet on Cityscapes to predict the 30 original object classes in the dataset [10], using an initial learning rate of . We refer to this model as Cityscapesobj. Note that this model is trained under a standard learning scheme (one-hot labels, uniformly weighted loss), and thus can be substituted with other pre-trained scene segmentation models.

  2. Transfer learning: to learn our 3-level driveability definition from a combination of datasets, the last convolutional layer of Cityscapesobj is re-initialized with 3 output channels and trained under the SORD labelling scheme with an initial learning rate of until convergence.

  3. Loss weighting (LW): implemented as a final training stage to consolidate the segmentation, while maintaining the same labelling scheme. Weight maps are generated with (same as [32]) and .

Iv-E Evaluation metrics

Considering the particular context of autonomous navigation, in which under-segmentation of obstacles and over-segmentation of driveable areas pose safety risks, we select two segmentation metrics of interest: pixel-wise recall (R) for the level, and precision (P) for . We also introduce a weighted version of these metrics Rw and Pw where the contribution of each pixel is weighted based on the LW map.

In addition to these binary metrics, we report two distance-based metrics. Root-mean-square error (RMSE) captures the degree of error between predicted and ground truth driveability levels, with heavier penalty for large error (confusion between and levels). Inspired by [5], we also compute a measure of mistake severity (MS) as the mean absolute error of incorrect predictions; note that MS is fully decoupled from accuracy. We normalize MS per pixel, such that it ranges from 0 to 1: mis-classifying a pixel as yields a MS of 0, while confusing the and levels yields a MS of 1.

Fig. 3: Examples of out-of-dataset predictions by the proposed model, trained on the cross-domain dataset with soft driveability labels and loss weighting.

V Results

We first present our main results which validate our navigation-oriented learning scheme. We then draw parallels to existing work, compare our approach to an object-based baseline model, and comment on the effect of input color to motivate the use of grayscale images in our experiments. Lastly, we show some limitations of our approach.

Cross-domain driveability segmentation: Figure 3 shows predictions by our proposed cross-domain model on out-of-dataset samples from the three unseen test sets, and we include a video showing additional qualitative results as supplementary material. Our driveability model produces sensible driveability estimates across a wide range of navigation scenarios, with variations in scene layout and contents, lighting and weather conditions, as well as camera characteristics and perspectives. Table II reports quantitative performance, with a comparison between a model trained under our proposed training scheme (Section IV-D), and a standard model trained with one-hot labels in the transfer learning stage and uniformly-weighted pixel-wise loss.

Test data Learning R % Rw % P % Pw % MS % RMSE
one-hot 98.41 97.77 93.20 94.32 15.28 0.283
SORD + LW 98.33 97.88 93.75 94.71 9.15 0.278
one-hot 98.72 98.17 87.86 90.18 10.79 0.311
SORD + LW 98.79 98.64 89.44 91.09 7.43 0.304
one-hot 94.15 89.98 85.19 88.27 1.75 0.284
SORD + LW 97.57 96.98 86.29 89.26 0.69 0.258
one-hot 98.71 98.07 91.63 93.68 30.27 0.396
SORD + LW 98.58 98.08 94.01 95.48 18.54 0.380
TABLE II: Same-dataset and zero-shot generalization performance of our cross-domain driveability models.

Table II shows our navigation-oriented learning scheme to be effective at bringing down RMSE on the validation set and on every unseen test dataset, while also reducing mistake severity by at least 30% compared to a standard one-hot model. SORD+LW training consistently improves segmentation in safety-critical areas, as indicated by the weighted obstacle recall and precision scores. We note the most significant quantitative improvement in generalization performance for Freiburg Forest’s highly unstructured environments, where our method helps disambiguate the fuzzy transitions between path, grass and surrounding vegetation without getting lost in details. Looking at the overall aspect of the segmentation across test samples, we find that SORD labelling produce smoother contours and less spotty segmentation, and encourages cautious, low-stakes predictions especially for ambiguous border pixels. As can be seen in the examples of Figure 4, this visually manifests as a layer of pixels around non-driveable areas, rather than sharp transitions between and levels. While this deviates from what ground truth masks look like, we consider it beneficial for navigation, since it essentially adds a safe margin around obstacles. Finally, the addition of LW, which concentrates learning away from distant details towards close-range and non-border areas, results in a more approximate but cohesive segmentation.

Fig. 4: Selected crops of out-of-dataset predictions by the cross-domain driveability model, showing the qualitative effect of the soft labelling and loss weighting training schemes.

Comparison to the state-of-the-art: Our unique 3-level driveability definition makes direct comparison to existing scene segmentation methods difficult. To the best of our knowledge, the closest candidate for comparison are the segmentation results reported in [3] for the general obstacle class, which is defined similarly to our impossible level. The authors train a SegNet model on 50k custom-labeled images from the Oxford RobotCar dataset [25], and report pixel-wise obstacle recall scores ranging from 93.71% (night scenes) to 97.01% (sunny weather) on the same dataset. Ground truth data is not made publicly available, preventing us from evaluating our model on the same benchmark. However, we note that our cross-domain driveability model, trained on a similar number of images, achieves superior recall of over 98.5% on both of our autonomous driving test datasets (Kitti and WildDash) despite not having been exposed to any images from these datasets during training, and despite WildDash being particularly visually challenging and geographically diverse compared to Oxford RobotCar. Furthermore, while [3] fails to predict viable path segmentations in ambiguous road configurations (e.g. tight turns in intersections) and does not show results in road-free scenes, the examples in Figure 3 show that our model successfully identifies preferable driveable areas even in the absence of a structured lane, while falling back to the level in open unstructured areas.

Train data Learning Cityscapes (val) Kitti Freiburg WildDash
Cityscapesobj one-hot 0.210 0.353 0.660 0.491
Cityscapesdriv SORD+LW 0.207 0.317 0.491 0.402
Cross-domaindriv SORD+LW 0.226 0.304 0.258 0.380
TABLE III: RMSE of our model compared to single-dataset baselines.

Comparison to an object-based single-dataset baseline: The conventional approach to semantic scene segmentation consist of learning object descriptions on a single dataset. In contrast, our driveability definition allows us to combine heterogeneously labelled datasets during training. To show the benefit of our approach for generalization to new scenes, we take Cityscapesobj as an experimental baseline, and map its object-based predictions to driveability levels for evaluation. We then apply the transfer learning and LW stages to learn driveability on Cityscapes (Cityscapesdriv). Table III compares our cross-domain model with these two single-dataset baselines. Comparing the two Cityscapes models, we note that learning driveability with SORD+LW consistently reduces same-dataset and generalization error compared to a one-hot object-based approach, with the most notable improvement for Cityscapes Freiburg Forest transfer. Extending the findings in [20], our results show cross-domain learning to be beneficial for segmenting driveability in out-of-dataset images: learning driveability across a diverse 8-dataset combination reduces generalization error across all 3 unseen test datasets. While the performance of Cityscapes models drops when faced with Freiburg Forest’s unstructured scenes, the cross-domain models maintain an RMSE below 0.4 (and pixel accuracy above 90%) across all test sets.

Does color matter? On unseen samples from a known dataset or from a dataset captured in ideal urban scenarios (Cityscapes and Kitti in Table IV), color brings a small improvement in segmentation. However, interestingly, we note a significant performance gap in favour of grayscale models when evaluating zero-shot generalization to challenging new scenes (Freiburg Forest and WildDash in Table IV). While grayscale models are blind to dataset-specific color palettes, RGB models seem to make color-class associations (e.g. dark gray for the driveable road) which may not hold in different outdoor environments (e.g. brown paths in Freiburg Forest).

Train data Input Cityscapes (val) Kitti Freiburg WildDash
Cityscapesobj RGB 99.51 99.30 89.11 92.62
Gray 98.91 97.96 92.53 93.36
Cross-domaindriv RGB 99.33 98.91 91.55 92.36
Gray 99.32 98.72 94.11 98.71
TABLE IV: Effect of input color on recall for one-hot models.

Failure cases: Although learning driveability across a diverse combination of datasets improves robustness to WildDash’s challenging visual hazards, our method nevertheless inherits the limitations of monocular RGB vision, with poor results in extreme weather or illumination conditions. In Figure 5 for instance, the first image is too dark and foggy to estimate driveability, especially with the unfamiliar car dashboard taking up a large portion of the image. In the second example, the model overlooks small but important animals on the road. This raises a key contradiction faced during learning: the model is encouraged to ignore road irregularities such as potholes, shadows, lane markings (since all of these are considered driveable ), yet should still be able to identify anomalies and potential hazards on the robot’s path. Distinguishing flat textures with obstacles can be tricky in the absence of depth cues - especially with our approach which encourages learning of general representations and explicitly reduces attention to detail with loss weighting.

Fig. 5: Examples of unacceptable predictions on WildDash [44] images.

Vi Discussion

In defining a simple mapping between object classes and driveability, we bridge the semantic gap between datasets to allow joint cross-domain training, and bypass the need for manual labelling. However, this mapping assumes the original object classes to be specific enough to capture important degrees of driveability. In practice, certain datasets encapsulate all forms of vegetation or the entire sidewalk as a single class for instance, while others distinguish between the curb (which may not be driveable for small-wheeled robots) and the walkable area, or between trees, bushes and traversable grass. Furthermore, our mapping is blind to contextual information: the sidewalk may be the preferable path for a “pedestrian” robot, but only a possible last resort for an autonomous vehicle driving on the road; a dirt path may be preferable to drive on in a forest, but not a route of choice in a city. Incorporating such scene- and application-dependent context during learning is an important direction for further research. Future work will also investigate how driveability can be learned from multi-modal image data to improve scene understanding in poor visibility conditions.

Vii Conclusions

We have presented a simple yet effective method for learning to segment driveability across a wide range of outdoor scenes for open-world robotic navigation. Compared to a general-purpose approach which segments the scene in terms of object-based descriptions, treats all pixels and mistakes equally, and is constrained to a specific domain, our learning framework allows cross-dataset training and includes modifications of the ground truth pixel labels and loss formulation which are tailored to a driving task, resulting in quantitative and qualitative improvements in functional scene segmentation of unseen environments.

While we have approached robotic perception purely as an offline computer vision task, our method produces generic driveability affordance maps which, when paired with a planning module, are specifically intended to enable visual navigation. When incorporated into a real system, the ground truth mapping from object classes to driveability and types of environments seen during training can easily be adapted to the task at hand or robot capabilities. For resource-constrained platforms, the segmentation architecture used in our experiments can be replaced with a more lightweight variant. We stress the versatility of our method: it only requires an RGB image as input, can be trained with readily available datasets, and is applicable to arbitrary scenes. Thus, we believe the direction and findings of our work to be pertinent across a wide range of robotic applications.


  • [1] P. Ardón, È. Pairet, K. S. Lohan, S. Ramamoorthy, and R. P. A. Petrick (2020) Affordances in robotic tasks - a survey. External Links: 2004.07400 Cited by: §I, §III-A.
  • [2] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. ieee_j_pami 39 (12), pp. 2481–2495. External Links: Document Cited by: §IV-A, §IV-C.
  • [3] D. Barnes, W. Maddern, and I. Posner (2017) Find your own way: weakly-supervised segmentation of path proposals for urban autonomy. In icra, Vol. , pp. 203–210. External Links: Document Cited by: §I, §II-A, §III-A, §IV-A, §V.
  • [4] A. Behl, K. Chitta, A. Prakash, E. Ohn-Bar, and A. Geiger (2020) Label efficient visual abstractions for autonomous driving. In iros, Vol. , pp. 2338–2345. External Links: Document Cited by: §I.
  • [5] L. Bertinetto, R. Mueller, K. Tertikas, S. Samangooei, and N. A. Lord (2020) Making better mistakes: leveraging class hierarchies with deep networks. In cvpr, Vol. , pp. 12503–12512. External Links: Document Cited by: §II-B, §IV-E.
  • [6] G. Borgefors (1986) Distance transformations in digital images. Comput. Vis. Graph. Image Process. 34, pp. 344–371. Cited by: item 2.
  • [7] A. Buslaev, V. I. Iglovikov, E. Khvedchenya, A. Parinov, M. Druzhinin, and A. A. Kalinin (2020) Albumentations: fast and flexible image augmentations. Information 11 (2). External Links: ISSN 2078-2489, Document Cited by: §IV-C.
  • [8] C. Chen, A. Seff, A. Kornhauser, and J. Xiao (2015) DeepDriving: learning affordance for direct perception in autonomous driving. iccv, pp. 2722–2730. External Links: Document Cited by: §I, §I.
  • [9] F. Codevilla, E. Santana, A. Lopez, and A. Gaidon (2019) Exploring the limitations of behavior cloning for autonomous driving. In iccv, Vol. , pp. 9328–9337. External Links: Document Cited by: §I.
  • [10] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In cvpr, Vol. , pp. 3213–3223. External Links: Document Cited by: item 1, §IV-B, TABLE I.
  • [11] R. Díaz and A. Marathe (2019) Soft labels for ordinal regression. In cvpr, Vol. , pp. 4733–4742. External Links: Document Cited by: §II-B, §II-B, §III-B, §III-C.
  • [12] A. Galstyan and P. R. Cohen (2008) Empirical comparison of “hard” and “soft” label propagation for relational classification. In

    Inductive Logic Programming

    pp. 98–111. External Links: ISBN 978-3-540-78469-2, Document Cited by: §II-B.
  • [13] B. Gao, C. Xing, C. Xie, J. Wu, and X. Geng (2017) Deep label distribution learning with label ambiguity. tip 26 (6), pp. 2825–2838. External Links: Document Cited by: §II-B.
  • [14] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun (2013) Vision meets robotics: the kitti dataset. ijrr. External Links: Document Cited by: §IV-B, TABLE I.
  • [15] C. Gros, A. Lemay, and J. Cohen-Adad (2021) SoftSeg: advantages of soft versus binary training for image segmentation. Medical image analysis 71, pp. 102038. External Links: Document Cited by: §II-B, §II-B.
  • [16] M. Hassanin, S. Khan, and M. Tahtali (2021-04) Visual affordance and function understanding: a survey. ACM Comput. Surv. 54 (3). External Links: ISSN 0360-0300, Document Cited by: §I, §III-A.
  • [17] J. Janai, F. Güney, A. Behl, and A. Geiger (2020) Computer vision for autonomous vehicles: problems, datasets and state of the art. Foundations and Trends® in Computer Graphics and Vision 12 (1–3), pp. 1–308. External Links: Document, ISSN 1572-2740 Cited by: §II-A, §IV-C.
  • [18] G. Kahn, P. Abbeel, and S. Levine (2021) BADGR: an autonomous self-supervised learning-based navigation system. IEEE Robotics and Automation Letters 6 (2), pp. 1312–1319. External Links: Document Cited by: §I, §II-A.
  • [19] D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In iclr, Y. Bengio and Y. LeCun (Eds.), Cited by: §IV-A.
  • [20] J. Lambert, Z. Liu, O. Sener, J. Hays, and V. Koltun (2020) MSeg: a composite dataset for multi-domain semantic segmentation. In cvpr, Vol. , pp. 2876–2885. External Links: Document Cited by: §I, §II-D, §IV-B, §IV-B, §IV-C, §V.
  • [21] X. Li, Z. Liu, P. Luo, C. C. Loy, and X. Tang (2017) Not all pixels are equal: difficulty-aware semantic segmentation via deep layer cascade. In cvpr, Vol. , pp. 6459–6468. External Links: Document Cited by: §II-C.
  • [22] X. Liu, X. Han, Y. Qiao, Y. Ge, S. Li, and J. Lu (2019) Unimodal-uniform constrained wasserstein training for medical diagnosis. In iccvw, Vol. , pp. 332–341. External Links: Document Cited by: §II-B.
  • [23] T. Lüddecke, T. Kulvicius, and F. Wörgötter (2019) Context-based affordance segmentation from 2d images for robot actions. Robotics and Autonomous Systems 119, pp. 92–107. External Links: ISSN 0921-8890, Document Cited by: §II-A, §IV-C.
  • [24] T. Lüddecke and F. Wörgötter (2020) Fine-grained action plausibility rating. Robotics and Autonomous Systems 129, pp. 103511. External Links: ISSN 0921-8890, Document Cited by: §II-B, §III-A.
  • [25] W. Maddern, G. Pascoe, C. Linegar, and P. Newman (2017) 1 Year, 1000km: The Oxford RobotCar Dataset. ijrr 36 (1), pp. 3–15. External Links: Document Cited by: §V.
  • [26] D. Maturana, P. Chou, M. Uenoyama, and S. Scherer (2018) Real-time semantic mapping for autonomous off-road navigation. In Field and Service Robotics, pp. 335–350. Cited by: §IV-B, TABLE I.
  • [27] K. A. Metzger, P. Mortimer, and H. Wuensche (2021) A fine-grained dataset and its efficient semantic segmentation for unstructured driving scenarios. In icpr, Cited by: §IV-B, TABLE I.
  • [28] M. Müller, A. Dosovitskiy, B. Ghanem, and V. Koltun (2018) Driving policy transfer via modularity and abstraction. In 2nd Annual Conference on Robot Learning (CoRL),

    Proceedings of Machine Learning Research

    , Vol. 87, pp. 1–15.
    Cited by: §I.
  • [29] G. Neuhold, T. Ollmann, S. Rota Bulò, and P. Kontschieder (2017) The mapillary vistas dataset for semantic understanding of street scenes. In iccv, External Links: Document Cited by: §IV-B, TABLE I.
  • [30] W. Qi, R. T. Mullapudi, S. Gupta, and D. Ramanan (2020) Learning to move with affordance maps. In iclr, Cited by: §II-A.
  • [31] R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V. Koltun (2020) Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. pami (), pp. 1–1. External Links: Document Cited by: §II-D.
  • [32] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In miccai, pp. 234–241. External Links: Document Cited by: §II-C, §III-C, item 3.
  • [33] A. Roy and S. Todorovic (2016) A multi-scale cnn for affordance segmentation in rgb images. In eccv, pp. 186–201. External Links: ISBN 978-3-319-46493-0, Document Cited by: §II-A.
  • [34] C. Sakaridis, D. Dai, and L. V. Gool (2021) ACDC: the adverse conditions dataset with correspondences for semantic driving scene understanding. External Links: 2104.13395 Cited by: §IV-B, TABLE I.
  • [35] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In iclr, Cited by: §IV-A.
  • [36] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In cvpr, Vol. , pp. 2818–2826. External Links: Document Cited by: §II-B.
  • [37] A. Tampuu, T. Matiisen, M. Semikin, D. Fishman, and N. Muhammad (2020) A survey of end-to-end driving: architectures and training methods. tnnls (), pp. 1–21. External Links: Document Cited by: §I.
  • [38] M. Teichmann, M. Weber, M. Zöllner, R. Cipolla, and R. Urtasun (2018) MultiNet: real-time joint semantic reasoning for autonomous driving. In IEEE Intelligent Vehicles Symposium (IV), Vol. , pp. 1013–1020. External Links: Document Cited by: §I, §II-A.
  • [39] A. Valada, J. Vertens, A. Dhall, and W. Burgard (2017) AdapNet: adaptive semantic segmentation in adverse environmental conditions. In icra, Vol. , pp. 4644–4651. External Links: Document Cited by: §I, §IV-C.
  • [40] A. Valada, G. L. Oliveira, T. Brox, and W. Burgard (2017) Deep multispectral semantic scene understanding of forested environments using multimodal fusion. In 2016 International Symposium on Experimental Robotics (ISER), pp. 465–477. External Links: ISBN 978-3-319-50115-4, Document Cited by: §IV-B, TABLE I.
  • [41] G. Varma, A. Subramanian, A. Namboodiri, M. Chandraker, and C. V. Jawahar (2019) IDD: a dataset for exploring problems of autonomous navigation in unconstrained environments. In wacv, Vol. , pp. 1743–1751. External Links: Document Cited by: Fig. 1, §IV-B, TABLE I.
  • [42] M. Wigness, S. Eum, J. G. Rogers, D. Han, and H. Kwon (2019) A rugd dataset for autonomous navigation and visual perception in unstructured outdoor environments. In iros, Vol. , pp. 5000–5007. External Links: Document Cited by: §IV-B, TABLE I.
  • [43] F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell (2020-06) BDD100K: a diverse driving dataset for heterogeneous multitask learning. In cvpr, Cited by: §IV-B, TABLE I.
  • [44] O. Zendel, K. Honauer, M. Murschitz, D. Steininger, and G. F. Dominguez (2018-09) WildDash - creating hazard-aware benchmarks. In eccv, External Links: Document Cited by: item 4, §IV-B, TABLE I, Fig. 5.
  • [45] C. Zhang, P. Jiang, Q. Hou, Y. Wei, Q. Han, Z. Li, and M. Cheng (2021) Delving deep into label smoothing. tip 30, pp. 5984–5996. External Links: ISSN 1941-0042, Document Cited by: §II-B.
  • [46] B. Zhou, P. Krähenbühl, and V. Koltun (2019) Does computer vision matter for action?. Science Robotics 4. External Links: Document Cited by: §I.