Memory-based Semantic Segmentation for Off-road Unstructured Natural Environments

08/12/2021 ∙ by Youngsaeng Jin, et al. ∙ Drexel University Korea University 1

With the availability of many datasets tailored for autonomous driving in real-world urban scenes, semantic segmentation for urban driving scenes achieves significant progress. However, semantic segmentation for off-road, unstructured environments is not widely studied. Directly applying existing segmentation networks often results in performance degradation as they cannot overcome intrinsic problems in such environments, such as illumination changes. In this paper, a built-in memory module for semantic segmentation is proposed to overcome these problems. The memory module stores significant representations of training images as memory items. In addition to the encoder embedding like items together, the proposed memory module is specifically designed to cluster together instances of the same class even when there are significant variances in embedded features. Therefore, it makes segmentation networks better deal with unexpected illumination changes. A triplet loss is used in training to minimize redundancy in storing discriminative representations of the memory module. The proposed memory module is general so that it can be adopted in a variety of networks. We conduct experiments on the Robot Unstructured Ground Driving (RUGD) dataset and RELLIS dataset, which are collected from off-road, unstructured natural environments. Experimental results show that the proposed memory module improves the performance of existing segmentation networks and contributes to capturing unclear objects over various off-road, unstructured natural scenes with equivalent computational cost and network parameters. As the proposed method can be integrated into compact networks, it presents a viable approach for resource-limited small autonomous platforms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Semantic segmentation is a fundamental but significant task for scene understanding. It aims to assign semantic labels for each pixel in the image. As semantic segmentation provides diverse information including categories, locations, and shapes of objects, it is critical to a variety of real-world applications, such as robot vision 

[1], autonomous driving [2], medical diagnosis [3], etc. However, this task is challenging to achieve high accuracy due to label and shape variety.

With the developments of deep neural networks, Shelhamer

et al. [4] proposed the Fully Convolutional Network (FCN), which attained an impressive improvement on semantic segmentation accuracy. Due to the effectiveness of the FCN, it is employed as a core framework in the state-of-art semantic segmentation methods. Some efforts [5, 6, 7] aggregate multi-scale contextual information to capture multi-scale objects. Other efforts [8, 9, 10] use an attention mechanism [11, 12] to capture richer global contextual information.

Another factor making progress in the development of semantic segmentation is the availability of semantic segmentation datasets [13, 14]. These datasets were collected from various real-world environments and have been provided for advancing the technique in real-world applications. In particular, several datasets tailored for autonomous driving in real-world urban scenes [15, 16, 17, 18] have been provided and exploited as a primary source of data for navigating autonomous vehicles. As the urban scenes are structured environments with low variations of scenes and illumination, they are relatively easier to be segmented precisely. Thus, a significant development for autonomous driving in urban environments has been achieved.

Fig. 1: Image samples from RUGD dataset [19], collected from off-road, unstructured environments. The numbers below each image are the average pixel intensity of categories ‘sky’ and ‘tree’. Images in this dataset cover a wide range of scenes and their illumination is inconsistent.

However, when navigating in off-road, unstructured natural environments, an autonomous platform encounters formidable challenges to recognize its surroundings and objects therein. Such scenarios face not only a wide range of scenes but also significant illumination changes as some examples shown in Fig. 1. Unfortunately, as there are many factors, such as camera sensitivity or lighting conditions, causing the illumination changes, these changes are unavoidable during navigating. At worst, an object becomes as brighter (or darker) as surrounding regions and it is hard to distinguish from them as the building in the ‘Test sample’ of Fig. 2. Therefore, as illumination plays a critical role in capturing the appearance, inconsistent illumination results in performance degradation.

Motivated by the above issues, in this paper, we propose a built-in memory module for semantic segmentation to improve semantic segmentation accuracy for off-road, unstructured natural environments. The memory module stores the significant representations of training images as memory items. Then, the memory items are recalled to cluster the instances of the same class closer within the embedding space learned by training images. Therefore, the memory module mitigates significant variances in embedding features and segmentation networks with the memory module better deal with the unexpected illumination changes as illustrated in Fig. 2. In our experimental configuration, the proposed memory module follows after an encoder to refine the global contextual features (encoder output feature maps) using memory items. Then, the decoder takes the refined features and produces the segmentation masks. In order not to affect the computational cost, the memory module contains a few items. In addition, the triplet loss [20] is used to make the items far apart and minimize the redundancy of the items. Our proposed memory module is very general thus can be adopted on a wide range of networks.

Fig. 2: The memory module shifts representations of samples with significant variances into the learned embedding space so that the overall segmentation networks better deal with the unexpected illumination changes.

Experiments are conducted on the Robot Unstructured Ground Driving (RUGD) dataset [19] and RELLIS [21] dataset, which are collected by an unmanned ground robot from off-road, unstructured natural environments. The quantitative results show that the proposed memory module improves the performance of existing segmentation networks with equivalent computational cost and network parameters regardless of compact or complex networks. The qualitative results demonstrate the effectiveness in capturing unclear objects over a variety of off-road, unstructured scenes. Our memory module applied to compact segmentation networks delivers improved performance on outdoor scene segmentation in real-time operation, thus it allows better autonomous navigation for resource-limited small autonomous platforms.

Ii Related Work

Ii-a Semantic Segmentation

Driven by the development of Convolutional Neural Networks (CNNs), current semantic segmentation methods always employ deep CNNs (e.g., ResNet 

[22], ResNext [23], etc.) as an encoder to extract feature representations. To improve the performance, various decoder modules are proposed to produce precise segmentation masks.

A variety of early methods [4, 24, 6, 25, 26] have been proposed, but these approaches often resulted in poor performance due to their model simplicity. To boost performance, more advanced methods have been developed. PSPNet [5] and Deeplab [7, 27] frameworks incorporated multi-scale contextual information using spatial pyramid module while some methods [28, 29, 30] used dense layers in the decoder end to infuse multi-level features for advantage of feature reuse. Some efforts [9, 31, 32, 8, 10, 33, 34] exploited attention mechanisms to capture long-range dependency for richer global contextual information. In addition, a gate mechanism [35] is employed to selectively fuse multi-level features or multi-modal features to further improve the performance [36, 37, 38, 30, 39]

. Effectiveness of these methods has been verified on datasets collected from structured environments (e.g., Cityscapes 

[17]

, ADE20K 

[14], etc.), but their effectiveness on an off-road, unstructured environment has not yet been verified.

Ii-B Memory Networks

Graves et al. [40]

introduced a concept called ”Neural Turing Machine”, which combines neural networks with an external memory bank to extend neural networks capability. The external memory is jointly trained with the main branch. The combined architecture uses an attention process to selectively read from and write to the memory. Due to its flexibility, it is adapted to a variety of tasks, such as few-shot learning 

[41, 42], video summarization [43]

, image captioning 

[44]

, anomaly detection 

[45], etc. In our method, a memory module is exploited for semantic segmentation to deal with unexpected illumination changes in an off-road, unstructured environment by storing the significant representations of training images and recalling the representations to correct significant variances in embedded features.

Iii Method

In this section, we first provide an overview of the semantic segmentation framework. Then we present the proposed memory module and the loss function.

Fig. 3: Illustration of the overall architecture. and denote the cross entropy loss and the triplet loss, respectively.

Iii-a Overview of Semantic Segmentation Framework

As deep convolutional neural networks are capable of extracting salient features from input images, they are employed as an encoder. To alleviate the loss of the details, they produce high-resolution feature maps to preserve information. It is implemented by replacing some of the last stride convolutions 

[46, 47] with dilated convolutions [48] correspondingly. The ratio of input image spatial resolution to the encoder output resolution, denoting as output stride (OS), is often 8 or 16111Lower OS improves the segmentation accuracy but requires more computational cost.. Starting with the input image whose height and width are and , an encoder produces the global contextual feature maps , where are the number of channels, height and width respectively, at the final layer. Then a decoder takes the feature maps as input to produce a segmentation mask. It is often a sub-network, such as Atrous Spatial Pyramid Pooling (ASPP) module in Deeplabv3 [7], to refine the feature maps for the performance improvement and produce the prediction map. At last, the prediction map is bilinearly upsampled to the resolution of the input image for the final segmentation result , where is the number of the categories. Although the decoder often improves the performance, they are ineffective if the encoder output feature maps do not properly represent the input image. To mitigate this issue, we propose a memory module to refine the feature maps using the memory items before feeding them into the decoder as depicted in Fig. 3. The details of the memory module are presented in the following subsection.

Iii-B Memory Module

The memory module performs read and write operations. The read operation refines encoder output feature maps using stored memory items while the write operation updates the memory items according to the encoder output feature maps. The write operation is only conducted during the training phase. The read operation (Fig. 4) is presented first and then the write operation is followed (Fig. 5).

Iii-B1 Read

Given encoder output feature maps of an image, where and each is an individual feature at a spatial position, they are refined by the memory items , where each

is a memory item. The read operation is based on addressing weights, which are obtained by the cosine similarities 

222The cosine similarity delivers the best performance compared to other similarity functions, such as Manhattan distance (roughly -0.8% in terms of mIoU under the same experimental settings). between each individual feature (for all ) and all memory items and a softmax function. Thus, an addressing weight of an individual feature to the memory item is as follows:

where the cosine similarity is computed as:

As the proposed memory module contains a small number of memory items for compactness, each individual feature addresses all items for the diverse representations instead of the one most similar item. For the feature , the memory module refines it through a weighted sum of all memory items with the corresponding addressing weight as:

Fig. 4: Illustration of the process of memory read. To read memory items, the cosine similarities of each individual feature and all memory items are computed in (1), and then a weighted sum of the items with the corresponding addressing weight is applied as in (3) to obtain the refined feature . \⃝raisebox{-0.9pt}{$\textup{s}$} and \⃝raisebox{-0.9pt}{$\textup{w}$} denote the cosine similarity operation and a weight sum.

Instead of only feeding the refined feature maps into the decoder, they are multiplied by a scale parameter and added to the original feature maps as also contains the significant information of the input image. Thus, the feature maps fed into the decoder is given by:

The parameter is a trainable scalar and initialized as 0.1.

Iii-B2 Write

The write operation is the process of updating memory items using the individual features . Different from the read operation that all memory items are involved to refine each individual feature, the write operation updates each memory item using a part of individual features, each of which is most similar to the target memory item. Thus, given a target memory item , we first look for the features which have the highest addressing weight on according to the similarities computed in (2) as:

where contains the indexes of the features which have the highest addressing weight on . Similar to (1), an update weight of a memory item to an individual feature is computed as:

where the cosine similarity is computed as:

Moreover, the update weight is re-normalized by the maximum weight of the features in the set as:

Fig. 5: Illustration of the process of memory write. Given the memory item , we look for a set containing indexes of individual features which have highest addressing weight on as in (5). Then we compute the update weights in (6)(8) and the memory item is updated by a weighted sum of the individual features in with the corresponding update weights as shown applied in (9). \⃝raisebox{-0.9pt}{$\textup{+}$} is the element-wise summation. in this example.

At last, the memory item is updated by the features in the set using a weighted sum with the corresponding update weight as follows:

where is L2 normalization function. If all individual features are involved in updating the memory items, update weights on similar features get diminished as small weights are assigned to the uncorrelated features. As a result, the memory items could be improperly updated. Thus, we select above updating strategy to write memory items.

Iii-C Optimization

To train our proposed model, we exploit 2D multi-class cross entropy loss for semantic segmentation as:

where and

are the true category label and the segmentation predicted probability for pixel

. In addition, in order to reduce the redundancy of memory items, the triplet loss is used to make the items far apart as:

where and are the first and second most similar memory items to the feature according to the addressing weights computed in (1), and , set as 1.0 in our experiments, is the margin between the two items. To minimize the triplet loss, the feature should be close to the while far away from . Thus the overall loss function consists of the two loss functions as:

is set as 0.05 and the model is trained end-to-end to minimize the overall loss function.

Iv Experiments

To evaluate the proposed method, a series of experiments are conducted. The dataset and implementation details are introduced first. Then, to look for the optimal experimental settings for achieving the best performance, some studies are delivered. At last, quantitative and qualitative results are presented.

Iv-a Dataset

RUGD dataset [19] is a dataset tailored for semantic segmentation in unstructured environments. It focuses on off-road autonomous navigation scenario. It was collected from a Clearpath Husky ground robot traversing in a variety of natural, unstructured semi-urban areas. It contains no discernible geometric edges or vanishing points, and semantic boundaries are highly irregular and convoluted. As such, off-road driving scenarios present a variety of challenges. It contains 4,759 and 1,964 images for training and testing sets, respectively. It has 24 categories including vehicle, building, sky, grass and etc. The resolution of the images is 688500. The ablation studies are conducted on the training and testing sets.

RELLIS dataset [21] is another dataset tailored for semantic segmentation in off-road environments. The dataset was collected on the Rellis Campus of Texas A&M University and presents challenges to existing algorithms related to class imbalance and environmental topography. It contains 3,302 and 1,672 images for training and testing sets, respectively. It has 19 categories including fence, vehicle, rubble and etc. The resolution of the images is 19201200. Due to the limitation of computing resources, the images were randomly cropped to 640640 during training.

Iv-B Implementation Details

Iv-B1 Training settings

Following  [32], a poly learning rate policy is adopted. The initial learning rate is set as 0.01 and the learning rate at each iteration is the initial learning rate multiplied by

. The momentum and weight decay rates are set to 0.9 and 0.0001, respectively. The networks are trained with 8 mini-batch sizes per GPU using stochastic gradient descent (SGD). We set 150 epochs for training. As in existing methods, parameters in the encoder are initialized from the weights pretrained from the ImageNet 

[49] while those in the decoder and the memory module are randomly initialized. To avoid overfitting, data augmentation is exploited during training including horizontal flipping, scaling (from 0.5 to 2.0), and rotation (from -10 to 10).

Iv-B2 Networks

To verify the effectiveness of our proposed memory module, it is adopted on a variety of networks, such as PSPNet [5], Deeplabv3 [7] and DANet [9], with various depths of encoders, such as MobileNetv2 [50], ResNet18 [22] and HRNet [51]. The decoder denoted as ‘Upsampling’ consists of a single convolutional layer producing a prediction map and a bilinear upsampling operation to resize the prediction map to the resolution of the input image for the final segmentation result. We conduct the ablation studies on Deeplabv3 with a lightweight encoder ‘ResNet18’ with OS16.

Fig. 6: The effect of the number of memory items in the proposed memory module. The memory module with 24 items yields the best performance.

Iv-C Ablation Study

Iv-C1 Effect of the Number of Memory Items

To find the optimal number of memory items, we conduct experiments by setting the different number of memory items and the results are shown in Fig. 6. We observe that the memory with 24 items, which is identical to the number of the categories, yields the best performance. It outperforms the baseline with an improvement of 1.59% in terms of mIoU. We observe that although the memory module with less than 24 items outperforms the baseline, it cannot deliver diverse representations to cover a wide range of scenes and all categories. In the case of too many items, it is vulnerable to focus on relevant items and leads to performance degradation.

Iv-C2 Effect of triplet loss

As the triplet loss contributes to the separateness of features, we control the separateness of the memory items to store discriminative representations by weighting the triplet loss using a constant scalar . As shown in Table I, we vary from 0 to 0.2 and achieves the best performance.

To analyze the effectiveness of the triple loss more clearly, we visualize the cosine similarities of all pairs of the items without/with the triple loss in Fig. 7. We observe that the triplet loss makes the items less similar to others. The average cosine similarities of all pairs of the items without and with the triple loss are 0.50 and 0.19, respectively. These results indicate that the triplet loss allows the memory module to reduce the redundancy and store discriminative representations, which improves the segmentation accuracy.

Fig. 7: Visualization of cosine similarities of all pairs of memory items without (Left) and with (Right, ) the triplet loss. The result demonstrates that the triplet loss makes the items highly discriminative and less redundant.
0 0.01 0.05 0.1 0.2
mIoU 34.44 34.42 35.07 34.81 33.68
TABLE I: Results by setting different loss weight on the triplet loss. Empirically, yields the best performance.

Iv-D Results

To verify the effectiveness of our memory module, it is applied to diverse decoders with either compact (e.g., ResNet18 and MobileNetv2) or complex (e.g., HRNet and ResNet50) encoder. Table II presents the segmentation performance (mIoU), the number of network parameters (#Param) and the computational cost (GFLOPs333

The GFLOPs is computed with the Pytorch code on

https://github.com/sovrasov/flops-counter.pytorch.). We can observe that our memory module can improve the performance over different networks regardless of compact or complex networks. As we propose a compact and non-parametric memory module, the baselines with our memory module keep the same number of network parameters and equivalent GFLOPs as the baseline. More importantly, “ResNet18 + Deeplabv3” with our memory module outperforms the heavier networks “HRNet + Upsampling” and “ResNet50 + Deeplabv3” without the memory module. It demonstrates that our proposed memory module contributes to significant performance improvement and makes lighter networks perform as well as more complex networks.

Fig. 8 and Fig. 9 give some visualization results from a compact network (MobileNetv2 + Upsampling) and a complex network (ResNet50 + Deeplabv3), respectively. As shown, the results demonstrate the effectiveness of our method for capturing unclear regions and objects as highlighted regions in the images over various off-road, unstructured natural environments.

Encoder Decoder mIoU GFLOPs #Param
MobileNetv2 Upsampling 32.12 4.65 2.64M
MobileNetv2 Upsampling (Ours) 32.78 4.66 2.64M
ResNet18 Upsampling 32.20 23.90 12.96M
ResNet18 Upsampling (Ours) 33.30 23.92 12.96M
HRNet Upsampling 36.49 137.48 65.86M
HRNet Upsampling (Ours) 37.23 137.74 65.86M
ResNet18 PSPNet 33.42 27.83 16.77M
ResNet18 PSPNet (Ours) 34.13 27.85 16.77M
ResNet18 Deeplabv3 33.48 28.60 16.5M
ResNet18 Deeplabv3 (Ours) 35.07 28.62 16.5M
ResNet18 DANet 33.02 25.61 13.27M
ResNet18 DANet (Ours) 33.99 25.64 13.27M
ResNet18 Deeplabv3 35.98 93.37 16.5M
ResNet18 Deeplabv3 (Ours) 37.04 93.43 16.5M
ResNet50 Deeplabv3 36.77 242.94 42.13M
ResNet50 Deeplabv3 (Ours) 37.71 243.21 42.13M
TABLE II: The segmentation results comparison with and without our memory module on the RUGD test set. ‘(Ours)’ denotes the network with the memory module. and denotes the encoder with OS4 and 8 (else OS16). The GFLOPs are computed on the input size 688 550.

Fig. 8: Visualization of semantic segmentation results from a compact network (MobileNetv2 + Upsampling). Our method is superior to capturing unclear objects, such as fences in the top two samples and a building in the bottom sample.

Fig. 9: Visualization of semantic segmentation results from a complex network (ResNet50 + Deeplabv3). Our method is superior to capturing unclear objects, such as a vehicle in the top sample, grass in the third sample and buildings in the rest of the samples.

Iv-E Rellis

The effectiveness of our method is further validated on the RELLIS dataset. Compared to the RUGD dataset, the RELLIS dataset does not contain frame sequences with significant illumination changes. Thus the overall quality of images captured is better than that of the RUGD dataset. However, the RELLIS images contain scenes of wide unobstructed views, resulting in distant objects captured by a small number of pixels. As such, accurate semantic segmentation of such objects is difficult. Table III summarizes the test results on a variety of networks. It clearly demonstrates that our method resulted in improvement on each of the network tests. Visualization results from the network “ResNet18 + Deeplabv3” on RELLIS testing images are shown in Fig. 10. While the network without the memory module has difficulties accurately segmenting the fence-post and the distant vehicles (especially the one on the left), the network with our proposed memory module accurately segmented those distant objects.

Encoder Decoder mIoU
MobileNetv2 Upsampling 37.26
MobileNetv2 Upsampling (Ours) 37.89
MobileNetv2 Deeplabv3 38.67
MobileNetv2 Deeplabv3 (Ours) 39.24
ResNet18 PSPNet 38.52
ResNet18 PSPNet (Ours) 39.97
ResNet18 DANet 38.92
ResNet18 DANet (Ours) 40.25
ResNet18 Deeplabv3 38.66
ResNet18 Deeplabv3 (Ours) 40.10
ResNet18 Deeplabv3 40.76
ResNet18 Deeplabv3 (Ours) 41.62
ResNet50 Deeplabv3 43.97
ResNet50 Deeplabv3 (Ours) 45.61
TABLE III: The segmentation results comparison with and without our memory module on the RELLIS test set. ‘(Ours)’ denotes the network with the memory module. denotes the encoder with OS8 (else OS16).

Fig. 10: Visualization of semantic segmentation results from the network “ResNet18 + Deeplabv3”. Our method is superior to capturing distant objects, such as the fence-post in the left sample and distant vehicles (especially the one on the left) in the right sample.

V Conclusions

In this paper, a built-in memory module was proposed to improve the semantic segmentation performance on off-road unstructured natural environments by refining global contextual feature maps. The memory module stored the significant representations of the training images as memory items. Then, the memory items were recalled to cluster instances of the same class together within the learned embedding space even when there were significant variances in embedded features from the encoder. Thus, the memory module contributed to handling the unexpected illumination changes which made objects unclear. Considering real-time navigation of an autonomous platform, the memory module contains a small number of memory items in order not to affect the computational cost (GFLOPs). To make the best use of the memory module, the triplet loss was employed to minimize redundancy, and the memory module stored discriminative representations. We demonstrated the effectiveness of the proposed memory module by applying it to several existing networks. It improved performance while rarely affecting efficiency, and the qualitative results showed that our memory module contributed to capturing unclear objects over various off-road, unstructured natural environments. As the proposed method can be integrated into compact networks, it presents a viable approach for resource-limited small autonomous platforms.

References