Segmentation of Roads in Satellite Images using specially modified U-Net CNNs

by   Jonas Bokstaller, et al.
ETH Zurich

The image classification problem has been deeply investigated by the research community, with computer vision algorithms and with the help of Neural Networks. The aim of this paper is to build an image classifier for satellite images of urban scenes that identifies the portions of the images in which a road is located, separating these portions from the rest. Unlike conventional computer vision algorithms, convolutional neural networks (CNNs) provide accurate and reliable results on this task. Our novel approach uses a sliding window to extract patches out of the whole image, data augmentation for generating more training/testing data and lastly a series of specially modified U-Net CNNs. This proposed technique outperforms all other baselines tested in terms of mean F-score metric.



There are no comments yet.


page 1

page 2


A QuadTree Image Representation for Computational Pathology

The field of computational pathology presents many challenges for comput...

Peri-Net-Pro: The neural processes with quantified uncertainty for crack patterns

This paper uses the peridynamic theory, which is well-suited to crack st...

DeepGlobe 2018: A Challenge to Parse the Earth through Satellite Images

We present the DeepGlobe 2018 Satellite Image Understanding Challenge, w...

Efficient Convolutional Neural Networks for Pixelwise Classification on Heterogeneous Hardware Systems

This work presents and analyzes three convolutional neural network (CNN)...

Concurrent Segmentation and Object Detection CNNs for Aircraft Detection and Identification in Satellite Images

Detecting and identifying objects in satellite images is a very challeng...

AI Based Waste classifier with Thermo-Rapid Composting

Waste management is a certainly a very complex and difficult process esp...

Deep Features for training Support Vector Machine

Features play a crucial role in computer vision. Initially designed to d...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The task of image segmentation consists of choosing the right label for each pixel such that the image is segmented into different parts like in this work a satellite image into road segments and background (e.g buildings, trees, grass). This task is more important then ever because of the recent developments of autonomous driving vehicles, which requires precise guidance for save driving.

Advances in computational infrastructure and breakthroughs in artificial neural network architectures (ANNs) are emerging. These lead to the use of machine learning to tackle this partitioning task. In order to process the amount of information in a single image efficiently, we need to use convolutional neural networks (CNNs)

[7] because they split, downscale and upscale the image while applying different filters in an efficient way. This enables the use of such networks for real time image segmentation with high accuracy.

As training data we have satellite images as well as the corresponding ground-truth map which labels each pixel as road or background. The task of the paper is to predict the label of new satellite images in chunks of 16x16 pixels. The test images have dimensions 608x608 pixels, therefore we have to predict 1444 chunks per image. In order to compare the performance of different models and approaches, mean score is used. The road label is assigned to a chunk if the proportion of road pixels exceed the threshold of 25% in the chunk.

Fig. 1: Example from the provided training data. On the left there is the input satellite image and on the right its respective ground-truth.

This paper gives the reader a brief overview over the different methods we implemented, the difficulties we had encountered during testing and our final solution which consists of a set of specially modified U-Net CNNs.

Ii Models and Methods

Ii-a General Workflow

Given the training set of 100 images and corresponding masks with size , the task requires classifying the patches of the test set into either road or background. To conduct such classification, we have tested three baseline models:

  1. The sliding window approach, which conducts a binary classification for each patch with the given window size using convolutional neural networks.

  2. The segmentation approach based on U-Net to generate pixel-level classification.

  3. Transfer learning strategy for segmentation. The encoder of U-Net is adapted from the base model of MobileNetV2. Pre-trained parameters on ImageNet were taken for initialization.

Besides, several data preprocessing techniques and training strategies were applied. Our final solution involves averaging outputs of three models based on U-Net, which result from improvements on network architecture of original U-net model. This combination has significantly improved F-score on both validation and test set.

Ii-B Data Pre-processing and Augmentation

As the number of aerial images in training set is relatively small (with only 100 images and corresponding masks), we generated 900 patches of in size from them. These patches are used throughout the second and third baseline solution and the final one. Furthermore, image data augmentation techniques were used to overcome the impacts of limited training data size. We augmented the images by rotation (), shifting the width and height of the image center (, ), shearing and zooming the image (, ), and random flipping in both horizontal and vertical directions. The corresponding masks were also transformed with the same operations.

Ii-C Network Architecture

Ii-C1 Baseline solution 1: sliding-window approach

This approach is based on the work of Gouk et al.[5]

The main idea behind this approach is to classify each patch with padding region around it to provide the model with enough context information. Then we use this window and scan over the image in order to classify each patch. The patch plus the padding is then fed into a convolutional neural network with 5 layers in total. To avoid overfitting, Leaky ReLu was used as activation function, and Dropout was added with a probability of 0.5.


Fig. 2:

Patch with padding around it called sliding window approach. The padding is used to provide the model with enough contextual information. The stride of the sliding window is its own size, namely 16x16.

To further avoid overfitting and to increase the number of windows as training data, we used a data augmentation strategy different from the aforementioned one. First, a patch is randomly selected as training instance and the window inside that patch is chosen. Then it underwent random flipping and rotation so that the model’s generalization ability is further enhanced. If the patch is at the border, the padding is just a mirrored image around its edge.

Ii-C2 Baseline solution 2: U-Net

We implemented the original U-Net architecture[9] on this task. (see Figure 3) This is a small and easily trainable fully convolutional neural network designed originally for binary segmentation of biomedical images. The network is consisted of two paths, the encoding and decoding paths. In the first half or the encoding path, operations are 2D convolutions followed by a non-linear activation function, ELU.[3]

Feature map size is retained in each convolution by padding. Then max pooling is used to reduce the feature map in the encoding path. After these operations, number of feature channels is doubled. Then the U-Net uses an expansion path (decoding path) to up-sample the feature maps and eventually create a segmentation mask same in size as input image. This path consists of sequences of up sampling in the form of transpose convolutions, followed by concatenations with feature maps copied from the encoding path. Finally, a convolution of kernel size (1,1) and a Softmax function is applied. This maps each entry in input aerial image to one of two classes, road or non-road.

Fig. 3: The structure of the modified U-net architecture.

Ii-C3 Baseline solution 3: U-Net with MobileNetV2 trained on ImageNet

In order to extract robust features, We further applied transfer learning, specifically using pre-trained parts of a lightweight model, MobileNetV2[10]

as encoder path of U-Net. This is especially applicable in this task because it has a rather limited training set, and the method does not require much computational resource. The base model to transfer, MobileNetV2 is trained on a large dataset of another task, in this case the ImageNet datset

[4], so that the weights of its convolutional layers (especially the first few ones) are more robust in extracting features that can be applied to the road extraction task. The fundamental architecture of the baseline model is based on U-Net. We adapted its encoder from the convolutional layers of pre-trained MobileNetV2. The output layer of the encoder is the layer of MobileNetV2. Four output layers of MobileNetV2 (, , , ) were concatenated to the transposed convolutional layers of the encoder to build skip connections.

Ii-C4 Final solution: improvements on U-Net

We made two major improvements on U-Net architecture and implemented three different models (Figure 4). The final submitted result was generated through averaging the values of masks generated by three models (after Softmax function) and classification was done with a threshold of 0.25.

Fig. 4: The major improvements on U-Net architecture: (a) addition of identity mapping before final nonlinear activation in each convolution block, as in ResNet; (b) Substituting the original bottleneck with dilated convolutions.

First, we added local skip connections (similar to that of ResNet[6]) to the convolution blocks in both the encoding and decoding pathways of U-Net. This is done through addition of the current layer input’s identity mapping to its output values. Each modified convolution block takes the following general form:

where and are

-level’s input and output tensors,

is the residual function, which is the original convolutions followed by batch normalization and nonlinear activation.

is the activation function, and

is the identity mapping function, which is a simple 2D convolution with kernel size 1 in our solution. This added operation addresses the gradient degradation problem by letting it backpropagate through this additional identity mapping function. It also facilitates training, and utilizes features of lower semantic information extracted in the previous layers.

The second improvement is substitution of the central bottleneck in U-Net with a series of dilated convolution operations. The summation of these convolutions are fed into the decoder path. Previous work [8], [12] reports that adding dilated convolution modules can solve the problem of degraded picture resolution. This is achieved through expanded receptive fields, thus maintaining per-pixel level classification accuracy while generating large-scale feature maps with rich context information.

In all, we implemented three models based on the above improvements. They are summarized in the table below.

Model name Description
unet-32 32 channels on the first layer, local skip connections
unet-64 64 channels on the first layer, local skip connections
unet-dilated parallel dilated convolutions as bottleneck, local skip connections
TABLE I: Models used in final solution

Ii-D Loss function and evaluation metrics

The models were trained to optimize the smoothed dice loss, defined as follows:

where is the mask values predicted by model, is the ground-truth, and is the smoothing coefficient, selected as 1. Model performance is evaluated based on the following metrics: IoU (intersection over union) and F-1 score ( is true positive, and are false positive and false negative ratios):

Ii-E Training

We split the data into training () and validation set (

). The initial learning rate is 0.0001 and it will be reduced with a factor of 0.5 if the validation loss is less than 0.0002 after every 5 epochs. The model was built in Keras

[2] and trained with a batch size of 8 with 100 epochs in maximum. When training the unet-32 and unet-64 models (described in Table I), dropout was added after each concatenation with probability 0.2. Batch normalization was performed to accelerate training (as illustrated in Figure 4), in the position after each 2D convolution and before nonlinear activation functions. Baseline solutions were trained using Google Colaboratory[1] and final solution models were trained on a cluster using one NVIDIA®V100 GPU and 28 Intel®Xeon®E5-2690 v4 CPUs.

Iii Results

Iii-a Key results from baselines

Iii-A1 Sliding window approach results

Hyperparameters were fine-tuned for this approach and mean F-score on test set was around 0.85.

Iii-A2 Transfer learning results

The IoU on the validation set finally achieves 0.7991, which cannot meet typical standards of road extraction. Given that the outcome of transfer learning also depends on the similarity between the pretrained dataset and the specific task, the trained model conditioned on ImageNet might only be able to find a suboptimal parameters for road segmentation.

Iii-B Final solution results

We observed a certain degree of improvement on all three models proposed above than the traditional U-Net architecture. Detailed performance metrics on validation and test set are listed below:

Model val_loss val_f1 test_f1
unet-32 0.0826 0.9174 0.8780
unet-64 0.0368 0.9632 0.8959
unet-dilated 0.0334 0.9666 0.9015
TABLE II: Performance metrics of proposed models

Averaging the values of masks predicted by these models yield higher score on test set (0.9027), which was chosen for submission.

Iv Discussion

In this work we first discovered that by using encoder-decoder architecture such as U-net, semantic information in different levels are effectively coupled to give clearer and more accurate results than a classification of each patch with a simple 5-layer CNN. By further substituting bottleneck of U-Net with parallel dilated convolution structure and adding residual-like skip connections, we largely exceeded performance of baseline solutions. However, there is still room of improvement for each of the models proposed. The gap between score on test set and validation set suggests that the problem of lacking training data is still not resolved. Possible candidate methods are image sharpening with filters and color jittering. Also, different selections of layer parameters transferred into U-Net as well as different tasks the base model is trained on needs to be explored further. This can affect overall model performance and the amount of time required to train until convergence.


  • [1] E. Bisong (2019) Google colaboratory. In

    Building Machine Learning and Deep Learning Models on Google Cloud Platform

    pp. 59–64. Cited by: §II-E.
  • [2] F. Chollet et al. (2015)(Website) External Links: Link Cited by: §II-E.
  • [3] D. Clevert, T. Unterthiner, and S. Hochreiter (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Cited by: §II-C2.
  • [4] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In

    2009 IEEE conference on computer vision and pattern recognition

    pp. 248–255. Cited by: §II-C3.
  • [5] H. Gouk and A. Blake (2014-11) Fast sliding window classification with convolutional neural networks. pp. 114–118. External Links: Document Cited by: §II-C1.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §II-C4.
  • [7] Y. Lecun, P. Haffner, and Y. Bengio (2000-08) Object recognition with gradient-based learning. pp. . Cited by: §I.
  • [8] S. Piao and J. Liu (2019) Accuracy improvement of unet based on dilated convolution. In Journal of Physics: Conference Series, Vol. 1345, pp. 052066. Cited by: §II-C4.
  • [9] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §II-C2.
  • [10] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §II-C3.
  • [11] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §II-C1.
  • [12] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §II-C4.