The task of image segmentation consists of choosing the right label for each pixel such that the image is segmented into different parts like in this work a satellite image into road segments and background (e.g buildings, trees, grass). This task is more important then ever because of the recent developments of autonomous driving vehicles, which requires precise guidance for save driving.
Advances in computational infrastructure and breakthroughs in artificial neural network architectures (ANNs) are emerging. These lead to the use of machine learning to tackle this partitioning task. In order to process the amount of information in a single image efficiently, we need to use convolutional neural networks (CNNs) because they split, downscale and upscale the image while applying different filters in an efficient way. This enables the use of such networks for real time image segmentation with high accuracy.
As training data we have satellite images as well as the corresponding ground-truth map which labels each pixel as road or background. The task of the paper is to predict the label of new satellite images in chunks of 16x16 pixels. The test images have dimensions 608x608 pixels, therefore we have to predict 1444 chunks per image. In order to compare the performance of different models and approaches, mean score is used. The road label is assigned to a chunk if the proportion of road pixels exceed the threshold of 25% in the chunk.
This paper gives the reader a brief overview over the different methods we implemented, the difficulties we had encountered during testing and our final solution which consists of a set of specially modified U-Net CNNs.
Ii Models and Methods
Ii-a General Workflow
Given the training set of 100 images and corresponding masks with size , the task requires classifying the patches of the test set into either road or background. To conduct such classification, we have tested three baseline models:
The sliding window approach, which conducts a binary classification for each patch with the given window size using convolutional neural networks.
The segmentation approach based on U-Net to generate pixel-level classification.
Besides, several data preprocessing techniques and training strategies were applied. Our final solution involves averaging outputs of three models based on U-Net, which result from improvements on network architecture of original U-net model. This combination has significantly improved F-score on both validation and test set.
Ii-B Data Pre-processing and Augmentation
As the number of aerial images in training set is relatively small (with only 100 images and corresponding masks), we generated 900 patches of in size from them. These patches are used throughout the second and third baseline solution and the final one. Furthermore, image data augmentation techniques were used to overcome the impacts of limited training data size. We augmented the images by rotation (), shifting the width and height of the image center (, ), shearing and zooming the image (, ), and random flipping in both horizontal and vertical directions. The corresponding masks were also transformed with the same operations.
Ii-C Network Architecture
Ii-C1 Baseline solution 1: sliding-window approach
This approach is based on the work of Gouk et al.
The main idea behind this approach is to classify each patch with padding region around it to provide the model with enough context information. Then we use this window and scan over the image in order to classify each patch. The patch plus the padding is then fed into a convolutional neural network with 5 layers in total. To avoid overfitting, Leaky ReLu was used as activation function, and Dropout was added with a probability of 0.5.
To further avoid overfitting and to increase the number of windows as training data, we used a data augmentation strategy different from the aforementioned one. First, a patch is randomly selected as training instance and the window inside that patch is chosen. Then it underwent random flipping and rotation so that the model’s generalization ability is further enhanced. If the patch is at the border, the padding is just a mirrored image around its edge.
Ii-C2 Baseline solution 2: U-Net
We implemented the original U-Net architecture on this task. (see Figure 3) This is a small and easily trainable fully convolutional neural network designed originally for binary segmentation of biomedical images. The network is consisted of two paths, the encoding and decoding paths. In the first half or the encoding path, operations are 2D convolutions followed by a non-linear activation function, ELU.
Feature map size is retained in each convolution by padding. Then max pooling is used to reduce the feature map in the encoding path. After these operations, number of feature channels is doubled. Then the U-Net uses an expansion path (decoding path) to up-sample the feature maps and eventually create a segmentation mask same in size as input image. This path consists of sequences of up sampling in the form of transpose convolutions, followed by concatenations with feature maps copied from the encoding path. Finally, a convolution of kernel size (1,1) and a Softmax function is applied. This maps each entry in input aerial image to one of two classes, road or non-road.
Ii-C3 Baseline solution 3: U-Net with MobileNetV2 trained on ImageNet
In order to extract robust features, We further applied transfer learning, specifically using pre-trained parts of a lightweight model, MobileNetV2
as encoder path of U-Net. This is especially applicable in this task because it has a rather limited training set, and the method does not require much computational resource. The base model to transfer, MobileNetV2 is trained on a large dataset of another task, in this case the ImageNet datset, so that the weights of its convolutional layers (especially the first few ones) are more robust in extracting features that can be applied to the road extraction task. The fundamental architecture of the baseline model is based on U-Net. We adapted its encoder from the convolutional layers of pre-trained MobileNetV2. The output layer of the encoder is the layer of MobileNetV2. Four output layers of MobileNetV2 (, , , ) were concatenated to the transposed convolutional layers of the encoder to build skip connections.
Ii-C4 Final solution: improvements on U-Net
We made two major improvements on U-Net architecture and implemented three different models (Figure 4). The final submitted result was generated through averaging the values of masks generated by three models (after Softmax function) and classification was done with a threshold of 0.25.
First, we added local skip connections (similar to that of ResNet) to the convolution blocks in both the encoding and decoding pathways of U-Net. This is done through addition of the current layer input’s identity mapping to its output values. Each modified convolution block takes the following general form:
where and are
-level’s input and output tensors,
is the residual function, which is the original convolutions followed by batch normalization and nonlinear activation.is the activation function, and
is the identity mapping function, which is a simple 2D convolution with kernel size 1 in our solution. This added operation addresses the gradient degradation problem by letting it backpropagate through this additional identity mapping function. It also facilitates training, and utilizes features of lower semantic information extracted in the previous layers.
The second improvement is substitution of the central bottleneck in U-Net with a series of dilated convolution operations. The summation of these convolutions are fed into the decoder path. Previous work ,  reports that adding dilated convolution modules can solve the problem of degraded picture resolution. This is achieved through expanded receptive fields, thus maintaining per-pixel level classification accuracy while generating large-scale feature maps with rich context information.
In all, we implemented three models based on the above improvements. They are summarized in the table below.
|unet-32||32 channels on the first layer, local skip connections|
|unet-64||64 channels on the first layer, local skip connections|
|unet-dilated||parallel dilated convolutions as bottleneck, local skip connections|
Ii-D Loss function and evaluation metrics
The models were trained to optimize the smoothed dice loss, defined as follows:
where is the mask values predicted by model, is the ground-truth, and is the smoothing coefficient, selected as 1. Model performance is evaluated based on the following metrics: IoU (intersection over union) and F-1 score ( is true positive, and are false positive and false negative ratios):
We split the data into training () and validation set (2] and trained with a batch size of 8 with 100 epochs in maximum. When training the unet-32 and unet-64 models (described in Table I), dropout was added after each concatenation with probability 0.2. Batch normalization was performed to accelerate training (as illustrated in Figure 4), in the position after each 2D convolution and before nonlinear activation functions. Baseline solutions were trained using Google Colaboratory and final solution models were trained on a cluster using one NVIDIA®V100 GPU and 28 Intel®Xeon®E5-2690 v4 CPUs.
Iii-a Key results from baselines
Iii-A1 Sliding window approach results
Hyperparameters were fine-tuned for this approach and mean F-score on test set was around 0.85.
Iii-A2 Transfer learning results
The IoU on the validation set finally achieves 0.7991, which cannot meet typical standards of road extraction. Given that the outcome of transfer learning also depends on the similarity between the pretrained dataset and the specific task, the trained model conditioned on ImageNet might only be able to find a suboptimal parameters for road segmentation.
Iii-B Final solution results
We observed a certain degree of improvement on all three models proposed above than the traditional U-Net architecture. Detailed performance metrics on validation and test set are listed below:
Averaging the values of masks predicted by these models yield higher score on test set (0.9027), which was chosen for submission.
In this work we first discovered that by using encoder-decoder architecture such as U-net, semantic information in different levels are effectively coupled to give clearer and more accurate results than a classification of each patch with a simple 5-layer CNN. By further substituting bottleneck of U-Net with parallel dilated convolution structure and adding residual-like skip connections, we largely exceeded performance of baseline solutions. However, there is still room of improvement for each of the models proposed. The gap between score on test set and validation set suggests that the problem of lacking training data is still not resolved. Possible candidate methods are image sharpening with filters and color jittering. Also, different selections of layer parameters transferred into U-Net as well as different tasks the base model is trained on needs to be explored further. This can affect overall model performance and the amount of time required to train until convergence.
Building Machine Learning and Deep Learning Models on Google Cloud Platform, pp. 59–64. Cited by: §II-E.
-  (2015)(Website) External Links: Cited by: §II-E.
-  (2015) Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Cited by: §II-C2.
Imagenet: a large-scale hierarchical image database.
2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §II-C3.
-  (2014-11) Fast sliding window classification with convolutional neural networks. pp. 114–118. External Links: Cited by: §II-C1.
-  (2016) Identity mappings in deep residual networks. In European conference on computer vision, pp. 630–645. Cited by: §II-C4.
-  (2000-08) Object recognition with gradient-based learning. pp. . Cited by: §I.
-  (2019) Accuracy improvement of unet based on dilated convolution. In Journal of Physics: Conference Series, Vol. 1345, pp. 052066. Cited by: §II-C4.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §II-C2.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: §II-C3.
-  (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §II-C1.
-  (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §II-C4.