Vehicle Instance Segmentation from Aerial Image and Video Using a Multi-Task Learning Residual Fully Convolutional Network

05/26/2018 ∙ by Lichao Mou, et al. ∙ DLR 0

Object detection and semantic segmentation are two main themes in object retrieval from high-resolution remote sensing images, which have recently achieved remarkable performance by surfing the wave of deep learning and, more notably, convolutional neural networks (CNNs). In this paper, we are interested in a novel, more challenging problem of vehicle instance segmentation, which entails identifying, at a pixel-level, where the vehicles appear as well as associating each pixel with a physical instance of a vehicle. In contrast, vehicle detection and semantic segmentation each only concern one of the two. We propose to tackle this problem with a semantic boundary-aware multi-task learning network. More specifically, we utilize the philosophy of residual learning (ResNet) to construct a fully convolutional network that is capable of harnessing multi-level contextual feature representations learned from different residual blocks. We theoretically analyze and discuss why residual networks can produce better probability maps for pixel-wise segmentation tasks. Then, based on this network architecture, we propose a unified multi-task learning network that can simultaneously learn two complementary tasks, namely, segmenting vehicle regions and detecting semantic boundaries. The latter subproblem is helpful for differentiating closely spaced vehicles, which are usually not correctly separated into instances. Currently, datasets with pixel-wise annotation for vehicle extraction are ISPRS dataset and IEEE GRSS DFC2015 dataset over Zeebrugge, which specializes in semantic segmentation. Therefore, we built a new, more challenging dataset for vehicle instance segmentation, called the Busy Parking Lot UAV Video dataset, and we make our dataset available at so that it can be used to benchmark future vehicle instance segmentation algorithms.



There are no comments yet.


page 1

page 4

page 5

page 6

page 7

page 8

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The last decade has witnessed dramatic progress in modern remote sensing technologies – along with the launch of small and cheap commercial high-resolution satellites and the now widespread availability of unmanned aerial vehicles (UAVs) – which facilitates a diversity of applications, such as urban management [1, 2, 3, 4], monitoring of land changes [5, 6, 7, 8], and traffic monitoring [9, 10]. Among these applications, object extraction from very high-resolution remote sensing images/videos has gained increasing attention in the remote sensing community in recent years, particularly vehicle extraction, due to successful civil applications. Vehicle extraction, however, is still a challenging task, mainly because it is easily affected by several factors, e.g., vehicle appearance variation, the effects of shadow, illumination, a complicated and cluttered background, etc. Existing vehicle extraction approaches can be roughly divided into two categories: vehicle detection and vehicle semantic segmentation.

Fig 1: An illustration of different vehicle extraction methods. From left to right and top to bottom: input image, vehicle detection, semantic segmentation, and vehicle instance segmentation. The challenge of vehicle instance segmentation is that some vehicles are segmented incorrectly. While most pixels belonging to the category are identified correctly, they are not correctly separated into instances (see arrows in the lower left image).

I-a Vehicle Detection

The goal of vehicle detection is to detect all instances of vehicles and localize them in the image, typically in the form of bounding boxes with confidence scores. Traditionally, this topic was addressed by works that use low-level, hand-crafted visual features (e.g., color histogram, texture feature, scale-invariant feature transform (SIFT), and histogram of oriented gradients (HOG)) and classifiers. For example, in 

[11], the authors incorporate multiple visual features, local binary pattern (LBP), HOG, and opponent histogram for vehicle detection from high-resolution aerial images. Moranduzzo and Melgani [12]

first use SIFT to detect the interest points of vehicles and then train a support vector machine (SVM) to classify these interest points into vehicle and non-vehicle categories based on the SIFT descriptors. They later present an approach 

[13] that performs filtering operations in horizontal and vertical directions to extract HOG features and yield vehicle detection after the computation of a similarity measure, using a catalog of vehicles as a reference. [14], where authors make use of an integral channel concept, with Haar-like features and an AdaBoost classifier in a soft-cascade structure, to achieve fast and robust vehicle detection.

The aforementioned approaches mainly rely on hand-crafted features for constructing a classification system. Recently, as an important branch of the deep learning family, the convolutional neural network (CNN) has become the method of choice in many computer vision and remote sensing problems 

[15, 16, 17, 18, 19]

(e.g., object detection) due to its ability to automatically extract mid- and high-level abstract features from raw images for pattern recognition purposes. Chen et al. 

[20] propose a vehicle detection model, called the hybrid deep neural network, which consists of a sliding window technique and CNN. The main insight behind their model is to divide the feature maps of the last convolutional layer into different scales, allowing for the extraction of multi-scale features for vehicle detection. In [21], authors segment an input image into homogeneous superpixels that can be considered as vehicle candidate regions, making use of a pre-trained deep CNN to extract features, and train a linear SVM to classify these candidate regions into vehicle and non-vehicle classes.

I-B Vehicle Semantic Segmentation

Vehicle semantic segmentation aims to label each pixel in an image as belonging to the vehicle class or other categories (e.g., building, tree, low vegetation, etc.). In comparison with vehicle detection, it can give more accurate pixel-wise extraction results. More recently, progress in deep CNNs, particularly fully convolutional networks (FCNs), makes it possible to achieve end-to-end vehicle semantic segmentation. For instance, Audebert et al. [22] propose a deep learning-based “segment-before-detect” method for semantic segmentation and subsequent classification of several types of vehicles in high-resolution remote sensing images. The use of SegNet [23] in this method is capable of producing pixel-wise annotations for vehicle semantic mapping. In addition, several recent works in semantic segmentation of high-resolution aerial imaging also involve vehicle segmentation. In [24]

, the authors focus on class imbalance, which often represents a problem for semantic segmentation in remote sensing images since small objects (e.g., vehicles) are less prioritized in an effort to achieve good overall accuracy. To address this problem, they train FCNs, using the cross-entropy loss function weighted with median frequency balancing, which is proposed by Eigen and Fergus 


I-C Is Semantic Segmentation Good Enough for Vehicle Extraction?

The existence of “touching” vehicles in a remote sensing image makes it quite hard for most vehicle semantic segmentation methods to separate objects individually, while in most cases, we need to know not only which pixels belong to vehicles (vehicle semantic segmentation problem) but also the exact number of vehicles (vehicle detection task). This drives us to examine instance-oriented vehicle segmentation.

Vehicle instance segmentation seeks to identify the semantic class of each pixel (i.e., vehicle or non-vehicle) as well as associate each pixel with a physical instance of a vehicle. This is contrasted with vehicle semantic segmentation, which is only concerned with the above-mentioned first task. In this work, we are interested in vehicle instance segmentation in a complex, cluttered, and challenging background from aerial images and videos. Moreover, since deep networks have recently been very successful in a variety of remote sensing applications, from hyper/multi-spectral image analysis to interpretation of high-resolution aerial images to multimodal data fusion [15], in this paper, we would like to use an end-to-end network to achieve vehicle instance segmentation. Our work contributes to the literature in three major respects:

  • So far, most studies in the remote sensing community have focused on object detection and semantic segmentation in high-resolution remote sensing imagery. Instance segmentation has rarely been addressed. In a pioneer work moving from semantic segmentation to instance segmentation, Audebert et al. [22] developed a three-stage segment-before-detect framework. In this paper, we try to address the vehicle instance segmentation problem by a end-to-end learning framework.

  • In order to facilitate progress in the field of vehicle instance segmentation in high-resolution aerial images/videos, we provide a new, challenging dataset that presents a high range of variation – with a diversity of vehicle appearances, effects of shadow, a cluttered background, and extremely close vehicle distances – for producing quantitative measurements and comparing among approaches.

  • We present a semantic boundary-aware unified multi-task learning fully convolutional network, which is end-to-end trainable, for vehicle instance segmentation. Inspired by several recent works [26, 27, 28], we exploit ResNet [29] to construct the feature extractor of the whole network. In this paper, we theoretically analyze and discuss why residual networks can produce better probability maps for pixel-wise prediction tasks. The proposed multi-task learning network creates two separate, yet identical branches to jointly optimize two complementary tasks – namely, vehicle semantic segmentation and semantic boundary detection. The latter subproblem is beneficial for differentiating vehicles with an extremely close distance and further improving instance segmentation performance.

The remainder of this paper is organized as follows. After the introductory Section I, detailing vehicle extraction from high-resolution remote sensing imagery, we enter Section II, dedicated to the details of the proposed semantic boundary-aware multi-task learning network for vehicle instance segmentation. Section III then provides dataset information, the network setup, and experimental results and discussion. Finally, Section IV concludes the paper.

Ii Methodology

We formulate the vehicle instance segmentation task by two subproblems, namely vehicle detection and semantic segmentation. The training set is denoted by , where and is the number of training samples. Since we consider each image independently, the subscript is dropped hereafter for notational simplicity. represents a raw input image, denotes its corresponding manually annotated pixel-wise segmentation mask, and is the instance label, where indicates a set of pixels inside the -th region111Regions in the image satisfy and , in where is the whole image region.. is the total number of vehicle instances in the image, and is the background area. When takes other values, it denotes the corresponding vehicle instance. Note that instance labels only count vehicle instances, thus they are commutative. Our aim is to segment vehicles while ensuring that all instances are differentiated. In this work, we approximate vehicle detection by semantic boundary detection222Semantic boundary detection is to detect the boundaries of each object instance in the images. Compared to edge detection, it focuses more on the association of boundaries and their object instances.. We generate semantic boundary labels through to train a boundary detector, in which and equals 1 when it belongs to boundaries.

In this section, we describe in detail our proposed semantic boundary-aware multi-task learning network for accurate vehicle instance segmentation. We start by introducing the FCN architecture for end-to-end semantic segmentation in Section II-A. Furthermore, we propose to exploit multi-level contextual feature representations, generated by different stages of a residual network, to construct a residual FCN for producing better likelihood maps of vehicle regions or semantic boundaries (see Section II-B). Then, in Section II-C, we elaborate the semantic boundary-aware unified multi-task learning network drawn from the residual FCN for effective instance segmentation by jointly optimizing the complementary tasks.

Fig 2: The network architecture of the ResFCN we use, as illustrated in Section II-B. We incorporate multi-level contextual features from the last , , and layers of a classification ResNet since making use of information from fairly early fine-grained layers is beneficial to segmenting small objects such as vehicles. To get the desired full resolution output, we use convolutional layers followed by upsampling operations to upsample back to the spatial resolution of the input image. Then, predictions from different residual blocks are fused together with a summing operation.

Ii-a Fully Convolutional Network for Semantic Segmentation

Long et al. [30] first proposed FCN architecture for semantic segmentation tasks, which is both efficient and effective. Later, some extensions of the FCN model have been proposed to improve semantic segmentation performance. To name a few, in [31]

, the authors removed some of the max-pooling operations and, accordingly, introduced atrous/dilated convolutions in their network, which can expand the field of view without increasing the number of parameters. As post-processing, a dense conditional random field (CRF) was trained separately to refine the estimated category score maps for further improvement. Zhang et al. 


introduced a new form of network that combines FCN- and CRF-based probabilistic graphical modeling to simulate mean-field approximate inference for the CRF, with Gaussian pairwise potentials as the recurrent neural network (RNN).

Ii-B Residual Fully Convolutional Network (ResFCN)

Here, we first explain how to construct a ResFCN according to existing works in literature; mainly, the ResNet [29] and FCN [30]. Then, we theoretically analyze why ResFCN is able to offer better performance than other FCNs based on the traditional feedforward network architectures (e.g., VGG Nets [33]).

Network design. Several recent studies in computer vision [26, 27, 28] have shown that ResNet [29] is capable of offering better features for pixel-wise prediction tasks such as semantic segmentation [26, 27] and depth estimation [28]. We, therefore, make use of ResNet to construct the segmentation network in our work. We initialize a ResFCN from the original version of ResNet [29], instead of the newly presented pre-activation version [34]. Unlike [30], we directly remove the fully connected layers from the original ResNet but do not convolutionalize these layers so as to make one prediction per spatial location. Moreover, we keep the convolutional layer and max-pooling layer, which can enlarge the field of view for feature representations. One of recent trend in network architecture design is stacking convolutional layers with small convolution kernels (e.g., and ) in the entire network because the stacked small kernels are more efficient than a large filter, given the same computational complexity. However, a recent study [35] found that the large filter also plays an important role when classification and localization tasks are performed simultaneously. This can be easily understood through the analogy of individuals commonly confirming the category of a pixel by referring to its surrounding context region.

By now, the output feature maps are only the resolution of their original input image, which is apparently too low to precisely differentiate individual pixels. To deal with this problem, Long et al. [30]

made use of backwards strided convolutions that upsample the feature maps and output score masks. The motivation behind this is that the convolutional layers and max-pooling layers focus on extracting high-level abstract features whereas the backwards strided convolutions estimate the score masks in a pixel-wise way. Ghiasi et al. 

[36] proposed a multi-resolution reconstruction architecture based on a Laplacian pyramid that uses skip connections from higher resolution feature maps and multiplicative gating to successively refine segment boundaries reconstructed from lower-resolution maps. Inspired by the existing works, in this paper, we exploit multi-level contextual feature representations that include information from different residual blocks (i.e., different levels of contextual information). Fig. 2 shows the illustration of the ResFCN architecture we use with multi-level contextual features. More specifically, we incorporate feature representations from the last , , and layers of the original ResNet since making use of information from fairly early fine-grained layers is beneficial to segmenting small objects such as vehicles. To get the desired full resolution output, we used a convolutional layer, which adaptively squashes the number of channels down to the number of labels (1 for binary classification), takes advantage of the upsampling operation to upsample back to the spatial resolution of the input image, and makes predictions based on contextual cues from the given fields of view. Then, these predictions are fused together with a summing operation, and the final segmentation results are generated after sigmoid classification.

Fig 3: Overall architecture of the proposed semantic boundary-aware ResFCN. We propose to use such a unified multi-task learning network for vehicle instance segmentation, which creates two separate, yet identical branches to jointly optimize two complementary tasks, namely, vehicle semantic segmentation and semantic boundary detection. The latter subproblem is beneficial for differentiating “touching” vehicles and further improving the instance segmentation performance.

Why residual learning? Until recently, the majority of feedforward networks, like AlexNet [37] and VGG Nets [33], were made up of a linear sequence of layers. and are denoted as the input and output of the -th layer/block, respectively, and each layer in such a network learns the mapping function :


where is the parameters of the -th layer. This kind of network is also often referred to as a traditional feedforward network.

According to a study by He et al. [29], simply deepening traditional feedforward networks usually leads to an increase in training and test errors (i.e., so-called degradation problem). A residual learning-based network is composed of a sequence of residual blocks and exhibits significantly improved training characteristics, providing the opportunity to make network depths that were previously unattainable. The output of the -th residual block in a ResNet can be computed as


where is the residual, which is parametrized by . The core insight of ResNet is that the addition of a shortcut connection from the input to the output bypasses two or more convolutional layers by performing identity mapping and is then added together with the output of stacked convolutions. By doing so, only computes a residual instead of computing the output directly.

In the experiments, we found that ResFCN can offer better performance than other FCNs based on traditional feedforward network architecture, such as VGG-FCN. What is the reason behind this? To answer this question, we need to go deeper. According to the characteristics of ResFCN, we can easily get the following recurrence formula


for any deeper residual block and any shallower residual block . Eq. (3) shows that ResFCN creates a direct path for propagating information of shallow layers (i.e., ) through the entire network. Several recent studies [38, 39]

that attempt to reveal what were learned by CNNs show that deeper layers exploit filters to grasp global high-level information while shallower layers capture low-level details, such as object boundaries and edges, which are of great importance in small object detection/segmentation. In addition, when we dive into the backward propagation process, according to the chain rule of backpropagation, we can obtain


where is the loss function of the network. As exhibited in Eq. (4), the gradient can be decomposed into two additive terms: the term that passes information through the weight layers, and the term that directly propagates without concerning any weight layers. The latter term ensures that the information can also be directly propagated back to any shallower residual block .

In brief, the properties of the forward and backward propagation procedures of the ResFCN make it possible to shuttle the low-level visual information directly across the network, which is quite helpful for our vehicle (small object) instance segmentation tasks.

Ii-C Semantic Boundary-Aware ResFCN

By exploiting the multi-level contextual features, the ResFCN is capable of producing good likelihood maps of vehicles. It is, however, still difficult to differentiate vehicles with a very close distance by only leveraging the probability of vehicles, due to the ambiguity in “touching” regions. This is rooted in the loss of spatial details caused by max-pooling layers (downsampling) along with feature abstraction. The semantic boundaries of vehicles provide good complementary cues that can be used for separating instances.

Some approaches in computer vision and remote sensing have been explored for modeling segmentation and boundary prediction jointly in a combinatorial framework. For example, Kirillov et al. [40] propose InstanceCut, which represents instance segmentation by two modalities, namely a semantic segmentation and all instance-boundaries. The former is computed from a CNN for semantic segmentation, and the latter is derived from a instance-aware edge detector. But this approach does not address end-to-end learning. In the remote sensing community, Marmanis et al. [41] propose a two-step model that learns a CNN to separately output edge likelihoods at multiple scales from color-infrared (CIR) and height data. Then, the boundaries detected with each source are added as an extra channel to each source, and a network is trained for semantic segmentation purposes. The intuition behind this work is that using predicted boundaries helps to achieve sharper segmentation maps. In contrast, we train one end-to-end network that takes as input color images and predicts segmentation maps and object boundaries, in order to augment the performance of segmentation at instance level.

To this end, we train a deep semantic boundary-aware ResFCN for effective vehicle instance segmentation (i.e., segmenting the vehicles and splitting clustered instances into individual ones). Fig. 3 shows an overview of the proposed network. Specifically, we formulate it as a unified multi-task learning network architecture by exploring the complementary information (i.e., vehicle region and semantic boundaries), instead of treating the vehicle segmentation problem as an independent and single task, which can simultaneously learn the detections of vehicle regions and corresponding semantic boundaries. As shown in Fig. 3, the feature representations extracted from multiple residual blocks are upsampled with two separate, yet identical branches to predict the semantic segmentation masks of vehicles and semantic boundaries, respectively. In each branch, the mask is estimated by the ResFCN with multi-level contextual features as illustrated in Section II-B. Since we have only two categories (foreground/vehicles vs. background and semantic boundaries vs. non-boundaries), sigmoid and binary cross-entropy loss are used to train these two branches. Formally, the network training can be formulated as a pixel-level binary classification problem regarding ground truth segmentation masks, including vehicle instances and semantic boundaries, as shown in the following:




and denote losses for estimating vehicle regions and semantic boundaries, respectively. We train the network using this joint loss, and the final instance segmentation map is produced by the first branch of the network in test phase. Vehicle instances are obtained by computing connected regions in the predicted segmentation map. Inside a region, pixels belong to the same vehicle; while different regions mean different instances. Our motivation is that jointly estimating segmentation and boundary map in a multi-task network with such a joint loss can offer a better segmentation result at instance level for aerial images. Note that we do not make use of any post-processing operations, such as fusing the segmentation and boundary map, as we want to directly evaluate the performance of this network architecture.

Note that the multi-task learning network is optimized in an end-to-end fashion. This joint multi-task training procedure has several merits. First, in the application of vehicle instance segmentation, the multi-task learning network architecture is able to provide complementary semantic boundary information, which is helpful in differentiating the clustered vehicles, improving the instance-level segmentation performance. Second, the discriminative capability of the network’s intermediate feature representations can be improved by this architecture because of multiple regularizations on correlated tasks. Therefore, it can increase the robustness of instance segmentation performance.

Iii Experimental Results and Discussion

Iii-a Datasets

Iii-A1 ISPRS Potsdam

The ISPRS Potsdam Semantic Labeling dataset [42] is an open benchmark dataset provided online333 The dataset is consists of 38 ortho-rectified aerial IRRGB images ( px), with a 5 cm spatial resolution and corresponding DSMs generated by dense image matching, taken over the city of Potsdam, Germany. A comprehensive manually annotated pixel-wise segmentation mask is provided as ground truth for 24 tiles, which are available for training and validation. The other 14 remain unreleased and are kept with the challenge organizers for testing purposes. We randomly selected 5 tiles (image number: 2_12, 5_12, 7_7, 7_8, 7_9) from 24 training images and used them as test set in our experiments (cf. Fig. 4). The resolution is downsampled to 15 cm/pixel to match the subsequent video dataset. The input to the networks contains only red, green, and blue channels, and all results reported on this dataset refer to the aforementioned test set. Table I provides details about this dataset for our experiments.

Iii-A2 Busy Parking Lot

The task of vehicle instance segmentation currently lacks a compelling and challenging benchmark dataset to produce quantitative measurements and to compare with other approaches. While the ISPRS Potsdam dataset has clearly boosted research in semantic segmentation of high-resolution aerial imagery, it is not as challenging as certain practical scenes, such as a busy parking lot, where vehicles are often parked so close that it is quite hard to separate them, particularly from an aerial view. To this end, in this work, we propose our new challenging Busy Parking Lot UAV Video dataset that we built for the vehicle instance segmentation task. The UAV video was acquired by a camera onboard a UAV covering the parking lot of Woburn Mall, in Woburn, Massachusetts, USA. The video comprises pixels with a spatial resolution of about 15 cm per pixel at 24 frames per second and with a length of 60 seconds. We have manually annotated pixel-wise instance segmentation masks for 5 frames (at 1, 15, 30, 45, and 59 seconds); i.e., the annotation is dense in space and sparse in time to allow for the evaluation of methods with this long sequence (cf. Fig. 6). The Busy Parking Lot dataset is challenging because it presents a high range of variations, with a diversity of vehicle colors, effects of shadow, several slightly blurred regions, and vehicles that are parked too close. We train networks on the ISPRS Potsdam dataset and then perform vehicle instance segmentation using the trained networks on this video dataset. Details regarding this dataset are shown in Table II.

Fig 4: Image #5_12 from the ISPRS Potsdam dataset for vehicle instance segmentation as well as three zoomed-in areas.
Training Set Test Set
2_12 5_12 7_7 7_8 7_9
Vehicle Count 4,433 123 427 301 309 305
Number of Pixels 1,184,789 36,236 122,332 76,892 77,669 74,404
TABLE I: Vehicle Counts and Number of Vehicle Pixels in ISPRS Potsdam Dataset
Frame@1s Frame@15s Frame@30s Frame@45s Frame@59s
Vehicle Count 511 492 502 484 479
Number of Pixels 257,462 235,560 240,607 235,448 226,697
TABLE II: Vehicle Counts and Number of Vehicle Pixels in Busy Parking Lot UAV Video Dataset

Iii-B Training Details

The network training is based on the TensorFlow framework. We chose Nesterov Adam 

[43, 44]

as the optimizer to train the network, since, for this task, it shows much faster convergence than standard stochastic gradient descent (SGD) with momentum 

[45] or Adam [46]. We fixed almost all of the parameters of Nesterov Aadam as recommended in [43]: , , , and a schedule decay of 0.004, making use of a fairly small learning rate of . All weights in the newly added layers are initialized with a Glorot uniform initializer [47]

that draws samples from a uniform distribution.

In our experiments, we note that the pixel-wise F1 score of the network is less sensitive to the parameter , and the instance-level performance is relatively sensitive to . Based on the sensitivity analysis (cf. Fig. 5), we set it as 0.1.

Fig 5: A sensitivity analysis for the parameter on ISPRS Potsdam dataset.

The networks are trained on the training set of the ISPRS Potsdam dataset to predict instance segmentation maps. The training set has only 931 unique

patches. We make use of the data augmentation technique to increase the number of training samples. The RGB patches and corresponding pixel-wise ground truth are transformed by horizontally and vertically flipping three-quarters of the patches. By doing so, the number of training samples increases to 14,896. To monitor overfitting during training, we randomly select 10% of the training samples as the validation set; i.e., splitting the training set into 13,406 training and 1,490 validation pairs. We train the network for 50 epochs and make use of early stopping to avoid overfitting. Moreover, we use fairly small mini-batches of 8 image pairs because, in a sense, every pixel is a training sample. We train our network on a single NVIDIA GeForce GTX TITAN with 12 GB of GPU memory, which takes about two hours.

Fig 6: Frame@1s from the proposed Busy Parking Lot UAV Video dataset for vehicle instance segmentation. Four zoomed-in areas are shown on the bottom.
Fig 7: Instance segmentation results of ISPRS Potsdam dataset (from left to right): ground truth, VGG-FCN, Inception-FCN, Xception-FCN, ResFCN, and B-ResFCN (different colors denote individual vehicle objects). The three areas are derived from Fig. 4.
Model OA OA (eroded) F1 score F1 score (eroded)
ResFCN 99.79 99.89 93.43 95.66
B-ResFCN 99.79 99.89 93.44 95.87
TABLE III: Pixel-level OAs and F1-scores for the car class on ISPRS Potsdam dataset

Iii-C Qualitative Evaluation

Some vehicle instance segmentation results are shown in Fig. 7 (test set of ISPRS Potsdam dataset) and Fig. 9 (Busy Parking Lot dataset), respectively, in order to qualitatively illustrate the efficacy of our model. First, we compare various CNN variants used for FCN architecture to determine which one is the best-suited for our task. In Fig. 7, we qualitatively investigate the accuracy of the predicted instance segmentation maps, using FCN architecture with leading CNN variants – namely, VGG[33]-FCN, Inception[48]-FCN, Xception[49]-FCN, and ResFCN, on the ISPRS Potsdam dataset. We implement VGG-FCN, Inception-FCN, and Xception-FCN by fusing the output feature maps of the last three convolutional blocks as we do for ResFCN (cf. Section II-B). From the segmentation results, we can see an improvement in quality from VGG-FCN to ResFCN. Moreover, on the Busy Parking Lot dataset, ResFCN also demonstrates a fairly strong ability to generalize to an “unseen” scene outside the training dataset (see Fig. 9). However, there are some vehicles that cannot be separated in both segmentation results produced using the aforementioned networks, due to the extremely close vehicle distance. The situation is further deteriorated when the imagery suffers from the effects of shadow, as the cases shown in the zoomed-in areas of Fig. 9. On the other hand, to identify the role of the semantic boundary component of the proposed unified multi-task learning network architecture, we also performed an ablation study to compare the performance of networks relying on the prediction of vehicles. In comparison with ResFCN, the semantic boundary-aware ResFCN (B-ResFCN) is able to separate those “touching” cars clearly, which qualitatively highlights the superiority of semantic boundary-aware network by exploring the complementary information under a unified multi-task learning network architecture. Fig. 8 shows a couple of example segmentations using the proposed B-ResFCN on several frames of the Busy Parking Lot dataset.

Model 2_12 5_12 7_7 7_8 7_9
F1 P R F1 P R F1 P R F1 P R F1 P R
VGG-FCN 66.04 70.00 62.50 57.00 61.45 53.14 59.21 61.95 56.70 57.21 66.84 50.00 61.31 65.91 57.31
B-VGG-FCN 70.27 68.42 72.22 69.85 67.42 72.47 71.03 68.47 73.79 67.96 66.86 69.09 66.47 60.96 73.08
Inception-FCN 51.91 55.45 48.80 31.65 37.42 27.42 40.00 43.41 37.08 27.79 31.70 24.74 40.87 45.02 37.42
B-Inception-FCN 55.15 50.61 60.58 46.14 47.42 44.92 53.81 52.91 54.75 43.47 42.45 44.54 50.74 47.49 54.47
Xception-FCN 96.92 98.21 95.65 83.55 81.11 86.14 93.33 94.59 92.11 92.05 93.10 91.01 93.92 96.59 91.40
B-Xception-FCN 97.00 100 94.17 88.40 88.60 88.19 93.65 96.47 91.00 93.58 97.54 89.94 94.63 97.50 91.92
ResFCN 97.93 100 95.93 83.88 80.84 87.15 94.72 96.86 92.67 95.62 97.93 93.42 95.25 96.23 94.30
B-ResFCN 98.31 100 96.67 88.57 87.08 90.11 96.43 97.12 95.74 95.19 97.88 92.64 95.76 97.83 93.77

Detection Results of Different Networks on ISPRS Potsdam Semantic Labeling Dataset (Instance-level F1 Score, Precision, and Recall)

Model Frame@1s Frame@15s Frame@30s Frame@45s Frame@59s
F1 P R F1 P R F1 P R F1 P R F1 P R
Inception-FCN 15.48 60.00 8.89 15.67 51.09 9.25 13.92 43.43 8.29 11.56 41.98 6.71 7.75 39.29 4.30
B-Inception-FCN 17.74 62.50 10.34 19.84 58.72 11.94 18.71 51.69 11.42 17.84 55.34 10.63 10.63 51.67 5.93
Xception-FCN 87.25 86.82 87.69 87.27 85.28 89.36 86.58 84.14 89.16 87.10 84.82 89.50 75.65 74.12 77.25
B-Xception-FCN 91.43 89.72 93.20 90.15 86.80 93.78 90.12 87.69 92.70 90.35 87.64 93.22 88.30 84.24 92.77
ResFCN 88.73 89.71 87.77 89.43 89.76 89.10 90.43 91.38 89.50 88.81 88.69 88.92 87.10 90.23 84.17
B-ResFCN 93.29 95.16 91.50 92.55 91.52 93.61 93.62 94.02 93.22 93.06 94.33 91.83 94.54 95.28 93.81
TABLE V: Detection Results of Different Methods on proposed Busy Parking Lot UAV Video Dataset (Instance-level F1 Score, Precision, and Recall)

Iii-D Quantitative Evaluation

To verify the effectiveness of networks used, we report in Table III pixel-level overall accuracies (OAs) and F1 scores of the car class on our test set of ISPRS Potsdam dataset and compare to state-of-the-art methods. These metrics are calculated on a full reference and an alternative ground truth obtained by eroding the boundaries of objects by a circular disk of 3 pixel radius. The current state-of-the-art CASIA2 (in the leaderboard obtains the F1 score of 96.2% for the vehicle segmentation on the held-out test set (which is different from the validation set we use) using IRRG. Our B-ResFCN is competitive with the F1 score of 95.87% obtained by using RGB information only on our own test set. This indicates that the trained network can be though as a good, competitive model for the follow-up experiments. Note that the pixel-wise OA and F1 score can only evaluate the segmentation performance at pixel-level instead of instance level. Therefore, they are actually not suitable for our task.

To quantitatively evaluate the performance of different approaches for vehicle segmentation at instance level, the evaluation criteria we use are instance-level F1 score, precision, recall, and Dice similarity coefficient. The first three criteria consider the performance of vehicle detection, and the last validates the performance of instance-level segmentation.

Iii-D1 Detection

For the vehicle detection evaluation, the metric instance-level F1 score444Note that the instance-level F1 score is different from the pixel-wise F1 score used by the ISPRS semantic labeling evaluation (

is employed, which is the harmonic mean of instance-level precision P and recall R, defined as:


where , , and are the number of true positives, false positives, and false negatives, respectively. Here, the ground truth for each segmented vehicle is the object in the manually labeled segmentation mask that has maximum overlap with the segmented vehicle. When calculating and , a segmented vehicle that intersects with at least 50% of its ground truth is considered a true positive; otherwise it is regarded as a false positive. For , a false negative indicates a ground truth object that has less than 50% of its area overlapped by its corresponding segmented vehicle or has no corresponding segmented vehicle.

The detection results of different networks on the ISPRS Potsdam dataset and Busy Parking Lot scene are shown in Table IV and Table V, respectively. Among the networks without semantic boundary component, ResFCN surpasses all other models (VGG-FCN, Inception-FCN, and Xception-FCN), highlighting the strength of residual learning-based FCN architecture with multi-level contextual feature representations in our task. The network with the semantic boundary component – i.e., B-ResFCN – achieved the best results on most test images of the ISPRS Potsdam scene and surpassed the others by a significant margin on the Busy Parking Lot dataset, demonstrating the effectiveness of the semantic boundary-aware multi-task learning network in this instance segmentation problem. From Table IV and Table V, we observe that all networks yield a fairly lower instance-level F1, precision, and recall on the Busy Parking Lot dataset than on the ISPRS Potsdam dataset. This mainly comes from the different difficulty levels of the two datasets. Specifically, high-density parking, strong light conditions, critical effects of shadow, and a slightly blurry image quality lead to the fact that networks achieved a more inferior performance on the proposed dataset than on the Potsdam scene.

Model Frame@1s Frame@15s Frame@30s Frame@45s Frame@59s
Inception-FCN 26.81 26.06 25.68 22.89 23.77
B-Inception-FCN 32.37 33.07 33.34 30.44 31.26
Xception-FCN 72.74 72.74 72.85 72.47 71.31
B-Xception-FCN 77.31 77.50 77.22 77.13 76.32
ResFCN 71.17 71.47 71.76 68.82 72.73
B-ResFCN 78.84 77.33 79.13 77.83 79.39
TABLE VI: Segmentation Results of Different Methods on Busy Parking Lot UAV Video Dataset (Instance-level Dice Similarity Coefficient)
Model 2_12 5_12 7_7 7_8 7_9
VGG-FCN 58.88 45.79 53.13 51.09 54.25
B-VGG-FCN 71.48 64.48 74.54 70.43 69.47
Inception-FCN 52.79 34.37 37.15 35.08 44.22
B-Inception-FCN 55.26 35.69 46.76 37.33 47.14
Xception-FCN 90.05 73.05 84.84 84.58 86.54
B-Xception-FCN 91.44 75.47 85.12 88.64 87.95
ResFCN 91.97 77.68 89.10 89.78 89.65
B-ResFCN 93.80 77.72 90.61 91.19 90.66
TABLE VII: Segmentation Results of Different Methods on ISPRS Potsdam Semantic Labeling Dataset (Instance-level Dice Similarity Coefficient)
Fig 8: Example segmentations using the proposed B-ResFCN in several frames of the Busy Parking Lot dataset.
Fig 9: Instance segmentation maps of Busy Parking Lot dataset (from left to right): ground truth, Inception-FCN, Xception-FCN, ResFCN, and B-ResFCN (different colors denote individual vehicles). The four areas are derived from Fig. 6.

Iii-D2 Segmentation

The dice similarity coefficient is often used to evaluate segmentation performance. Given a set of pixels denoted as a segmented vehicle and a set of pixels annotated as a ground truth object, the Dice similarity coefficient is defined as:


This, however, is not suitable for segmentation evaluation on individual objects (i.e., instance segmentation). Instead, in this paper, an instance-level Dice similarity coefficient is defined and employed as:


where , , , and are the -th segmented vehicle, ground truth object that maximally overlaps , -th ground truth object, and segmented vehicle that maximally overlaps , respectively. and respectively denote the total number of segmented vehicles and ground truth objects. Furthermore, and are both coefficients and can be calculated as:


Table VII and Table VI show the segmentation results of different approaches on the Potsdam scene and Busy Parking Lot dataset, respectively. We can see that our B-ResFCN achieves the best performance on both these two datasets. Compared to the ResFCN, there is a 1.16% increment in terms of the instance-level Dice similarity coefficient on the Potsdam dataset and a 7.31% improvement on the Busy Parking Lot scene. From the figures in these two tables, we can see that the networks offer a more inferior performance on the Busy Parking Lot dataset than on the Potsdam scene. This is also in line with our intention of proposing a more challenging benchmark dataset for the vehicle instance segmentation problem. In addition, it is worth noting that basically all the networks with boundary components can offer better instance segmentations compared to those without boundary. This means that multi-task learning is useful for different CNN variants in our task.

Iv Conclusion

In this paper, we propose a semantic boundary-aware unified multi-task learning residual fully convolutional network in order to handle a novel problem (i.e., vehicle instance segmentation). In particular, the proposed network harnesses multi-level contextual features learned from different residual blocks in a residual network architecture to produce better pixel-wise likelihood maps. We theoretically analyze the reason behind this. Furthermore, our network creates two separate, yet identical branches to simultaneously predict the semantic segmentation masks of vehicles and semantic boundaries. The joint learning of these two problems is beneficial for separating “touching” vehicles, which are often not correctly differentiated into instances. The network is validated using a large high-resolution aerial image dataset, ISPRS Potsdam Semantic Labeling dataset, and the proposed Busy Parking Lot UAV Video dataset. To quantitatively evaluate the performance of different approaches for the vehicle instance segmentation, we advocate using an instance-level F1 score, precision, recall, and Dice similarity coefficient as evaluation criteria, instead of traditional pixel-wise overall accuracy (OA) and F1 score for semantic segmentation. Both visual and quantitative analysis of the experimental results demonstrate the effectiveness of our approach.


The authors would like to thank the ISPRS for making the Potsdam data set available.


  • [1] M. Volpi and D. Tuia, “Dense semantic labeling of subdecimeter resolution images with convolutional neural networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 2, pp. 881–893, 2017.
  • [2] L. Mou, X. Zhu, M. Vakalopoulou, K. Karantzalos, N. Paragios, B. Le Saux, G. Moser, and D. Tuia, “Multitemporal very high resolution from space: Outcome of the 2016 IEEE GRSS Data Fusion Contest,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 8, pp. 3435–3447, 2017.
  • [3] N. Audebert, B. Le Saux, and S. Lefèvre, “Fusion of heterogeneous data in convolutional networks for urban semantic labeling,” in Joint Urban Remote Sensing Event (JURSE), 2017.
  • [4] L. Mou and X. X. Zhu, “RiFCN: Recurrent network in fully convolutional network for semantic segmentation of high resolution remote sensing images,” arXiv:1805.02091, 2018.
  • [5] M. Vakalopoulou, K. Karantzalos, N. Komodakis, and N. Paragios, “Graph-based registration, change detection, and classification in very high resolution multitemporal remote sensing data,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 7, pp. 2940–2951, 2016.
  • [6]

    D. Wen, X. Huang, L. Zhang, and J. A. Benediktsson, “A novel automatic change detection method for urban high-resolution remotely sensed imagery based on multiindex scene representation,”

    IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 1, pp. 609–625, 2016.
  • [7] C. Wu, B. Du, X. Cui, and L. Zhang, “A post-classification change detection method based on iterative slow feature analysis and bayesian soft fusion,” Remote Sensing of Environment, vol. 199, pp. 241–255, 2017.
  • [8] H. Lyu, H. Lu, and L. Mou, “Learning a transferable change rule from a recurrent neural network for land cover change detection,” Remote Sensing, vol. 8, no. 6, p. 506, 2016.
  • [9] L. Mou and X. X. Zhu, “Spatiotemporal scene interpretation of space videos via deep neural network and tracklet analysis,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2016.
  • [10] G. Kopsiaftis and K. Karantzalos, “Vehicle detection and traffic density monitoring from very high resolution satellite video data,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2015.
  • [11] W. Shao, W. Yang, G. Liu, and J. Liu, “Car detection from high-resolution aerial imagery using multiple features,” in IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2012.
  • [12] T. Moranduzzo and F. Melgani, “Automatic car counting method for unmanned aerial vehicle images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 3, pp. 1635–1647, 2014.
  • [13] ——, “Detecting cars in UAV images with a catalog-based approach,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 10, pp. 6356–6367, 2014.
  • [14] K. Liu and G. Mattyus, “Fast multiclass vehicle detection on aerial images,” IEEE Geoscience and Remote Sensing Letters, vol. 12, no. 9, pp. 1938–1942, 2015.
  • [15] X. X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, 2017.
  • [16] L. Mou and X. X. Zhu, “IM2HEIGHT: Height estimation from single monocular imagery via fully residual convolutional-deconvolutional network,” arXiv:1802.10249, 2018.
  • [17] L. Mou, P. Ghamisi, and X. X. Zhu, “Unsupervised spectral–spatial feature learning via deep residual conv–deconv network for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 1, pp. 391–406, 2018.
  • [18] L. Mou, L. Bruzzone, and X. X. Zhu, “Learning spectral-spatial-temporal features via a recurrent convolutional neural network for change detection in multispectral imagery,” arXiv:1803.02642, 2018.
  • [19] L. Mou, P. Ghamisi, and X. Zhu, “Deep recurrent neural networks for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3639–3655, 2017.
  • [20] X. Chen, S. Xiang, C.-L. Liu, and C.-H. Pan, “Vehicle detection in satellite images by hybrid deep convolutional neural networks,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 10, pp. 1797–1801, 2014.
  • [21] N. Ammour, H. Alhichri, Y. Bazi, B. Benjdira, N. Alajlan, and M. Zuair, “Deep learning approach for car detection in UAV imagery,” Remote Sensing, vol. 9, no. 4, p. 312, 2017.
  • [22] N. Audebert, B. Le Saux, and S. Lefèvre, “Segment-before-detect: Vehicle detection and classification through semantic segmentation of aerial images,” Remote Sensing, vol. 9, no. 4, p. 368, 2017.
  • [23] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep convolutional encoder-decoder architecture for image segmentation,” arXiv:1511.00561, 2015.
  • [24] M. Kampffmeyer, A.-B. Salberg, and R. Jenssen, “Detection of small objects, land cover mapping and modelling of uncertainty in urban remote sensing images using deep convolutional neural networks,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) Workshop, 2016.
  • [25] D. Eigen and R. Fergus, “Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [26] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [27] Z. Wu, C. Shen, and A. van den Hengel, “High-performance semantic segmentation using very deep fully convolutional networks,” arXiv:1604.04339, 2016.
  • [28] I. Laina, C. Rupprecht, V. Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” in International Conference on 3D Vision (3DV), 2016.
  • [29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [30] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [31] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs,” in arXiv:1606.00915, 2016.
  • [32] S. Zhang, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr, “Conditional random fields as recurrent neural networks,” in IEEE International Conference on Computer Vision (ICCV), 2015.
  • [33] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in IEEE International Conference on Learning Representation (ICLR), 2015.
  • [34] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision (ECCV), 2016.
  • [35] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large kernel matters – improve semantic segmentation by global convolutional network,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [36] G. Ghiasi and C. C. Fowlkes, “Laplacian pyramid reconstruction and refinement for semantic segmentation,” in European Conference on Computer Vision (ECCV), 2016.
  • [37]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in Neural Information Processing Systems (NIPS), 2012.
  • [38] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European Conference on Computer Vision (ECCV), 2014.
  • [39] A. Mahendran and A. Vedaldi, “Understanding deep image representations by inverting them,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [40] A. Kirillov, E. Levinkov, B. Andres, B. Savchynskyy, and C. Rother, “Instancecut: from edges to instances with multicut,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [41] D. Marmanis, K. Schindler, J. D. W. amd S. Galliani, M. Datcu, and U. Stilla, “Classification with an edge: Improving semantic image segmentation with boundary detection,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 135, pp. 158–172, 2018.
  • [42] F. Rottensteiner, G. Sohn, J. Jung, M. Gerke, C. Baillard, S. Benitez, and U. Breitkopf, “The ISPRS benchmark on urban object classification and 3D building reconstruction,” ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 1, no. 3, pp. 293–298, 2012.
  • [43]

    T. Dozat, “Incorporating Nesterov momentum into Adam,”, online.
  • [44] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” in

    IEEE International Conference on Machine Learning (ICML)

    , 2013.
  • [45] Y. LeCun, B. Boser, J. Denker, D. Henderson, R. Howard, W. Hubbard, and L. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989.
  • [46] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic optimization,” in IEEE International Conference on Learning Representations (ICLR), 2015.
  • [47] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in

    International Conference on Artificial Intelligence and Statistics (AISTATS)

    , 2010.
  • [48] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [49] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2017.