Dynamic Multi-Scale Semantic Segmentation based on Dilated Convolutional Networks

04/11/2018 ∙ by Keiller Nogueira, et al. ∙ Grenoble Institute of Technology 2

Semantic segmentation requires methods capable of learning high-level features while dealing with large volume of data. Towards such goal, Convolutional Networks can learn specific and adaptable features based on the data. However, these networks are not capable of processing a whole remote sensing image, given its huge size. To overcome such limitation, the image is processed using fixed size patches. The definition of the input patch size is usually performed empirically (evaluating several sizes) or imposed (by network constraint). Both strategies suffer from drawbacks and could not lead to the best patch size. To alleviate this problem, several works exploited multi-scale information by combining networks or layers. This process increases the number of parameters resulting in a more difficult model to train. In this work, we propose a novel technique to perform semantic segmentation of remote sensing images that exploits a multi-scale paradigm without increasing the number of parameters while defining, in training time, the best patch size. The main idea is to train a dilated network with distinct patch sizes, allowing it to capture multi-scale characteristics from heterogeneous contexts. While processing these varying patches, the network provides a score for each patch size, helping in the definition of the best size for the current scenario. A systematic evaluation of the proposed algorithm is conducted using four high-resolution remote sensing datasets with very distinct properties. Our results show that the proposed algorithm provides improvements in pixelwise classification accuracy when compared to state-of-the-art methods.



There are no comments yet.


page 6

page 11

page 12

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The increased accessibility to high spatial resolution data provided by new sensor technologies has opened new horizons to the remote sensing community, allowing a better understanding of the Earth’s surface. Towards such understanding, one of the most important task is semantic labeling (or segmentation), which may be stated as a task of assigning a semantic category to every pixel in an image. Semantic segmentation allows the creation of thematic maps aiming to help in the comprehension of a scene. In fact, semantic labeling has been an essential task for the remote sensing community [1] given that its outcome, the thematic map, generates essential and useful information capable of assisting in the decision making of a wide range of fields, including environmental monitoring, agriculture [2], disaster relief [3], urban planning [4].

Given the importance of such task, several methods [5, 6] have been proposed for the semantic segmentation of remote sensing images. The current state-of-the-art method for semantic segmentation is based on a resurgent approach, called deep learning [7]

, that can learn specific and adaptable spatial features directly from the images. Specifically, deep learning aims at designing end-to-end trainable neural networks, i.e., systems that map raw input into an output space depending on the task. These systems are capable of learning features and classifiers (in distinct layers) and adjust the parameters, at running time, based on accuracy, giving more importance to one layer than another depending on the problem. This end-to-end feature learning (e.g., from image pixels to semantic labels) is the great advantage of deep learning when compared to previous state-of-the-art methods 

[8], such as low-level [9] and mid-level (e.g. Bag of Visual Words [10]) descriptors.

Among all networks, a specific type, called Convolutional (Neural) Networks, ConvNets or CNNs [7]

, is the most traditional one for learning visual features in computer vision applications, as well as remote sensing. This type of network relies on the natural stationary property of an image, i.e., the information learned in one part of the image can be used to describe any other region of the image. Furthermore, ConvNets usually obtain different levels of abstraction for the data, ranging from local low-level information in the initial layers (e.g., corners and edges), to more semantic descriptors, mid-level information (e.g., object parts) in intermediate layers, and high-level information (e.g., whole objects) in the final layers.

Although originally proposed for image classification, to become more suitable for the semantic labeling task, these ConvNets were adapted to output a dense prediction, i.e., to produce another image (usually with the same resolution of the input) that has each pixel associated to a semantic class. Based on this idea, several networks [11, 12, 13] achieved state-of-the-art for the labeling task in the computer vision domain. Because of their success, these approaches were naturally introduced into the remote sensing scenario. Although somehow successful in this domain, these approaches could be improved if some differences for aerial images were taken into account. Specifically, the main difference concerns the definition of spatial context. In classical computer vision applications, the spatial context is restricted by the scene. In the case of remote sensing images, the context is typically delimited by an input patch (mainly because of memory constraints, given the huge size of remote sensing images). Therefore, the definition of the best input patch size is of vital importance for the network, given that patches of small size could not bring enough information to allow the network to capture the patterns while, larger patches could lead to semantically mixed information, which could affect the performance of the ConvNet. In the literature, the definition of this patch size is usually performed using two approaches: (i) empirically [14, 4], by evaluating several sizes and selecting the best one, which is a very expensive process, given that, for each size, a new network must be trained (without any guarantee for the best patch configuration), and (ii) imposed [15, 16], in which the patch size is defined by network constraints (i.e., changing the patch size implies modifying the architecture). This could be a potentially serious limitation given that the patch size required by the network could be not even close to the optimal one. Hence, it is clear that both current strategies suffer from drawbacks and could not lead to the best patch size.

An attempt to alleviate such dependence of the patch size is to aggregate multi-scale information111Definition of multi-scale information is very broad and abstract, but in this work, multi-scale refers to spatial context difference, i.e., any method that aggregates image with distinct context are aggregating multi-scale information.. Multi-scale paradigm has been proven to be essential for segmentation methods [17, 14], given that it allows the model to extract and capture patterns of varying granularity, helping the method to aggregate more useful information. Therefore, several works [18, 15, 19, 16, 20, 21] incorporate the benefits of the multi-scale paradigm in their architectures using different approaches. Some of them [18, 20, 15] train several distinct layers or networks, one for each scale, and combine them for the final prediction. Others [16, 19, 21] extract and merge features from distinct layers in order to aggregate multi-scale information. Independently of the approach, to aggregate multi-scale information, more parameters are included in the final model, resulting in a more complex learning process [7].

In this work, we propose a novel technique to perform semantic segmentation of remote sensing images that exploits the multi-scale paradigm without increasing the number of parameters while defining adaptively the optimal patch size for the inference stage. Specifically, this technique is based upon an architecture composed exclusively on dilated convolutions [22], which are capable of processing input patch of varying sizes without distinction, given that they learn the patterns without downsampling the input. In fact, the multi-scale information is aggregated to the model by allowing it to be trained using patches of varying sizes (and contexts), a process that does not require any combination of distinct networks or layers (a common process of deep learning-based multi-scale approaches), resulting in a method with fewer parameters and easier to train. Furthermore, during the training stage, the network gives a score (based on accuracy or loss) for each patch size. Then, in the prediction phase, the process selects the patch size with the highest score to perform the segmentation. Therefore, differently from empirically selecting the best patch size which requires a new network trained for each evaluated patch (increasing the computational complexity and training time), the proposed technique evaluates several patches during the training stage and selects the best one for the inference phase doing only a unique training procedure. Aside from the aforementioned advantages, the proposed networks can be fine-tuned for any semantic segmentation application, since it does not depend on the patch size to process the data. This allows other applications to benefit from the patterns extracted by our models, a very interesting feature specially when working with small amounts of labeled data [23].

In practice, these are the contributions of this work:

  • Our main contribution is a novel approach that performs remote sensing semantic segmentation by doing a unique training procedure that aggregates multi-scale information while determining the best input patch size for the inference stage,

  • Network architectures capable of performing semantic segmentation of remote sensing datasets with distinct properties, and that can be trained or fine-tuned for any semantic segmentation application.

The paper is structured as follows. Related works are presented in Section II while the concept of dilated convolutions are introduced in Section III. We explain the proposed technique in Section IV. Section V presents the experimental protocol and Section VI reports and discusses the obtained results. Finally, in Section VII we conclude the paper and point promising directions for future work.

Ii Related Work

As introduced, deep learning has made its way into the remote sensing community, mainly due to its success in several computer vision tasks. Towards a better understanding of the Earth’s surface, a myriad of techniques [18, 15, 19, 14, 16, 20, 21] have been proposed to perform semantic segmentation in remote sensing images. Based on previous successful models [24, 11], several of the proposed works exploit the benefits of the multi-scale paradigm.

In [15], the authors fine-tuned a deconvolutional network (based on SegNet [12]) using fixed size patches. To incorporate multi-scale knowledge into the learning process, they proposed a multi-kernel technique at the last convolutional layer. Specifically, the last layer is decomposed into three branches. Each branch processes the same feature maps but using distinct filter sizes generating different outputs which are combined into the final dense prediction. They argue that these different scales smooth the final predictions due to the combination of distinct fields of view and spatial context.

Sherrah [14] proposed methods based on fully convolutional networks [11]

. The first architecture was purely based on the fully convolutional paradigm, i.e., the network has several downsampling layers (generating a coarse map) and a final bilinear interpolation layer, which is responsible to restore the coarse map into a dense prediction. In the second strategy, the previous network was adapted by replacing the downsampling layers with dilated convolutions, allowing the network to maintain the full resolution of the image. Finally, the last strategy evaluated by the authors was to fine-tune 

[23] pre-trained networks over the remote sensing datasets. None of the aforementioned strategies exploit the benefits of the multi-scale paradigm. Furthermore, these techniques were evaluated using several input patch sizes with final architectures processing patches with or pixels depending on the dataset.

Marcu et al. [18] combined the outputs of a dual-stream network in order to aggregate multi-scale information for semantic segmentation. Specifically, each network processes the image using patches of distinct size, i.e., one network process patches (in which the global context is considered) while the other processes patches (where local context is taken into account). The output of these architectures are combined in a later stage using another network. Although they can train the network jointly, in an end-to-end process, the number of parameters is really huge allowing them to use only small values of batch size (10 patches per batch). In [20], the authors proposed a multi-scale semantic segmentation by combining ConvNets, hand-crafted descriptors, and Conditional Random Fields [25]. Specifically, they trained three ConvNets, each with a different patch size (, and

pixels). Features extracted from these networks are combined with hand-crafted ones and classified using random forest classifier. Finally, Conditional Random Fields 

[25] are used as a post-processing method in an attempt to improve the final results.

In [16], the authors proposed a multi-scale method that combines boundary detection with deconvolution networks (specifically, based on SegNet [12]). The main contribution of this work is the Class-Boundary (CB) network, which is responsible to help the proposed methods to give more attention to the boundaries. Based on this CB network, they proposed several methods. The first uses three networks that receive as input the same image but with different resolutions (as well as the output of the corresponding CB network) and output the label predictions, which are aggregated, in a subsequent fusion stage, generating the final label. They also experimented fully convolutional architectures [11] (with several skip layers in order to aggregate multi-scale information) and an ensemble of several architectures. All aforementioned networks initially receive fixed size patches. Maggiori et al. [19] proposed a multi-scale method that performs labeling segmentation based on upsampled and concatenated features extracted from distinct layers of a fully convolutional network [11]. Specifically, the network, that receives as input patches of or pixels (depending on the dataset), is composed of several convolutional and pooling layers, which downsample the input image. Downsampled feature maps extracted from several layers are, then, upsampled, concatenated and finally classified by another convolutional layer. This proposed strategy resembles somehow the DenseNets [26], with the final layer having connections to previous layers. Wang et al. [21] proposed to extract features from distinct layers of the network to capture multi-scale low- and high-level information. They fine-tuned a ResNet-101 [27] to extract salient information from patches. Feature maps are then extracted from intermediate layers, combined with entropy maps, and upsampled to generate the final dense prediction.

In this work, we perform semantic segmentation by exploiting a multi-scale network composed uniquely of dilated convolutions. The main differences that may pointed out with respect to the aforementioned works: (i) the proposed technique exploits a fully convolutional network that does not downsample the input data (a common process performed in most works [11, 22, 16]), (ii) the multi-scale strategy is exploited during the training process without any modification of the network or combination of several architectures, and (iii) instead of evaluating possible patch sizes (to find the best one) or to use a patch size determined by network constraints (which could not be the best one), the proposed algorithm determines the best patch size adaptively in training time.

Iii Dilated ConvNets

Dilated convolutions were originally proposed for the computation of wavelet transform [28] and employed in the deep learning context (as an alternative to deconvolution layers) mainly for semantic segmentation [22, 29, 14]. In dilated convolutional layers, filter weights are employed differently when compared to standard convolutions. Specifically, filters of this layer may have gaps (or “holes”) between their parameters. These gaps, inserted according to the dilation rate , enlarge the convolutional kernel but preserve the number of trainable parameters since the inserted holes are not considered in the convolution process. Therefore, the final alignment of the kernel weights are defined by this dilation rate .

Formally, for each location , the output of an one-dimension dilated convolution given as input a signal and filter of length is calculated as:


The dilation rate parameter

corresponds to the stride with which the input signal is sampled. As illustrated by Figure 


, smaller rates result in a more clustered filter (in fact, rate 1 generates a kernel identical to the standard convolution) while larger rates make an expansion of the filter, producing a larger kernel with several gaps. Since this whole process is independent of the input data, changing the dilation rate does not impact in the resolution of the layer outcome, i.e., in a dilated convolution, independent of the rate, input and output have the same resolution (considering appropriate stride and padding).

(a) Rate 1
(b) Rate 2
(c) Rate 3
Fig. 1: Example of dilated convolutions. Dilation supports expansion of the receptive field without loss of resolution or coverage of the input.

By enlarging the filter (with such gaps), the network expands its receptive field (since the weights will be arranged in a more sparse shape) but preserves the resolution and no downsampling in the data is performed. Hence, this process has several advantages, such as: (i) supports the expansion of the receptive field without increasing the number of trainable parameters per layer [22], which reduces the computational burden, and (ii) preserves the feature map resolution, which may help the network to extract even more useful information from the data, mainly of small objects.

To better understand the aforementioned advantage, a comparison between dilated and standard convolution is presented in Figure 2. Given an image, the first network (in red) performs a downsampling operation (that reduces the resolution by a factor of 2) and a convolution, using horizontal Gaussian derivative as kernel. The obtained low-resolution feature map is then enlarged by an upsampling operation with a factor of 2 that restores the original resolution but not the information lost during the downsampling process. The second network (blue) computes the response of a dilated convolution on the original image. In this case, the same kernel was used but rearranged with dilation rate . Although the filter size increases, only non-zero values are taken into account when performing the convolution. Therefore, the number of filter parameters and of operations per position stay constant. Furthermore, it is possible to observe that salient features are better represented by the dilated model since no downsampling is performed over the input data.

Fig. 2: Comparison between dilated and standard convolutions. Top (red) row presents the feature extraction process using a standard convolution over a downsampled image and then an upsample in order to recover the input resolution. Bottom (blue) row presents the feature extraction process using dilated convolution with rate applied directly to the input (without downsample). The outcomes clearly show the benefits of dilated convolutions over standard ones.

Iv Dynamic Multi-Scale Dilated Convolution

In this section, we present the proposed method for dynamic multi-scale semantic segmentation of remote sensing images. The proposed methodology is presented in Section IV-A while the network architectures are described in Section IV-B.

Iv-a Dynamic Multi-Scale Algorithm

As introduced, we propose a novel method to perform semantic segmentation of remote sensing images that (i) exploits the multi-scale paradigm without increasing the number of trainable parameters of the network, and (ii) defines, in training time, the best patch size that should be exploited by the network in the test phase.

The multi-scale paradigm is aggregated into the model by allowing the ConvNet to explore, in the training step, patches of multiple sizes at every new batch processed by the network. Specifically, in the training phase, the processing of each batch may be divided into five main steps, as presented in Figure 3: (i) a new patch size is randomly selected from a distribution (which may be any valid distribution, such as uniform or multinomial), (ii) a new batch is created based on patches of the currently selected size, i..e, patches of

pixels, (iii) the ConvNet processes this batch normally, using it to learn the parameters through the backpropagation algorithm, (iv) for the current batch, the network outputs the prediction maps, which are used to calculate some statistics (such as loss and accuracy), that represent how well the current batch has performed, and (v) these statistics are used to update the score of the current selected size

, according to:

All these steps are repeated during the training process until convergence or number of iterations is reached. Therefore, during the training, a new size is used to create patches that compose the batch processed by the network. These patches of multiple sizes aggregate and contribute with multi-scale information allowing the network to capture and extract features by considering distinct context regions.

The second benefit of the proposed method is almost a direct application of the patch scores created in the training phase. Specifically, in the prediction phase, scores over the patch sizes (created during the training phase, step v) are averaged and analyzed. The best patch size (which corresponds to the highest or lowest score, for accuracy and loss, respectively) is then selected and used to create patches for the input images. The network still outputs the prediction maps but no updates in the patch scores are performed.

Fig. 3: Example of a training batch procedure for the proposed method. This whole process is repeated for each batch.

Iv-B Architectures

Motivated by the fact that dilated convolutions [22] can process input images of any size, they fit perfectly in the proposed multi-scale methodology, given that a network composed of such layers is capable of processing the input without downsampling it. This creates the possibility of processing patches of any size without constraints. Although these layers have the advantage of computing feature responses at the original image resolution, a network composed uniquely of dilated convolutions would be costly to train especially when processing entire (large) scenes. However, as previously mentioned, processing an entire remote sensing image is not possible (because of its huge size) and, therefore splitting the image into small patches is already necessary, which naturally alleviates the training process.

Though, in this work, we explore networks composed of dilated convolutions, other types of ConvNets could be used, such as fully convolutions [11] and deconvolutions [13, 12]. These networks can also process patches of varying size, but they have restrictions related to a high variation of patch sizes. Specifically, these networks need to receive a patch larger enough to allow the generation of a coarse map, that is upsampled to the original size. If the patch is too small, the network could fall in a situation where it is not possible to create the coarse map and, consequently, the final upsampled map. Such problem is overcome by dilated ConvNets [22]

, which are allowed to process patches of any size in the same way, without distinction, outputting results always with the same resolution of the input data (given proper configurations, such as stride and padding). Such concept is essential to allow the variance of patch sizes, from very small values (such as

) to larger ones (for instance, ).

(a) Dilated ConvNet with max rate 6
(b) Densely Dilated ConvNet
(c) Dilated ConvNet with max rate 6 and pooling
(d) Dilated ConvNet with max rate 8 and pooling
Fig. 4: Dilated Convolution Neural Network architectures.

Considering this, four architectures, illustrated in Figure 4, have been developed and evaluated in this work. The first network, presented in Figure (a)a, is composed of seven layers: six dilated convolutions (that are responsible to capture the patterns of the input images) and a final convolution layer, which is responsible to generate the dense predictions. There is no pooling or normalization in this network, and all layers have stride 1. Specifically, the first two convolutions have filters with dilation rate 1 and 2, respectively. The following two convolutions have filters but rate 3 and 4 while the last two convolutions have smaller filters () but 5 and 6 as dilation rate. Because this network has 6 layers responsible for the feature extraction, it will be referenced as Dilated6. The second network (Figure (b)b) is based on densely connected networks [26], which recently achieved outstanding results on the image classification task. This network is very similar to the first one having the same number of layers. The main difference between these networks is that a layer has as input feature maps of all preceding layers. Hence, the last layer has access to all feature maps generated by all other layers of the network. This process allows the network to combine different feature maps with distinct level of abstraction, supporting the capture and learning of a wide range of feature combination. Because this network has 6 layers responsible for the feature extraction and is densely connected, it will be referenced in this work as DenseDilated6. The third network, presented in Figure (c)c, has the same configuration of the Dilated6, but with pooling layers between each convolutional one. Given a specific combination of stride and padding, no downsampling is performed over the inputs in these pooling layers. Because of the number of layers and the pooling layers, this network will be referenced hereafter as Dilated6Pooling. The last network (Figure (d)d) is an extension of the previous one, having 8 dilated convolutions instead of only 6. The last two layers have smaller filters () but 7 and 8 as dilation rate. There are pooling layers between all convolutional ones. Given that this network has 8 dilated convolutional and pooling layers, it will be referenced hereafter as Dilated8Pooling.

V Experimental Setup

In this section, we present the experimental setup. Specifically, Section V-A presents datasets employed in this work. Baselines are described in Section V-B while the experimental protocol is introduced in Section V-C.

V-a Datasets

To better evaluate the effectiveness of the proposed method, we carried out experiments on four high-resolution remote sensing datasets with very distinct properties. The first one is an agricultural dataset composed of multispectral high-resolution scenes of coffee crops and non-coffee areas. The others are urban datasets which have the objective of mapping targets such as roads, buildings, and cars. The first one is the GRSS Data Fusion contest dataset (consisting of very high-resolution images), while the others are the Vaihingen and Potsdam datasets, provided in the framework of the 2D semantic labeling contest organized by the ISPRS Commission III222http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html and composed of multispectral high-resolution images.

V-A1 Coffee Dataset

This dataset [30] is composed of 5 images taken by the SPOT sensor in 2005 over a famous coffee grower county (Monte Santo) in the State of Minas Gerais, Brazil. Each image has pixels with green, red, and near-infrared bands (in this order), which are the most useful and representative ones for discriminating vegetation areas [17]. More specifically, the dataset consists of 1,250,000 pixels classified into two classes: coffee (637,544 pixels or 51%) and non-coffee (612,456 pixels or 49%). Figure VI-F1 presents the images and ground-truths of this dataset.

This dataset is very challenging for several different reasons, including: (i) high intraclass variance, caused by different crop management techniques, (ii) scenes with different plant ages, since coffee is an evergreen culture and, (iii) images with spectral distortions caused by shadows, since the South of Minas Gerais is a mountainous region.

V-A2 GRSS Data Fusion Dataset

Proposed for the 2014 IEEE GRSS Data Fusion Contest, this dataset [31] is composed of two (training and testing) fine-resolution visible (RGB) images that cover an urban area near Thetford Mines in Quebec, Canada. Both training and testing images have meter of spatial resolution, with the former having pixels of resolution while the latter has pixels. Training and testing images, as well as the respective ground-truths, are presented in Figure 7.

Pixels are categorized into seven classes: trees, vegetation, road, bare soil, red roof, gray roof, and concrete roof. The dataset is not balanced, as can be seen in Table I. It is important to highlight that not all pixels are classified into one of these categories, with some pixels considered as uncategorized or unclassified.

Train Test
Classes #Pixels % #Pixels %
Road 112,457 19.83 808,490 55.77
Trees 27,700 4.89 100,528 6.93
Red roof 45,739 8.05 136,323 9.40
Grey roof 53,520 9.44 142,710 9.84
Concrete roof 97,821 17.25 109,423 7.55
Vegetation 185,242 32.65 102,948 7.10
Bare soil 44,738 7.89 49,212 3.41
Total 567,217 100.00 1,449,634 100.00
TABLE I: Number of pixels per class for the GRSS Data Fusion dataset.

V-A3 Vaihingen Dataset

As introduced, this dataset [32] was released for the 2D semantic labeling contest of the International Society for Photogrammetry and Remote Sensing (ISPRS). It is composed by a total of 33 image tiles (with an average size of pixels), that are densely classified into six possible labels: impervious surfaces, building, low vegetation, tree, car, clutter/background. The pixel distribution for the labeled images can be seen in Table II.

Sixteen of these images are fully annotated while the remaining ones compose the test set (which are the ones evaluated by the contest). Each of these images is composed of near infrared, red and green channels (in this order) with a spatial resolution of meter. A Digital Surface Model (DSM) coregistered to the image data was also provided, allowing the creation of a normalized DSM (nDSM) by [33]. In this work, we use the spectral information (NIR-R-G) and the nDSM. Specifically, the input data for the method consists of 4 dimensions: NIR-R-G and nDSM. Examples of the Vaihingen Dataset can be seen in Figure VI-F3.

Vaihingen Potsdam
Classes #Pixels % #Pixels %
Impervious Surfaces 21,815,349 27.94 245,930,445 28.46
Building 20,417,332 26.15 230,875,852 26.72
Low Vegetation 16,272,917 20.84 203,358,663 23.54
Tree 18,110,438 23.19 126,352,970 14.62
Car 945,687 1.21 14,597,667 1.69
Clutter/Background 526,083 0.67 42,884,403 4.96
Total 78,087,806 100.00 864,000,000 100.00
TABLE II: Number of pixels per class for ISPRS dataset, i.e., Vaihingen and Potsdam.

V-A4 Potsdam Dataset

Also proposed for the 2D semantic labeling contest, this dataset [34] has 38 tiles of same size ( pixels), with a spatial resolution of meter. From the available patches, 24 are densely annotated (with same classes as for the Vaihingen dataset), in which pixel distribution is presented in Table II. This dataset consists of 4-channel images (near infrared, red, green and blue), Digital Surface Model (DSM) and normalized DSM (nDSM). In this work, all spectral channels plus the nDSM are used as input for the ConvNet, resulting in a 5-dimensional input data. Some samples of these images are presented in Figure VI-F4.

V-B Baselines

For the coffee dataset, we employed the Cascaded Convolutional Neural Network (CCNN) [2] as baseline. This method employs a multi-scale strategy by aggregating several ConvNets in order to perform classification of fixed size tiles towards a final segmentation of the image. For the GRSS Data Fusion Dataset, we employed, as baseline, the work of Santana et al. [35]. Their algorithm extracts features with many levels of context from superpixels by exploiting different layers of a pre-trained convolutional network, which are then combined in order to aggregate multi-scale information.

Aside this, for both aforementioned datasets, we also consider as baseline the method conceived by [30], in which specific networks are used to perform labeling segmentation using the pixelwise paradigm, i.e., each pixel is classified independently by the classifier. Also, for these two datasets, we considered as baselines: (i) Fully Convolutional networks (FCN) [11]. In this case, the pixelwise architectures proposed by [30] were converted into fully convolutional network and exploited as baseline. (ii) Deconvolutional networks [12, 13]. Again, the pixelwise architectures proposed by [30] were converted deconvolutional network (based on the well-known SegNet [12] architecture) and exploited as baseline in this work. (iii) dilated network [22], which is, in this case, the Dilated6 (Figure (c)c). All these networks were trained traditionally using patches of constant size defined according to set of experiments of [30], i.e., patches of and for Coffee and GRSS Data Fusion datasets, respectively.

For the remaining datasets (Vaihingen and Potsdam), we refer to the official results published on the challenge website333http://www2.isprs.org/vaihingen-2d-semantic-labeling-contest.html and http://www2.isprs.org/potsdam-2d-semantic-labeling.html. as baselines for our proposed work.

V-C Experimental Protocol

For the Coffee [30] and the GRSS Data Fusion [31] datasets, we employed the same protocol of [30]

. Specifically, for the former dataset, we conducted a five-fold cross-validation to assess the performance of the proposed algorithm. In this strategy, five runs are executed, where, at each run, three coffee scenes are used as training while, one is used as validation and the remaining one is used as test. The results reported are the average metric of the five runs followed by its corresponding standard deviation. For the GRSS Data Fusion dataset, an image was used for training while the other was used for test, since this dataset has a clear definition of training/testing.

For Vaihingen [32] and Potsdam [34] datasets, we followed the protocol proposed by [4]. For the Vaihingen dataset, 11 out of the 16 annotated images were used to train the network. The 5 remaining images (with IDs 11, 15, 28, 30, 34) were employed to validate and evaluate the segmentation generalization accuracy. For the Potsdam dataset, 18 (out of 24) images were used for training the proposed technique. The remaining 6 images (with IDs 02_12, 03_12, 04_12, 05_12, 06_12, 07_12) were employed for validation of the method.

Some metrics were evaluated [36]

for the datasets: overall and average accuracy, kappa index and F1 score. Overall accuracy is a metric that considers the global aspects of the classification, i.e., it takes into account all correctly classified pixels indistinctly. On the other hand, average accuracy reports the average (per-class) ratio of correctly classified samples, i.e., it outputs an average of the accuracy of each class. Kappa index measures the agreement between the reference and the prediction map. Finally, F1 score is defined as the harmonic mean of precision and recall. These metrics were selected based on their diversity: overall accuracy and kappa are biased toward large classes (relevance of classes with small amount of samples are canceled out by those of with large amount) while average accuracy and F1 are calculated specifically for each class and, therefore, are independent of class size. Hence, the presented results are always some combination of such metrics in order to provide enough information about the effectiveness of the proposed method.

The proposed method and network were implemented using TensorFlow 

[37], a framework conceived to allow an efficient exploitation of deep learning with Graphics Processing Units (GPUs). All experiments were performed on a 64 bits Intel i7 4960X machine with 3.6GHz of clock and 64GB of RAM memory. Four GeForce GTX Titan X with 12GB of memory, under a 8.0 CUDA version, were employed in this work. Note, however, that each GPU was used independently and that all networks proposed here can be trained using only one GPU. Ubuntu version 16.04.3 LTS was used as operating system444As soon as the paper is accepted for publication, we will make all code and trained models publicly available..

The ConvNet and its parameters were adjusted by considering a full set of experiments guided by [38]

. After all the setup experiments, the best values for hyperparameters, presented in Table 

III, were determined for each dataset. The number of iterations increases with the complexity of the dataset in order to ensure convergence. In the proposed models, the learning rate, responsible to determine how much an updating step influences the current value of the network weights, starts with a high value and is reduced during the training phase using the exponential decay [37] with parameters defined according the last column of Table III.

Exponential Decay
Coffee Dataset 0.01 0.001 150,000 0.5/50,000
GRSS Data Fusion Dataset 0.01 0.005 200,000 0.5/50,000
Vaihingen Dataset 0.01 0.01 500,000 0.5/50,000
Potsdam Dataset 0.01 0.01 500,000 0.5/50,000
TABLE III: Hyperparameters employed in each dataset.

Vi Results and Discussion

In this section, we present and discuss the obtained results. Specifically, we first analyze the parameters of the proposed technique: Section VI-A presents the results achieved using different patch distributions, Section VI-B analyzes distinct functions to update the patch size score, and Section VI-C evaluates distinct range for the patch size. After, a convergence analysis of the proposed technique is performed in Section VI-D. Then, a comparison between networks trained with the proposed and standard techniques is presented in Section VI-E. Finally, comparison with state-of-the-art is reported in Section VI-F.

Vi-a Patch Distribution Analysis

As explained in the beginning of Section IV-A, the algorithm receives as input a list of possible patch sizes and a correspondent distribution. In fact, any distribution could be used, including uniform or multinomial. Given the influence of this distribution over the proposed algorithm, experiments have been conducted to determine the most appropriate distribution in our case. Towards this, we selected and compared three distinct distributions. First is the uniform

distribution over a range of values, i.e., given two extreme points, all intermediate values inside this range should have the same probability of being selected. Second is the uniform distribution but over selected values (and not a range). In this case, referenced as

uniform fixed

, the probability distribution is equally divided into the given values (the intermediate points have no probability of being selected). The last distribution evaluated is the

multinomial. In this case, ordinary values inside a range have the same probability but several given points have twice the chance to be selected.

The main difference between the evaluated distributions is related to the prior knowledge of the application. In the uniform distribution, no prior knowledge is assumed, and all patch sizes from the input range have the same probability, taking more time to converge the model. The uniform fixed distribution assumes a good knowledge of the application and only pre-defined patch sizes can be (equally) selected and evaluated, taking less time to converge the model. The multinomial distribution tries to blend previous ideas. Assuming certain prior knowledge of the application, the multinomial distribution weights the probabilities allowing the network to give more attention to specific pre-defined patch sizes but without discarding the others. If prior intuition is confirmed, these pre-defined patch sizes are randomly selected more often and the network should converge faster. Otherwise, the proposed process is still able to use other (non pre-defined) patch sizes and converge the network anyway.

Results of this analysis can be seen in Table IV. Note that all experiments were performed using the Coffee dataset [30], Dilated6 network (Figure (a)a), accuracy as score function, and hyperparameters presented in Table III. In these experiments, patches size varied from to . Specifically, for the uniform distribution, any value between 25 and 50 have the same probability of being selected, while for the multinomial distribution, all values have some chance to be selected, but these two points have twice the probability. For the uniform fixed, these two patch sizes split the total probability and each one has 50% of being selected. Overall, the variation of the distribution has no serious impact on the final outcome, since results are statistically equal. However, given its simplicity and faster convergence, for the remaining of this work, results will be reported using the uniform fixed distribution.

Overall Accuracy Kappa Average Accuracy F1 Score
Uniform 86.132.39 69.393.48 84.811.65 84.581.90
Uniform Fixed 86.271.44 69.412.01 84.851.66 84.621.06
Multinomial 86.061.68 68.942.94 84.562.00 84.391.51
TABLE IV: Results over different distributions.

Vi-B Score Function Analysis

As presented in Figure 3, the last step is responsible to update the score of patches that can be selected in the testing stage. In this work, we evaluated two possible score functions that could be employed in this step: the loss and the accuracy. In the first case, the loss is a measure (obtained using cross entropy [7] in this case) that represents the error generated in terms of the ground-truth and the network predictions. In the second case, the score is represented by the classification accuracy [36] of the images.

To analyze the most appropriate score function, experiments were performed varying only this particular parameter and maintaining the remaining ones. Specifically, these experiments were conducted using: the Coffee dataset [30], Dilated6 network (Figure (a)a), uniform fixed distribution (over and ), and same hyperparameters presented in Table III. Results can be seen in Table V. Through the table, it is possible to see that both score functions achieved similar results being statistically equal. However, since accuracy score is more intuitive, for the remaining of this work, results will be reported using this function.

Overall Accuracy Kappa Average Accuracy F1 Score
Accuracy 86.271.44 69.412.01 84.851.66 84.621.06
Loss 86.151.96 69.163.41 84.682.02 84.491.76
TABLE V: Results over different score functions.

Vi-C Range Analysis

Although the presented approach is proposed to select automatically the best patch size, in training time, avoiding lots of experiments to adjust such size (as done in several works [30, 20, 4]), in this section, the patch size range is analyzed in order to determine the robustness of the method.

This range is evaluated on all datasets, except Potsdam. Such dataset is very similar to Vaihingen one, therefore, analysis and decisions made over the latter dataset are also applicable to the Potsdam one. Furthermore, in order to evaluate such dataset, a validation set, created according to [4], was used. Experiments were conducted varying only the patch size range but maintaining the remaining configurations. Particularly, the experiments employed the same hyperparameters (presented in Table III), Dilated6 network (Figure (a)a), and uniform fixed.

Table VI presents the obtained results. Each dataset was evaluated in several ranges, selected based on previous works [30, 4]. Specifically, each dataset was evaluated in a large range (comprising from small to large sizes) and subsets of such range. Table VI also presents the most selected patch size (for the testing phase) for each experiment, giving some insights about how the proposed method behaves during such step.

For the Coffee dataset [30], obtained results are very similar, being considered statistically equal. Hence, any patch size range could be selected for further experiments, showing the robustness of the proposed algorithm which yielded similar results independently of the patch size range. Because of processing time, in this case, patch size range was selected and used in all further experiments.

For remaining datasets, a specific range achieved the best result. For the GRSS Data Fusion dataset [31], the best result was obtained when considering the largest range (), i.e., the range varying from small to large patch sizes. For Vaihingen [32], the intermediate range () achieved the best result. Therefore, in these cases, such ranges were selected and used in remaining experiments of this work. However, as can be seen through Table VI, other ranges also produce competitive result and could be used without high loss of performance, which confirms the robustness of the proposed method to patch size range, allowing it to process images without the need of experimentally searching for the best patch size configuration.

In terms of patch size selection, the algorithm really varies depending on the experiment. For the Coffee dataset, the most selected patch sizes were 50 and 75, showing a trend towards such interval. For the remaining datasets, larger patches were favored in our experiments. This may be justified by the fact that urban areas have complex interactions and larger patches allow the network to capture more information about the context. Though the best patch size is really dependent on the experiment, current results showed that the proposed approach is able to learn and select the best patch size in processing time producing interesting outcomes when compared to state-of-the-art works, a fact reconfirmed in Section VI-F.

F1 Score
Coffee 25,50 50 86.271.44 69.412.01 84.851.66 84.621.06
50,75 50 87.321.82 71.891.46 85.591.59 87.261.73
75,100 75 86.071.95 70.341.91 85.911.68 86.141.77
25,50,75,100,125 75 87.111.74 71.202.23 85.171.52 86.851.69
GRSS 7,14,21,28,35 35 87.93 80.55 85.87 87.22
28,35,42,49,56 49 87.71 81.47 85.26 88.48
42,49,56,63,70 70 88.33 83.21 88.04 89.63
7,14,21,28,35,42,49,56,63,70 70 90.10 85.22 90.13 90.37
Vaihingen 25,45,55,65 65 86.60 83.18 71.03 71.77
45,55,65,75,85 85 88.66 84.97 71.96 72.82
25,45,55,65,85,95,100 95 87.44 84.27 71.30 71.97
TABLE VI: Results of the proposed approach when varying the input range of patch sizes. For Vaihingen, a validation set (created according [4]) is employed. Bold patch size range were selected for all further experiments.

Vi-D Convergence Analysis

In this section, we analyze the convergence of the proposed technique. Figure 5 presents the convergence of the datasets using the Dilated6 network, accuracy as score function, uniform fixed, and hyperparameters presented in Table III. According to the figure, the loss and accuracy vary significantly at the beginning of the process but, with the reduction of the learning rate, the networks converge independently of the use of distinct patch sizes. Moreover, the test/validation accuracy (green line) converges and stabilizes showing that the networks can learn to extract features from patches of multiple sizes while selecting the best patch size for testing.

(a) Coffee dataset – fold 1
(b) GRSS Data Fusion dataset
(c) Vaihingen dataset
(d) Potsdam dataset
Fig. 5: Convergence of Dilated6 network for all datasets. For the Coffee dataset, only the fold 1 is reported. For Vaihingen and Potsdam datasets, the validation set (created according [4]) is reported.

Vi-E Performance Analysis

To analyze the efficiency of the proposed algorithm, several experiments were conducted comparing the same network trained using two distinct methods: (i) the traditional training process [7], in which the network is trained using patches of constant size, without any variation. This method is the standard one when it comes to neural networks and is the most exploited in the literature for training deep learning-based techniques. Also, this is the approach used to empirically selects the best patch size, which is traditionally done by training several network, one for each considered patch. (ii) the proposed dynamic training process, in which the network is trained with patches of varying size.

Two datasets were selected to be evaluated using these training strategies: (i) the GRSS Data Fusion dataset, which has the largest patch size range (according to Section VI-C of the paper) allowing a better comparison between the training strategies, and (ii) Vaihingen dataset, which is very similar to Potsdam one and, therefore, allows the conclusions to be applied to this one.

Specifically, in these experiments, Dilated6 network (Figure (a)a) is trained using both strategies. For the proposed dynamic training process, previous experiments were taken into account, i.e., this process uses uniform fixed, patch ranging according to Section VI-C, accuracy as score function, and hyperparameters presented in Table III. After training the network using varying patch sizes, each possible patch size is processed independently by the inference phase, resulting in several outcomes, one for each possible patch size. Concerning the traditional training process, the same network was trained using each of the possible patch sizes.

Results can be seen in Figure 6. Outcomes concerning Vaihingen dataset are related to the validation set, created according [4]. As can be seen through the results, networks trained with the proposed approach outperform the models trained with the traditional training process. This shows the ability of the proposed method to capture multi-scale information from distinct context patches which improve the performance of the final model, independently of the patch size used during the inference stage.

(a) GRSS Data Fusion dataset
(b) Vaihingen dataset
Fig. 6: Comparison between the dilated network trained using the proposed and the traditional method.

Vi-F State-of-the-art Comparison

Vi-F1 Coffee dataset

Using analysis performed on previous sections, we have conducted several experiments over the Coffee dataset [30]. Results of these experiments, as well as state-of-the-art baselines, are presented in Table VII. In order to allow a visual comparison, prediction maps for the Coffee dataset using different networks trained with the proposed method are presented in Figure VI-F1.

Overall, all baselines produced similar results. While the pixelwise network [30] yielded a slightly worse result with higher standard deviation, all other baselines reached basically the same level of performance, with a smaller standard deviation. This may be justified by the fact that the pixelwise network does not learn much information about the pixel interaction (since each pixel is processed independently), while the other methods process and classify a set of pixels simultaneously. Because of these similar results, all baselines are statistically equivalent.

This same behavior may be seen among the networks trained with the proposed methodology. Although these networks achieved comparable results, such models outperformed the baselines. Furthermore, the Dilated6 trained with the proposed dynamic method produced better results than the same network trained with traditional training process (mainly in the Kappa Index). These results show the effectiveness of the proposed technique that produces state-of-the-art outcomes by capturing multi-scale information while selecting the best patch size, two great advantages when compared to the traditional training process.

Traditional Pixelwise [30] 81.722.38 62.757.42
CCNN [2] 82.802.30 64.604.34
FCN [11] 83.252.47 66.003.55
Deconvolution Network [12] 82.612.05 65.563.47
Dilated network (Dilated6Pooling) 82.521.14 66.142.27
Dynamic Dilated6 84.791.66 69.412.01
DenseDilated6 85.882.34 71.512.74
Dilated6Pooling 85.771.74 72.271.38
Dilated8Pooling 86.671.39 73.781.87
TABLE VII: Results for the Coffee dataset.
Original Image Ground-Truth Dilated6 DenseDilated6 Dilated6Pooling Dilated8Pooling

figureTwo images of the Coffee Dataset, their respective ground-truths and the prediction maps generated by the proposed algorithm. Legend – White: Coffee areas. Black: Non Coffee areas.

Vi-F2 GRSS Data Fusion Dataset

We also performed several experiments on the GRSS Data Fusion Contest dataset [31] considering all analysis carried out in previous sections. Experimental results, as well as baselines, are presented in Table VIII. The prediction maps obtained for the test set are presented in Figure 7.

Overall, Dilated6 produced the best result among all approaches. In general, networks trained with the proposed method outperformed the baselines. Moreover, the Dilated6 trained with the proposed dynamic technique outperformed the baseline composed of the same network trained using traditional training process, corroborating with previous conclusions.

Among the baseline methods, although all achieved comparable results, the best outcome was yielded by the Deep Contextual [35]. This method also leverages from multi-scale information, since it combines features extracted from distinct layers of pre-trained ConvNets. When comparing this method with the best result of the proposed technique (Dilated6), one may see the gap between the models, since the proposed method improves the results for all metrics when compared to the Deep Contextual [35] approach. This reaffirms the effectiveness of the proposed dynamic method, corroborating with previous conclusions.

Traditional Pixelwise [30] 85.04 86.52 78.18
FCN [11] 83.27 87.45 76.10
Deconvolution Network [12] 82.15 86.24 75.04
Dilated network (Dilated6Pooling) 83.96 83.83 76.12
Deep Contextual [35] 85.45 88.33 79.01
Dynamic Dilated6 90.10 90.13 85.22
DenseDilated6 88.66 80.62 81.80
Dilated6Pooling 88.05 86.12 81.81
Dilated8Pooling 89.03 85.31 83.08
TABLE VIII: Results for the GRSS Data Fusion dataset.
(a) Test
(b) Dilated6
(c) DenseDilated6
Fig. 7: The GRSS Data Fusion training and test images, their respective ground-truths and the prediction maps generated by the proposed algorithm.

Vi-F3 Vaihingen Dataset

As introduced in Section V-B, official results for the Vaihingen dataset are reported only by the challenge organization that held some images that are used for testing the submitted algorithms. Therefore, one must submit the proposed algorithm to have it evaluated. In our case, following previous analysis, we submitted five approaches: the first four are related to each network presented in Section IV trained with the 6 classes (which are represented in the official results as UFMG_1 to 4), and the fifth one, represented in the official results as UFMG_5, is the Dilated8 network (Figure (d)d) trained with only 5 classes, i.e., all labels except the clutter/background one. This last submission is due to the lack of training data for that class which corresponds to only 0.67% of the dataset (as stated in Table II). It is important to note that all submissions related to the proposed work do not use any post-processing, such as Conditional Random Fields (CRF) [25].

Some official results reported by the organization are summarized in Table IX. In addition to our results, this table also compiles the best results of each work with enough information to make a fair comparison, i.e., in which the proposed approach is minimally explained.

It is possible to notice that the proposed work yielded competitive results. The best result, in terms of overall accuracy, was achieved by DLR_9 [16] and GSN3 [21]. Our best result (UFMG_4) appears in fifth place by yielding of overall accuracy, outperforming several methods, such as ADL_3 [39] and RIT_L8 [40], that also tried to aggregate multi-scale information. However, as can be seen in Table IX and Figure (a)a, while the others have a larger number of trainable parameters, our approach has only 2 millions, which makes it less pruned to overfitting and, consequently, easier to train, showing that the proposed method really helps in extracting all feasible information of the data using limited architectures (in terms of parameters). In fact, the number of parameters of the network is so relevant that authors of DLR_9 submission [16], one of the best results but with higher number of parameters, do not recommend their proposed method for practical usage because of the memory consumption and expensive training phase. Furthermore, the obtained results, that do not have any post-processing, are better than others, such as DST_2 [14], that employ CRF as post-processing method, which shows the potential of dilated convolutions in aggregate refined information.

Aside this, the proposed work (UFMG_5) achieved the best result ( of F1 Score) in the car class, which is one of the most difficult classes of this dataset when compared to others (such as building) because of its composition (small objects) and its high intraclass variance (caused by a great variety of models and colors). This may be justified by the fact that the proposed network does not downsample the input image preserving important details for such classes composed of small objects. However, this submission ignores the clutter/background class, which could be considered as an advantage making the comparison unfair. But, there are other works doing the same training protocol (i.e., ignoring the clutter/background class), such as INR [19]. Yet such works have not achieved good accuracy as the proposed work. Furthermore, still considering the car class, the second best result ( of F1 Score) is also yielded by our proposed work (UFMG_4), which employs all classes during the training phase, which shows the effectiveness and robustness of our work mainly for classes related to small objects.

Method #Parameters F1 Score Overall Accuracy
Tree Car
DLR_9 [16] 92.4 95.2 83.9 89.9 81.2 90.3
GSN3 [21] 92.3 95.2 84.1 90.0 79.3 90.3
ONE_7 [15] 91.0 94.5 84.4 89.9 77.8 89.8
INR [19] 91.1 94.7 83.4 89.3 71.2 89.5
UFMG_4 (Dilated8Pooling) 91.1 94.5 82.9 88.8 81.3 89.4
UFMG_5 (Dilated8Pooling) 91.0 94.6 82.7 88.9 82.5 89.3
UFMG_1 (Dilated6) 90.5 94.1 82.5 89.0 78.5 89.1
DST_2 [14] 90.5 93.7 83.4 89.2 72.6 89.1
UFMG_2 (DenseDilated6) 90.7 94.3 82.5 88.5 77.4 89.0
UFMG_3 (Dilated6Pooling) 90.6 93.4 82.4 88.5 79.8 88.8
ADL_3 [39] 89.5 93.2 82.3 88.2 63.3 88.0
RIT_2 [41] 90.0 92.6 81.4 88.4 61.1 88.0
RIT_L8 [40] 89.6 92.2 81.6 88.6 76.0 87.8
UZ_1 [4] 89.2 92.5 81.6 86.9 57.3 87.3
TABLE IX: Official results for the Vaihingen dataset.
Image nDSM Ground-Truth Dilated6 DenseDilated6 Dilated6 Pooling Dilated8 Pooling Dilated8 Pooling

figureExample predictions for the validation set of the Vaihingen dataset. Legend – White: impervious surfaces. Blue: buildings. Cyan: low vegetation. Green: trees. Yellow: cars. Red: clutter, background.

Image nDSM Dilated6 DenseDilated6 Dilated6 Pooling Dilated8 Pooling Dilated8 Pooling

figureExample predictions for the test set of the Vaihingen dataset. Legend – White: impervious surfaces. Blue: buildings. Cyan: low vegetation. Green: trees. Yellow: cars. Red: clutter, background.

(a) Vaihingen
(b) Potsdam
Fig. 8: Comparison, in terms of overall accuracy and number of trainable parameters, between proposed and existing networks for Vaihingen and Potsdam datasets. Ideal architectures should be in the top left corner, with fewer parameters but higher accuracy. Since the x axis is logarithmic, a change of only 0.3 in this axis is equivalent to more than 1 million new parameters in the model.

Vi-F4 Potsdam Dataset

As for the Vaihingen dataset, official results for the Potsdam dataset are reported only by the challenge organization. For this dataset, we have four submissions, one for each network presented in Section IV trained with the 6 classes (which are represented in the official results as UFMG_1 to 4). In this dataset, there is no need to disregard the clutter/background class, since it has a sufficient amount of samples (4.96%). As before, all submissions related to the proposed work does not use any post-processing.

Table X summarizes some results reported by the organizers. Again, in addition to our results, the table also compiles the best results of each work with enough information to make a fair comparison.

The proposed work achieved competitive results, appearing in third place according to the overall accuracy. DST_5 [14] and RIT_L7 [40] are the best result in terms of overall accuracy. However, they have a larger number of trainable parameters when compared to our proposed networks, as can be seen in Figure (b)b. This outcome corroborates with previous results , reaffirming obtained conclusions.

Method #Parameters F1 Score Overall Accuracy
Tree Car
DST_5 [14] 92.5 96.4 86.7 88.0 94.7 90.3
RIT_L7 [40] 91.2 94.6 85.1 85.1 92.8 88.4
UFMG_4 (Dilated8Pooling) 90.8 95.6 84.4 84.3 92.4 87.9
UFMG_3 (Dilated6Pooling) 90.5 95.6 83.3 82.6 90.8 87.2
UFMG_1 (Dilated6) 90.1 95.6 83.7 82.4 91.3 87.0
KLab_2 [42] 89.7 92.7 83.7 84.0 92.1 86.7
UFMG_2 (DenseDilated6) 88.7 95.3 83.1 80.8 90.8 85.8
UZ_1 [4] 89.3 95.4 81.8 80.5 86.5 85.8
TABLE X: Official results for the Potsdam dataset.
Image nDSM Ground-Truth Dilated6 DenseDilated6 Dilated6 Pooling Dilated8 Pooling Dilated8 Pooling

figureExample predictions for the validation set of the Potsdam dataset. Legend – White: impervious surfaces. Blue: buildings. Cyan: low vegetation. Green: trees. Yellow: cars. Red: clutter, background.

Image nDSM Dilated6 DenseDilated6 Dilated6 Pooling Dilated8 Pooling Dilated8 Pooling

figureExample predictions for the test set of the Potsdam dataset. Legend – White: impervious surfaces. Blue: buildings. Cyan: low vegetation. Green: trees. Yellow: cars. Red: clutter, background.

Vii Conclusions

In this paper, we propose a novel approach based on Convolutional Neural Networks to perform semantic segmentation of remote sensing scenes. The method exploits networks composed uniquely of dilated convolution layers that do not downsample the input. Based on these networks and their no downsampling property, the proposed approach: (i) employs, in the training phase, patches of different sizes, allowing the networks to capture multi-scale characteristics given the distinct context size, and (ii) updates a score for each of these patch sizes in order to select the best one during the testing phase.

We performed experiments on four high-resolution remote sensing datasets with very distinct properties: (i) Coffee dataset [30], composed multispectral high-resolution scenes of coffee crops and non-coffee areas, (ii) GRSS Data Fusion dataset [31], consisting of very high-resolution of visible spectrum images, (iii) Vaihingen dataset [32], composed of multispectral high-resolution images and normalized Digital Surface Model, and (iv) Potsdam dataset [34], also composed of multispectral high-resolution images and normalized Digital Surface Model.

Experimental results have showed that our method is effective and robust. It achieved state-of-the-art results in two datasets (Coffee and GRSS Data Fusion datasets) outperforming several techniques (such as Fully Convolutional [11] and deconvolutional networks [12]) that also exploit the multi-scale paradigm. This shows the potential of the proposed method in learning multi-scale information using patches of multiple sizes.

For the other datasets (Vaihingen and Potsdam), although the proposed technique did not achieve state-of-the-art, it yielded competitive results. In fact, our approach outperformed some relevant baselines that exploit post-processing techniques (although we did not employ any) and other multi-scale strategies. Among all methods, the proposed one has the least number of parameters and is, therefore, less pruned to overfitting and, consequently, easier to train. At the same time, it produces one of the highest accuracy, which shows the effectiveness of the proposed technique in extracting all feasible information from the data using limited architectures (in terms of parameters). Furthermore, the proposed technique achieved one of the best results for the car class, which is one of the most difficult classes of these datasets because of its composition (small objects). This demonstrates the benefits of processing the input image without downsampling it, a process that preserves important details for classes that are composed of small objects.

Aside this, the proposed networks can fine-tuned for any semantic segmentation application, since it does not depend of the patch size to process the data. This allows other applications to benefit from the patterns extracted by our models, a very important process mainly when working with small amounts of labeled data [23].

The presented conclusions open some opportunities towards a simplified use of deep learning methods for better understanding the Earth’s surface, which is still needed for remote sensing applications, such as agriculture or urban planning. In the future, we plan to better analyze the relation between the number of classes in the dataset and the number of parameters in the ConvNet.


This work was partially financed by CNPq (grant 312167/2015-6), CAPES (grant 88881.131682/2016-01), and Fapemig (APQ-00449-17). The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce GTX TITAN X GPU used for this research.


  • [1] J. A. Richards and J. Richards, Remote sensing digital image analysis.   Springer, 1999, vol. 3.
  • [2] K. Nogueira, W. R. Schwartz, and J. A. dos Santos, “Coffee crop recognition using multi-scale convolutional neural networks,” in

    Iberoamerican Congress on Pattern Recognition

    , 2015, pp. 67–74.
  • [3] D. Fustes, D. Cantorna, C. Dafonte, B. Arcay, A. Iglesias, and M. Manteiga, “A cloud-integrated web platform for marine monitoring using gis and remote sensing,” Future Generation Computer Systems, vol. 34, pp. 155–160, 2014.
  • [4] M. Volpi and D. Tuia, “Dense semantic labeling of subdecimeter resolution images with convolutional neural networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 2, pp. 881–893, 2017.
  • [5] V. Dey, Y. Zhang, and M. Zhong, A review on image segmentation techniques with remote sensing perspective, 2010.
  • [6] X. Zhu, D. Tuia, L. Mou, G.-S. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: a review,” arXiv preprint arXiv:1710.03959, 2017.
  • [7] I. Goodfellow, Y. Bengio, and A. Courville, “Deep learning,” 2016, http://www.deeplearningbook.org.
  • [8] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
  • [9]

    O. A. Penatti, E. Valle, and R. d. S. Torres, “Comparative study of global color and texture descriptors for web image retrieval,”

    Journal of Visual Communication and Image Representation, vol. 23, no. 2, pp. 359–380, 2012.
  • [10] J. Sivic, A. Zisserman et al., “Video google: A text retrieval approach to object matching in videos.” in IEEE International Conference on Computer Vision, vol. 2, no. 1470, 2003, pp. 1470–1477.
  • [11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in IEEE/CVF Computer Vision and Pattern Recognition, 2015, pp. 3431–3440.
  • [12] V. Badrinarayanan, A. Handa, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling,” arXiv preprint arXiv:1505.07293, 2015.
  • [13] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for semantic segmentation,” in IEEE/CVF Computer Vision and Pattern Recognition, 2015, pp. 1520–1528.
  • [14] J. Sherrah, “Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery,” arXiv preprint arXiv:1606.02585, 2016.
  • [15] N. Audebert, B. Le Saux, and S. Lefèvre, “Semantic segmentation of earth observation data using multimodal and multi-scale deep networks,” in Asian Conference on Computer Vision.   Springer, 2016, pp. 180–196.
  • [16] D. Marmanis, K. Schindler, J. D. Wegner, S. Galliani, M. Datcu, and U. Stilla, “Classification with an edge: improving semantic image segmentation with boundary detection,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 135, pp. 158–172, 2018.
  • [17] J. A. d. dos Santos, P.-H. Gosselin, S. Philipp-Foliguet, R. d. S. Torres, and A. X. Falao, “Multiscale classification of remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 10, pp. 3764–3775, 2012.
  • [18] A. Marcu and M. Leordeanu, “Dual local-global contextual pathways for recognition in aerial imagery,” arXiv preprint arXiv:1605.05462, 2016.
  • [19] E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez, “High-resolution semantic labeling with convolutional neural networks,” arXiv preprint arXiv:1611.01962, 2016.
  • [20] S. Paisitkriangkrai, J. Sherrah, P. Janney, and A. van den Hengel, “Semantic labeling of aerial and satellite imagery,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 7, pp. 2868–2881, 2016.
  • [21] H. Wang, Y. Wang, Q. Zhang, S. Xiang, and C. Pan, “Gated convolutional neural network for semantic segmentation in high-resolution images,” Remote Sensing, vol. 9, no. 5, p. 446, 2017.
  • [22] F. Yu and V. Koltun, “Multi-scale context aggregation by dilated convolutions,” 2015.
  • [23]

    K. Nogueira, O. A. Penatti, and J. A. d. Santos, “Towards better exploiting convolutional neural networks for remote sensing scene classification,”

    Pattern Recognition, vol. 61, no. 1, pp. 539–556, 2017.
  • [24] I. Kokkinos, “Pushing the boundaries of boundary detection using deep learning,” arXiv preprint arXiv:1511.07386, 2015.
  • [25] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” 2001.
  • [26] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten, “Densely connected convolutional networks,” vol. 1, no. 2, p. 3, 2017.
  • [27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE/CVF Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [28] M. Holschneider, R. Kronland-Martinet, J. Morlet, and P. Tchamitchian, “A real-time algorithm for signal analysis with the help of the wavelet transform,” in Wavelets.   Springer, 1990, pp. 286–297.
  • [29] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” arXiv preprint arXiv:1606.00915, 2016.
  • [30] K. Nogueira, M. Dalla Mura, J. Chanussot, W. R. Schwartz, and J. A. dos Santos, “Learning to semantically segment high-resolution remote sensing images,” in International Conference on Pattern Recognition, December 2016.
  • [31] W. Liao, X. Huang, F. Van Coillie, S. Gautama, A. Pižurica, W. Philips, H. Liu, T. Zhu, M. Shimoni, G. Moser et al., “Processing of multiresolution thermal hyperspectral and digital color data: Outcome of the 2014 ieee grss data fusion contest,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 6, pp. 2984–2996, 2015.
  • [32] “International society for photogrammetry and remote sensing (isprs),” http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html, accessed: 2018-06-18.
  • [33] M. Gerke, “Use of the stair vision library within the isprs 2d semantic labeling benchmark (vaihingen),” in ITC, Univ. of Twente, Tech. Rep., 2015.
  • [34] “International society for photogrammetry and remote sensing (isprs),” http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-potsdam.html, accessed: 2018-06-18.
  • [35] T. M. Santana, K. Nogueira, A. M. Machado, and J. A. dos Santos, “Deep contextual description of superpixels for aerial urban scenes classification,” in IEEE International Geoscience and Remote Sensing Symposium, 2017, pp. 1–9.
  • [36] R. G. Congalton and K. Green, Assessing the accuracy of remotely sensed data: principles and practices.   CRC press, 2008.
  • [37]

    M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from tensorflow.org. [Online]. Available:

  • [38] Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” in Neural Networks: Tricks of the Trade.   Springer, 2012, pp. 437–478.
  • [39] S. Paisitkriangkrai, J. Sherrah, P. Janney, and A. Van-Den Hengel, “Effective semantic pixel labelling with convolutional networks and conditional random fields,” in IEEE/CVF Computer Vision and Pattern Recognition Workshop.   IEEE, 2015, pp. 36–43.
  • [40] Y. Liu, S. Piramanayagam, S. T. Monteiro, and E. Saber, “Dense semantic labeling of very-high-resolution aerial imagery and lidar with fullyconvolutional neural networks and higher-order crfs,” in IEEE/CVF Computer Vision and Pattern Recognition Workshop, 2017.
  • [41] S. Piramanayagam, W. Schwartzkopf, F. Koehler, and E. Saber, “Classification of remote sensed images using random forests and deep learning framework,” in Image and Signal Processing for Remote Sensing XXII, vol. 10004.   International Society for Optics and Photonics, 2016, p. 100040L.
  • [42] J. Sherrah, “Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery,” arXiv preprint arXiv:1606.02585, 2016.