Encoder-Decoder based CNN and Fully Connected CRFs for Remote Sensed Image Segmentation

10/14/2019 ∙ by Vikas Agaradahalli Gurumurthy, et al. ∙ 0

With the advancement of remote-sensed imaging large volumes of very high resolution land cover images can now be obtained. Automation of object recognition in these 2D images, however, is still a key issue. High intra-class variance and low inter-class variance in Very High Resolution (VHR) images hamper the accuracy of prediction in object recognition tasks. Most successful techniques in various computer vision tasks recently are based on deep supervised learning. In this work, a deep Convolutional Neural Network (CNN) based on symmetric encoder-decoder architecture with skip connections is employed for the 2D semantic segmentation of most common land cover object classes - impervious surface, buildings, low vegetation, trees and cars. Atrous convolutions are employed to have large receptive field in the proposed CNN model. Further, the CNN outputs are post-processed using Fully Connected Conditional Random Field (FCRF) model to refine the CNN pixel label predictions. The proposed CNN-FCRF model achieves an overall accuracy of 90.5 on the ISPRS Vaihingen Dataset.



There are no comments yet.


page 1

page 2

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

An important intermediate step in the automation of map generation and updation from raw images is semantic segmentation. Semantic segmentation is the task of assigning each pixel in input image with the class label it is likely to belong to. Though numerous works in the past decades have contributed to improving image segmentation techniques, complete automation of map generation is yet to be achieved. In an urban scene, disparate objects with similar visual/spectral signatures and homogeneous objects with varied visual/spectral signatures pose a challenge to segmentation algorithms.

Since their advent, the convolutional neural networks (CNN) have become the benchmark in computer vision tasks. They have outdone the traditional methods in tasks such as regression, classification, object detection and semantic segmentation. AlexNet [5]

, a CNN based model for image classification, won the ImageNet challenge in 2012.

[8] converted a CNN trained for classification to a fully convolutional neural network (FCN) that could be trained end-to-end pixel-to-pixel for semantic segmentation. Their FCN model achieved state-of-the-art performance on PASCAL VOC, NYUDv2, SIFT Flow datasets. Several works have adopted the supervised learning approach based on CNNs for analysing remote sensed images. [6] compared ensemble of 1D-CNNs with convolutions in spectral domain and ensemble of 2D-CNNs with convolutions in spatial domain to obtain pixel-by-pixel prediction of class labels. Their work concluded ensemble of 2D-CNNs to be superior. [3]

used more efficient FCN based models - sharpmask

[12] and refinenet [7] - to benchmark their multispectral dataset (RIT-18). The models improvise on standard skip architectures to merge features from shallow layers. Skip connections are known to facilitate gradient propagation across long-range connections and refine class boundaries in segmentation output. They observed that pre-training the models with synthetic data prior to training with actual data increased performance. [10] used an extension of architecture in [8]

. They used two different pathways with same architecture for image and digital elevation model (DEM) data, and, merged the spectral and height features shortly before the final layer that outputs the class probabilities. Predictions were averaged from several trained models of same architecture with different initialization and fully connected conditional random fields were employed for post-processing.

Provided high intra-class variance and low inter-class variance in VHR imagery, it is intuitive to have large receptive field to incorporate features from a larger context rather than small local context for segmentation. Increasing the number of convolution layers or size of convolutional filters are among well known methods to increase the receptive field. [1] introduced dilated/atrous convolution for time efficient image segmentation. Atrous convolutions expand receptive field by using gaps in convolution filters while keeping the computational budget constant. [9]

introduced the concept of effective receptive field and showed atrous convolution to increase effective receptive field. In this work, atrous convolutions are adopted in the CNN model to have a large receptive field. The proposed atrous convolutions based model has symmetric encoder-decoder architecture with skip connections. Upsampling in decoder is achieved using transpose convolutions with overlapping stride.

Fig. 1: Architecture of the symmetric encoder-decoder model with atrous convolutions.

Conditional Random Fields (CRFs) are popularly employed as post-processing step to smooth noisy segmentation outputs. [11] used edge sensitive binary CRF to refine segmentation results from a CNN. [4] proposed an efficient approximate inference algorithm for Fully Connected Conditional Random Fields (FCRF). Their results demonstrate that accounting for long range dependencies with dense pixel-level connectivity significantly improves segmentation accuracy. In this work, the FRCF algorithm proposed by [4] is integrated on top of the CNN model. Atrous convolution based model with FCRF post-processing provides a competitive overall accuracy of 90.5% on the ISPRS 2D semantic labeling Vaihingen dataset.

The remainder of the paper is organized into following sections. Section II describes the dataset. The models considered are detailed in section III. The experimentation details and results are provided in section IV. Conclusion is drawn in section V.

Ii Dataset

The Vaihingen 2D dataset provided by semantic labeling contest111http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html

of the ISPRS WG III/4 is utilized in this work for validation of proposed methods. Vaihingen is a small village in Germany with many detached and small multi-storied buildings. The dataset consists of 33 image tiles extracted from a large orthomosaic image with a ground sampling distance of 9cm. The images are provided as 8-bit TIFF files with three spectral bands - near infrared, red, green (IRRG). The label images classified manually at pixel level are provided as 8-bit TIFF files. Pixels are categorized into one of the following classes - impervious surface, building, low vegetation, tree, car, background. In addition to images and label images, the normalized Digital Surface Model (nDSM) images provided by

[2] are used for training the CNN models. The nDSM images are provided as 8-bit JPEG files.

Iii Models for Segmentation

Iii-a CNN model with atrous convolutions

The proposed CNN model with atrous convolution operations is depicted in the Figure 1. It has an encoder-decoder architecture with skip connections. The description of colored blocks is provided in the Figure 1

. The model has 4 atrous convolution blocks included as depicted with the rationale of increasing receptive field. Each convolution operation is followed by batch-normalization and non-liner activation using ReLU function. The terminating convolution is followed by softmax activation. Filters of size 3

3 are used for convolutions and atrous convolutions. A dilation rate of 2 is used for atrous convolutions. The terminating convolution operation uses 6 (number of classes) filters of size 11. Transpose convolutions are employed to undo the downsampling by maxpooling operations. They use filters of size 55 with stride 2 and are followed by batch-normalization and ReLU activation. Skip connections are included to merge feature maps from encoder convolutions with feature maps of corresponding dimensions output from transpose convolutions. This model is referred as atrous convolution model in rest of the paper.

The results of segmentation from atrous convolution model are compared with those from a deep CNN model that lacks atrous convolutions. The model is referred to as standard convolution model. In addition, filters of size 22 with stride 2 are used for transpose convolutions in the standard convolution model. No batch-normalization or non-linear activation follow transpose convolutions. Except for these differences the architecture of standard convolution model is same as that of atrous convolution model.

Iii-B Fully connected conditional random field

The standard convolution and atrous convolution models are integrated with fully connected conditional random field (FCRF) model222https://github.com/lucasb-eyer/pydensecrf proposed by [4] for post-processing. The model energy function is given by Equation 1.


is the unary potential evaluated as the negative logarithm of softmax probabilities (). The pair wise potential is , where

are the guassian kernels which depend on feature vectors (

) for pixels i and j in arbitrary feature space, are weight parameters and is potts compatibility function. The kernels, based on position () and color () terms, adopted in the model are defined in Equation 2


, ,

are standard deviations of the gaussian kernels. The message passing step under a mean field approximation to the CRF distribution can be expressed as gaussian filtering in feature space. Efficient high-dimensional filtering algorithms reduce the complexity of message passing resulting in an approximate inference algorithm that is significantly fast. For further details, readers are referred to the original paper


Iv Experimentation

The experiment is carried out with standard convolution and atrous convolution models along with their FCRF integrated variants for post processing. All models are trained using training split of the ISPRS 2D semantic labeling contest Vaihingen dataset. For evaluation of the trained models the test split of the dataset is used. The f1-score and overall accuracy metrics obtained using accumulated confusion matrix are used to evaluate the performance.

Iv-a Training

The image tiles in the dataset labeled 1, 3, 5, 7, 11, 13, 15, 17, 21, 23, 26, 28, 30, 32, 34, 37 were used for training the CNN models. Each image tile is a true orthophoto (TOP) extracted from a large true orthomosaic photo. The image tiles were cropped into patches and subjected to augmentation due to computational bottleneck and to enable use of relatively large mini-bacth size. This resulted in a dataset consisting of 16244 patches of size 128

128. The patches were split into training and validation sets randomly in the ratio 3:1 during training. The input to the CNN models was IRRG image concatenated with nDSM along the channel dimension. The softmax output from model along with one-hot encoded labels was used to calculate the weighted cross entropy loss (Equation



where is the weighted cross entropy loss, is the array of weights associated with each class, is the one-hot encoded class labels for pixel at and

is the softmax probabilities obtained from CNN model. Backpropagation and adam optimizer were used to update the parameters during training. A mini-batch size of 16 with a learning rate of 1e-4 was used. Weights (

) associated with background, building, car, impervious surface, low vegetation and tree classes are [5, 1, 100, 1, 2, 1]. The choice of weights is driven by class imbalance and common misclassifications.

Fig. 2: Patchwise prediction scheme

Iv-B Evaluation

The standard convolution (SC) and atrous convolution (AC) models, and their FCRF integrated variants (SC-FCRF and AC-FCRF) are evaluated using the test split provided by the dataset. The image tiles labeled 2, 4, 6, 8, 10, 12, 14, 16, 20, 22, 24, 27, 29, 31, 33, 35, 38 in the dataset form the testing set. During testing, the images are converted to patches of size 256 256 with 50% overlap with adjacent patch. The central 128 128 region in the predicted 256 256 output from CNN model is considered. Patch-wise predictions are put together to obtain segmentation map with size equal to input size. Illustration of patch wise prediction scheme is presented in Fig. 2. The full segmentation maps from both CNN models are processed using FCRF model discussed in Section III-B. Post-processing smoothens segmentation map by removing isolated noisy regions and improves class boundaries.

The performance of the CNN models and their FCRF integrated variants is measured using f1-score and overall accuracy metrics. The definition of metrics is provided in Equation 4. Confusion matrix is obtained for segmentation map of each test image tile. The confusion matrix contains references along row direction and predictions along column direction. True Positive (TP) pixels are obtained from the principal diagonal elements, False Positive (FP) pixels are evaluated as sum per column excluding principal diagonal element and False Negative (FN) are evaluated as sum per row excluding principal diagonal element. The metrics are evaluated using accumulated confusion matrix obtained by summing confusion matrices for individual image tiles. The overall accuracy reported is the fraction of trace and sum of elements of accumulated confusion matrix. The evaluation is carried out using labels with eroded boundaries provided in the dataset.


Iv-C Results

Model F1-score Overall accuracy
Building Car Imp. surf. Low veg. Tree
SC 94.2 83.5 90.7 81.6 88.2 88.9
AC 94.8 83.1 91.5 83.1 88.9 89.8
SC-FCRF 94.9 82.3 91.6 83.6 89.1 89.9
AC-FCRF 95.3 81.5 92.2 84.6 89.6 90.5
TABLE I: F1-scores and overall accuracies

Classwise F1-scores and overall accuracies for the models considered are tabulated in Tab. I

. Performance evaluation metrics clearly indicate the AC model to provide increased performance over SC model. Post-processing segmentation maps with FCRF to remove isolated noisy regions and misclassifications, and refine segmentation boundaries has shown to further increase the overall accuracy. SC-FCRF and AC models provide similarly good results. AC-FCRF model delivers best results with highest f1-score for most classes and highest overall accuracy. Accumulated confusion matrix, normalized with respect to reference, for the test image tiles obtained using best performing model is presented in Tab.


Building 95.56 0.02 3.26 0.84 0.31
Car 7.44 73.14 18.57 0.43 0.29
Imp. surf. 2.37 0.14 93.76 2.92 0.81
Low veg. 1.71 0 4.61 83.89 9.79
Tree 0.33 0 1.17 8.84 89.65
TABLE II: Normalized accumulated confusion matrix for the test set using AC-FCRF model

Visualization of segmentation maps from AC and AC-FCRF models is provided in Fig. 3. The AC-FCRF model provides an increased overall accuracy of 0.7% over AC model. Regions in segmentation maps highlighted with colored circles show the improvement in prediction caused due to post-processing using FCRF model. The boundary of buildings is refined and some misclassified pixels have been corrected in segmentation map of the first image. Isolated noisy regions within segmented buildings are removed in segmentation map of the second image. It can be seen from Fig. 3 and Tab. I that FCRF model improves the results both visually and quantitatively.

Employing large receptive field has had a positive impact on performance in [13] and [14]. AC model achieves higher performance than the SC model. The key elements in AC model being atrous/dilated convolutions, and transpose convolutions with large filter and overlapping stride. Increasing receptive field has empirically shown to increase the prediction accuracy on a dataset with high intra-class variance and low inter-class variance. Further, post-processing the segmentation maps from AC model has provided an overall accuracy of 90.5%. The results obtained using AC-FCRF model on ISPRS 2D Vaihingen dataset are competitive.








Fig. 3: Comparison of segmentation output from atrous convolution model (AC) and its FCRF integrated variant (AC-FCRF).

V Conclusion

In this paper, a deep CNN model is proposed for semantic segmentation of remote sensed images. The proposed CNN model has a symmetric encoder-decoder architecture with skip connections and is integrated with a FCRF model for post-processing. Atrous convolutions were employed to have large receptive field. Also, transpose convolutions with large filter and overlapping stride were used for upsampling. FCRF model adopted for post-processing accounts for long range dependencies with dense pixel connectivity and refines CNN segmentation outputs. Experimental results on ISPRS Vaihingen dataset are promising. A competitive overall accuracy of 90.5% was obtained using the proposed model.


  • [1] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. CoRR abs/1412.7062. Cited by: §I.
  • [2] M. Gerke, T. Speldekamp, C. Fries, and C. Gevaert (2015-04) Automatic semantic labelling of urban areas using a rule-based approach and realized with mevislab. pp. . External Links: Document Cited by: §II.
  • [3] R. Kemker, C. Salvaggio, and C. Kanan (2018)

    Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning

    ISPRS Journal of Photogrammetry and Remote Sensing. External Links: ISSN 0924-2716, Document, Link Cited by: §I.
  • [4] P. Krähenbühl and V. Koltun (2012) Efficient inference in fully connected crfs with gaussian edge potentials. CoRR abs/1210.5644. External Links: Link, 1210.5644 Cited by: §I, §III-B, §III-B.
  • [5] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §I.
  • [6] N. Kussul, M. Lavreniuk, S. Skakun, and A. Shelestov (2017-05) Deep learning classification of land cover and crop types using remote sensing data. IEEE Geoscience and Remote Sensing Letters 14 (5), pp. 778–782. External Links: Document, ISSN 1545-598X Cited by: §I.
  • [7] G. Lin, A. Milan, C. Shen, and I. Reid (2017-07) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. pp. 5168–5177. External Links: Document Cited by: §I.
  • [8] J. Long, E. Shelhamer, and T. Darrell (2014) Fully convolutional networks for semantic segmentation. CoRR abs/1411.4038. External Links: Link, 1411.4038 Cited by: §I.
  • [9] W. Luo, Y. Li, R. Urtasun, and R. S. Zemel (2017) Understanding the effective receptive field in deep convolutional neural networks. CoRR abs/1701.04128. External Links: Link, 1701.04128 Cited by: §I.
  • [10] D. Marmanis, J. D. Wegner, S. Galliani, K. Schindler, M. Datcu, and U. Stilla (2016-06) SEMANTIC segmentation of aerial images with an ensemble of cnns. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences III-3, pp. 473–480. External Links: Document Cited by: §I.
  • [11] S. Paisitkriangkrai, J. Sherrah, P. Janney, and A. Van-Den Hengel (2015-06) Effective semantic pixel labelling with convolutional networks and conditional random fields. In

    2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

    Vol. , pp. 36–43. External Links: Document, ISSN 2160-7516 Cited by: §I.
  • [12] P. O. Pinheiro, T. Lin, R. Collobert, and P. Dollár (2016) Learning to refine object segments. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 75–91. External Links: ISBN 978-3-319-46448-0 Cited by: §I.
  • [13] G. Seif and D. Androutsos (2018)

    Large receptive field networks for high-scale image super-resolution

    CoRR abs/1804.08181. External Links: Link, 1804.08181 Cited by: §IV-C.
  • [14] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122. External Links: Link, 1511.07122 Cited by: §IV-C.