An important intermediate step in the automation of map generation and updation from raw images is semantic segmentation. Semantic segmentation is the task of assigning each pixel in input image with the class label it is likely to belong to. Though numerous works in the past decades have contributed to improving image segmentation techniques, complete automation of map generation is yet to be achieved. In an urban scene, disparate objects with similar visual/spectral signatures and homogeneous objects with varied visual/spectral signatures pose a challenge to segmentation algorithms.
Since their advent, the convolutional neural networks (CNN) have become the benchmark in computer vision tasks. They have outdone the traditional methods in tasks such as regression, classification, object detection and semantic segmentation. AlexNet 
, a CNN based model for image classification, won the ImageNet challenge in 2012. converted a CNN trained for classification to a fully convolutional neural network (FCN) that could be trained end-to-end pixel-to-pixel for semantic segmentation. Their FCN model achieved state-of-the-art performance on PASCAL VOC, NYUDv2, SIFT Flow datasets. Several works have adopted the supervised learning approach based on CNNs for analysing remote sensed images.  compared ensemble of 1D-CNNs with convolutions in spectral domain and ensemble of 2D-CNNs with convolutions in spatial domain to obtain pixel-by-pixel prediction of class labels. Their work concluded ensemble of 2D-CNNs to be superior. 
used more efficient FCN based models - sharpmask and refinenet  - to benchmark their multispectral dataset (RIT-18). The models improvise on standard skip architectures to merge features from shallow layers. Skip connections are known to facilitate gradient propagation across long-range connections and refine class boundaries in segmentation output. They observed that pre-training the models with synthetic data prior to training with actual data increased performance.  used an extension of architecture in 
. They used two different pathways with same architecture for image and digital elevation model (DEM) data, and, merged the spectral and height features shortly before the final layer that outputs the class probabilities. Predictions were averaged from several trained models of same architecture with different initialization and fully connected conditional random fields were employed for post-processing.
Provided high intra-class variance and low inter-class variance in VHR imagery, it is intuitive to have large receptive field to incorporate features from a larger context rather than small local context for segmentation. Increasing the number of convolution layers or size of convolutional filters are among well known methods to increase the receptive field.  introduced dilated/atrous convolution for time efficient image segmentation. Atrous convolutions expand receptive field by using gaps in convolution filters while keeping the computational budget constant. 
introduced the concept of effective receptive field and showed atrous convolution to increase effective receptive field. In this work, atrous convolutions are adopted in the CNN model to have a large receptive field. The proposed atrous convolutions based model has symmetric encoder-decoder architecture with skip connections. Upsampling in decoder is achieved using transpose convolutions with overlapping stride.
Conditional Random Fields (CRFs) are popularly employed as post-processing step to smooth noisy segmentation outputs.  used edge sensitive binary CRF to refine segmentation results from a CNN.  proposed an efficient approximate inference algorithm for Fully Connected Conditional Random Fields (FCRF). Their results demonstrate that accounting for long range dependencies with dense pixel-level connectivity significantly improves segmentation accuracy. In this work, the FRCF algorithm proposed by  is integrated on top of the CNN model. Atrous convolution based model with FCRF post-processing provides a competitive overall accuracy of 90.5% on the ISPRS 2D semantic labeling Vaihingen dataset.
The Vaihingen 2D dataset provided by semantic labeling contest111http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html
of the ISPRS WG III/4 is utilized in this work for validation of proposed methods. Vaihingen is a small village in Germany with many detached and small multi-storied buildings. The dataset consists of 33 image tiles extracted from a large orthomosaic image with a ground sampling distance of 9cm. The images are provided as 8-bit TIFF files with three spectral bands - near infrared, red, green (IRRG). The label images classified manually at pixel level are provided as 8-bit TIFF files. Pixels are categorized into one of the following classes - impervious surface, building, low vegetation, tree, car, background. In addition to images and label images, the normalized Digital Surface Model (nDSM) images provided by are used for training the CNN models. The nDSM images are provided as 8-bit JPEG files.
Iii Models for Segmentation
Iii-a CNN model with atrous convolutions
The proposed CNN model with atrous convolution operations is depicted in the Figure 1. It has an encoder-decoder architecture with skip connections. The description of colored blocks is provided in the Figure 1
. The model has 4 atrous convolution blocks included as depicted with the rationale of increasing receptive field. Each convolution operation is followed by batch-normalization and non-liner activation using ReLU function. The terminating convolution is followed by softmax activation. Filters of size 33 are used for convolutions and atrous convolutions. A dilation rate of 2 is used for atrous convolutions. The terminating convolution operation uses 6 (number of classes) filters of size 11. Transpose convolutions are employed to undo the downsampling by maxpooling operations. They use filters of size 55 with stride 2 and are followed by batch-normalization and ReLU activation. Skip connections are included to merge feature maps from encoder convolutions with feature maps of corresponding dimensions output from transpose convolutions. This model is referred as atrous convolution model in rest of the paper.
The results of segmentation from atrous convolution model are compared with those from a deep CNN model that lacks atrous convolutions. The model is referred to as standard convolution model. In addition, filters of size 22 with stride 2 are used for transpose convolutions in the standard convolution model. No batch-normalization or non-linear activation follow transpose convolutions. Except for these differences the architecture of standard convolution model is same as that of atrous convolution model.
Iii-B Fully connected conditional random field
The standard convolution and atrous convolution models are integrated with fully connected conditional random field (FCRF) model222https://github.com/lucasb-eyer/pydensecrf proposed by  for post-processing. The model energy function is given by Equation 1.
is the unary potential evaluated as the negative logarithm of softmax probabilities (). The pair wise potential is , where
are the guassian kernels which depend on feature vectors () for pixels i and j in arbitrary feature space, are weight parameters and is potts compatibility function. The kernels, based on position () and color () terms, adopted in the model are defined in Equation 2
are standard deviations of the gaussian kernels. The message passing step under a mean field approximation to the CRF distribution can be expressed as gaussian filtering in feature space. Efficient high-dimensional filtering algorithms reduce the complexity of message passing resulting in an approximate inference algorithm that is significantly fast. For further details, readers are referred to the original paper.
The experiment is carried out with standard convolution and atrous convolution models along with their FCRF integrated variants for post processing. All models are trained using training split of the ISPRS 2D semantic labeling contest Vaihingen dataset. For evaluation of the trained models the test split of the dataset is used. The f1-score and overall accuracy metrics obtained using accumulated confusion matrix are used to evaluate the performance.
The image tiles in the dataset labeled 1, 3, 5, 7, 11, 13, 15, 17, 21, 23, 26, 28, 30, 32, 34, 37 were used for training the CNN models. Each image tile is a true orthophoto (TOP) extracted from a large true orthomosaic photo. The image tiles were cropped into patches and subjected to augmentation due to computational bottleneck and to enable use of relatively large mini-bacth size. This resulted in a dataset consisting of 16244 patches of size 128
128. The patches were split into training and validation sets randomly in the ratio 3:1 during training. The input to the CNN models was IRRG image concatenated with nDSM along the channel dimension. The softmax output from model along with one-hot encoded labels was used to calculate the weighted cross entropy loss (Equation3).
where is the weighted cross entropy loss, is the array of weights associated with each class, is the one-hot encoded class labels for pixel at and
is the softmax probabilities obtained from CNN model. Backpropagation and adam optimizer were used to update the parameters during training. A mini-batch size of 16 with a learning rate of 1e-4 was used. Weights () associated with background, building, car, impervious surface, low vegetation and tree classes are [5, 1, 100, 1, 2, 1]. The choice of weights is driven by class imbalance and common misclassifications.
The standard convolution (SC) and atrous convolution (AC) models, and their FCRF integrated variants (SC-FCRF and AC-FCRF) are evaluated using the test split provided by the dataset. The image tiles labeled 2, 4, 6, 8, 10, 12, 14, 16, 20, 22, 24, 27, 29, 31, 33, 35, 38 in the dataset form the testing set. During testing, the images are converted to patches of size 256 256 with 50% overlap with adjacent patch. The central 128 128 region in the predicted 256 256 output from CNN model is considered. Patch-wise predictions are put together to obtain segmentation map with size equal to input size. Illustration of patch wise prediction scheme is presented in Fig. 2. The full segmentation maps from both CNN models are processed using FCRF model discussed in Section III-B. Post-processing smoothens segmentation map by removing isolated noisy regions and improves class boundaries.
The performance of the CNN models and their FCRF integrated variants is measured using f1-score and overall accuracy metrics. The definition of metrics is provided in Equation 4. Confusion matrix is obtained for segmentation map of each test image tile. The confusion matrix contains references along row direction and predictions along column direction. True Positive (TP) pixels are obtained from the principal diagonal elements, False Positive (FP) pixels are evaluated as sum per column excluding principal diagonal element and False Negative (FN) are evaluated as sum per row excluding principal diagonal element. The metrics are evaluated using accumulated confusion matrix obtained by summing confusion matrices for individual image tiles. The overall accuracy reported is the fraction of trace and sum of elements of accumulated confusion matrix. The evaluation is carried out using labels with eroded boundaries provided in the dataset.
|Building||Car||Imp. surf.||Low veg.||Tree|
Classwise F1-scores and overall accuracies for the models considered are tabulated in Tab. I
. Performance evaluation metrics clearly indicate the AC model to provide increased performance over SC model. Post-processing segmentation maps with FCRF to remove isolated noisy regions and misclassifications, and refine segmentation boundaries has shown to further increase the overall accuracy. SC-FCRF and AC models provide similarly good results. AC-FCRF model delivers best results with highest f1-score for most classes and highest overall accuracy. Accumulated confusion matrix, normalized with respect to reference, for the test image tiles obtained using best performing model is presented in Tab.II.
Visualization of segmentation maps from AC and AC-FCRF models is provided in Fig. 3. The AC-FCRF model provides an increased overall accuracy of 0.7% over AC model. Regions in segmentation maps highlighted with colored circles show the improvement in prediction caused due to post-processing using FCRF model. The boundary of buildings is refined and some misclassified pixels have been corrected in segmentation map of the first image. Isolated noisy regions within segmented buildings are removed in segmentation map of the second image. It can be seen from Fig. 3 and Tab. I that FCRF model improves the results both visually and quantitatively.
Employing large receptive field has had a positive impact on performance in  and . AC model achieves higher performance than the SC model. The key elements in AC model being atrous/dilated convolutions, and transpose convolutions with large filter and overlapping stride. Increasing receptive field has empirically shown to increase the prediction accuracy on a dataset with high intra-class variance and low inter-class variance. Further, post-processing the segmentation maps from AC model has provided an overall accuracy of 90.5%. The results obtained using AC-FCRF model on ISPRS 2D Vaihingen dataset are competitive.
In this paper, a deep CNN model is proposed for semantic segmentation of remote sensed images. The proposed CNN model has a symmetric encoder-decoder architecture with skip connections and is integrated with a FCRF model for post-processing. Atrous convolutions were employed to have large receptive field. Also, transpose convolutions with large filter and overlapping stride were used for upsampling. FCRF model adopted for post-processing accounts for long range dependencies with dense pixel connectivity and refines CNN segmentation outputs. Experimental results on ISPRS Vaihingen dataset are promising. A competitive overall accuracy of 90.5% was obtained using the proposed model.
-  (2014) Semantic image segmentation with deep convolutional nets and fully connected crfs. CoRR abs/1412.7062. Cited by: §I.
-  (2015-04) Automatic semantic labelling of urban areas using a rule-based approach and realized with mevislab. pp. . External Links: Cited by: §II.
Algorithms for semantic segmentation of multispectral remote sensing imagery using deep learning. ISPRS Journal of Photogrammetry and Remote Sensing. External Links: Cited by: §I.
-  (2012) Efficient inference in fully connected crfs with gaussian edge potentials. CoRR abs/1210.5644. External Links: Cited by: §I, §III-B, §III-B.
-  (2012) ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. Cited by: §I.
-  (2017-05) Deep learning classification of land cover and crop types using remote sensing data. IEEE Geoscience and Remote Sensing Letters 14 (5), pp. 778–782. External Links: Cited by: §I.
-  (2017-07) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. pp. 5168–5177. External Links: Cited by: §I.
-  (2014) Fully convolutional networks for semantic segmentation. CoRR abs/1411.4038. External Links: Cited by: §I.
-  (2017) Understanding the effective receptive field in deep convolutional neural networks. CoRR abs/1701.04128. External Links: Cited by: §I.
-  (2016-06) SEMANTIC segmentation of aerial images with an ensemble of cnns. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences III-3, pp. 473–480. External Links: Cited by: §I.
Effective semantic pixel labelling with convolutional networks and conditional random fields.
2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vol. , pp. 36–43. External Links: Cited by: §I.
-  (2016) Learning to refine object segments. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 75–91. External Links: Cited by: §I.
Large receptive field networks for high-scale image super-resolution. CoRR abs/1804.08181. External Links: Cited by: §IV-C.
-  (2015) Multi-scale context aggregation by dilated convolutions. CoRR abs/1511.07122. External Links: Cited by: §IV-C.