The developments in the systematic collection and organization of remote sensing imagery have resulted in several high-resolution aerial imagery datasets. Information from aerial imagery plays a key role in urban planning, disaster aversion, and change detection. Building detection is a crucial aspect for the aforementioned applications. Depending on the geographical region and conditions, building structures have different shapes and sizes. This challenge is particularly addressed by Maggiori et al. . They created a dataset of labeled aerial imagery from different locations for this problem, such that a model trained from a variety of sources generalizes to the task of segmentation. Semantic segmentation in aerial imagery is challenging due to variable lighting conditions, shapes/sizes, and large intraclass variations. In this research, we address the problem of improving the building segmentation by utilizing attentive multi-scale pathways. Each of the paths exploits non-local neighborhoods that account for buildings of varying sizes. This allows our network to learn long-range dependencies at various scales with minimal computation costs. In addition to attentive multi-scale pathways, we also incorporate a channel-wise attention module to model interdependencies across channels. In summary, our contributions are as follows.
We introduce a self-attention based contextual pyramid attention (CPA) module that accounts for various building sizes to segment buildings in aerial images. The proposed module outperforms current state-of-the-art methods by about 2% on the IoU metric and 12.6% over FCN baselines.
Through experiments, we also show that our base model offers competitive performance to current state-of-the-art methods, while having much lower inference costs. We also provide ablation studies on the impact of our proposed module and other comparisons.
2 Related Work
Convolutional Neural Networks (CNNs) have improved the performance of semantic segmentation significantly in recent years. Encoder-Decoder architectures are a popular choice for segmentation. Using fully convolutional architectures for segmentation is proposed in . Architectures such as FCN, U-Net, DenseNet, etc., are often applied to obtain higher quality segmentations [2, 3, 4]. Nowadays, most segmentation architecture designs are based on using a pre-trained backbone such as VGGNet, ResNet, and a complex decoder such as Pyramid Scene Parsing (PSP) or Atrous Spatial Pyramid Pooling (ASPP)[5, 6, 7, 8, 9, 10].
Due to the unique nature of remote sensing imagery, several custom architectures and loss functions have been developed for aerial image building segmentation[11, 12]. In , a multi-task learning approach is introduced, where a distance-transform loss function is applied in conjunction with the cross-entropy loss. Similar to our work,  proposes to combine local and global features to improve semantic segmentation of buildings in aerial imagery. However, the combination of two VGGNets makes the overall inference computationally expensive. Our proposed approach circumvents this by splitting only the last convolutional block to operate at multiple scales. In , a joint multi-stage multi-task approach is used, where the first stage trains a segmentation network, and the second stage trains for geo-localization using a multi-task loss function. Apart from these, post-processing techniques such as Conditional Random Fields (CRFs) and test-time augmentations are applied to improve segmentation performance. In 
, a recurrent network is applied in a fully convolutional network that exploits a decoder network fusing features from the encoder layer in a similar fashion to feature pyramid networks (FPN). An FPN utilizes a pyramidal hierarchy of features extracted from an encoder that is later combined with the decoder via lateral connections. However, as FPNs utilize only a few of the earlier layers of the encoder for lateral connections, the features might not be rich as extracting from deeper layers.
A common practice to retain feature resolution is using dilated convolutions. Due to the larger feature resolution and the repeated object patterns in aerial imagery, capturing long-range relations is beneficial. However, a single-sized feature map may not be able to capture all objects of variable sizes. For example, spatial relationships across large buildings are easier to capture when feature maps have a low resolution. When buildings are smaller in size, larger feature resolution is suited to capture more fine-grained details. Using a decoder such as FPN is beneficial for high-resolution features, however, as they are not obtained from deeper layers, they lack rich semantics. The proposed approach in this research is suited for these properties that are typically exhibited in aerial imagery. We particularly address the problem of improving semantic segmentation of buildings at various sizes.
The architecture is composed of an encoder and decoder, similar to other segmentation networks. An overview of the proposed method is shown in Fig. 2. Each component of our model is described in the following subsections.
We consider three backbones for training, ResNet18, ResNet101 and Squeeze and Excitation ResNeXt101 (SEResNeXt101). ResNet18 serves as the light architecture that may be suited for fast inference, whereas SEResNeXt101 offers higher performance with additional computation cost. SEResNeXt101 incorporates an additional Squeeze and Excitation module along with aggregated residual connections.
Decoder. The decoder is a Feature Pyramid Network (FPN), similar to Semantic FPN in . FPN combines features at different spatial resolutions through lateral connections to a top-down decoder. The lateral connections are established by combining the bottom-up outputs at each corresponding level of the top-down decoder. Each of the combined top-down decoder features is upsampled to of the input resolution and is combined via summation to produce rich features that are transformed into a segmentation mask.
3.2 Contextual Pyramid Attention block
The proposed Contextual Pyramid Attention block is composed of two parts, a contextual attention module that captures long-range spatial dependencies and a channel-wise attention module. Both proposed attention modules are based on self-attention [18, 19].
Contextual attention. Buildings in aerial imagery appear in various sizes, and structural features are often redundant in a given region (e.g. repeated houses structure in a neighborhood). This non-local information is not immediately processed by convolutions as they operate in a limited region defined by the kernel size. Self-attention, on the other hand, can bring in the understanding of long-range dependencies capturing the relation between repeated structures. It is a standard practice in semantic segmentation to use dilated convolutions in deeper layers, to increase the feature resolution. This is beneficial for smaller buildings and furthermore, self-attention blocks can capture fine-grained relations across the image. However, the contextual information of buildings of different sizes becomes challenging for two reasons. First, the dilated convolutions retain the feature resolution and self-attention may then retain spatial relations that are too fine-grained for larger buildings, which may not provide sufficient spatial context. Second, if the dilated convolutions are not applied to retain feature resolution, this may fail to capture relations between smaller buildings. Our motivation using multi-scale pathways stems from these two reasons.
Method. Given a feature , we apply convolutions to generate key, query and value features
. The tensorsand are reshaped into , where = . The self-attention (SA) operation is defined as
where is then reshaped into and is a learned parameter. The output of the self-attention is the sum of and . The resolution of is retained by dilated convolutions (
of the input image resolution). To capture contextual long-range dependencies, we operate the deep featuresat various scales . Contextual attention is
Channel-wise attention. Contextual attention employs both spatial axes to establish interdependencies to model spatial information. However, higher-level class or object information is prevalent across channels, and hence, we apply channel-wise attention to model relations across them. The last feature block of ResNet is large with 2,048 channels and is computationally expensive to perform self-attention based operations. Therefore, the features are compressed to of the total number of channels by 1 1 convolutions to obtain the features , where is the compression factor across channels (note the difference with scale factor ). The key, query and value are all the same in this case (). However, to apply self-attention, the features are reshaped into , where = . The channel-wise attention is now
The final output is the sum of and
. Note that the affinity matrices (inputs to the softmax) generated through both attention mechanisms have different shapes. The contextual attention affinity matrix has a size of, whereas the channel-wise affinity matrix has a size of . The final contextual pyramid attention block is the sum of both contextual and channel-wise attention.
4.1 Dataset and Evaluation
Dataset. The Inria Aerial Image Labeling Dataset consists of 360 RGB ortho-rectified aerial images at a spatial resolution of 0.3 meters. Each of the ortho-rectified images has a resolution of 5000 5000 pixels, covering a region of 1500 square meters per image. The dataset comprises of 10 different cities, out of which 5 cities are available for training. The images are taken at different urban conditions from various cities in Europe and America. The groundtruth is available only for the training set, where information is available for two classes: buildings and non-buildings. For fair comparisons, we follow the same evaluation protocol for testing as in [11, 22, 23]. Images 1–5 of each location are used for validation and Images 6–36 for training.
For evaluation, we use both Intersection over Union (IoU) and accuracy. IoU is measured across the building class. IoU is the ratio of true positive pixels to the pixels that are labeled as positive in the ground-truth or predictions. We also utilize accuracy, which is the ratio of correctly classified pixels to the total. Each dataset is evaluated separately and jointly, for both of these metrics.
Each of the networks are initialized with weights pretrained on ImageNet. The models are trained using the Adam optimizer with a learning rate of 0.00001 
. The objective function is cross-entropy and is trained for 35 epochs. The input images have a size of 500500 pixels and are randomly rotated by 90°or flipped horizontally and vertically while training. Images are also augmented with minor changes in brightness and contrast. The image augmentations are applied using Albumentations 
, and the models are implemented in PyTorch. Each of the contextual and self-attention weights are intialized to 1 and 0.05, respectively.
4.2 Comparison with state-of-the-art
With the proposed module, we observe an improvement of 3-5% IoU across all the datasets except Kitsap Co.. Our method with a simple backbone such as ResNet18 offers competitive performance to current state-of-the-art results. Using a deeper backbone such as ResNet101 or SE-ResNeXt101 further improves performance. Compared to the previous state-of-the-art ICT-Net that uses U-Net with Dense Blocks and SE Blocks , we observe a gain of 1.8 IoU points. The improvement is higher than previous increases in recent years. Our method does not rely on post-processing, or multi-task learning methods, however, applying this may further improve performance. A visual comparison of our results is in Fig. 1
. We note that on Kitsap Co., which contains the fewest buildings, our performance decreases slightly. We observe very high accuracy (99.32%), whereas IoU is lower, indicating skewness to classifying background better. On visual inspection, we have noticed errors in the ground truth where the background is annotated as buildings. The incorrect ground truth, when combined with the sparsely covered building regions, penalizes the IoU metric, thereby lowering performance.
4.3 Ablation studies
To study the impact of attention, an ablation study is conducted without any attention, with self-attention and the proposed CPA. This is shown in Table 1 (top-right). The IoU improves by 0.87 and 1.54 points with the addition of self-attention and CPA over the ResNet-FPN. Furthermore, we conduct experiments to study the impact of different models on computation costs (Table 1 mid-right). For a tile of 5000 5000 pixels, ResNet18-FPN-CPA takes only 4.67 seconds on a GTX 1080Ti GPU, whereas ResNet101 and SE-ResNeXt101 take 12.16 and 14.07 seconds, respectively. In , a tile takes 160 seconds for processing, however, they use a K80 GPU, which is typically two times slower than our setup. Even so, compared to , our largest model (SE-ResNeXt101) is five times faster for inference, while offering higher performance.
We have presented a novel and effective method to improve building segmentation in aerial imagery. It consists of the contextual pyramid and channel-wise attention blocks to model long-range dependencies across spatial contexts to account for buildings of different sizes. The contextual pyramid attention combines contextual information at multiple scales in an efficient manner with minimal computation overhead, whereas channel-wise attention captures interdependencies across channels. The proposed method achieves state-of-the-art performance on the Inria Aerial Image Labelling dataset by 1.8 and 12.6 IoU points over previous state of the art and baseline. Furthermore, we perform ablation experiments to study the impact of the CPA module and provide comparisons for computation costs. Our model offers high performance without any post-processing with low inference times.
-  Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat, and Pierre Alliez, “Can semantic labeling methods generalize to any city? the inria aerial image labeling benchmark,” in 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS). IEEE, 2017, pp. 3226–3229.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell, “Fully convolutional networks for semantic segmentation,” in , 2015, pp. 3431–3440.
-  Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger, “Densely connected convolutional networks,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Simon Jégou, Michal Drozdzal, David Vazquez, Adriana Romero, and Yoshua Bengio, “The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 11–19.
-  Karen Simonyan and Andrew Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
-  Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
-  Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia, “Pyramid scene parsing network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.
-  Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 801–818.
-  Benjamin Bischke, Patrick Helber, Joachim Folz, Damian Borth, and Andreas Dengel, “Multi-task learning for segmentation of building footprints with deep neural networks,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 1480–1484.
-  Clint Sebastian, Raffaele Imbriaco, Egor Bondarev, and Peter HN de With, “Adversarial loss for semantic segmentation of aerial imagery,” arXiv preprint arXiv:2001.04269, 2020.
-  Alina Marcu and Marius Leordeanu, “Dual local-global contextual pathways for recognition in aerial imagery,” arXiv preprint arXiv:1605.05462, 2016.
-  Alina Marcu, Dragos Costea, Emil Slusanschi, and Marius Leordeanu, “A multi-stage multi-task neural network for aerial scene interpretation and geolocalization,” ArXiv, vol. abs/1804.01322, 2018.
-  Lichao Mou and Xiao Xiang Zhu, “Rifcn: Recurrent network in fully convolutional network for semantic segmentation of high resolution remote sensing images,” arXiv preprint arXiv:1805.02091, 2018.
-  Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
-  Alexander Kirillov, Ross Girshick, Kaiming He, and Piotr Dollár, “Panoptic feature pyramid networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6399–6408.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” in Advances in neural information processing systems, 2017, pp. 5998–6008.
-  Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim, “Dual attention networks for multimodal reasoning and matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 299–307.
Vinod Nair and Geoffrey E Hinton,
Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814.
-  Sergey Ioffe and Christian Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
-  Alina Marcu, Dragos Costea, Emil Slusanschi, and Marius Leordeanu, “A multi-stage multi-task neural network for aerial scene interpretation and geolocalization,” arXiv preprint arXiv:1804.01322, 2018.
-  Andrew Khalel and Motaz El-Saban, “Automatic pixelwise object labeling for aerial imagery using stacked u-nets,” arXiv preprint arXiv:1803.04953, 2018.
-  Bodhiswatta Chatterjee and Charalambos Poullis, “Semantic segmentation from remote sensor data and the exploitation of latent learning for classification of auxiliary tasks,” arXiv preprint arXiv:1912.09216, 2019.
-  Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
-  Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  E. Khvedchenya V. I. Iglovikov A. Buslaev, A. Parinov and A. A. Kalinin, “Albumentations: fast and flexible image augmentations,” ArXiv e-prints, 2018.