Improving Building Segmentation for Off-Nadir Satellite Imagery

09/08/2021 ∙ by Hanxiang Hao, et al. ∙ 17

Automatic building segmentation is an important task for satellite imagery analysis and scene understanding. Most existing segmentation methods focus on the case where the images are taken from directly overhead (i.e., low off-nadir/viewing angle). These methods often fail to provide accurate results on satellite images with larger off-nadir angles due to the higher noise level and lower spatial resolution. In this paper, we propose a method that is able to provide accurate building segmentation for satellite imagery captured from a large range of off-nadir angles. Based on Bayesian deep learning, we explicitly design our method to learn the data noise via aleatoric and epistemic uncertainty modeling. Satellite image metadata (e.g., off-nadir angle and ground sample distance) is also used in our model to further improve the result. We show that with uncertainty modeling and metadata injection, our method achieves better performance than the baseline method, especially for noisy images taken from large off-nadir angles.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 7

page 8

page 10

page 13

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object segmentation for satellite imagery has been studied extensively because of the availability of large-scale datasets [10, 37, 15, 4, 9]

and computational resources. Although many existing methods achieve accurate segmentation results, using them in real-world applications is still challenging. Unlike many segmentation tasks for natural images, such as the COCO dataset 

[31]

and Cityscapes dataset 

[6], real-world object segmentation for satellite imagery often faces challenges in identifying small, visually heterogeneous objects (cars and buildings) with varying orientation and density in images [37]. For example, it is even hard for humans to detect the small buildings inside the forest area from the images in Figure 1, because of the low lighting condition and the similar colors of the buildings compared to their surrounded trees. Furthermore, due to changes in the satellite viewing angle, the appearance of target objects can vary dramatically, including changes in lighting intensity, object resolution, and image noise level. As the input images in Figure 1 show, from small viewing angle (first row) to large viewing angle (second row), the overall image intensity and image quality changes significantly. Therefore, to be able to successfully use the segmentation models in real-world applications, addressing the aforementioned challenges is necessary.

Figure 2: Illustration of satellite off-nadir angle.

Many existing satellite imagery segmentation methods directly adopt approaches that were originally designed for the natural image object segmentation task without considering the previously mentioned challenges. Since most of the publicly available datasets for satellite image segmentation consist of images taken nearly directly overhead (at-nadir images) [10, 15, 4, 9], these existing methods are able to produce accurate results. However, as mentioned earlier, the accurate results do not guarantee that these methods can be successfully used in real-world applications. To address this issue, in this work, we consider the more challenging SpaceNet 4, a multi-view overhead imagery dataset [37] for building segmentation, which focuses on noisy data due to large off-nadir angles. As shown in Figure 2, satellite off-nadir angle (viewing angle) is the angle between the nadir point directly below the satellite and the center of the imaged scene [37]. Considering images with large off-nadir angles enables us to move one step closer to real-world applications. For example, many satellite images collected during disaster responses or other urgent situations often involve large off-nadir angles. The first set of satellite images taken from Puerto Rico after Hurricane Maria was obtained with the off-nadir angle as  [8]. A large off-nadir angle can cause a significant deterioration in image quality. As shown in Figure 1, compared to the image with the smaller off-nadir angle, the image with the larger off-nadir angle is blurrier and noisier. Furthermore, a large off-nadir angle can also cause a change in object appearance. For example, in the same figure, with the smaller off-nadir angle, only building roofs are visible, but with the larger off-nadir angle, both building roofs and their facades are visible, which will cause the change of building area in the satellite images. In the SpaceNet 4 dataset, images of the same scene are taken at different off-nadir angles. All building annotations are labeled based on the images with the smallest magnitude of off-nadir angle () and the rest of the images with different off-nadir angles use the same labels as ground truth during training. Therefore, the change of building appearance due to the change of off-nadir angle has an adverse effect when training the model due to the inaccurate ground truth annotations. These challenges are similar to the challenges in domain adaptation, where reliable data is available for training in one scenario, but the model needs to be adapted to new data collected under different scenarios (different lighting conditions, image noise conditions, or annotation accuracy conditions).

In order to solve these challenges provided by the SpaceNet 4 dataset and real-world applications, we present a building segmentation method with uncertainty modeling and satellite image metadata injection. Our method is able to provide accurate segmentation results when training with noisy images and inaccurate ground truth annotations. More specifically, based on Bayesian deep learning, the proposed method is designed to capture both model and data uncertainty to ignore the image regions with a higher uncertainty level. For example, as shown in Figure 1 (we will provide more detailed information in Section 3.1), our uncertainty maps highlight the areas with larger image noise (building boundaries due to the image blur and inaccurate annotation). As the off-nadir angle increases (from the first row to the second row), the uncertainty level increases, indicating a higher data noise from both image and annotation. Furthermore, satellite image metadata is also considered in our method, as it usually contains useful information to improve model performance. In this work, we use ground sample distance (GSD) and off-nadir angle as input metadata. GSD describes the spatial resolution of the image and a larger GSD usually indicates blurrier and noisier images. As mentioned earlier, different off-nadir angles can also cause changes in image quality. In this work, we propose two metadata injection methods in Section 3.2 to show the effectiveness of using metadata in building segmentation. The main contributions of this paper are summarized as follows:

  • we design a building segmentation model that is able to capture both model uncertainty (epistemic uncertainty) and data uncertainty (aleatoric uncertainty);

  • a concatenation-based metadata injection method is developed for using satellite image metadata to improve building segmentation;

  • alternatively, we also propose a metadata injection method using Affine Combination Module for multi-resolution injection;

  • based on our experimental analysis and ablation study, we show that the proposed method is able to achieve a better performance than the baseline method, especially for noisy images taken at large off-nadir angles.

2 Related Work

In this section, we will review the previous work for satellite image building segmentation as well as the methods using uncertainty modeling and metadata injection in satellite imagery.

Building segmentation for satellite imagery. In this paper, we consider the building segmentation task as a binary semantic segmentation task222Some previous work also considered this task as an instance segmentation task. In this paper, we will focus on the semantic segmentation task.. Many recent approaches (including our proposed method) are designed based on the U-Net structure [35]

, because of its good performance in many computer vision tasks 

[30, 1, 21, 20, 3]. Here we briefly review several U-Net-based methods of building segmentation for satellite imagery. A large receptive field is important for the segmentation model to detect buildings with different sizes. Therefore, many methods improve the original U-Net by using different techniques to enlarge the receptive field to achieve better performance. Zhang  [40] extend the U-Net model with dense connections [19] and dilated convolutional layers [39, 3] to reach a large receptive field for capturing the information of large objects. Liu  [32] incorporate a feature pyramid scene parsing (PSP) network [41] with U-Net to further increasing the receptive field. They use the PSP module to replace the bottleneck layer from U-Net to allow the use of multi-scale features for extracting building footprints of different sizes. Jing  [22] design a spatial pyramid dilated network for building segmentation by combining the aforementioned PSP network with dilated convolution. In this work, as discussed previously, instead of focusing on improving the performance on the at-nadir images, our method aims to deal with the problem of adapting for real-world applications: building segmentation for images with large off-nadir angles, as these images tend to be noisier and blurrier than at-nadir images.

Uncertainty modeling for satellite imagery analysis. Using Bayesian deep learning to model uncertainty has already been seen in satellite imagery analysis. Kampffmeyer  [23] first introduced Monte Carlo dropout [12]

to capture model uncertainty for small object segmentation. Although dropout is rarely used in convolutional neural networks (CNNs) due to the empirically deteriorated performance, they show that adding dropout layers in their fully convolutional encoder-decoder model with Monte Carlo integration during inference can achieve better performance. Our proposed method also uses Monte Carlo dropout; please check Section 

3.1.1 for more information. Inspired by this, Bischke  [2] proposed to use the model uncertainty to address the class imbalance issue in the satellite image segmentation task. The predicted uncertainty for each class is used as the weight in the cross-entropy loss to account for model uncertainty caused by class imbalance. In this work, we propose to use not only the model uncertainty (epistemic uncertainty) as presented in the previous work, but also the data uncertainty (aleatoric uncertainty) to enable our segmentation model to learn from noisy data.

Injecting metadata for satellite imagery analysis. Satellite image metadata can be used in many satellite imagery analysis tasks, as it usually contains useful information to improve model performance. Pritt  [33]

use a variety of satellite metadata including GSD, off-nadir angle, longitude, and latitude for the image classification task in satellite imagery. They use an ensemble of CNN models for image feature extraction. Then the CNN features are concatenated with the normalized metadata and fed into fully-connected layers for classification. In Section 

3.2.1

, we will provide a similar concatenation-based metadata injection method with an improvement of metadata feature extraction using multi-layer perceptrons. Christie  

[5]

proposed a similar model to fuse the CNN features with normalized metadata for multi-temporal satellite image sequence. Different from the previous work, instead of feeding the fused features to fully-connected layers, these features are fed into a long short-term memory (LSTM) model to accumulate temporal information from different frames to obtain the final classification result. In this work, besides the aforementioned concatenation-based method, we will also present an Affine Combination Module-based metadata injection to inject metadata for multiple feature resolutions as described in Section 

3.2.2.

3 Method

In this section, we will introduce our building segmentation method with uncertainty modeling and satellite image metadata injection. As shown in Figure 3, the proposed method is based on U-Net [35] and has multiple outputs. As described in Section 3.1, modeling uncertainty enables our method to ignore the noisy pixels that are caused by 1) blurry or noisy images; and 2) inaccurate data annotation. Injecting satellite image metadata such as ground sample distance (GSD) and off-nadir angle provides the model with more information to improve its performance. We will provide two metadata injection approaches in Section 3.2.

Figure 3: The block diagram of the proposed method with uncertainty modeling and concatenation-based metadata injection. is the dropout variational distribution as described in Section 3.1.1.

3.1 Modeling Uncertainty via Bayesian Deep Learning

Unlike standard deep learning methods, Bayesian deep learning (Bayesian DL) provides a model with the ability to ignore certain data points based on uncertainty. In Bayesian DL, there are two types of uncertainty one can model:

  • Epistemic Uncertainty describes the uncertainty that is caused by the model ignoring some training data. For example, a segmentation model might miss some building areas with certain colors/textures. Usually, this type of uncertainty can be reduced as more training data is made available.

  • Aleatoric Uncertainty describes the uncertainty that is inherited from data (image/sensor noise). Aleatoric uncertainty can be further categorized as homoscedastic uncertainty, which is the uncertainty based on the entire dataset, and heteroscedastic uncertainty

    , which is the uncertainty for each input data point (each pixel in our case). In this work, we will consider heteroscedastic aleatoric uncertainty to accurately model the data noise for different input images.

In the following section, we will review the methods for modeling epistemic uncertainty [11] and aleatoric uncertainty [25], followed by our proposed approach to combine both uncertainties in one model.

3.1.1 Epistemic Uncertainty

In Bayesian DL, to capture the uncertainty from the model (epistemic uncertainty), we place a distribution over the model parameters. For example, the prior distribution of the model weights for a fully-connected layer, , can be modeled as: . This is different from the standard deep learning model, which uses deterministic parameters. In Bayesian DL, for each forward pass, including both training and testing, the model parameters will be different due to parameter sampling. Formally speaking, we formulate our building segmentation model as:

(1)

where is the input image, is the output class label (in our case, it is a binary label indicating foreground or background), is our Bayesian DL model with sampled parameters , and

is the sigmoid function applied to each input element.

Estimating the model posterior over the entire training set is intractable [12, 11]. To evaluate this posterior distribution, following the work [12, 11, 25, 24], we use dropout variational inference. This inference is performed by placing a dropout layer before every convolutional layer (or fully-connected layer). Since dropout can be formulated as a Bernoulli trial by randomly setting the model parameters to zero, [12, 11] show that this dropout distribution over model parameters,

, can be used to estimate our model posterior. This is done by minimizing their Kullback-Leibler (KL) divergence via the following loss function during training:

(2)

where is a pair of training image and its corresponding ground truth label mask, is a classification loss (binary cross entropy loss in our case), is our model with parameters sampled from the dropout distribution , and is a non-trainable hyper-parameter as described in [12]. The second term of Equation 2 can be implemented using weight decay [27], which was originally designed for model regularization. During inference, we can estimate the final prediction distribution given a testing image via Monte Carlo integration as proposed in [12, 11]:

(3)

where is the model parameters from each Monte Carlo sample and is the total number of samples. Equation 3 is referred as Monte Carlo dropout as proposed in [12]

. Epistemic uncertainty can be visualized by calculating the variance of the Monte Carlo samples:

(4)

where is the Hadamard product for element-wise multiplication and .

As shown in Figure 3, we model the epistemic uncertainty by placing the dropout layers before just the first three decoder layers, instead of all convolutional layers. Since we use a ResNet-34 model [18]

pretrained on ImageNet 

[7] as the CNN encoder, we model this feature extraction process as a deterministic process. Therefore, no dropout layers are used in the CNN encoder. In this work, we only model the first three decoder layers as stochastic processes by placing the dropout layers before each convolutional layer in each upsampling block. We do not add dropout layers to the last two decoder layers. This is to reduce the output noise due to the limited number of Monte Carlo samples during inference as shown in Equation 3.

3.1.2 Aleatoric Uncertainty

Aleatoric uncertainty captures the noise from training data. As described previously, in this work, we consider heteroscedastic aleatoric uncertainty, which captures the noise from each pixel from an input image. We use two additional convolutional layers placed on top of the last decoder layer to obtain the classification logit

and aleatoric uncertainty , as shown in Figure 3. We use the predicted aleatoric uncertainty during training to ignore the pixels with larger uncertainty and address the pixels with less uncertainty. To achieve this, as proposed in [25], we corrupt the predicted logits

with Gaussian random noise, where the standard deviation is the predicted aleatoric uncertainty. More specifically, we modify Equation 

1

by placing a Gaussian distribution over the predicted logits:

(5)
where

Note that and are the pixel coordinates of the output logit and aleatoric uncertainty. We denote with for simplicity. From Equation 5, we can see that with larger aleatoric uncertainty, the Gaussian corrupted logit tends to be noisier, which enforces the model to ignore this “random” prediction. With smaller aleatoric uncertainty, the Gaussian corrupted logit tends to be closer to the original predicted logit , which makes the model to focus on this prediction. Since we use Gaussian corruption, we can facilitate our implementation using the Gaussian reparameterization trick:

(6)

During training, to capture both uncertainties, we can replace the classification loss in Equation 2 with a binary cross entropy loss with Gaussian corrupted output:

(7)

where is ground truth label at pixel coordinates and as shown in Equation 5. Therefore, we can obtain the final loss function for learning both epistemic uncertainty and aleatoric uncertainty as:

(8)

We do not need aleatoric uncertainty during inference, as it is used for ignoring noisy pixels during training.

3.2 Metadata Injection

Satellite image metadata contains useful information to support many computer vision tasks, such as using solar and satellite azimuth and elevation angles for shadow detection and building height estimation [13, 29, 14, 36, 16]. In this paper, we consider two types of metadata to improve the building segmentation result: (1) ground sample distance (GSD); and (2) off-nadir angle. GSD describes the spatial resolution of the image; a larger GSD indicates blurrier and noisier images due to lower image resolution. Off-nadir angle describes the viewing angle of the satellite camera and a larger off-nadir angle can also cause lower image resolution. In the following sections, we will provide two metadata injection approaches to improve the baseline U-Net model.

3.2.1 Metadata Injection via Feature Concatenation

As shown in Figure 3

, we first pass the metadata vector to multi-layer perceptrons (MLP) to obtain the output vector (

) for feature extraction and dimension expansion. Then we combine the metadata feature vector with the image features () obtained from the last CNN encoder layer. To combine metadata and image features, we repeat the metadata feature vector to match the shape of image features, getting . Then we concatenate the features along the channel dimension as . The final features can be obtained by linearly projecting the channel dimension back to the input channel dimension: , where is applied for each input element and it can be implemented by a convolutional layer with kernel size of . We refer to this concatenation-based approach as MetaCat.

3.2.2 Metadata Injection via Affine Combination Module

Figure 4: The block diagram of the proposed method with uncertainty modeling and ACM-based metadata injection.

As described above, the previous concatenation-based metadata injection method combines the metadata and image features by channel-wise concatenation following a linear projection layer. By doing so, we augment the image features using the metadata features for every location in the and dimensions evenly. However, intuitively, not all image features need to be modified. For example, since we focus on building segmentation, a large forest area should not be considered and modified. To effectively locate the desired regions that need to be modified, we use the Affine Combination Module (ACM) [28] for metadata injection as shown in Figure 4. As the name indicates, ACM is based on affine transforms and can be formulated as follows:

(9)

where is the image features obtained from the CNN encoder, is either the repeated metadata features as described in Section 3.2.1 or the features from the previous decoder layer, and and are convolutional layers as proposed in [28]. From Equation 9, we can consider the term as the metadata-relevant information, since it can directly interact with the metadata features (or the previous decoder features). The term can be considered as a metadata-irrelevant information that is not modified by the metadata features (or the previous decoder features). As the results that we will provide in Section 4.2 indicate, with ACM, we can explicitly decouple the metadata-relevant and metadata-irrelevant information without implicit learning by the model. Following the design from [28], we use multiple ACMs in different feature resolutions in our decoder without changing other parts of the model, as shown in Figure 4. We refer to this ACM-based approach as MetaACM.

4 Experiment

In this section, we will describe the dataset we used and the model implementation details in Section 4.1 while providing experimental results with analysis in Section 4.2.

4.1 Dataset and Experiment Setting

In this work, we use the SpaceNet 4 dataset [37], which is designed for building segmentation with a larger range of off-nadir angles. It contains 4-channel RGB-NIR (Near-Infrared) images with resolutions of . There are distinct locations in the dataset, with images captured at each location at off-nadir angles ranging from to , which totals to images. We partition the dataset into training, validation, and testing sets with the ratio of . Note that when splitting the dataset, we ensure that all images of the same location are assigned to the same partition. This can avoid different partitions sharing images from the same location.

Figure 5: Illustration of the building segmentation annotation issue in the original dataset. The light white area is the annotated ground truth area. The first row shows the annotation from the original dataset and the second row shows the annotation we manually labeled.

As mentioned in [37], the building annotations from the SpaceNet 4 dataset are obtained from the images with the smallest (in magnitude) off-nadir angle (), and the same annotations are used for the other images with different off-nadir angles. As shown in the first row of Figure 5, due to the change of viewing angle, the appearance, especially for the tall buildings, changes significantly. For example, with the smaller off-nadir angle, only the building roof is visible, but with larger off-nadir angles, both building roof and facade are visible, which can cause inaccurate annotations. Although the proposed method is designed to deal with the noisy images and annotations, in order to have an accurate testing evaluation, we manually label the testing images with off-nadir angles greater than , as shown in the second row of Figure 5.

To ensure fair comparison between the proposed method and the baseline U-Net, all of our experiments used the same setting, which we will now describe. The downsampling blocks (yellow blocks) in Figure 3 and Figure 4 are the residual blocks from a ResNet-34 model [18]

pretrained on ImageNet 

[7]. The upsampling blocks (dark green blocks) consist of bilinear upsampling convolution batch normalization ReLU. The upsampling blocks with dropout (light green blocks) consist of bilinear upsampling dropout convolution batch normalization ReLU. Following [25], the dropout rate is set as

. The MLP for metadata feature extraction consists of three blocks, where each block is a fully-connected layer following by a leaky ReLU layer with a slope of

. During training, to allow for a larger batch size as required by batch normalization, we resize the input image to

with batch size as . ADAM optimizer [26] with learning rate (linear decay) is used and all experiments are trained for 1 million iterations. As mentioned in Section 3.1.1, modeling epistemic uncertainty requires using weight decay during training. To achieve a fair comparison, we use weight decay with the factor of for all experiments. For the Monte Carlo integration during inference, following [25], we set the number of samples as (we will provide the analysis of this parameter in the following section).

Figure 6: Testing F1 scores with different off-nadir angles. The average F1 scores of all off-nadir angles are shown in the legend.

4.2 Experimental Result and Analysis

We start with evaluating the use of uncertainty modeling and metadata injection (we consider MetaCat first and then compare MetaCat with MetaACM later). Figure 6 shows the F1 scores with different off-nadir angles in the testing set. Compared with the baseline U-Net, with uncertainty modeling, there is a slight improvement across most of the off-nadir angles. Adding the metadata injection layer can further improve the performance, especially for the cases with larger off-nadir angles () and negative off-nadir angles. As mentioned in [37], due to the data collection process, the images with large negative off-nadir angles have very different lighting conditions and shadows. Since most of the images are collected from positive off-nadir angles, the baseline method will suffer from unbalanced data during training. With metadata injection and uncertainty modeling, the proposed method is able to deal with the changes of lighting and shadows.

Figure 7: Result comparison of the baseline U-Net and the proposed method with uncertainty modeling and metadata injection. The input images are taken with the off-nadir angle as . The red circles highlight the improvement of the proposed method compared to the baseline U-Net.

Figure 7 shows three testing examples captured from the largest off-nadir angles to visualize the improvement of the proposed method compared to the baseline U-Net. Based on the ground truth, we can see that the proposed method is able to detect more accurate building area even under this high noise-level condition. For instance, in the first example, the baseline U-Net fails to differentiate the parking lot area and the building area in the top-left of the input image highlighted by the red circle). The proposed method is able to segment the area correctly. From the epistemic uncertainty map, the proposed method raises higher uncertainty indicating the predictions from those class-ambiguous pixels are not reliable. Similar examples can be found in the highlighted areas in the second and third images. From the aleatoric uncertainty, we can also see that the input data has higher data noise around the forest region compared to the building region. This is due to the larger appearance variance of forests compared to buildings. Therefore, our model will focus more on the building region during training to avoid the adverse effect of the frequent appearance changes from the forest region. Unlike aleatoric uncertainty, epistemic uncertainty focuses more around the buildings or other man-made structures (roads). It highlights the area where the predictions are not reliable, such as the boundary of buildings, due to the image blur and noise. Figure 1 shows the prediction difference of two images with same scene but different off-nadir angles. We can see that overall, aleatoric uncertainty has a significant increase from small to large off-nadir angles due to higher noise in the input image. Although there is less of an increase with epistemic uncertainty, the area where it highlights does get larger. Appendix A shows the results with more off-nadir angles for the same scene for comparison.

Experiment Nadir Off-Nadir Very Off-Nadir Overall
None 0.7820 0.7450 0.6335 0.7219
Aleatoric 0.7822 0.7448 0.6499 0.7275
Epistemic 0.7824 0.7424 0.6380 0.7229
Both 0.7822 0.7429 0.6415 0.7249
Table 1: F1 scores for the ablation study of uncertainty modeling. All of the listed experiments are based on U-Net with concatenation-based metadata injection. None means no uncertainty modeling.
Figure 8: Ablation study of Monte Carlo dropout. F1 scores for different numbers of Monte Carlo samples are shown for all images from the validation set (left) and for the images in Very Off-Nadir category (right).

We also provide the ablation study of uncertainty modeling to show that modeling both uncertainties does not necessarily yield the best result. Due to the close performance of the compared experiments, we group the F1 scores with different ranges of off-nadir angle in Table 1 to better visualize differences. As defined in [37], we group the images into three categories based on the off-nadir angles as following:

  • Nadir: ;

  • Off-Nadir: ;

  • Very Off-Nadir: .

As shown in Table 1, the best performance from each category are not from the experiment with both uncertainties. Therefore, the effectiveness of uncertainty modeling could be different depending on the dataset and task. Furthermore, as shown in the highlighted cells, for the Very Off-Nadir category, all experiments with uncertainty modeling achieve much better performance than the method without uncertainty modeling. This confirms that using uncertainty modeling improves the model performance when larger data noise appears.

Figure 8 shows the effectiveness of different number of samples in Monte Carlo integration obtained from our validation set. The Regular Dropout experiment uses dropout as a regularization method meaning that dropout is only used during training. The No Dropout experiment does not use dropout for both training and testing. From the overall F1 score plot (left) and the F1 score plot for the Very Off-Nadir category (right), we can see that the performance stops improving when the number of samples is over , which shows our choice of samples is reasonable. Furthermore, we also show that with Monte Carlo dropout, a better result can be achieved compared to regular dropout and no dropout experiments. Among the three experiments, regular dropout has the worst performance. This shows the same observation as mentioned in [12], since empirically adding dropout layer in CNN tends to have a deteriorated performance.

Experiment Nadir Off-Nadir Very Off-Nadir Overall
None 0.7752 0.7359 0.6347 0.7180
MetaCat 0.7822 0.7429 0.6415 0.7249
MetaACM 0.7758 0.7382 0.6419 0.7197
Table 2: F1 scores for ACM-based and concatenation-based metadata injection. All of the listed experiments are based on U-Net with uncertainty modeling of both aleatoric and epistemic uncertainties. None means no metadata injection.

We compare the ACM-based (MetaACM) and concatenation-based (MetaCat) metadata injection methods in Table 2. Overall, MetaCat achieves better performance than MetaACM. Compared with the method without metadata injection, MetaCat has significant improvement for all three off-nadir angle categories. Although MetaACM does not have a major improvement for the lower off-nadir angle images, it achieves the best performance under the Very Off-Nadir category.

Figure 9: Illustration of ACM feature maps obtained from the last decoder layer. The inpainted results are obtained by thresholding the normalized ACM product map (green region) with threshold value as 0.5. The first row shows the case with off-nadir angle as . The second row shows the result of the same scene but with off-nadir angle as .

Figure 9 shows the ACM feature maps obtained from the last decoder layer. Following [28], the visualization of these feature maps is obtained by computing the average along the channel dimension. The results from the fourth column show the map based on Equation 9. As we discussed in Section 3.2.2, this feature map should highlight the metadata-relevant information, since it directly interacts with the metadata features (or the previous decoder features). The first row in Figure 9 shows the case with small off-nadir angle. Its map mainly addresses the entire building area, according to the inpainted result from the last column. However, when dealing with a large off-nadir angle, the map highlights the lower side of building area, as shown in the second row of Figure 9. With larger off-nadir angle, building facade becomes visible which increases the building area compared to the case with small off-nadir angle. ACM highlights the building facades (lower side of the building area) to improve the prediction on those regions. This confirms our observation in Table 2 that MetaACM is able to significantly improve the performance of the Very Off-Nadir category.

Figure 10: Resized ACM map for different decoder layers. The resolution of ACM map from decoder layer 1 is and increases with the factor of 2 after each decoder layer.

Figure 10 shows the ACM map from different decoder layers. We can see that the feature maps from different decoder layers address different part of the image. The design of our MetaACM enables the model to locate different areas for different feature resolutions. This is important for metadata injection, since if we only modify the image features using metadata features in the lowest resolution (MetaCat), these modifications will affect a large area in the final full-resolution result. For example, in our case, the bottleneck layer has the resolution of and the final result has the resolution of . If we only consider the effect of upsampling operators (without considering the change of receptive field caused by convolution), any modifications of the features from the bottleneck layer will affect at least area in the final result. These modifications are not accurate enough for the buildings that are much smaller than pixels. Therefore, injecting metadata features for the image features with different resolutions is important for the refinement of small buildings. To show the effectiveness of the proposed ACM-based metadata injection method, we also evaluate it on a different backbone model, U-Net [34]. Please check Appendix B for more information on the experiments.

5 Conclusion

In this paper, we propose a method that can provide accurate building segmentation despite the data noise that is caused by large off-nadir angles. Both aleatoric uncertainty and epistemic uncertainty are modeled by our method to enable our model to learn from noisy training data. Based on the level of predicted uncertainty, the proposed method learns to ignore the area with larger uncertainty and focus on the area with less uncertainty. Satellite image metadata is also considered to further improve the performance. We propose concatenation-based and ACM-based metadata injection methods to effectively use metadata for the building segmentation task. By conducting the experimental analysis and ablation study, we show that the proposed method is able to achieve a clear improvement compared to the baseline method, especially for the noisy images taken from large off-nadir angles.

Acknowledgments

This material is based on research sponsored by Lockheed Martin Space. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied of Lockheed Martin Space.

References

  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017-01) SegNet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (12), pp. 2481–2495. Note: External Links: Document, Link Cited by: §2.
  • [2] B. Bischke, P. Helber, D. Borth, and A. Dengel (2018-07) Segmentation of imbalanced classes in satellite imagery using adaptive uncertainty weighted class loss. IEEE International Geoscience and Remote Sensing Symposium (), pp. 6191–6194. Note: Valencia, Spain External Links: Document, Link Cited by: §2.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018-04) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (4), pp. 834–848. Note: External Links: Document, Link Cited by: §2.
  • [4] M. T. Chiu, X. Xu, Y. Wei, Z. Huang, A. G. Schwing, R. Brunner, H. Khachatrian, H. Karapetyan, I. Dozier, G. Rose, D. Wilson, A. Tudor, N. Hovakimyan, T. S. Huang, and H. Shi (2020-06) Agriculture-vision: a large aerial image database for agricultural pattern analysis.

    IEEE/CVF Conference on Computer Vision and Pattern Recognition

    (), pp. 2825–2835.
    Note: Seoul, Korea External Links: Document, Link Cited by: §1, §1.
  • [5] G. Christie, N. Fendley, J. Wilson, and R. Mukherjee (2018-06) Functional map of the world. IEEE/CVF Conference on Computer Vision and Pattern Recognition (), pp. 6172–6180. Note: Salt Lake City, UT External Links: Document, Link Cited by: §2.
  • [6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016-06) The cityscapes dataset for semantic urban scene understanding. IEEE/CVF Conference on Computer Vision and Pattern Recognition (), pp. 3213–3223. Note: Las Vegas, NV External Links: Document, Link Cited by: §1.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009-06) ImageNet: a large-scale hierarchical image database. IEEE/CVF Conference on Computer Vision and Pattern Recognition (), pp. 248–255. Note: Miami, FL External Links: Document, Link Cited by: §3.1.1, §4.1.
  • [8] (2019)(Website) External Links: Link Cited by: §1.
  • [9] A. V. Etten, D. Hogan, J. Martinez-Manso, J. Shermeyer, N. Weir, and R. Lewis (2021-02) The multi-temporal urban development spacenet dataset. arXiv 2102.04420 (), pp. . Note: External Links: Document, Link Cited by: §1, §1.
  • [10] A. V. Etten, D. Lindenbaum, and T. M. Bacastow (2028-08) SpaceNet: a remote sensing dataset and challenge series. arXiv 1807.01232 (), pp. . Note: External Links: Document, Link Cited by: §1, §1.
  • [11] Y. Gal and Z. Ghahramani (2016-05)

    Bayesian convolutional neural networks with bernoulli approximate variational inference

    .

    International Conference on Machine Learning

    (), pp. .
    Note: San Juan, Puerto Rico External Links: Document, Link Cited by: §3.1.1, §3.1.
  • [12] Y. Gal and Z. Ghahramani (2016-06) Dropout as a bayesian approximation: representing model uncertainty in deep learning. International Conference on Learning Representations 48 (), pp. 1050–1059. Note: New York City, NY External Links: Document, Link Cited by: §2, §3.1.1, §4.2.
  • [13] R. Gouiaa and J. Meunier (2014-11) Shadow analysis technique for extraction of building height using high resolution satellite single image and accuracy assessment. ISPRS Journal of Photogrammetry and Remote Sensing XL-8 (), pp. 1185–1192. Note: External Links: Document, Link Cited by: §3.2.
  • [14] R. Gouiaa and J. Meunier (2014-05) 3D reconstruction by fusioning shadow and silhouette information. IEEE Canadian Conference on Computer and Robot Vision (), pp. 378–384. Note: Montreal, Canada External Links: Document, Link Cited by: §3.2.
  • [15] R. Gupta, B. Goodman, N. Patel, R. Hosfelt, S. Sajeev, E. Heim, J. Doshi, K. Lucas, H. Choset, and M. Gaston (2019-06) Creating xbd: a dataset for assessing building damage from satellite imagery. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (), pp. . Note: Long Beach, California External Links: Document, Link Cited by: §1, §1.
  • [16] H. Hao, S. Baireddy, E. Bartusiak, M. Gupta, K. LaTourette, L. Konz, M. Chan, M. L. Comer, and E. J. Delp (2021-04) Building height estimation via satellite metadata and shadow instance detection. Automatic Target Recognition XXXI 11729 (), pp. 175–190. Note: External Links: Document, Link Cited by: §3.2.
  • [17] H. Hao, S. Baireddy, K. LaTourette, L. Konz, M. Chan, M. L. Comer, and E. J. Delp (2021-11) Improving building segmentation using uncertainty modeling and metadata injection. SIGSPATIAL’21: ACM International Conference on Advances in Geographic Information Systems (), pp. . Note: Bejing, China External Links: Document, Link Cited by: footnote 1.
  • [18] K. He, X. Zhang, S. Ren, and J. Sun (2016-06) Deep residual learning for image recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition (), pp. 770–778. Note: Las Vegas, NV External Links: Document, Link Cited by: §3.1.1, §4.1.
  • [19] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017-07) Densely connected convolutional networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (), pp. 2261–2269. Note: Honolulu, HI External Links: Document, Link Cited by: §2.
  • [20] V. Iglovikov and A. Shvets (2018-08) TernausNet: u-net with vgg11 encoder pre-trained on imagenet for image segmentation. arXiv 1801.05746 (), pp. . Note: External Links: Document, Link Cited by: §2.
  • [21] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017-07) Image-to-image translation with conditional adversarial networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition (), pp. 5967–5976. Note: Honolulu, HI External Links: Document, Link Cited by: §2.
  • [22] H. Jing, X. Sun, Z. Wang, K. Chen, W. Diao, and K. Fu (2021-04) Fine building segmentation in high-resolution sar images via selective pyramid dilated network. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (), pp. 1–1. Note: External Links: Document, Link Cited by: §2.
  • [23] M. Kampffmeyer, A. Salberg, and R. Jenssen (2016-06) Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops 1511.07122 (), pp. 680–688. Note: Las Vegas, NV External Links: Document, Link Cited by: §2.
  • [24] A. Kendall, V. Badrinarayanan, and R. Cipolla (2017-09) Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. British Machine Vision Conference (), pp. 57.1–57.12. Note: London, United Kingdom External Links: Document, Link Cited by: §3.1.1.
  • [25] A. Kendall and Y. Gal (2017-12) What uncertainties do we need in bayesian deep learning for computer vision?. Conference on Neural Information Processing Systems 30 (), pp. . Note: Long Beach, CA External Links: Document, Link Cited by: §3.1.1, §3.1.2, §3.1, §4.1.
  • [26] D. P. Kingma and J. Ba (2015-05) Adam: a method for stochastic optimization. International Conference on Learning Representations (), pp. . Note: San Diego, CA External Links: Document, Link Cited by: §4.1.
  • [27] A. Krogh and J. A. Hertz (1991-12) A simple weight decay can improve generalization. Conference on Neural Information Processing Systems (), pp. 950–957. Note: Denver, Colorado External Links: Document, Link Cited by: §3.1.1.
  • [28] B. Li, X. Qi, T. Lukasiewicz, and P. H.S. Torr (2020-06) ManiGAN: text-guided image manipulation. IEEE/CVF Conference on Computer Vision and Pattern Recognition 30 (), pp. 7877–7886. Note: Seattle, WA External Links: Document, Link Cited by: §3.2.2, §4.2.
  • [29] G. Liasis and S. Stavrou (2016) Satellite images analysis for shadow detection and building height estimation. ISPRS Journal of Photogrammetry and Remote Sensing 19 (), pp. 437–450. Note: External Links: Document, Link Cited by: §3.2.
  • [30] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017-07) Feature pyramid networks for object detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition (), pp. 936–944. Note: Honolulu, HI External Links: Document, Link Cited by: §2.
  • [31] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014-09) Microsoft coco: common objects in context. European Conference on Computer Vision (), pp. 740–755. Note: Zurich, Switzerland External Links: Document, Link Cited by: §1.
  • [32] W. Liu, J. Xu, Z. Guo, E. Li, X. Li, L. Zhang, and W. Liu (2021-01) Building footprint extraction from unmanned aerial vehicle images via pru-net: application to change detection. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 14 (), pp. 2236–2248. Note: External Links: Document, Link Cited by: §2.
  • [33] M. Pritt and G. Chern (2017-07) Satellite image classification with deep learning. IEEE Applied Imagery Pattern Recognition Workshop (), pp. 1–7. Note: Washington, DC External Links: Document, Link Cited by: §2.
  • [34] X. Qin, Z. Zhang, C. Huang, M. Dehghan, O. Zaiane, and M. Jagersand (2020) U-net: going deeper with nested u-structure for salient object detection. Pattern Recognition 106 (), pp. 107404. Note: External Links: Document, Link Cited by: §4.2, Figure 12, Appendix B: U-Net Result.
  • [35] O. Ronneberger, P. Fischer, and T. Brox (2015-06) U-net: convolutional networks for biomedical image segmentation. Medical Image Computing and Computer-Assisted Intervention 9351 (), pp. 234–241. Note: External Links: Document, Link Cited by: §2, §3.
  • [36] A. Trekin, V. Ignatiev, and P. Yakubovskiy (2020) Deep neural networks for determining the parameters of buildings from single-shot satellite imagery. Computer and Systems Sciences Internationa 59 (), pp. 755–767. Note: External Links: Document, Link Cited by: §3.2.
  • [37] N. Weir, D. Lindenbaum, A. Bastidas, A. Etten, V. Kumar, S. Mcpherson, J. Shermeyer, and H. Tang (2019-10) SpaceNet mvoi: a multi-view overhead imagery dataset. IEEE/CVF International Conference on Computer Vision (), pp. 992–1001. Note: Seoul, Korea External Links: Document, Link Cited by: §1, §1, §4.1, §4.1, §4.2, §4.2.
  • [38] S. Xie and Z. Tu (2017) Holistically-nested edge detection. International Journal of Computer Vision 125 (1–3), pp. 3–18. Note: External Links: Document, Link Cited by: Appendix B: U-Net Result.
  • [39] F. Yu and V. Koltun (2016-04) Multi-scale context aggregation by dilated convolutions. arXiv 1511.07122 (), pp. . Note: External Links: Document, Link Cited by: §2.
  • [40] Z. Zhang and Y. Wang (2019-03) JointNet: a common neural network for road and building extraction. Remote Sensing 11 (6), pp. 696. Note: External Links: Document, Link Cited by: §2.
  • [41] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017-07) Pyramid scene parsing network. IEEE/CVF Conference on Computer Vision and Pattern Recognition (), pp. 6230–6239. Note: Honolulu, HI External Links: Document, Link Cited by: §2.

Appendix A: Result Comparison with Different Off-Nadir Angles

Figure 11: Results of the proposed method for the images taken from different off-nadir angles. The results are obtained from the model with uncertainty modeling and concatenation-based metadata injection.

Figure 11 shows the proposed method with uncertainty modeling and concatenation-based metadata injection. All four input images are taken from the same scene but with different off-nadir angles from to . From the aleatoric uncertainty and epistemic uncertainty maps, we can see that larger off-nadir angle leads to higher uncertainty. More specifically, as off-nadir angle gets larger, the aleatoric uncertainty increases for the non-building regions, like forests and parking lots, since these areas are usually contains larger appearance variance. By raising higher uncertainty around these regions, our method can ignore those regions and address more on the building areas. The epistemic uncertainty mainly highlights the building boundaries, since the predictions from these areas are not reliable compared to other regions. As the off-nadir angle gets larger, the highlighted building boundaries get thicker, because these images get noisier and blurrier. Therefore, with uncertainty modeling, the proposed method is able to learn from the image area with less uncertainty without getting deteriorated by the noisy areas.

Appendix B: U-Net Result

Figure 12: The block diagram of the proposed U-Net [34] with uncertainty modeling and concatenation-based metadata injection.
Experiment Nadir Off-Nadir Very Off-Nadir Overall
None 0.8019 0.7447 0.6185 0.7259
Uncertainty 0.8081 0.7588 0.6305 0.7356
Uncertainty + MetaCat 0.8080 0.7580 0.6137 0.7304
Uncertainty + MetaACM 0.8163 0.7700 0.6348 0.7426
Table 3: F1 scores of U-Net with uncertainty modeling and metadata injection. None means no metadata injection and no uncertainty modeling. The experiments with Uncertainty use both aleatoric and epistemic uncertainties.

The proposed uncertainty modeling and metadata injection methods can be extended to other backbone models. As shown in Figure 12, we can apply the proposed methods for U-Net [34], which is a modified version of the original U-Net. It is able to utilize a two-level nested U-structure to enlarge the receptive field in each encoder/decoder block. Moreover, deep supervision [38]

(output multiple masks for different decoder blocks) is used to enforce the integration of multi-level deep features to further improve the performance. Please check the original paper 

[34] for the detailed design of U-Net. Similar to the U-Net backbone, we use the epistemic uncertainty modeling (Monte Carlo dropout layers) in the first three decoder blocks in U-Net. Then we split the final layer into two branches to learn the aleatoric uncertainty map. Figure 12 shows the model with concatenation-based metadata injection. The ACM-based metadata injection method can be obtained in a manner similar to the block diagram shown in Figure 4.

Table 3 shows the results of U-Net with uncertainty modeling and the different metadata injection approaches. Using the proposed uncertainty modeling and metadata injection methods can improve the original U-Net model, especially for the cases with large off-nadir angles (except the Very Off-Nadir case from the concatenation-based metadata injection experiment). The experiment with both uncertainty modeling and the ACM-based metadata injection method achieves the best performance. It achieves the best performance for all off-nadir angle cases, which confirms the benefit of using the multi-level features in metadata injection. Therefore, from the aforementioned experiments, we show that the proposed uncertainty modeling and metadata injection methods can improve the performance of both U-Net and U-Net.