LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation

10/17/2021 ∙ by Junjue Wang, et al. ∙ Wuhan University 10

Deep learning approaches have shown promising results in remote sensing high spatial resolution (HSR) land-cover mapping. However, urban and rural scenes can show completely different geographical landscapes, and the inadequate generalizability of these algorithms hinders city-level or national-level mapping. Most of the existing HSR land-cover datasets mainly promote the research of learning semantic representation, thereby ignoring the model transferability. In this paper, we introduce the Land-cOVEr Domain Adaptive semantic segmentation (LoveDA) dataset to advance semantic and transferable learning. The LoveDA dataset contains 5987 HSR images with 166768 annotated objects from three different cities. Compared to the existing datasets, the LoveDA dataset encompasses two domains (urban and rural), which brings considerable challenges due to the: 1) multi-scale objects; 2) complex background samples; and 3) inconsistent class distributions. The LoveDA dataset is suitable for both land-cover semantic segmentation and unsupervised domain adaptation (UDA) tasks. Accordingly, we benchmarked the LoveDA dataset on eleven semantic segmentation methods and eight UDA methods. Some exploratory studies including multi-scale architectures and strategies, additional background supervision, and pseudo-label analysis were also carried out to address these challenges. The code and data are available at https://github.com/Junjue-Wang/LoveDA.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 8

page 9

page 16

Code Repositories

LoveDA

[NeurIPS2021 Poster] LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the continuous development of society and economy, the human living environment is gradually being differentiated, and can be divided into urban and rural zones (un2020recommendation). High spatial resolution (HSR) remote sensing technology can help us to better understand the geographical and ecological environment. Specifically, land-cover semantic segmentation in remote sensing is aimed at determining the land-cover type at every image pixel. The existing HSR land-cover datasets such as the Gaofen Image Dataset (GID) (GID), DeepGlobe (demir2018deepglobe), Zeebruges (Zeebruges), and Zurich Summer (volpi2015semantic) contain large-scale images with pixel-wise annotations, thus promoting the development of fully convolutional networks (FCNs) in the field of remote sensing (RSNet; 9530280). However, these datasets are designed only for semantic segmentation, and they ignore the diverse styles among geographic areas. For urban and rural areas, in particular, the manifestation of the land cover is completely different, in the class distributions, object scales, and pixel spectra. In order to improve the model generalizability for large-scale land-cover mapping, appropriate datasets are required.

In this paper, we introduce an HSR dataset for Land-cOVEr Domain Adaptive semantic segmentation (LoveDA) for use in two challenging tasks: semantic segmentation and UDA. Compared with the UDA datasets clan; adaptseg that use simulated images, the LoveDA dataset contains real urban and rural remote sensing images. Exploring the use of deep transfer learning methods on this dataset will be a meaningful way to promote large-scale land-cover mapping. The major characteristics of this dataset are summarized as follows: 1) Multi-scale objects. The HSR images were collected from 18 complex urban and rural scenes, covering three different cities in China. The objects in the same category are in completely different geographical landscapes in the different scenes, which increases the scale variation. 2) Complex background samples. The remote sensing semantic segmentation task is always faced with the complex background samples (i.e., land-cover objects that are not of interest) (pang2019mathcal; zheng2020foreground)

, which is particularly the case in the LoveDA dataset. The high-resolution and different complex scenes bring more rich details as well as larger intra-class variance for the background samples.

3) Inconsistent class distributions. The urban and rural scenes have different class distributions. The urban scenes with high population densities contain lots of artificial objects such as buildings and roads. In contrast, the rural scenes include more natural elements, such as forest and water. Compared with UDA datasets venkateswara2017deep; peng2019moment

in general computer vision, the LoveDA dataset focuses on the style differences of the geographical environments. The inconsistent class distributions pose a special challenge for the UDA task.

As the LoveDA dataset was built with two tasks in mind, both advanced semantic segmentation and UDA methods were evaluated. Several exploratory experiments were also conducted to solve the particular challenges inherent in this dataset, and to inspire further research. A stronger representational architecture and UDA method are needed to jointly promote large-scale land cover mapping.

2 Related Work

Image level Resolution (m) Dataset Year Sensor Area () Classes Image width Images Task
SS UDA
Meter level 10 LandCoverNet (alemohammad2020landcovernet) 2020 Sentinel-2 30000 7 256 1980
4 GID (GID) 2020 GF-2 75900 5 48006300 150
Sub-meter level 0.250.5 LandCover.ai (boguszewski2021landcover) 2020 Airborne 216.27 3 42009500 41
0.6 Zurich Summer (volpi2015semantic) 2015 QuickBird 9.37 8 6221830 20
0.5 DeepGlobe (demir2018deepglobe) 2018 WorldView-2 1716.9 7 2448 1146
0.05 Zeebruges (Zeebruges) 2018 Airborne 1.75 8 10000 7
0.05 ISPRS Potsdam 1 2013 Airborne 3.42 6 6000 38
0.09 ISPRS Vaihingen 2 2013 Airborne 1.38 6 18873816 33
0.07 AIRS (chen2019) 2019 Airborne 475 2 10000 1047
0.5 SpaceNet (van2018spacenet) 2017 WorldView-2 2544 2 406439 6000
0.3 LoveDA (Ours) 2021 Spaceborne 536.15 7 1024 5987
  • The abbreviations are: SS – semantic segmentation, UDA – unsupervised domain adaptation.

Table 1: Comparison between LoveDA and the main land-cover semantic segmentation datasets.

2.1 Land-cover semantic segmentation datasets

Land-cover semantic segmentation, as a long-standing research topic, has been widely explored over the past decades. The early research relied on low- and medium-resolution datasets, such as MCD12Q1 (sulla2018user), the National Land Cover Database (NLCD) (jin2019national), GlobeLand30 (jun2014open), LandCoverNet (alemohammad2020landcovernet), etc. However, these studies all focused on large-scale mapping and analysis from a macro-level. With the advancement of remote sensing technology, massive HSR images are now being obtained on a daily basis from both spaceborne and airborne platforms. Due to the advantages of the clear geometrical structure and fine texture, HSR land-cover datasets are tailored for specific scenes at a micro-level. As is shown in Table 1, datasets such as ISPRS Potsdam 111http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-potsdam.html , ISPRS Vaihingen 222http://www2.isprs.org/commissions/comm3/wg4/2d-sem-label-vaihingen.html , Zurich Summer (volpi2015semantic), and Zeebruges (Zeebruges) are designed for urban parsing. These datasets only contain a small number of annotated images and cover limited areas. In contrast, DeepGlobe (demir2018deepglobe) and LandCover.ai (boguszewski2021landcover) focus on rural areas with a larger scale, in which the homogeneous areas contain few man-made structures. The GID dataset(GID) was collected with Gaofen-2 satellite from different cities in China. Although LandCoverNet and GID datasets contain both urban and rural areas, the geo-locations of these released images are private. Therefore, the urban and rural areas are not able to be divided. In addition, the identifications of cities in released GID images have been already removed so it is hard to perform UDA tasks. Considering limited coverage and annotation cost, the existing HSR datasets mainly promotes the research of improving land-cover segmentation accuracy, ignoring its transferability. Compared with land-cover datasets, the iSAID dataset(waqas2019isaid) focuses on key objects semantic segmentation. The different study objects bring different challenges for different remote sensing tasks.

These HSR land-cover datasets have all promoted the development of semantic segmentation, and many variants of FCNs (long2015fully) have been evaluated (RSNet; chen2019collaborative; dong2020spectral; duan2020local). Recently, some UDA methods have been developed from the combination of two public datasets (yan2019triplet). However, directly utilizing combined datasets may result in two problems: 1) Insufficient common categories. Different datasets are designed for different purposes, and the insufficient common categories limit further exploration. 2) Inconsistent annotation granularity. The different spatial resolutions and labeling styles lead to different annotation granularities, which can result in unreliable conclusions. Compared with existing datasets, LoveDA dataset encompasses two domains (urban and rural), representing a novel UDA task for land-cover mapping.

2.2 Unsupervised domain adaptation

For natural images, UDA is aimed at transferring a model trained on the source domain to the target domain. Some conventional image classification studies (sun2016deep; tzeng2014deep; long2015learning) have directly minimized the discrepancy of the feature distributions to extract domain-invariant features. The recent works have mainly proceeded in two directions, i.e., adversarial training and self-training.

Adversarial training. In adversarial training, the architecture includes a feature extractor and a discriminator. The extractor aims to learn domain-invariant features, while the discriminator attempts to distinguish these features. For semantic segmentation, Tsai et al. (adaptseg) considered the semantic outputs containing spatial similarities between the different domains, and adapted the structured output space for segmentation (AdaptSeg) with adversarial learning. Luo et al. clan introduced a category-level adversarial network (CLAN) to align each class with an adaptive adversarial loss. Differing from the binary discriminators, Wang et al. (fada) proposed a fine-grained adversarial learning framework for domain adaptive semantic segmentation (FADA), aligning the class-level features. From the aspect of structure, the transferable normalization (TransNorm) method (wang2019transferable) was proposed to enhance the transferability of the FCN-based feature extractors. All these advanced adversarial learning methods were implemented on the LoveDA dataset for evaluation.

Self-training. Self-training involves alternately generating pseudo-labels on the target data and fine-tuning the model. Recently, the self-training UDA methods have focused on improving the quality of the pseudo-labels crst; zhang2019category. Lian et al. lian2019constructing designed the self-motivated pyramid curriculum (PyCDA) to observe the target properties, and fused multi-scale features. Zou et al. cbst proposed a class-balanced self-training (CBST) strategy to sample pseudo-labels, thus avoiding the dominance of the large classes. Mei et al. IAST used an instance adaptive self-training (IAST) selector for sample balance. In addition to testing these self-training methods on the LoveDA dataset, we also performed the pseudo-label analysis for the CBST.

UDA in the remote sensing community

. The early UDA methods focused on scene classification tasks

(othman2017domain; lu2019multisource). Recently, adversarial training (iqbal2020weakly; Tasar_2020_CVPR_Workshops) and self-training (GID) have been studied for UDA land-cover semantic segmentation. These methods follow the general UDA approach in the computer vision field, with some improvements. However, with only the public datasets, the advancement of the UDA algorithms has been limited by the insufficient shared categories and the inconsistent annotation granularity. To this end, the LoveDA dataset is proposed for a more challenging benchmark, promoting future research of remote sensing UDA algorithms and applications.

3 Dataset Description

3.1 Image Distribution and Division

Figure 1: Overview of the dataset distribution. The images were collected from Nanjing, Changzhou, and Wuhan cities, covering 18 different administrative districts.

The LoveDA dataset was constructed using 0.3 m images obtained from Nanjing, Changzhou and Wuhan in July 2016, totally covering 536.15 (Figure 1). The historical images were obtained from the Google Earth platform. As each research area has its own planning strategy, the urban-rural ratio is inconsistent (yearbook).

Data from the rural and urban areas were collected referring to the “Urban and Rural Division Code” issued by the National Bureau of Statistics. There are nine urban areas selected from different economically developed districts, which are all densely populated ( 1000 ) yearbook. The other nine rural areas were selected from undeveloped districts. The spatial resolution is 0.3 m, with red, green, and blue bands. After geometric registration and pre-processing, each area is covered by images, without overlap. Considering Tobler’s First Law, i.e., everything is related to everything else, but near things are more related than distant things (tobler1970computer), the training, validation, and test sets were split so that they were spatially independent (Figure 1), thus enhancing the difference between the split sets. There are two tasks that can be evaluated on the LoveDA dataset: 1) Semantic segmentation. There are eight areas for training, and the others are for validation and testing. The training, validation, and test sets cover both urban and rural areas.2) Unsupervised domain adaptation. The UDA process considers two cross-domain adaptation sub-tasks: a) Urban Rural. The images from the Qinhuai, Qixia, Jianghan, and Gulou areas are included in the source training set. The images from Liuhe and Huangpi are included in the validation set. The Jiangning, Xinbei, and Liyang images included in the test set. The Oracle setting is designed to test the upper limit of accuracy in a single domain peng2018visda. Hence, the training images were collected from the Pukou, Lishui, Gaochun, and Jiangxia areas. b) Rural Urban. The images from the Pukou, Lishui, Gaochun, and Jiangxia areas are included in the source training set. The images from Yuhuatai and Jintan are used for the validation set. The Jiangye, Wuchang, and Wujin images are used for the test set. In the Oracle setting, the training images cover the Qinhuai, Qixia, Jianghan, and Gulou areas.

With the division of these images, a comprehensive annotation pipeline was adopted, including professional annotators and strict inspection procedures (waqas2019isaid). Further details of the data division and annotation can be found in §A.1.

3.2 Statistics for LoveDA

Some statistics of the LoveDA dataset are analyzed in this section. With the collection of public HSR land-cover datasets, the number of labeled objects and pixels has been counted. As is shown in the Figure 2, our proposed LoveDA dataset contains the largest number of labeled pixels as well as land-cover objects, which shows the advantage in data diversity. There are a lot of buildings because urban scenes have large populations (Figure 2). As is shown in Figure 2, the background class contains the most pixels with complex samples (pang2019mathcal; zheng2020foreground). The complex background samples have larger intra-class variance in the complex scenes and cause serious false alarms.

Figure 2: Statistics for the pixels and objects in LoveDA dataset. (a) Number of objects vs. number of pixels. The radius of the circles represents the number of classes. (b) Histogram of the number of objects for each class. (c) Histogram of the number of pixels for each class.
(a) Class distributions
(b) Spectral values
(c) Building scales
Figure 3:

Statistics for the urban and rural scenes in Nanjing City. (a) Class distribution. (b) Spectral statistics. The mean and standard deviation (

) for 5 urban and 5 rural areas are reported. (c) Distribution of the building sizes. The Jianye (urban) and Lishui (rural) scenes are reported.

3.3 Differences Between Urban and Rural Scenes

During the process of urbanization, cities differentiate into rural and urban forms. In this section, we list the main differences, which reveal the meaning and challenges of the remote sensing UDA task. For the Nanjing City, the main differences come from the shape, layout, scale, spectra, and class distribution. As is shown in Figure 1, the buildings in the urban area are neatly arranged, with various shapes, while the buildings in the rural area are disordered, with simpler shapes. The roads are wide in the urban scenes. In contrast, the roads are narrow in the rural scenes. Water is often presented in the form of large-scale rivers or lakes in the urban scenes, while small-scale ponds and ditches are common in the rural scenes. The agricultural is found in the gaps between the buildings in the urban scenes, but occurs in a large-scale and continuously distributed form in the rural scenes.

For the class distribution, spectra, and scale, the related statistics are reported in Figure 3. The urban areas always contain more man-made objects such as buildings and roads due to their high population density (Figure 3(a)). In contrast, the rural areas have more agricultural land. The inconsistent class distributions between the urban and rural scenes increases the difficulty of model generalization. For the spectral statistics, the mean values are similar (Figure 3(b)). Because of the large-scale homogeneous geographical areas, such as agriculture and water, the rural images have lower standard deviations. As is shown in Figure 3(c), most of the buildings have relatively small scales in the rural areas, representing the “long tail” phenomenon. However, the buildings in the urban scenes have a larger size variance. Scale differences also exist in the other categories, as shown in Figure 1. The multi-scale objects require the models to have multi-scale capture capabilities. When faced with large-scale land cover mapping tasks, the differences between urban and rural scenes bring new challenges to the model transferability.

4 Experiments

4.1 Semantic Segmentation

For the semantic segmentation task, the general architectures as well as their variants, and particularly those most often used in remote sensing, were tested on the LoveDA dataset. Specifically, the selected networks were: UNetronneberger2015u, UNet++zhou2018unet++, LinkNetchaurasia2017linknet, DeepLabV3+chen2018encoder, PSPNetzhao2017pyramid, FCN8Slong2015fully, PANli2018pyramid, Semantic-FPNkirillov2019panoptic, HRNetwang2020deep, and FarSegzheng2020foreground. Following the common practicewang2020deep; long2015fully, we use the intersection over union (IoU) to report the semantic segmentation accuracy. With respect to the IoU for each class, the mIoU represents the mean of the IoUs over all the categories. The inference speed is reported with a single input (repeated 500 times), using frames per second (FPS).

Method Backbone IoU per category (%) mIoU (%) Speed (FPS)
Background Building Road Water Barren Forest Agriculture
FCN8S long2015fully VGG16 42.60 49.51 48.05 73.09 11.84 43.49 58.30 46.69 86.02
DeepLabV3+ chen2018encoder ResNet50 42.97 50.88 52.02 74.36 10.40 44.21 58.53 47.62 75.33
PAN li2018pyramid ResNet50 43.04 51.34 50.93 74.77 10.03 42.19 57.65 47.13 61.09
UNet ronneberger2015u ResNet50 43.06 52.74 52.78 73.08 10.33 43.05 59.87 47.84 71.35
UNet++ zhou2018unet++ ResNet50 42.85 52.58 52.82 74.51 11.42 44.42 58.80 48.20 27.22
Semantic-FPN kirillov2019panoptic ResNet50 42.93 51.53 53.43 74.67 11.21 44.62 58.68 48.15 73.98
PSPNet zhao2017pyramid ResNet50 44.40 52.13 53.52 76.50 9.73 44.07 57.85 48.31 74.81
LinkNet chaurasia2017linknet ResNet50 43.61 52.07 52.53 76.85 12.16 45.05 57.25 48.50 67.01
FarSeg zheng2020foreground ResNet50 43.09 51.48 53.85 76.61 9.78 43.33 58.90 48.15 66.99
FactSeg ma2021factseg ResNet50 42.60 53.63 52.79 76.94 16.20 42.92 57.50 48.94 65.58
HRNet wang2020deep W32 44.61 55.34 57.42 73.96 11.07 45.25 60.88 49.79 16.74
Table 2: Semantic segmentation results obtained on the test set of LoveDA.
Method mIoU(%)
Baseline +MSTr +MSTrTe
DeepLabV3+ 47.62 50.27 51.17
UNet 48.00 50.66 51.28
SFPN 48.15 50.19 51.09
HRNet 49.79 52.15 52.72
Table 3: Multi-Scale augmentation during Training and Testing (MSTrTe).
(a) Semantic-FPN
(b) HRNet
Figure 4: Representative confusion matrices for the semantic segmentation experiments.

Implementation details. The data splits followed the Table 8 in §A.1

. During the training, we used the Stochastic Gradient Descent (SGD) optimizer with a momentum of

and a weight decay of . The learning rate was initially set to , and a ‘poly’ schedule with power was applied. The number of training iterations was set to with a batch size of . For the data augmentation,

patches were randomly cropped from the raw images, with random mirroring and rotation. The backbones used in all the networks were pre-trained on ImageNet.

Multi-scale architectures and strategies. As ground objects show considerable scale variance, especially in complex scenes (§3.3), we have analyzed the multi-scale architectures and strategies. There are three noticeable observations from Table 2: 1) UNet++ outperforms UNet due to its nested cross-scale connections between different scales. 2) Among the different fusion strategies, UNet++, Semantic-FPN, LinkNet and HRNet outperform DeepLabV3+. This demonstrates that the cross-layer fusion works better than the in-module fusion. 3) HRNet outperforms the other methods, due to its sophisticated architecture, where the features are repeatedly exchanged across different scales. As is shown in Table 3, multi-scale augmentation (with scale = {0.5, 0.75, 1.0, 1.25, 1.5, 1.75}) was conducted during the training (MSTr), significantly improving the performance of different methods. In the implementation, the multi-scale inference adopts multi-scale inputs and ensembles the rescaled multiple outputs using a simple mean function. With further use in the testing process, all methods were further improved. As for multi-scale fusion, multi-level and multi-scale architecture search RSNet may also become an effective solution.

Additional background supervision. The complex background samples cause serious false alarms in HRS imagery semantic segmentation zheng2020foreground; everingham2015pascal. As is shown in Figure 4, the confusion matrices show that lots of objects were misclassified into background, which is consistent with our analysis in §3.2. Based on Semantic-FPN, we designed the additional background supervision to address this problem. Dice loss milletari2016v and binary cross-entropy loss were utilized with the corresponding modulation factors. We calculated the total loss as: , where denotes the original cross-entropy loss. Table 5 and Table 5 additionally report the precision (P), recall (R) and F1-score (F1) of the background class with varying modulation factors. Besides, the standard deviations are reported after 3 runs. Table 5 shows that the addition of binary cross-entropy loss improves the background accuracy and the overall performance. The combination of and performs well because they optimize the background class from different directions. In the future, the spatial attention mechanism mou2019relation may improve the background with adaptive weights.

Background mIoU (%)
P (%) R (%) F1(%)
0 55.46 61.01 59.86 48.15 0.17
0.2 57.70 63.36 60.39 48.50 0.13
0.5 56.92 65.86 61.06 48.85 0.15
0.7 57.73 64.62 61.98 48.74 0.19
0.9 57.30 64.05 60.48 48.26 0.14
1.0 58.43 62.64 60.46 48.14 0.18
Table 4: Varied for
Background mIoU (%)
P (%) R (%) F1(%)
0.2 0.5 56.68 64.82 60.47 48.97 0.16
0.5 0.5 56.88 65.16 60.96 49.23 0.09
0.7 0.5 57.13 65.31 60.93 49.68 0.14
0.2 0.7 56.91 66.03 61.13 49.69 0.17
0.5 0.7 57.14 66.21 61.34 50.08 0.15
0.7 0.7 56.68 65.52 60.78 49.48 0.13
Table 5: Varied for (w. optimal )
(a) Image
(b) Ground truth
(c) FCN8S
(d) DeepLabV3+
(e) PSPNet
(f) UNet
(g) UNet++
(h) PAN
(i) Semantic-FPN
(j) LinkNet
(k) FarSeg
(l) HRNet
Figure 5: Semantic segmentation results on images from the LoveDA test set in the Liuhe (Rural) area. Some small-scale objects such as buildings and scattered trees are hard to recognize. The forest and agricultural classes are easy to misclassify due to their similar spectra.

Visualization. Some representative results are shown in Figure 5

. With the shallow backbone (VGG16), FCN8S can hardly recognize the road due to its lack of feature extraction capability. The other methods which utilize deep layers can produce better results. Because of the disorderly arrangement and varied scales, the edges of the buildings are hard to extract accurately. Some small-scale objects such as buildings and scattered trees are easy to miss. In contrast, water class achieves higher accuracies for all methods. This because water have strong spectral homogeneity and low intra-class variance

GID. The forest is easy to misclassify into agriculture because these classes have similar spectra. Because of the high-resolution retention and multi-scale fusion, HRNet produces the best visualization result, especially in the details.

4.2 Unsupervised Domain Adaptation

The advanced UDA methods were evaluated on the LoveDA dataset. In addition to the original metric-based approach of MCD tzeng2014deep, two mainstream UDA approaches were tested, i.e., adversarial training (AdaptSeg adaptseg, CLAN clan, TransNorm wang2019transferable, FADA fada) and self-training (CBST cbst, PyCDA lian2019constructing, IAST IAST).

Domain Method Type IoU (%) mIoU(%)
Background Building Road Water Barren Forest Agriculture
Rural
Urban
Oracle - 48.18 52.14 56.81 85.72 12.34 36.70 35.66 46.79
Source only - 43.30 25.63 12.70 76.22 12.52 23.34 25.14 31.27
MCD tzeng2014deep - 43.60 15.37 11.98 79.07 14.13 33.08 23.47 31.53
AdaptSeg adaptseg AT 42.35 23.73 15.61 81.95 13.62 28.70 22.05 32.68
FADA fada AT 43.89 12.62 12.76 80.37 12.70 32.76 24.79 31.41
CLAN clan AT 43.41 25.42 13.75 79.25 13.71 30.44 25.80 33.11
TransNorm wang2019transferable AT 38.37 5.04 3.75 80.83 14.19 33.99 17.91 27.73
PyCDA lian2019constructing ST 38.04 35.86 45.51 74.87 7.71 40.39 11.39 36.25
CBST cbst ST 48.37 46.10 35.79 80.05 19.18 29.69 30.05 41.32
IAST IAST ST 48.57 31.51 28.73 86.01 20.29 31.77 36.50 40.48
Urban
Rural
Oracle - 37.18 52.74 43.74 65.89 11.47 45.78 62.91 45.67
Source only - 24.16 37.02 32.56 49.42 14.00 29.34 35.65 31.74
MCD tzeng2014deep - 25.61 44.27 31.28 44.78 13.74 33.83 25.98 31.36
AdaptSeg adaptseg AT 26.89 40.53 30.65 50.09 16.97 32.51 28.25 32.27
FADA fada AT 24.39 32.97 25.61 47.59 15.34 34.35 20.29 28.65
CLAN clan AT 22.93 44.78 25.99 46.81 10.54 37.21 24.45 30.39
TransNorm wang2019transferable AT 19.39 36.30 22.04 36.68 14.00 40.62 3.30 24.62
PyCDA lian2019constructing ST 12.36 38.11 20.45 57.16 18.32 36.71 41.90 32.14
CBST cbst ST 25.06 44.02 23.79 50.48 8.33 39.16 49.65 34.36
IAST IAST ST 29.97 49.48 28.29 64.49 2.13 33.36 61.37 38.44
  • The abbreviations are: AT – adversarial training methods. ST – self-training methods.

Table 6: Unsupervised domain adaptation results obtained on the test set of the LoveDA dataset.

Implementation details. All the UDA methods adopted the same feature extractor and discriminator, following the common practice adaptseg; clan; fada. Specifically, DeepLabV2 deeplabv2 with ResNet50 was utilized as the extractor, and the discriminator was constructed by fully convolutional layers adaptseg. For the adversarial training (AT), the classification and discriminator learning rates were set to and , respectively. The Adam optimizer was used for the discriminator with the momentum of and . The number of training iterations was set to , with a batch size of . The eight source images and eight target images were alternatively input. The other settings are the same in the semantic segmentation. and the learning schedule is the same as in semantic segmentation settings. For the self-training (ST), the classification learning rate was set to . Full implementation details are provided in the §A.4.

Benchmark results. As is shown in Table 6, the Oracle setting obtains the best overall performances. However, DeepLabV2 has lost its effectiveness due to the domain divergence, referring to the result of Source only setting. In the Rural Urban experiments, the accuracies of artificial classes (building and road) drop more than natural classes (forest and agricultural). Because of the inconsistent class distribution, the Urban Rural experiments show the opposite results. The transfer learning methods relatively improve the model transferability. Noticeably, TransNorm obtains the lowest mIoUs. This is because the source and target images were obtained by the same sensor, and their spectral statistics are similar (Figure 3(2)). These rural and urban domains require similar normalization weights, so that the adaptive normalization can lead to optimization conflicts (more analysis are provided in §A.6). The ST methods achieve better performances because they address the class imbalance problem with pseudo-label generation.

Inconsistent class distributions. It is noticeable to find that the ST methods surpass AT methods in cross-domain adaptation experiments. We conclude that the main reason for this is the extremely inconsistent class distribution (Figure 3(a)). The rural scenes only contain a few artificial samples and large-scale natural objects. In contrast, the urban scenes have a mixture of buildings and roads with few natural objects. The AT methods cannot address this difficulty, so that they report lower accuracies. However, differing from the AT methods, the ST methods generate pseudo-labels on the target images. With the addition of diverse target samples, the class distribution divergence is eliminated during the training. Overall, the ST methods show more potential in the UDA land-cover semantic segmentation task. In Urban Rural experiments, all UDA methods show negative transfer effects for the road class. Hence, more tailored UDA methods are worth exploring faced with these special challenges.

Visualization. The qualitative results for the Rural Urban experiments are shown in Figure 6. The Oracle result successfully recognizes the buildings and roads, and is the closest to the ground truth. According to the Table 2, it can be further improved by using a more robust backbone. The ST methods (j)–(l) produce better results than AT methods (f)–(i), but there is still much room for improvement. The large-scale mapping visualizations are provided in §A.7.

(a) Image
(b) Ground truth
(c) Oracle
(d) Source only
(e) MCD
(f) AdaptSeg
(g) CLAN
(h) TransNorm
(i) FADA
(j) PyCDA
(k) CBST
(l) IAST
Figure 6: Visual results for the Rural Urban experiments. (f)–(i) and (j)–(l) were obtained from the AT and ST methods, respectively. The ST methods produce better results than the AT methods.

Pseudo-label analysis for CBST. As pseudo samples are important for addressing inconsistent class distribution problem, we varied the target class proportion in CBST, which is a hyper-parameter controlling the number of pseudo samples. The mean F1-score (mF1) and mIoU are reported in Table 7. Without pseudo-label learning (), the model degenerated into Source only setting and achieved low accuracy. The optimal range of is relatively large (), which proves that it is not sensitive to the remote sensing UDA task.

0. 0.01 0.05 0.1 0.5 0.7 0.9 1.0
mF1(%) 46.81 45.24 48.50 50.93 56.30 51.23 51.03 49.43
mIoU(%) 32.94 32.18 34.46 36.84 41.32 37.12 37.02 35.47
Table 7: Varied for target class proportion (Rural Urban)

5 Conclusion

The differences between urban and rural scenes limit the generalization of deep learning approaches in land-cover mapping. In order to address this problem, we built an HSR dataset for Land-cOVEr Domain Adaptive semantic segmentation (LoveDA). The LoveDA dataset reflects three challenges in large-scale remote sensing mapping, including multi-scale objects, complex background samples, and inconsistent class distributions. The state-of-the-art methods were evaluated on the LoveDA dataset, revealing the challenges of LoveDA. In addition, multi-scale architectures and strategies, additional background supervision and pseudo-label analysis were conducted to find alternative ways to address these challenges.

6 Acknowledgments

This work was supported by National Key Research and Development Program of China under Grant No. 2017YFB0504202, National Natural Science Foundation of China under Grant Nos. 41771385, 41801267, and the China Postdoctoral Science Foundation under Grant 2017M622522. This work was supported by the Nanjing Bureau of Surveying and Mapping.

7 Broader Impact

This work offers a free and open dataset with the purpose of advancing land-cover semantic segmentation in the area of remote sensing. We also provide two benchmarked tasks with three considerable challenges. This will allow other researchers to easily build on this work and create new and enhanced capabilities. The authors do not foresee any negative societal impacts of this work. A potential positive societal impact may arise from the development of generalizable models that can produce large-scale high-spatial-resolution land-cover mapping accurately. This could help to reduce the manpower and material resource consumption of surveying and mapping.

References

Appendix A Appendix

a.1 Annotation Procedure and Data Division

The seven common land-cover types were developed according to the “Data Regulations and Collection Requirements for the General Survey of Geographical Conditions”, i.e., buildings, road, water, forest, agriculture, and background classes. Based on the advanced ArcGIS geo-spatial software , all the images were annotated by professional remote sensing annotators. With the division of these images, a comprehensive annotation pipeline was adopted referring to (waqas2019isaid)

. The annotators labeled all objects belonging to six categories (except background) using polygon features. As for the 18 selected areas, it took approximately 24.6 h to finish the single-area annotations, resulting in a time cost of 442.8 man hours in total. After the first round of labeling, self-examination and cross-examination was conducted, correcting the false labels, missing objects, and inaccurate boundaries. The team supervisors then randomly sampled 600 images for quality inspection. The unqualified annotations were then refined by the annotators. Finally, several statistics (e.g. object numbers per image, object areas, etc.) were computed to double check the outliers. Based on DeepLabV3, preliminary experiments were conducted to ensure the validity of the annotations.

Domain City Region #Images Train Val Test
Urban Nanjing Qixia 320
Gulou 320
Qinhuai 336
Yuhuatai 357
Jianye 357
Changzhou Jintan 320
Wujin 320
Wuhan Jianghan 180
Wuchang 143
Rural Nanjing Pukou 320
Gaochun 336
Lishui 336
Liuhe 320
Jiangning 336
Changzhou Liyang 320
Xinbei 320
Wuhan Jiangxia 374
Huangpi 672
Total 5987 2522 1669 1796
Table 8: The division of the LoveDA dataset

a.2 Top performances compared with other datasets

. In order to support the "challeangability" of the proposed dataset compared to other land-cover datasets. By investigating the current researches, the top performances on different datasets have been reported in Table 9. The advanced method (HRNet) only achieved the lowest performance on the LoveDA dataset, showing the difficulty of this dataset

Dataset Top mIoU (%)
GID RSNet 93.54
DeepGlobe Tian_2018_CVPR_Workshops 52.24
ISPRS Potsdam mou2019relation 82.38
ISPRS Vaihingen mou2019relation 79.76
LoveDA 49.79
Table 9: Top performances compared with other datasets

a.3 Instance Differences Between Urban and Rural Areas

For the LoveDA dataset, the differences between urban and rural areas at the instance level are shown in the Figure 7. Similar with the pixel analysis in §3.3, the instances across domains are imbalanced. Specifically, the urban areas have more buildings and fewer instances of agricultural land. The rural areas have more instances of agricultural land. This also highlights the inconsistent class distribution problem between different domains.

Figure 7: Instance differences between urban and rural areas.

a.4 Implementation Details

All the networks were implemented under the PyTorch framework, using an NVIDIA 24 GB RTX TITAN GPU. The backbones used in all the networks were pre-trained on ImageNet. The number of training iterations was set to

with a batch size of . The eight source images and eight target images were alternately input. The other settings were the same as in the semantic segmentation. As for self-training (ST), the pseudo-generation hyper-parameters remained the same as in the original literature. The classification learning rate was set to . All the ST-based networks were trained for steps including two stages: 1) for the first steps, the models were trained only on the source images for initialization; and 2) the pseudo-labels were then updated every steps during the remaining training process. Considering the training stability, IAST method was set steps for initialization in the Urban Rural experiments.

All the networks were then re-implemented following the original literature. The segmentation models followed the default settings in adaptseg, including a modified ResNet50 and atrous spatial pyramid pooling (ASPP)deeplabv2

. By using dilated convolutions, the stride of the last two convolution layers was modified from

to . The final output stride of the feature map was .

Following adaptseg, the discriminator was made up of five convolutional layers with a kernel of and a stride of , where the channel numbers were

, respectively. Each convolution was followed with a Leaky ReLU, and the parameter was set to

. Bilinear interpolation was used for re-scaling the output to the size of the input.

As for the hyperparameter settings, the adversarial scale factor

was set to following clan; fada. With respect to the two segmentation outputs in adaptseg, and were set to and , respectively. The weight discrepancy loss was used in CLANclan, and the default settings were adopted, i.e., , , and . FADA fada adopts the temperature

to encourage a soft probability distribution over the classes, which was set to

by default. The confidence of pseudo-label in PyCDAlian2019constructing was set to by default. The pseudo-label related hyperparameters for IAST remained the same as in IAST. The target proportion in CBST was set to and when transferring to the rural and urban domains, respectively.

a.5 Error bar visualization for the UDA experiments

In order to make the results more convincing and reproducible, we ran all UDA methods five times using a random seed. The error bar visualization for the UDA experiments is shown in Figure 8. The adversarial training methods achieve smaller error fluctuations than the self-training methods. This is because the self-training methods assign and update the pseudo-labels alternately, which brings greater randomness. Hence, for the self-training methods, we suggest that three times more repeats are preferred to provide more convincing results.

Figure 8: Error bar visualization for the UDA experiments.

a.6 Batch Normalization Statistics in the Different Domains

The batch normalization (BN) statistics are shown in Figure 

9. We observe that in the Oracle source and target settings, the model has similar BN statistics in both mean and variance. This demonstrates that the gap between the source and target domains does not lie in the BNs, which is different from the conclusion in wang2019transferable. Hence, the modification of the BN statistics may have a negative effect, as in TransNormwang2019transferable, where the target BN statistics are far different from those of the Oracle target model. This observation is consistent with the results listed in Table 6. We speculate that the cause of this failure in the combined simulation dataset UDA experimentsfada; clan; wang2019transferable is that the source and target domains have large spectral differences, and thus require domain-specific BN statistics. However, the LoveDA dataset is real data obtained from the same sensor at the same time. The spectral difference in the source and target domains is very small (Figure 3(b)), so the BN statistics are very similar (Figure 9).

(a) Layer1’s RM
(b) Layer2’s RM
(c) Layer3’s RM
(d) Layer4’s RM
(e) Layer1’s RV
(f) Layer2’s RV
(g) Layer3’s RV
(h) Layer4’s RV
Figure 9: Statistics of the running mean (RM) and running var (RV) of the batch normalization in the different layers of ResNet50. Two Oracle models and TransNorm in the Urban Rural experiments are shown.

a.7 Large-scale Visualizations on UDA Test Set

The large-scale visualizations are shown in the Figure 10. Compared with the baseline, CBST can produce better results on large-scale mapping, which highlights the importance of developing UDA methods. However, CBST still has a lot of room for improvement. More tailored UDA algorithms requires to be developed on the LoveDA dataset.

(a) Baseline on Wujin area
(b) CBST on Wujin area
Figure 10: Large-scale Visualizations on UDA Test Set (Rural Urban).