Change detection (CD) is one of the most widely used interpretation techniques in the field of remote sensing, and has been intensively studied in previous years [Singh1989]
. Nonetheless, most traditional CD models only explore low-level features in multispectral images, which are insufficient for representing the key information of original images. Recently, deep learning (DL) has been shown to be very promising in the field of computer vision and remote sensing images interpretation. Hence, a number of CD methods based on DL models are developed.[Zhu2017, Chen2019].
However, the training process of these DL-based CD methods requires a lot of labeled data and there is no denying that the manual selection of labeled data is labor-consuming, especially for remote sensing images. Besides, deep networks are often task-specific, in other words, they have a relatively weak generalization. And due to several factors, including noise and distortions, sensor characteristics, imaging conditions, the data distributions of different CD data sets are often quite dissimilar. Thus, if we train a deep network on one multi-temporal data set with abundant labeled samples, it would suffer degraded performance after we transfer it to a new multi-temporal data set, which makes it unavoidable to manually label numerous samples in the new data set. Nowadays, there are massive amounts of remote sensing images are available by satellite sensors, these images can provide diverse and abundant information for covered regions. Therefore, it is incentive to develop an efficient CD model that is trained on a data set (source domain) with enough labeled data but can be easily transferred to a new data set (target domain) with very limited (even no) labeled data. This can be defined as a domain adaption problem in change detection area.
Considering the above issues comprehensively, in this paper, a novel deep network architecture called DSDANet is proposed for cross-domain CD. By incorporating a domain discrepancy metric MK-MMD into the network architecture, the DSDANet can learn transferrable features, where the distribution of two domains would be similar. To the best of authors’ knowledge, it is the first time that such a deep network based on domain adaptation is designed for CD in multispectral images.
Caused by plenty of factors, the probability distributions characterizing source domainand target domain are dissimilar. And due to only limited (or no) labeled data in target domain available, it is challenging to construct a model that can match these two domains and learn transferable representation. An efficient and common way is combining the CD errors with a domain discrepancy metric.
A widely used metric is the maximum mean discrepancy (MMD). MMD is a nonparametric kernel-based metric that measures the distance between two distributions in a RKHS. And when the distributions of two domains tend to be the same and the RKHS is universal, MMD would approach zero.
Nonetheless, it is difficult to find an optimal RKHS and the representation ability of single kernel is limited. And it is reasonable to assume that the optimal RKHS can be expressed as the linear combination of single kernels, thus the multi-kernel variant of MMD entitled MK-MMD [Gretton2012] is introduced.
Considering a source data set and a target data set , the formulation of MK-MMD is defined as
where is the RKHS norm, is the feature map induced by multi-kernel , which is defined as the linear combination of positive semi-definite kernels
where each is associated uniquely with an RKHS , and we assume the kernels are bounded. Owing to leveraging diverse kernels, the representation ability of MK-MMD can get improvement.
If the network can learn a domain-invariant representation that minimizes the MK-MMD between two domains, it can be easily transferred to the target domain with sparsely labeled data.
2.2 Network Architecture
Introduced MK-MMD for domain adaptation, the structure of the proposed DSDANet is shown in Fig. 1. Given a source data set with enough labeled data and a target domain without labels, is an image patch centered -th pixel and is the corresponding label of -th pixel. For each image patch-pair in both domains, the spatial-spectral features and
are extracted by cascade convolutional layers and max-pooling layers.
After that, the absolute value of multi-temporal spatial-spectral features’ difference is calculated. Since the two branches of DSDANet are weight-shared, the change information could be highlighted through this operation.
As we all konw, deep features learned by CNN transition from general to specific by the network going deeper. Especially for the last few fully connected (FC) layers, there exists an insurmountable transferability gap between features learned from different domains. If we train a network in the source domain, it cannot be transferred to the target domain via fine-tuning with sparse target labeled data. Therefore, the MK-MMD is adopted to make the network learn domain-invariant features from two domains. An intuitive idea is combining MK-MMD with the penultimate FC layer, which can directly make the classifier adaptive to two domains. But considering a single layer may not cope with domain distribution bias, thus the MK-MMD is embedded into the two FC layers in front of the classifier. Since we aim to construct a network that is trained on the source CD data set but also perform well on the target task, thus the loss function of DSDANet is
where is CD loss on the source labeled data, is layer index, means the MK-MMD between the two domain on the features in the -th layer and denotes a domain adaptation penalty parameter.
In the training procedure, two types of parameters require to learn, one is the network parameters and another is the kernel coefficient . However, the cost of MK-MMD computation by kernel trick is
, it is unacceptable for deep networks in large-scale data sets and makes the training procedure more difficult. Therefore, the unbiased estimate of MK-MMD[Gretton2012] is utilized to decrease the computation cost from to , which can be formulated as
where is a quad-tuple evaluated by multi-kernel and is learned features in -th layer.
As for the kernel parameters , the optimal coefficient for each can be sought by jointly maximizing
itself and minimizing the variance, which results in the optimization
where is estimation variance. Eventually, this optimization finally can be resolved as a quadratic program (QP) [Gretton2012].
By alternatively adopting stochastic gradient descent (SGD) to updateand solving QP to optimize , the DSDANet can gradually learn transferrable representation from source labeled data and target unlabeled data. By minimizing Eq. 3, the marginal distributions and of two domains become very similar, yet the conditional distributions and
of two domains may still be slightly different. Thus, a very small part of target labeled data is selected to fine-tune the classifier of DSDANet. Compared with the enough labeled data in the source domain, the labeled data provided by the target domain is very limited, so this procedure can be treated as a semi-supervised learning fashion.
3.1 General Information
The data set used as the source domain is WH data set captured by GF-2, as shown in Fig. 2. The size of the two images is 1000 1000 pixels with four spectral bands and they have a spatial resolution of 4m.
The data sets adopted as the target domains are HY data set and QU data set, as shown in Fig. 3. The HY data set was also captured by GF-2 with a size of 1000 1000 pixels. The second target data set was acquired by QuickBird with four spectral bands and a spatial resolution of 2.4m denoted as QU. Both images in this data set are 358 280 pixels. Since the WH and QU were acquired by different sensors leading to diverse spatial resolutions and statistical characteristics, the data distributions of these two data sets are significantly different.
In the training procedure, we randomly select 10 samples (the particular number is 50416) from the source domain as labeled training samples. And we train the DSDANet with labeled source training samples and all target samples without labels. After training, we only select 200 labeled samples from each target domain for fine-tuning the classifier. Compared with the labeled source data, the labeled data provided by the target domain is sparse.
To evaluate the proposed method, we compare it with CVA [Sharma2007] and SVM. To further evaluate the effectiveness of MK-MMD, we compare the DSDANet to its variants that don’t perform domain adaptation, including directly inferring target data without fine-tuning (DSCNet-v1), directly training in the target labeled data instead of training in the source domain (DSCNet-v2) and fine-tuning with target labeled data but not equipped with MK-MMD (DSCNet-v3).
3.2 Experimental Results
The binary change maps obtained by different methods on the HY data set are shown in Fig. 4. It can be observed that the proposed model generates the best CD result with more complete changed regions and less noise. For the QU data set, even though the distributions of the two domain are significantly different due to the diverse characteristics of the two sensors, the DSDANet still can generate an accurate binary change map. It implies that through embedding data distributions into the optimal RKHS and minimize the distance between them, the network is capable of learning domain-invariant representation from source labeled data and unlabeled target data and can be easily transferred from one CD data set to another.
The quantitative results are listed in Table 1. Due to only providing very limited target labeled data that cannot contain all the kinds of changed and unchanged land-cover types, fine-tuning without domain adaptation also performs not well. By contrast, the DSDANet achieves the best OA and KC on the two target data set.
In this paper, a novel network architecture entitled DSDANet is proposed for cross-domain CD in multispectral images. Through restricting the domain discrepancy with MK-MMD and optimizing the network parameters and kernel coefficient, the DSDANet can learn transferrable representation from source labeled data and target unlabeled data, which can efficiently bridge the discrepancy between two domains. The experimental results in two target data sets demonstrate the effectiveness of the proposed DSDANet in cross-domain CD. Even though the data distributions of the two domains are significantly different, the DSDANet only needs sparse labeled data of the target domain to fine-tune the classifier, which makes it superior in actual production environments.