Hperspectral images (HSIs) can capture detailed spectral information measured in contiguous bands of the electromagnetic spectrum [1, 2, 3] and have been widely used in various remote sensing applications, such as environmental monitoring  and mineral exploration 
. One fundamental challenge in these applications is to assign a unique label to each pixel in the image, which is called HSI classification. When the problem is treated as a supervised learning and solved using machine learning methods (including random forest
, support vector machine (SVM), laplacian SVM (LapSVM) [8, 9, 10]11]
and support tensor machine (STM)
), a large amount of labeled samples are required due to the high dimensionality of hyperspectral data. This would require extensive and expensive field data collection campaigns. Consequently, only a small quantity of labeled samples are available in most practical applications of HSI classification. In order to solve the problem, several machine learning and feature extraction methods have been widely applied to hyperspectral data
, such as active learning (AL)[13, 14, 15, 16, 17]
, semi-supervised learning (SSL)[18, 19, 15], spectral-spatial classification [20, 21, 22, 23], domain adaptation (DA) [24, 25, 3]
and more recently deep learning based techniques[26, 27, 28]. In this paper, we focus on applying DA to HSI classification.
According to the machine learning and pattern recognition literature, DA refers to solving the problem of adapting model trained on the source domain to the target domain. When applied to HSI classification, DA aims to generate accurate classification map of target HSI by utilizing the knowledge learned on the source HSI. According to, unsupervised DA refers to the case where there are no labeled samples available in the target domain, whereas semi-supervised DA represents the case where there are few target labeled samples. Further, heterogenous DA (HDA) refers to the dimensions of features in both domains are assumed to be different. Since we assume that a limited amount of labeled samples are available in the target HSI, we therefore focus on semi-supervised HDA for HSI classification. Although there are several HDA methods based on deep learning for visual and remote sensing applications [29, 30, 31], the feature representation ability of deep learning models is strongly dependent on the availability of a large number of training samples [32, 33]. Therefore, it is difficult to obtain a reliable deep learning model with the availability of very few samples in hyperdimensional feature spaces. Given the assumption of the limited number of training samples, in our paper we focus on handcrafted features for HDA.
In the literature of HDA, one of the simplest feature-based approaches is the feature augmentation proposed in , which extended versions, called heterogeneous feature augmentation (HFA) and semi-supervised HFA (SHFA), have been recently proposed in . In , a robust domain adaptation low-rank reconstruction method is introduced, where a transformed intermediate representation of the samples in the source domain is linearly reconstructed by the target samples. In , the authors align domains with canonical correlation analysis (CCA) and then perform change detection. The approach is extended to a kernel and semisupervised version in , where the authors perform change detection with different sensors. In , the supervised multi-view canonical correlation analysis ensemble is presented to address HDA problems. In , the proposed cross-domain landmark selection (CDLS) method is able to learn representative cross-domain landmarks for deriving a proper feature subspace for adaptation and classification purposes. Different from the above feature-based category, several studies employ manifold learning to preserve the original geometry. In , the method of domain adaptation using manifold alignment (DAMA) can reuse labeled data from multiple source domains in the target domain even in the case when the input domains do not share any common features or instances. In , semi-supervised manifold alignment (SSMA) is proposed, where both domains are matched through manifold alignment while preserving label (dis)similarities and the geometric structures of the single manifold in both domains. Recently, the kernelized manifold alignment (KEMA) has been introduced in . In 
, a deep feature alignment neural network is introduced to carry out the domain adaptation, where discriminative features for the source and target domains are extracted using deep convolutional recurrent neural networks and then aligned with each other layer-by-layer. In
, a kernel-based domain-invariant feature selection method has been proposed for the classification of hyperspectral images, where a novel measure of data shift for evaluating the domain stability is defined.
As stated earlier, it is not feasible to obtain a large amount of labeled target samples in practical applications. On the other hand, if a sufficient number of labeled samples are available in the target HSI, an accurate classification map can be achieved by using newly-developed deep learning methods . Therefore, it is reasonable to assume that only limited labeled samples can be used in the semi-supervised HDA problem. In oder to address the problem and obtain better classification performance, two key problems should be solved, i.e. how to obtain more pseudo-labeled reliable target samples for adaptation and how to achieve better adaptation with these samples.
In this paper, random walker (RW)-based pseudolabeling  and cluster canonical correlation analysis (C-CCA)  are employed to solve the above two problems, respectively. The RW-based pseudolabeling algorithm has been proved to be effective for high-confidence samples extraction , whereas C-CCA uses all pair-wise correspondences within a cluster across the two domains and results in cluster segregation . Fig. 1 illustrates the difference between CCA and C-CCA. It is clear that CCA requires paired samples and can hardly be directly applied when multiple clusters of samples in the source domain correspond to several clusters of samples in the target domain.
In the proposed approach, the two algorithms work in a collaborative manner, i.e. RW-based pseudolabeling is employed to extract target samples with high confidence, whereas C-CCA is employed for cross-domain learning. Then the projected samples are used for RW-based pseudolabeling. Therefore, the proposed method is denoted as cross domain collaborative learning (CDCL). As is shown in Fig. 2, the proposed method is based on an iterative process, consisting of three main components, i.e. RW-based pseudolabeling, cross domain learning via C-CCA and classification using the extended RW (ERW) algorithm. Firstly, given the initially labeled target samples as training set (), the RW-based pseudolabeling is employed to update the and extract target clusters () by fusing the segmentation results obtained by RW and ERW classifiers. Secondly, cross domain learning via C-CCA is applied using labeled source samples and . The unlabeled target samples are then classified with the estimated probability maps using the model trained in the projected correlation subspace. Then, both and estimated probability maps are used for updating again via RW-based pseudolabeling. Finally, when the iterative process converges, the classification map is obtained by the ERW classifier using the final and the estimated probability maps. Comprehensive experiments on four publicly available benchmark HSIs have been conducted to demonstrate the effectiveness of the proposed algorithm.
The rest of the paper is organized as follow. The C-CCA and RW algorithms are reviewed in Section II. The proposed methodology of CDCL is presented in section III. Section IV describes the experimental datasets and setup. Results and discussions are presented in Section V. Section VI summarizes the contributions of our research.
Ii Background Algorithms
This section briefly describes background algorithms, i.e. C-CCA and RW algorithms.
Let us consider two sets of labeled samples and extracted from the source and the target domains, respectively. Each set is divided into corresponding clusters, which are denoted as and , respectively. The -th cluster of is represented as and the -th cluster of is denoted as , where and represent the number of samples in and , respectively. The aim of C-CCA is to find a projection for and for , so that correlation between projections of and are maximized and clusters are well separated.
In C-CCA, a one-to-one correspondence between all pairs of samples in a given cluster across the two sets is established and thereafter standard CCA is used to learn the projections. The C-CCA problem is written as
where the covariance matrices , and are defined as:
is the total number of cross-set correspondences. The problem can be solved as an eigenvalue problem as in CCA.
The RW algorithm has been initially designed for general image segmentation based on a small set of labeled pixels . The algorithm assigns each unlabeled pixel to the label that a random walker starting from that pixel would be most likely to reach first. To be specific, it considers an image as a graph G= with vertices and edges . Then vertices and edges represent the pixels in the image and the links connecting the adjacent pixels, respectively. The structure of image intensities can be defined by the edge weights. The edge weight between the -th and -th pixels is defined as , where indicates the image intensity at pixel and is a free parameter that controls the smoothness of graph edges. The corresponding Laplacian matrix of the graph is denoted as .
The vertices of the image can be divided into the set of labeled pixels and unlabeled set , where has assigned a label from the set . Given the intensity representation of the image G and , the RW algorithm determines the probability that a random walker starting at unlabeled pixel will first reach a labeled pixel belonging to with label . The set of probabilities is addressed analytically and quickly with closed solutions by minimizing the energy function . By assigning each pixel to the label with the largest probability, a high-quality image segmentation is obtained .
When RW is directly applied to HSI classification, the spectral information can hardly be integrated in the energy function . To address the problem, an ERW-based spectral-spatial algorithm is proposed in , including the aspatial energy function defined as follows:
where is a diagonal matrix, where the values are the initial probabilities for pixels. The probabilities can be estimated by applying SVM classifier to the HSI image. The combined energy function for ERW algorithm is formulated as
where is a free parameter controlling the dynamic range of the aspatial function. Similar to the solution of RW, the set of probabilities in ERW can be estimated by solving linear equations . Given the optimized probabilities, each unlabeled pixel is assigned with the label of the largest probability.
Iii Proposed Method
Iii-a Problem Definition
Assume that we have labeled training samples in the source domain, where and . The samples in the target domain are divided into the labeled and unlabeled sets, which are denoted as and , respectively, and . In this paper, we only consider semi-supervised heterogeneous problem, thus we assume that . For better illustration, the sets of labeled training source samples, the labeled and unlabeled target samples are denoted as , and , respectively.
As shown in Fig. 2, the proposed algorithm is based on an iterative process, including three main components, i.e. RW-based pseudolabeling, cross domain learning via C-CCA and ERW-based classification. It is notable that both RW-based pseudolabeling and ERW-based classification require training set and probability maps (which measure the probabilities that each sample of the target HSI belongs to different classes). The pseudolabeling procedure is introduced to extract several labeled samples with high-confidence as target clusters for C-CCA and more reliably labeled samples for updating the training set. To be specific, the strategy of RW-based label verification in  is applied to obtain reliable results of pseudolabeling. For simplicity, the estimated probability maps, target clusters and training set are denoted as , and , respectively.
In the following, the RW-based pseudolabeling will be first described. Then, the details of the proposed method will be introduced.
Iii-B RW-based Pseudolabeling
Given the training set and the estimated probability map , the RW-based target samples pseudolabeling consists of the following five steps:
1) Graph construction: In order to make full use of the spatial information, the first principal component (PC) of the hyperspectral image is used to construct a weighted graph G=. Here, the vertices () refer to the sample values in the first PC, and the edges () refer to the links connecting the adjacent samples (eight neighbors are considered for each sample). A weight is defined for each edge to model the difference between adjacent samples in the weighted graph, where is a free parameter.
2) RW segmentation: When the graph representation and are available, RW probabilities can be directly obtained by minimizing the energy function . Denoted as , the segmentation result is obtained by choosing the label with the maximum of probabilities for each sample.
3) ERW segmentation: Given the graph representation, and the initial probability map , ERW probabilities can be optimized by minimizing the energy function . Note that there is a free parameter controlling the dynamic range of the aspatial function. Once the optimized probability map is obtained, the segmentation result is easily computed by choosing the label corresponding to the maximum of probabilities for each sample.
4) Label verification : After obtaining and , label verification is employed to extract sample candidates for further and updating. As illustrated in Fig. 3, and are compared to verify the confidences of the unlabeled samples in the target HSI. To be specific, samples segmented as the same label in and are considered as the sample candidates with high confidences. The rationality of the strategy is as follows. Firstly, the RW and ERW can take complementary decisions. The RW algorithm is only based on the spatial correlation among adjacent samples, whereas the ERW algorithm combines the spectral information and the spatial correlations of adjacent samples. Secondly, the core idea of the strategy is similar to the voting-based decision fusion strategy, i.e. if different classifiers take the same decision for a sample, the decision of this sample is assumed to be more reliable.
5) and updating: Although candidate samples are extracted by label verification strategy with high confidences, and are expected to include more correctly labeled samples. In order to ensure the accuracy of , unlabeled samples in the candidate samples are selected according to the modified breaking ties (MBT)-based query strategy . To be specific, the MBT strategy finds the samples maximizing the ERW probability . Then the samples with the predicted label are added into . In addition, the sample having its largest probability larger than the mean probability of its predicted class is used for extraction.
Iii-C Details of Proposed Technique
As illustrated in Algorithm 1, the proposed algorithm can be denoted as cross domain collaborative learning (CDCL) via RW-based pseudolabeling and C-CCA, with and updated iteratively. The details of the proposed algorithm are as follows.
1) RW-based pseudolabeling: As illustrated in Fig. 2, in one iterative process, pseudolabeling is applied twice, i.e. before and after C-CCA. Firstly, probability estimation for pseudolabeling is achieved by training a linear SVM classifier on . Then, the obtained probability maps and are employed to extract and update . Note that the initial only contains . Secondly, after C-CCA using and , the probability maps are estimated by the linear SVM trained using the projected and . Given and the newly estimated probability map , pseudolabeling is applied again for updating . In summary, is updated twice and is computed only once in a single iterative process.
2) Cross domain learning via C-CCA: Given and , more than one pair of projection vectors and with corresponding correlation coefficient are derived via C-CCA. Note that is the dimension number of the obtained subspace, which is smaller than both and . Higher values of the correlation coefficients indicate better correlations between samples projected from different domains, resulting in better domain transfer abilities. In order to generalize correlation subspace with good transfer abilities, we fix the threshold for as 0.5 and the corresponding vectors are kept. After projecting all samples in both domains onto the correlation subspace, are classified with estimated probability maps by using a linear SVM trained on the projected and . Although non-linear classifiers like SVM with RBF kernel generally perform better than linear classifiers in classification task, the optimal parameters of such classifier tuned by source samples usually perform worse than expected for target samples under the context of DA. On the contrary, linear kernel can capture original relationships between samples from different domains.
3) ERW-based classification: When the iterative process of RW-based pseudolabeling and C-CCA converges, the classification map is obtained by ERW using the estimated probability maps and the final .
Iii-D Performance Analysis and Convergence
The classification ability and convergence of the proposed CDCL method are analyzed as follows:
1) Given and , the classification ability of the proposed method relies on two factors, i.e. transfer abilities of C-CCA and ERW-based classification. It is clear that the transfer abilities of C-CCA relies on the number of samples in and the corresponding accuracies, whereas ERW-based classification requires a good estimation of and to achieve higher accuracy. In each iterative process, the samples with highest confidence are added into and several samples are extracted by label verification as . If and are accurate, good cross domain learning would be achieved. Since the classifier is trained using labeled samples from both domains, performs better than when it is used for RW-based pseudolabeling. Therefore, more reliable samples are added into , ensuring and are accurately updated in the next iteration. With and updated iteratively, good classification result can be obtained using the proposed method.
2) As stated above, higher classification accuracy via the proposed method is easily obtained under the assumption to have reasonably accurate and . However, note that since the C-CCA is based on the pairwise correspondences within a cluster across domains, it is expected that the source clusters are aligned with the corresponding target clusters even if there are few mislabelled samples in . With the iterative process going on, both RW and ERW segmentation results will be close to the ground truth, resulting in more samples extracted as candidates via label verification. Then more samples would be extracted as . Since samples with their probability larger than the mean probabilities of their predicted class are considered as , the number of samples in is smaller than the number of all unlabeled samples. In fact, the number of samples in can hardly be monotonically increasing with iterations due to the inconsistency between segmentation results obtained by RW and ERW algorithms. Therefore, if the increase of sample amount in is less than 5% of the total unlabeled samples, we consider the convergence reached.
Iv Experimental Data and Setup
Iv-a DataSet Description
|No.||Class||Color||Pavia University||Pavia Center|
The first dataset consists of two hyperspectral images collected by the Reflective Optics Spectrographic Image System (ROSIS) sensor over the University of Pavia and Pavia City Center. The Pavia City Center image contains 102 spectral bands and has a size of 1096492 pixels. The Pavia University image contains instead 103 spectral reflectance bands and has a size of 610340 pixels. Only seven classes shared by both images are considered herein. In the experiments, the Pavia University image is considered as the source domain, while the Pavia City Center image as the target domain, or vice versa. These two cases are denoted as Univ/Center and Center/Univ, respectively. Note that there are manually selected training maps (TM) which are publicly available and widely used in related publications [51, 46, 1, 52]. The color composite image, ground truth (GT) and TM of Pavia dataset are illustrated in Fig. 4, whereas the corresponding number of labeled samples is detailed in Table I.
The second dataset consists of two hyperspectral images captured with Airborne Visible Infrared Imaging Spectrometer (AVIRIS) over Salinas Valley, California and Northwest Indiana. After discarding 20 water absorption bands, Salinas image contains 224 bands of 512 217 pixels. Fig. 5(a-b) show the color composite image and the GT of the Salinas data set, in which 16 different classes represent mostly different types of crops. After removing 20 spectral bands due to noise and water absorption, Indian Pines image contains 200 bands of 145 145 pixels, and its spatial resolution is 20 m per pixel. The color composite image and the GT containing 16 different classes are presented in Fig. 5(c-d). The classes of both images are listed in Table I
with the corresponding number of samples. Since we mainly focus on the HDA problem, a low-dimensional image is considered as the source domain by clustering the spectral space of the original data for each image. Specifically, the original bands of the HSI are clustered into 50 groups using the K-means algorithm, and the mean value of each cluster is considered as a new spectral band, providing a total of 50 new bands. The corresponding cases are denoted asSalinas and Indian cases, respectively.
Iv-B Experimental Setup
In order to make a general comparison, the default parameters of the ERW classifier given in [49, 46] are adopted for the proposed algorithm. Specifically, the parameters of the RW and ERW in the proposed method are set to be and . In addition, the threshold of correlation coefficients and the query size are set to 0.5 and 10, respectively, in all experiments. The free parameter of linear SVM in our method is tuned in the range () with 5-fold cross-validation.
Several approaches of semi-supervised HDA proposed for visual and remote sensing applications are employed as baseline methods:
CCA : CCA aligns both domains by using the same number of labeled samples from source and target domains. To be specific, a random selection of samples from source or target domain is applied to ensure pairwise correspondences between domains.
C-CCA : C-CCA is directly employed by using the labeled samples in both domains.
DAMA : DAMA adopts a linear projection to match the differences between the source and target subspaces.
SSMA : SSMA carries out adaptation through manifold alignment while preserving label (dis)similarities and the geometric structures of the single manifold in both domains.
KEMA : KEMA is a kernerlized version of SSMA.
SHFA : SHFA simultaneously learns the target classifier and infers the target labels in an augmented common feature space.
CDLS : CDLS jointly explores a domain-invariant feature subspace and identifies cross-domain landmarks.
|TR_S||50||10, 20, 50||50||5, 10, 15|
|TR_T||2||2, 3, 5||2||2, 3, 5|
Moreover, several methods applied only to the target domain are also employed as baselines:
No Adaptation (NA): NA is a basic baseline that learns linear SVM  using the initially labeled target samples.
LapSVM : LapSVM is a typical baseline for semi-supervised classification and the one-vs-one strategy for linear SVM is applied for fair comparison.
ERW : ERW carries out classification using the initial probabilities learned by linear SVM and the initially labeled target samples.
The threshold of correlation coefficient for CCA and C-CCA is set as 0.5. The parameter is set as 0.9 for DAMA, SSMA and KEMA, whereas the optimal dimensionality of final projection for the three methods are cross-validated by exploiting labeled source and target samples. Once samples are projected onto new subspace, the final classification results of CCA, C-CCA, DAMA, SSMA and KEMA are obtained by training the linear SVM using labeled samples of both domains with parameter tuned in the range (). The parameters of SHFA is tuned as in . The dimensionality of PCA in CDLS is set as 30, whereas other parameters of CDLS are tuned as in . The parameters of LapSVM are set to be and , whereas parameters of ERW are set to be and for fair comparison.
In a practical application, the number of labeled samples in the target HSI is typically not enough to learn a reliable classifier, whereas the amount of labeled samples in the source HSI is relatively larger. To model this scenario, we randomly select a limited amount of samples from the target HSI as labeled. Table II lists the settings of training and test samples used in our experiments, which consist of three parts: 1) training samp les (labeled) from the source HSI (TR_S); 2) training samples (labeled) from the target HSI (TR_T); and 3) test samples (unlabeled) from the target HSI (TE_T). The integers (i.e., 2 3 5) in Table II represent the number of samples per class, whereas the percentages refer to the ratio of training or testing samples. For example, the setting of Univ/Center case means that 50 labeled source samples and 2 labeled target samples per class are selected as training samples, and 2% of all unlabeled target samples are used for testing. Note that testing samples of four cases are selected from the corresponding target ground truth. The training samples for Pavia dataset are selected from publicly available training maps (see Fig. 4), whereas the training samples for Salinas and Indian datasets are selected from the ground truth. To exploit the effectiveness of various training samples in both domains, various settings of TR_S and TR_T are applied for Center/Univ and Indian cases. For each setting in Table II, 50 trials of the classification have been performed to ensure stability of the result. The classification results are evaluated in terms of Overall Accuracy (OA), Average Accuracy (AA) and Kappa statistic. All our experiments have been conducted by using Matlab R2017b in a desktop PC equipped with an Intel Core i5 CPU (at 3.1GHz) and 8GB of RAM.
V Results and Discussions
|Method||Metrics||Number of Source/Target Training Samples (per class)|
V-a Results of Univ/Center Case
To illustrate the effectiveness of the proposed CDCL on the whole HSI, an experiment is performed with the setting (TR_T and TR_S) in Table II and all unlabeled samples in Pavia Center HSI as TE_T. Fig. 6(a)-(k) show the classification results obtained by different methods, including CCA, C-CCA, DAMA, SSMA, KEMA, SHFA, CDLS, NA, LapSVM, ERW and the proposed CDCL methods. Fig. 6(l) represents the corresponding ground truth. From this figure, it can be seen that the CDCL method can effectively remove the noise in the NA and ERW classification results. Furthermore, the CDCL method obtained the highest OA = 91.03%. Table III
reports the results of different methods in terms of individual class accuracies, the mean and standard variance of OA, AA, and Kappa statistics using the setting in TableII. The following observations can be done:
The CDCL method gives the highest classification accuracies for “Baresoil”, “Bricks” and “Bitumen” classes. Moreover, the CDCL method also shows the best performance in terms of OA = 83.24%, AA = 82.29%, and Kappa = 80.00%.
The results of KEMA and SHFA are comparable and better than results of other HDA methods, whereas the CCA method performs worst due to the fact that only part of labeled samples are used.
The NA method outperforms LapSVM and ERW methods, and even all the baseline HDA methods. It can be concluded that the knowledge of Pavia University data can hardly be well transferred to the Center data with limited target labeled samples. In addition, both CDCL and ERW perform worse than NA method on the “Meadows” and “Shadows” classes, confirming the relation between ERW and CDCL methods.
V-B Results of Center/Univ Case
illustrates the OAs, AAs, Kappa statistics and the corresponding standard errors obtained by the proposedCDCL method and the baseline methods for the Center/Univ case. The experiments are performed with different numbers of source and target training samples illustrated in Table II. The following observations can be easily drawn:
When increasing the number of labeled source and target samples, the mean OAs, AAs and Kappa statistics of most methods increase as expected. The increasing trend of mean OAs with more target training samples confirms that 50 trials are enough for achieving stable results. Moreover, the standard errors of OAs, AAs and Kappa statistics for smaller numbers of labeled samples appear to be higher.
The CDCL method gives the highest classification accuracies with different numbers of training samples. To be specific, the mean OAs of NA, ERW and CDCL methods are in the range of 58.88%-67.4%, 70.05%-83.12% and 72.35%-85.66%, respectively. Further, when only 10 per class source labeled samples are used for training, the CDCL method yields 2.30%, 7.11% and 2.48% higher mean OAs than ERW with 2, 3 and 5 target samples per class.
Fig. 7 reports individual class accuracies for the Center/Univ case obtained by C-CCA, NA, ERW and CDCL methods using different numbers of labeled samples, assessed by the mean OAs (main curves) and their standard errors (shaded area for each curve). The classification accuracies of 7 classes (“asphalt”, “meadows”, “trees”, “baresoil”, “bricks”, “bitumen”, “shadows”) are shown in Fig. 7(a-g), respectively. Note that the abscissas represent the different settings of training number in Table IV. The CDCL method outperforms C-CCA, NA and ERW methods on “asphalt” (a), “meadows” (b), “baresoil” (d) and “bitumen” (f) classes, and shows comparable accuracy on “bricks” class (e) with the ERW method, yielding a better overall classification accuracy. Further, the ERW method performs worse than NA on “trees” (c) and “shadows” (g) classes, resulting in low accuracies of CDCL method on these two classes.
V-C Results of Salinas Case
|Method||Metrics||Number of Source/Target Training Samples (per class)|