Tensor Alignment based Domain Adaptation for Hyperspectral Image Classification

08/29/2018 ∙ by Yao Qin, et al. ∙ Università di Trento 0

This paper presents a tensor alignment (TA) based domain adaptation method for hyperspectral image (HSI) classification. To be specific, HSIs in both domains are first segmented into superpixels and tensors of both domains are constructed to include neighboring samples from single superpixel. Then we consider the subspace invariance between two domains as projection matrices and original tensors are projected as core tensors with lower dimensions into the invariant tensor subspace by applying Tucker decomposition. To preserve geometric information in original tensors, we employ a manifold regularization term for core tensors into the decomposition progress. The projection matrices and core tensors are solved in an alternating optimization manner and the convergence of TA algorithm is analyzed. In addition, a post-processing strategy is defined via pure samples extraction for each superpixel to further improve classification performance. Experimental results on four real HSIs demonstrate that the proposed method can achieve better performance compared with the state-of-the-art subspace learning methods when a limited amount of source labeled samples are available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 7

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In the past decades, extensive research efforts have been spent on hyperspectral remote sensing since hyperspectral data contains detailed spectral information measured in contiguous bands of the electromagnetic spectrum [1, 2, 3]. Due to the discriminative spectral information of such data, they have been used for a wide variety of applications, including agricultural monitoring [4], mineral exploration [5], and etc

. One fundamental challenge in these applications is how to generate accurate land-cover maps. Although supervised learning for hyperspectral image (HSI) classification has been extensively developed in the literature (including random forest

[6]

, support vector machine (SVM)

[7], laplacian SVM (LapSVM) [8, 9, 10]

, decision trees

[11] and support tensor machine (STM) [12]

), sufficient labeled training samples should be available to obtain satisfactory classification results. This would require extensive and expensive field data collection compaigns. Furthermore, with the advance of newly-developed spaceborne hyperspectral sensors, large numbers of HSIs are easily collected and it is not feasible to timely label samples of the hyperspectral images as reference for training. Therefore, only limited labeled samples are available in most real applications of hyperspectral classification. According to the statistical theory in supervised learning, the data to be classified are expected to follow the same probability distribution function (

PDF) of training data. However, since the physical conditions (i.e. illumination, atmosphere, sensor parameters, and etc.) can hardly be the same when collecting data, PDFs of training and testing data tend to be different (but related) [13]. Then how to apply the labeled samples of original HSI to the related HSI is challenging in such cases. These problems can be addressed by adapting models trained on a limited number of source samples (source domain) to new but related target samples (target domain). The problem should be further studied for the development of hyperspectral applications.

According to the machine learning and pattern recognition literature, the problem of adapting model trained on a source domain to a target domain is referred to as

transfer learning or domain adaptation (DA) [13]

. The main idea of transfer learning is to adapt the knowledge learned in one task to a related but different task. An excellent review of transfer learning can be found in

[14, 15]. In general, transfer learning is divided into four categories based on the properties of domains and tasks, i.e. DA, multi-task learning, unsupervised transfer learning and self-taught learning. In fact, DA has greater impact on practical applications. When applied to classification problems, DA aims to generate accurate classification results of target samples by utilizing the knowledge learned on the labeled source samples. According to [3], DA techniques for remote sensing applications can be roughly categorized as selection of invariant features, adaptation of data distributions, adaptation of classifier and

adaptation of classifier by active learning (AL)

.

In our case of HSI classification, we focus on the second category, i.e. adaptation of data distributions, in which data distributions of both domains are made as similar as possible to keep the classifier unchanged. Despite the fact that several DA methods have been proposed for HSI classification, they treat HSIs as several single samples, which renders them incapable of reflecting and preserving important spatial consistency of neighboring samples. In this paper, to exploit the spatial information in a natural and efficient way, tensorial processing is utilized, which treats HSIs as three-dimensional (3D) tensors. Tensor arithmetic is a generalization of matrix and vector arithmetic, and is particularly well suited to represent multilinear relationships that neither vector nor matrix algebra can capture naturally [16, 17]. The power of tensorial processing for improving classification performance without DA has been proved in [18, 19, 20, 21, 22]. Similarly, when we apply tensorial processing to HSI in DA, multilinear relationships between neighboring samples in both HSIs are well captured and preserved, while conventional DA methods using vectorization deal with single samples. Tensor-based DA methods for visual application has demonstrated the efficacy and efficiency on the task of cross-domain visual recognition [17, 23], whereas there are few published works on DA by using tensorial processing of HSIs.

To be specific, we propose a tensor alignment (TA) method for DA, which can be divided into two steps. First, the original HSI data cubes in both domains are divided into small superpixels and each central sample is represented as a 3D tensor consisting of samples in the same superpixel. In this way, each tensor is expected to include more samples from the same class. Since tensors are acted as “basic elements” in the TA method, we believe that high purity of tensors brings better adaptation performance. Second, taking into account the computational cost, we randomly select part of target tensors in the progress of TA to identify the invariance subspace between the two domains as projection matrices. This is done on the source and selected target tensors, and the subspace shared by both domains is obtained by utilizing three projection matrices with original geometry preserved. The solution is addressed through the Tucker decomposition [24] with orthogonal regularization on projection matrices and original tensors are represented by core tensors in the shared subspace. Fig. 1 illustrates the manifold regularized TA method with a 1-Nearest Neighbor (1NN) geometry preserved.

In addition to the TA method, after generating classification map, a post-processing strategy based on pure samples extraction of each superpixel is employed to improve performances. The pure samples in superpixels have similar spectral features and likely belong to the same class. Therefore, if most pure samples in a superpixel are classified as -th class, it is probable that the remaining pure samples belongs to the same class. Since samples in one superpixel may belong to two or even more classes and there are always classification errors in DA, the ratio of pure samples predicting as the same class might be reduced if we extract more pure samples. Therefore, we extract the pure samples by fixing the ratio as 0.7. Specifically, final pure samples are extracted by first projecting samples in each superpixel to principal component axis and then including more samples in the middle range of the axis till the ratio reaches 0.7. In this way the consistency of classification results on pure samples is enforced. To sum up, the main contributions of our work lie in the following two aspects:
We propose a manifold regularized tensor alignment (TA) for DA and develop the corresponding efficient iterative algorithm to find the solutions. Moreover, we analyze the convergence properties of the proposed algorithm and its computational complexity as well. We introduce a pure samples extraction strategy as post-processing to further improve the classification performance. Comprehensive experiments on four publicly available benchmark HSIs have been conducted to demonstrate the effectiveness of the proposed algorithm.

The rest of the paper is organized as follows. Related works on adaptation of data distributions, tensorial processing of HSI and multilinear algebra are illustrated in Section II. The proposed methodology of TA is presented in section III, while the pure samples extraction strategy for classification improvement is outlined in Section IV. Section V describes the experimental datasets and setup. Results and discussions are presented in Section VI. Section VII summarizes the contributions of our research.

Fig. 1: Illustration of the manifold regularized tensor alignment method. There are 5 tensor objects for each class in the source domain, while 3 tensor objects for each class in the target domain. The shared subspace is obtained by utilizing 3 projection matrices with original geometry preserved. Each arrow represents the 1NN relationship between tensors. Best view in colors.

Ii Related Work

This section briefly describes important studies related to the adaptation of data distributions, tensorial processing of hyperspectral data and basic concepts in multilinear algebra.

Ii-a Adaptation of Data Distributions

Several methods for the adaptation of data distributions focus on subspace learning, where projected data from both domains are well aligned. Then, the same classifier (or regressor) is expected to be suitable for both domains. In [25]

, the data alignment is achieved through principal component analysis (PCA) or kernel PCA (KPCA). In

[26], a PCA-based subspace alignment (SA) algorithm is proposed, where the source subspace is aligned as close as possible to the target subspace using a matrix transformation. In [17]

, features from convolutional neural network are treated as tensors and their invariant subspace is obtained through the Tucker decomposition. In

[27], the authors align domains with canonical correlation analysis (CCA) and then perform change detection. The approach is extended to a kernel and semisupervised version in [28], where the authors perform change detection with different sensors. In [29], the supervised multi-view canonical correlation analysis ensemble is presented to address heterogeneous domain adaptation problems.

A few studies assume that data from both domains lie on the Grassmann manifold, where data alignment is conducted. In [30], the sampling geodesic flow (SGF) method is introduced and finite intermediate subspaces are sampled along the geodesic path connecting the source subspace and the target subspace. Geodesic flow kernel (GFK) method in [31] models infinite subspaces in the way of incremental changes between both domains. Along this line, GFK support vector machine in [32] shows the performance of GFK in nonlinear feature transfer tasks. A GFK-based hierarchical subspace learning strategy for DA is proposed in [33], and an iterative coclustering technique applied to the subspace obtained by GFK is proposed in [34].

Other studies hold the view that the subspace of both domains can be low-rank reconstructed or clustered. The reconstruction matrix is enforced to be low-rank and a sparse matrix is used to represent noise and outliers. In

[35], a robust domain adaptation low-rank reconstruction (RDALRR) method is proposed, where a transformed intermediate representation of the samples in the source domain is linearly reconstructed by the target samples. In [36], the low-rank transfer subspace learning (LTSL) method is proposed where transformations are applied for both domains to resolve disadvantages of RDALRR. In [37], a low-rank and sparse representation (LRSR) method is presented by additionally enforcing the reconstruction matrix to be sparse. To obtain better results of reconstruction matrix, structured domain adaptation (SDA) in [38] utilizes block-diagonal matrix to guide iteratively the computation. Different from the above methods, latent sparse domain transfer (LSDT) in [39] is inspired by subspace clustering, while the low-rank reconstruction and instance weighting label propagation algorithm in [40]

attempts to find new representations for the samples in different classes from the source domain by multiple linear transformations.

Other methods focus on feature extraction strategy by minimizing predefined distance measures, e.g.

Maximum Mean Discrepancy (MMD) or Bregman divergence. In [41], transfer component analysis (TCA) tries to learn some transfer components across domains in a Reproducing Kernel Hilbert Space (RKHS) using MMD. It is then applied to remote sensing images in [42]. TCA is further improved by joint domain adaptation (JDA), where both the marginal distributions and conditional distributions are adapted in a dimensionality reduction procedure [43]. Furthermore, transfer joint matching (TJM) aims to reduce the domain difference by jointly matching the features and reweighting the instances across domains [44]. Recently, joint geometrical and statistical alignment (JGSA) is presented by reducing the MMD and forcing both projection matrices to be close [45]. In [46] and [47], the authors transfer category models trained on landscape views to aerial views for high-resolution remote sensing images by reducing MMD.

Different from the above category for feature extraction, several studies employ manifold learning to preserve the original geometry. In [48], both domains are matched through manifold alignment while preserving label (dis)similarities and the geometric structures of the single manifolds. The algorithm is extended to a kernelized version in [49]. Spatial information of HSI data is taken into account for manifold alignment in [50]. In [51], both local and global geometric characteristics of both domains are preserved and bridging pairs are extracted for alignment. In addition to manifold learning, the manifold regularized domain adaptation (MRDA) method integrates spatial information and the overall mean coincidence method to improve prediction accuracy [52]. Beyond classical subspace learning, manifold assumption and feature extraction methods, several other approaches are proposed in the literature, such as class centroid alignment [53], histogram matching [54] and graph matching [55].

Ii-B Tensorial Processing of Hyperspectral Data

A few published works of tensor-based methods have been applied to HSI processing to fully exploit both spatial and spectral information. The texture features of HSI at different scales, frequencies and orientations are successfully extracted by the 3D discrete wavelet transform (3D-DWT) [56]. The gray level co-occurrence is extended to its 3D version in [57] to improve classification performance. Tensor discriminative locality alignment (TDLA) algorithm optimizes the discriminative local information for feature extraction [58], while local tensor discriminant analysis (LTDA) technique is employed in [59] for spectral-spatial feature extraction. The high-order structure of HSI along all dimensions is fully exploited by superpixel tensor sparse coding to better understand the data in [60]. Moreover, several conventional 2D methods are extended to the 3D for HSI processing, such as the 3D extension of empirical mode decomposition in [61, 62]. The modified tensor locality preserving projection (MTLPP) algorithm is presented for HSI dimensionality reduction and classification in [63].

Ii-C Notations and basics of Multilinear Algebra

A tensor is a multi-dimensional array that generalizes matrix representation. Vectors and matrices are first and second order tensors, respectively. In this paper, we use lower case letters (e.g. ), boldface lowercase letters (e.g. ) and boldface capital letters (e.g. ) to denote scalars, vectors and matrices, respectively. Tensors of order 3 or higher will be denoted by boldface Euler script calligraphic letters (e.g. ). The operations of Kronecker product, Frobenius norm, vectorization and product are denoted by , , and , respectively. The denotes the trace of a matrix.

A M-th order tensor is denoted by . The corresponding -mode matricization of the tensor , denoted by , unfolds the tensor with respect to mode . The operation between tensor and a matrix is the -mode product denoted by , which is a tensor of size . Briefly, the notion for the product of tensor with a set of projection matrices { excluding the -mode as:

(1)

Tucker decomposition is one of the most well-known decomposition models for tensor analysis. It decomposes a mode tensor into a core tensor multiplied by a set projection matrices { with the objective fuction defined as follows:

(2)

where and represents { with . For simiplicity, we denote as .

By applying the -mode unfolding, Eq. (2) can alternatively be written as

(3)

where denotes the -mode unfolding of , and denotes . The vectorization of (3) can be formulated as

(4)

Note that regularizations of and are ignored in above equations.

Iii Proposed Tensor Alignment Approach

Iii-a Problem Definition

Assume that we have tensor samples of mode in the source domain, where . Similarly, tensors in the target domain are denoted as . In this paper, we consider only homogeneous DA problem, thus we assume that . In the context of DA, we follow the idea to represent the subspace invariance between two domains as projection matrices , where . Intuitively, we propose to conduct subspace learning on the tensor samples in both domains with manifold regularization. Fig. 1 shows that the shared subspace is obtained by utilizing 3 projection matrices . By performing Tucker decomposition simultaneously, the tensor samples in both domains are represented by the corresponding core tensors and with smaller dimensions in the shared subspace. The geometrical information should be preserved as much as possible via forcing manifold regularization during subspace learning. In the next subsection, we will introduce how to construct HSI tensors and perform tensor subspace alignment.

Fig. 2: Illustration of (a) superpixel segmentation of Pavia University data and (b) how to determine the neighbors of the central (training/testing) sample. (b1) A patch surrounded by the central sample. (b2) Conventional methods take the patch as neighbors (see the purple area). (b3) Strategy used in our method.

Iii-B Tensors Construction

DA is achieved by tensor alignment, where HSI tensors are regarded as “basic elements”. Therefore, HSI tensors are expected to be as pure as possible. An ideal tensor should consist of samples belonging to the same class. However, spatial square patches centered at the training (or testing) samples can contain samples from different classes when extracted at the edge between different classes. To obtain pure tensors, superpixel segmentation [64] is performed on the first three principal components of the HSI. Fig. 2(a) shows the segmentation result of Pavia University data as an illustrative example. Then, samples surrounding the central sample in the same superpixel are utilized to form the tensor [see Fig. 2(b)]. This strategy can preserve the spectral-spatial structure of the HSI cube and takes into account the dissimilarity of various neighborhoods, particularly at the edges of different classes.

Iii-C Method Formulation

After the construction of tensors, two weight matrices for both domains are computed to enforce manifold regularization. We compute the source weight matrix in a supervised manner with labels of source tensors, while 10 nearest neighboring samples are searched via spectral angle measure (SAM) for target weight matrix . Then, the binary distance is employed for weight matrix construction. To reduce computation cost, dimensionality reduction for all tensors are conducted via multilinear PCA (MPCA). Given tensor samples and weight matrices for both domains, the final optimization problem is then defined as:

(5)

where is a nonnegative parameter for controlling the importance of manifold regularization, and are weight between the -th and -th tensor in and , respectively. When is small, the objective function depends mainly on the minimization of reconstruction errors for all tensor objects. When it is large, the objective function depends mainly on the preservation of tensor geometry information.

Iii-D Optimization

The problem in (5) can be solved by alternatively updating and the cores and until the objective function converges. Herein, by applying the -mode unfolding and according to (3), we obtain the following problem

(6)

where is introduced to denote source () and target () domain for simplicity.

Iii-D1 Updating and

When is fixed, by applying vectorization shown in (4), the problem is formulated as

(7)

where . The matrix form of the above equation can be written as

(8)

where and are matrices in which the -th columns are and , respectively. The denotes the Laplacian matrix, and . Formally, we transform the problem above into the following optimization formulation

(9)

For simplicity and better illustration, , and . Let

be the singular value decomposition of

and denote as . Note that no information of is lost in the transformation because is invertible. Then we have

(10)

where , and . Based on the properties of trace and F-norm, we reformulate it as

(11)

where . We denote the -th row of matrix as . Then the problem above can be rewritten as

(12)

When only considering , we have

(13)

where is a positive definite matrix. This is an unconstrained quadratic programming optimization of and can be easy solved by setting the derivation to zero. The optimal can be obtained by updating all rows and optimal is given as . When both are updated, the can be obtained by applying tensorization.

Iii-D2 Updating

When the cores and are fixed, we first write the problem of (5) as

(14)

where and denote the concatenation of sample tensors and core tensors in each domain, respectively. Similar to most tensor decomposition algorithms, the solution for is obtained by updating one with others fixed. By using the -mode unfolding, the problem is derived as the following constrained objective function:

(15)

which can be effectively solved by utilizing singular value decomposition (SVD) of . Please refer to the Appendix for the proof. For an efficient computation, is solved in the implementation as follows:

(16)

Based on the derived solutions of projection matrices and core tensors, the proposed method is summarized in Algorithm 1. Since the objective function in (5) is non-convex on projection matrices and core tensors , we initialize projection matrices to obtain a stationary solution by solving a conventional Tucker decomposition problem.

Tensor Alignment
: tensor set in both domains, regularization parameter and dim of cores
: core tensor set in both domains, projection matrices
1: Compute two graph matrice ;
2: Initialize using Tucker decomposition;
3: optimization in (5) does not converge
4:      update by solving (13);
5:      update by alternatively solving (15);
6: the convergence of (5);
7: and .

Iii-E Convergence and Computational Complexity Analysis

Formally, the objective function of the optimization in problem (5) is denoted as . In (13), we update and with fixed, i.e., we solve {} = argmin. Since both of them have a closed-form solution, we have for any . Similarly, given the closed form solution of optimal , we have . Therefore, the decreases monotonically and iteratively, assuring the convergence of the proposed algorithm. As shown in Section VI, the proposed algorithm achieves convergence in less than 15 iterations.

The computational complexity mainly contains two parts: unconstrained optimization problem in (9) and orthogonal constrained problem in (15). The number of iterations for updating is denoted as , while is the average number of iterations for updating following each trial of updating . For simplicity, the vectorization dimensionality of original and core tensors are denoted as and , respectively. Firstly, the complexity of (9) consists of SVD of in (10) and matrix inverse of in (13). The corresponding complexities are and , respectively. Secondly, given the SVD of in (15), is updated solved with complexity . In total, the complexity of TA method is .

Iv Pure Samples based Classification Improvement

Once the projection matrices are computed, source and target tensors are represented as core tensors and , respectively. The predicted map of target HSI can be easily obtained by a supervised classifier. It is notable that only part of target tensors are well exploited for domain adaptation and superpixel segmentation contributes only to tensor construction in the whole progress. In order to further exploit all target tensors and superpixel segmentation, in this section a strategy based on pure samples extraction is introduced to improve classification performance.

We firstly introduce the PCA-based method used for extracting the pure samples. As suggested in [65], for each superpixel we perform PCA and choose the first three principal components as the projection axis. Then, we project all samples onto these three principal components. For each projection axis, we normalize the projection values to . Since samples belonging to the same class in each superpixel have similar spectral signatures, these samples are likely to be projected to middle range of , instead of extreme value 0 and 1. Given a threshold (i.e., 0.9), if the normalized projection of sample is larger than , we assign a weight of to the sample. Otherwise if it is smaller than , the weight is set as . Further, 0 is assigned to those pixels which meet . In this way, each pixel is represented by three weights for three components. Finally, the sum of all weights for each sample is regarded as its purity index. The samples with purity index equal to 0 are extracted as pure samples. Illustrative examples of pure samples extraction is shown in Fig. 3, where is set as 0.7. After the extraction of pure samples in target HSI, we can apply the strategies for performance improvement.

Fig. 3: (a) Superpixel segmentation of the Pavia University data. (b) Illustration of 3 superpixels. (c) PCA-based pure samples extraction of the 3 superpixels. Here, the min-max axis is the first principal component vector and the threshold is set as 0.7. (d) Results of pure samples for 3 superpixels. (Best viewed in color).

The pure samples in each superpixel are expected to belong to the same class. However, there are always some samples predicting as different class in the testing stage. Therefore, if most of pure samples in one superpixel are predicted as -th class, it is reasonable to believe the residual pure samples belong to the -th class. Indeed, this idea is similar to the spatial filtering which also exploits spatial consistency. Since samples in one superpixel may belong to two even more classes and there are always classification errors in DA, the ratio of pure samples predicting as the same class might be reduced if we extract more pure samples. Therefore, we extract the pure samples by fixing the ratio of pure samples predicting as same class as . If 70% pure samples are predicted to belong to the -th class, then remaining 30% pure samples are changed into the -th class. To find the optimal pure samples, a greedy algorithm is applied to extract more pure samples so that the ratio is no more than :

(17)

where means the ratio predicting as the same class. Intuitively, the optimal should be as large as possible (to include more samples for the purpose of improving classification). Meanwhile, it should not be too large, otherwise samples belonging to different classes are included. We reduce the threshold by 0.01 iteratively to include more pure samples until the ratio of pure samples predicting as the same class reaches 70%. Then predicted results of remaining 30% pure samples are alternated as the predicted class of the 70% pure samples. We denote the strategy as TA_P for short. Although the strategy is simple, experimental results in section V reveal that remarkable margins are gained by TA_P over the proposed TA method.

V Data Description and Experimental Setup

Fig. 4: ROSIS Pavia dataset used in our experiments. (a) Color composite image and (b) ground truth of the university scene; (c) color composite image and (d) ground truth of city center scene.
Fig. 5: Houston GRSS2013 dataset used in our experiments. (a) Color composite image and (b) ground truth of the left dataset; (c) color composite image and (d) ground truth of the right dataset.

V-a DataSet Description

The first dataset consists in two hyperspectral images acquired by the Reflective Optics Spectrographic Image System (ROSIS) sensor over the University of Pavia and Pavia City Center are considered (see Fig. 4). The Pavia City Center image contains 102 spectral bands and has a size of 1096492 pixels. The Pavia University image contains instead 103 spectral reflectance bands and has a size of 610340 pixels. Seven classes shared by both images are considered in our experiments. The number of labeled pixels available is detailed in Table I. In the experiments, the Pavia University image was considered as the source domain, while the Pavia City Center image as the target domain, or vice versa. These two cases are denoted as univ/center and center/univ, respectively. Note that only 102 spectral bands of the Pavia University image were used for adaptation.

The second dataset is the GRSS2013 hyperspectral image consisting of 144 spectral bands. The data were acquired by the National Science Foundation (NSF)-funded Center for Airborne Laser Mapping (NCALM) over the University of Houston campus and the neighboring urban area. Originally, the data sets have a size of pixels and their ground truth includes 15 land cover types. Similarly to the previous case, we consider two disjoint sub-images with pixels [Fig. 5(a)] and pixels [Fig. 5(b)], respectively. For ease of reference, we name the two cases as left/right and right/left. These sub-images share eight classes in the ground truth: “healthy grass”, “stressed grass”, “trees”, “soil”, “residential”, “commercial”, “road” and “parking lot 1”. The classes are listed in Table I with the corresponding number of samples.

No. Class Color in Fig. 4 Pavia University Pavia Center
1 Asphalt 6631 7585
2 Meadows 18649 2905
3 Trees 3064 6508
4 Baresoil 5029 6549
5 Bricks 3682 2140
6 Bitumen 1330 7287
7 Shadows 947 2165
No. Class Color in Fig. 5 Left Right
1 Healthy grass 547 704
2 Stressed grass 569 685
3 Trees 451 793
4 Soil 760 482
5 Residential 860 408
6 Commercial 179 1065
7 Road 697 555
8 Parking Lot 1 713 520
TABLE I: Number of Labeled Samples Available for the Pavia Dataset (top) and the GRSS2013 Dataset (down).

V-B Experimental Setup

To investigate the classification performance of the proposed methods, SVM with linear kernel is employed as the supervised classifier. In detail, it is trained on the labeled source samples and tested on the unlabeled target samples. Although classifier like SVM with Gaussian kernel performs better in the classification task, the optimal parameters of such classifier tuned by source samples usually perform worse than expected for target samples under the context of DA. On the other hand, simple linear kernel is not biased by parameter tuning and can capture original relationships between samples from different domains. Free parameter for linear SVM is tuned in the range (0.001-1000) by 5-fold cross validation.

Several unsupervised DA approaches for visual and remote sensing applications are employed as baseline methods:
Source (SRC): SRC is the first baseline method that trains the classifier directly utilizing the labeled source samples. Target (TGT): TGT is the second baseline that trains the classifier directly utilizing the labeled target samples (upper bound on performance). PCA: PCA treats source and target samples as a single domain. GFK: GFK proposes a closed-form solution to bridge the subspaces of the two domains using a geodesic flow on a Grassmann manifold. SA: SA directly adopts a linear projection to match the differences between the source and target subspaces. Our approach is closely related to this method. TCA: TCA carries out adaptation by learning some transfer components across domains in a RKHS using MMD.

# Labeled samples per class 5 10 15 20 25 30 35 40
univ/center SRC OA
Kappa
TGT OA
Kappa
PCA OA
Kappa
GFK OA
Kappa
SA OA
Kappa
TCA OA
Kappa
Proposed TA OA
Kappa
Proposed TA_P OA
Kappa
center/univ SRC OA
Kappa
TGT OA
Kappa
PCA OA
Kappa
GFK OA
Kappa
SA OA
Kappa
TCA OA
Kappa
Proposed TA OA
Kappa
Proposed TA_P OA
Kappa
TABLE II: Classification Results for the Pavia Dataset with Different Numbers of Labeled Source Samples. The First Three Best Results of Mean OAs For Each Column are Reported in Italic Bold, Underlined Bold and Bold, Respectively. The Proposed Approaches Outperform All the Baseline DA Methods.

The parameters of GFK, SA and TCA are tuned as in [31], [26] and [41], respectively. The dimension of final features in PCA is set as same as SA. The main parameters of the TA method are the window size, the tensor dimensionality after MPCA, the core tensor dimensionality after TA and the manifold regularization term . They are fixed as pixels, , and 1e-3 in all experiments, respectively. Note that spectral dimensionality setting as 20 and spatial dimensionality unchanged in MPCA guarantee that 99% energy is preserved and spatial information is also well kept, respectively.

Given the computation cost of TA, we explore the adaptation ability of TA and TA_P with limited samples by randomly selecting tensors in both domains in each trial. To be specific, different numbers of tensors ([5 10 15 20 25 30 35 40] for Pavia dataset and [3 4 5 6 7 8 9 10] for GRSS2013 dataset) per class from the source domain, and 100 tensors per class from the target domain are randomly selected for adaptation. After obtaining the projection matrices, SVM classifier with linear kernel is trained on selected source tensors and tested on all unlabeled target tensors. Regarding the SRC and TGT methods, central samples in selected source tensors and same number of samples per class randomly selected from target domain are employed for training, respectively. In the setting of other DA baseline methods, source tensors are vectorized as source samples and all target samples are used for adaptation. In the training stage, only the central samples are used as labeled. Take the case of 10 per class of source tensors as a example, then 250 (

) source samples per class are available for adaptation in DA baselines, and only 10 per class of central source samples are used for training the classifier. For each setting with same number of labeled source samples, 100 trials of the classification have been performed to ensure stability of the results. The classification results are evaluated using Overall Accuracy (OA), Kappa statistic and F-measure (harmonic mean of user’s and producer’s accuracies). All our experiments have been conducted by using Matlab R2017b in a desktop PC equipped with an Intel Core i5 CPU (at 3.1GHz) and 8GB of RAM.

Vi Results and Discussions

Vi-a Classification Performances

# Labeled samples per class 3 4 5 6 7 8 9 10
left/right SRC OA
Kappa
TGT OA
Kappa
PCA OA
Kappa
GFK OA
Kappa
SA OA
Kappa
TCA OA
Kappa
Proposed TA OA
Kappa
Proposed TA_P OA
Kappa
right/left SRC OA
Kappa
TGT OA
Kappa
PCA OA
Kappa
GFK OA
Kappa
SA OA
Kappa
TCA OA
Kappa
Proposed TA OA