1 Introduction
A very challenging problem in the remote sensing community is to generate landcover maps for semantically characterizing Earth’s surface DBLP:journals/lgrs/AlajlanPMF14 ; Li:2015 ; Li:2016 ; Tong:2016 ; Tang:2016 ; Li:2017
. As one of the most widely used approaches, hyperspectral image (HSI) classification has recently gained in popularity and attracted research interests from other scientific disciplines such as image processing, machine learning, and computer vision
Plaza:2009 ; Tarabalka:2012 ; Li:2014 ; Samat:2016 ; Ye:2016 ; Yan:2016 ; Zhang:2016 ; Shao:2017 ; Appice:2017 . Most of these studies belong to the supervised learning methods, which have shown promising classification performance in practice. However, they usually require many labeled samples to train classifier’s parameters properly, which is quite expensive and timeconsuming for realworld applications. Moreover, the high dimensionality of HSI makes it difficult to find an expected classifier only with a few labeled samples Ifarraguerri:2000 ; DBLP:journals/tgrs/RajanGC08 .To address these issues, one feasible way is to exploit the available information from other geographical areas with abundant labeled samples (regarded as source domain). However, usually these areas are different from the target one, and there always exist certain shifts in data distribution, especially for the image data with underlying structures Chen:2015
. From a perspective of machine learning, this shift problem can be modeled by transfer learning, especially the domain adaptation (DA) approaches. In such scenario, it is always assumed that source domain and target domain possess similar characteristics, i.e., they share the same set of label classes or correlated class distributions
DBLP:journals/pami/BruzzoneM10 .A number of DA techniques have been adopted in HSI classification tasks DBLP:journals/pami/BruzzoneM10 ; zhuo ; DBLP:journals/tgrs/BruzzoneF01 , where both labeled samples from source domain and unlabeled samples from target domain are exploited to train a classifier for the target domain. DBLP:journals/pami/BruzzoneM10 proposed a DA framework named DASVM, which extended Transductive SVM (TSVM) to label unlabeled target samples and remove some source samples progressively. In zhuo , a multiplekernel DA technique was designed to learn a discriminative model by simultaneously minimizing both the SVM structural risk function and the distribution mismatch. The DA work of DBLP:journals/tgrs/BruzzoneF01 adapted the maximumlikelihood classifier to target domain through updating the classifier parameters.
Most of the existing domain adaptation works assume that labeled samples are available only for the source domain, but not for the target domain, or only a few labeled data exists in target domain. In fact, the labeled information in target domain can directly and notably improve the classification performance. To alleviate the expensive cost on the collection of labeled data, an effective solution for discriminative classifier training is to interactively generate the labeled data using the active learning (AL) technique. In the AL literature, we know that some samples from the target domain can be selected and further labeled by the user, which finally form a new training set together with the existing training set in the source domain in order to adapt the classifier to the target domain DBLP:conf/igarss/PerselloB11 . There are a number of studies that have already taken the advantages of the AL strategy in HSI classification 5764734 ; DBLP:journals/tgrs/PerselloB12 ; DBLP:journals/tgrs/PerselloB12 ; Dai:2007:BTL:1273496.1273521 ; Tuia2011 . However, they usually either only focus on the target domain without exploiting the useful information of the source domain, or lack of the power of capturing the data set shift using the ineffective active queries selection without fully utilizing the domain correlations.
In this paper, we first propose a novel framework named multikernel learning with active learning (MKLAL for short) for HSI classification, which combines the powerful AL and DA techniques based on the multiple kernels and largely helps compensate the data shift occurred between two image acquisitions. The main idea of our method is to retrain a multikernel classifier using the labeled samples from both source domain and the userlabeled samples selected from target domain, and the process of retraining is convergent to a satisfying performance of the desired level. On one side, the AL technique can enable us to fully utilize the target information. On the other side, DA based on multikernel learning can offer a good distribution distance measurement across domains, which helps us determine which kernel space is most fit for the data of the two domains and thus select the informative samples. To illustrate the performance of our MKLAL framework, here we choose margin sampling (MS) schohn2000less DBLP:journals/prl/MitraSP04 DBLP:journals/tgrs/DemirPB11 , simple and common used uncertainty criterion, as the active learning strategy DBLP:journals/spm/CampsVallsTBB14 DBLP:journals/pieee/CrawfordTY13 DBLP:journals/tgrs/DemirPB11 for selecting most informative pixels. Therefore, in the following, we can specialize our proposed MKLAL method as multikernel learning with margin sampling (MKLMS for short). We conducted extensive experiments on HSI classification over two hyperspectral datasets, and the experimental results demonstrate the effectiveness of our proposed framework.
The rest of this paper is organized as follows. The following section introduces the related works. Section 3 presents our framework for hyperspectral image classification and elaborates on the detailed components of the proposed framework including the multikernel learning, domain adaptation and the active learning MS strategy. In Section 4, we present and discuss experimental results. Finally, we conclude in Section 5.
2 Related Works
In the literature, support vector machine (SVM) has become one of the most successfully used techniques for hyperspectral image classification
Melgani:2004 , mainly due to the fact that SVM can deal with the highdimensional and noisy data, with the help of the sparse set of the support vectors Xu:2017 and the powerful kernel tricks Liu:2014:pr . This has been proved in many applications like biological problems, which involve highdimensional, noisy data, for which SVMs are known to behave well compared to other statistical or machine learning methods Bandyopadhyay:2007 . Therefore, many kernelbased SVM methods have been studied to capture the complex semantic structure of the hyperspectal images. Basically, these methods first uplift data in the original feature space to a highdimensional kernel space, and then solve the linear classification problem in the uplifted space. With the supervised information from many labeled samples fed to the model, the kernelbased solutions have demonstrated excellent performance in hyperspectral data classification, in terms of accuracy and robustness Camps:2005 ; Camps:2006 ; Camps:2007 ; Liu:2016 ; Li:2016 .Most of these existing HSI classification methods assume that labeled samples are available for the concerned domains. However, in practice it is common that there exist different domains in the hyperspectral images and some of them do not have sufficient labeled samples, because it is usually expensive to collect labels for all domains. Therefore, respectively training a classifier for each domain becomes infeasible. At the same time, due to the spectral shifts among the domains, the model trained in one domain cannot directly fit the other domains. This will obviously limit the power of the traditional HSI classification methods. To address the problem, the domain adaptation (DA) serves as a successful strategy that transfers the well trained classifier from a source domain to the different, yet related target domain Tuia:2016 . In DA based HSI classification, both labeled samples from the source domain and unlabeled samples from the target domain are exploited to train a classifier for the target domain. Such a technique has been proved able to avoid the expensive and timeconsuming labelling efforts, and meanwhile achieve satisfying classification performance across domains DBLP:journals/pami/BruzzoneM10 ; zhuo ; DBLP:journals/tgrs/BruzzoneF01 .
There are several ways in DA research to migrate the classifiers among the source and the target domains. One of the typical strategies is to make the data distributions more similar across the domains to train a single model that can simultaneously classify the source and target domains. zhuo designed a multiplekernel DA technique to learn a discriminative model by simultaneously minimizing both the SVM structural risk and the distribution divergence. DBLP:journals/tgrs/BruzzoneF01 adapted the maximumlikelihood classifier to the target domain through updating the classifier parameters. Even with the promising progress achieved by the DA based HSI classification methods, it is still beyond the desired performance without the supervised information from target domain, mainly due to the divergence among the source and the target domains.
Since the labeled information in target domain can directly and notably improve the classification performance, it is obviously helpful that we can exploit a small number of labeled data to boost the classification performance, which at the same time only brings a quite limited additional cost for the labeled data collection. The active learning (AL) strategies have been widely studied in the literature to tackle such a challenging task in recent years Tong:2002 ; Jain:2010 ; Huang:2014 ; Liu:2016:cvpr , with the aim to exploit the information available from unlabeled data and to enrich the labeled data. In AL process, labelling originally unlabeled data is usually completed by a user according to an specific informative measure.
Due to the promising performance, many efforts have been devoted to the integration of AL in domain adaptation based HSI classification DBLP:conf/igarss/PerselloB11 ; DBLP:journals/tgrs/PerselloB12 ; Tuia2011 ; 6353565 ; 5764734 . DBLP:journals/tgrs/PerselloB12 offered an iterative AL process by adding the most informative samples to the training set, while removing the sourcedomain samples that do not fit with the distributions of the classes in the target domain. 6353565 is a framework that efficiently combines the DA and AL techniques, and the most informative pixels are sampled with active queries from the target image while adapting the obtained classifier using a transfer learning strategy. In Tuia2011 , a DA framework equipped with the active selection pursued the training samples in unknown areas using the strategies based on uncertainty and clustering. These active learning methods are able to alleviate the expensive cost on the collection of labeled data by interactively generating the labeled data.
3 Active MultiKernel Domain Adaptation
First of all, we introduce the notations adopted throughout this paper. Let be the labeled samples (i.e., pixels) in source domain with the corresponding labels , be the unlabeled samples in target domain, and , be all the labeled samples from both source and target domains, which will be iteratively updated in our active multikernel domain adaptation. Here, we adopt binary classifier for simplicity and easy understanding, so each label . In practical scenarios, this can be further extended using the oneagainstall strategy for the multiclass problem.
The flowchart in Fig. 1
outlines the general procedure, including the updating of the base kernels and the maximum mean discrepancy. It consists of two parts, of which the first one is the active learning for target data labelling, with MS selection criteria that heuristically updates the labeled dataset from both source and target domain (the top part), and the other corresponds to the retraining of multikernel classifier based on the adaption from source domains to the target (the bottom part). In active learning, the most interesting candidates for labelling are the ones that fall within the margin of the current classifier, as they are the most likely to become new support vectors
devis1 . So the most interesting candidates from the target domain are identified by using the margin sampling (MS) strategy. After assigning the corresponding true labels by annotators, these candidates are further added to training data in target domain for a better MKL classifier.3.1 MultiKernel Learning for HSI Classification
The multiple kernel learning (MKL) framework has been proved powerful for support vector machine (SVM) based classification in the literature Vishwanathan:2010 . Therefore, in this paper we employ MKL technique for HSI classification. Specifically, we pursue a decision function of the form , where each function belongs to a different reproducing kernel Hilbert space (RKHS). According to the above functional framework, the wellknown SimpleMKL method proposed a weighted 2norm regularization formulation with an constraint on the combination weights, which encourage the sparsity simplemkl . Based on the combination, it solves a standard SVM optimization problem with a kernel defined as a linear combination of multiple kernels. Supposing contains labeled samples (or pixels), the MKL based SVM problem can be addressed by solving the following convex problem, which we will be referred to as the primal MKL problem in SimpleMKL
(1) 
where is the linear combination coefficients for components, and each controls the contribution of each component in the objective function. The smaller means the smoother , according to the measurement (Note that when , has also to be set to zero for a finite objective value). The constraint on actually corresponds to a the popular sparsity norm constraint, which forces some to be zero and thus encourages sparse basis kernel expansions. Note that since the above formulation is convex and differentiable, it is easy to solve the problem by a simple gradient method simplemkl .
In SVM based classification, we usually adopt a linear projection based classifier. Specifically, we use the form , where is the projection vectors, and is the nonlinear feature mapping function, which induces the kernel function , i.e., . Based on the linear classifier formulation, the above MKL problem can be further rewritten as follows
(2) 
Such a formulation is quite similar to the standard SVM, i.e., given the combination weight , its dual problem can be easily obtained using the Lagrangian multipliers satisfying KKT conditions. According to the theoretical result in simplemkl , the above optimization problem can be turn to its associated dual problem as follows:
(3) 
where is the kernel matrix defined for the labeled data by , and is the Hadamard product operator which performs the product in an elementwise manner.
3.2 Domain Adaptation based on MKL
In our MKL framework, besides the label information from the training samples, we further consider the correlation between the source and target domains for HSI classification, and incorporate the domain adaptation technique into the MKL based classification. Intuitively, since we have more training data from the source domain than those from the target domain, it is natural to find the optimal migration between the two domains. Therefore, we attempt to reduce the distribution discrepancy when we transfer from source domains to the target one.
To address this issue, we follow duan1 and develop the SimpleMKL formulation with a Maximum Mean Discrepancy (MMD). The MMD measures the mismatch based on the distance between the means of the samples, respectively, from the source domain and the target domain in a RKHS, namely,
(4) 
If we define the matrix :
(5) 
then will turn to
(6) 
where , with each kernel component induced from defined as
(7) 
where , , and are the kernel matrices defined for the source domain, the target domain, and the cross domain from the source domain to the target domain, respectively. The above formulation indicates that MMD serve as a good measure of distribution discrepancy in kernel space. For more details about MMD, readers can refer to Borgwardt MMD .
With the discrepancy measurement MMD, we can now present the final formulation for DA based on MKL as follows:
(8) 
where is positive parameters that controls the balance between the domain adaptation and the classification accuracy. Such a formulation actually forces the learnt classifier can simultaneously predict the labels accurately and preserve the semantic relations across domains.
By introducing domain adaptation, we can explore the available knowledge on a given source domain to develop a classifier built on the target domain where a priori information is not available. This will significantly reduce the heavy requirement of labeled samples in the target domain. Subsequently, our work can enjoy the capability of transferring the model to different target domains easily.
3.3 Alternating Optimization
There are two variables involved in the above formulation, i.e., the variable for the MKL in domain adaptation, and variable for the classifier. We employ the reduced gradient descent procedure proposed in simplemkl to iteratively update the linear combination coefficient and the dual variable . The optimization consists of two main steps, each of which can be easily solved using the simple and efficient existing techniques.

step: with the learnt , we can rewrite the problem 8 with respect to as follows
(9) with . Since the above problem is convex with respect to , we can directly apply the secondorder gradient descent method to solving it.
To obtain the global optimization solution, we alternatingly update the linear coefficient and the dual variable in a few iterations. After we have these parameters, the final classier based on the multiple kernels can be formulated as follows
(10) 
3.4 Active Learning
The domain adaption based on MKL can help fully utilize the training data from different domains to obtain a discriminative classifier for HSI. However, as we mentioned above, these exist more or less divergence between the source and the target domains, which limits the power of the domain adaptation. To maximally exploit the semantic information from the target domain and meanwhile minimally rely on the large number of the labeled data, active learning serves as an promising solution to improving the classification performance with a small set of the selected labeled samples Tong:2002 ; Jain:2010 ; Liu:2016:cvpr ; Samat:2016 .
Specifically, in our proposed framework we repeat the training and active learning stages alternatingly, where the active learning stage selects the most informative samples from the target domain, labels it and adds it into the training set , and then the training stage retrain the MKL classifier using the updated dataset containing more labeled samples from the source domain.
In the active learning, the selection criteria is quite important for the classification performance. In our framework we adopt the SVM model for HSI classification, which basically pursues the hyperplane that gives the maximum margin. Therefore, in a oneagainstall setting for multiclass problems
devis2 , a margin sampling (MS) strategy is employed to heuristically select the best points, i.e., the closest points , to the hyperplane of the classifier learnt in the last iteration) from the remaining unlabeled dataset according to the following ranking criterion for each candidate :(11) 
Here is the distance of the samples to the hyperplane defined for any class, with its dual variables and the training data in . Note that when is quite huge, i.e., there are a huge number of the unlabeled samples in the target domain, the above ranking solution will be quite timeconsuming due to the expensive computation of the distances for all samples. In this case, we can employ the existing speedup techniques like the popular locality sensitive hashing Jain:2010 ; Liu:2016:cvpr .
Algorithm 1 lists the detailed procedures of our proposed active MKLMS framework.
4 Experiments
4.1 Settings and Protocol
In the experiments, we employ two hyperspectral datasets to evaluate the proposed method, which adapts the multikernel classifier trained on the area S (considered as source domain) to the spatially separate area T (considered as target domain). The first dataset is Pavia center (northern Italy) containing pixels. As shown in Fig. 2(a), the groundtruth map with five classes (water, trees, meadow, soil and tile) of interest available for the scene, displayed in the form of a class assignment for each labeled pixel. These classes have been included in a labeled data set of 126,069 samples extracted by visual inspection. The second hyperspectral dataset is the university area, whose image size is in pixels. Similar to the Pavia center, it is also divided into nine classes. Fig. 2(b) presents the five reference classes of interest, i.e., asphalt, meadows, gravel, trees, and bricks. Totally, the number of different labeled samples available for university area scene is 34,125.
On each dataset, 20 samples of each class (i.e., 100 samples totally for five classes) were randomly selected as the initial training set to obtain a multikernel classifier. To suppress the randomness in the evaluation, all the results are averaged over ten times of experiments, namely, we sample ten different initial training sets from the source domain. The active learning process runs with two settings: selecting and adding and pixels into the training set per iteration. For the above settings, we respectively repeat 40 and 30 iterations for Pavia center image, and 30 for University area in all cases.
As to the multikernel classifier, we following the common way for multiclass classification problem that oneversusall classifiers are trained with four types of kernels (i.e., ): Gaussian kernel , Laplacian kernel , inverse square distance kernel , and inverse distance kernel . Here represents the kernel parameter, for which we use the default value , with respect to the mean value of the square distances between all training samples.
We compare our method with those classic active learning methods with different heuristics like margin sampling (MS), random sampling (RS), and MKLRS (with random sampling heuristic), in terms of the classification accuracy (%) and Kappa statistic. The Kappa statistic is a metric that compares an bbserved accuracy with an expected accuracy (random chance) Viera:2005 . It takes into account random chance (agreement with a random classifier), which generally means it is a more robust measure than simple percent agreement calculation. We adopted it for the comprehensive evaluation of the proposed method in our paper. For these active learning methods, a small set of labeled data in source domains with the sequentially selected samples in the target domain are treated as the training data. In addition, we also adopt those traditional methods without active learning process (nonAL for short), like single kernel version of DTMKL (SKV), classic SVM, and domain transfer multikernel learning (DTMKL) zhuo , as the baselines, where they utilize all the labeled samples in source domains to train the classifier for the target domain. In those kernel methods we simply choose the Gaussian kernel the default one. As to the SVM based methods, the popular LIBSVM is applied in the experiments, and the optimal SVM parameters are found using the grid search in a tenfold crossvalidation manner, where the parameter and .
Method  # target samples ()  

0  40  100  200  
RS  47.770  86.200  90.445  93.865 
MS  47.770  87.405  95.685  96.255 
MKLRS  47.930  92.250  94.275  94.655 
MKLMS  47.945  94.765  98.610  98.235 
Method  # source/target samples  OA  Kappa 

SKV  2723 / 0  46.550  0.358 
DTMKL  2723 / 0  46.850  0.360 
SVM  2723 / 0  49.750  0.388 
RS  100 / 300  92.220 5.288  0.882 0.085 
MS  100 / 300  96.140 2.427  0.943 0.036 
MKLRS  100 / 300  92.960 2.383  0.897 0.035 
MKLMS  100 / 300  97.930 0.658  0.970 0.010 
region size  Method  # source/target samples  OA  Kappa 

3842 
MKLRS 
80 / 300  93.2440.215  0.9140.008 
100/300  92.8250.153  0.9220.013  
MKLMS 
80/300  98.3010.322  0.9750.010  
100/300  97.4670.235  0.9630.002  
5338 
MKLRS 
80 / 300  90.1320.056  0.9030.006 
100/300  90.0980.028  0.8920.016  
MKLMS 
80/300  94.2000.032  0.9230.005  
100/300  94.0670.005  0.9120.016 
4.2 Results and Analysis
4.2.1 Experiments on Pavia Center Dataset
Fig. 4(a) plots the obtained overall accuracy (OA) of the classification using four different AL methods: RS, MS, MKLRS, and MKLMS, with respect to the number of labeled training samples for the Pavia center scene. From the figure, we can first notice that when the training set is very small, the proposed MKLMS provides lower OA than MKLRS. But as the number of training samples increases, e.g., to 140 at the fourth iteration, MKLMS substantially improves its classification accuracy, and gets the best performance among all methods. Besides, we can also notice that MKLMS attains a quick convergence, after about 10 iteration with 100 target domain pixels selected, and its OA curve (in red) reaches 97.4% (about 4.6% performance gains over MKLRS). This performance is never reached by the three baseline approaches during the 40 active learning iterations.
Table 3 further investigates the OA for different numbers of samples added during the AL process. As it could be expected, more added samples indicate a higher classification accuracy. Moreover, the proposed MKLMS better behaves compared to other approaches with a obvious performance gap. Table 2
reports the results of three nonAL methods and four active learning based methods, in terms of OA, Kappa, and their standard deviations. Here, those nonAL algorithms like SKV, classic SVM and DTMKL utilize the whole source domain (i.e., 2723 samples in Pavia Center) as the training set to adapt the learnt model for target domain. We can we can notice that even though with much more training samples, the classification accuracy of nonAL methods still can not outperform AL strategies with DA technique, which even can not reach 50% of OA in Table
2. This is because that there exists huge distribution variation between the source and the target domains. Instead, our proposed method MKLMS incorporating the active learning to select the most informative samples to reduce their distribution discrepancy, and thus reaches 97.4% OA with totally 400 samples, including 100 samples from source domain and 300 samples from target domain. The observation clearly verifies the effectiveness of our strategy combining MKL and AL.To comprehensive evaluate the performance, we randomly select more and different source regions from the two datasets and keep the same target image as the prior experiments. The performance further confirm that our proposed method with MS strategy (MKLMS) again obtain the best performance. This means that there exist the obvious domain shift, and our method can capture the correlations, select the most informative samples for a better classifier, and consistently achieve the best performance. Besides, we also compare our method with the stateoftheart classification methods including SKV, DTMKL, and SVM. With the help the samples from target domains, RS, MS, MKLRS, MKLMS can significantly improve the performance. Besides, by considering the domain shift, MKLMS method can obtain the best performance in most cases. This further confirm that our method can not only largely leverage the information from target domain, but also exploit the information from source domains, and thus show the robust and better performance in practice.
To further investigate the effect of the source region size, we gradually enlarge the source region from 3842 samples to 5338 ones on Pavia Center, and keep the target region unchanged. Fig. 3) demonstrate the region growth, and Table 3 lists the performance using different number of labeled samples in our framework, where we can see that MKLMS outperforms MKLRS consistently, and get the best performance compared to the other baselines (in Table 2) in all cases.
Method  # target samples (q=20)  

0  40  100  200  
RS  62.290  87.385  89.955  91.030 
MS  62.290  83.600  88.910  90.785 
MKLRS  48.575  88.625  90.320  91.850 
MKLMS  42.525  89.830  92.625  96.925 
Method  # source/target samples  OA  Kappa 

SKV  1907 / 0  48.350  0.3385 
DTMKL  1907 / 0  44.800  0.298 
SVM  1907 / 0  64.650  0.526 
RS  100 / 300  91.430 0.4264  0.871 0.006 
MS  100 / 300  91.610 0.5592  0.873 0.009 
MKLRS  100 / 300  93.480 1.214  0.901 0.019 
MKLMS  100 / 300  97.070 0.343  0.956 0.005 
region size  Method  # source/target samples  OA  Kappa 

2842 
MKLRS 
100/300  94.8250.825  0.9220.013 
200/300  94.5530.157  0.9180.018  
MKLMS 
100/300  97.2500.235  0.9560.003  
200/300  96.9500.453  0.9520.004  
4623 
MKLRS 
100/300  93.5501.55  0.9020.024 
200/300  93.9500.05  0.9090.001  
MKLMS 
100/300  96.8790.225  0.9530.004  
200/300  96.8250.225  0.9520.003 
4.2.2 Experiments on University Area Dataset
To comprehensively evaluate the proposed method, in Fig. 4(b) we further display the learning curves obtained on University Area dataset using the four active learning methods. The proposed MKLMS algorithm clearly obtains the best results, which certainly demonstrates the advantage of MKLMS. Similar to the results on Pavia center, at the beginning the proposed MKLMS lies 15% below the traditional method (MS and RS) without adaptation. However, after three active iterations MKLMS algorithm reaches nearly 90% OA. At the same time, it can be observed that MKLRS also archives a promising learning curve, which is closest to that of MKLMS among all methods. This fact confirms that selecting samples near the decision boundary helps boost the classification performance, and significantly outperforms the random sampling. In this figure, all methods converge quickly after adding 150 samples from the target domain, where we can see that using we can achieve much more efficiency with fewer active iterations than . Moreover, in all cases our proposed method MKLMS achieves the best performance, i.e., nearly 97.0% OA, while only about 95.0% using MKLRS, 91.5% using MS and RS.
In Table 4, we also list the classification accuracy of the four methods by varying the number (ranging from 0 to 200) of samples selected from target domain. From the table, we can see that MKLMS always sustains the highest OA from the beginning of the AL process. Table 5 further reports the OA and Kappa results of three nonAL and four AL classification methods. For these nonAL algorithms, we still use all the source domain (i.e., 1907 samples in University Area) as training set to train the classifier. We can see that there exists significant domain shift between the source and target domains, by comparing the basic classification methods and the active learning methods with DA. Without considering the divergence between the source and target domains, all the nonAL methods that directly transfer the domain knowledge, get unsatisfying performance in both OA and Kappa. Owing to the AL strategies, our proposed approach MKLMS only relies on a very few training samples and obtains promising results. For example, in Fig. 4(b) with it takes only 15 iterations (150 samples are labeled and added into training set) to converge at the level of 96.9% accuracy, and while with , 140 samples are selected totally. By comparing its performance to the three nonAL classification methods, we can conclude that the active learning strategy would be very helpful for discriminative classifier learning, when there exists a large distribution deviation from the source domain to the target domain. In Table 6 we also investigate the performance with respect to different source region size, and obtain the similar conclusion that our proposed method is able to robustly achieve the promising performance in different scenarios.
5 Conclusion
This paper presents a novel active framework MKLAL based on domain adaptation for addressing the hyperspectral image (HSI) classification problem. It fully utilizes the label information from the auxiliary domains by compensating domain distribution shifts in an sequential active learning manner, which not only significantly boosts the classification accuracy but also saves the expensive computational cost and human labelling efforts. The extensive experimental results on two popular HSI datasets demonstrated that when a large bias exists between the source and target domains, the conventional DA classifier can not promise a satisfying performance. Instead, by actively expanding the training set without too much efforts, our method can efficiently improve the classification accuracy.
6 Acknowledgement
This work was supported by the National Natural Science Foundation of China (61402026 and 61572388), the Key R&D Program  The Key Industry Innovation Chain of Shaanxi (Grant No. 2017ZDCXLGY050402 and Grant No. 2017ZDCXLGY0502), Beijing Municipal Science and Technology Commission (Z171100000117022) and the Foundation of State Key Lab of Software Development Environment (SKLSDE2016ZX04).
References
References
 (1) N. Alajlan, E. Pasolli, F. Melgani, A. Franzoso, Largescale image classification using active learning, IEEE Geosci. Remote Sensing Lett. 11 (1) (2014) 259–263.
 (2) N. Li, H. Zhao, P. Huang, G. Jia, X. Bai, A novel logistic multiclass supervised classification model based on multifractal spectrum parameters for hyperspectral data, Int. J. Comput. Math. 92 (4) (2015) 836–849.

(3)
Z. Li, C. Li, C. Deng, J. Li, Hyperspectral image superresolution using sparse spectral unmixing and lowrank constraints, in: IEEE IGARSS, 2016, pp. 7224–7227.
 (4) L. Tong, J. Zhou, Y. Qian, X. Bai, Y. Gao, Nonnegativematrixfactorizationbased hyperspectral unmixing with partially known endmembers, IEEE Trans. Geoscience and Remote Sensing 54 (11) (2016) 6531–6544.
 (5) Y. Tang, E. Fan, C. Yan, X. Bai, J. Zhou, Discriminative weighted band selection via oneclass SVM for hyperspectral imagery, in: IEEE IGARSS, 2016, pp. 2765–2768.

(6)
Hyperspectral image reconstruction by deep convolutional neural network for classification, Pattern Recognition 63 (2017) 371 – 383.
 (7) J. Plaza, A. Plaza, R. Perez, P. Martinez, On the use of small training sets for neural networkbased characterization of mixed pixels in remotely sensed hyperspectral images, Pattern Recognition 42 (11) (2009) 3032 – 3045.
 (8) Y. Tarabalka, J. Chanussot, J. Benediktsson, Segmentation and classification of hyperspectral images using watershed transformation, Pattern Recognition 43 (7) (2010) 2367 – 2379.

(9)
W. Li, S. Prasad, J. E. Fowler, Hyperspectral image classification using gaussian mixture models and markov random fields, IEEE Geoscience and Remote Sensing Letters 11 (1) (2014) 153–157.
 (10) A. Samat, J. Li, S. Liu, P. Du, Z. Miao, J. Luo, Improved hyperspectral image classification by active learning using predesigned mixed pixels, Pattern Recognition 51 (2016) 43 – 58.
 (11) M. Ye, Y. Qian, J. Zhou, Y. Tang, Dictionary learning based feature level domain adaptation for crossscene hyperspectral image classification, IEEE Trans. Geoscience and Remote Sensing (2016) 1544–1562.
 (12) C. Yan, X. Bai, P. Ren, L. Bai, W. Tang, J. Zhou, Band weighting via maximizing interclass distance for hyperspectral image classification, IEEE Geosci. Remote Sensing Lett. 13 (7) (2016) 922–925.
 (13) E. Zhang, X. Zhang, L. Jiao, L. Li, B. Hou, Spectral cspatial hyperspectral image ensemble classification via joint sparse representation, Pattern Recognition 59 (2016) 42 – 54.
 (14) Probabilistic class structure regularized sparse representation graph for semisupervised hyperspectral image classification, Pattern Recognition 63 (2017) 102 – 114.
 (15) A novel spectralspatial cotraining algorithm for the transductive classification of hyperspectral imagery data, Pattern Recognition 63 (2017) 229 – 245.
 (16) A. Ifarraguerri, C.I. Chang, Unsupervised hyperspectral image analysis with projection pursuit, IEEE Transactions on Geoscience and Remote Sensing 38 (6) (2000) 2529–2538.
 (17) S. Rajan, J. Ghosh, M. M. Crawford, An active learning approach to hyperspectral data classification, IEEE T. Geoscience and Remote Sensing 46 (4) (2008) 1231–1242.
 (18) C. Chen, S. Li, H. Qin, A. Hao, Structuresensitive saliency detection via multilevel rank analysis in intrinsic feature space, IEEE Transactions on Image Processing 24 (8) (2015) 2303–2316.
 (19) L. Bruzzone, M. Marconcini, Domain adaptation problems: A DASVM classification technique and a circular validation strategy, IEEE Trans. Pattern Anal. Mach. Intell. 32 (5) (2010) 770–787.
 (20) Z. Sun, C. Wang, H. Wang, J. Li, Learn multiplekernel svms for domain adaptation in hyperspectral data, IEEE Geosci. Remote Sensing Lett. 10 (5) (2013) 1224–1228.
 (21) L. Bruzzone, D. FernándezPrieto, Unsupervised retraining of a maximum likelihood classifier for the analysis of multitemporal remote sensing images, IEEE T. Geoscience and Remote Sensing 39 (2) (2001) 456–460.
 (22) C. Persello, L. Bruzzone, A novel active learning strategy for domain adaptation in the classification of remote sensing images, in: 2011 IEEE International Geoscience and Remote Sensing Symposium, 2011, pp. 3720–3723.
 (23) D. Tuia, E. Pasolli, W. Emery, Dataset shift adaptation with active queries, in: Urban Remote Sensing Event, 2011, pp. 121–124.
 (24) C. Persello, L. Bruzzone, Active learning for domain adaptation in the supervised classification of remote sensing images, IEEE T. Geoscience and Remote Sensing 50 (11) (2012) 4468–4483.
 (25) W. Dai, Q. Yang, G.R. Xue, Y. Yu, Boosting for transfer learning, in: Proceedings of the 24th International Conference on Machine Learning, ICML ’07, ACM, New York, NY, USA, 2007, pp. 193–200.
 (26) D. Tuia, E. Pasolli, W. Emery, Using active learning to adapt remote sensing image classifiers, Remote Sensing of Environment 115 (9) (2011) 2232–2242.
 (27) G. Schohn, D. Cohn, Less is more: Active learning with support vector machines, in: ICML, 2000, pp. 839–846.
 (28) P. Mitra, B. U. Shankar, S. K. Pal, Segmentation of multispectral remote sensing images using active support vector machines, Pattern Recognition Letters 25 (9) (2004) 1067–1074.
 (29) B. Demir, C. Persello, L. Bruzzone, Batchmode activelearning methods for the interactive classification of remote sensing images, IEEE T. Geoscience and Remote Sensing 49 (3) (2011) 1014–1031.
 (30) G. CampsValls, D. Tuia, L. Bruzzone, J. A. Benediktsson, Advances in hyperspectral image classification: Earth monitoring with statistical learning methods, IEEE Signal Process. Mag. 31 (1) (2014) 45–54.
 (31) M. M. Crawford, D. Tuia, H. L. Yang, Active learning: Any value for classification of remotely sensed data?, Proceedings of the IEEE 101 (3) (2013) 593–608.
 (32) F. Melgani, L. Bruzzone, Classification of hyperspectral remote sensing images with support vector machines, IEEE Transactions on Geoscience and Remote Sensing 42 (8) (2004) 1778–1790.
 (33) J. Xu, X. Liu, Z. Huo, C. Deng, F. Nie, H. Huang, Multiclass support vector machine via maximizing multiclass margins, in: IJCAI, 2017, pp. 3154–3160.
 (34) X. Liu, J. He, B. Lang, Multiple feature kernel hashing for largescale visual search, Pattern Recognition 47 (2) (2014) 748–757.
 (35) S. Bandyopadhyay, S. Bandyopadhyay, Analysis of Biological Data: A Soft Computing Approach  Vol. 3, World Scientific Publishing Co., Inc., River Edge, NJ, USA, 2007.
 (36) G. CampsValls, L. Bruzzone, Kernelbased methods for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 43 (6) (2005) 1351–1362.
 (37) G. CampsValls, L. GomezChova, J. MunozMari, J. VilaFrances, J. CalpeMaravilla, Composite kernels for hyperspectral image classification, IEEE Geoscience and Remote Sensing Letters 3 (1) (2006) 93–97.
 (38) G. CampsValls, T. V. B. Marsheva, D. Zhou, Semisupervised graphbased hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 45 (10) (2007) 3044–3054.
 (39) T. Liu, Y. Gu, X. Jia, J. A. Benediktsson, J. Chanussot, Classspecific sparse multiple kernel learning for spectral spatial hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 54 (12) (2016) 7351–7365.
 (40) D. Tuia, C. Persello, L. Bruzzone, Domain adaptation for the classification of remote sensing data: An overview of recent advances, IEEE Geoscience and Remote Sensing Magazine 4 (2) (2016) 41–57.
 (41) S. Tong, D. Koller, Support vector machine active learning with applications to text classification, J. Mach. Learn. Res. 2 (2002) 45–66.
 (42) P. Jain, S. Vijayanarasimhan, K. Grauman, Hashing Hyperplane Queries to Near Points with Applications to LargeScale Active Learning, in: Advances in Neural Information Processing Systems, 2010, pp. 928–936.

(43)
L. Huang, Y. Liu, X. Liu, X. Wang, B. Lang, Graphbased active semisupervised learning: A new perspective for relieving multiclass annotation labor, in: IEEE ICME, 2014, pp. 1–6.
 (44) X. Liu, X. Fan, C. Deng, Z. Li, H. Su, D. Tao, Multilinear hyperplane hashing, in: IEEE CVPR, 2016, pp. 1–9.
 (45) G. Matasci, D. Tuia, M. Kanevski, Svmbased boosting of active learning strategies for efficient domain adaptation, Selected Topics in Applied Earth Observations and Remote Sensing, IEEE Journal of 5 (5) (2012) 1335–1343.
 (46) D. Tuia, F. Ratle, F. Pacifici, M. F. Kanevski, W. J. Emery, Active learning methods for remote sensing image classification, IEEE T. Geoscience and Remote Sensing 47 (72) (2009) 2218–2232.
 (47) S. V. N. Vishwanathan, Z. Sun, N. TheeraAmpornpunt, M. Varma, Multiple kernel learning and the smo algorithm, in: Proceedings of the 23rd International Conference on Neural Information Processing Systems, 2010, pp. 2361–2369.
 (48) A. Rakotomamonjy, F. Bach, S. Canu, Y. Grandvalet, Simplemkl, Journal of Machine Learning Research 9 (2008) 2491–2521.
 (49) L. Duan, D. Xu, I. W. Tsang, J. Luo, Visual event recognition in videos by learning from web data, IEEE Trans. Pattern Anal. Mach. Intell. 34 (9) (2012) 1667–1680.
 (50) K. M. Borgwardt, A. Gretton, M. J. Rasch, H. Kriegel, B. Schölkopf, A. J. Smola, Integrating structured biological data by kernel maximum mean discrepancy, in: Proceedings 14th International Conference on Intelligent Systems for Molecular Biology, 2006, pp. 49–57.
 (51) D. Tuia, M. Volpi, L. Copa, M. F. Kanevski, J. MuñozMarí, A survey of active learning algorithms for supervised remote sensing image classification, J. Sel. Topics Signal Processing 5 (3) (2011) 606–617.
 (52) A. Viera, J. Garrett, Understanding interobserver agreement: The kappa statistic, Family Medicine 37 (5) (2005) 360–363.