Deep Manifold Embedding for Hyperspectral Image Classification

12/24/2019 ∙ by Zhiqiang Gong, et al. ∙ 14

Deep learning methods have played a more and more important role in hyperspectral image classification. However, the general deep learning methods mainly take advantage of the information of sample itself or the pairwise information between samples while ignore the intrinsic data structure within the whole data. To tackle this problem, this work develops a novel deep manifold embedding method(DMEM) for hyperspectral image classification. First, each class in the image is modelled as a specific nonlinear manifold and the geodesic distance is used to measure the correlation between the samples. Then, based on the hierarchical clustering, the manifold structure of the data can be captured and each nonlinear data manifold can be divided into several sub-classes. Finally, considering the distribution of each sub-class and the correlation between different subclasses, the DMEM is constructed to preserve the estimated geodesic distances on the data manifold between the learned low dimensional features of different samples. Experiments over three real-world hyperspectral image datasets have demonstrated the effectiveness of the proposed method.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, hyperspectral images, which contain hundreds of spectral bands to characterize different materials, make it possible to discriminate different objects with the plentiful spectral information and have proven its important role in the literature of remote sensing and computer vision

[7, 51, 52]. As an important hyperspectral data task, hyperspectral image classification aims to assign the unique land-cover label to each pixel and is also the key technique in many real-word applications, such as the urban planning [15], military applications [8], and others. However, hyperspectral image classification is still a challenging task. There usually exists high nonlinearity of samples within each class. Therefore, how to effectively model and represent the samples of each class tends to be a difficult problem. Besides, great overlapping which occurs between the spectral channels from different classes in the hyperspectral image, multiplies the difficulty to obtain discriminative features from the samples.

Deep models have demonstrated their potential to model the nonlinearity of samples [23, 47, 13]. It can learn the model adaptively with the data information from the training samples and extract the difference between different classes. Due to the good performance, this work will take advantage of the deep model to extract features from the hyperspectral image. However, large amounts of training samples are required to guarantee a good performance of the deep model while there usually exists limited number of training samples in many computer vision tasks, especially in the literature of hyperspectral image classification. Therefore, how to construct the training loss and fully utilize the data information with a certain number of training samples becomes the essential and key problem for effectively deep learning.

The softmax loss, namely the softmax cross-entropy loss, is widely applied in prior works. It is formulated by the cross entropy between the posterior probability and the class label of each sample

[38], which mainly takes advantage of the point-to-point information of each sample itself. Several variants which try to utilize the distance information between each sample pair or among each triplet have been proposed. These losses, such as the contrastive loss [9] and triplet loss [32]

have made great strides in improving the representational ability of the CNN model. However, these prior losses, which we call samples-wise methods, mainly utilize the data information of sample itself or between samples and ignore the intrinsic data structure. In other words, these samples-wise methods only consider the commonly simple information and ignore the special intrinsic data structures of the hyperspectral image for the task at hand.

Establishing a good model for the hyperspectral image is the premise of making use of the intrinsic data structure in the deep learning. Generally, the way to model the hyperspectral image can be broadly divided into two classes: parametric model and non-parametric model. Typical parametric models for hyperspectral image are usually constructed by the probabilistic model, such as the multi-variant Gaussian distribution. This class of model has been successfully applied in the literature of hyperspectral target detection

[57]

and anomaly detection

[49]. Generally, parameter estimation with the training data is essential under these parametric models [43]. The other class of models usually makes use of the information provided by the training data directly, without modelling class data [18]. These nonparametric models are usually based on the mutual information and suitable for general cases since it does not assume anything about the shape of the class data density functions. In this work, manifold model, which plays an important role in the nonparametric models and can better fit the high dimension of the hyperspectral image, will be applied to model the image for the current task.

Manifold learning has been widely applied in many computer vision tasks, such as the face recognition

[43, 44], image classification [28], as well as in the literature of hyperspectral image [42, 29]

. Generally, a data manifold follows the law of manifold distribution: in real-world applications, high-dimensional data of the same class usually draws close to a low dimensional manifold

[21]. Therefore, hyperspectral images, which provide a dense spectral sampling at each pixel, possess good intrinsic manifold structure. This work aims to develop a novel manifold embedding method in deep learning (DMEM) for hyperspectral image classification to make use of the data manifold structure and preserve the intrinsic data structure in the obtained low dimensional features.

In addition to the law of manifold distribution, data manifold usually follows another property, namely the law of cluster distribution: The different subclasses of a certain class in the high-dimensional data correspond to different probability distributions on the manifold

[22]. Furthermore, these probability distributions are far enough to distinguish these subclasses. Therefore, under the geodesic distances between the samples, we divide each class in the hyperspectral image into several sub-classes. Then, we develop the DMEM according to the following two principles.

  1. Based on multi-statistical analysis, deep manifold embedding can be constructed to encourage the features from each sub-class to follow a certain distribution and further preserve the intrinsic structure in the low dimensional feature space.

  2. Motivated by the idea of maximizing the “manifold margin” by the manifold discriminant analysis [43], additional diversity-promoting term is developed to increase the margin between sub-classes from different data manifold.

Overall, the main contributions of this work are threefold. Firstly, this work models the hyperspectral image with the nonlinear manifold and takes advantage of the intrinsic manifold structure of the hyperspectral image in the deep learning process. Secondly, this work formulates a novel training loss based on the manifold embedding in deep learning for hyperspectral image classification and thus the intrinsic manifold structure can be preserved in the low dimensional features. Finally, a thorough comparison is provided using different samples-based embedding and loss.

The rest of this paper is arranged as follows. Section II briefly reviews the existing works on the topic of manifold learning and general deep learning. Section III gives a detailed description of the proposed method, which embeds the manifold model in deep learning for hyperspectral image classification. Section IV presents the experimental results and comparisons to validate the effectiveness of this paper. Finally, we conclude this work with some discussions in Section V.

Ii Related Work

In this section, we will review two topics that closely related to this paper. First, deep learning methods are briefly introduced, which promote the generation of motivations of this work. Then, manifold learning in prior works is investigated, which is directly the related work of the proposed method.

Ii-a Deep Learning

Deep learning methods capture the data information from the training samples under a fixed criterion given by the loss function. These loss functions are mainly based on samples-wise information and can be divided into two classes according to different criterions.

The first one is the one-to-one correspondence criterion, which measures the difference between the predicted and the corresponding label of each sample. The typical representative is the widely used softmax loss. Some more variants have been developed to boost the performance of the general softmax loss. For example, Liu et al. [27] develops the large margin softmax loss (L-Softmax) which utilizes a simple angle margin regularization to achieve a classification angle margin between different classes. The work of Liu et al. [26] is also of this type while improves the L-Softmax by normalizing the weights. Wang et al. [39] rethinks the softmax loss from the cosine perspective and constructs the large margin cosine loss. Wan et al. [38] introduces the distribution prior on the learned features and constructs the Gaussian Mixture (GM) Loss. Classification margin and likelihood regularization can also be imposed on the GM Loss to accurately model the features. All these works utilize the information from different samples independently.

The second one is using the inter-sample information. The principle of these works is to decrease the Euclidean distances of the samples with the same class label and increase the distances of samples from different classes. Hadsell et al. [9] first develops the contrastive loss to utilize the information of image pairs. Schroff et al. [32] constructs the triplet data and formulates the triplet loss. Sohn [34] further considers the N-pair sampling other than the triplet sampling. Wang et al. [41] reformulates the correlation of triplet data from the view of angular and develops the angular loss with the triplet sampling. As the variants of the contrastive loss, Song et al. [35] proposes the structured loss by taking advantaging of the intrinsic structure within the mini-batch and Zhang et al. [48] makes use of the harmonic range within each class to handle imbalanced data through the developed range loss. As a deep improvement one, center loss which is developed by Wen et al. [46] utilizes the center point of each class to formulate the image pairs within the class. By adding more image pairs, Zhe et al. [50] pushes the contrastive loss to the class-wise loss.

These former losses only consider the commonly simple information from the training samples and ignore the intrinsic information within the data. Especially, for the task at hand, there exist high nonlinearity and great overlapping in the high dimensional hyperspectral data. Under these circumstances, using the special intrinsic data manifold structure within the hyperspectral image would be particularly important and can make the learned model be more fit for the image. This is also the direct motivations of the developed method in this work.

Ii-B Manifold Learning

Manifold learning is the research topic to learn from a data a latent space representing the input space. It can not only grasp the hidden structure of the data, but also generate low dimensional features by nonlinear mapping. A large amount of manifold learning methods have already been proposed, such as the Isometric Feature Mapping (ISOMAP) [40, 37], Laplacian Eigenmaps [2, 37], Local linear Embedding (LLE) [31], Semidefinite Embedding [45], Manifold Discriminant Analysis [43, 44], RSR-ML [10]. With the development of the deep learning, some works have incorporated the manifold in the deep models [1, 56, 28, 17]. Zhu et al. [56] develops the automated transform by manifold approximation (AUTOMAP) which learns a near-optimal reconstruction mapping through manifold learning. Lu et al. [28] and Aziere et al. [1] mainly apply the manifold learning in deep ensemble and consider the manifold similarity relationships between different CNNs. Iscen et al. [17] utilizes the manifolds to implement the metric learning without labels.

These manifold learning methods are mainly applied in natural image processing tasks, such as face recognition [10], natural image classification [44]

, image retrieval

[1]. Only few works, such as [29], [42], focus on the hyperspectral image classification task. Among these works, Ma et al. [29] only combines the local manifold learning with the

-nearest-neighbor classifier. Wang

et al. [42] uses the manifold ranking for salient band selection. All these works do not consider the intrinsic manifold structure of the hyperspectral image in the training process. Faced with the current task, this work tries to develop a novel deep manifold embedding which can promote the learned deep model to capture the data intrinsic structure of the hyperspectral image and further preserve the manifold structure in low dimensional features. In the following, we’ll introduce the developed deep manifold embedding in detail.

Fig. 1: Flowchart of the proposed deep manifold embedding for hyperspectral image classification.

Iii Manifold Embedding in Deep Learning

Denote as the training samples of the hyperspectral image and is the corresponding class label of where defines the number of the training samples. where stands for the set of class labels and represents the number of the class of the image.

Iii-a Manifold Structure within the Hyperspectral Image

Denote where represents the set of samples from the th class and is the number of samples from the th class.

Following the law of manifold distribution, samples of each class from the hyperspectral image is assumed to satisfy a certain nonlinear manifold. As introduced in the former, the nonlinear manifold obeys the law of cluster distribution. Therefore, each class can be divided into several sub-class and each sub-class is supposed to follow a certain probability distribution. Generally, the closer samples on the manifold are supposed to belong to the same sub-class, namely the same probability distribution. This work uses a novel measurement instead of the Euclidean distance to measure the distance between samples on the manifold.

0:  , , ,
0:  ,
1:  for  do
2:     Construct the undirected graph over the th class with the node of .
3:     Compute the weights of edges on the graph using (2).
4:     Compute the distance matrix over the manifold using (3) through Dijkstra algorithm.
5:     while obtain the sub-classes do
6:        Combine the nearest two set points in distance matrix as a new set.
7:        Update the distance matrix with the newly established set.
8:     end while
9:  end for
10:  return ,
Algorithm 1 Extracting Manifold Structure via Hierarchical Clustering

Given the th class in the image. To separate the samples of each class into different sub-classes, all the samples of each class is used to formulate an undirected graph. Let denote the graph over the th class, where is the set of nodes in the graph and is the set of edges in the graph.

The distance between the sample and its nearest neighbors is assumed to distribute on a certain linear manifold and can be calculated under the Euclidean distance,

(1)

Then, the weights of the edges on the undirected graph on the th class are defined as follows:

(2)

In the data manifold, the geodesic distance [33] can be used to measure the distance between different samples on the manifold. The geodesic distance on the manifold can be transformed by the shortest path on the graph . Then, the distance between the sample and on the manifold can be calculated by

(3)

where .

This work uses the Dijkstra algorithm [4] to solve the optimization in Eq. 3. Then, the distance matrix over the data manifold of the th class can be formulated by the pairwise distance between different samples.

Here, for each class, we divide the whole training samples of the class into sub-classes. Denote as the sub-classes of the th class. The samples in each sub-class are supposed to be close enough. Then, these sub-classes are constructed under the following optimization:

(4)

Under the optimization in Eq. 4, we can obtain the sub-classes with the smallest geodesic distances between the samples in each sub-class. Hierarchical clustering can be used to solve the optimization. The whole procedure is outlined in Algorithm 1.

Iii-B Deep Manifold Embedding

This work selects the CNN model as the features extracted model for hyperspectral image. Denote

as the extracted features of sample from the CNN model. Then, the obtained features can be looked as the global low dimensional coordinates under the nonlinear CNN mapping. Besides, as Fig. 1 shows, the deep manifold embedding constructs the global low dimensional coordinates to preserve the estimated distance on the manifold.

From the law of cluster distribution, we know that a sub-class corresponds to different probability distributions over the manifold. To preserve the estimated geodesic distances, for samples in each sub-class, the extracted features in the low dimensional coordinates are also expected to follow the same distribution.

As processed in former subsection, suppose as the sub-classes from the th class. Given where is the number of samples in the sub-class . If not specified, in the following, we use to represent the . Then, is the set of the learned features. The problem to promote the features in to follow the same distribution can be transformed to the one that , follows the distributions constructed by all the other features in under a certain degree of confidence.

Therefore, Given . Suppose all the other features in follow the multi-variant Gaussian distribution where is the dimension of the learned features. Then,

(5)

Under the confidence , when

(6)

can be seen as the sample from distribution

. For simplicity, we assume that different dimensions in the feature are independent and have the same variance, namely the covariance

where

represents the identity matrix. Besides, the unbiased estimation of the mean value

is

(7)

Then, the penalization from can be formulated by

(8)

where is the constant term. Since

(9)

Ignore the constant term and we use the following penalization to replace that in Eq. 8:

(10)

Then, the loss for deep manifold embedding can be written as

(11)

To further improve the performance for manifold embedding, we introduce the diversity-promoting term to enlarge the distance between the sub-classes from different classes. The distance between the sub-classes can be processed by the set-to-set distance between different sets. This work will use the Hausdorff distance which is the maximum distance of a set to the nearest point in the other set [30] to measure the distance between different sub-classes since the measurement considers the whole shape of the data set and also the position of the samples in the set.

Suppose as the sub-class from th class and as the sub-class from th class, then the Hausdorff distance between the two sub-classes can be calculated by

(12)

Then, the diversity-promoting term [6] can be formulated as

(13)

where is a positive value which represents the margin.

Based on Eq. 11 and 13, the final loss for the proposed DMEM can be written as

(14)

where stands for the tradeoff parameter.

Iii-C Optimization

Just as general deep learning methods, stochastic gradient descent (SGD) methods and back propagation (BP) are used for the training process of the developed deep manifold embedding

[11]. The key process is to calculate the derivation of the loss with respect to (w.r.t.) the features .

Based on the chain rule, gradients of

w.r.t. can be calculated as

(15)

Then, we have

(16)

where represents the indicative function.

(17)

We summarize the computation of loss functions and gradients in Algorithm 2.

0:  Features , , sub-classes

, Hyperparameter

.
0:  
1:  Compute the loss for deep manifold embedding in each sub-class in the mini-batch using Eq. 10.
2:  Compute the loss .
3:  Compute the Hausdroff distance between sub-classes from different classes using Eq. 12.
4:  Compute the diversity-promoting term by .
5:  Compute using Eq. 16.
6:  Compute using Eq. 17.
7:  return .
Algorithm 2 Calculate Gradient for DMEM

Iv Experimental Results

In this section, intensive experiments are conducted to prove the effectiveness of the proposed method. First, the datasets used in this work is introduced. Then, the experimental setups is detailed and the experimental results are shown and analyzed.

Iv-a Datasets

To further validate the effectiveness of the proposed method, this work conducts experiments over three real-world hyperspectral images [16], namely the Pavia University, the Indian Pines, and the Salinas Scene data.

  1. The Pavia University was acquired by the reflective optics system imaging spectrometer (ROSIS-3) sensor during a flight campaign over Pavia, Northern Italy. The image consists of pixels with a geometric resolution of 1.3 m/pixels. A total of 42,776 labelled samples divided into 9 land cover objects are used for experiments and each sample is with 103 spectral bands ranging from 0.43 to 0.86 .

  2. The Indian Pines was gathered by 224-band AVIRIS sensor ranging from 0.4 to 2.5 over the Indian Pines test site in North-western Indiana. It consists of pixels with a spatial resolution of 20 m/pixel. Removing the 24 water absorbtion bands, the 200 bands are retained. 16 classes of agriculture, forests and vegetation with a total of 10,249 labelled samples are included for experiments.

  3. The Salinas Scene was also collected by the 224-band AVIRIS sensor with a spectral coverage from 0.4 to 2.5 but over Salinas Valley, California. The image size is with a spatial resolution of 3.7 m/pixel. As the Indian Pines scene, 20 water absorption bands are discarded. 16 classes of interest, including vegetables, bare soils, and vineyard fields with a total of 54,129 labelled samples are chosen for experiments.

Iv-B Experimental Setups

There are four parameters in the experiments to be determined, namely the balance between the optimization term and the diversity-promoting term , and the balance between the manifold embedding term and the softmax loss ,the number of sub-classes , the number of the neighbors . The first two are empirically set as . As for and , a lot of experiments have been done to choose the best parameters. We set the two variables as different values and then check their performance under various and .

Caffe [19] is chosen as the deep learning framework to implement the developed method for hyperspectral image classification. This work adopts the simple CNN architecture as Fig. 2

shows to provide the nonlinear mapping for the low dimensional features of the data manifold. The learning rate, epoch iteration, training batch are set to 0.001, 60000, 84, respectively. The tradeoff parameter

in the deep manifold embedding is set to 0.0001. Just as Fig. 2, this work takes advantage of the neighbors to extract both the spatial and the spectral information from the image.

In the experiments, we choose 200 samples per class for training and the remainder for testing over Pavia University and Salinas scene data while over Indian Pines data, we select 20 percent of samples per class for training and the others for testing. To objectively evaluate the classification performance, metrics of the overall accuracy (OA), average accuracy (AA), and the Kappa coefficient are adopted. All the results come from the average value and standard deviation of ten runs of training and testing. The code for the implementation of the proposed method will be released soon at

http:/github.com/shendu-sw/deep-manifold-embedding.

Fig. 2: Architecture of CNN model for hyperspectral image classification. In the figure, denotes the channel bands of the image. The CNN is jointly trained by the softmax loss and the developed manifold embedding loss.

Iv-C General Performance

At first, we present the general performance of the developed manifold embedding for hyperspectral image classification. In this set of experiments, the number of sub-classes is set to 5, the number of neighbors is set to 5. Very common machine with a 3.6-GHz Intel Core i7 CPU, 64-GB memory and NVIDIA GeForce GTX 1080 GPU was used to test the performance of the proposed method. The proposed method took about 2196s over Pavia University data, 2314s over Indian Pines data, and 2965s over Salinas scene data. It should be noted that the developed manifold embedding is implemented through CPU and the computational performance can be remarkably improved by modifying the codes to run on the GPUs.

Table I, II, and III show the general performance over the Pavia University, Indian Pines, and salinas scene data, respectively. These tables show the classification accuracies of each class and the OA, AA as well as the Kappa by SVM-POLY, the CNN trained with softmax loss and the CNN trained with the proposed method. From these tables, we can easily get that the CNN model provides a more discriminative representation of the hyperspectral image than other handcrafted features. Furthermore, we can find that the performance of the CNN model can be significantly improved when trained with the proposed method other than only with the softmax loss. Over the Pavia University data, the CNN model with the manifold embedding can obtain an accuracy of which is higher than by the CNN with softmax loss only. Over the Indian Pines and the Salinas scene data, the proposed method which can also achieve and outperforms the CNN with general softmax loss.

It should be noted that constructing the data manifold structure requires a certain amount of samples. Over the Salinas scene and the Pavia University data, the classification accuracies from each class are improved by the proposed method. However, over the Indian Pines data, some classification accuracies from the classes, such as the alfalfa, corn, grass pasture mowed, oats, and wheat, are decreased by the developed method. The reason is that the training samples in these classes are quite small while the training samples of other samples are quite large when compared with these classes. Especially, only four samples from the oats class are used for training. Few training samples cannot model the data manifold structures and may even show negatively effects on the classification performance.

To further validate the effectiveness of the developed method, this work uses the McNemar’s test [5]

, which is based on the standardized normal test statistics, for deeply comparisons in the statistic sense. The statistic can be computed by

(18)

where describes the number of correctly classified samples by the th method but wrongly by the th method. Therefore, measures the pairwise statistical significance between the th and th methods. At the widely used level of confidence, the difference of accuracies between different methods is statistically significant if .

Methods SVM-POLY CNN Proposed Method
Classification
Accuracies (%)
C1
C2
C3
C4
C5
C6
C7
C8
C9
OA (%)
AA (%)
KAPPA (%)
TABLE I: Classification accuracies () (OA, AA, and Kappa) of different methods achieved on the Pavia University data. The results from CNN is trained with the Softmax Loss. represents the value of McNemar’s test.

From these tables, it can also be noted that when compared the proposed method with the CNN trained by general softmax loss, the Mcnemar’s test value achieves 20.80, 4.48, and 12.67 over Pavia University, Indian Pines, and salinas scene data, respectively. This indicates that the improvement of the developed deep manifold embedding on the performance of CNN is statistically significant.

Methods SVM-POLY CNN Proposed Method
Classification
Accuracies (%)
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
OA (%)
AA (%)
KAPPA (%)
TABLE II: Classification accuracies (OA, AA, and Kappa) of different methods achieved on the Indian Pines data.
Methods SVM-POLY CNN Proposed Method
Classification
Accuracies (%)
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
OA (%)
AA (%)
KAPPA (%)
TABLE III: Classification accuracies () (OA, AA, and Kappa) of different methods achieved on the Salinas Scene data.

Iv-D Effects of Different Number of Training Samples

Since the number of training samples can significantly affect the construction of the data manifold, this subsection will further validate the performance of the developed deep manifold embedding under different number of training samples. For the Pavia University and the Salinas Scene data, the number of training samples per class is selected from . For the Indian Pines data, we choose 1%, 2%, 5%, 10%, and 20% of samples for training, respectively. In this set of experiments, the number of the sub-classes and the neighbors is set to 5, 5, respectively.

Fig. 3 shows the classification performance of the developed method with different number of training samples and Fig. 4 shows the corresponding Mcnemar’s test value between the CNN trained with the proposed method and the CNN trained with the softmax loss only. From the figures, we can obtain the following conclusions.

  1. The developed manifold embedding method can take advantage of the data manifold property within the hyperspectral image and preserve the manifold structure in the low dimensional features which can improve the representational ability of the CNN model. Fig. 3 shows that the proposed method obtains better performance over all the three datasets under different number of training samples. Moreover, Fig. 4 also shows that the corresponding Mcnemar’s test value over the three datasets is higher than 1.96 which means that the improvement of the proposed method is significant in statistic sense.

  2. With the decrease of the training samples, the effectiveness of the developed method would be limited. Fig. 3 shows that the curves of the classification accuracy over each data set tend to be close to each other. Besides, from the Fig. 4, it can be find that when the samples is limited, the value is fluctuate which indicates that the effectiveness is negatively affected by the limited number of training samples. Just as the former subsection shows, this is because that constructing the data manifold requires a certain number of training samples. In contrary, too few samples may construct the false data manifold and show negative effects on the performance.

Fig. 3: Classification performance of the proposed method under different number of training samples over (a) Pavia University; (b) Indian pines; (c) Salinas scene data.
Fig. 4: The value of Mcnemar’s test between the CNN trained with deep manifold embedding and the softmax loss under different number of training samples over (a) Pavia University; (b) Indian pines; (c) Salinas scene data.

Iv-E Effects of the Number of Sub-Classes

This subsection will show the performance of the developed method under different . In the experiments, the is chosen from . The parameter is set to 5. Fig. 5 presents the experimental results over the three data sets, respectively.

From the figure, we can find that a proper can guarantee a good performance of the developed manifold embedding method. From Fig. 5, it can be find that when , the classification accuracy over Pavia University data can achieve 99.52% OA while can only lead to an accuracy of OA. For Indian Pines data, just as Fig. 5 shows, can make the classification accuracy reach OA while when , the accuracy can only achieve 99.31% OA. Besides, as Fig. 5shows, for Salinas Scene data, when , the proposed method performs the best. Generally, cross validation can be applied to select a proper in real-world application.

Fig. 5: Classification performance of the proposed method under different choices of the number of sub-classes over (a) Pavia University; (b) Indian pines; (c) Salinas scene data.

Iv-F Effects of the Number of neighbors

Just as the parameter , the number of neighbors also plays an important role in the developed method. Generally, extremely small , such as , would lead to the extremely “steep” of the constructed data manifold. While extremely large would lead to the overly smoothness of the data manifold. This subsection would discuss the performance of the developed method under different number of neighbors . In the experiments, the is chosen from . We also present the results when approaches infinity, namely all the samples are measured by Euclidean distance. In this set of experiments, the parameter is set to 5. Fig. 6 shows the classification results of the proposed method under different choices of over the three data sets, respectively. Inspect the tendencies in Fig. 6 and we can note that the following hold.

Firstly, different can also significantly affect the performance of the developed method. Coincidentally, the performance of the proposed method achieves the best performance when is set to 5. Besides, the application of the Geodesic distance other than the Euclidean distance can improve the performance of the deep manifold embedding method. As Fig. 6 shows, the proposed method can achieve 99.52% over Pavia University data under Geodesic distance which is higher than 99.35% under Euclidean distance. Over Indian Pines, just as Fig. 6 shows, the proposed method under Geodesic distance obtains an accuracy of 99.51% outperforms that under Euclidean distance (99.32%). From Fig. 6, it can be noted that over Salinas scene data, the proposed method under Geodesic distance can achieve 97.80% which is better than 97.51% under Euclidean distance.

Fig. 6: Classification performance of the proposed method under different choices of the number of neighbors over (a) Pavia University; (b) Indian pines; (c) Salinas scene data. “-” represents that all the samples are measured by Euclidean distance.

Iv-G Comparisons with the Samples-based Loss

This work also compares the developed deep manifold embedding with other recent samples-based loss. Here, we choose three representative loss in prior works, namely the softmax loss, center loss [46], and structured loss [35]. Table IV lists the comparison results over the three data sets, respectively.

From the table, we can find that the proposed deep manifold embedding which can take advantage of the data manifold property within the hyperspectral image and preserve the manifold structure in the low dimensional features can be more fit for the classification task than these samples-based loss. Over the Pavia University data, the proposed method can obtain an accuracy of 99.52% outperform the CNN trained with the softmax loss (98.61%), center loss (99.28%), and the structured loss (99.27%). Over the Salinas Scene and Indian Pines data, the proposed method also outperforms these prior samples-based loss (see the table for details).

Data Methods OA(%) AA(%) KAPPA(%)
PU Softmax Loss 15.77
Center Loss 6.03
Structured Loss 6.22
Proposed Method
IP Softmax Loss 4.48
Center Loss 3.01
Structured Loss 3.83
Proposed Method
SA Softmax Loss 12.67
Center Loss 6.42
Structured Loss 7.05
Proposed Method

TABLE IV: Comparisons with other sample-wise loss. This work selects the softmax loss, the center loss [46] and the structured loss [35]. PU, IP, SA stands for the Pavia University, the Indian Pines, and the Salinas Scene data, respectively.

Furthermore, we present the classification maps in Fig. 7, 8, and 9 by different methods over the Pavia University, Indian Pines, and Salinas Scene data, respectively. Compare Fig. 7 and 7, 8 and 8, 9 and 9, and it can be easily noted that the CNN model trained with the deep manifold embedding can improve the performance of the CNN model. Besides, compare Fig. 7 and 7, 8 and 8, 9 and 9, and we can find that the deep manifold embedding which can take advantage of the manifold structure can better model the hyperspectral image than the center loss. When compared 7 and 7, 8 and 8, 9 and 9, we can also note that the proposed method can significantly decrease the classification errors obtained by the structured loss.

Fig. 7: Pavia University classification maps by different methods with 200 samples per class for training (overall accuracies). (a) groundtruth; (b) SVM (86.54%); (c) CNN with softmax loss (98.91%); (d) CNN with center loss (99.25%) ; (e) CNN with Structured loss (99.42%); (f) CNN with developed manifold embedding loss (99.66%); (g) map color.
Fig. 8: Indian Pines classification maps by different methods with 20% of samples per class for training (overall accuracies). (a) groundtruth; (b) SVM (88.77%); (c) CNN with softmax loss (98.72%); (d) CNN with center loss (99.38%); (e) CNN with Structured loss (99.45%); (f) CNN with developed manifold embedding loss (99.61%); (g) map color.
Fig. 9: Salinas Scenes classification maps by different methods with 200 samples per class for training (overall accuracies). (a) groundtruth; (b) SVM (90.69%); (c) CNN with softmax loss (97.06%); (d) CNN with center loss (97.05%); (e) CNN with Structured loss (97.41%); (f) CNN with developed statistical loss (98.03%); (g) map color.

Iv-H Comparisons with the State-of-the-Art Methods

To further validate the effectiveness of the proposed manifold embedding method for hyperspectral image classification, we further make comparison with a number of the state-of-the-art methods. Tables V, VI, and VII list the comparison results under the same experimental setups over the three data sets, respectively. It should be noted that the results in these tables are from the literatures where the method was first developed.

Over Pavia University data, the developed method can obtain 99.52% OA outperforms D-DBN-PF (93.11% OA) [51], CNN-PPF (96.48% OA) [24], Contextual DCNN (97.31% OA) [20], SSN (99.36% OA) [55], ML-based Spec-Spat (99.34% OA) [3], and DPP-DML-MS-CNN (99.46% OA) [7]. Besides, over Salinas Scene data and Indian Pines data, the developed method can also provide competitive results (see tables VI and VII for detail). To sum up, the joint supervision of the developed manifold embedding loss and softmax loss can always enhance the deep models’ ability to extract discriminative representations and obtain comparable or even better results when compared other state-of-the-art methods.

Methods OA(%) AA(%) KAPPA(%)

SVM-POLY
D-DBN-PF [51]
CNN-PPF [24]
Contextual DCNN [20]
SSN [55]
ML-based Spec-Spat [3]
DPP-DML-MS-CNN [7]
Proposed Method

TABLE V: Classification performance of different methods over Pavia Unviersity data in the most recent literature (200 training samples per class for training).
Methods OA(%) AA(%) KAPPA(%)

R-ELM [25]
DEFN [36]
DRN [12]
MCMs+2DCNN [14]
Proposed Method (10%)
SVM-POLY
SSRN [53]
MCMs+2DCNN [14]
Proposed Method (20%)

TABLE VI: Classification performance of different methods over Indian Pines data in the most recent literature. The percent in the brackets demonstrates the training samples per class.
Methods OA(%) AA(%) KAPPA(%)

SVM-POLY
CNN-PPF [24]
Contextual DCNN [20]
Spec-Spat [54]
DPP-DML-MS-CNN [7]
Proposed Method

TABLE VII: Classification performance of different methods over Salinas Scene data in the most recent literature(200 training samples per class for training).

V Conclusion and Discussion

The data structure is a critical factor that influences the deep learning performance. In this paper, we take advantage of the data manifold to model the intrinsic data structure within the hyperspectral image and develop a novel manifold embedding method in deep learning (DMEM) to preserve the manifold structure in the low dimensional features. Using the intrinsic data structure does help to improve the performance of the deep model and experimental results have validated the effectiveness of the developed DMEM.

As future work, it would be interesting to investigate the effectiveness of the manifold embedding on other hyperspectral imaging tasks, such as hyperspectral target detection. Besides, further consideration should be given to embed the manifold structure in other forms. Finally, other data structures which can significantly affect the deep learning performance is another important future topic.

References

  • [1] N. Aziere and S. Todorovic (2019) Ensemble deep manifold similarity learning using hard proxies. In CVPR, pp. 7299–7307. Cited by: §II-B, §II-B.
  • [2] M. Belkin and P. Niyogi (2002) Laplacian eigenmaps and spectral techniques for embedding and clustering. In NIPS, pp. 585–591. Cited by: §II-B.
  • [3] G. Cheng, Z. Li, J. Han, X. Yao, and L. Guo (2018) Exploring hierarchical convolutional features for hyperspectral image classification. IEEE TGRS 56 (11), pp. 6712–6722. Cited by: §IV-H, TABLE V.
  • [4] E. W. Dijkstra (1959) A note on two problems in connexion with graphs. Numerische Mathematik 1 (1), pp. 269–271. Cited by: §III-A.
  • [5] G. Foody (2004) Thematic map comparison: evaluating the statistical significance of differences in classification accuracy. Photogrammetric Engineering and Remote Sensing 70 (5), pp. 627–633. Cited by: §IV-C.
  • [6] Z. Gong, P. Zhong, and W. Hu (2019)

    Diversity in machine learning

    .
    IEEE Access 7 (1), pp. 64323–64350. Cited by: §III-B.
  • [7] Z. Gong, P. Zhong, Y. Yu, W. Hu, and S. Li (2019) A cnn with multiscale convolution and diversified metric for hyperspectral image classification. IEEE TGRS 57 (6), pp. 3599–3618. Cited by: §I, §IV-H, TABLE V, TABLE VII.
  • [8] Y. Gu, J. Chanussot, X. Jia, and J. A. Benediktsson (2017) Multiple kernel learning for hyperspectral image classification: a review. IEEE TGRS 55 (11), pp. 6547–6565. Cited by: §I.
  • [9] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In CVPR, pp. 1735–1742. Cited by: §I, §II-A.
  • [10] M. T. Harandi, M. Salzmann, and R. Hartley (2014) From manifold to manifold: geometry-aware dimensionality reduction for spd matrics. In ECCV, pp. 17–32. Cited by: §II-B, §II-B.
  • [11] S. S. Haykin (2009) Neural networks and learning machines. Prentice Hall, New York. Cited by: §III-C.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: TABLE VI.
  • [13] N. He, L. Fang, S. Li, J. Plaza, and A. Plaza (2019)

    Skip-connected covariance network for remote sensing scene classification

    .
    IEEE TNNLS. Cited by: §I.
  • [14] N. He, M. E. Paoletti, J. M. Haut, L. Fang, S. Li, A. Plaza, and J. Plaza (2018) Feature extraction with multiscale covariance maps for hyperspectral image classification. IEEE TGRS 57 (2), pp. 755–769. Cited by: TABLE VI.
  • [15] U. Heiden, W. Heldens, S. Roessner, K. Segl, T. Esch, and A. Mueller (2019) Urban structure type characterization using hyperspectral remote sensing and height information. Landscape and Urban Planning 105 (4), pp. 361–375. Cited by: §I.
  • [16] Hyperspectral data, accessed on aug. 18, 2019. Note: https://www.ehu.ews/ ccwintoco/index.php?title=Hyperspectral_Remote_Sensing_Scenes Cited by: §IV-A.
  • [17] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum (2018) Mining on manifolds: metric learning without labels. In CVPR, pp. 7642–7651. Cited by: §II-B.
  • [18] X. Jia, B. C. Kuo, and M. M. Crawford (2013) Feature mining for hyperspectral image classification. Proceedings of the IEEE 101 (3), pp. 676–697. Cited by: §I.
  • [19] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, and et. al (2014) Caffe: convolutional architecture for fast feature embedding. In ACM MM, pp. 675–678. Cited by: §IV-B.
  • [20] H. Lee and H. Kwon (2017) Going deeper with contextual cnn for hyperspectral image classification. IEEE TIP 26 (10), pp. 4843–4855. Cited by: §IV-H, TABLE V, TABLE VII.
  • [21] N. Lei, Z. Luo, S. Yau, and D. X. Gu (2018) Geometric understanding of deep learning. arXiv preprint arXiv: 1805.10451. Cited by: §I.
  • [22] N. Lei, K. Su, L. Cui, S. T. Yau, and X. D. Gu (2019) A geometric view of optimal transportation and generative model. Computer Aided Geometric Design 68, pp. 1–21. Cited by: §I.
  • [23] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson (2019) Deep learning for hyperspectral image classification: an overview. IEEE TGRS. Cited by: §I.
  • [24] W. Li, G. Wu, F. Zhang, and Q. Qu (2016) Hyperspectral image classification using deep pixel-pair features. IEEE TGRS 55 (2), pp. 844–853. Cited by: §IV-H, TABLE V, TABLE VII.
  • [25] Y. Li, W. Xie, and H. Li (2017) Hyperspectral image reconstruction by deep convolutional neural network for classification. Pattern Recognition 63, pp. 371–383. Cited by: TABLE VI.
  • [26] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. In CVPR, pp. 212–220. Cited by: §II-A.
  • [27] W. Liu, Y. Wen, Z. Yu, and M. Yang (2016) Large-margin softmax loss for convolutional neural networks. In ICCV, Cited by: §II-A.
  • [28] J. Lu, G. Wang, W. Deng, P. Moulin, and J. Zhou (2015) Multi-manifold deep metric learning for image set classification. In CVPR, pp. 1137–1145. Cited by: §I, §II-B.
  • [29] L. Ma, M. M. Crawford, and J. Tian (2010) Local manifold learning-based -nearest-neighbor for hyperspectral image classification. IEEE TGRS 48 (11), pp. 4099–4109. Cited by: §I, §II-B.
  • [30] G. Rote (2019) Computing the minimum hausdorff distance between two point sets on a line under translation. Information Processing Letters 38 (3), pp. 123–127. Cited by: §III-B.
  • [31] S. Roweis and L. Saul (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290 (5500), pp. 2323–2326. Cited by: §II-B.
  • [32] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In CVPR, pp. 815–823. Cited by: §I, §II-A.
  • [33] G. Shamai and R. Kimmel (2017) Geodesic distance descriptors. In CVPR, pp. 6410–6418. Cited by: §III-A.
  • [34] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In NIPS, pp. 1857–1865. Cited by: §II-A.
  • [35] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In CVPR, pp. 4004–4012. Cited by: §II-A, §IV-G, TABLE IV.
  • [36] W. Song, S. Li, L. Fang, and T. Lu (2018)

    Hyperspectral image classification with deep feature fusion network

    .
    IEEE TGRS 56 (6), pp. 3173–3184. Cited by: TABLE VI.
  • [37] A. Talwalkar, S. Kumar, and H. Rowley (2008) Large-scale manifold learning. In CVPR, pp. 1–8. Cited by: §II-B.
  • [38] W. Wan, Y. Zhong, T. Li, and J. Chen (2018) Rethinking feature distribution for loss functions in image classification. In CVPR, pp. 9117–9126. Cited by: §I, §II-A.
  • [39] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, and J. Zhou (2018) CosFace: large margin cosine loss for deep face recognition. In CVPR, pp. 5265–5274. Cited by: §II-A.
  • [40] J. B. Wang, V. de Silva, and J. C. Langford (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), pp. 2319–2323. Cited by: §II-B.
  • [41] J. Wang, F. Zhou, S. Wen, X. Liu, and Y. Lin (2017) Deep metric learning with angular loss. In ICCV, pp. 2593–2601. Cited by: §II-A.
  • [42] Q. Wang, J. Lin, and Y. Yuan (2016) Salient band selection for hyperspectral image classification via manifold ranking. IEEE TNNLS 27 (6), pp. 1279–1289. Cited by: §I, §II-B.
  • [43] R. Wang and X. Chen (2009) Manifold discriminant analysis. In CVPR, pp. 429–436. Cited by: item 2, §I, §I, §II-B.
  • [44] R. Wang, S. Shan, X. Chen, Q. Dai, and W. Gao (2012) Manifold-manifold distance and its application to face recognition with image sets. IEEE TIP 21 (10), pp. 4466–4479. Cited by: §I, §II-B, §II-B.
  • [45] K. Q. Weinberger and L. K. Saul (2006) Unsupervised learning of image manifolds by semidefinite programming. IJCV 70 (1), pp. 77–90. Cited by: §II-B.
  • [46] Y. Wen, K. Zhang, Z. Li, and Y. Qiao (2016) A discriminative feature learning approach for deep face recognition. In ECCV, pp. 499–515. Cited by: §II-A, §IV-G, TABLE IV.
  • [47] Y. Yuan, L. Mou, and X. Lu (2015) Scene recognition by manifold regularized deep learning architecture. IEEE TNNLS 26 (10), pp. 2222–2233. Cited by: §I.
  • [48] X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao (2017) Range loss for deep face recognition with long-tailed training data. In ICCV, pp. 5409–5418. Cited by: §II-A.
  • [49] R. Zhao, B. Du, and L. Zhang (2017) Hyperspectral anomaly detection via a sparsity score estimation framework. IEEE TGRS 55 (6), pp. 3208–3222. Cited by: §I.
  • [50] X. Zhe, S. Chen, and H. Yan (2019) Deep class-wise hashing: semantics-preserving hashing via class-wise loss. IEEE TNNLS. Cited by: §II-A.
  • [51] P. Zhong, Z. Gong, S. Li, and C. B. Schonlieb (2017)

    Learning to diversify deep belief networks for hyperspectral image classification

    .
    IEEE TGRS 55 (6), pp. 3516–3530. Cited by: §I, §IV-H, TABLE V.
  • [52] P. Zhong, Z. Gong, and J. Shan (2019) Multiple instance learning for multiple diverse hyperspectral target characterizations. IEEE TNNLS. Cited by: §I.
  • [53] Z. Zhong, J. Li, Z. Luo, and M. Chapman (2018) Spectral-spatial residual network for hyperspectral image classification: a 3-d deep learning framework. IEEE TGRS 56 (2), pp. 847–858. Cited by: TABLE VI.
  • [54] P. Zhou, J. Han, G. Cheng, and B. Zhang (2019)

    Learning compact and discriminative stacked autoencoder for hyperspectral image classification

    .
    IEEE TGRS. Cited by: TABLE VII.
  • [55] Y. Zhou and Y. Wei (2016) Learning hierarchical spectral-spatial features for hyperspectral image classification. IEEE CYB 46 (7), pp. 1667–1678. Cited by: §IV-H, TABLE V.
  • [56] B. Zhu, J. Z. Liu, S. F. Cauley, B. R. Rosen, and M. S. Rosen (2018) Image reconstruction by domain-transform manifold learning. Nature 555 (7697), pp. 487–487. Cited by: §II-B.
  • [57] Z. Zou and Z. Shi (2015) Hierachical suppression method for hyperspectral target detection. IEEE TGRS 54 (1), pp. 330–342. Cited by: §I.