Statistical Loss and Analysis for Deep Learning in Hyperspectral Image Classification

12/28/2019 ∙ by Zhiqiang Gong, et al. ∙ 42

Nowadays, deep learning methods, especially the convolutional neural networks (CNNs), have shown impressive performance on extracting abstract and high-level features from the hyperspectral image. However, general training process of CNNs mainly considers the pixel-wise information or the samples' correlation to formulate the penalization while ignores the statistical properties especially the spectral variability of each class in the hyperspectral image. These samples-based penalizations would lead to the uncertainty of the training process due to the imbalanced and limited number of training samples. To overcome this problem, this work characterizes each class from the hyperspectral image as a statistical distribution and further develops a novel statistical loss with the distributions, not directly with samples for deep learning. Based on the Fisher discrimination criterion, the loss penalizes the sample variance of each class distribution to decrease the intra-class variance of the training samples. Moreover, an additional diversity-promoting condition is added to enlarge the inter-class variance between different class distributions and this could better discriminate samples from different classes in hyperspectral image. Finally, the statistical estimation form of the statistical loss is developed with the training samples through multi-variant statistical analysis. Experiments over the real-world hyperspectral images show the effectiveness of the developed statistical loss for deep learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

Code Repositories

statistical-loss

This repository is an extension of Caffe for the paper "Statistical Loss and Analysis for Deep Learning in Hyperspectral Image Classification" (TNNLS).


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the development of the new and advanced space-borne and aerial-borne sensors, large amounts of hyperspectral images, which contain hundreds of spectral channels, are available [5, 36]. The high-dimension spectral bands in the image make it possible to obtain plentiful spectral information to discriminate different objects [1, 31, 35]. However, great similarity which occurs in the bands between different objects makes the image processing task be a challenging one. Besides, the increasing dimensionality in hyperspectral image and the limited number of training samples multiply the difficulties to obtain discriminative features from the image. Therefore, faced with these circumstances, spatial features are usually incorporated into the representation [16, 3]. However, modelling discriminative spatial and spectral features is not so simple. There have been increasing efforts to explore effective spectral-spatial methods for hyperspectral image classification.

Recently, deep models with multi-layers have demonstrated their potentials in modelling both the spectral and spatial features from the hyperspectral image [22, 24, 17, 37]. Especially, the CNNs, which can capture both the local and the global information from the objects, have presented good performance and been widely applied in hyperspectral image processing tasks. More extended CNNs with the multi-scale convolution [5], spectral and spatial residual block [38], have also been developed to improve the representational ability of the CNNs. Therefore, due to the good performance, this work will take advantage of the CNN model to extract the deep spectral-spatial features from the hyperspectral image.

The essential and key problem for the deep representation is how to train a good model. Generally, a good training process is guaranteed by a fine and proper definition of the training loss. The common training loss is constructed with the training samples directly and can be broadly divided into two classes. The first class of losses mainly penalizes the predicted and the real label of each sample for the training of the deep model, such as the generally used softmax loss [18, 14]. However, these losses only take advantage of the pixel-wise information from the hyperspectral image while ignore the correlation between different samples. The other one focuses on the penalization of the samples’ correlation [5, 6, 20]. These losses penalize the Euclidean distances [8, 2] or the angular [30] between sample pairs [26, 32] or among sample triplets [25] and usually provide a better performance than the first one. In real-world applications, the CNN is usually trained under the joint supervisory signals of the losses from the two classes for an effective deep representation.

Even though these samples-based losses have been successfully applied in the training of the deep models, there exist two shortcomings using in the hyperspectral image classification. First, these methods mainly consider the pixel-wise information of each training sample or the pairwise and triplet correlation between different samples which make the training process be susceptible to the imbalanced and limited number of training samples. This would increase the randomness and uncertainty of the training process. Besides, these methods do not take the statistical properties of the hyperspectral image into consideration. Especially, there exist the spectral variability within each class and the seriously overlapped spectra between different classes in the image. These intrinsic properties could play an important role in providing an effective training process for deep learning.

To overcome these problems, this work tries to model each class from the image as a certain probabilistic model and formulates the penalization with the class distributions not directly with the samples. The distributions-based loss can reduce the uncertainty caused by the imbalanced and limited number of training samples and further improve the performance of the learned model to extract discriminative features from the image. Specifically, this work uses the multi-variant normal distributions to model different classes in the image.

Under the probabilistic models and multi-variant statistical analysis, this work develops a novel statistical loss for deep learning in the literature of hyperspectral image classification. Based on the Fisher discrimination criterion [33], the developed statistical loss penalizes the sample variance of each class distribution to decrease the spectral variability of each class. Moreover, a diversity-promoting condition [23] is added in the statistical loss to enlarge the inter-class variance between different class distributions. Finally, under the multi-variant statistical analysis, the statistical estimation form of the statistical loss is developed with the training samples. As a result, the learned deep model can be more powerful to extract discriminative features from the image. Overall, the major contributions of this paper are listed as follows.

  • This work models the hyperspectral image with the probabilistic model and characterizes each class from the image as a certain sampling distribution to take advantage of the statistical properties of the image, so as to formulate the penalization with the class distributions.

  • Based on the multi-variant statistical analysis and the Fisher discrimination criterion, we develop a novel statistical loss that decreases the spectral variability of each class while enlarges the variance between different class distributions.

  • Extensive experiments over the real-world hyperspectral image data sets demonstrate the effectiveness and practicability of the developed method and its superiority when compared with other recent samples-based methods.

Fig. 1: Comparison of statistical loss and samples-based loss. The visualization of spectra curves describes the meadows and bare soil in Pavia University.

Ii Motivation

Ii-a Statistical Properties of the Hyperspectral Image

Hyperspectral remote sensing measures the radiance of the materials within each pixel area at a very large number of contiguous spectral wavelength bands [21]. The space-borne or aerial-borne sensors gather these spectral information and provide hyperspectral images with hundreds of spectral bands. Since each pixel describes the energy reflected by surface materials and presents the intensity of the energy in different parts of the spectrum, each pixel contains a high-resolution spectrum, which can be used to identify the materials in the pixel by the analysis of reflectance or emissivity.

Unfortunately, a theoretically perfect fixed spectrum for any given material does not exist [23]. Due to the variations in the material surface, the spectra observed from samples of the same class are generally not identical. The measured spectra corresponding to pixels with the same class presents an inherent spectral variability that prevents the characterization of homogeneous surface materials by unique spectral signatures. Just as the spectral curves shown in Fig. 1, each class in usual hyperspectral image exhibits remarkable spectral variability and different classes show serious overlapping of the set of spectra. Besides, most spectra appearing in real applications are random. Therefore, their statistical variability is better described using the probabilistic models .

The learned features from the objects in the image presents the similar characteristics. Since the CNNs have demonstrated their potential in extracting discriminative features from the image [5, 19]

, this work will use the CNN model to extract deep features from the image. The features extracted from the CNNs can be seen as the linear or nonlinear mapping of the objects. Therefore, the features from the same class also show obvious variability and can be described by the probabilistic models.

For the task at hand, the probabilistic models are with respect to the high dimensional features. Therefore, multi-variant statistical analysis, which concerns with analyzing and understanding data in high dimensions, is necessary and just fit for the image processing task we face with [15]. Then, based on the Fisher discrimination criterion and multi-variant statistical analysis, this work will focus on modelling each class from the hyperspectral image as a specific probabilistic model and further develop a novel statistical loss to extract discriminative features from the image.

Even though the hyperspectral image possesses good statistical properties, to the best of our knowledge, this work first takes the statistical properties of the hyperspectral image into consideration and develops the loss with the distributions, not directly with the samples for deep learning. In the following, we will provide a deep comparison between the developed distributions-based loss and the samples-based loss.

Ii-B Distributions-based Loss v.s. Samples-based Loss

The samples-based losses mainly consider the pixel-wise information or penalize the correlation between the sample pairs [8] or triplets [25] for the deep learning. These losses attempt to obtain good representations of the image by decreasing the distances between samples from the same class and increasing the distances between samples from different classes. However, the performance of these samples-based loss is seriously influenced by the imbalanced and limited training samples, which leads to the uncertainty and randomness of the training process. Fig. 1 shows the flowchart of training process by these samples-based loss. Just as the figure shows, there may exists the overlapping between the obtained features from different classes. Besides, the variability of the learned features from each class would still be too large.

Different from these samples-based losses, the distributions-based loss characterizes each class from the image as a certain probabilistic model and considers the class relationship with the distributions under the Fisher discrimination criterion. Since we model the correlation based on the class distributions, the problems caused by the imbalance and limitation of the training samples can be solved. This shows positive effects on obtaining discriminative features from the image. Just as presented in Fig. 1, with the statistical loss by the multi-variant statistical analysis, the spectral variability of the learned features in each class would be decreased and different class distributions can be better separated. This makes the learned features be discriminative enough and thus the classification performance can be significantly improved. In the following, we will introduce the construction of the statistical loss for deep learning in detail.

Iii Statistical Loss And Analysis for Deep Learning

Let us denote as the set of training samples of the hyperspectral image where is the number of training samples and as the corresponding label of the sample . where is the number of the sample classes.

Iii-a Characterizing the Hyperspectral Image with Probabilistic Model

A reasonable and mostly used probabilistic model for such spectral data in hyperspectral image is generally provided by multivariate normal distribution. It has already presented impressive performance in modelling target and background as random vectors with multivariate normal distributions for hyperspectral target detection

[40]

and hyperspectral anomaly detection

[34]. For the task at hand, the extracted features from the CNN model of different classes will also be modelled with the multivariate normal distributions.

Given a

-dimensional random variable

which follows a certain multi-variant distribution. The random variable

is multi-variant normal if its probability density function (pdf)

has the form

(1)

where , , describes the mean of the distribution and which is a positive function represents the covariance matrix of the distribution. Generally, the multi-variant normal distribution can be described as .

In this work, each class is modelled by a certain multi-variant normal distribution with a mean of and a covariance of , which can be written as . represents the dimension of the obtained features from the CNN model and denotes the number of classes in the hyperspectral image. Obviously, the sampling distributions corresponding to different classes in the hyperspectral image are independent to each other.

Iii-B Construction of The Statistical Loss

As Fig. 2

shows, this work formulates the loss function based on the Fisher discrimination criterion

[33]. Under the criterion, we penalize the sample variance of each class distribution to decrease the intra-class variance, then the problem can be formulated as the following optimization,

(2)

where means the trace of the matrix and denotes the set of the parameters in the CNN model.

Moreover, to further improve the performance, this work add additional diversity-promoting condition to repulse different class distributions from each other. The diversity-promoting term can be formulated as

(3)

where is a positive value. and represents different classes from the image.

Therefore, from the statistical view, we characterize the feature correlation of the hyperspectral image with the probabilistic model and develop the statistical loss as follows,

(4)

Under the optimization in Eq. 4, the intra-class variance of the obtained features is decreased. Besides, the diversity-promoting condition increases the variance between different class distributions. Thus, the learned features can be more discriminative to separate different samples.

To solve the optimization in Eq. 4 with the training samples, this work statistically estimates the optimization with the multi-variant statistical analysis and develops the estimated statistical loss for hyperspectral image classification.

Iii-C Statistical Estimation for The Statistical Loss

Generally, in the training process of CNNs, the training batches are usually constructed to accurately estimate the CNN model. A training batch consists of a batch of randomly selected training samples, which can realize the parallelization of the training process [6]. Obviously, a training batch can be looked as a sampling from the class distributions in the hyperspectral image.

Fig. 2: Flowchart of the construction of the developed statistical loss.

Given a training batch . Denote as the feature of extracted from the deep model. represents the samples of the -th class in the batch. Then, denotes the extracted features of the -th class where is the number of the samples in the class. Therefore, the features in follows the class distribution .

Iii-C1 Estimate

The unbasied estimate of the distribution mean of the -th class in can be calculated as

(5)

Define the scatter matrix of the -th class as

(6)

Then, for the

-th class, the unbiased estimate

of the covariance matrix can be formulated as

(7)

We use to estimate the covariance matrix . Then,

(8)

Besides, can be estimated by . Therefore, is estimated by

(9)

Iii-C2 Estimate

Given the -th and the -th class. The -th class follows the multi-variant normal distribution as and the -th class follows . Obviously, the two class distributions are independent from each other. This work will use the statistical hypothesis to estimate the condition .

To estimate , two famous multi-variant distributions, namely the and the Hotelling are necessary.

The plays a prominent role in the analysis of estimated covariance matrices. Assume as independent distributions which follows the same -dimensional multi-variant normal distribution . Denote

. Then the random matrix

follows the -dimensional with degrees of freedom, which can be written as

(10)

It should be noted that the satisfies the following property. If statistics , and the statistics are independent from each other, then,

(11)

The Hotelling is essential to the hypothesis testing in multi-variant statistical analysis. Suppose that is independent to . Denote the statistic , then the statistic is defined as the Hotelling with degree of freedom, which can be formulated as

(12)

It should be noted that the former and Hotelling

are certain distributions where the probability distribution is fixed under a certain degrees of freedom. In the following, the two distributions will play an important role in the following estimation.

Traditionally, a statistical hypothesis is an assertion or conjecture concerning one or more populations. It should be noted that the rejection of a hypothesis implies that the sample evidence refutes it. That is to say, rejection means that there is a small probability of obtaining the sample information observed when, in fact, the hypothesis is true [29]

. The structure of hypothesis testing will be formulated with the use of the null hypothesis

and the alternative hypothesis . Generally, the rejection of leads to the acceptance of the alternative hypothesis .

For simplicity, this work would set the in Eq. 4 to 0. Therefore, from the statistical hypothesis view, we may then re-state the condition as the following two competing hypotheses:

(13)
(14)

The scatter matrices and of the -th and the -th class are defined as Eq. 6 shows. As the definition of , it can be noted that

(15)
(16)

Since all the samples of different classes are from the same hyperspectral image, just as processed in many hyperspectral target recognition task [21], different class distributions are supposed to have the same covariance matrix, namely . Therefore, based on the properties of as Eq. 11, the statistic follows

(17)

Moreover, depending on the definition of the multi-variant normal distribution, we can find that the statistic which is defined as follows the multi-variant normal distribution,

(18)

Furthermore, denote the statistic

(19)

Then, according to the definition of Hotelling , it can be noted that the statistic in Eq. 19 follows the as

(20)

Therefore, at the level of confidence, if , accept the null hypothesis , reject the alternative hypothesis ; otherwise, if , accept the alternative hypothesis , reject the null hypothesis .

Since the alternative hypothesis is what we seek, can be transformed to the following one,

(21)

Iii-C3 Formulate the Statistical Loss

Denote . Then, based on the Eq. 9 and Eq. 21, the optimization problem in Eq. 4 can be transformed as

(22)

By Lagrange multiplier, the statistical loss for the hyperspectral image can be formulated from Eq. 22 as

(23)

where is the tradeoff parameter.

Besides, is a constant value that is irrelevant to the training samples, therefore, we set as a constant positive value. Then, Eq. 23 can be re-formulated as

(24)

where represents a positive value. Therefore, Eq. 24 defines the statistical loss for deep learning in this work. Fig. 2 shows the detailed process to formulate the statistical loss. Under the statistical loss, the learned model can be more discriminative for the hyperspectral image.

Iv Training

Generally, the deep model is trained with the stochastic gradient descent method and back propagation is used for the training process of the model

[9]. Therefore, the main problem for the implementation of the developed statistical loss in hyperspectral image classification task is to compute the derivation of the statistical loss w.r.t. the extracted features from the training samples.

As defined in section III-C, the statistical loss can be formulated as

(25)

where

(26)
(27)

According to the chain rule, gradients of the statistical loss w.r.t.

can be formulated as

(28)

The partial of w.r.t. can be easily computed by

(29)

where is the learned features of training sample from the CNN model. denotes the indicative function.

Besides, the partial of w.r.t. can be calculated by

(30)

Therefore, the key process is to calculate the following derivation:

(31)

The can be computed as

(32)

where

represents the identity matrix. In addition,

(33)

Based on Eqs. 30, 31, 32 and 33, the partial of w.r.t. can be calculated by111Detailed computations of gradients are shown in the attachment.

(34)

Through back propagation with the former equations, the CNN model can be trained with the training samples and discriminative features can be learned from the hyperspectral image. The detailed training process of the developed method is shown in Algorithm 1. It should also be noted that the whole CNN is trained under joint supervisory signals of softmax loss and our statistical loss.

0:  , , as the parameters of -th layer,

as the parameters in Softmax layer, learning rate

.
0:  ,
1:  Initialize in -th convolutional layer where

is initialized from Gaussian distribution with standard derivation of 0.01 and

is set to 0.
2:  while not converge do
3:     .
4:     Construct the training batch randomly.
5:     Obtain the deep features from the sample with the CNN model specified by .
6:     Compute the penalization of using Eq. 26.
7:     Compute the penalization of the diversity-promoting term using Eq. 27.
8:     Compute the statistical loss by .
9:     Compute the joint loss by where is the penalization from the softmax loss and is the tradeoff parameter.
10:     Compute the derivation of w.r.t. in using Eq. 29.
11:     Compute the derivation of w.r.t. in as Eq. 34 shows.
12:     Update the parameters by .
13:     Update the parameters by .
14:  end while
15:  return  ,
Algorithm 1 Training process of the developed method

V Experimental Results

V-a Experimental Datasets and Experimental Setups

To further validate the effectiveness of the developed statistical loss, this work conducts experiments over the real-world hyperspectral image data sets, namely Pavia University and Indian Pines222More results can be seen in the attachment.

. We also compare the experimental results with other state-of-the-art methods including the most recent samples-based loss to show the advantage of the proposed method. In addition, overall average (OA), average accuracy (AA), and Kappa are chosen as the measurements to evaluate the classification performance. All the results in this work come from the average value and standard deviation of ten runs of training and testing. For each of the ten experiments, the training and testing sets are randomly selected.

Pavia University data [28] was gathered by the reflective optics system imaging spectrometer (ROSIS-3) sensor with a spatial resolution of 1.3m per pixel. It consists of pixels of which a total of 42, 776 labelled samples divided into nine classes have been chosen for experiments. Each pixel denotes a sample and consists of 115 bands with a spectral coverage ranging from 0.43 to 0.86 . 12 spectral bands are abandoned due to the noise and the remaining 103 channels are used for experiments.

Indian Pines data [12] was collected by the 224-band AVIRIS sensor ranging from 0.4 to 2.5 over the Indian Pines test site in north-western Indiana. It consists of pixels and the corrected data of Indian Pines remains 200 bands where 24 bands covering the region of water absorption are removed. Sixteen land-cover classes with a total of 10249 labelled samples are selected from the data for experiments.

Fig. 3: The deep structure adopted in this work to implement the proposed method for hyperspectral image Classification. The whole CNN is trained under the joint supervisory signals of softmax loss and our statistical loss.

Caffe is chosen as the deep learning framework to implement the proposed method [13]. Since this work mainly test the effectiveness of the developed statistical loss, we will use the CNN model just as Fig. 3

shows for all the experiments in this work. The learning rate, epoch iteration, training batch are set to 0.001, 60000, 84, respectively. As Fig.

3 shows, this work uses the neighborhoods to incorporate the spatial information. In the experiments, we choose 200 samples per class for training and the remainder for testing over Pavia University while over Indian Pines data, we select 20% of samples per class for training. The code for the implementation of the proposed method will be released soon at http://github.com/shendu-sw/statistical-loss.

V-B General Performance

At first, we present a brief overview of the merits of the developed statistical loss for hyperspectral image classification. In this set of experiments, the diversity weight is fixed as constant 0.01. General machine with a 4.00GHz Intel Core (IM) i7-6700K CPU, 64 GB memory, and GeForce GTX 1080 GPU is chosen to perform the proposed method. The proposed method implemented through caffe took about 1146s over Pavia University and 1610s over Indian Pines data. It should be noted that this work implements the developed statistical loss by CPU and the computational performance can be remarkably improved by modifying the codes to run the developed method on GPUs.

Methods SVM-POLY CNN Proposed Method
Classification
Accuracies (%)
C1
C2
C3
C4
C5
C6
C7
C8
C9
OA (%)
AA (%)
KAPPA (%)
TABLE I: Classification accuracies () (OA, AA, and Kappa) of different methods achieved on the Pavia University data. The results from CNN is trained with the Softmax Loss. represents the value of McNemar’s test.
Methods SVM-POLY CNN Proposed Method
Classification
Accuracies (%)
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
C12
C13
C14
C15
C16
OA (%)
AA (%)
KAPPA (%)
TABLE II: Classification accuracies (OA, AA, and Kappa) of different methods achieved on the Indian Pines data.

Tables I and II show the classification results over the two datasets separately. For Pavia University data, C1, C2, , C9 represent the asphalt, meadows, gravel, trees, metal sheet, bare soil, bitumen, brick, shadow, respectively. For Indian Pines data, C1, C2, , C16 stand for the alfalfa, corn-no-till, corn-min-till, corn, grass_pasture, grass_trees, grass_pasture-mowed, hay-windrowed, oats, soybeans-no_till, soybeans-min_till, soybeans-clean, wheat, woods, buildings-grass-trees-drives, stone-steel_towers, separately. It can be noted that the developed method obtains a better performance than that by SVM. More importantly, the learned CNN by the statistical loss achieves an accuracy of 99.51% 0.09% over Pavia University which is much higher than that by general softmax loss (98.61% 0.35%). Besides, for Indian Pines, the proposed method can decrease the error rate by 47.42% when compared with that by general softmax loss. The statistical loss can take advantage of the statistical property of the hyperspectral image and embed the information of class distributions of the hyperspectral image in the deep learning process. Thus, the learned deep model can better represent the hyperspectral image and further provide a better classification performance.

Furthermore, we use the McNemar’s test, which is based upon the standardized normal test statistics

[7], as the statistical analysis method to demonstrate whether the developed statistical loss method improve the classification performance in the statistic sense. The statistic is computed by

(35)

where measures the pairwise statistical significance of difference between the accuracies of the th and th methods, and

denotes the number of samples which is classified correctly by

th method but wrongly by th method. At the 95% level of confidence, the difference of accuracies between different methods is statistically significant if .

From tables I and II, we can find that obtains 15.50 and 4.48 over Pavia University and Indian Pines, respectively, which means that the improvement of the performance by the developed statistical loss is statistically significant.

V-C Effects of Different Number of Training Samples

The former subsection has demonstrated the effectiveness of the developed statistical loss for hypersperctral image at the given experimental setups as section V-A shows. This subsection will further evaluate the performance of the developed method under different number of training samples. For Pavia University data, we choose the number of training samples per class from the set of . While for Indian Pines data, we select 1%, 2%, 5%, 10%, and 20% of training samples per class from the whole data. It should be noted that in these experiments, the diversity weight is set to 0.01. Fig. 4 presents the classification performance of the developed method with different number of training samples over the two data. Furthermore, we have presented the value of McNemar’s test with different number of training samples between the CNN trained with general softmax loss and the statistical loss in Fig. 5. Inspect the tendencies in Figs. 4 and 5 and we can note that the following hold.

Fig. 4: Classification performance with different number of training samples per class over (a) Pavia University; (b) Indian pines.

Firstly, the accuracies obtained by CNN with proposed method can be remarkably improved when compared with CNN by general softmax loss only. From Fig. 5, we can find that all the improvement by the developed method is statistically significant when compared with general softmax loss. Particularly, the accuracy is increased from 74.62% to 86.97% under 10 training samples per class over Pavia University and from 67.50% to 79.60% under 1% of training samples per class over Indian Pines. Secondly, the classification performance of the learned model is significantly improved with the increase of the training samples. Finally, it can be noted that the developed statistical loss shows a definite improvement of the learned model with limited number of training samples. As showed in Fig. 5, the value of MeNemar’s test is significantly improved when decreasing the training samples. The can even rank 59.74 under 10 training samples per class over Pavia University and 28.03 under 1% training samples per class over Indian Pines. The statistical loss is constructed with the class distributions, not directly with the samples. Therefore, even under limited training samples, the statistical loss can learn more class information with the class distributions and provide a deeply improvement of classification performance. This indicates that the proposed method provides another way to train an effective CNN model with limited training samples.

Fig. 5: The Mcnemar’s test between the general softmax loss and the proposed method under different number of training samples over (a) Pavia University; (b) Indian pines.

Furthermore, we show the classification maps from different methods under 200 training samples per class over Pavia University data and 20% of training samples per class over Indian Pines in Figs. 6 and 7, respectively. Compare Fig. 6 with 6, and 7 with 7. We can find that with the statistical loss, the classification errors can be remarkably decreased over both the datasets. Besides, when compare Fig. 6 with 6, and 7 with 7, it can be noted that, the developed method can learn the model that is more fit for the image than general handcrafted features.

Fig. 6: Pavia University classification maps by different methods with 200 samples per class for training (overall accuracies). (a) groundtruth; (b) SVM (89.2%); (c) CNN with softmax loss (98.25%); (d) CNN with center loss (99.44%) ; (e) CNN with structured loss (99.25%); (f) CNN with developed statistical loss (99.64%); (g) map color.
Fig. 7: Indian Pines classification maps by different methods with 20% of samples per class for training (overall accuracies). (a) groundtruth; (b) SVM (88.15%); (c) CNN with softmax loss (98.87%); (d) CNN with center loss (98.91%); (e) CNN with structured loss (99.31%); (f) CNN with developed statistical loss (99.48%); (g) map color.

V-D Effects of Diversity Weight

As mentioned in Section III-C, represents the tradeoff parameter between the optimization term and the diversity term. The value of can also affect the performance of the developed statistical loss. In this set of experiments, we evaluate the performance of the proposed method with different values of . Fig. 8 shows the classification performance with different over the Pavia University and Indian Pines data, respectively.

Fig. 8: Classification performance of the proposed method with different diversity weight over (a) Pavia University; (b) Indian pines. ’’ represents the results obtained with general softmax loss only.

We can find that the statistical loss can provide a better performance with a larger . However, an extensively large shows negatively effects on the performance of the statistical loss. Generally, increasing the can encourage different class distributions to repulse from each other, and therefore, the learned features can be more discriminative to separate different objects. However, an excessively large focuses too much attention on the diversity among different classes while ignores the variance of each class distribution. This could make the increase the intra-class variance of each class and show negative effects on the classification performance. More importantly, From Fig. 8, it can be noted that the proposed method performs the best (99.51%) when is set to 0.01 over the Pavia University data. While for Indian Pines data, the accuracy ranks 99.49% when which performs the best. In practice, cross validation can be used to select a proper to satisfy specific requirements of the developed statistical loss over different datasets.

V-E Comparisons with other Samples-based Loss

This work also compares the developed statistical loss with other recent samples-based loss. This work selects the center loss [32] and the structured loss [26] as the benchmarks to characterize the pair-wise correlation between the training samples. Table III shows the comparison results over the Pavia University and the Indian Pines data, respectively.

From the table, we can find that the developed statistical loss which formulates the penalization with the class distributions can be more fit for the classification task than the center loss and the structured loss. Using 200 samples per class for training, for Pavia Unviersity data, the statistical loss achieves OA outperforms that by the center loss ( OA) and the structured loss ( OA). While for the Indian Pines data, it can obtain OA with 20% training samples which is higher than OA by the center loss and OA by the structured loss. Moreover, it can also be noted that the also achieves 5.52 and 5.68 when compared with the center loss and structured loss over the Pavia University. Besides, the also obtains 2.64 and 3.56 over the Indian Pines. This means that by Mcnemar’s test, the developed statistical loss is statistically significant than other samples-based loss.

Besides, compare the statistical loss with these samples-based losses under limited number of training samples and we can also find that the deep model can obtain a significant improvement with the developed method. The reason is that the statistical loss is constructed with the class distributions and can use more class information in the training process while the samples-based losses are constructed directly with the training samples. In conclusion, the developed statistical loss which is formulated with the class distributions can achieve superior performance when compared with other samples-based loss in the literature of hyperspectral image classification.

The classification maps from CNN model learned with the center loss, the structured loss over the two datasets are shown in Figs. 6 and 7, separately. Compare Fig. 6 with Fig. 6 and Fig. 7 with Fig. 7, and it can be easily found that the CNN model with the statistical loss can better model the hyperspectral image than that with the center loss. Besides, compare Fig. 6 with Fig. 6 and Fig. 7 with Fig. 7 and we can obtain that the statistical loss can significantly decrease the classification errors than that by the structured loss.

Data Training set Methods OA(%) AA(%) KAPPA(%)
PU 10 per class Softmax Loss 59.74
Center Loss 27.17
Structured Loss 37.84
Statistical Loss
20 per class Softmax Loss 43.92
Center Loss 21.81
Structured Loss 31.28
Statistical Loss
200 per class Softmax Loss 15.50
Center Loss 5.52
Structured Loss 5.68
Statistical Loss
IP 1% Softmax Loss 28.03
Center Loss 16.97
Structured Loss 20.93
Statistical Loss
2% Softmax Loss 21.54
Center Loss 12.98
Structured Loss 16.27
Statistical Loss
20% Softmax Loss 4.48
Center Loss 2.64
Structured Loss 3.56
Statistical Loss

TABLE III: Comparisons with other sample-wise loss. This work selects the generally used softmax loss. Furthermore, this work compares the developed statistical loss with other most recent sample-based loss, namely the center loss [32] and the structured loss [26]. PU represents the Pavia University and IP represents the Indian Pines data.

V-F Comparisons with the Most Recent Methods

To further validate the effectiveness of the developed statistical loss for hyperspectral image classification, we compare the developed statistical loss with the state-of-the-art methods. Tables III and IV show the comparisons over the two datasets, respectively. The experimental results in each table are with the same experimental setups and we use the results from the literature where the method is first developed directly.

Methods OA(%) AA(%) KAPPA(%)

SVM-POLY
D-DBN-PF [35]
CNN-PPF [18]
Contextual DCNN [16]
SSN [39]
ML-based Spec-Spat [4]
DPP-DML-MS-CNN [5]
Proposed Method

TABLE IV: Classification performance of different methods over Pavia Unviersity data in the most recent literature (200 training samples per class for training).
Methods OA(%) AA(%) KAPPA(%)

R-ELM [19]
DEFN [27]
DRN [10]
MCMs+2DCNN [11]
Proposed Method (10%)
SVM-POLY
SSRN [38]
MCMs+2DCNN [11]
Proposed Method (20%)