This repository is an extension of Caffe for the paper "Statistical Loss and Analysis for Deep Learning in Hyperspectral Image Classification" (TNNLS).
Nowadays, deep learning methods, especially the convolutional neural networks (CNNs), have shown impressive performance on extracting abstract and high-level features from the hyperspectral image. However, general training process of CNNs mainly considers the pixel-wise information or the samples' correlation to formulate the penalization while ignores the statistical properties especially the spectral variability of each class in the hyperspectral image. These samples-based penalizations would lead to the uncertainty of the training process due to the imbalanced and limited number of training samples. To overcome this problem, this work characterizes each class from the hyperspectral image as a statistical distribution and further develops a novel statistical loss with the distributions, not directly with samples for deep learning. Based on the Fisher discrimination criterion, the loss penalizes the sample variance of each class distribution to decrease the intra-class variance of the training samples. Moreover, an additional diversity-promoting condition is added to enlarge the inter-class variance between different class distributions and this could better discriminate samples from different classes in hyperspectral image. Finally, the statistical estimation form of the statistical loss is developed with the training samples through multi-variant statistical analysis. Experiments over the real-world hyperspectral images show the effectiveness of the developed statistical loss for deep learning.READ FULL TEXT VIEW PDF
In this paper, a novel statistical metric learning is developed for
Deep learning methods have played a more and more important role in
Deep learning based methods have seen a massive rise in popularity for
In this letter, we introduce multitask learning to hyperspectral image
Active deep learning classification of hyperspectral images is considere...
Deep learning methods have shown considerable potential for hyperspectra...
Traditional Active/Self/Interactive Learning for Hyperspectral Image
This repository is an extension of Caffe for the paper "Statistical Loss and Analysis for Deep Learning in Hyperspectral Image Classification" (TNNLS).
With the development of the new and advanced space-borne and aerial-borne sensors, large amounts of hyperspectral images, which contain hundreds of spectral channels, are available [5, 36]. The high-dimension spectral bands in the image make it possible to obtain plentiful spectral information to discriminate different objects [1, 31, 35]. However, great similarity which occurs in the bands between different objects makes the image processing task be a challenging one. Besides, the increasing dimensionality in hyperspectral image and the limited number of training samples multiply the difficulties to obtain discriminative features from the image. Therefore, faced with these circumstances, spatial features are usually incorporated into the representation [16, 3]. However, modelling discriminative spatial and spectral features is not so simple. There have been increasing efforts to explore effective spectral-spatial methods for hyperspectral image classification.
Recently, deep models with multi-layers have demonstrated their potentials in modelling both the spectral and spatial features from the hyperspectral image [22, 24, 17, 37]. Especially, the CNNs, which can capture both the local and the global information from the objects, have presented good performance and been widely applied in hyperspectral image processing tasks. More extended CNNs with the multi-scale convolution , spectral and spatial residual block , have also been developed to improve the representational ability of the CNNs. Therefore, due to the good performance, this work will take advantage of the CNN model to extract the deep spectral-spatial features from the hyperspectral image.
The essential and key problem for the deep representation is how to train a good model. Generally, a good training process is guaranteed by a fine and proper definition of the training loss. The common training loss is constructed with the training samples directly and can be broadly divided into two classes. The first class of losses mainly penalizes the predicted and the real label of each sample for the training of the deep model, such as the generally used softmax loss [18, 14]. However, these losses only take advantage of the pixel-wise information from the hyperspectral image while ignore the correlation between different samples. The other one focuses on the penalization of the samples’ correlation [5, 6, 20]. These losses penalize the Euclidean distances [8, 2] or the angular  between sample pairs [26, 32] or among sample triplets  and usually provide a better performance than the first one. In real-world applications, the CNN is usually trained under the joint supervisory signals of the losses from the two classes for an effective deep representation.
Even though these samples-based losses have been successfully applied in the training of the deep models, there exist two shortcomings using in the hyperspectral image classification. First, these methods mainly consider the pixel-wise information of each training sample or the pairwise and triplet correlation between different samples which make the training process be susceptible to the imbalanced and limited number of training samples. This would increase the randomness and uncertainty of the training process. Besides, these methods do not take the statistical properties of the hyperspectral image into consideration. Especially, there exist the spectral variability within each class and the seriously overlapped spectra between different classes in the image. These intrinsic properties could play an important role in providing an effective training process for deep learning.
To overcome these problems, this work tries to model each class from the image as a certain probabilistic model and formulates the penalization with the class distributions not directly with the samples. The distributions-based loss can reduce the uncertainty caused by the imbalanced and limited number of training samples and further improve the performance of the learned model to extract discriminative features from the image. Specifically, this work uses the multi-variant normal distributions to model different classes in the image.
Under the probabilistic models and multi-variant statistical analysis, this work develops a novel statistical loss for deep learning in the literature of hyperspectral image classification. Based on the Fisher discrimination criterion , the developed statistical loss penalizes the sample variance of each class distribution to decrease the spectral variability of each class. Moreover, a diversity-promoting condition  is added in the statistical loss to enlarge the inter-class variance between different class distributions. Finally, under the multi-variant statistical analysis, the statistical estimation form of the statistical loss is developed with the training samples. As a result, the learned deep model can be more powerful to extract discriminative features from the image. Overall, the major contributions of this paper are listed as follows.
This work models the hyperspectral image with the probabilistic model and characterizes each class from the image as a certain sampling distribution to take advantage of the statistical properties of the image, so as to formulate the penalization with the class distributions.
Based on the multi-variant statistical analysis and the Fisher discrimination criterion, we develop a novel statistical loss that decreases the spectral variability of each class while enlarges the variance between different class distributions.
Extensive experiments over the real-world hyperspectral image data sets demonstrate the effectiveness and practicability of the developed method and its superiority when compared with other recent samples-based methods.
Hyperspectral remote sensing measures the radiance of the materials within each pixel area at a very large number of contiguous spectral wavelength bands . The space-borne or aerial-borne sensors gather these spectral information and provide hyperspectral images with hundreds of spectral bands. Since each pixel describes the energy reflected by surface materials and presents the intensity of the energy in different parts of the spectrum, each pixel contains a high-resolution spectrum, which can be used to identify the materials in the pixel by the analysis of reflectance or emissivity.
Unfortunately, a theoretically perfect fixed spectrum for any given material does not exist . Due to the variations in the material surface, the spectra observed from samples of the same class are generally not identical. The measured spectra corresponding to pixels with the same class presents an inherent spectral variability that prevents the characterization of homogeneous surface materials by unique spectral signatures. Just as the spectral curves shown in Fig. 1, each class in usual hyperspectral image exhibits remarkable spectral variability and different classes show serious overlapping of the set of spectra. Besides, most spectra appearing in real applications are random. Therefore, their statistical variability is better described using the probabilistic models .
, this work will use the CNN model to extract deep features from the image. The features extracted from the CNNs can be seen as the linear or nonlinear mapping of the objects. Therefore, the features from the same class also show obvious variability and can be described by the probabilistic models.
For the task at hand, the probabilistic models are with respect to the high dimensional features. Therefore, multi-variant statistical analysis, which concerns with analyzing and understanding data in high dimensions, is necessary and just fit for the image processing task we face with . Then, based on the Fisher discrimination criterion and multi-variant statistical analysis, this work will focus on modelling each class from the hyperspectral image as a specific probabilistic model and further develop a novel statistical loss to extract discriminative features from the image.
Even though the hyperspectral image possesses good statistical properties, to the best of our knowledge, this work first takes the statistical properties of the hyperspectral image into consideration and develops the loss with the distributions, not directly with the samples for deep learning. In the following, we will provide a deep comparison between the developed distributions-based loss and the samples-based loss.
The samples-based losses mainly consider the pixel-wise information or penalize the correlation between the sample pairs  or triplets  for the deep learning. These losses attempt to obtain good representations of the image by decreasing the distances between samples from the same class and increasing the distances between samples from different classes. However, the performance of these samples-based loss is seriously influenced by the imbalanced and limited training samples, which leads to the uncertainty and randomness of the training process. Fig. 1 shows the flowchart of training process by these samples-based loss. Just as the figure shows, there may exists the overlapping between the obtained features from different classes. Besides, the variability of the learned features from each class would still be too large.
Different from these samples-based losses, the distributions-based loss characterizes each class from the image as a certain probabilistic model and considers the class relationship with the distributions under the Fisher discrimination criterion. Since we model the correlation based on the class distributions, the problems caused by the imbalance and limitation of the training samples can be solved. This shows positive effects on obtaining discriminative features from the image. Just as presented in Fig. 1, with the statistical loss by the multi-variant statistical analysis, the spectral variability of the learned features in each class would be decreased and different class distributions can be better separated. This makes the learned features be discriminative enough and thus the classification performance can be significantly improved. In the following, we will introduce the construction of the statistical loss for deep learning in detail.
Let us denote as the set of training samples of the hyperspectral image where is the number of training samples and as the corresponding label of the sample . where is the number of the sample classes.
A reasonable and mostly used probabilistic model for such spectral data in hyperspectral image is generally provided by multivariate normal distribution. It has already presented impressive performance in modelling target and background as random vectors with multivariate normal distributions for hyperspectral target detection
and hyperspectral anomaly detection. For the task at hand, the extracted features from the CNN model of different classes will also be modelled with the multivariate normal distributions.
-dimensional random variablewhich follows a certain multi-variant distribution. The random variable
is multi-variant normal if its probability density function (pdf)has the form
where , , describes the mean of the distribution and which is a positive function represents the covariance matrix of the distribution. Generally, the multi-variant normal distribution can be described as .
In this work, each class is modelled by a certain multi-variant normal distribution with a mean of and a covariance of , which can be written as . represents the dimension of the obtained features from the CNN model and denotes the number of classes in the hyperspectral image. Obviously, the sampling distributions corresponding to different classes in the hyperspectral image are independent to each other.
As Fig. 2
shows, this work formulates the loss function based on the Fisher discrimination criterion. Under the criterion, we penalize the sample variance of each class distribution to decrease the intra-class variance, then the problem can be formulated as the following optimization,
where means the trace of the matrix and denotes the set of the parameters in the CNN model.
Moreover, to further improve the performance, this work add additional diversity-promoting condition to repulse different class distributions from each other. The diversity-promoting term can be formulated as
where is a positive value. and represents different classes from the image.
Therefore, from the statistical view, we characterize the feature correlation of the hyperspectral image with the probabilistic model and develop the statistical loss as follows,
Under the optimization in Eq. 4, the intra-class variance of the obtained features is decreased. Besides, the diversity-promoting condition increases the variance between different class distributions. Thus, the learned features can be more discriminative to separate different samples.
To solve the optimization in Eq. 4 with the training samples, this work statistically estimates the optimization with the multi-variant statistical analysis and develops the estimated statistical loss for hyperspectral image classification.
Generally, in the training process of CNNs, the training batches are usually constructed to accurately estimate the CNN model. A training batch consists of a batch of randomly selected training samples, which can realize the parallelization of the training process . Obviously, a training batch can be looked as a sampling from the class distributions in the hyperspectral image.
Given a training batch . Denote as the feature of extracted from the deep model. represents the samples of the -th class in the batch. Then, denotes the extracted features of the -th class where is the number of the samples in the class. Therefore, the features in follows the class distribution .
The unbasied estimate of the distribution mean of the -th class in can be calculated as
Define the scatter matrix of the -th class as
Then, for the
-th class, the unbiased estimateof the covariance matrix can be formulated as
We use to estimate the covariance matrix . Then,
Besides, can be estimated by . Therefore, is estimated by
Given the -th and the -th class. The -th class follows the multi-variant normal distribution as and the -th class follows . Obviously, the two class distributions are independent from each other. This work will use the statistical hypothesis to estimate the condition .
To estimate , two famous multi-variant distributions, namely the and the Hotelling are necessary.
The plays a prominent role in the analysis of estimated covariance matrices. Assume as independent distributions which follows the same -dimensional multi-variant normal distribution . Denote
. Then the random matrixfollows the -dimensional with degrees of freedom, which can be written as
It should be noted that the satisfies the following property. If statistics , and the statistics are independent from each other, then,
The Hotelling is essential to the hypothesis testing in multi-variant statistical analysis. Suppose that is independent to . Denote the statistic , then the statistic is defined as the Hotelling with degree of freedom, which can be formulated as
It should be noted that the former and Hotelling
are certain distributions where the probability distribution is fixed under a certain degrees of freedom. In the following, the two distributions will play an important role in the following estimation.
Traditionally, a statistical hypothesis is an assertion or conjecture concerning one or more populations. It should be noted that the rejection of a hypothesis implies that the sample evidence refutes it. That is to say, rejection means that there is a small probability of obtaining the sample information observed when, in fact, the hypothesis is true 
. The structure of hypothesis testing will be formulated with the use of the null hypothesisand the alternative hypothesis . Generally, the rejection of leads to the acceptance of the alternative hypothesis .
For simplicity, this work would set the in Eq. 4 to 0. Therefore, from the statistical hypothesis view, we may then re-state the condition as the following two competing hypotheses:
The scatter matrices and of the -th and the -th class are defined as Eq. 6 shows. As the definition of , it can be noted that
Since all the samples of different classes are from the same hyperspectral image, just as processed in many hyperspectral target recognition task , different class distributions are supposed to have the same covariance matrix, namely . Therefore, based on the properties of as Eq. 11, the statistic follows
Moreover, depending on the definition of the multi-variant normal distribution, we can find that the statistic which is defined as follows the multi-variant normal distribution,
Furthermore, denote the statistic
Then, according to the definition of Hotelling , it can be noted that the statistic in Eq. 19 follows the as
Therefore, at the level of confidence, if , accept the null hypothesis , reject the alternative hypothesis ; otherwise, if , accept the alternative hypothesis , reject the null hypothesis .
Since the alternative hypothesis is what we seek, can be transformed to the following one,
By Lagrange multiplier, the statistical loss for the hyperspectral image can be formulated from Eq. 22 as
where is the tradeoff parameter.
Besides, is a constant value that is irrelevant to the training samples, therefore, we set as a constant positive value. Then, Eq. 23 can be re-formulated as
where represents a positive value. Therefore, Eq. 24 defines the statistical loss for deep learning in this work. Fig. 2 shows the detailed process to formulate the statistical loss. Under the statistical loss, the learned model can be more discriminative for the hyperspectral image.
Generally, the deep model is trained with the stochastic gradient descent method and back propagation is used for the training process of the model. Therefore, the main problem for the implementation of the developed statistical loss in hyperspectral image classification task is to compute the derivation of the statistical loss w.r.t. the extracted features from the training samples.
As defined in section III-C, the statistical loss can be formulated as
According to the chain rule, gradients of the statistical loss w.r.t.can be formulated as
The partial of w.r.t. can be easily computed by
where is the learned features of training sample from the CNN model. denotes the indicative function.
Besides, the partial of w.r.t. can be calculated by
Therefore, the key process is to calculate the following derivation:
The can be computed as
represents the identity matrix. In addition,
Through back propagation with the former equations, the CNN model can be trained with the training samples and discriminative features can be learned from the hyperspectral image. The detailed training process of the developed method is shown in Algorithm 1. It should also be noted that the whole CNN is trained under joint supervisory signals of softmax loss and our statistical loss.
To further validate the effectiveness of the developed statistical loss, this work conducts experiments over the real-world hyperspectral image data sets, namely Pavia University and Indian Pines222More results can be seen in the attachment.
. We also compare the experimental results with other state-of-the-art methods including the most recent samples-based loss to show the advantage of the proposed method. In addition, overall average (OA), average accuracy (AA), and Kappa are chosen as the measurements to evaluate the classification performance. All the results in this work come from the average value and standard deviation of ten runs of training and testing. For each of the ten experiments, the training and testing sets are randomly selected.
Pavia University data  was gathered by the reflective optics system imaging spectrometer (ROSIS-3) sensor with a spatial resolution of 1.3m per pixel. It consists of pixels of which a total of 42, 776 labelled samples divided into nine classes have been chosen for experiments. Each pixel denotes a sample and consists of 115 bands with a spectral coverage ranging from 0.43 to 0.86 . 12 spectral bands are abandoned due to the noise and the remaining 103 channels are used for experiments.
Indian Pines data  was collected by the 224-band AVIRIS sensor ranging from 0.4 to 2.5 over the Indian Pines test site in north-western Indiana. It consists of pixels and the corrected data of Indian Pines remains 200 bands where 24 bands covering the region of water absorption are removed. Sixteen land-cover classes with a total of 10249 labelled samples are selected from the data for experiments.
Caffe is chosen as the deep learning framework to implement the proposed method . Since this work mainly test the effectiveness of the developed statistical loss, we will use the CNN model just as Fig. 3
shows for all the experiments in this work. The learning rate, epoch iteration, training batch are set to 0.001, 60000, 84, respectively. As Fig.3 shows, this work uses the neighborhoods to incorporate the spatial information. In the experiments, we choose 200 samples per class for training and the remainder for testing over Pavia University while over Indian Pines data, we select 20% of samples per class for training. The code for the implementation of the proposed method will be released soon at http://github.com/shendu-sw/statistical-loss.
At first, we present a brief overview of the merits of the developed statistical loss for hyperspectral image classification. In this set of experiments, the diversity weight is fixed as constant 0.01. General machine with a 4.00GHz Intel Core (IM) i7-6700K CPU, 64 GB memory, and GeForce GTX 1080 GPU is chosen to perform the proposed method. The proposed method implemented through caffe took about 1146s over Pavia University and 1610s over Indian Pines data. It should be noted that this work implements the developed statistical loss by CPU and the computational performance can be remarkably improved by modifying the codes to run the developed method on GPUs.
Tables I and II show the classification results over the two datasets separately. For Pavia University data, C1, C2, , C9 represent the asphalt, meadows, gravel, trees, metal sheet, bare soil, bitumen, brick, shadow, respectively. For Indian Pines data, C1, C2, , C16 stand for the alfalfa, corn-no-till, corn-min-till, corn, grass_pasture, grass_trees, grass_pasture-mowed, hay-windrowed, oats, soybeans-no_till, soybeans-min_till, soybeans-clean, wheat, woods, buildings-grass-trees-drives, stone-steel_towers, separately. It can be noted that the developed method obtains a better performance than that by SVM. More importantly, the learned CNN by the statistical loss achieves an accuracy of 99.51% 0.09% over Pavia University which is much higher than that by general softmax loss (98.61% 0.35%). Besides, for Indian Pines, the proposed method can decrease the error rate by 47.42% when compared with that by general softmax loss. The statistical loss can take advantage of the statistical property of the hyperspectral image and embed the information of class distributions of the hyperspectral image in the deep learning process. Thus, the learned deep model can better represent the hyperspectral image and further provide a better classification performance.
Furthermore, we use the McNemar’s test, which is based upon the standardized normal test statistics, as the statistical analysis method to demonstrate whether the developed statistical loss method improve the classification performance in the statistic sense. The statistic is computed by
where measures the pairwise statistical significance of difference between the accuracies of the th and th methods, and
denotes the number of samples which is classified correctly byth method but wrongly by th method. At the 95% level of confidence, the difference of accuracies between different methods is statistically significant if .
The former subsection has demonstrated the effectiveness of the developed statistical loss for hypersperctral image at the given experimental setups as section V-A shows. This subsection will further evaluate the performance of the developed method under different number of training samples. For Pavia University data, we choose the number of training samples per class from the set of . While for Indian Pines data, we select 1%, 2%, 5%, 10%, and 20% of training samples per class from the whole data. It should be noted that in these experiments, the diversity weight is set to 0.01. Fig. 4 presents the classification performance of the developed method with different number of training samples over the two data. Furthermore, we have presented the value of McNemar’s test with different number of training samples between the CNN trained with general softmax loss and the statistical loss in Fig. 5. Inspect the tendencies in Figs. 4 and 5 and we can note that the following hold.
Firstly, the accuracies obtained by CNN with proposed method can be remarkably improved when compared with CNN by general softmax loss only. From Fig. 5, we can find that all the improvement by the developed method is statistically significant when compared with general softmax loss. Particularly, the accuracy is increased from 74.62% to 86.97% under 10 training samples per class over Pavia University and from 67.50% to 79.60% under 1% of training samples per class over Indian Pines. Secondly, the classification performance of the learned model is significantly improved with the increase of the training samples. Finally, it can be noted that the developed statistical loss shows a definite improvement of the learned model with limited number of training samples. As showed in Fig. 5, the value of MeNemar’s test is significantly improved when decreasing the training samples. The can even rank 59.74 under 10 training samples per class over Pavia University and 28.03 under 1% training samples per class over Indian Pines. The statistical loss is constructed with the class distributions, not directly with the samples. Therefore, even under limited training samples, the statistical loss can learn more class information with the class distributions and provide a deeply improvement of classification performance. This indicates that the proposed method provides another way to train an effective CNN model with limited training samples.
Furthermore, we show the classification maps from different methods under 200 training samples per class over Pavia University data and 20% of training samples per class over Indian Pines in Figs. 6 and 7, respectively. Compare Fig. 6 with 6, and 7 with 7. We can find that with the statistical loss, the classification errors can be remarkably decreased over both the datasets. Besides, when compare Fig. 6 with 6, and 7 with 7, it can be noted that, the developed method can learn the model that is more fit for the image than general handcrafted features.
As mentioned in Section III-C, represents the tradeoff parameter between the optimization term and the diversity term. The value of can also affect the performance of the developed statistical loss. In this set of experiments, we evaluate the performance of the proposed method with different values of . Fig. 8 shows the classification performance with different over the Pavia University and Indian Pines data, respectively.
We can find that the statistical loss can provide a better performance with a larger . However, an extensively large shows negatively effects on the performance of the statistical loss. Generally, increasing the can encourage different class distributions to repulse from each other, and therefore, the learned features can be more discriminative to separate different objects. However, an excessively large focuses too much attention on the diversity among different classes while ignores the variance of each class distribution. This could make the increase the intra-class variance of each class and show negative effects on the classification performance. More importantly, From Fig. 8, it can be noted that the proposed method performs the best (99.51%) when is set to 0.01 over the Pavia University data. While for Indian Pines data, the accuracy ranks 99.49% when which performs the best. In practice, cross validation can be used to select a proper to satisfy specific requirements of the developed statistical loss over different datasets.
This work also compares the developed statistical loss with other recent samples-based loss. This work selects the center loss  and the structured loss  as the benchmarks to characterize the pair-wise correlation between the training samples. Table III shows the comparison results over the Pavia University and the Indian Pines data, respectively.
From the table, we can find that the developed statistical loss which formulates the penalization with the class distributions can be more fit for the classification task than the center loss and the structured loss. Using 200 samples per class for training, for Pavia Unviersity data, the statistical loss achieves OA outperforms that by the center loss ( OA) and the structured loss ( OA). While for the Indian Pines data, it can obtain OA with 20% training samples which is higher than OA by the center loss and OA by the structured loss. Moreover, it can also be noted that the also achieves 5.52 and 5.68 when compared with the center loss and structured loss over the Pavia University. Besides, the also obtains 2.64 and 3.56 over the Indian Pines. This means that by Mcnemar’s test, the developed statistical loss is statistically significant than other samples-based loss.
Besides, compare the statistical loss with these samples-based losses under limited number of training samples and we can also find that the deep model can obtain a significant improvement with the developed method. The reason is that the statistical loss is constructed with the class distributions and can use more class information in the training process while the samples-based losses are constructed directly with the training samples. In conclusion, the developed statistical loss which is formulated with the class distributions can achieve superior performance when compared with other samples-based loss in the literature of hyperspectral image classification.
The classification maps from CNN model learned with the center loss, the structured loss over the two datasets are shown in Figs. 6 and 7, separately. Compare Fig. 6 with Fig. 6 and Fig. 7 with Fig. 7, and it can be easily found that the CNN model with the statistical loss can better model the hyperspectral image than that with the center loss. Besides, compare Fig. 6 with Fig. 6 and Fig. 7 with Fig. 7 and we can obtain that the statistical loss can significantly decrease the classification errors than that by the structured loss.
|PU||10 per class||Softmax Loss||59.74|
|20 per class||Softmax Loss||43.92|
|200 per class||Softmax Loss||15.50|
To further validate the effectiveness of the developed statistical loss for hyperspectral image classification, we compare the developed statistical loss with the state-of-the-art methods. Tables III and IV show the comparisons over the two datasets, respectively. The experimental results in each table are with the same experimental setups and we use the results from the literature where the method is first developed directly.
|Contextual DCNN |
|ML-based Spec-Spat |