Multiscale Principle of Relevant Information for Hyperspectral Image Classification

07/13/2019 ∙ by Yantao Wei, et al. ∙ Central China Normal University University of Florida 0

This paper proposes a novel architecture, termed multiscale principle of relevant information (MPRI), to learn discriminative spectral-spatial features for hyperspectral image (HSI) classification. MPRI inherits the merits of the principle of relevant information (PRI) to effectively extract multiscale information embedded in the given data, and also takes advantage of the multilayer structure to learn representations in a coarse-to-fine manner. Specifically, MPRI performs spectral-spatial pixel characterization (using PRI) and feature dimensionality reduction (using regularized linear discriminant analysis) iteratively and successively. Extensive experiments on four benchmark data sets demonstrate that MPRI outperforms existing state-of-the-art HSI classification methods (including deep learning based ones) qualitatively and quantitatively, especially in the scenario of limited training samples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 11

page 12

page 14

page 16

page 17

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the rapid development of hyperspectral imaging techniques, current sensors always have high spectral and spatial resolution [1]. For example, the ROSIS sensor can cover spectral resolution higher than nm, reaching m per pixel spatial resolution [2, 3]. The increased spectral and spatial resolution enables us to accurately discriminate diverse materials of interest. As a result, hyperspectral images (HSIs) have been widely used in many practical applications, such as precision agriculture, environmental management, mining and mineralogy [1]. Among them, HSI classification, which aims to assign each pixel of HSI to a unique class label, has attracted increasing attention in recent years. However, the unfortunate combination of high-dimensional spectral features and the limited ground truth samples, as well as different atmospheric scattering conditions, make the HSI data inherently highly nonlinear and difficult to be categorized [4].

Early HSI classification methods straightforwardly apply conventional dimensionality reduction techniques, such as the principal component analysis (PCA)

[5] and the linear discriminative analysis (LDA) [6], on spectral domain to learn discriminative spectral features. Although these methods are conceptually simple and easy to implement, they neglect the spatial information, a complement to spectral behavior that has been demonstrated effective to augment HSI classification performance [1, 7]. To address this limitation, Chen et al[8] proposed the joint sparse representation (JSR) to incorporate spatial neighborhood information of pixels. Soltani-Farani et al[9] designed spatial aware dictionary learning (SADL) by using a structured dictionary learning model to incorporate both spectral and spatial information. Kang et al. suggested using an edge-preserving filter (EPF) to improve the spatial structure of HSI [10] and also introduced PCA to encourage the separability of new representations [11]. A similar idea appears in Pan et al. [12], in which EPF is substituted with a hierarchical guidance filter. Although these methods perform well, the discriminative power of their extracted spatial-spectral features is far from satisfactory when being tested on challenging land covers.

A recent trend is to use deep neural networks (DNN), such as autoencoders (AE) 

[13]

and convolutional neural networks (CNN) 

[14], to learn discriminative spectral-spatial features [15]

. Although deep features always demonstrate superior discriminative power than hand-crafted features in different computer vision or image processing tasks, existing DNN based HSI classification methods either improve the performance marginally or require significantly more labeled data 

[16]. On the other hand, collecting labeled data is always difficult and expensive in remote sensing community [3]

. Admittedly, transfer learning has the potential to alleviate the problem of limited labeled data, but it still remains an open problem to construct a reliable relevance between the target domain and the source domain due to the large variations between HSIs obtained by different sensors with unmatched imaging bands and resolutions 

[17].

Different from previous work, this paper presents a novel architecture, termed multiscale principle of relevant information (MPRI), to learn discriminative spectral-spatial features for HSI classification. MPRI inherits the merits of the principle of relevant information (PRI) [18, Chapter 8] [19, Chapter 3] to effectively extract multiscale information from the given data, and also takes advantage of the multilayer structure to learn representations in a coarse-to-fine manner. To summarize, the major contributions of this work are threefold.

  • We demonstrate the capability of PRI, originated from the information theoretic learning (ITL) [18], to characterize 3D pictorial structures in HSI data.

  • We generalize PRI into a multilayer structure to extract hierarchical representations for HSI classification. A multiscale scheme is also incorporated to model both local and global structures.

  • MPRI outperforms state-of-the-art HSI classification methods based on classical machine learning models (e.g., PCA-EPF 

    [11] and HIFI [12]) with a large margin. Using significantly fewer labeled data, MPRI also achieves almost the same classification accuracy compared to existing deep learning techniques (e.g., SAE-LR [20] and 3D-CNN [21]).

The remainder of this paper is organized as follows. Section II reviews the basic objective of PRI and formulates PRI under the ITL framework. The architecture and optimization of our proposed MPRI is elaborated in Section III. Section IV shows experimental results on four popular HSI data sets. Finally, Section V draws the conclusion.

Ii Elements of Renyi’s -entropy and the principle of relevant information

Before presenting our method, we start with a brief review to the general idea and the objective of PRI, and then formulate this objective under the ITL framework.

Ii-a PRI: the general idea and its objective

Suppose we are given a random variable

with a known probability density function (PDF)

, from which we want to learn a reduced statistical representation characterized by a random variable with PDF . The PRI [18, Chapter 8] [19, Chapter 3] casts this problem as a trade-off between the entropy of and its descriptive power about in terms of their divergence . Therefore, for a fixed PDF , the objective of PRI is given by:

(1)

where is a hyper-parameter controlling the amount of relevant information that can extract from

. Note that, the minimization of entropy can be viewed as a means of finding the statistical regularities in the outcomes of a process, whereas the minimization of information theoretic divergence, such as the Kullback-Leibler divergence 

[22] or the Chernoff divergence [23], ensuring that such regularities are closely related to . The PRI is similar in spirit to the Information Bottleneck (IB) method [24], but the formulation is different because PRI does not require an observed relevant (or auxillary) variable and the optimization is done directly on the random variable , which provides a set of solutions that are related to the principal curves [25] of , as will be demonstrated below.

Ii-B Formulation of PRI using Renyi’s entropy functional

In information theory, a natural extension of the well-known Shannon’s entropy is the Renyi’s -entropy [26]. For a random variable with PDF in a finite set , the -entropy of is defined as:

(2)

On the other hand, motivated by the famed Cauchy-Schwarz (CS) inequality:

(3)

with equality if and only if and are linear independent, a measure of the “distance” between the PDFs can be defined, which was named the CS divergence [27], with:

(4)

the term is also called the quadratic cross entropy [18].

Combining Eqs. (2) and (4), the PRI under the -order Renyi’s entropy can be formulated as:

(5)

the second equation holds for the fact that the extra term is a constant with respect to .

Given and , both in , drawn i.i.d. from and

, respectively. Using the Parzen-window density estimation 

[28] with Gaussian kernel , Eq. (5) can be simplified as [19]:

(6)

It turns out that the value of defines various levels of information reduction, ranging from data mean value (), clustering (), principal curves [25]

extraction at different dimensions, and vector quantization obtaining back to the initial data when

 [18, 19]

. Hence, the PRI achieves similar effects to a moment decomposition of the PDF controlled by a single parameter

, using a data driven optimization approach. See Fig. 1 for an example. Note that, despite its strategic flexibility to find reduced structure of given data, the possible applications of PRI are mostly unknown to practitioners [18].

(a) Intersect
(b)
(c)
(d)
(e)
(f)
Fig. 1: Illustration of the structures revealed by the PRI for (a) Intersect data set. As the values of increase the solution passes through (b) a single point, (c) modes, (d) and (e) principal curves at different dimensions, and in the extreme case of (f) we get back the data themselves as the solution.

Iii Multiscale principle of relevant information (MPRI) for HSI classification

In this section, we present MPRI for HSI classification. MPRI stacks multiple spectral-spatial feature learning units, in which each unit consists of multiscale PRI and a regularized LDA [29]. The architecture of MPRI is shown in Fig. 2.

Fig. 2: The architecture of multiscale principle of relevant information (MPRI) for HSI classification. The spectral-spatial feature learning unit is marked with red dashed rectangle. The spectral-spatial features are extracted by performing PRI (in multiple scales) and LDA iteratively and successively on HSI data cube (after normalization). Finally, features from each unit are concatenated and fed into a

-nearest neighbors (KNN) classifier to predict pixel labels. This plot only demonstrates a

-layer MPRI, but the number of layers can be increased or decreased flexibly.

To the best of our knowledge, apart from performing band selection (e.g., [30]) or measuring spectral variability (e.g., [31]), information theoretic principles have seldom been investigated to learn discriminative spectral-spatial features for HSI classification. The most similar work to ours is [32], in which the authors use the criterion of minimum redundancy maximum relevance (MRMR) [33] to extract linear features. However, owing to the poor approximation to estimate multivariate mutual information, the performance of [32] is only slightly better than the basic linear discriminate analysis (LDA) [6].

Iii-a Spectral-Spatial Feature Learning Unit

Let be the raw 3D HSI data cube, where and are the spatial dimensions, is the number of spectral bands. For a target spectral vector , we extract a local cube (denote ) from using a sliding window of width centered at , i.e., , , and , where is the nearest integer function. We obtain the spectral-spatial characterization from using PRI via the following objective:

(7)

and the new representation for is exactly the vector in the center of , i.e., (denote for simplicity).

Denote and , taking the derivative of with respect to and equating to zero, we have:

(8)

Rearrange Eq. (8), we have:

(9)

Divide both sides of the Eq. (9) by , and let , we obtain the fixed point update rule for :

(10)

where is the iterative number.

We scan the whole 3D cube with a sliding window of width targeted at each spectral vector to get the final spectral-spatial characterization of . Besides, we also introduce two modifications to increase the discriminative power of the new representation. First, different values of (3, 5, 7, 9, 11, 13 in this work) are used to model both local and global structures. Second, to reduce the redundancy of raw features constructed by concatenating representations of PRI in multiple scales, we further perform a regularized LDA [29].

Iii-B Stacking Multiple Units

In order to characterize spectral-spatial structures in a coarse-to-fine manner, we stack multiple spectral-spatial feature learning units described in Section III-A to constitute a multilayer structure and concatenate representations from each layer to form the final spectral-spatial representation. We finally feed this representation into a standard -nearest neighbors (KNN) for classification.

The interpretation of DNN as a way of creating successively better representations of the data has already been suggested and explored by many (e.g., [34]). Most recently, Schwartz-Ziv and Tishby [35] put forth an interpretation of DNN as creating sufficient representations of the data that are increasingly minimal. For our deep architecture, in order to have an intuitive understanding to its inner mechanism, we plot the 2D projection (after t-SNE [36] iterations) of features learned from different layers in Fig. 3. Similar to DNN, MPRI creates successively more faithful and separable representations in deeper layers. Moreover, the deeper features can discriminate the with-in class samples in different geography regions, even though we do not manually incorporate geographic information in the training.

Fig. 3: 2D projection of features in different layers learned by MPRI on the Indian Pines data set [37]. Features of “Woods” in the st layer, the nd layer, and the rd layer are marked with red rectangle in (a)-(c). Similarly, features of “Grass-pasture” are marked with magenta ellipses in (d)-(f). (g) shows the locations of “Region 1” and “Region 2”. (h) shows the locations of “Region 3” , “Region 4” and “Region 5”. (i) shows class legend.

Iv Experimental Results

We conduct four groups of experiments to demonstrate the effectiveness and superiority of our proposed MPRI. Specifically, we first perform a simple test to determine a reliable range for the value of in PRI and the number of layers in MPRI. Then, we implement MPRI and several of its degraded variants to analyze and evaluate component-wise contributions to performance gain. Following this, we evaluate MPRI against state-of-the-art classical machine learning models based HSI classification methods on benchmark data sets using both visual and qualitative evaluations. Finally, we compare MPRI with existing DNN based methods to further reveal our advantage.

Four popular data sets, namely the Indian Pines [37], the Pavia University, the Pavia Center and the Kennedy Space Center, are selected in this work. We summarize the properties of each data set in Table I. Three metrics are used for quantitative evaluation [2]: overall accuracy (OA), average accuracy (AA) and the kappa coefficient . OA is computed as the percentage of correctly classified test pixels, AA is the mean of the percentage of correctly classified pixels for each class, and involves both omission and commission errors and gives a good representation of the the overall performance of the classifier.

Data set Indian Pines Pavia University Pavia Center Kennedy Space Center
Sensor AVIRIS ROSIS ROSIS-3 AVIRIS
Spatial size 610 340
bands (used) 200 103 102 176
classes 16 9 9 13
TABLE I: Details of data sets.

For our method, the values of the kernel width in PRI were tuned around the multivariate Silverman’s rule-of-thumb [38]: , where is the sample size, is the variable dimensionality, and

are respectively the smallest and the largest standard deviation among each dimension of the variable. For example, in Indian Pines data set, the estimated range in the

-th layer corresponds to , and we set kernel width to . On the other hand, the PRI in each layer is optimized with iterations, which has been observed to be sufficient to provide desirable performance.

Iv-a Parameter Analysis

Iv-A1 Effects of parameter in PRI

The parameter in PRI balances the trade-off between the regularity of extracted representation and its discriminative power to given data. We illustrate the values of OA, AA, and for MPRI with respect to different values of in Fig. 4(a). As can be seen, these quantitative values are initially stable, but decrease when . Moreover, the value of AA drops more drastically than that of OA or . A likely interpretation is that when training samples are limited, many classes have only a few labeled samples ( for minority classes, such as Oats, Grass-pasture-mowed, and Alfalfa). An unreliable value of may severely influence the classification accuracy in these classes, hereby decreasing AA at first.

Fig. 4: (a) Quantitative evaluation with different values of . (b) Quantitative evaluation with different number of layers.

The corresponding classification maps are shown in Fig. 5. It is obviously that, the smaller the , the more smooth results achieved by MPRI. This is because large encourages a small divergence between the extracted representation and the original HSI data. For example, in the scenario of , PRI clusters both spectral and spatial structures into a single point (the data mean) that has no discriminative power. By contrast, in the scenario of , the extracted representation gets back to the HSI data itself (to minimize their divergence) such that PRI will fit all noisy and irregular structures.

Fig. 5: Classification maps of MPRI with (a) ; (b) ; (c) ; (d) ; (e) ; (f) ; (g) ; and (h) .

From the above analysis, extremely large and small values of are not interesting for classification of HSI. Moreover, the results also suggest that is able to balance a good trade-off between preserving relevant spatial information (such as edges in classification maps) and filtering out unnecessary one. Unless otherwise specified, the PRI mentioned in the following experiments uses three different values of , i.e., , , and . The final representation of PRI is formed by concatenating representations obtained from each .

Iv-A2 Effects of the number of layers

We then illustrate the values of OA, AA and for MPRI with respect to different number of layers in Fig. 4(b). The corresponding classification maps are shown in Fig. 6. Similar to existing deep architectures, stacking more layers (in a reasonable range) can increase performance. If we keep the input data size the same, more layers (beyond a certain layer number) will not increase the performance anymore and the classification maps become over-smooth. This work uses a -layer MPRI because it provides favorable visual and quantitative results.

Fig. 6: Classification maps of MRPI with (a) layer; (b) layers; (c) layers; (d) layers; (e) layers; (f) layers; (g) layers; and (h) layers.

Iv-B Evaluation on component-wise contributions

Before systematically evaluating the performance of MPRI, we first compare it with its degraded baseline variants to demonstrate the component-wise contributions to the performance gain. The results are summarized in Table II. As can be seen, models that only consider one attribute (i.e., multi-layer, multi-scale and multi-) improve the performance marginally. Moreover, it is interesting to find that multi-layer and multi-scale play more significant roles than multi-. One possible reason is that the representations learned from different contain redundant information with respect to class labels. However, either the combination of multi-layer and multi- or the combination of multi-scale and multi- can obtain remarkable improvements. Our MPRI performs the best as expected. This result indicates that multi-layer, multi-scale and multi- are essentially important for the problem of HSI classification.

Multi-layer Multi-scale Multi- OA AA
76.25 72.74 0.729
80.86 79.33 0.7814
81.28 76.25 0.786
76.61 73.09 0.733
83.30 77.89 0.809
86.94 85.35 0.851
92.80 91.16 0.918
94.00 91.20 0.932
TABLE II: Quantitative evaluation of our MPRI (the last row) and its degraded baseline variants in terms of OA, AA, and . “” denotes the model contains the corresponding module. The best two performances are marked in bold and underlined, respectively.

Iv-C Comparison with methods using classical machine learning models

Having illustrated component-wise contributions of MPRI, we compare it with several state-of-the-art methods using classical machine learning models, including EPF [10], MPM-LBP [39], SADL [9], MFL [40], PCA-EPF [11], and HIFI [12].

Class EPF MPM-LBP SADL MFL PCA-EPF HIFI MPRI
1 10.00 51.73 75.77 66.92 98.76 87.33 74.44
2 74.41 80.98 83.89 80.58 88.87 88.34 92.48
3 89.35 68.46 81.57 76.41 88.15 89.00 95.68
4 79.01 44.76 64.15 26.46 87.08 81.85 72.67
5 95.15 80.86 90.12 80.76 94.78 83.97 89.98
6 73.93 97.87 97.34 91.20 92.81 96.90 96.91
7 50.00 29.20 100.0 67.20 80.30 96.30 96.67
8 86.57 98.37 99.65 96.41 99.83 98.57 100.0
9 10.00 40.53 93.16 32.11 76.45 96.84 81.05
10 77.92 75.79 85.79 82.59 90.10 88.80 92.68
11 71.76 90.25 91.55 90.50 92.65 90.19 94.90
12 74.80 79.53 66.27 70.70 86.78 87.30 87.93
13 98.02 99.47 98.55 96.28 96.40 99.30 98.40
14 91.21 96.36 96.34 96.47 97.72 98.24 98.59
15 78.84 63.17 84.78 74.11 94.11 81.83 93.89
16 87.15 50.54 90.32 73.01 89.67 98.24 92.97
OA 78.44 83.84 88.19 84.32 91.77 90.87 94.00
AA 71.76 71.74 87.45 75.11 90.90 91.44 91.20
0.750 0.814 0.865 0.821 0.906 0.896 0.932
TABLE III: Classification accuracies (%) of different methods on Indian Pines data set. 2% of labeled samples per class were randomly selected for training. The best two performances are marked in bold and underlined, respectively.
Fig. 7: Classification maps on Indian Pines data set. (a) Ground truth; (b) SADL; (c) MFL; (d) PCA-EPF; (e) HIFI; (f) MPRI.

Tables III-VI summarized quantitative evaluation results of different methods. For each method, we report its classification accuracy in each land cover category as well as the overall OA, AA and values across all categories. To avoid biased evaluation, we average the results from independent runs. Obviously, MPRI achieves the best or second best performance in most of items. Take the OA values in Table VI as an example, our proposed MPRI achieves a performance gain of 6.3%, 8.24%, 5.84%, 15.71%, 3.98%, and 8.25% over MPM-LBP, SADL MFL, PCA-EPF and HIFI, respectively. These results suggest that MPRI is able to learn more discriminative spectral-spatial representations than its counterparts using classical machine learning models.

The classification maps of different methods in four data sets are demonstrated in Figs. 7-10, which further corroborate the above quantitative evaluations. The performances of EPF and MPM-LBP are omitted due to their relatively lower quantitative evaluations. It is very easy to observe that our proposed MPRI improves the region uniformity (see the small region marked with dashed border) and the edge preservation (see the small region marked by solid line rectangles) significantly, both criteria are critical for evaluating classification maps [11]. By contrast, other methods either fail to preserve local details (such as edges) of different classes (e.g., MFL) or generate noises in the uniform regions (e.g., SADL, PCA-EPF and HIFI).

To evaluate the robustness of our method with respect to the number of training samples, we demonstrate, in Fig. 11, the OA values of different methods in a range of the percentage (or number) of training samples per class. As can be expected, the more training samples, the better classification accuracies. However, MPRI is consistently superior to its counterparts, especially when the training samples are limited.

Class EPF MPM-LBP SADL MFL PCA-EPF HIFI MPRI
1 93.05 97.38 92.66 98.05 96.56 92.37 97.75
2 95.37 99.45 98.92 99.61 98.67 98.54 99.88
3 96.87 79.02 74.65 74.35 87.43 80.40 89.52
4 99.46 90.62 93.50 89.78 98.34 81.61 94.05
5 98.09 97.52 99.38 98.45 99.44 96.39 98.87
6 98.52 93.37 94.57 95.04 99.00 90.49 96.26
7 99.54 82.95 77.26 94.12 94.29 87.67 99.02
8 87.63 91.52 77.73 93.11 92.22 89.71 91.69
9 96.88 98.94 99.14 91.54 98.36 99.92 97.18
OA 95.06 95.42 93.28 95.83 97.11 93.40 97.31
AA 96.16 92.31 89.76 92.67 96.21 90.78 96.03
0.935 0.940 0.912 0.945 0.962 0.912 0.965
TABLE IV: Classification accuracies (%) of different methods on Pavia University data set. 1% of labeled samples per class were randomly selected for training. The best two performances are marked in bold and underlined, respectively.
Fig. 8: Classification maps on University of Pavia data set. (a) Ground truth; (b) SADL; (c) MFL; (d) PCA-EPF; (e) HIFI; (f) MPRI.
Class EPF MPM-LBP SADL MFL PCA-EPF HIFI MPRI
1 100.0 99.10 98.79 99.43 100.0 99.57 100.0
2 99.39 94.57 92.66 87.02 97.21 92.63 97.76
3 86.78 95.83 96.63 95.83 88.49 96.16 99.86
4 98.28 81.94 98.32 99.46 97.74 93.70 100.0
5 96.64 97.75 97.91 99.43 99.54 98.32 98.55
6 77.48 95.21 95.51 95.05 93.93 98.39 99.83
7 95.29 94.51 98.24 96.39 99.84 94.00 97.85
8 100.0 99.62 98.76 99.34 98.61 99.72 100.0
9 100.0 100.0 100.0 97.23 100.0 99.95 97.31
OA 97.08 97.85 98.08 98.01 99.01 98.48 99.57
AA 94.87 95.39 97.42 96.47 97.26 96.94 99.02
0.948 0.961 0.965 0.964 0.982 0.972 0.992
TABLE V: Classification accuracies (%) of different methods on Pavia Center data set with fixed training and testing split. The best two performances are marked in bold and underlined, respectively.
Fig. 9: Classification maps on Pavia Center data set. (a) Ground truth; (b) SADL; (c) MFL; (d) PCA-EPF; (e) HIFI; (f) MPRI.
Class EPF MPM-LBP SADL MFL PCA-EPF HIFI MPRI
1 97.30 95.78 86.59 72.21 99.54 91.76 99.02
2 92.49 87.65 87.48 51.13 80.59 86.68 87.98
3 88.49 95.14 80.88 84.66 93.32 92.83 93.78
4 65.40 38.54 63.16 42.55 84.36 60.81 68.26
5 73.63 77.95 88.33 84.42 70.05 78.97 83.01
6 85.64 49.15 80.67 88.75 87.93 73.39 91.16
7 72.30 95.00 94.50 95.70 98.77 98.20 100.0
8 68.43 74.39 85.94 71.62 74.05 78.40 93.17
9 94.84 90.14 95.94 88.27 96.67 78.49 91.94
10 83.64 92.08 84.29 49.50 81.13 75.44 94.76
11 98.34 89.81 89.08 85.99 89.71 85.22 92.95
12 87.02 65.78 81.16 68.17 91.03 77.17 87.11
13 99.92 98.73 94.41 97.94 99.92 98.22 99.01
OA 86.50 84.56 86.96 77.09 88.82 84.55 92.80
AA 85.19 80.78 85.57 75.46 88.24 82.74 90.94
0.850 0.828 0.855 0.746 0.876 0.828 0.920
TABLE VI: Classification accuracies (%) of different methods on Kennedy Space Center data set. samples per class were randomly selected for training. The best two performances are marked in bold and underlined, respectively.
Fig. 10: Classification maps on Kennedy Space Center data set. (a) Ground truth; (b) SADL; (c) MFL; (d) PCA-EPF; (e) HIFI; (f) MPRI.
Fig. 11: OA values of different methods with respect to different percentages of training samples per class on (a) Indian Pines; (b) Pavia University; and (c) Kennedy Space Center. The results on Pavia Center is omitted, because the training and testing samples are fixed.

Iv-D Comparison with methods using DNNs

We finally compare MPRI with existing DNN based methods on Indian Pines data set. These include 3D-CNN [16], R-3D-CNN [16], DBN [41], SAE [20], NSSNet [42], R-VCANet [43], CNN-MRF [44], SRCL [45], RPNet  [46] and DDNSA [47]. Among them, R-VCANet, CNN-MRF, SRCL, RPNet and DDNSA are specially designed for HSI classification in the scenario of limited training samples. The quantitative assessment is reported in Table VII, where the results of all competitors come from the relevant references indicated. As can be seen, MPRI achieves equivalent or superior performance compared its DNN based counterparts, with the least amount of training samples. The only exception comes from R-3D-CNN. However, one should note that R-3D-CNN requires 7 times more training sample than ours. This is, unfortunately, a nearly impractical condition in remote sensing community [48]. The classification maps of two example methods (RPNet and CNN-MRF) are demonstrated in Fig. 12. Again, MPRI improves the region uniformity and edge preservation.

Method Ratio OA AA
3D-CNN 70% 98.92 97.31 N/A
R-3D-CNN 70% 99.50 99.42 N/A
DBN 50% 95.95 95.45 0.954
SAE 50% 92.58 90.38 0.915
NSSNet 50% 96.08 96.40 0.955
R-VCANet 10% 97.90 97.91 0.976
CNN-MRF 10% 97.10 90.53 0.969
SRCL 10% 97.58 88.88 0.972
RPNet 10% 96.41 92.09 0.959
DDNSA 10% 98.23 98.08 0.980
MPRI 10% 98.73 97.94 0.986
TABLE VII: Classification accuracies (%) of MPRI and different DNN based methods on Indian Pines data set. “Ratio” denotes the percentage of training samples per class. The best two performances are marked in bold and underlined, respectively.
Fig. 12: Classification maps of (a) CNN-MRF, (b) RPNet, and (c) MPRI on Indian Pines data set with 10% training samples per class.

V Conclusions

This paper proposes multiscale principle of relevant information (MPRI), a multilayer multiscale architecture, for hyperspectral image (HSI) classification. Experimental results on four benchmark data sets demonstrate that MPRI is able to learn discriminative representations from 3D spectral-spatial data, with significantly fewer training samples. Moreover, MPRI enjoys an intuitive geometric interpretation, it also prompts the region uniformity and edge preservation of classification maps.

In the future, we intend to speed up the optimization of PRI. This is because the current information theoretic learning (ITL) estimators (e.g., and ) are computational expensive growing quadratically with data. This drawback is not severe in our application, since the number of samples in a local HSI data cube is significantly less than a few thousands. However, it still limits the applicability of PRI in scenarios of larger samples commonly encountered in machine learning and signal processing communities. One possible solution is to approximate the Gram matrix with rank deficient decomposition [49].

References

  • [1] L. He et al., “Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 3, pp. 1579–1597, 2018.
  • [2] X. Cao et al., “Hyperspectral image classification with markov random fields and a convolutional neural network,” IEEE Trans. Image Processing, vol. 27, no. 5, p. 2354, 2017.
  • [3] A. Zare et al., “Discriminative multiple instance hyperspectral target characterization,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 10, pp. 2342–2354, 2018.
  • [4] P. Ghamisi et al., “Advances in hyperspectral image and signal processing: A comprehensive overview of the state of the art,” IEEE Geosci. Remote Sens. Mag., vol. 5, no. 4, pp. 37–78, 2017.
  • [5] A. Plaza et al., “Dimensionality reduction and classification of hyperspectral image data using sequences of extended morphological transformations,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 466–479, 2005.
  • [6] Q. Du, “Modified fisher’s linear discriminant analysis for hyperspectral imagery,” IEEE Geosci. Remote Sens. Lett., vol. 4, no. 4, pp. 503–507, 2007.
  • [7] P. Ghamisi et al., “A survey on spectral–spatial classification techniques based on attribute profiles,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 5, pp. 2335–2353, 2015.
  • [8] Y. Chen et al., “Hyperspectral image classification using dictionary-based sparse representation,” IEEE Trans. Geosci. Remote Sens., vol. 49, no. 10, pp. 3973–3985, 2011.
  • [9] A. Soltani-Farani et al., “Spatial-aware dictionary learning for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 1, pp. 527–541, 2015.
  • [10] X. Kang et al., “Spectral–spatial hyperspectral image classification with edge-preserving filtering,” IEEE Trans. Geosci. Remote Sens., vol. 52, no. 5, pp. 2666–2677, May 2014.
  • [11] ——, “Pca-based edge-preserving features for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 12, pp. 7140–7151, 2017.
  • [12] B. Pan et al., “Hierarchical guidance filtering-based ensemble classification for hyperspectral images,” IEEE Trans. Geosci. Remote Sens., vol. 55, no. 7, pp. 4177–4189, 2017.
  • [13] X. Ma et al., “Spectral–spatial classification of hyperspectral image based on deep auto-encoder,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 9, no. 9, pp. 4073–4085, 2016.
  • [14] Y. Chen et al.

    , “Deep feature extraction and classification of hyperspectral images based on convolutional neural networks,”

    IEEE Trans. Geosci. Remote Sens., vol. 54, no. 10, pp. 6232–6251, 2016.
  • [15] Z. Zhong et al., “Spectral–spatial residual network for hyperspectral image classification: A 3-d deep learning framework,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 2, pp. 847–858, 2018.
  • [16] X. Yang et al., “Hyperspectral image classification with deep learning models,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 9, pp. 5408–5423, 2018.
  • [17] X. Zhu et al., “Deep learning in remote sensing: a comprehensive review and list of resources,” IEEE Geosci. Remote Sens. Mag., vol. 5, no. 4, pp. 8–36, 2017.
  • [18] J. C. Principe, Information theoretic learning: Renyi’s entropy and kernel perspectives.   Springer Science & Business Media, 2010.
  • [19]

    S. M. Rao, “Unsupervised learning: an information theoretic framework,” Ph.D. dissertation, University of Florida.

  • [20] Y. Chen et al., “Deep learning-based classification of hyperspectral data,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 7, no. 6, pp. 2094–2107, 2014.
  • [21] Y. Li et al., “Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network,” Remote Sensing, vol. 9, no. 1, p. 67, 2017.
  • [22] S. Kullback et al., “On information and sufficiency,” The Ann. Math. Stat., vol. 22, no. 1, pp. 79–86, 1951.
  • [23] H. Chernoff et al., “A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations,” The Ann. Math. Stat., vol. 23, no. 4, pp. 493–507, 1952.
  • [24] N. Tishby et al., “The information bottleneck method,” arXiv preprint physics/0004057, 2000.
  • [25] T. Hastie et al., “Principal curves,” J. Am. Stat. Assoc., vol. 84, no. 406, pp. 502–516, 1989.
  • [26] A. Rényi, “On measures of entropy and information,” in Proc. 4th Berkeley Sympos. Math. Statist. and Prob., vol. 1, 1961, pp. 547–561.
  • [27] R. Jenssen et al., “The cauchy–schwarz divergence and parzen windowing: Connections to graph theory and mercer kernels,” J. Franklin Institute, vol. 343, no. 6, pp. 614–629, 2006.
  • [28] E. Parzen, “On estimation of a probability density function and mode,” The Ann. Math. Stat., vol. 33, no. 3, pp. 1065–1076, 1962.
  • [29] T. V. Bandos et al., “Classification of hyperspectral images with regularized linear discriminant analysis,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 3, pp. 862–873, 2009.
  • [30] S. Yu et al., “Multivariate extension of matrix-based renyi’s -order entropy functional,” arXiv preprint arXiv:1808.07912, 2018.
  • [31] C.-I. Chang, “An information-theoretic approach to spectral variability, similarity, and discrimination for hyperspectral image analysis,” IEEE Trans. Inform. Theory, vol. 46, no. 5, pp. 1927–1932, 2000.
  • [32] M. Kamandar et al., “Linear feature extraction for hyperspectral images based on information theoretic learning,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 4, pp. 702–706, 2013.
  • [33] H. Peng et al.

    , “Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy,”

    IEEE Trans. Pattern Anal. Mach. Intell., no. 8, pp. 1226–1238, 2005.
  • [34] A. Achille and S. Soatto, “Information dropout: Learning optimal representations through noisy computation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 12, pp. 2897–2905, 2018.
  • [35] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural networks via information,” arXiv preprint arXiv:1703.00810, 2017.
  • [36] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” J. Mach. Learn. Res., vol. 9, no. Nov, pp. 2579–2605, 2008.
  • [37] M. F. Baumgardner et al., “220 band aviris hyperspectral image data set: June 12, 1992 indian pine test site 3,” Sep 2015. [Online]. Available: https://purr.purdue.edu/publications/1947/1
  • [38] B. W. Silverman, Density estimation for statistics and data analysis.   Chapman & Hall, 1986.
  • [39] J. Li et al.

    , “Spectral-spatial classification of hyperspectral data using loopy belief propagation and active learning,”

    IEEE Trans. Geosci. Remote Sens., vol. 51, no. 2, pp. 844–856, Feb. 2013.
  • [40] ——, “Multiple feature learning for hyperspectral image classification,” IEEE Trans. Geosci. Remote Sens., vol. 53, no. 3, pp. 1592–1606, 2015.
  • [41] Y. Chen et al.

    , “Spectral-spatial classification of hyperspectral data based on deep belief network,”

    IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 8, no. 6, pp. 2381–2392, 2015.
  • [42] B. Pan et al., “Hyperspectral image classification based on nonlinear spectral–spatial network,” IEEE Geosci. Remote Sens. Lett., vol. 13, no. 12, pp. 1782–1786, 2016.
  • [43] ——, “R-vcanet: a new deep-learning-based hyperspectral image classification method,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 10, no. 5, pp. 1975–1986, 2017.
  • [44] X. Cao et al., “Hyperspectral image classification with markov random fields and a convolutional neural network,” IEEE Trans. Image Processing, vol. 27, no. 5, pp. 2354–2367, 2018.
  • [45] S. Hao et al.

    , “A deep network architecture for super-resolution-aided hyperspectral image classification with classwise loss,”

    IEEE Trans. Geosci. Remote Sens., vol. 56, no. 8, pp. 4650–4663, 2018.
  • [46] Y. Xu et al., “Hyperspectral image classification via a random patches network,” ISPRS J. Photogramm, vol. 142, pp. 344–357, 2018.
  • [47] X. Ma et al., “Hyperspectral image classification based on deep deconvolution network with skip architecture,” IEEE Trans. Geosci. Remote Sens., vol. 56, no. 8, pp. 4781–4791, 2018.
  • [48] P. Ghamisi et al., “New frontiers in spectral-spatial hyperspectral image classification: The latest advances based on mathematical morphology, markov random fields, segmentation, sparse representation, and deep learning,” IEEE Geosci. Remote Sens. Mag., vol. 6, no. 3, pp. 10–43, 2018.
  • [49] L. G. Sánchez Giraldo et al., “An efficient rank-deficient computation of the principle of relevant information,” in IEEE ICASSP, 2011, pp. 2176–2179.