Hyperspectral Image Classification with Attention Aided CNNs

05/25/2020 ∙ by Renlong Hang, et al. ∙ University of Maryland University of Missouri-Kansas City 14

Convolutional neural networks (CNNs) have been widely used for hyperspectral image classification. As a common process, small cubes are firstly cropped from the hyperspectral image and then fed into CNNs to extract spectral and spatial features. It is well known that different spectral bands and spatial positions in the cubes have different discriminative abilities. If fully explored, this prior information will help improve the learning capacity of CNNs. Along this direction, we propose an attention aided CNN model for spectral-spatial classification of hyperspectral images. Specifically, a spectral attention sub-network and a spatial attention sub-network are proposed for spectral and spatial classification, respectively. Both of them are based on the traditional CNN model, and incorporate attention modules to aid networks focus on more discriminative channels or positions. In the final classification phase, the spectral classification result and the spatial classification result are combined together via an adaptively weighted summation method. To evaluate the effectiveness of the proposed model, we conduct experiments on three standard hyperspectral datasets. The experimental results show that the proposed model can achieve superior performance compared to several state-of-the-art CNN-related models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Similar to the semantic segmentation task of natural images, hyperspectral image classification aims at assigning one of pre-defined categories to each pixel. An important issue for this task is how to represent each pixel effectively. Hyperspectral sensors are capable of capturing the spectral signature of each material along different spectral bands [39, 15]. This rich spectral information is able to be used as a feature representation for each pixel in hyperspectral images [9]. Due to the existence of the spectral variability between the same class materials and the spectral similarity between the different class materials, the individual use of spectral features will easily cause misclassified pixels, which is recognized as the “salt and pepper” noise in the classification maps. In addition to spectral information, hyperspectral images also contain spatial information. Taking advantage of spectral and spatial features jointly can dramatically alleviate the aforementioned misclassification issue, thus becoming a hot research topic in the field of hyperspectral image classification [14, 8].

Most of the traditional feature extraction techniques depend on pre-designed criterions by human experts

[19, 16]. For example, Hong et al. [16]

for the first time took the remote sensing image classification as a cross-modality learning problem, and proposed a milestone common subspace learning algorithm. However, it is difficult to thoroughly explore the intrinsic properties of data. In recent years, deep learning has demonstrated its overwhelming superiority in numerous computer vision fields

[21]. In contrast with the traditional feature extraction techniques, it combines the task of feature extraction and that of image classification into a unified framework, and lets the data itself drive the optimization of this end-to-end model, thus achieving more robust and discriminative features. Due to its powerful feature learning ability, deep learning has been naturally employed to the classification task of hyperspectral images [45, 22]

. Typical deep learning models include autoencoders

[4, 31, 26]

, recurrent neural networks (RNNs)

[25, 44, 11], and convolutional neural networks (CNNs) [3, 23, 10]

. The inputs of autoencoders and RNNs are vectors. Therefore, they can be easily adopted for spectral feature extraction, but will lose some useful information when applied to extract spatial features. In comparison with them, CNNs are able to deal with both spectral feature and spatial feature flexibly, thus becoming the most popular deep learning model for hyperspectral image classification.

According to the input information of networks, existing CNN models are able to be grouped into two classes: spectral CNNs and spectral-spatial CNNs. Spectral CNNs aim at extracting spectral features for each pixel in hyperspectral images. For example, in [18], Hu et al. designed a 1-D CNN model for extracting features from the spectral information of each pixel. It mainly consists of one convolutional layer and two fully-connected layers. Since there often exist small numbers of training pixels, the proposed CNN model is not very deep, which limits the feature representation ability of 1-D CNNs. In order to address this issue, a novel pixel-pair method was proposed in [23]. By regarding the pixel learning problem as the pixel-pair counterpart, the number of training pixels is significantly increased. Thus, a deeper 1-D CNN model with ten convolutional layers was successfully trained, improving the spectral classification results compared to the shallow CNN model [18]. In [35] and [36], Wu and Prasad proposed to combine 1-D CNN and RNN together. Specifically, they fed the spectral features learned by a 1-D CNN into a RNN to further fuse and enhance the discriminative ability of the extracted features.

Different from spectral CNNs, the purpose of spectral-spatial CNNs is extracting spectral and spatial features simultaneously from hyperspectral images. An intuitive method is adopting 3-D convolutional kernels to construct the CNN model [3, 24, 29], so the rich spectral and spatial information can be integrated together in each convolutional layer. However, these 3-D convolutional operators often cost much time or need large numbers of parameters. To alleviate this problem, a lot of works attempted designing two-branch networks. One of them focuses on spectral feature extraction, and the other one aims at spatial feature extraction. These results are then combined together using different kinds of fusion strategies. For example, in [40] and [38], a parallel two-branch framework was proposed, where a 1-D CNN and a 2-D CNN were designed to extract spectral and spatial features, respectively. For the 2-D CNN model, its inputs were constructed by extracting a few principal components [42], thus the computational consuming was significantly reduced as compared to 3-D CNNs. In [43], a serial two-branch framework was designed. It firstly applied several convolutions to extract spectral features and then fed them into several 2-D convolutions to extract spatial features.

Similar to traditional feature extraction techniques, spectral-spatial CNNs often acquire superior classification performance than spectral CNNs because of the joint exploitation of spectral and spatial information. Therefore, we focus on spectral-spatial CNNs in this paper. In general, small cubes are firstly cropped from the hyperspectral image and then fed into spectral-spatial CNNs to extract features. However, it is well known that different spectral bands and spatial positions in the cubes have different discriminative abilities. If fully explored, this prior information will help improve the learning capacity of CNNs. Recently, attention mechanism has been popularly employed to language modelling [28, 37, 41] and computer vision tasks [32, 17, 34]. Its success mainly depends on the reasonable assumption that human vision tends to only focus on selective parts of the whole visual space when and where needed [2]. Very recently, similar works have been explored in the field of hyperspectral image processing [7, 13, 27]. Inspired from them, we propose an attention aided spectral-spatial CNN model for hyperspectral image classification. Our goal is to enhance the representation capacity of CNNs by using attention mechanism, making CNNs focus on more discriminative spectral bands and spatial positions while suppress unnecessary ones.

During the last few years, numerous attention models have been developed. In

[33], Wang et al. designed a spatial attention module. The response value at each position was derived according to the weighted summation of the features at all positions. They used several convolutions to achieve this goal. In [17], Hu et al. proposed a channel attention module via two fully-connected layers to adaptively recalibrate channel-wise feature responses. For hyperspectral image classification, there only exists a limited number of training samples, so lightweight attention modules are preferred. In [34], a convolutional layer was employed to construct a spatial attention module. Motivated by it, we also use small convolutional layers to design our spectral and spatial attention modules. Specifically, our spatial attention module is mainly comprised by one convolution and two small convolutions. The goal of the convolution is to reduce the channel numbers in 3-D feature maps to 1. Similar to the spatial attention module, our spectral attention module is mainly comprised of an average pooling layer and two small convolutional layers. The average pooling layer aims at reducing the spatial size in 3-D feature maps to . More importantly, in both spectral and spatial attention modules, we use an output layer to aid them learn more discriminative features. Based on these two kinds of attention modules, we are able to construct two attention sub-networks. One of them incorporates spectral attention modules into a 2-D CNN for extracting spectral features and classification, while the other one incorporates spatial attention modules into another 2-D CNN for extracting spatial features and classification. Our major contributions can be summarized as follows.

  1. We propose a two-branch spectral-spatial attention network for hyperspectral image classification. Compared to the existing CNNs, our model incorporates attention modules to each convolutional layer, making CNNs focus on more discriminative channels and spatial positions, while suppress unnecessary ones. In the classification phase, two-branch results are fused together via an adaptively weighted summation method.

  2. Considering the limited numbers of training samples, we propose a lightweight spectral attention module via two convolutional operators. Before convolutional layers, we use global average pooling to reduce the effects of spatial information. More importantly, we add an output layer in the module to aid its learning process.

  3. Similar to the spectral attention module, we also use two convolutional layers to construct the spatial attention module. Instead of using pooling operators, we adopt one convolutional layer for reducing the number of channels to 1. Also, an output layer is added to guide the learning process of the spatial module.

The following sections are organized as follows. Section II presents the proposed model in detail, including the structure of CNNs, the attention modules, and the network training method. Section III describes the experimental data and results. Finally, the conclusions of this paper are summarized in Section IV.

Ii Methodology

Ii-a Framework of the Proposed Model


Fig. 1: Flowchart of the proposed model. Note that the numbers represent the size of each layer for the Houston 2018 data.

Fig.1 shows the flowchart of the proposed model. It mainly consists of two branches: the spectral attention sub-network and the spatial attention sub-network. Different from the widely used CNNs, the spectral and spatial attention sub-networks incorporate attention modules to refine the feature maps in each convolutional layer, thus enhancing the learning ability of CNNs. Specifically, for a given pixel, a small cube centered at it is firstly extracted. Then, the cube is fed into the spectral attention sub-network and the spatial attention sub-network simultaneously to obtain two classification results. Finally, a weighted summation method is employed to combine these two results together. Assume that there are classes to discriminate, and represent the output results of the spectral attention sub-network and the spatial attention sub-network, respectively. The final output of the proposed model is:

(1)

where and are the weighting parameters. They can be adaptively learned during the optimization process of the whole network. The th element in

denotes the probability that the given pixel is classified as the

th category.

Ii-B The Structure of CNNs

As demonstrated in Fig.1, the spectral attention sub-network and the spatial attention sub-network consist of CNNs and attention modules. In this subsection, we will present the basic structure of the adopted CNNs. Then, in the next subsection, we will describe the attention modules in detail.

For hyperspectral images, there only exists a limited number of training samples, so it is difficult to train very deep CNNs. The same as [3]

, we also apply three convolutional layers to construct the spectral and spatial sub-networks. Each convolutional layer is sequentially followed by a batch normalization layer to regularize and accelerate the training process, and a rectified linear unit (ReLU) to learn a nonlinear representation. Before the second and the third convolutional layer, a max-pooling layer is adopted for reducing the data variance and the computation complexity. The kernel size of each convolutional layer is

, and the channel numbers from the first to the third convolution layer are 32, 64, and 128, respectively.

For the th convolutional layer, the th feature map can be represented as:

(2)

where , is the th feature map at the th layer, is the th spectral band of the original input cube, is the convolutional kernel, ‘’ is the convolutional operator, is the bias, and

is the ReLU activation function. Note that the spatial size of

is the same as that of

via a padding operator. In Equation

(2), the convolutional operator and the summation operator can learn the spatial features and aggregate the spectral features, respectively.

Ii-C Attention Modules

It is well known that the spectral responses at different bands may vary largely for the same object, which means the discriminative abilities of different bands are diverse. In addition, different positions of the extracted cube also have different semantic information. For example, the object edges are generally more discriminative than the other positions. If such prior information can be fully explored, the learning ability of the spectral and spatial sub-networks will be improved. In this paper, we design two classes of attention modules to achieve this goal. They are the spectral attention module and the spatial attention module. The spectral attention module is adopted to make the spectral sub-network focus on more discriminative channels while suppress unnecessary ones. Similarly, the spatial attention module can make the spatial sub-network pay more attention to the semantic positions.

Spectral Attention. As shown in Fig.2(a), the spectral attention module is constructed by exploiting the inter-channel relationships of feature maps. Given an intermediate feature map , where , , , and represent channel numbers, the height, and the width of , respectively, a global average-pooling layer is firstly applied to squeeze the spatial dimension of it. Then, two 1-D convolutional layers are employed to generate a spectral attention map , which can be formulated as follows:

(3)

where

denotes a sigmoid function,

is the ReLU function, represents the feature map obtained by the global average-pooling operator, and are the first and the second convolution kernels, respectively. Note that padding operators are employed in each convolutional layer for making the output size equal to . Since increases when changes from 1 to 3, larger values are used as increases. Specifically, is set to 3, 5 and 7 when equals to 1, 2, and 3, respectively.

After acquiring , we apply it to refine the original feature map as follows:

(4)

where ‘’ represents element-wise multiplication. During multiplication, the values in are expanded (copied) along the spatial dimension. Finally, is adopted as an input for the th convolutional layer and an output branch (yellow colors in Fig.2

). The output branch is comprised of a global max-pooling layer and a fully-connected layer. This branch mainly has two purposes: the first one is to provide supervised information for the spectral attention module, ensuring the discriminative ability of the refined feature map; the other one is to incorporate a regularization term to the loss function (discussed in subsection II-D), alleviating the overfitting problem during network training.


Fig. 2: Attention modules. (a) Spectral attention module. (b) Spatial attention module.

Spatial Attention. Similar to the spectral attention module, the spatial attention module is built by taking advantage of the inter-spatial relationships of feature maps. As demonstrated in Fig.2(b), an convolutional layer is firstly used to aggregate the information along the channel direction of , generating a 2-D feature map . Then, two 2-D convolutional layers are applied to derive a spatial attention map , which can be formulated as:

(5)

where and have the same kernel size. Also, padding operators are employed at each convolutional layer to avoid the change of spatial sizes. Due to the decreasing of spatial sizes as increases, is set to 7, 5 and 3 when equals to 1, 2, and 3, respectively. After that, can be used to recalibrate as:

(6)

During multiplication, the values in are expanded (copied) along the channel dimension. Finally, is fed into the th convolutional layer and an output branch. Here, we use an adaptive max-pooling layer in the output branch, which means that the output size is fixed for any size inputs. Specifically, the output size is fixed to , , and when equals to 1, 2, and 3, respectively.

Ii-D Network Training

To train the proposed model effectively, we adopt a two-step strategy. The first step is pre-training the two sub-networks independently, while the second step is adding the weighted summation layer and fine-tuning the whole network. Assume the output result of the th attention module for the th training sample is , and its ground-truth label is , so the loss value can be formulated as:

(7)

where denotes the total number of training pixels, and represents the loss function. Without loss of generality, the cross-entropy loss function is chosen. Since the deeper convolutional layers are expected to capture more discriminative features, we set larger weights for them. Thus, , , and are empirically set to 0.01, 0.1, and 1, respectively. During the pre-training process, we choose the gradient descent algorithm to optimize . After that, we apply Equation(1), where and are replaced by their respective , to re-calculate the output value . Based on , we can update the loss value as follows:

(8)

Again, the gradient descent algorithm is used to optimize during the fine-tuning process.

Iii Experiments

Iii-a Data Description and Experimental Setup

Three different hyperspectral datasets are used to conduct experiments. The first dataset is Houston 2013, which was collected over the University of Houston campus on June, 2012 [6]. The spatial size of it is , and the number of spectral bands is 144. Fig.3 demonstrates a three-channel image as well as the training and test maps of the Houston 2013 data. As shown in the figure, there exist 15 different classes of land covers. TableI reports the detailed pixel distributions in each class. The second dataset is Houston 2018 [30], which has a larger spatial size (i.e., ) but less spectral bands (i.e., 48) than the Houston 2013 data. There are 20 different land-cover classes to discriminate. The detailed number of training as well as test pixels in each class and their spatial distributions are illustrated in TableII and Fig.4, respectively. The last dataset is HyRANK111http://www2.isprs.org/commissions/comm3/wg4/HyRANK.html, which is comprised of two hyperspectral images: Dioni and Loukia. The spatial sizes of Dioni and Loukia are and , respectively. Both of them contain 176 spectral bands. The available pixels in Dioni are used as the training set, while those in Loukia are used as the test set. Seven common land-cover classes in both images are pre-defined. They are Dense Urban Fabric, Non Irrigated Arable Land, Olive Groves, Dense Sclerophyllous Vegetation, Sparse Sclerophyllous Vegetation, Sparsely Vegetated Areas, and Water. TableIII demonstrates the detailed number of available pixels in each class, and Fig.5 shows the three-channel images and their pixel distributions.

Class No. Class Name Training Test Percent
1 Healthy grass 198 1053 15.83%
2 Stressed grass 190 1064 15.15%
3 Synthetic grass 192 505 27.55%
4 Tree 188 1056 15.11%
5 Soil 186 1056 14.98%
6 Water 182 143 56.00%
7 Residential 196 1072 15.46%
8 Commercial 191 1053 15.35%
9 Road 193 1059 15.42%
10 Highway 191 1036 15.57%
11 Railway 181 1054 14.66%
12 Parking lot 1 192 1041 15.57%
13 Parking lot 2 184 285 39.23%
14 Tennis court 181 247 42.29%
15 Running track 187 473 28.33%
- Total 2832 12197 18.84%
TABLE I: Pixel distributions on the Houston 2013 data. Note that ‘Percent’ means the proportion between the training pixels and the total number of available pixels.

Fig. 3:

Houston 2013 data visualization. (a) False-color image. (b) Training data visualization. (c) Test data visualization.

Class No. Class Name Training Test Percent
1 Healthy grass 1458 8341 14.88%
2 Stressed grass 4316 28186 13.28%
3 Synthetic grass 331 353 48.39%
4 Evergreen Trees 2005 11583 14.76%
5 Deciduous Trees 676 4372 13.39%
6 Soil 1757 2759 38.91%
7 Water 147 119 55.26%
8 Residential 3809 35953 9.58%
9 Commercial 2789 220895 1.25%
10 Road 3188 42622 6.96%
11 Sidewalk 2699 31303 7.94%
12 Crosswalk 225 1291 14.84%
13 Major Thoroughfares 5193 41165 11.20%
14 Highway 700 9149 7.11%
15 Railway 1224 5713 17.64%
16 Paved Parking Lot 1179 10296 10.27%
17 Gravel Parking Lot 127 22 85.23%
18 Cars 848 5730 12.89%
19 Trains 493 4872 9.19%
20 Seats 1313 5511 19.24%
- Total 34477 470235 6.83%
TABLE II: Pixel distributions on the Houston 2018 data. Note that ‘Percent’ means the proportion between the training pixels and the total number of available pixels.

Fig. 4: Houston 2018 data visualization. (a) False-color image. (b) Training data visualization. (c) Test data visualization.
Class No. Class Name Dioni Loukia
1 Dense Urban Fabric 1262 288
2 Non Irrigated Arable Land 614 542
3 Olive Groves 1768 1401
4 Dense Sclerophyllous Vegetation 5035 3793
5 Sparse Sclerophyllous Vegetation 6374 2803
6 Sparsely Vegetated Areas 1754 404
7 Water 1612 1393
- Total 18419 10624
TABLE III: Pixel distributions on the HyRANK data.

Fig. 5: HyRANK data visualization. (a) and (b) are the false-color image and the available pixel map on the Dioni data. (c) and (d) are the false-color image and the available pixel map on the Loukia data.

In order to effectively analyze the performance of the proposed model, we implement two different kinds of experiments on these three data. The first one is to evaluate the effects of different components in the proposed model, including the spectral attention modules and the spatial attention modules. The second one is to compare the proposed model with some state-of-the-art CNN-related models. All the models are simulated by PyTorch on a computer with 32GB RAM and a GTX TITAN X graphic card. The input cube size is empirically set to

, because larger sizes will lead to the serious overlap problem [12, 20]

. The optimizer is Adam with default parameters. The learning rate, the batch size, and the training epochs are set to 0.001, 128, and 200, respectively. To alleviate the effects of random initialization on the performance, all the experiments are repeatedly implemented 10 times, and the average performance are recorded. To quantitatively evaluate the performance of each model, we adopt the overall accuracy (OA), the average accuracy (AA), the per-class accuracy, the Kappa coefficient, and the F1 score as indicators.

Iii-B Model Analysis

Iii-B1 Effects of different components


Fig. 6: Effects of different components on the classification performance. (a) Houston 2013 data. (b) Houston 2018 data. (c) HyRANK data.

Different from the traditional CNN models, our proposed model incorporates spectral attention and spatial attention modules into CNN. In this subsection, we test the effectiveness of these attention modules. Specifically, we adopt the convolutional network without any attention modules as a baseline model (abbreviated as Plain). It has three convolutional layers and an output layer. The kernel size of each convolutional layer is , and the channel numbers from the first to the third convolution layer are sequentially set to 32, 64, and 128. We compare Plain with the spectral attention sub-network (abbreviated as SpeAtt), the spatial attention sub-network (abbreviated as SpaAtt), and their integrated two-branch network (abbreviated as SSAtt).

Fig.6 shows the classification performance of different models in terms of OA, AA, and Kappa values. Different colors denote different evaluation indicators. From this figure, we can observe that SpeAtt and SpaAtt achieve similar performance in most cases, and both of them are better than Plain on three data, which certifies the effectiveness of our proposed spectral attention modules and spatial attention modules. In addition, after combining the results of SpeAtt and SpaAtt, our proposed model SSAtt is able to further improve the classification performance. It indicates that SSAtt can integrate the complementary information between SpeAtt and SpaAtt.

Iii-B2 Effects of attention numbers

Model First Second Third OA
Plain 87.98
88.40
88.75
88.53
88.98
88.73
89.14
89.69
TABLE IV: Effects of different numbers of attention modules on the SpeAtt model.
Model First Second Third OA
Plain 87.98
88.34
88.77
88.30
89.20
88.47
89.07
89.61
TABLE V: Effects of different numbers of attention modules on the SpaAtt model.

The last subsection evaluates the effectiveness of different kinds of attention modules, but the number of attention modules may also affect the classification performance. Therefore, in this subsection, we take the Houston 2013 data as an instance, and attempt to comprehensively test the effects of different numbers of attention modules on the classification performance. Since the total number of convolutional layers in the proposed model is three, the maximal number of attention modules is also three. Here, we change the number of attention modules from one to three, and record the classification performance of SpeAtt and SpaAtt models with different combinations.

TableIV demonstrates the classification performance achieved by SpeAtt with different numbers of attention modules. The subscript numbers refer to the number of attention modules used by each model. The symbol ‘✓’ represents that the corresponding layer contains an attention module, while the symbol ‘✗’ does not. From this table, it can be observed that when the first convolutional layer adopts the spectral attention module, its performance is inferior to both the second and the third convolutional layers. This is because the first convolutional layer has less discriminative features than the other layers. In comparison with the second convolutional layer, the third one contains more discriminative features, but its improvement space is smaller than the second one. Therefore, when equipped with the attention module, the second layer obtains the highest OA. Different from , adopts attention modules on two layers. Due to the exploitation of more attention modules, is able to achieve higher OAs than its counterparts. Similarly, is better than .

Similar to TableIV, TableV shows the classification performance of SpaAtt using different numbers of attention modules. Again, obtains the best performance, because every convolutional layer has been refined by the spatial attention modules. In addition, is superior to its counterparts. For example, with attention modules in the first and the second layers is better than the first (i.e., the second row in tableV) and the second models. In terms of models, the third convolutional layer is inferior to the other ones, because it has less spatial information. Although the first convolutional layer has more spatial information than the second one, its feature representation ability is not good enough. This is why the second model achieves higher OA than the first one.

Iii-B3 Analysis on and

Data Band Number Spatial Size
Houston 2013 144 0.5030 0.4970
Houston 2018 48 0.4803 0.5197
HyRANK 176 0.5208 0.4792
TABLE VI: Values of and on different data.

Since the whole network is optimized by the gradient descent algorithm, and are also optimized by it. TableVI shows the final values of and on three datasets. It is interesting to observe that the optimal and values vary for different data, because they have different spatial and spectral resolutions. In particular, both Houston 2013 and HyRANK data contain more than 100 spectral bands, which provide more discriminative information than the spatial domain. Therefore, is relatively larger than for them, especially for the HyRANK data. On the contrary, Houston 2018 only has 48 spectral bands, but its spatial information is rich, making larger than . Based on these observations, adaptively optimizing and is a better choice than empirically fixing them. Although it takes time to optimize them, empirically choosing them on different data will also cost time.

Class No. PPF 2DCNN ECNN GCNN 3DCNN SSRN MSDNSA SSAtt
1 84.20 56.73 87.49 87.47 78.63 81.48 82.72 82.54
2 96.00 64.68 80.99 86.01 93.23 92.48 99.81 99.92
3 98.61 44.67 87.72 78.22 40.99 98.02 89.70 86.48
4 94.89 59.05 90.43 85.02 97.44 98.11 95.08 99.57
5 96.67 68.31 100 99.89 87.31 99.91 94.89 99.61
6 82.94 70.21 97.90 89.44 79.02 95.80 95.80 83.71
7 82.67 82.57 90.48 90.19 90.49 89.46 85.63 89.92
8 52.69 52.12 58.51 74.44 59.83 69.90 85.57 81.94
9 78.21 70.35 79.77 84.42 81.11 84.04 86.02 85.99
10 72.78 60.37 64.28 63.61 69.59 82.34 60.33 88.81
11 87.38 72.16 78.37 80.06 75.14 93.17 87.67 90.53
12 80.52 44.63 78.29 87.30 82.23 90.80 90.78 89.48
13 70.88 87.02 76.84 85.06 82.11 72.98 90.88 86.81
14 99.51 96.92 99.19 100 80.57 99.19 99.60 95.34
15 94.71 14.93 77.04 56.95 39.11 94.08 94.71 85.77
OA 83.84 61.85 84.04 84.12 78.19 89.46 87.78 90.38
AA 84.84 62.98 83.33 82.94 75.79 89.45 89.28 89.76
Kappa 82.53 58.64 82.54 82.51 76.27 88.58 86.73 89.55
F1 85.72 60.92 82.35 81.27 76.79 88.03 88.69 90.67
Training(s) 779.47 21.46 21.55 22.54 11472.14 715.44 4994.82 180.36
Test(s) 0.30 0.12 0.13 0.16 77.21 5.10 120.12 0.78
TABLE VII: Classification results (%) and computation time (seconds) of different models on the Houston 2013 data.

Fig. 7: Classification maps achieved by seven different models on the Houston 2013 data. (a) Test data map. (b) 2DCNN. (c) ECNN. (d) GCNN. (e) 3DCNN. (f) SSRN. (g) MSDNSA. (h) SSAtt.

Fig. 8: Classification maps achieved by seven different models on the Houston 2018 data. (a) Test data map. (b) 2DCNN. (c) ECNN. (d) GCNN. (e) 3DCNN. (f) SSRN. (g) MSDNSA. (h) SSAtt.
Class No. 2DCNN ECNN GCNN 3DCNN SSRN MSDNSA SSAtt
1 20.72 66.62 54.71 58.76 71.38 63.28 68.83
2 62.45 75.60 85.06 90.06 81.88 88.83 90.54
3 44.48 89.52 86.40 92.35 98.87 99.15 88.10
4 83.60 95.46 94.16 97.78 91.70 95.53 95.49
5 42.18 41.72 50.09 55.95 38.76 49.89 55.10
6 28.63 32.58 33.06 31.10 40.40 38.35 30.12
7 0 0 31.09 29.41 24.16 30.25 30.25
8 87.93 86.44 82.22 89.30 82.74 83.56 86.37
9 53.44 64.10 65.67 58.74 75.76 73.74 77.39
10 41.77 61.46 57.09 55.41 54.86 45.40 54.74
11 45.35 59.02 66.04 60.55 65.46 67.46 63.32
12 2.87 1.47 7.67 14.41 9.39 11.46 12.70
13 44.25 50.36 51.13 55.18 50.76 47.96 50.34
14 41.84 84.06 74.49 57.55 34.43 35.38 46.91
15 60.21 69.19 68.28 67.01 67.73 64.76 65.31
16 79.98 88.33 89.28 87.10 74.18 77.82 85.01
17 100 95.45 100 100 100 100 100
18 44.10 38.03 43.23 57.03 79.55 73.30 66.58
19 95.87 90.68 92.34 85.76 85.09 80.17 94.68
20 41.23 70.57 85.96 67.50 61.67 81.98 74.14
OA 54.59 65.99 67.05 64.19 70.52 69.30 72.57
AA 51.04 63.03 65.89 65.55 64.44 65.41 66.80
Kappa 44.97 57.64 59.04 56.24 62.39 61.01 64.89
F1 39.81 50.15 50.67 51.38 54.69 54.49 58.02
Training(s) 242.76 265.33 280.56 6193.41 1155.94 23685.47 1060.20
Test(s) 4.66 5.20 6.12 132.18 51.49 1526.66 16.77
TABLE VIII: Classification results (%) and computation time (seconds) of different models on the Houston 2018 data.
Class No. 2DCNN ECNN GCNN 3DCNN SSRN MSDNSA SSAtt
1 44.79 17.01 35.07 45.14 3.82 59.03 82.99
2 75.28 0.92 39.30 74.54 9.41 54.43 21.59
3 47.39 0.21 12.21 36.12 10.71 0 30.34
4 66.86 94.99 88.53 73.64 76.43 68.05 60.93
5 3.28 2.64 4.07 3.14 42.99 51.02 50.16
6 59.65 86.14 61.63 50.99 70.30 4.70 81.44
7 100 100 100 100 100 100 100
OA 51.42 51.53 52.70 51.96 56.41 55.42 58.55
AA 56.75 43.13 48.69 54.79 44.81 48.17 61.06
Kappa 40.78 34.60 37.96 40.91 43.97 43.54 47.88
F1 45.49 31.78 40.60 44.99 40.12 40.61 51.61
Training(s) 136.61 140.51 141.05 17326.13 8474.97 37695.25 272.56
Test(s) 0.059 0.076 0.080 97.62 8.58 122.94 0.23
TABLE IX: Classification results (%) and computation time (seconds) of different models on the HyRANK data.

Fig. 9: Classification maps achieved by seven different models on the HyRANK data. (a) Test data map. (b) 2DCNN. (c) ECNN. (d) GCNN. (e) 3DCNN. (f) SSRN. (g) MSDNSA. (h) SSAtt.

Iii-C Model Comparison

Since our proposed model SSAtt is based on CNN, we compare it with seven state-of-the-art CNN-related models to evaluate its performance. These models include Pixel-Pair CNN (PPF) in [23], 2DCNN in [3], Attribute Profile based CNN (ECNN) in [1], Gabor Filtering based CNN (GCNN) in [5], 3DCNN in [3], Spectral-Spatial Residual Network (SSRN) in [43], and Dense Convolutional Networks with Spectral-Wise Attention Mechanism (MSDNSA) in [7]. For these models, we transfer the network structures from the original papers and re-implement them by ourselves on the three datasets.

Iii-C1 Quantitative comparisons

TableVII demonstrates the classification performance in terms of OA, AA, Kappa, F1 score as well as each class accuracy achieved by different models on the Houston 2013 data. Note that the numbers reported in bold type face indicate the best results in each row. From this table, several conclusions can be derived. First of all, PPF is able to achieve satisfactory classification results because of its deep network structure (i.e., 10 convolutional layers) along with the novel data augmentation strategy. However, PPF is a spectral classification model, which ignores the use of spatial information, making it difficult to discriminate land-cover classes with similar materials. For instance, the ‘Commercial’ class whose accuracy is only 52.69% can be easily misclassified as ‘Residential’. Second, for the 2-D convolutional networks (i.e., 2DCNN, ECNN, and GCNN), 2DCNN obtains inferior performance in most classes when compared to ECNN and GCNN. This can be explained by the loss of discriminative spectral information in the 2DCNN model. Instead of using the first principal component as input in 2DCNN, ECNN and GCNN adopt more principal components as inputs and extract some spatial features from them, thus improving the classification performance. Third, for the 3-D convolutional networks (i.e., 3DCNN, SSRN, and MSDNSA), 3DCNN significantly improves the performance of 2DCNN, but the amount of improvement is still not as good as what we expected. One of the possible reasons is that 3DCNN has a large number of learned parameters, while the available training samples are not large enough to train it. Another possible reason is that the convolutional features are not fully exploited. In comparison with 3DCNN, SSRN adopts smaller convolutional kernels to reduce the number of training parameters and residual structures to combine features from different convolutional layers; MSDNSA uses densely-connected structures and spectral attention modules to integrate and refine convolutional features, respectively. Therefore, they are capable of improving the performance of 3DCNN by a large margin in terms of OA, AA, Kappa, and F1 score. The last but not the least, for the seven comparison models, SSRN is able to achieve the best OA, AA, and Kappa values. Benefiting from the designed attention modules, which enhance the discriminative information and suppress the unnecessary information in spectral and spatial domains, our proposed model SSAtt can further improve these values, which certifies the effectiveness of it.

TableVIII and TableIX compare the classification performance of different models on the Houston 2018 data and the HyRANK data, respectively. Note that we do not re-implement the PPF model on these two datasets, because the pixel-pair method will generate more than 10 millions of training samples, which is out of the computation capability of our GPU. The same as TableVII, some similar conclusions can be observed from TableVIII and TableIX. For 2-D convolutional networks, 2DCNN does not work as well as the other two models in terms of OA. For the 3-D convolutional networks, 3DCNN achieves inferior classification results in terms of OA and Kappa values as compared to SSRN and MSDNSA. When comparing 2-D convolutional networks and 3-D convolutional networks, it can be found that SSRN is a relatively better model in terms of OA and Kappa values. In comparison with SSRN, the SSAtt model is capable of improving the OA, AA, Kappa, and F1 scores on both datasets. These conclusions can sufficiently validate the effectiveness of the proposed model.

Iii-C2 Qualitative comparisons

In addition to the quantitative results in TableVII-TableIX, we also demonstrate the classification maps acquired by seven different models qualitatively. Fig.7, Fig.8, and Fig.9 represent the classification maps on the Houston 2013 data, the Houston 2018 data, and the HyRANK data, respectively. In these figures, different colors correspond to different land-cover classes. When comparing the classification maps in sub-figures (b)-(h) with the ground-truth map in sub-figure (a), it can be observed that the proposed model SSAtt obtains more reasonable maps than the other comparison models, which indicates its superiority. However, all of the classification maps in sub-figures (b)-(h) seem to be a little over-smoothed. This is caused by the fact that the inputs of these CNN-related models are cubes around each pixels, making the boundary pixels between two objects easily misclassified.

Iii-C3 Computation time

In TableVII-TableIX, the last two rows record the training time and test time of different models. Without loss of generality, it can be observed that the training stage costs much more time than the test stage for each model. Specifically, 2DCNN is the most efficient model, because it only processes one component extracted from the whole spectral bands. Nevertheless, its classification performance is significantly inferior to the 3-D CNN-related models (i.e., 3DCNN, SSRN, and MSDNSA) and the proposed SSAtt model in most cases. In contrary, although the 3-D CNN-related models are generally able to achieve satisfactory results, their computation costs are very high due to the simultaneous convolution operators in both spectral and spatial domains. Take Houston 2013 data as an example, it takes more than ten thousand seconds to train the 3DCNN model, and about five thousand seconds to train the MSDNSA model, while other models only costs hundreds of seconds to train. Different from 3-D CNN-related models, PPF only deals with the spectral information, but it still takes much time to train on the Houston 2013 data. This is caused by its pair-wise classification strategy which increases the number of available training samples exponentially. In summary, SSRN has the best balance between the computation time and the classification performance among the seven compared models. In comparison with the SSRN model, our proposed model SSAtt takes less time to train and test on all of the three datasets.

Iv Conclusions

In this paper, we proposed a hyperspectral image classification method using an attention aided CNN model. We firstly designed two different classes of attention modules (i.e., the spectral attention module and the spatial attention module) using some convolutional layers. Then, we incorporated them into the original CNN to construct a spectral attention sub-network and a spatial attention sub-network, which can focus on more discriminative information in the spectral domain and the spatial domain, respectively. Finally, an integrated method was employed to fuse the complementary information from these two sub-networks. In order to validate the effectiveness of the proposed model, we constructed two kinds of experiments on three different data. The first one analyzed the effects of different attention modules. The other one compared the performance of the proposed model with several state-of-the-art CNN-related models. The experimental results show that both the spectral attention sub-network and the spatial attention sub-network are able to obtain higher performance than the original CNN with the aid of attention modules, and their integrated model can further improve the performance. In comparison with the state-of-the-art models, the proposed model is able to achieve the best performance in terms of OA, AA, Kappa, and F1 scores, and has a good balance between the classification performance and computation time.

References

  • [1] E. Aptoula, M. C. Ozdemir, and B. Yanikoglu (2016) Deep learning with attribute profiles for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters 13 (12), pp. 1970–1974. Cited by: §III-C.
  • [2] L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T. Chua (2017)

    Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning

    .
    In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 5659–5667. Cited by: §I.
  • [3] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi (2016) Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Transactions on Geoscience and Remote Sensing 54 (10), pp. 6232–6251. Cited by: §I, §I, §II-B, §III-C.
  • [4] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu (2014) Deep learning-based classification of hyperspectral data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (6), pp. 2094–2107. Cited by: §I.
  • [5] Y. Chen, L. Zhu, P. Ghamisi, X. Jia, G. Li, and L. Tang (2017) Hyperspectral images classification with gabor filtering and convolutional neural network. IEEE Geoscience and Remote Sensing Letters 14 (12), pp. 2355–2359. Cited by: §III-C.
  • [6] C. Debes, A. Merentitis, R. Heremans, J. Hahn, N. Frangiadakis, T. van Kasteren, W. Liao, R. Bellens, A. Pižurica, S. Gautama, et al. (2014) Hyperspectral and lidar data fusion: outcome of the 2013 grss data fusion contest. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7 (6), pp. 2405–2418. Cited by: §III-A.
  • [7] B. Fang, Y. Li, H. Zhang, and J. C. Chan (2019) Hyperspectral images classification based on dense convolutional networks with spectral-wise attention mechanism. Remote Sensing 11 (2), pp. 159. Cited by: §I, §III-C.
  • [8] P. Ghamisi, E. Maggiori, S. Li, R. Souza, Y. Tarablaka, G. Moser, A. De Giorgi, L. Fang, Y. Chen, M. Chi, et al. (2018) New frontiers in spectral-spatial hyperspectral image classification: the latest advances based on mathematical morphology, markov random fields, segmentation, sparse representation, and deep learning. IEEE Geoscience and Remote Sensing Magazine 6 (3), pp. 10–43. Cited by: §I.
  • [9] P. Ghamisi, J. Plaza, Y. Chen, J. Li, and A. J. Plaza (2017) Advanced spectral classifiers for hyperspectral images: a review. IEEE Geoscience and Remote Sensing Magazine 5 (1), pp. 8–32. Cited by: §I.
  • [10] R. Hang, Z. Li, P. Ghamisi, D. Hong, G. Xia, and Q. Liu (2020) Classification of hyperspectral and lidar data using coupled cnns. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §I.
  • [11] R. Hang, Q. Liu, D. Hong, and P. Ghamisi (2019) Cascaded recurrent neural networks for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §I.
  • [12] R. Hänsch, A. Ley, and O. Hellwich (2017)

    Correct and still wrong: the relationship between sampling strategies and the estimation of the generalization error

    .
    In IEEE International Geoscience and Remote Sensing Symposium, pp. 3672–3675. Cited by: §III-A.
  • [13] J. M. Haut, M. E. Paoletti, J. Plaza, A. Plaza, and J. Li (2019) Visual attention-driven hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 57 (10), pp. 8065–8080. Cited by: §I.
  • [14] L. He, J. Li, C. Liu, and S. Li (2018) Recent advances on spectral–spatial hyperspectral image classification: an overview and new guidelines. IEEE Transactions on Geoscience and Remote Sensing 56 (3), pp. 1579–1597. Cited by: §I.
  • [15] D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu (2018)

    An augmented linear mixing model to address spectral variability for hyperspectral unmixing

    .
    IEEE Transactions on Image Processing 28 (4), pp. 1923–1938. Cited by: §I.
  • [16] D. Hong, N. Yokoya, J. Chanussot, and X. X. Zhu (2019) Cospace: common subspace learning from hyperspectral-multispectral correspondences. IEEE Transactions on Geoscience and Remote Sensing 57 (7), pp. 4349–4359. Cited by: §I.
  • [17] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141. Cited by: §I, §I.
  • [18] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li (2015) Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors 2015. Cited by: §I.
  • [19] J. Kang, D. Hong, J. Liu, G. Baier, N. Yokoya, and B. Demir (2020) Learning convolutional sparse coding on complex domain for interferometric phase restoration. IEEE Transactions on Neural Networks and Learning Systems. Note: DOI:10.1109/TNNLS.2020.2979546 Cited by: §I.
  • [20] J. Lange, G. Cavallaro, M. Götz, E. Erlingsson, and M. Riedel (2018) The influence of sampling methods on pixel-wise hyperspectral image classification with 3d convolutional neural networks. In IEEE International Geoscience and Remote Sensing Symposium, pp. 2087–2090. Cited by: §III-A.
  • [21] Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436. Cited by: §I.
  • [22] S. Li, W. Song, L. Fang, Y. Chen, P. Ghamisi, and J. A. Benediktsson (2019) Deep learning for hyperspectral image classification: an overview. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §I.
  • [23] W. Li, G. Wu, F. Zhang, and Q. Du (2017) Hyperspectral image classification using deep pixel-pair features. IEEE Transactions on Geoscience and Remote Sensing 55 (2), pp. 844–853. Cited by: §I, §I, §III-C.
  • [24] Y. Li, H. Zhang, and Q. Shen (2017) Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network. Remote Sensing 9 (1), pp. 67. Cited by: §I.
  • [25] Q. Liu, F. Zhou, R. Hang, and X. Yuan (2017) Bidirectional-convolutional lstm based spectral-spatial feature learning for hyperspectral image classification. Remote Sensing 9 (12), pp. 1330. Cited by: §I.
  • [26] X. Ma, H. Wang, and J. Geng (2016) Spectral–spatial classification of hyperspectral image based on deep auto-encoder. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 9 (9), pp. 4073–4085. Cited by: §I.
  • [27] X. Mei, E. Pan, Y. Ma, X. Dai, J. Huang, F. Fan, Q. Du, H. Zheng, and J. Ma (2019) Spectral-spatial attention networks for hyperspectral image classification. Remote Sensing 11 (8), pp. 963. Cited by: §I.
  • [28] V. Mnih, N. Heess, A. Graves, et al. (2014) Recurrent models of visual attention. In Advances in Neural Information Processing Systems, pp. 2204–2212. Cited by: §I.
  • [29] M. E. Paoletti, J. M. Haut, R. Fernandez-Beltran, J. Plaza, A. J. Plaza, and F. Pla (2018) Deep pyramidal residual networks for spectral-spatial hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 57 (2), pp. 1–15. Cited by: §I.
  • [30] B. Rasti, D. Hong, R. Hang, P. Ghamisi, X. Kang, J. Chanussot, and J. A. Benediktsson (2020) Feature extraction for hyperspectral imagery: the evolution from shallow to deep. arXiv preprint arXiv:2003.02822. Cited by: §III-A.
  • [31] C. Tao, H. Pan, Y. Li, and Z. Zou (2015) Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification. IEEE Geoscience and Remote Sensing Letters 12 (12), pp. 2438–2442. Cited by: §I.
  • [32] F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang (2017) Residual attention network for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164. Cited by: §I.
  • [33] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §I.
  • [34] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In Proceedings of the European Conference on Computer Vision, pp. 3–19. Cited by: §I, §I.
  • [35] H. Wu and S. Prasad (2017) Convolutional recurrent neural networks forhyperspectral data classification. Remote Sensing 9 (3), pp. 298. Cited by: §I.
  • [36] H. Wu and S. Prasad (2018) Semi-supervised deep learning using pseudo labels for hyperspectral image classification. IEEE Transactions on Image Processing 27 (3), pp. 1259–1270. Cited by: §I.
  • [37] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In

    International Conference on Machine Learning

    ,
    pp. 2048–2057. Cited by: §I.
  • [38] X. Xu, W. Li, Q. Ran, Q. Du, L. Gao, and B. Zhang (2018) Multisource remote sensing data classification based on convolutional neural network. IEEE Transactions on Geoscience and Remote Sensing 56 (2), pp. 937–949. Cited by: §I.
  • [39] Y. Xu, Z. Wu, J. Chanussot, and Z. Wei (2019)

    Nonlocal patch tensor sparse representation for hyperspectral image super-resolution

    .
    IEEE Transactions on Image Processing 28 (6), pp. 3034–3047. Cited by: §I.
  • [40] J. Yang, Y. Zhao, and J. C. Chan (2017) Learning and transferring deep joint spectral–spatial features for hyperspectral classification. IEEE Transactions on Geoscience and Remote Sensing 55 (8), pp. 4729–4742. Cited by: §I.
  • [41] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola (2016) Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29. Cited by: §I.
  • [42] W. Zhao and S. Du (2016) Spectral–spatial feature extraction for hyperspectral image classification: a dimension reduction and deep learning approach. IEEE Transactions on Geoscience and Remote Sensing 54 (8), pp. 4544–4554. Cited by: §I.
  • [43] Z. Zhong, J. Li, Z. Luo, and M. Chapman (2018) Spectral–spatial residual network for hyperspectral image classification: a 3-d deep learning framework. IEEE Transactions on Geoscience and Remote Sensing 56 (2), pp. 847–858. Cited by: §I, §III-C.
  • [44] F. Zhou, R. Hang, Q. Liu, and X. Yuan (2019) Hyperspectral image classification using spectral-spatial lstms. Neurocomputing 328, pp. 39–47. Cited by: §I.
  • [45] X. Zhu, D. Tuia, L. Mou, G. Xia, L. Zhang, F. Xu, and F. Fraundorfer (2017) Deep learning in remote sensing: a comprehensive review and list of resources. IEEE Geoscience and Remote Sensing Magazine 5 (4), pp. 8–36. Cited by: §I.