With the development of imaging technology, current hyperspectral sensors can fully portray the surface of the Earth using hundreds of continuous and narrow spectral bands, ranging from the visible spectrum to the short-wave infrared spectrum. The generated hyperspectral image (HSI) is often considered as a three-dimensional cube. The first two are spatial dimensions, which record the locations of each object. The third one is spectral dimension, which captures the spectral signature (reflective or emissive properties) of each material in different bands along the electromagnetic spectrum . Using such rich information, HSIs have been widely applied to various applications, such as land cover/land use classification, precision agriculture, and change detection. For these applications, one basic but important procedure is HSI classification, whose goal is to assign candidate class labels to each pixel.
In order to acquire accurate classification results, numerous methods have been proposed. For example, one can directly consider the rich spectral signature as features and feed them into advanced classifiers, such as support vector machine (SVM)3] and extreme learning machine 
. However, due to the dense spectral sampling of HSIs, there may exist some redundant information among adjacent spectral bands. This easily leads to the so-called curse of dimensionality (the Hughes effect) which causes a sudden drop in classification accuracy when there is no balance between the high number of spectral channels and a limited number of training samples. Therefore, a large number of works were proposed to mine discriminative features from the high-dimensional spectral signature. Popular models include principle component analysis (PCA), linear discriminant analysis (LDA) [6, 7, 8], and graph embedding [9, 10, 11]. Besides, representation-based models have also been employed to HSI classification in recent years. In  and , sparse representation was proposed to learn discriminative features from HSIs. Similarly, collaborative representation was also widely explored ,. In these models, an input spectral signature is usually represented by a linear combination of atoms from a dictionary, and the classification result can be derived from the reconstructed residual without needing to train extra classifiers, which often costs much time.
Although the aforementioned models have demonstrated their effectiveness in the field of HSI classification, there still exist some drawbacks to address. For the traditional feature extraction models, we need to pre-define a mining criterion (e.g., maximizing the between-class scatter matrix in LDA), which heavily depends on the domain knowledge and experience of experts. For the representation-based models, their goal is to reconstruct the input signal, leading to sub-optimal representation for classification. Additionally, all of them can be considered as shallow-layer models, which limit their potentials to learn high-level semantic features. Recently, deep learning,18, 19, 20, 21]22],. The goal of deep learning is to learn nonlinear, high-level semantic features from data in a hierarchical manner.
Due to the effects of multi-path scattering and the heterogeneity of sub-pixel constituents, HSI often lies in a nonlinear and complex feature space. Deep learning can be naturally adopted to deal with this issue ,. In the past few years, many deep learning models were successfully applied to HSI classification. For example, in [26, 27, 28]
, the autoencoder model has been used to learn deep features from high-dimensional spectral signature directly. Similar to autoencoder, deep belief network was also explored to extract spectral features[29, 30, 31]
. However, both of them belong to fully-connected networks, which contain large numbers of parameters to train. Different from them, convolutional neural networks (CNNs) have local connection and weight sharing properties, thus largely reducing the number of training parameters[32, 33, 34]. In , Hu et al. proposed to use one-dimensional CNN to learn and represent the spectral information. This model is comprised of an input layer, a convolutional layer, a pooling layer, a fully-connected layer and an output layer. The whole model is trained in an end-to-end manner, thus achieving satisfying results for HSI classification.
Besides spectral information, HSIs also have rich spatial information. How to combine them together has been an active research topic in the field of HSI classification ,. One potential method is to extend the spectral classification model into its spectral-spatial counterpart. For instance, in [38, 39, 40], a three-dimensional CNN was employed to spectral-spatial classification of HSIs. However, due to the simultaneous convolution operators in both spectral domain and spatial domain, the computational complexity is dramatically increased. In addition, the number of trainable parameters in three-dimensional CNNs is also a problem. In order to perform three-dimensional convolution, the dimensionality of the input and the dimensionality of the kernel (filter) should be equal. This heavily increases the number of parameters. Another candidate method for spectral-spatial classification is the one based on two-branch networks. One branch is for spectral classification and the other one for spatial classification. In [41, 42, 43], one-dimensional CNN or autoencoder was used to learn spectral features and two-dimensional CNN was designed to learn spatial features. These two features are then integrated together via feature-level fusion or decision-level fusion. For two-dimensional CNNs, only a few principal components were extracted and used as inputs, thus reducing the computational consuming compared to three-dimensional CNNs.
Most of existing models can be considered as vector-based methodologies. Recently, a few works attempted to regard HSIs as sequential data, so recurrent neural networks (RNNs) were naturally used to learn features. In , Wu et.al proposed using RNN to extract spectral features from HSIs. In  and 
, a variant of RNN using long short-term memory (LSTM) units was designed to learn spectral-spatial features from HSIs. In, another variant of RNN using gated recurrent units (GRUs) was employed. Compared to the widely explored CNN models, RNNs have many superiorities. For example, the key component of CNNs is the convolutional operator. Due to the kernel size limitations of it, one-dimensional CNNs can only learn the local spectral dependency while easily ignoring the effects of non-adjacent spectral bands. Different from them, RNNs, especially using GRU or LSTM, often input spectral bands one by one via recurrent operators, thus capturing the relationship from the whole spectral bands. Besides, RNNs often have smaller numbers of parameters to train than CNNs, so they will be more efficient in the training and inferring phases.
Benefiting from its powerful learning ability from sequential data, current RNN-related models often simply input the whole spectral bands to networks, which may not fully explore the redundant and complementary properties of HSIs. The redundant information between adjacent spectral bands will increase the computational burden of RNNs without improving the classification results. Sometimes such redundancy may reduce the classification accuracy since it increases within-class variances and decreases between-class variances in the feature space. Besides, it may also increase the difficulties in learning complementary information. To address these issues, we propose a cascaded RNN model using gated recurrent units (GRUs) in this paper. This model mainly consists of two RNN layers. The first RNN layer focuses on reducing the redundant information of adjacent spectral bands. These reduced information are then fed into the second RNN layer to learn their complementary features. Besides, in order to improve the discriminative ability of the learned features, we design two strategies for the proposed model. Finally, we also extend the proposed model to its spectral-spatial version by incorporating some convolutional layers. The major contributions of this paper are summarized as follows.
We propose a cascaded RNN model with GRUs for HSI classification. Compared to the existing RNN-related models, our model can sufficiently consider the redundant and complementary information of HSIs via two RNN layers. The first one is to reduce redundancy and the second one is to learn complementarity. These two layers are integrated together to generate an end-to-end trainable model.
In order to learn more discriminative features, we design two strategies to construct connections between the first RNN layer and the output layer. The first strategy is the weighted fusion of features from two layers, and the second one is the weighted combination of different loss functions from two layers. Their weights can be adaptively learned from data itself.
To capture the spectral and spatial features simultaneously, we further extend the proposed model to its spectral-spatial counterpart. A few convolutional layers are integrated into the proposed model to learn spatial features from each band, and these features are then combined together via recurrent operators.
The rest of this paper is organized as follows. Section II describes the details of the proposed models, including a brief introduction of RNN, and the structure of the proposed model as well as its modifications. The descriptions of data sets and experimental results are given in Section III. Finally, Section IV concludes this paper.
As shown in Fig.1
, the proposed cascaded RNN model mainly consists of four steps. For a given pixel, we firstly divide it into different spectral groups. Then, for each group, we consider the spectral bands in it as a sequence, which is fed into a RNN layer to learn features. After that, the learned features from each group are again regraded as a sequence and fed into another RNN layer to learn their complementary information. Finally, the output of the second RNN layer is connected to a softmax layer to derive the classification result.
Ii-a Review of RNN
RNN has been widely used for sequential data analysis, such as speech recognition and machine translation ,. Assume that we have a sequence data , where generally represents the information at the -th time step. When applying RNN to HSI classification, will correspond to the spectral value at the -th band. For RNN, the output of hidden layer at time is
is a nonlinear activation function such as logistic sigmoid or hyperbolic tangent functions,
is a bias vector,is the output of hidden layer at the previous time, and denote weight matrices from the current input layer to hidden layer and the previous hidden layer to current hidden layer, respectively. From this equation, we can observe that via a recurrent connection, the contextual relationships in the time domain can be constructed. Ideally, can capture most of the time information for the sequence data.
For classification tasks,
is often fed into an output layer, and the probability that the sequence belongs to-th class can be derived by using a softmax function. These processes can be formulated as
where is a bias vector, is the weight matrix from hidden layer to output layer, and are parameters of softmax function, is the number of classes to discriminate. All of these weight parameters in Equation (1) and (2) can be trained using the following loss function
where is the number of training samples, and are the true label and the predicted label of the
-th training sample, respectively. This function can be optimized using a backpropagation through time (BPTT) algorithm.
Ii-B Cascaded RNNs
HSIs can be described as a three-dimensional matrix , where , and represent the width, height and number of spectral bands, respectively. For a given pixel , we can consider it as a sequence whose length is , so RNN can be naturally employed to learn spectral features. However, HSIs often contain hundreds of bands, making a very long sequence. Such long-term sequence increases the training difficulty since the gradients tend to either vanish or explode . To address this issue, one popularly used method is to design a more sophisticated activation function by using gating units such as the LSTM unit and GRU . Compared to LSTM unit, GRU has a fewer number of parameters , which may be more suitable for HSI classification because it usually has a limited number of training samples. Therefore, we select GRU as the basic unit of RNN in this paper.
The core components of GRU are two gating units that control the flow of information inside the unit. Instead of using Equation(1), the activation of the hidden layer for band is now formulated as
where is the update gate, which can be derived by
is a sigmoid function,is a weight value, and is a weight vector. Similarly, can be computed by
where denotes an element-wise multiplication, and is the reset gate, which can be derived by
Due to the dense spectral sampling of hyperspectral sensors, adjacent bands in HSIs have some redundancy while non-adjacent bands have some complementarity. In order to take account of such information comprehensively, we propose a cascaded RNN model. Specifically, we divide the spectral sequence into sub-sequences , each of which consists of adjacent spectral bands. Besides the last sub-sequence , the length of the other sub-sequences is , which denotes the nearest integers less than or equal to . Thus, for the -th sub-sequence , it is comprised of the following bands
Then, we feed all the sub-sequences into the first-layer RNNs respectively. These RNNs have the same structure and share parameters, thus reducing the number of parameters to train. In the sub-sequence , each band has an output from GRU. We use the output of the last band as the final feature representation for , which can be denoted as , where is the size of the hidden layer in the first-layer RNN. After that, we can combine together to generate another sequence whose length is . This sequence is fed into the second-layer RNN to learn their complementary information. Similar to the first-layer RNNs, we also use the output of GRU at the last time as the learned feature . To get a classification result of , we need to input into an output layer whose size equals to the number of candidate classes . Both of these two-layer RNNs have many weight parameters. We choose Equation(3) as a loss function and use the BPTT algorithm to optimize them simultaneously.
Ii-C Improvement for Cascaded RNNs
As described in subsection II-B, the second-layer RNN is directly connected to the output layer, so it may be optimized better than the first-layer RNNs. However, the performance of the first-layer RNNs will have effects on the second-layer RNN. In order to improve the discriminative ability of , an intuitive method is to construct relations between the first-layer RNNs and the output layer. Here, we propose two strategies to achieve this goal. The first strategy is based on the feature-level connection shown in Fig.2. Instead of feeding the output of the second-layer RNN into the output layer only, we attempt to feed all the output features from the first- and the second-layer RNNs in a weighted concatenation manner. Specifically, the input of the output layer is computed as follows
where are fusion weights for the first-layer RNNs, and is the fusion weight for the second-layer RNN. These weights can be integrated into the whole network and their optimal values are automatically learned from data. The same as the original two-layer RNN model, we also use Equation(3) to construct the loss function and use the BPTT algorithm to optimize it.
Different from the first improvement strategy, our second strategy is based on the output-level connection. As shown in Fig.3, we feed the features extracted by the first-layer RNNs into output layers, respectively, so that they can learn more discriminative features. Combining these features together using the second-layer RNN will result in a better . In particular, for , we can input it into an output layer and construct a loss function . Meanwhile, we also input into an output layer and construct another loss function . After that, a weighted summation method can be used to combine them together, which can be formulated as
where and are fusion weights, and are derived from Equation(3). The final loss function can be optimized by using the BPTT algorithm. In the prediction phase, we can delete the output layers of the first-layer RNNs and use the output from the second-layer RNN as the final classification result.
Ii-D Spectral-spatial Cascaded RNNs
Due to the effects of atmosphere, instrument noises, and natural spectrum variations, materials from the same class may have very different spectral responses, while those from different classes may have similar spectral responses. If we only use the spectral information, the resulting classification maps will have many outliers, which is known as the “salt and pepper” phenomenon. As a three-dimensional cube, HSIs also have rich spatial information, which can be used as a complement to address this issue. Among numerous deep learning models, CNNs have demonstrated their superiority in spatial feature extraction. In, a typical two-dimensional CNN is designed to extract spatial features from HSIs. The input of this model is the first principle component of HSIs.
Inspired from the two-dimensional CNN model, we extend the cascaded RNN model to its spectral-spatial version by adding some convolutional layers. Fig.4 shows the flowchart of the proposed spectral-spatial cascaded RNN model. For a given pixel , we select a small cube centered at it. Then, we split this cube into matrices across the spectral domain. For each , we feed it into several convolutional layers to learn spatial features. The same as , we also use three convolutional layers, and the first two layers are followed by pooling layers. The input size is . The sizes of the three convolutional filters are , and , respectively. After these convolutional operators, each will generate a -dimensional spatial feature . Similar to the cascaded RNN model, we can also consider as a sequence whose length is . This sequence is divided into sub-sequences, and they are subsequently fed into the first-layer RNNs respectively to reduce redundancy inside each sub-sequence. The outputs from the first-layer RNNs are combined again to generate another sequence, which are fed into the second-layer RNN to learn complementary information.
Compared to the cascaded RNN model, the spectral-spatial cascaded RNN model is deeper and more difficult to train. Therefore, we propose a transfer learning method to train it. Specifically, we firstly pre-train the convolutional layers using all of. We replace two-layer RNNs by an output layer whose size is the number of classes . Besides, we assume that the label of equals to the label of its corresponding pixel . Then, we will have samples. These samples are used to train convolutional layers. After that, the weights of these convolutional layers are fixed and the training samples are used again to train the two-layer RNNs. Finally, the whole network is fine-tuned based on the learned parameters.
Iii-a Data Description
Our experiments are conducted on two HSIs, which are widely used to evaluate classification algorithms.
Indian Pines Data: The first data set was acquired by the AVIRIS sensor over the Indian Pine test site in northwestern Indiana, USA, on June 12, 1992. The original data set contains 224 spectral bands. We utilize 200 of them after removing four bands containing zero values and 20 noisy bands affected by water absorption. The spatial size of the image is pixels, and the spatial resolution is 20 m. The number of training and test pixels are reported in TableI. Fig.5 shows the false-color image, as well as training and test maps of this data set.
|Class No.||Class Name||Training||Test|
Pavia University Scene Data: The second data set was acquired by the ROSIS sensor during a flight campaign over Pavia, northern Italy, on July 8, 2002. The original image was recorded with 115 spectral channels ranging from 0.43 to 0.86 . After removing noisy bands, 103 bands are used. The image size is pixels with a spatial resolution of 1.3 m. There are nine classes of land covers with more than 1000 labeled pixels for each class. The number of pixels for training and test are listed in TableII. Their corresponding distribution maps are demonstrated in Fig.6.
|Class No.||Class Name||Training||Test|
Iii-B Experimental Setup
In order to highlight the effectiveness of our proposed models, we compare them with SVM, one-dimensional CNN (1D-CNN), two-dimensional CNN (2D-CNN), and the original RNN using GRU (RNN). For simplicity, the cascaded RNN model using GRUs is abbreviated as CasRNN; the two improvement methods of CasRNN based on feature-level and output-level connections are abbreviated as CasRNN-F and CasRNN-O, respectively; the spectral-spatial CasRNN is abbreviated as SSCasRNN. Some of their explanations are summarized as follows.
SVM: The input of SVM is the original spectrum signature. We choose Gaussian kernel as its kernel function. The penalty parameter and the spread of the Gaussian kernel are selected from a candidate set using a fivefold cross-validation method.
RNN: GRU is used as the basic unit of RNN. The number of hidden nodes is chosen from a candidate set via a fivefold cross-validation method.
The deep learning models are constructed with a PyTorch framework. To optimize them, we use a mini-batch stochastic gradient descent algorithm. The batch size, the learning rate and the number of training epochs are set to 64, 0.001 and 300, respectively. For SVM, we use a libsvm package in a MATLAB framework. All of the experiments are implemented on a personal computer with an Intel core i7-4790, 3.60GHz processor, 32GB RAM, and a GTX TITAN X graphic card.
The classification performance of each model is evaluated by the overall accuracy (OA), the average accuracy (AA), the per-class accuracy, and the Kappa coefficient. OA defines the ratio between the number of correctly classified pixels to the total number of pixels in the test set, AA refers to the average of accuracies in all classes, and Kappa is the percentage of agreement corrected by the number of agreements that would be expected purely by chance.
Iii-C Parameter Analysis
There exist three important hyperparameters in the proposed models. They are sub-sequence numbers, as well as the size of hidden layers in the first-layer RNN and the second-layer RNN. To test the effects of them on the classification performance, we firstly fix and select the size of hidden layers from a candidate set . Then, we fix the size of hidden layers and choose from another set . Since the same hyperparameter values are used for CasRNN and its two improvements (i.e., CasRNN-F and CasRNN-O), we only demonstrate the performance of CasRNN here, shown in Fig.7. In this three-dimensional diagram, the first two axes (named and ) respectively correspond to the number of hidden nodes in the first-layer RNN and the second-layer RNN, while the third axis represents the classification accuracy OA. From this figure, we can observe that when and , CasRNN can achieve better OA than the other values on the Indian Pines data. The best OA appears when and . For the Pavia University data, OA changes a little larger than the Indian Pines data, but we can still find the best value when and . Similarly, Fig.8 shows OA values achieved by SSCasRNN using different hidden sizes. We can see the optimal parameter values are for the Indian Pines data, and for the Pavia University data, respectively.
Fig.9 and Fig.10 evaluate the effects of on classifying the Indian Pines and the Pavia University data sets, respectively. In these figures, different colors represent different models. They are CasRNN, CasRNN-F, CasRNN-O and SSCasRNN. As increases, OAs achieved by these models tend to increase firstly and then decrease. Given the same , SSCasRNN significantly outperforms the other three models. For the Indian Pines data, the maximal OAs of four models appear at the same , so their optimal values are set as 10. Different from the Indian Pines data, four models have different optimal values on the Pavia University data. As shown in Fig.10, the optimal value is 4 for SSCasRNN, and 8 for the other three models.
Iii-D Performance Comparison
In this section, we will report quantitative and qualitative results of our proposed models and their comparisons with the other state-of-the-art models. TableIII reports the detailed classification results of different models on the Indian Pines data, including OA, AA, Kappa and class specific accuracy. The bold fonts in each row denote the best results. Several conclusions can be observed from this table. First, if we directly input the whole spectral bands into RNN, its OA, AA and Kappa values are 69.82%, 75.42% and 65.87%, respectively, which are all lower than those achieved by SVM and 1D-CNN models. This indicates that RNN cannot fully explore the long-term spectral sequence of HSIs. On the contrary, considering the redundant and complementary properties of spectral signature, our proposed model CasRNN can improve the performance of RNN by 4 percents, thus outperforming SVM and 1D-CNN. Second, compared to CasRNN, CasRNN-F and CasRNN-O can obtain better results, which validates the effectiveness of the two improvement strategies. In terms of each class accuracy, CasRNN-F almost increases all of them in comparison with CasRNN, so it might be more powerful than CasRNN-O on the Indian Pines data. Third, compared to spectral classification models, 2D-CNN significantly improves the classification results by about 10 percents. It means that the consideration of spatial information is very important on the Indian Pines data, because there are many large and homogeneous objects shown in Fig.5(c). By incorporating the spatial information into CasRNN model, our proposed model SSCasRNN can further increase the performance to above 90 percents. Besides, it can obtain highest accuracies in 15 different classes, which sufficiently certifies the effectiveness of SSCasRNN.
In addition to the quantitative results, we also visualize classification results of different models shown in Fig.11. Different colors in this figure correspond to different classes. Compared to the groundtruth map in Fig.5(c), spectral classification models (i.e., SVM, 1D-CNN, RNN, CasRNN, CasRNN-F and CasRNN-O) have many outliers in the classification map due to the spectral variability of materials. This phenomenon can be alleviated by 2D-CNN, because it makes use of the spatial contextual information instead of the spectral information. For homogeneous regions, especially large objects, 2D-CNN performs very well. However, it will easily result in an over-smoothing problem especially for small objects, as demonstrated in Fig.11(g). Different from 2D-CNN and spectral models, SSCasRNN takes advantage of spectral and spatial information simultaneously. As shown in Fig.11 (h), it has significantly fewer outliers than spectral models, and retains more boundary details of objects than 2D-CNN.
TableIV and Fig.12 are the classification results of different models on the Pavia University data. Similar conclusions can be observed from them. For spectral models, CasRNN is better than RNN, while CasRNN-F and CasRNN-O are superior to CasRNN. All of these models have the “salt and pepper” phenomenon in their classification maps. Compared to the best spectral model, 2D-CNN can improve OA and Kappa by more than 5 percents. In addition, it generates fewer outliers and leads to a more homogeneous classification map. Nevertheless, without using the spectral information, its performance is not very high, and the classification map is easily to be over-smoothed. Combining the spectral and spatial information together, our proposed model SSCasRNN can alleviate these issues. It improves OA from 86.18% to 90.30%, and generates more details in the classification map. However, in comparison with the Indian Pines data, the classification results achieved by SSCasRNN are still not very high. One possible reason is that there exist many small objects in the Pavia University data, which increases the difficulty in exploring spatial features.
In this paper, we proposed a cascaded RNN model for HSI classification. Compared to the original RNN model, our proposed model can fully explore the redundant and complementary information of the high-dimensional spectral signature. Based on it, we designed two improvement strategies by constructing connections between the first-layer RNN and the output layer, thus generating more discriminative spectral features. Additionally, considering the importance of spatial information, we further extended the proposed model into its spectral-spatial version to learn spectral and spatial features simultaneously. To test the effectiveness of the proposed models, we compared them with several state-of-the-art models on two widely used HSIs. The experimental results demonstrate that the cascaded RNN model can obtain higher performance than RNN, and its modifications can further improve the performance. Besides, we also thoroughly evaluated the effects of different hyperparameters on the classification performance of the proposed models, including the hidden sizes and the number of sub-sequences. In the future, more experiments will be conducted to validate the effectiveness of our proposed models. In addition, more powerful spectral-spatial models will be explored. Since the sizes and shapes of different objects vary, using the patches or cubes with same sizes as inputs easily leads to the loss of spatial information.
-  Pedram Ghamisi, Naoto Yokoya, Jun Li, Wenzhi Liao, Sicong Liu, Javier Plaza, Behnood Rasti, and Antonio Plaza, “Advances in hyperspectral image and signal processing: A comprehensive overview of the state of the art,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 37–78, 2017.
-  Giorgos Mountrakis, Jungho Im, and Caesar Ogole, “Support vector machines in remote sensing: A review,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 66, no. 3, pp. 247–259, 2011.
-  Mariana Belgiu and Lucian Drăguţ, “Random forest in remote sensing: A review of applications and future directions,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 114, pp. 24–31, 2016.
-  Wei Li, Chen Chen, Hongjun Su, and Qian Du, “Local binary patterns and extreme learning machine for hyperspectral imagery classification.,” IEEE Trans. Geoscience and Remote Sensing, vol. 53, no. 7, pp. 3681–3693, 2015.
-  Xiuping Jia, Bor-Chen Kuo, and Melba M Crawford, “Feature mining for hyperspectral image classification,” Proceedings of the IEEE, vol. 101, no. 3, pp. 676–697, 2013.
-  Wenzhi Liao, Aleksandra Pizurica, Paul Scheunders, Wilfried Philips, and Youguo Pi, “Semisupervised local discriminant analysis for feature extraction in hyperspectral images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 1, pp. 184–198, 2013.
-  Renlong Hang, Qingshan Liu, Huihui Song, and Yubao Sun, “Matrix-based discriminant subspace ensemble for hyperspectral image spatial–spectral feature fusion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 2, pp. 783–794, 2016.
-  Renlong Hang, Qingshan Liu, Yubao Sun, Xiaotong Yuan, Hucheng Pei, Javier Plaza, and Antonio Plaza, “Robust matrix discriminative analysis for feature extraction from hyperspectral images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 5, pp. 2002–2011, 2017.
-  Dalton Lunga, Saurabh Prasad, Melba M Crawford, and Okan Ersoy, “Manifold-learning-based feature extraction for classification of hyperspectral data: A review of advances in manifold learning,” IEEE Signal Processing Magazine, vol. 31, no. 1, pp. 55–66, 2014.
-  Renlong Hang and Qingshan Liu, “Dimensionality reduction of hyperspectral image using spatial regularized local graph discriminant embedding,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 9, pp. 3262–3271, 2018.
-  Wenzhi Zhao, William Emery, Yanchen Bo, and Jiage Chen, “Land cover mapping with higher order graph-based co-occurrence model,” Remote Sensing, vol. 10, no. 11, pp. 1713, 2018.
-  Yi Chen, Nasser M Nasrabadi, and Trac D Tran, “Hyperspectral image classification using dictionary-based sparse representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 10, pp. 3973–3985, 2011.
-  Leyuan Fang, Shutao Li, Xudong Kang, and Jón Atli Benediktsson, “Spectral–spatial hyperspectral image classification via multiscale adaptive sparse representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 12, pp. 7738–7749, 2014.
-  Jiayi Li, Hongyan Zhang, Yuancheng Huang, and Liangpei Zhang, “Hyperspectral image classification by nonlocal joint collaborative representation with a locally adaptive dictionary,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 6, pp. 3707–3719, 2014.
-  Wei Li and Qian Du, “Joint within-class collaborative representation for hyperspectral image classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 6, pp. 2200–2208, 2014.
-  Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436, 2015.
-  Jürgen Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,
“Imagenet classification with deep convolutional neural networks,”in Advances in neural information processing systems, 2012, pp. 1097–1105.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
“Deep residual learning for image recognition,”
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol. 12, no. Aug, pp. 2493–2537, 2011.
-  Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
-  Liangpei Zhang, Lefei Zhang, and Bo Du, “Deep learning for remote sensing data: A technical tutorial on the state of the art,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2, pp. 22–40, 2016.
-  Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer, “Deep learning in remote sensing: a comprehensive review and list of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, 2017.
-  Yushi Chen, Zhouhan Lin, Xing Zhao, Gang Wang, and Yanfeng Gu, “Deep learning-based classification of hyperspectral data,” IEEE Journal of Selected topics in applied earth observations and remote sensing, vol. 7, no. 6, pp. 2094–2107, 2014.
-  Chao Tao, Hongbo Pan, Yansheng Li, and Zhengrou Zou, “Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification,” IEEE Geoscience and remote sensing letters, vol. 12, no. 12, pp. 2438–2442, 2015.
-  Xiaorui Ma, Hongyu Wang, and Jie Geng, “Spectral–spatial classification of hyperspectral image based on deep auto-encoder,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 9, pp. 4073–4085, 2016.
-  Yushi Chen, Xing Zhao, and Xiuping Jia, “Spectral–spatial classification of hyperspectral data based on deep belief network,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 6, pp. 2381–2392, 2015.
-  Xichuan Zhou, Shengli Li, Fang Tang, Kai Qin, Shengdong Hu, and Shujun Liu, “Deep learning with grouped features for spatial spectral classification of hyperspectral images,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 1, pp. 97–101, 2017.
-  Ping Zhong, Zhiqiang Gong, Shutao Li, and Carola-Bibiane Schönlieb, “Learning to diversify deep belief networks for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 6, pp. 3516–3530, 2017.
-  Yunsong Li, Weiying Xie, and Huaqing Li, “Hyperspectral image reconstruction by deep convolutional neural network for classification,” Pattern Recognition, vol. 63, pp. 371–383, 2017.
-  Wenzhi Zhao, Shihong Du, and William J Emery, “Object-based convolutional neural network for high-resolution imagery classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 7, pp. 3386–3396, 2017.
-  Mengmeng Zhang, Wei Li, and Qian Du, “Diverse region-based cnn for hyperspectral image classification,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 2623–2634, 2018.
-  Wei Hu, Yangyu Huang, Li Wei, Fan Zhang, and Hengchao Li, “Deep convolutional neural networks for hyperspectral image classification,” Journal of Sensors, vol. 2015, 2015.
-  Lin He, Jun Li, Chenying Liu, and Shutao Li, “Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 3, pp. 1579–1597, 2018.
-  Pedram Ghamisi, Emmanuel Maggiori, Shutao Li, Roberto Souza, Yuliya Tarablaka, Gabriele Moser, Andrea De Giorgi, Leyuan Fang, Yushi Chen, Mingmin Chi, et al., “New frontiers in spectral-spatial hyperspectral image classification: The latest advances based on mathematical morphology, markov random fields, segmentation, sparse representation, and deep learning,” IEEE Geoscience and Remote Sensing Magazine, vol. 6, no. 3, pp. 10–43, 2018.
-  Yushi Chen, Hanlu Jiang, Chunyang Li, Xiuping Jia, and Pedram Ghamisi, “Deep feature extraction and classification of hyperspectral images based on convolutional neural networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 10, pp. 6232–6251, 2016.
-  Ying Li, Haokui Zhang, and Qiang Shen, “Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network,” Remote Sensing, vol. 9, no. 1, pp. 67, 2017.
-  Cheng Shi and Chi-Man Pun, “Superpixel-based 3d deep neural networks for hyperspectral image classification,” Pattern Recognition, vol. 74, pp. 600–616, 2018.
-  Jingxiang Yang, Yong-Qiang Zhao, and Jonathan Cheung-Wai Chan, “Learning and transferring deep joint spectral–spatial features for hyperspectral classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 8, pp. 4729–4742, 2017.
-  Xiaodong Xu, Wei Li, Qiong Ran, Qian Du, Lianru Gao, and Bing Zhang, “Multisource remote sensing data classification based on convolutional neural network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 937–949, 2018.
-  Siyuan Hao, Wei Wang, Yuanxin Ye, Tingyuan Nie, and Lorenzo Bruzzone, “Two-stream deep architecture for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2349–2361, 2018.
-  Hao Wu and Saurabh Prasad, “Convolutional recurrent neural networks forhyperspectral data classification,” Remote Sensing, vol. 9, no. 3, pp. 298, 2017.
-  Qingshan Liu, Feng Zhou, Renlong Hang, and Xiaotong Yuan, “Bidirectional-convolutional lstm based spectral-spatial feature learning for hyperspectral image classification,” Remote Sensing, vol. 9, no. 12, pp. 1330, 2017.
-  Feng Zhou, Renlong Hang, Qingshan Liu, and Xiaotong Yuan, “Hyperspectral image classification using spectral-spatial lstms,” Neurocomputing, 2018.
-  Feng Zhou, Renlong Hang, Qingshan Liu, and Xiaotong Yuan, “Integrating convolutional neural network and gated recurrent unit for hyperspectral image spectral-spatial classification,” in Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2018, pp. 409–420.
-  Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
-  Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
-  Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1724–1734.