I Introduction
With the development of imaging technology, current hyperspectral sensors can fully portray the surface of the Earth using hundreds of continuous and narrow spectral bands, ranging from the visible spectrum to the shortwave infrared spectrum. The generated hyperspectral image (HSI) is often considered as a threedimensional cube. The first two are spatial dimensions, which record the locations of each object. The third one is spectral dimension, which captures the spectral signature (reflective or emissive properties) of each material in different bands along the electromagnetic spectrum [1]. Using such rich information, HSIs have been widely applied to various applications, such as land cover/land use classification, precision agriculture, and change detection. For these applications, one basic but important procedure is HSI classification, whose goal is to assign candidate class labels to each pixel.
In order to acquire accurate classification results, numerous methods have been proposed. For example, one can directly consider the rich spectral signature as features and feed them into advanced classifiers, such as support vector machine (SVM)
[2][3] and extreme learning machine [4]. However, due to the dense spectral sampling of HSIs, there may exist some redundant information among adjacent spectral bands. This easily leads to the socalled curse of dimensionality (the Hughes effect) which causes a sudden drop in classification accuracy when there is no balance between the high number of spectral channels and a limited number of training samples. Therefore, a large number of works were proposed to mine discriminative features from the highdimensional spectral signature
[5]. Popular models include principle component analysis (PCA), linear discriminant analysis (LDA) [6, 7, 8], and graph embedding [9, 10, 11]. Besides, representationbased models have also been employed to HSI classification in recent years. In [12] and [13], sparse representation was proposed to learn discriminative features from HSIs. Similarly, collaborative representation was also widely explored [14],[15]. In these models, an input spectral signature is usually represented by a linear combination of atoms from a dictionary, and the classification result can be derived from the reconstructed residual without needing to train extra classifiers, which often costs much time.Although the aforementioned models have demonstrated their effectiveness in the field of HSI classification, there still exist some drawbacks to address. For the traditional feature extraction models, we need to predefine a mining criterion (e.g., maximizing the betweenclass scatter matrix in LDA), which heavily depends on the domain knowledge and experience of experts. For the representationbased models, their goal is to reconstruct the input signal, leading to suboptimal representation for classification. Additionally, all of them can be considered as shallowlayer models, which limit their potentials to learn highlevel semantic features. Recently, deep learning
[16],[17], a very hot research topic in machine learning, has shown its huge superiority in most fields of computer vision
[18, 19, 20, 21]and natural language processing
[22],[23]. The goal of deep learning is to learn nonlinear, highlevel semantic features from data in a hierarchical manner.Due to the effects of multipath scattering and the heterogeneity of subpixel constituents, HSI often lies in a nonlinear and complex feature space. Deep learning can be naturally adopted to deal with this issue [24],[25]. In the past few years, many deep learning models were successfully applied to HSI classification. For example, in [26, 27, 28]
, the autoencoder model has been used to learn deep features from highdimensional spectral signature directly. Similar to autoencoder, deep belief network was also explored to extract spectral features
[29, 30, 31]. However, both of them belong to fullyconnected networks, which contain large numbers of parameters to train. Different from them, convolutional neural networks (CNNs) have local connection and weight sharing properties, thus largely reducing the number of training parameters
[32, 33, 34]. In [35], Hu et al. proposed to use onedimensional CNN to learn and represent the spectral information. This model is comprised of an input layer, a convolutional layer, a pooling layer, a fullyconnected layer and an output layer. The whole model is trained in an endtoend manner, thus achieving satisfying results for HSI classification.Besides spectral information, HSIs also have rich spatial information. How to combine them together has been an active research topic in the field of HSI classification [36],[37]. One potential method is to extend the spectral classification model into its spectralspatial counterpart. For instance, in [38, 39, 40], a threedimensional CNN was employed to spectralspatial classification of HSIs. However, due to the simultaneous convolution operators in both spectral domain and spatial domain, the computational complexity is dramatically increased. In addition, the number of trainable parameters in threedimensional CNNs is also a problem. In order to perform threedimensional convolution, the dimensionality of the input and the dimensionality of the kernel (filter) should be equal. This heavily increases the number of parameters. Another candidate method for spectralspatial classification is the one based on twobranch networks. One branch is for spectral classification and the other one for spatial classification. In [41, 42, 43], onedimensional CNN or autoencoder was used to learn spectral features and twodimensional CNN was designed to learn spatial features. These two features are then integrated together via featurelevel fusion or decisionlevel fusion. For twodimensional CNNs, only a few principal components were extracted and used as inputs, thus reducing the computational consuming compared to threedimensional CNNs.
Most of existing models can be considered as vectorbased methodologies. Recently, a few works attempted to regard HSIs as sequential data, so recurrent neural networks (RNNs) were naturally used to learn features. In [44], Wu et.al proposed using RNN to extract spectral features from HSIs. In [45] and [46]
, a variant of RNN using long shortterm memory (LSTM) units was designed to learn spectralspatial features from HSIs. In
[47], another variant of RNN using gated recurrent units (GRUs) was employed. Compared to the widely explored CNN models, RNNs have many superiorities. For example, the key component of CNNs is the convolutional operator. Due to the kernel size limitations of it, onedimensional CNNs can only learn the local spectral dependency while easily ignoring the effects of nonadjacent spectral bands. Different from them, RNNs, especially using GRU or LSTM, often input spectral bands one by one via recurrent operators, thus capturing the relationship from the whole spectral bands. Besides, RNNs often have smaller numbers of parameters to train than CNNs, so they will be more efficient in the training and inferring phases.Benefiting from its powerful learning ability from sequential data, current RNNrelated models often simply input the whole spectral bands to networks, which may not fully explore the redundant and complementary properties of HSIs. The redundant information between adjacent spectral bands will increase the computational burden of RNNs without improving the classification results. Sometimes such redundancy may reduce the classification accuracy since it increases withinclass variances and decreases betweenclass variances in the feature space. Besides, it may also increase the difficulties in learning complementary information. To address these issues, we propose a cascaded RNN model using gated recurrent units (GRUs) in this paper. This model mainly consists of two RNN layers. The first RNN layer focuses on reducing the redundant information of adjacent spectral bands. These reduced information are then fed into the second RNN layer to learn their complementary features. Besides, in order to improve the discriminative ability of the learned features, we design two strategies for the proposed model. Finally, we also extend the proposed model to its spectralspatial version by incorporating some convolutional layers. The major contributions of this paper are summarized as follows.

We propose a cascaded RNN model with GRUs for HSI classification. Compared to the existing RNNrelated models, our model can sufficiently consider the redundant and complementary information of HSIs via two RNN layers. The first one is to reduce redundancy and the second one is to learn complementarity. These two layers are integrated together to generate an endtoend trainable model.

In order to learn more discriminative features, we design two strategies to construct connections between the first RNN layer and the output layer. The first strategy is the weighted fusion of features from two layers, and the second one is the weighted combination of different loss functions from two layers. Their weights can be adaptively learned from data itself.

To capture the spectral and spatial features simultaneously, we further extend the proposed model to its spectralspatial counterpart. A few convolutional layers are integrated into the proposed model to learn spatial features from each band, and these features are then combined together via recurrent operators.
The rest of this paper is organized as follows. Section II describes the details of the proposed models, including a brief introduction of RNN, and the structure of the proposed model as well as its modifications. The descriptions of data sets and experimental results are given in Section III. Finally, Section IV concludes this paper.
Ii Methodology
As shown in Fig.1
, the proposed cascaded RNN model mainly consists of four steps. For a given pixel, we firstly divide it into different spectral groups. Then, for each group, we consider the spectral bands in it as a sequence, which is fed into a RNN layer to learn features. After that, the learned features from each group are again regraded as a sequence and fed into another RNN layer to learn their complementary information. Finally, the output of the second RNN layer is connected to a softmax layer to derive the classification result.
Iia Review of RNN
RNN has been widely used for sequential data analysis, such as speech recognition and machine translation [23],[48]. Assume that we have a sequence data , where generally represents the information at the th time step. When applying RNN to HSI classification, will correspond to the spectral value at the th band. For RNN, the output of hidden layer at time is
(1) 
where
is a nonlinear activation function such as logistic sigmoid or hyperbolic tangent functions,
is a bias vector,
is the output of hidden layer at the previous time, and denote weight matrices from the current input layer to hidden layer and the previous hidden layer to current hidden layer, respectively. From this equation, we can observe that via a recurrent connection, the contextual relationships in the time domain can be constructed. Ideally, can capture most of the time information for the sequence data.For classification tasks,
is often fed into an output layer, and the probability that the sequence belongs to
th class can be derived by using a softmax function. These processes can be formulated as(2)  
where is a bias vector, is the weight matrix from hidden layer to output layer, and are parameters of softmax function, is the number of classes to discriminate. All of these weight parameters in Equation (1) and (2) can be trained using the following loss function
(3) 
where is the number of training samples, and are the true label and the predicted label of the
th training sample, respectively. This function can be optimized using a backpropagation through time (BPTT) algorithm.
IiB Cascaded RNNs
HSIs can be described as a threedimensional matrix , where , and represent the width, height and number of spectral bands, respectively. For a given pixel , we can consider it as a sequence whose length is , so RNN can be naturally employed to learn spectral features. However, HSIs often contain hundreds of bands, making a very long sequence. Such longterm sequence increases the training difficulty since the gradients tend to either vanish or explode [49]. To address this issue, one popularly used method is to design a more sophisticated activation function by using gating units such as the LSTM unit and GRU [50]. Compared to LSTM unit, GRU has a fewer number of parameters [49], which may be more suitable for HSI classification because it usually has a limited number of training samples. Therefore, we select GRU as the basic unit of RNN in this paper.
The core components of GRU are two gating units that control the flow of information inside the unit. Instead of using Equation(1), the activation of the hidden layer for band is now formulated as
(4) 
where is the update gate, which can be derived by
(5) 
where
is a sigmoid function,
is a weight value, and is a weight vector. Similarly, can be computed by(6) 
where denotes an elementwise multiplication, and is the reset gate, which can be derived by
(7) 
Due to the dense spectral sampling of hyperspectral sensors, adjacent bands in HSIs have some redundancy while nonadjacent bands have some complementarity. In order to take account of such information comprehensively, we propose a cascaded RNN model. Specifically, we divide the spectral sequence into subsequences , each of which consists of adjacent spectral bands. Besides the last subsequence , the length of the other subsequences is , which denotes the nearest integers less than or equal to . Thus, for the th subsequence , it is comprised of the following bands
(8) 
Then, we feed all the subsequences into the firstlayer RNNs respectively. These RNNs have the same structure and share parameters, thus reducing the number of parameters to train. In the subsequence , each band has an output from GRU. We use the output of the last band as the final feature representation for , which can be denoted as , where is the size of the hidden layer in the firstlayer RNN. After that, we can combine together to generate another sequence whose length is . This sequence is fed into the secondlayer RNN to learn their complementary information. Similar to the firstlayer RNNs, we also use the output of GRU at the last time as the learned feature . To get a classification result of , we need to input into an output layer whose size equals to the number of candidate classes . Both of these twolayer RNNs have many weight parameters. We choose Equation(3) as a loss function and use the BPTT algorithm to optimize them simultaneously.
IiC Improvement for Cascaded RNNs
As described in subsection IIB, the secondlayer RNN is directly connected to the output layer, so it may be optimized better than the firstlayer RNNs. However, the performance of the firstlayer RNNs will have effects on the secondlayer RNN. In order to improve the discriminative ability of , an intuitive method is to construct relations between the firstlayer RNNs and the output layer. Here, we propose two strategies to achieve this goal. The first strategy is based on the featurelevel connection shown in Fig.2. Instead of feeding the output of the secondlayer RNN into the output layer only, we attempt to feed all the output features from the first and the secondlayer RNNs in a weighted concatenation manner. Specifically, the input of the output layer is computed as follows
(9) 
where are fusion weights for the firstlayer RNNs, and is the fusion weight for the secondlayer RNN. These weights can be integrated into the whole network and their optimal values are automatically learned from data. The same as the original twolayer RNN model, we also use Equation(3) to construct the loss function and use the BPTT algorithm to optimize it.
Different from the first improvement strategy, our second strategy is based on the outputlevel connection. As shown in Fig.3, we feed the features extracted by the firstlayer RNNs into output layers, respectively, so that they can learn more discriminative features. Combining these features together using the secondlayer RNN will result in a better . In particular, for , we can input it into an output layer and construct a loss function . Meanwhile, we also input into an output layer and construct another loss function . After that, a weighted summation method can be used to combine them together, which can be formulated as
(10) 
where and are fusion weights, and are derived from Equation(3). The final loss function can be optimized by using the BPTT algorithm. In the prediction phase, we can delete the output layers of the firstlayer RNNs and use the output from the secondlayer RNN as the final classification result.
IiD Spectralspatial Cascaded RNNs
Due to the effects of atmosphere, instrument noises, and natural spectrum variations, materials from the same class may have very different spectral responses, while those from different classes may have similar spectral responses. If we only use the spectral information, the resulting classification maps will have many outliers, which is known as the “salt and pepper” phenomenon. As a threedimensional cube, HSIs also have rich spatial information, which can be used as a complement to address this issue. Among numerous deep learning models, CNNs have demonstrated their superiority in spatial feature extraction. In
[38], a typical twodimensional CNN is designed to extract spatial features from HSIs. The input of this model is the first principle component of HSIs.Inspired from the twodimensional CNN model, we extend the cascaded RNN model to its spectralspatial version by adding some convolutional layers. Fig.4 shows the flowchart of the proposed spectralspatial cascaded RNN model. For a given pixel , we select a small cube centered at it. Then, we split this cube into matrices across the spectral domain. For each , we feed it into several convolutional layers to learn spatial features. The same as [38], we also use three convolutional layers, and the first two layers are followed by pooling layers. The input size is . The sizes of the three convolutional filters are , and , respectively. After these convolutional operators, each will generate a dimensional spatial feature . Similar to the cascaded RNN model, we can also consider as a sequence whose length is . This sequence is divided into subsequences, and they are subsequently fed into the firstlayer RNNs respectively to reduce redundancy inside each subsequence. The outputs from the firstlayer RNNs are combined again to generate another sequence, which are fed into the secondlayer RNN to learn complementary information.
Compared to the cascaded RNN model, the spectralspatial cascaded RNN model is deeper and more difficult to train. Therefore, we propose a transfer learning method to train it. Specifically, we firstly pretrain the convolutional layers using all of
. We replace twolayer RNNs by an output layer whose size is the number of classes . Besides, we assume that the label of equals to the label of its corresponding pixel . Then, we will have samples. These samples are used to train convolutional layers. After that, the weights of these convolutional layers are fixed and the training samples are used again to train the twolayer RNNs. Finally, the whole network is finetuned based on the learned parameters.Iii Experiments
Iiia Data Description
Our experiments are conducted on two HSIs, which are widely used to evaluate classification algorithms.
Indian Pines Data: The first data set was acquired by the AVIRIS sensor over the Indian Pine test site in northwestern Indiana, USA, on June 12, 1992. The original data set contains 224 spectral bands. We utilize 200 of them after removing four bands containing zero values and 20 noisy bands affected by water absorption. The spatial size of the image is pixels, and the spatial resolution is 20 m. The number of training and test pixels are reported in TableI. Fig.5 shows the falsecolor image, as well as training and test maps of this data set.
Class No.  Class Name  Training  Test 

1  Cornnotill  50  1384 
2  Cornmintill  50  784 
3  Corn  50  184 
4  Grasspasture  50  447 
5  Grasstrees  50  697 
6  Haywindrowed  50  439 
7  Soybeannotill  50  918 
8  Soybeanmintill  50  2418 
9  Soybeanclean  50  564 
10  Wheat  50  162 
11  Woods  50  1244 
12  Buildinggrasstrees  50  330 
13  Stonesteeltowers  50  45 
14  Alfalfa  15  39 
15  Grasspasturemowed  15  11 
16  Oats  15  5 
  Total  695  9671 
Pavia University Scene Data: The second data set was acquired by the ROSIS sensor during a flight campaign over Pavia, northern Italy, on July 8, 2002. The original image was recorded with 115 spectral channels ranging from 0.43 to 0.86 . After removing noisy bands, 103 bands are used. The image size is pixels with a spatial resolution of 1.3 m. There are nine classes of land covers with more than 1000 labeled pixels for each class. The number of pixels for training and test are listed in TableII. Their corresponding distribution maps are demonstrated in Fig.6.
Class No.  Class Name  Training  Test 

1  Asphalt  548  6631 
2  Meadows  540  18649 
3  Gravel  392  2099 
4  Trees  524  3064 
5  Metal sheets  265  1345 
6  Bare Soil  532  5029 
7  Bitumen  375  1330 
8  Bricks  514  3682 
9  Shadows  231  947 
  Total  3921  42776 
IiiB Experimental Setup
In order to highlight the effectiveness of our proposed models, we compare them with SVM, onedimensional CNN (1DCNN), twodimensional CNN (2DCNN), and the original RNN using GRU (RNN). For simplicity, the cascaded RNN model using GRUs is abbreviated as CasRNN; the two improvement methods of CasRNN based on featurelevel and outputlevel connections are abbreviated as CasRNNF and CasRNNO, respectively; the spectralspatial CasRNN is abbreviated as SSCasRNN. Some of their explanations are summarized as follows.

SVM: The input of SVM is the original spectrum signature. We choose Gaussian kernel as its kernel function. The penalty parameter and the spread of the Gaussian kernel are selected from a candidate set using a fivefold crossvalidation method.

1DCNN: The structure of 1DCNN is the same as that in [35]. It contains an input layer, a convolutional layer with 20 kernels whose size is
, a maxpooling layer whose kernel size is
, a fullyconnected layer with 100 hidden nodes, and an output layer. 
RNN: GRU is used as the basic unit of RNN. The number of hidden nodes is chosen from a candidate set via a fivefold crossvalidation method.
The deep learning models are constructed with a PyTorch framework. To optimize them, we use a minibatch stochastic gradient descent algorithm. The batch size, the learning rate and the number of training epochs are set to 64, 0.001 and 300, respectively. For SVM, we use a libsvm package in a MATLAB framework. All of the experiments are implemented on a personal computer with an Intel core i74790, 3.60GHz processor, 32GB RAM, and a GTX TITAN X graphic card.
The classification performance of each model is evaluated by the overall accuracy (OA), the average accuracy (AA), the perclass accuracy, and the Kappa coefficient. OA defines the ratio between the number of correctly classified pixels to the total number of pixels in the test set, AA refers to the average of accuracies in all classes, and Kappa is the percentage of agreement corrected by the number of agreements that would be expected purely by chance.
IiiC Parameter Analysis
There exist three important hyperparameters in the proposed models. They are subsequence numbers
, as well as the size of hidden layers in the firstlayer RNN and the secondlayer RNN. To test the effects of them on the classification performance, we firstly fix and select the size of hidden layers from a candidate set . Then, we fix the size of hidden layers and choose from another set . Since the same hyperparameter values are used for CasRNN and its two improvements (i.e., CasRNNF and CasRNNO), we only demonstrate the performance of CasRNN here, shown in Fig.7. In this threedimensional diagram, the first two axes (named and ) respectively correspond to the number of hidden nodes in the firstlayer RNN and the secondlayer RNN, while the third axis represents the classification accuracy OA. From this figure, we can observe that when and , CasRNN can achieve better OA than the other values on the Indian Pines data. The best OA appears when and . For the Pavia University data, OA changes a little larger than the Indian Pines data, but we can still find the best value when and . Similarly, Fig.8 shows OA values achieved by SSCasRNN using different hidden sizes. We can see the optimal parameter values are for the Indian Pines data, and for the Pavia University data, respectively.Class No.  SVM  1DCNN  RNN  CasRNN  CasRNNF  CasRNNO  2DCNN  SSCasRNN 

1  64.31  61.34  64.74  68.35  68.93  68.21  82.51  86.99 
2  70.92  60.33  61.35  64.8  67.6  67.35  88.14  98.72 
3  84.78  80.43  74.46  77.17  83.7  85.87  100  100 
4  91.05  89.04  83.45  91.50  90.60  89.93  94.85  94.41 
5  85.94  90.53  77.04  79.34  80.49  80.92  85.80  97.42 
6  93.62  96.13  87.70  92.03  92.94  92.94  99.77  100 
7  69.17  72.11  76.03  74.84  78.54  79.30  82.35  87.15 
8  52.90  54.47  60.79  67.41  67.49  66.91  73.86  85.98 
9  76.60  75.71  61.17  65.60  67.02  65.43  86.00  87.23 
10  97.53  99.83  93.21  95.06  96.91  98.15  100  100 
11  77.49  80.87  81.67  83.28  90.03  86.09  94.53  97.51 
12  73.33  78.48  55.45  54.85  67.88  54.55  97.27  99.70 
13  100  91.11  86.67  93.33  95.56  93.33  100  100 
14  87.18  94.87  69.23  76.92  84.61  76.92  97.44  100 
15  90.91  90.91  90.91  90.91  90.91  90.91  100  100 
16  100  100  80.00  100  100  80  100  100 
OA  70.55  70.79  69.82  73.49  75.85  74.60  85.43  91.79 
AA  82.23  82.23  75.24  79.71  82.70  79.80  92.66  95.94 
Kappa  66.90  67.07  65.87  69.91  72.57  71.19  83.49  90.62 
Class No.  SVM  1DCNN  RNN  CasRNN  CasRNNF  CasRNNO  2DCNN  SSCasRNN 

1  84.74  80.94  81.51  82.34  83.56  83.52  77.39  89.82 
2  64.50  70.37  62.58  67.13  70.65  71.37  98.89  96.06 
3  72.56  77.32  64.65  60.51  68.75  64.51  56.74  78.89 
4  97.13  85.93  98.89  98.63  98.11  98.43  92.75  95.89 
5  99.55  99.70  99.26  99.41  99.55  99.33  99.78  100 
6  93.30  93.26  88.90  84.97  88.29  89.08  47.27  57.67 
7  91.28  95.41  92.63  90.60  76.54  91.13  80.08  80.53 
8  91.99  84.47  91.04  92.23  86.04  93.54  96.69  96.80 
9  95.56  92.08  95.35  94.40  95.35  94.72  96.30  95.99 
OA  78.75  79.55  76.58  78.03  79.56  80.86  86.18  90.30 
AA  87.85  86.61  86.09  85.58  85.21  87.29  82.88  87.97 
Kappa  73.62  74.28  71.02  72.55  74.31  75.93  81.22  86.26 
Fig.9 and Fig.10 evaluate the effects of on classifying the Indian Pines and the Pavia University data sets, respectively. In these figures, different colors represent different models. They are CasRNN, CasRNNF, CasRNNO and SSCasRNN. As increases, OAs achieved by these models tend to increase firstly and then decrease. Given the same , SSCasRNN significantly outperforms the other three models. For the Indian Pines data, the maximal OAs of four models appear at the same , so their optimal values are set as 10. Different from the Indian Pines data, four models have different optimal values on the Pavia University data. As shown in Fig.10, the optimal value is 4 for SSCasRNN, and 8 for the other three models.
IiiD Performance Comparison
In this section, we will report quantitative and qualitative results of our proposed models and their comparisons with the other stateoftheart models. TableIII reports the detailed classification results of different models on the Indian Pines data, including OA, AA, Kappa and class specific accuracy. The bold fonts in each row denote the best results. Several conclusions can be observed from this table. First, if we directly input the whole spectral bands into RNN, its OA, AA and Kappa values are 69.82%, 75.42% and 65.87%, respectively, which are all lower than those achieved by SVM and 1DCNN models. This indicates that RNN cannot fully explore the longterm spectral sequence of HSIs. On the contrary, considering the redundant and complementary properties of spectral signature, our proposed model CasRNN can improve the performance of RNN by 4 percents, thus outperforming SVM and 1DCNN. Second, compared to CasRNN, CasRNNF and CasRNNO can obtain better results, which validates the effectiveness of the two improvement strategies. In terms of each class accuracy, CasRNNF almost increases all of them in comparison with CasRNN, so it might be more powerful than CasRNNO on the Indian Pines data. Third, compared to spectral classification models, 2DCNN significantly improves the classification results by about 10 percents. It means that the consideration of spatial information is very important on the Indian Pines data, because there are many large and homogeneous objects shown in Fig.5(c). By incorporating the spatial information into CasRNN model, our proposed model SSCasRNN can further increase the performance to above 90 percents. Besides, it can obtain highest accuracies in 15 different classes, which sufficiently certifies the effectiveness of SSCasRNN.
In addition to the quantitative results, we also visualize classification results of different models shown in Fig.11. Different colors in this figure correspond to different classes. Compared to the groundtruth map in Fig.5(c), spectral classification models (i.e., SVM, 1DCNN, RNN, CasRNN, CasRNNF and CasRNNO) have many outliers in the classification map due to the spectral variability of materials. This phenomenon can be alleviated by 2DCNN, because it makes use of the spatial contextual information instead of the spectral information. For homogeneous regions, especially large objects, 2DCNN performs very well. However, it will easily result in an oversmoothing problem especially for small objects, as demonstrated in Fig.11(g). Different from 2DCNN and spectral models, SSCasRNN takes advantage of spectral and spatial information simultaneously. As shown in Fig.11 (h), it has significantly fewer outliers than spectral models, and retains more boundary details of objects than 2DCNN.
TableIV and Fig.12 are the classification results of different models on the Pavia University data. Similar conclusions can be observed from them. For spectral models, CasRNN is better than RNN, while CasRNNF and CasRNNO are superior to CasRNN. All of these models have the “salt and pepper” phenomenon in their classification maps. Compared to the best spectral model, 2DCNN can improve OA and Kappa by more than 5 percents. In addition, it generates fewer outliers and leads to a more homogeneous classification map. Nevertheless, without using the spectral information, its performance is not very high, and the classification map is easily to be oversmoothed. Combining the spectral and spatial information together, our proposed model SSCasRNN can alleviate these issues. It improves OA from 86.18% to 90.30%, and generates more details in the classification map. However, in comparison with the Indian Pines data, the classification results achieved by SSCasRNN are still not very high. One possible reason is that there exist many small objects in the Pavia University data, which increases the difficulty in exploring spatial features.
Iv Conclusions
In this paper, we proposed a cascaded RNN model for HSI classification. Compared to the original RNN model, our proposed model can fully explore the redundant and complementary information of the highdimensional spectral signature. Based on it, we designed two improvement strategies by constructing connections between the firstlayer RNN and the output layer, thus generating more discriminative spectral features. Additionally, considering the importance of spatial information, we further extended the proposed model into its spectralspatial version to learn spectral and spatial features simultaneously. To test the effectiveness of the proposed models, we compared them with several stateoftheart models on two widely used HSIs. The experimental results demonstrate that the cascaded RNN model can obtain higher performance than RNN, and its modifications can further improve the performance. Besides, we also thoroughly evaluated the effects of different hyperparameters on the classification performance of the proposed models, including the hidden sizes and the number of subsequences. In the future, more experiments will be conducted to validate the effectiveness of our proposed models. In addition, more powerful spectralspatial models will be explored. Since the sizes and shapes of different objects vary, using the patches or cubes with same sizes as inputs easily leads to the loss of spatial information.
References
 [1] Pedram Ghamisi, Naoto Yokoya, Jun Li, Wenzhi Liao, Sicong Liu, Javier Plaza, Behnood Rasti, and Antonio Plaza, “Advances in hyperspectral image and signal processing: A comprehensive overview of the state of the art,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 37–78, 2017.
 [2] Giorgos Mountrakis, Jungho Im, and Caesar Ogole, “Support vector machines in remote sensing: A review,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 66, no. 3, pp. 247–259, 2011.
 [3] Mariana Belgiu and Lucian Drăguţ, “Random forest in remote sensing: A review of applications and future directions,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 114, pp. 24–31, 2016.
 [4] Wei Li, Chen Chen, Hongjun Su, and Qian Du, “Local binary patterns and extreme learning machine for hyperspectral imagery classification.,” IEEE Trans. Geoscience and Remote Sensing, vol. 53, no. 7, pp. 3681–3693, 2015.
 [5] Xiuping Jia, BorChen Kuo, and Melba M Crawford, “Feature mining for hyperspectral image classification,” Proceedings of the IEEE, vol. 101, no. 3, pp. 676–697, 2013.
 [6] Wenzhi Liao, Aleksandra Pizurica, Paul Scheunders, Wilfried Philips, and Youguo Pi, “Semisupervised local discriminant analysis for feature extraction in hyperspectral images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 51, no. 1, pp. 184–198, 2013.
 [7] Renlong Hang, Qingshan Liu, Huihui Song, and Yubao Sun, “Matrixbased discriminant subspace ensemble for hyperspectral image spatial–spectral feature fusion,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 2, pp. 783–794, 2016.
 [8] Renlong Hang, Qingshan Liu, Yubao Sun, Xiaotong Yuan, Hucheng Pei, Javier Plaza, and Antonio Plaza, “Robust matrix discriminative analysis for feature extraction from hyperspectral images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 5, pp. 2002–2011, 2017.
 [9] Dalton Lunga, Saurabh Prasad, Melba M Crawford, and Okan Ersoy, “Manifoldlearningbased feature extraction for classification of hyperspectral data: A review of advances in manifold learning,” IEEE Signal Processing Magazine, vol. 31, no. 1, pp. 55–66, 2014.
 [10] Renlong Hang and Qingshan Liu, “Dimensionality reduction of hyperspectral image using spatial regularized local graph discriminant embedding,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 9, pp. 3262–3271, 2018.
 [11] Wenzhi Zhao, William Emery, Yanchen Bo, and Jiage Chen, “Land cover mapping with higher order graphbased cooccurrence model,” Remote Sensing, vol. 10, no. 11, pp. 1713, 2018.
 [12] Yi Chen, Nasser M Nasrabadi, and Trac D Tran, “Hyperspectral image classification using dictionarybased sparse representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 10, pp. 3973–3985, 2011.
 [13] Leyuan Fang, Shutao Li, Xudong Kang, and Jón Atli Benediktsson, “Spectral–spatial hyperspectral image classification via multiscale adaptive sparse representation,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 12, pp. 7738–7749, 2014.
 [14] Jiayi Li, Hongyan Zhang, Yuancheng Huang, and Liangpei Zhang, “Hyperspectral image classification by nonlocal joint collaborative representation with a locally adaptive dictionary,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 6, pp. 3707–3719, 2014.
 [15] Wei Li and Qian Du, “Joint withinclass collaborative representation for hyperspectral image classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 6, pp. 2200–2208, 2014.
 [16] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436, 2015.
 [17] Jürgen Schmidhuber, “Deep learning in neural networks: An overview,” Neural networks, vol. 61, pp. 85–117, 2015.

[18]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton,
“Imagenet classification with deep convolutional neural networks,”
in Advances in neural information processing systems, 2012, pp. 1097–1105. 
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,
“Deep residual learning for image recognition,”
in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2016, pp. 770–778.  [20] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
 [21] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
 [22] Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa, “Natural language processing (almost) from scratch,” Journal of Machine Learning Research, vol. 12, no. Aug, pp. 2493–2537, 2011.
 [23] Ilya Sutskever, Oriol Vinyals, and Quoc V Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.
 [24] Liangpei Zhang, Lefei Zhang, and Bo Du, “Deep learning for remote sensing data: A technical tutorial on the state of the art,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2, pp. 22–40, 2016.
 [25] Xiao Xiang Zhu, Devis Tuia, Lichao Mou, GuiSong Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer, “Deep learning in remote sensing: a comprehensive review and list of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, 2017.
 [26] Yushi Chen, Zhouhan Lin, Xing Zhao, Gang Wang, and Yanfeng Gu, “Deep learningbased classification of hyperspectral data,” IEEE Journal of Selected topics in applied earth observations and remote sensing, vol. 7, no. 6, pp. 2094–2107, 2014.
 [27] Chao Tao, Hongbo Pan, Yansheng Li, and Zhengrou Zou, “Unsupervised spectral–spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification,” IEEE Geoscience and remote sensing letters, vol. 12, no. 12, pp. 2438–2442, 2015.
 [28] Xiaorui Ma, Hongyu Wang, and Jie Geng, “Spectral–spatial classification of hyperspectral image based on deep autoencoder,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 9, no. 9, pp. 4073–4085, 2016.
 [29] Yushi Chen, Xing Zhao, and Xiuping Jia, “Spectral–spatial classification of hyperspectral data based on deep belief network,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 6, pp. 2381–2392, 2015.
 [30] Xichuan Zhou, Shengli Li, Fang Tang, Kai Qin, Shengdong Hu, and Shujun Liu, “Deep learning with grouped features for spatial spectral classification of hyperspectral images,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 1, pp. 97–101, 2017.
 [31] Ping Zhong, Zhiqiang Gong, Shutao Li, and CarolaBibiane Schönlieb, “Learning to diversify deep belief networks for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 6, pp. 3516–3530, 2017.
 [32] Yunsong Li, Weiying Xie, and Huaqing Li, “Hyperspectral image reconstruction by deep convolutional neural network for classification,” Pattern Recognition, vol. 63, pp. 371–383, 2017.
 [33] Wenzhi Zhao, Shihong Du, and William J Emery, “Objectbased convolutional neural network for highresolution imagery classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 7, pp. 3386–3396, 2017.
 [34] Mengmeng Zhang, Wei Li, and Qian Du, “Diverse regionbased cnn for hyperspectral image classification,” IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 2623–2634, 2018.
 [35] Wei Hu, Yangyu Huang, Li Wei, Fan Zhang, and Hengchao Li, “Deep convolutional neural networks for hyperspectral image classification,” Journal of Sensors, vol. 2015, 2015.
 [36] Lin He, Jun Li, Chenying Liu, and Shutao Li, “Recent advances on spectral–spatial hyperspectral image classification: An overview and new guidelines,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 3, pp. 1579–1597, 2018.
 [37] Pedram Ghamisi, Emmanuel Maggiori, Shutao Li, Roberto Souza, Yuliya Tarablaka, Gabriele Moser, Andrea De Giorgi, Leyuan Fang, Yushi Chen, Mingmin Chi, et al., “New frontiers in spectralspatial hyperspectral image classification: The latest advances based on mathematical morphology, markov random fields, segmentation, sparse representation, and deep learning,” IEEE Geoscience and Remote Sensing Magazine, vol. 6, no. 3, pp. 10–43, 2018.
 [38] Yushi Chen, Hanlu Jiang, Chunyang Li, Xiuping Jia, and Pedram Ghamisi, “Deep feature extraction and classification of hyperspectral images based on convolutional neural networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 10, pp. 6232–6251, 2016.
 [39] Ying Li, Haokui Zhang, and Qiang Shen, “Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network,” Remote Sensing, vol. 9, no. 1, pp. 67, 2017.
 [40] Cheng Shi and ChiMan Pun, “Superpixelbased 3d deep neural networks for hyperspectral image classification,” Pattern Recognition, vol. 74, pp. 600–616, 2018.
 [41] Jingxiang Yang, YongQiang Zhao, and Jonathan CheungWai Chan, “Learning and transferring deep joint spectral–spatial features for hyperspectral classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 8, pp. 4729–4742, 2017.
 [42] Xiaodong Xu, Wei Li, Qiong Ran, Qian Du, Lianru Gao, and Bing Zhang, “Multisource remote sensing data classification based on convolutional neural network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 2, pp. 937–949, 2018.
 [43] Siyuan Hao, Wei Wang, Yuanxin Ye, Tingyuan Nie, and Lorenzo Bruzzone, “Twostream deep architecture for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2349–2361, 2018.
 [44] Hao Wu and Saurabh Prasad, “Convolutional recurrent neural networks forhyperspectral data classification,” Remote Sensing, vol. 9, no. 3, pp. 298, 2017.
 [45] Qingshan Liu, Feng Zhou, Renlong Hang, and Xiaotong Yuan, “Bidirectionalconvolutional lstm based spectralspatial feature learning for hyperspectral image classification,” Remote Sensing, vol. 9, no. 12, pp. 1330, 2017.
 [46] Feng Zhou, Renlong Hang, Qingshan Liu, and Xiaotong Yuan, “Hyperspectral image classification using spectralspatial lstms,” Neurocomputing, 2018.
 [47] Feng Zhou, Renlong Hang, Qingshan Liu, and Xiaotong Yuan, “Integrating convolutional neural network and gated recurrent unit for hyperspectral image spectralspatial classification,” in Chinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2018, pp. 409–420.
 [48] Alex Graves, Abdelrahman Mohamed, and Geoffrey Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
 [49] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
 [50] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, “Learning phrase representations using rnn encoder–decoder for statistical machine translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 2014, pp. 1724–1734.