Remote sensing observation plays an important role in earth observation and it has many applications in agriculture and military [12, 9]. Among various remote sensing observation technologies, hyperspectral image (HSI) classification is a fundamental but essential technique. Captured by the amounts of hyperspectral remote sensing imagers, the HSIs of hundreds of bands contain much richer spectral information than ordinary remote sensing images, and the characteristic of containing both spatial and rich spectral information makes HSIs very useful for distinguishing ground-cover objects. Due to this, the HSI classification technology is widely applied in various scenes, e.g., mineral exploration , plant stress detection , and environmental science 
, etc. But, in HSIs, feature vectors containing thousands of bands can be extracted from each spatial pixel location. Such high-dimensional features, on the one hand, help classify the ground objects, and on the other hand, increase difficulty in feature extraction. Therefore, it is worth to exploit how to efficiently extract features from HSIs.
Early HSI classification methods are mainly focused on digging out spectral and spatial information through traditional feature descriptors. These algorithms can be roughly classified as two categories. One is the classification method merely relying on spectral information. For instance, classification based on the distance and similarity between spectral features [23, 25]. Another one is the classification method using both spectral and spatial information. To be more specific, this kind of methods always perform pixel-level classification depending on spectral information first, and then utilize models such as Markov random field to refine the classification results according to spatial information . Apart from this, some methods apply the 3D filter to extract the spatial-spectrum joint features  directly. Due to the unique imaging characteristic of HSI, different objects in HSI may have similar spectral features and the same objects located in different locations may emerge with different spectral features. So it is difficult to classify these specific cases by simply introducing spectral features. The methods of using both spectral information and spatial information achieve higher accuracy than the methods only relying on spectral information.
Since the year of 2012, deep learning has been developing rapidly and achieving remarkable results in various fields. Inspired by this, researchers have brought the deep learning methods in solving the problem of HSI classification, and gained impressing performance. Traditional HSI classification methods primarily focus on jointly utilizing spatial and spectral features. In 2013, Lin et al. first introduced deep learning to HSI classification task. In specific, this work utilizes PCA to reduce the dimensionality of the HSI from hundreds of spectral dimensions to dozens. After that, a neighborhood of the pixel to be classified is cropped from the compressed HSI data and stretched into a feature vector. Finally this feature vector is fed to SAE to produce the deep spatial spectrum feature . From 2014 to 2015, Chen et.al introduced another spectral dimension channel based on the . This additional channel directly takes the spectral features extracted from the pixel to be classified as input, and its output is integrated with the spatial spectrum channels to form a dual-channel structure. Then SAE and DBN are used for feature extraction respectively and their extracted features are fused at the end of the dual-channel network structure [4, 5]. In the same period, some other methods tried to apply 1D and 2D-CNN in the HSI classification. Specifically, 1D-CNNs are used to extract the deep spectral features [37, 10], and 2D-CNN are employed to extract deep spatial features from HSI blocks that have been compressed along the spectral dimensions [20, 33]. After 2017, deep HSI classification methods primarily focused on extracting spatial-spectral features. Some work construct a dual-channel network structure to obtain spectral features and spatial features separately, and then merge them to form spatial-spectral features . Additionally, 3D-CNN is also a popular choice to capture the space-spectrum joint features directly [13, 3]. Since 2017, various optimized 3D-CNN have been applied on the HSI classification task [40, 35]
, besides which some transfer learning methods have also been drawn into the classification of HSI images[35, 32]
The deep HSI classification methods give full play to the ability of extracting robust features independently. These deep HSI classification approaches show a significant advantage in classificaion peformance compared with traditional HSI classification algorithms. However, these deep HSI classification approaches all face a problem. Specifically, the network structures in these methods are manually designed. For deep learning methods, designing an efficient network structure is difficult, time-consuming, labor-intensive and requiring a lot of verification experiments. This problem is even more serious in HSI classification. Because HSIs data are very different from each other in the number of bands, spectral range and spatial resolution, the suitable structures are also different for different HSI datasets. Therefore, it is usually necessary to design different network structures for different HSI data.
Moving beyond manually designed network architectures, Neural Architecture Search (NAS) techniques  seek to automate this process and find not only good architectures, but also their associated weights for a given image classification task. NAS provides an ideal solution to liberate people from the heavy work of network architecture design. Chen et al.  first introduced DARTS into the HSI classification task. This work compressed the spectral dimension of HSIs to tens of dimensions through Point Wise Convolution, and then directly used DARTS to search a 2D CNN that is suitable for specfic HSI dataset. Later, Zhang et al.  made an in-depth analysis of the structural characteristics of HSIs and proposed 3D-ANAS. In their work, a 3D asymmetric CNN is automatically designed under a pixel to pixel classification framework, which overcomes the problem of redundant operation existing in the previous Patch to Pixel classification framework and significantly improves the model inferring speed. Compared with previous work, 3D-ANAS shows great advantages in terms of accuracy and inference speed.
In this work, further improvements have been made on 3D-ANAS from two aspects. 1) In 3D-ANAS, an asymmetric decomposition convolution is introduced in the search space, considering the difference between the spatial resolution and the spectral resolution of the HSI. However, this distinction between space and spectrum is only reflected on operation level and is not free enough on search space level. It is difficult for such approach to incorporate some classic hand-designed experience. For example, in SSRN, the operation is completely separated into spectral processing and spatial processing. So in this article, we have constructed a new and more efficient search space, with more freedom in processing differences between spatial and spectral information. 2) Inspired by the superior performance of the transformer model, we tried to graft a transformer module at the end of the CNN model for handling HSI classification task for the first time. Before classification, we captured the relative relationship of pixels in different spatial positions and used this relationship to fine-tune the spatial spectral features to achieve better classification accuracy. The main contributions of this work include the following three aspects:
By analyzing the characteristics of HSI, we propose a NAS algorithm to automatically design ConvNet for HSI. Specifically, we proposed a novel hybrid search space, which contains two kinds of cells, including space dominated cell and spectral dominated cell. The space dominated cell includes 3D convolutions and a plethora of 2D spatial convolutions, while the spectral dominated cell consists of 3D convolutions and a large number of 2D spectral convolution operations. The entire search space is built on these two cells and can be divided into an inner and outer space. The inner space determines the topology in the cell, and the outer space decides whether the space dominated cell or the spectral dominated cell is selected on the specific layer.
To further improve the classification accuracy, we attempt to graft the emerging transformer module on the automatically designed ConvNet to adding global information to local region focused features learned by ConvNet. Benefiting from the classification framework we adopted here is pixel to pixel classification framework, the transformer module can be seamlessly grafted to the end layer of ConvNet. Such a grafted structure takes advantage of the transformer’s ability to capture pixels’ inner correlation, while avoiding the difficulties of training a complete transformer.
Experimental results on three typical HSI classification dataset, including Pavia Center, Pavia University and Houston University validate that the proposed approach obviously improves the classification accuracy of auto designed HSI classification approaches.
The rest of this paper is organized as follows. Section II reviews related work. Our approach is elaborated in Section III. Section IV gives algorithm implementation details, extensively evaluates and compares the proposed 3D-ANASV2 approach with state-of-the-art competitors. Finally, we conclude this work in Section V.
Ii Related Work
Ii-a Hyperspectral Image Classification via CNNs
With the rapid development of deep learning in recent years, deep learning techniques have been introduced into HSI classification algorithms in many works to facilitate its robustness and accuracy. Among the various deep learning models, SAE and DBN are firstly introduced into the HSI classification. Such methods always utilize PCA to compress the hundreds of bands of HSIs to dozens of bands at first. This preprocessing operation reduces the dimensionality of HSI data, while keeping the primary information. Then SAE or DBN is applied to extract the spatial-spectral features from the neighborhood region [14, 4, 5, 19]. Recent years have witnessed growing interests in using CNNs to deal with HSI classification problem. Compared with SAE, DBN and other deep networks constructed layer by layer, the structure of CNNs are more flexible. The HSI classification methods based on CNNs also gain more superior performance than those with SAE and DBN.
The development of HSI classification based on CNNs has mainly gone through three stages. From 2015 to early 2016, researchers mostly focused on the HSI classification based on 1D-CNNs and 2D-CNNs. The methods based on 1D-CNNs generally employ 1D-CNNs to perform convolution along the spectral dimension to extract spectral features. Mei et al.  constructed a 1D-CNN to perform convolution along the spectral direction and extract deep spectral features. Another 1D-CNN based HSI classification approach is proposed in , where the spectral vectors of nine channels are extracted from the eight neighborhoods of the pixels to be classified. Then the spectral vectors are fed to a 1D-CNN, generating the spatial-spectral features for HSI classification. Beyond methods based on 1D-CNNs, a series of 2D-CNNs based HSI classification approaches are with good prospects. Intuitively, regions surrounding the pixel can provide additional visual information facilitating the classification. After compressing HSIs to low-dimension, 2D-CNNs based methods crop a neighborhood patch around the pixel to be classified. Then, this patch is fed to the 2D-CNN to extract the spatial-spectral features. In , Makantasis et al. used R-PCA to reduce the dimensionality of HSIs, and then designed a 2D-CNN to extract features. In , Yue et al. adopted PCA to compress the dimension of HSIs to three, and then proposed 2D-CNN to exploit the spatial-spectral features. Compared with the 1D-CNN based approaches, the methods based on 2D-CNNs achieves higher accuracy. However, the classification results of methods that only using 2D-CNNs may not keep structural information very well. Their visual results are much smoother than of 1D-CNNs methods.
The second development stage mainly focus on combining 1D-CNN and 2D-CNN to perform HSI classification. Taking the advantages of 1D-CNN and 2D-CNN, the dual-channel CNN structure can further improve the accuracy of HSI classification. Zhang et al.  presented a dual-channel CNN, in which 1D-CNN extracts spectral features from a 33-sized window and 2D-CNN extracts spatial features from the PCA compressed HSI within a 4141-sized window. Under this basis, a pyramid structure is constructed to obtain the space-spectrum features through fusing multi-layer features. Almost at the same time, a similar structure is proposed. It utilizes the 1D-CNN to processes the spectral features of only one pixel , and adopts transfer learning to solve the small sample problem in model training.
The third stage is 3D-CNN stage. Inspired by the 3D structure of the HSIs, 3D-CNNs have been gradually used in HSI classification approaches. Such methods directly construct 3D-CNNs to extract the spatial spectrum features. Compared with that of dual-channel CNNs, the structure of 3D-CNNs are always more simple, intuitive and powerful [13, 3]. In recent years, optimizing the structures of 3D-CNNs for HSI classification becomes the mainstream. For example, the introduction of efficient residual structure, lightweight design and so on. Based on the classical residual structure, Zhong et al.  integrated the spectral residual and spatial residual modules, and then constructed a HSI classification model SSRN based on the two residual modules.
To handle the small sample problem of HSI classification, Zhang et al.  developed a lightweight 3D-CNN to optimize the model structure. Meanwhile, for improving the effectiveness of training strategy, two transfer learning strategies (cross-sensor and cross-modality) are proposed in this work . Zhao et al.  proposed a lightweight spectral-spatial convolution HSI classification module (LS2CM) to reduce network parameters and computational complexity. In addition, Jia et al.  employed the spatial–spectral Schroedinger eigenmaps (SSSE) feature extraction and a dual-scale convolution (DSC) module. These two elaborate designs greatly decrease the number of model parameters.
Ii-B Hyperspectral Image Classification via Neural network architecture search
To overcome the heavy burden in manually designing network architecture, researchers move their attentions to NAS, which can automatically and efficiently discover the neural architectures that are suitable for certain tasks. Recent years have witnessed the success of NAS algorithms in plenty of general computer vision tasks, such as image classification, object detection  and semantic segmentation 
. So far, the development of NAS always happened in three phases: architecture search based on evolutionary algorithm (EA) , architecture search based on reinforcement learning (RL) and architecture search based on gradient. RL based methods[41, 39]
often contain a recurrent neural network (RNN) to perform as a meta-controller, generating potential architectures. In the NAS methods enlightened by EA algorithms[24, 16, 26], a series of randomly constructed models are evolved into a better architecture through EA. However, most RL methods and EA methods suffer from heavy computational cost and be less efficient in searching stage. The gradient-based NAS methods are proposed recently and can alleviate this problem to some extents. The first attempt DARTS is proposed in . Unlike the EA and RL-based method that train plenty of student networks, DARTS merely trains one super network in the searching phase, reducing training workload significantly.
Getting inspiration from DARTS, Chen et.al.  proposed a 3D Auto-CNN for HSI classification. In the preprocessing stage, 3D Auto-CNN heavily compresses the spectral dimension of raw HSIs through point wise convolution. The search space of 3D Auto-CNN are made up of 2D convolution operations in fact. Very recently, Zhang et,al  put forward a 3D-ANAS under a pixel-to-pixel classification framework, where all operations in the hierarchical search space are in 3D structure. Besides, the widths of networks can be adjusted adaptively in this work according to the characteristics of different HSI. Unfortunately, 3D-ANAS still has two shortcomes:
Previous works have indicated that learning the spectral and spatial representations separately is beneficial to extracting more discriminative features, such as SSRN. Although various asymmetric convolutions in the search space of 3D-ANAS allows the fine-tuning of the convolution kernel size and receptive field along spectral and spatial dimensions, this adjustment is limited inside a cell. Adjusting the proportions of spectral and spatial convolutions across the entire network is infeasible in this framework.
The pure convolutional structure mainly focus on local neighborhood information, ignored the global connection information among the whole input patch, which has been proved to be useful by Non-Local articles.
To overcome these two issues mentioned above, we propose an new NAS method for HSI classification. Specifically, to address the first issue, we design a hybrid search space, which consists of two kinds of cells. One is space dominated cell and another is spectrum dominated cell. The hybrid search space has more flexible structures in selecting spatial or spectral convolution than the search space proposed in 3D-ANAS. Aiming to solve the second problem, a light transformer structure is grafted to the end of CNN, playing a similar role as CRF to dig out the connection between pixels.
Ii-C Vision Transformer
By in-depth analysis of the attention mechanism, Jones et al.  proposed the Transformer model. Compared with the RNN model previously applied to the NLP problem, Transformer improves the computational efficiency significantly. Because its structure can handle the elements in sequence in parallel. Besides, the Transformer inherits and further expands the ability of capturing the relationship between elements in the sequence, in comprehension with RNN. As a result, the introduction of Transformer has greatly promoted the development of NLP fields.
In recent years, transformer models have been adopted in image processing and achieved very promising performance. Dosovitskiy et.al proposed ViT , where the image is cut into patches then the patches are arranged into the input sequence for feature extraction. In order to keep sensitive to the position information of the patches, position embedding is introduced in the ViT. Besides, an additional class token is designed to perform the final classification. ViT’s success in the fundamental visual tasks has greatly inspired the field of CV. Although the performance of ViT is relatively good, there still exist some problems, for instance, vit has low computational efficiency and is hard to train. To alleviate the problem that the ViT is hard to train, Touvron et al.  proposed to use knowledge distillation to train ViT models, and achieved competitive accuracy with the less pre-training data. From the perspective of reducing the computational cost and improving inference speed, Touvron et al.  proposed Swin transformer. Swin Transformer limits the calculation of attention to pixels within a small window, which reduces the amount of calculation. Moreover, a shifted window based MSA is proposed, which makes the attention cross different windows. Swin transformer has achieved higher accuracy than previous CNN models on tasks such as dense prediction. Very recently, after conducting a detailed analysis of the working principle of CNN and transformer, Graham et al. mixed CNN and transformer in their LeVit model, which significantly outperforms previous CNNs and ViT models with respect to the speed/accuracy tradeoff 
Relevant to fusing the strength of CNN and the transformer model, our work is closely related to Levit. The difference is that the main body of our network still relies on an automatically designed CNN. In Levit, the transformer part is also the main part of feature extraction. The structure of the high-level CNN is equivalently replaced with the transformer structure. In our work, the transformer model is just to further capture the spatial relationship based on the features extracted by CNN.
Iii Proposed Method
In this section, the proposed method is introduced in detail. First, as the proposed method contain more steps than previous deep learning based HSI classificaion approaches, we introduce overall work briefly. Next, we elaborate on the proposed hybrid search space and compared it with the search space proposed in 3D-ANAS . Next, we explain the reason of grafting transformer module to the searched ConvNet and present the architecture of the grafted transformer module. Finally, we introduce our training process shortly.
Iii-a Overall work flow
As shown in the Fig.1, the workflow of the entire classification framework can be divided into the following steps:
1) Samples extraction Some pixels are randomly extracted from the whole HSIs according to certain proportions and rules. The collected sample pixels are divided into training set and validation set. Specifically, the training set has been divided into two parts, one is for the searching stage and another is for training. The rest are reserved as test set.
2) Searching The collected training samples are fed into the CNN super network stacked by the space dominated cell and spectrum dominated cell. The training loss aims to minimize the loss between prediction label and the groundtruth. The prediction accuracy of the network is validated on the validation set at a certain interval, and the loss and verification accuracy are recorded.
3) Deducing the final network and grafting transformer The weight of the super network model with highest validation accuracy is used to deduce the final component network. According to the weight of the search model, the kind of cell and the topology inside the cell are fixed in each layer. Besides, a flexible transformer structure is grafted at the end of the CNN network to capture the relationship between pixels.
4) Component network parameter optimization.
The training part of the training set is taken to optimize the grafted CNN-Transformer network structure, using the same loss as the searching stage.
5) Testing. After training, the model which has the highest verification accuracy and the smallest loss is tested on the test set.
Iii-B The proposed hybrid search space
Cell structure: In 3D-ANAS, authors have already noticed that processing spatial and spectral information separately have better performance than using 3D convolution. In their work, classification accuracy is improved by introducing asymmetric search space. In this work, we further extend this discover and propose mixed search space, which consists of space dominated cells and spectrum dominated cells. As shown in Fig. 2. The space dominated cell only contains spatial convolutions and 3D convolutions, and the spectral dominated cell includes some spectral convolutions and 3D convolutions. After searching, each layer can only keep one spatial cell or spectral cell, and different layers do not share the cell structure. In specific, the space dominated cell includes the following operations:
The spectrum dominated cell includes the following operations:
Architecture searching strategy: The network structure search process can be divided into inner and outer search. The outer search strategy determines the cell type of this layer and the inner search strategy determines the cell internal topology structure. The finally searched -layer network may contain different cell structures and every cell contains a sequence of nodes. The inputs for each node consist of the outputs of all previous nodes and two inputs of the current cell. Assuming that each path in a cell contains all the candidate operations, the output of node is:
where and represent the different convolution operations and its corresponding weights, respectively. This weight is learnt through inner search according to the back propagation gradient. The output of a cell is obtained by:
where denotes the cell type and represents the layer number. When optimizing the internal topology of a cell, the outer selection on cell types is also ongoing. Specifically, two kinds of cells are provided for each layer, focusing on spatial information and spectral information respectively. In each layer, the outputs corresponding two types of cells are combined via learnable weights and to weight the cell output . The output of layer can be expressed as:
Iii-C The structure of Transformer
After the stage of searching for network architectures, we build a compact network according the learnt structure parameters. Specifically, structure parameters consists of inner structure parameters
After searching for the network architecture, each layer only keeps one compact cell, in which each node only retains the two most valuable input paths. In the previous 3D-ANAS, the convolution operation can capture local information well, but it lacks the perception of global information. Previous work has proved that compared to the pure transformer structure, grafting the transformer structure after CNN is beneficial to capture detailed information, such as textures and contours. Therefore, on the basis of the 3D-ANAS, the decoder design has also been improved. As depicted in Fig.3, a flexible transformer structure is introduced to better explore the spatial relationship between pixels in HSIs.
As the prepossessing of transformer structure, the feature map of size from encoder is reshaped and transposed to before split to sequence. The the is calculated through a linear layer and batch normalization layer. Then the is input in the attention layer, and is computed according to Eq.4:
where denotes the dimensionality of , and means relative position embedding (RPB).
where represents the translation-invariant attention bias.
The output of transformer can be generated as Eq.6 and then reshaped to the same dimensionality as the input :
in which the and denote multi-layer perception and batch normalization layer,respectively. means activation function. Specifically, Hardswish function is employed in this work.
Iii-D Training Process
In this work, we have followed the pixel-to-pixel classification framework of 3D-ANAS. Therefore, to fairly verify the effectiveness of the proposed contributions, we apply the same sampling rules, searching and training strategy as those in 3D-ANAS. After taking a 3D image cube from raw HSI and predicting the class of each 2D positions in the cube, the cross entropy loss has been calculated according to the sparse training label map.
Experiments are conducted on a server with an Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz, 512 GB of memory, and Nvidia Tesla V100 32 GB graphics card. The training and testing experiments were implemented by using the open-source framework Pytorch 1.8111 https://pytorch.org/docs/1.8.0/
Iv-a Data Description
To evaluate the effectiveness of the proposed NAS algorithm, we conduct comparison experiments on three representative HSI data sets, namely Pavia University, Pavia Center and Houston University. In turn, the false color composites and ground truth maps of these three HSIs are presented in Fig. 7-9. The corresponding sample distribution information is listed in Table I-III.
Pavia University and Pavia Center were captured by the ROSIS-3 sensor in 2001 during a flight campaign over Pavia, Nothern Italy. Due to low SNR, some frequency bands were removed. The remaining 103 channels are used for classification. These datasets have the same geometric resolution, that is 1.3 meters. Each dataset covers nine different land cover categories. Part of categories are overlapped. Please find more details in Fig. 7 and Fig. 8. Pavia University consists of 610 × 340 pixels and Pavia Centre covers 1096×715 pixels.
The Houston University was captured by the ITRES-CASI 1500 hyperspectral Imager over the University of Houston campus and the neighboring urban area. Compared with the last two dataset, Houston University has lower spatial resolution but much higher spectral resolution. Its spatial resolution is 2.5 and it contains 144 spectral bands, covering the wavelength range of 360–1050 µm. This dataset also covers wider area and has a greater variety of land cover objects. This dataset consists of 349 × 1905 pixels and includes 15 land-cover classes of interest.
|Class||Land Cover Type||No.of Samples|
|5||Painted Metal Sheets||1345|
|Class||Land Cover Type||No.of Samples|
|Class||Land Cover Type||No.of Samples|
|12||Parking Lot 1||1233|
|13||Parking Lot 2||469|
|20 pixels/class||Pavia C||180||90||7186||2.41%|
|30 pixels/class||Pavia C||270||90||7096||3.62%|
Iv-B Experiment Design
In order to validate the effectiveness of the proposed NAS algorithm. We conduct experiments in two different settings. In setting one, 20 and 10 labeled pixels are randomly extracted from each category to build training set and validation set. The rest part is used as test set. In setting two, the number of training samples of each class is increased to 30. Others are keep the same with that in setting one. More details about samples distribution are listed in Table IV. To ensure the fairness and stability of the comparison, we repeat each experiment for five times and take the average values as the final results.
Iv-C Implementation Details
Similar with 3D-ANAS , the proposed method also has two optimizing stages and one inference stage. In this section, we introduce the different settings in the aforementioned stages on three different datasets. For brevity, the settings that are consistent with the baseline 3D-ANAS would not be mentioned here.
Searching: For three different datasets, we construct three different supernets, which have share the same outline structure. Specifically, in the outer structure, each supernet consists of four layers of super cells and each layer are made up with two different sper cells, space dominated super cell and spectrum dominated super cell. In the inner structure, each cell has a sequence of three nodes. The entire searching process is carried out on a NVIDIA V100 card with 32G memory. For Pavia University and Pavia Centre, we crop the patches with spatial resolution of 24×24 as searching samples, and the batch size is set to 6. On Houston University, the crop size of patches is set to 14x14, and the batch size is 5. On all three datasets, the Adam optimizer with both learning rate and weight attenuation of 0.001 is used to optimize the architecture parameters (, and
). The standard SGD optimizer is applied to update the supernet parameters (learnable kernels in candidate operations), where momentum and weight decay are set to 0.9 and 0.0003, respectively. The learning rate decays from 0.025 to 0.001 according to the cosine annealing strategy. For Pavia University and Pavia Centre, the first 15 epochs are the warm-up stage, in which we only optimize supernet parameters. Because Houston University is more challenging, we set 30 epochs for warming up. After the warming-up stage, we alternately update the architecture parameters and super network parameters in each iteration.
Grafted network optimization: We crop patches with spatial resolution 32×32 to train the final grafted network. Random cropping, flipping, and rotation are introduced as data enhancement strategies. Batch sizes of Pavia University and Pavia Centre are set to 12. Batch size for Houston University is set to 16. At this stage we use the SGD optimizer. The initial learning rate is set to 0.1, decayed according to the poly learning rate policy with power of 0.9 (). The performance of the network is validated every 100 iterations.
For the grafted framework based on mixed CNN and transformer, we introduced an overlap inference strategy (OV) to further improve the performance. Specifically, we use a sliding window to crop small blocks (the stride is half of the window size), and input the cropped blocks into the compact network. The average result of the overlapping area is considered as the final prediction result. As the number of tokens in our transformer module is fixed, the multi-scale verification method (MS) is not adopted in here. While, using OV strategy alone already achieve promising performance. The structure we designed requires the input sequence to be a fixed length. Therefore, the image blocks should be in the same scale during training and verification. Relaxing this restriction is considered as one of our future work.
|Models||3D-LWNet||1-D Auto-CNN||3-D Auto-CNN||3D-ANAS||Mix-Nas||MixT-NAS||MixT-NAS, OV|
|Models||3D-LWNet||1-D Auto-CNN||3-D Auto-CNN||3D-ANAS||Mix-Nas||MixT-NAS||MixT-NAS, OV|
|Models||3D-LWNet||1-D Auto-CNN||3-D Auto-CNN||3D-ANAS||Mix-Nas||MixT-NAS||MixT-NAS, OV|
|Models||3D-LWNet||1-D Auto-CNN||3-D Auto-CNN||3D-ANAS||Mix-Nas||MixT-NAS||MixT-NAS, OV|
|Models||3D-LWNet||1-D Auto-CNN||3-D Auto-CNN||3D-ANAS||Mix-Nas||MixT-NAS||MixT-NAS, OV|
|Models||3D-LWNet||1-D Auto-CNN||3-D Auto-CNN||3D-ANAS||Mix-Nas||MixT-NAS||MixT-NAS, OV|
Iv-D Comparison with state-of-the-art methods
In this section, we compare the proposed 3D-ANASV2 with other four recent CNN-based HSI classification methods. The codes for all comparison methods are derived from the official codes: 3DLWNet222https://github.com/hkzhang91/LWNet, 1-D Auto-CNN and 3-D Auto-CNN333https://github.com/YushiChen/Auto-CNN-HSI-Classification and 3D-ANAS444https://github.com/hkzhang91/3D-ANAS. TABLE V-X list the results of the comparative experiment, and Fig. 7-9 shows the corresponding visual results.
The performance on the Pavia University is listed in TABLE V and VI. The corresponding visual comparison results are shown in Fig.7. From the comparison results, we can draw the following conclusions: 1) Compared with the method based on 1D CNN, the method based on 3D CNN usually gains better performance. Because jointly using the spectral and spatial information is beneficial to improve the classification accuracy. However, compared to 3D-ANAS that uses all 3D convolutions, the proposed method introduces a mixed 2D-3D hybrid search space with separated space and spectrum convolution, achieving higher classification accuracy. 2) After grafting the transformer structure, the proposed 3D-ANASv2 achieves better performance than other auto-designed methods. For example, 3D-ANASV2 achieves 98.03% OA, 98.41% AA, and 97.39% K when 20 training samples are extracted from each category, which are 2.29%, 1.81%, and 3.02% higher than 3D-ANAS, respectively. 3) The overleap inference enhancement strategy adopted can further improve the performance. As shown in Table V, using OV increases OA, AA, and K by 0.74%, 0.40%, and 0.98%, respectively.
To save space, we only present the visual results using 30 training samples per class in Fig. 7. In order to clearly illustrate the difference, we placed a partially enlarged patch in the upper right corner of each result map. It can be easily found from the partially enlarged patch that there are fewer misclassified pixels in the results of a series of 3D-ANASV2. Some Asphalt pixels (class 1, cyan) are incorrectly classified as Self-Blocking Bricks (class 8, red) by 3D-LWNet and 3-D Auto-CNN. A lot of pixels belonging to Self-Blocking Bricks are incorrectly classified as Meadows (Class 2, green) by 3D-ANAS. But in the results of a series of 3D-ANASV2, all pixels belonging to Asphalt and Self-Blocking Bricks are correctly classified.
TABLE VII and VIII collects the comparison results on Pavia centre, and Fig. 8 shows the visual results of qualitative analysis. Compared with the results on Pavia University, the accuracy of these seven methods all improved to certain extents and the proposed 3D-ANASV2 still attains the best performance. Observing from the Fig. 8, the number of bitumen pixels that a serious of 3D-ANASv2 approaches incorrectly classified into self-blocking Bricks is significantly less than that of other methods. Although the 3D-ANASV2 with only improved spatial-spectrum search space still makes some false prediction on the bitumen class, the introduction of transformer finally handle the problems very well.
The comparison results on Houston University are shown in Table IX and X and Fig. 9. Compared with the first two datasets, the Houston University contains more spectral bands and more object categories. Therefore, the classification accuracy of all methods on this dataset is relatively low. The classification performance of different methods is quite different. As shown in Fig. 9, the result map of 1D Auto-CNN clearly shows the structural outlines of different buildings. For example, the dark red part of the partially enlarged area (commercial, level 8). But many misclassified pixels are distributed throughout the result image and look like salt and pepper noise, resulting in relatively poor visual effects. In contrast, 3D Auto-CNN showed very smooth results, in which the outline of the structure was almost lost. 3D-ANAS and 3D-ANASV2 have kept a relatively good balance between displaying good visual effects and maintaining the contour structure, and gain better performance than other algorithms. As shown in the enlarged image in Fig. 9, 3D-ANAS misclassifies some pixels classified as land into stressed grass and highway, while 3D-ANASV2 has very few misclassified pixels. From the TABLE IX and X, it is obvious that the results of 3D-ANASV2 are better than those of 3D-ANAS regardless of whether the training samples are 20 or 30. Specifically, when there are 20 training samples for each category, the OA, AA, and K of 3D-ANASV2 are 86.97%, 88.77%, and 85.92, respectively, which are significantly higher than that of 3D-ANAS. When the training samples of each category increase to 30, the advantages of 3D-ANASV2 and 3D-ANAS are more obvious. increasing by 4.23%, 4.60%, and 4.98% on OA, AA, and K respectively.
|Dataset||Search Space||Model Size||Transformer||OA||AA||K|
|HoustonU University||Spectrum||1.41 MB||✗||83.04||85.25||81.68|
|HoustonU University||Space||1.48 MB||✗||84.85||86.90||83.63|
|HoustonU University||Spectrum + Space||1.44 MB||✗||86.22||87.93||85.10|
|HoustonU University||Spectrum||9.98 MB||✓||88.18||89.75||87.22|
|HoustonU University||Space||10.04 MB||✓||89.53||90.96||88.67|
|HoustonU University||Spectrum + Space||10.00 MB||✓||90.04||91.89||89.64|
Iv-E Ablation study
HSI has different spatial and spectral resolutions. During the searching stage, different layers tend to select different types of cells. We speculate that merely maintaining a space dominated cell or a spectrum dominated cell would affect the performance of the algorithm, although both kinds of cells contain the 3D convolution. Here, ablative experiments are conducted to verify the effectiveness of space-spectrum separation search space. Besides, We also compared the classification accuracy of the model with and without the transformer. The experiments are carried out on the most challenging dataset Houston University with 30 training samples per class.
Retaining only the space dominated cell would lose the spectral information, while retaining only the spectral dominated cell would lose the spatial information. As shown in TABLE XI, the proposed method with mixed search space achieves the highest accuracy. When only the spectral search space is retained in the model, the classification accuracy is the lowest. Importing different types of cells can dig out the spatial and spectral information jointly and freely in the HSI classification tasks. In addition, the introduction of the transformer has brought about a 5% improvement in accuracy under all different search space settings. This illustrates the importance of fully mining the associated information between pixels in the classification of HSIs.
Iv-F Architecture analyse
In this section, the architectures searched by 3D-ANASV2 are shown in Fig. 10 and analyzed. Since the three datasets have different spectral and spatial resolutions, and land covers, we searched for the architecture on each dataset, separately. Although these three architectures are different in topology and operations, they also have some common characteristics:
1) 2D spatial convolution and 2D spectral convolution play important roles in the final selected operations. As introduced in Section III-B, the search space we construct for searching the internal topology includes not only 2D spatial convolution and 2D spectral convolution, but also 3D convolution. Even so, 3D-ANASV2 tends to build a network with both 2D convolution operations and 3D convolution operations. In most cases, 2D convolution operations are the main operation and 3D convolution operations play the part of the complementary operation. The proportion of 2D convolution operations in the final network designed on Pavia centre and Pavia University are 41.67% and 52.78%, respectively. Under the architecture searched for Houston University, 2D convolution operations occupied 44.44% of all operations. The proportions of 3D convolution operations are 13.89%, 8.33%, and 16.67% on Pavia centre, Pavia University and Houston University. This shows that although 3D convolution fits the data characteristics of HSIs, widely utilizing it as in traditional algorithms is not necessary. The 2D-3D mixed network architecture we searched has fewer parameters and higher parameter utilization compared with models under the same scale.
2) In the final network, 3D convolution operations are distributed from the beginning to the end. However, 2D spectral convolutions dominate at the shallow layer, while the number of 2D spatial convolutions is relatively large in the deep layer, as shown in Fig. 10. In the architectures for the Pavia centre and Pavia University, the spectral convolutions account for the majority in the first two layers, but the spatial convolutions account for the highest proportion in the last two layers. In the final architecture for Houston University, the first three layers are almost all spectral convolutions, and only the last layer of the network is mainly spatial convolution. In classic HSI classification networks such as SSRN, spectrum convolution is always performed first, followed by the spatial convolution. Our experimental results are consistent with the manual design experience.
3) With the enrichment of spectral information, the proportion of spectral convolution in the final network gradually increases. In different HSIs, the richness of spectral information and spatial information are quite different. For example, both Pavia University and Pavia centre has only 102 bands, while Houston University has 144 bands. Traditional 3D convolution pays the same attention to spatial information and spectral information. The space-spectrum separation search space we proposed can flexibly adjust the ratio of spatial and spectral convolutions freely according to the proportion of the space-spectrum information of the data itself. As shown in the Fig. 10, although the search space is exactly the same, 2D-spectral convolution and 3D convolution account for 33.33% and 25.00% of operations in Pavia University and Pavia centre, respectively. On the architecture for Houston University, this proportion has risen to 44.44% . This may be because Houston University has a larger number of spectra bands and requires more spectral convolutions to extract rich spectral information. The experimental results prove that the spatial-spectrum separation search space can well adapt to the characteristics of different data.
In this paper, we have proposed an auto-designed HSI classification method based on the CNN-Transformer mixed framework. The proposed 3D-ANASv2 has been compared with other manual designed CNN based HSI classification methods(3D-LWNet) and automatic design CNN based methods (1-D Auto-CNN,3-D Auto-CNN and 3D-ANAS) comprehensively on three typical public HSI datasets. The experimental results show that the 3D-ANASV2 outperforms other state of the art DL based algorithms. Additionally, abundant ablation studies have been carried out to verify the effectiveness of the proposed spatial-spectral search space and the grafted transformer. Results of ablation study demonstrated that the 3D-ANASv2 does find a local optimum architecture in the architecture search space and pixel-to-pixel classification is beneficial for improving inference speed. Compared with the pure CNN HSI classification framework, CNN-Transformer mixed framework captures the global connection between pixels. In the future work, we will focus on designing a more efficient neural architecture search approach to automatically design a full transformer architecture for HSI classification.
-  (2014) Detection of early plant stress responses in hyperspectral images. ISPRS Journal of Photogrammetry and Remote Sensing 93, pp. 98–111. Cited by: §I.
-  (2018) Hyperspectral remote sensing applied to mineral exploration in southern peru: a multiple data integration approach in the chapi chiara gold prospect. International journal of applied earth observation and geoinformation 64, pp. 287–300. Cited by: §I.
-  (2016-Oct.) Deep feature extraction and classification of hyperspectral images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 54 (10), pp. 6232–6251. Cited by: §I, §II-A.
-  (2014-Jun.) Deep learning-based classification of hyperspectral data. IEEE J. Sel. Topics Appl. Earch Observ. Remote Sens. 7 (6), pp. 2094–2107. Cited by: §I, §II-A.
Spectral–spatial classification of hyperspectral data based on deep belief network. IEEE J. Sel. Topics Appl. Earch Observ. Remote Sens. 8 (6), pp. 2381–2392. Cited by: §I, §II-A.
Automatic design of convolutional neural network for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 57 (9), pp. 7048–7066. Cited by: §I, §II-B.
-  (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929. Cited by: §II-C.
-  (2021) LeViT: a vision transformer in convnet’s clothing for faster inference. arXiv preprint arXiv:2104.01136. Cited by: §II-C.
-  (2017) Multiple kernel learning for hyperspectral image classification: a review. IEEE Transactions on Geoscience and Remote Sensing 55 (11), pp. 6547–6565. External Links: Cited by: §I.
-  (2015) Deep convolutional neural networks for hyperspectral image classification. Journal of Sensors 2015. Cited by: §I.
-  (2020) A lightweight convolutional neural network for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §II-A.
-  (2018) Modern trends in hyperspectral image analysis: a review. IEEE Access 6 (), pp. 14118–14129. External Links: Cited by: §I.
-  (2017) Spectral–spatial classification of hyperspectral imagery with 3d convolutional neural network. Remote Sensing 9 (1), pp. 67. Cited by: §I, §II-A.
Spectral-spatial classification of hyperspectral image using autoencoders. In 2013 9th International Conference on Information, Communications & Signal Processing, pp. 1–5. Cited by: §I, §II-A.
-  (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 82–92. Cited by: §II-B.
-  (2018) Hierarchical representations for efficient architecture search. Cited by: §II-B.
-  (2019) Darts: differentiable architecture search. Cited by: §I, §II-B.
-  (2021) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030. Cited by: §II-C.
-  (2016-Sept.) Spectral–spatial classification of hyperspectral image based on deep auto-encoder. IEEE J. Sel. Topics Appl. Earch Observ. Remote Sens. 9 (9), pp. 4073–4085. Cited by: §II-A.
Deep supervised learning for hyperspectral data classification through convolutional neural networks. In Proc. IEEE Conf. Int. Geosci. Remote Sens. Symp (IGARSS),, pp. 4959–4962. Cited by: §I, §II-A.
-  (2016-Jul.) Integrating spectral and spatial information into deep convolutional neural networks for hyperspectral classification. In Proc. IEEE Conf. Int. Geosci. Remote Sens. Symp (IGARSS),, pp. 5067–5070. Cited by: §II-A.
-  (2021) A lightweight spectral-spatial convolution module for hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters. Cited by: §II-A.
-  (2001) A linear constrained distance-based discriminant analysis for hyperspectral image classification. Pattern Recognition 34 (2), pp. 361–373. Cited by: §I.
-  (2019) Regularized evolution for image classifier architecture search. In Proc. AAAI Conf. Artificial Intell., Vol. 33, pp. 4780–4789. Cited by: §II-B.
-  (2008) Supervised classification of remotely sensed imagery using a modified -nn technique. IEEE Transactions on Geoscience and Remote Sensing 46 (7), pp. 2112–2125. Cited by: §I.
Efficient residual dense block search for image super-resolution.. In AAAI, pp. 12007–12014. Cited by: §II-B.
-  (2014) Supervised spectral–spatial hyperspectral image classification with weighted markov random fields. IEEE Transactions on Geoscience and Remote Sensing 53 (3), pp. 1490–1503. Cited by: §I.
Training data-efficient image transformers & distillation through attention.
International Conference on Machine Learning, pp. 10347–10357. Cited by: §II-C.
-  (2018) Survey of hyperspectral earth observation applications from space in the sentinel-2 context. Remote Sensing 10 (2), pp. 157. Cited by: §I.
-  (2017) Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §II-C.
-  (2020) NAS-FCOS: fast neural architecture search for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §II-B.
-  (2017-Aug.) Learning and transferring deep joint spectral–spatial features for hyperspectral classification. IEEE Trans. Geosci. Remote Sens. 55 (8), pp. 4729–4742. Cited by: §I, §II-A.
-  (2015) Spectral–spatial classification of hyperspectral images using deep convolutional neural networks. Remote Sensing Letters 6 (6), pp. 468–477. Cited by: §I, §II-A.
-  (2021) 3D-anas: 3d asymmetric neural architecture search for fast hyperspectral image classification. arXiv preprint arXiv:2101.04287. Cited by: §I, §II-B, §III, §IV-C.
-  (2019) Hyperspectral classification based on lightweight 3-d-cnn with transfer learning. IEEE Transactions on Geoence and Remote Sensing 57 (8), pp. 5813–5828. Cited by: §I, §II-A.
-  (2017) Spectral-spatial classification of hyperspectral imagery using a dual-channel convolutional neural network. Remote Sensing Letters 8 (5), pp. 438–447. Cited by: §I, §II-A.
-  (2016) Spectral-spatial classification of hyperspectral imagery based on deep convolutional network. In 2016 International Conference on Orange Technologies (ICOT), pp. 44–47. Cited by: §I, §II-A.
-  (2014-Apr.) An adaptive memetic fuzzy clustering algorithm with spatial information for remote sensing imagery. IEEE J. Sel. Topics Appl. Earch Observ. Remote Sens. 7 (4), pp. 1235–1248. Cited by: §I.
-  (2018) Practical block-wise neural network architecture generation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2423–2432. Cited by: §II-B.
-  (2018-Feb.) Spectral–spatial residual network for hyperspectral image classification: a 3-d deep learning framework. IEEE Trans. Geosci. Remote Sens. 56 (2), pp. 847–858. Cited by: §I, §II-A.
-  (2018) Learning transferable architectures for scalable image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 8697–8710. Cited by: §II-B.