## I Introduction

The rapid development of optics and photonics has significantly advanced hyperspectral techniques. As a result, hyperspectral images, which consist of hundreds of contiguous bands and contain large amounts of useful information, can be easily acquired [Chen2015Spectral, Zhong2019Multiple]. Over the past few decades, hyperspectral image classification has played an important role in various fields, such as military target detection, vegetation monitoring, and disaster prevention and control.

Up to now, diverse kinds of approaches have been proposed for classifying the pixels of a hyperspectral image into certain land-cover categories. The early-staged methods are mainly based on conventional pattern recognition methods, such as nearest neighbor classifier and linear classifier. Among these conventional methods,

-nearest neighbor [Li2010Local]has been widely used due to its simplicity in both theory and practice. Support Vector Machine (SVM)

[Kuo2010Spatial] also performs robustly and satisfactorily with high-dimensional hyperspectral data. In addition to these, graph-based methods [Shi2013Supervised], extreme learning machine [Wei2015Local], sparse representation-based classifier [Yi2011Hyperspectral], and many other methods have been further employed to promote the performance of hyperspectral image classification. Nevertheless, it is difficult to distinguish different land-cover categories accurately by only using the spectral information [Hang2016Matrix]. With the observation that spatially neighboring pixels usually carry correlated information within a smooth spatial domain, many researchers have resorted to spectral-spatial classification methods and several models have been proposed to exploit such local continuity [Zhang2018Simultaneous, Zhong2017Discriminant]. For example, Markov Random Field (MRF)-based models [Wang2005A] have been widely used for deploying spatial information and have achieved great popularity. In MRF-based models, spatial information is usually regarded as a priori before optimizing an energy function via posteriori maximization. Meanwhile, morphological profiles-based methods [Fauvel2008Spectral, Song2014Remotely] have also been proposed to effectively combine spatial and spectral information.However, the aforementioned methods are all based on the handcrafted spectral-spatial features [Zhang2018Diverse]

which heavily depend on professional expertise and are quite empirical. To address this defect, deep learning

[Liu2016Deep, Fan2014Saliency, Zhong2017Learning, Mou2017Deep] has been extensively employed for hyperspectral image classification and has attracted increasing attention for its strong representation ability. The main reason is that deep learning methods can automatically obtain abstract high-level representations by gradually aggregating the low-level features, by which the complicated feature engineering can be avoided [Yang2018Hyperspectral]. The first attempt to use deep learning methods for hyperspectral image classification was made by Chen*et*

*al.*[Chen2014Deep]

, where the stacked autoencoder was built for high-level feature extraction. Subsequently, Mou

*et*

*al.*[Mou2017Deep]

first employed Recurrent Neural Network (RNN) for hyperspectral image classification. Besides, Ma

*et*

*al.*[Ma2015Hyperspectral] attempted to learn the spectral-spatial features via a deep learning architecture by fine-tuning the network via a supervised strategy. Recently, Convolutional Neural Network (CNN) has emerged as a powerful tool for hyperspectral image classification [Chen2016Deep, Hao2018Two, Zhao2016Spectral]. For instance, Jia

*et*

*al.*[Jia2016Convolutional] employed CNN to extract spectral features and achieved superior performance to SVM. In addition, Hu

*et*

*al.*[Hu2015Deep] proposed a five-layer 1-D CNN to classify hyperspectral images directly in spectral domain. In these methods, the convolution operation is mainly applied to spectral domain while the spatial details are largely neglected. Another set of deep learning approaches perform hyperspectral image classification by incorporating spectral-spatial information. For example, in [Makantasis2015Deep], Makantasis

*et*

*al.*

encoded spectral-spatial information with a CNN and conducted classification with a multi-layer perceptron. Besides, Zhang

*et*

*al.*[Zhang2017Spectral] proposed a multi-dimensional CNN to automatically extract hierarchical spectral features and spatial features. Furthermore, Lee

*et*

*al.*[Lee2017Going]

designed a novel contextual deep CNN, which is able to optimally explore contextual interactions by exploiting local spectral-spatial relationship among spatially neighboring pixels. Specifically, the joint exploitation of spectral-spatial information is obtained by a multi-scale convolutional filter bank. Although the existing CNN-based methods have achieved good performance to some extent, they still suffer from some drawbacks. To be specific, conventional CNN models only conduct the convolution on the regular square regions, so they cannot adaptively capture the geometric variations of different object regions in a hyperspectral image. Besides, the weights of each convolution kernel are identical when convolving all image patches. As a result, the information of class boundaries may be lost during the feature abstraction process and misclassifications will probably happen due to the inflexible convolution kernel. In other words, the convolution kernels with fixed shape, size, and weights are not adaptive to all the regions in a hyperspectral image. Apart from that, CNN-based methods often take a long training time because of the large number of parameters.

Consequently, in this paper, we propose to utilize the recently proposed Graph Convolutional Network (GCN) [Defferrard2016Convolutional, Kipf2016Semi] for hyperspectral image classification. GCN operates on a graph and is able to aggregate and transform feature information from the neighbors of every graph node. Consequently, the convolution operation of GCN is adaptively governed by the neighborhood structure of a graph and thus GCN can be applicable to the non-Euclidean irregular data based on the predefined graph. Besides, both node features and local graph structure can be encoded by the learned hidden layers, so GCN is able to exhaustively exploit the image features and flexibly preserve the class boundaries.

Nevertheless, the direct use of traditional GCN for hyperspectral image classification is still inadequate. Since hyperspectral data is often contaminated by noise, the initial input graph may not be accurate. Specifically, the edge weight of pairwise pixels may not represent their intrinsic similarity, which makes the input graph less than optimal. Furthermore, traditional GCN can only use the spectral features of image pixels without incorporating the spatial context which is actually of great significance in hyperspectral images. Additionally, the computational complexity of traditional GCN will be unacceptable when the number of pixels gets too large. To tackle these difficulties in applying GCN to hyperspectral image classification, we propose a new type of GCN called ‘Multi-scale Dynamic GCN’ (MDGCN). Instead of utilizing a predefined fixed graph for convolution, we design a dynamic graph convolution operation, by which the similarity measures among pixels can be updated by fusing current feature embeddings. Consequently, the graph can be gradually refined during the convolution process of GCN, which will in turn make the feature embeddings more accurate. The processes of graph updating and feature embedding alternate, which work collaboratively to yield faithful graph structure and promising classification results. To take the multi-scale cues into consideration, we construct multiple graphs with different neighborhood scales so that the spatial information at different scales can be fully exploited [He2017Multi]. Different from commonly used GCN models which utilize only one fixed graph, the multi-scale design enables MDGCN to extract spectral-spatial features with varied receptive fields, by which the comprehensive contextual information from different levels can be incorporated. Moreover, due to the large number of pixels brought by the high spatial resolution of hyperspectral images, the computational complexity of network training can be extremely high. To mitigate this problem, we group the raw pixels into a certain amount of homogenous superpixels and treat each superpixel as a graph node. As a result, the number of nodes in each graph will be significantly reduced, which also helps to accelerate the subsequent convolution process.

To sum up, the main contributions of the proposed MDGCN are as follows: First, we propose a novel dynamic graph convolution operation, which can reduce the impact of a bad predefined graph. Secondly, multi-scale graph convolution is utilized to extensively exploit the spatial information and acquire better feature representation. Thirdly, the superpixel technique is involved in our proposed MDGCN framework, which significantly reduces the complexity of model training. Finally, the experimental results on three typical hyperspectral image datasets show that MDGCN achieves state-of-the-art performance when compared with the existing methods.

## Ii Related Works

In this section, we review some representative works on hyperspectral image classification and GCN, as they are related to this work.

### Ii-a Hyperspectral Image Classification

As a traditional yet important remote sensing technique, hyperspectral image classification has been intensively investigated and many related methods have been proposed, such as Bayesian methods [8390931]

[709601], and kernel methods [Maji2008Classification]. Particularly, SVM has shown impressive classification performance with limited labeled examples [Mercier2003Support]. However, SVM independently treats every pixel (i.e., example) and fails to exploit the correlations among different image pixels. To address this limitation, spatial information is introduced. For instance, by directly incorporating spatial information into kernel design, Camps-Valls*et*

*al.*[Camps2006Composite] used SVM with composite kernels for hyperspectral image classification. Besides, filtering-based methods have also been applied to spectral-spatial classification. In [Lin2016A], Lin

*et*

*al.*designed a three-dimensional filtering with a Gaussian kernel and its derivative for spectral-spatial information extraction. After that, they also proposed a discriminative low-rank Gabor filtering for spectral-spatial information extraction [Lin2017Discriminative]. Additionally, MRF has been commonly used to exploit spatial context for hyperspectral image classification with the assumption that spatially neighboring pixels are more likely to take the same label [5997308]. However, when neighboring pixels are highly correlated, the standard neighbor determination approaches will degrade the MRF models due to the insufficient contained pixels [Fauvel2013Advances]

. Therefore, instead of modeling the joint distribution of spatially neighboring pixels, conditional random field directly models the class posterior probability given the hyperspectral image and has achieved encouraging performance

[Zhang2012Simplified, Zhong2014A].The aforementioned methods simply employ various manually-extracted spectral-spatial features to represent the pixels, which highly depends on experts’ experience and is not general. In contrast, deep learning-based methods [Ma2015Hyperspectral, Zhao2015On], which can generate features automatically, have recently attracted increasing attention in hyperspectral image classification. The first attempt can be found in [Chen2014Deep], where stacked autoencoder was utilized for high-level feature extraction. Subsequently, Li *et* *al.* [Tong2015Classification]

used restricted Boltzmann machine and deep belief network for hyperspectral image feature extraction and pixel classification, by which the information contained in the original data can be well retained. Meanwhile, RNN model has been applied to hyperspectral image classification

[Shi2018Multi]. In [Shi2018Multi], Shi*et*

*al.*exploited multi-scale spectral-spatial features via hierarchical RNN, which can learn the spatial dependency of non-adjacent image patches in a two-dimension spatial domain. Among these deep learning methods, CNN, which needs fewer parameters than fully-connected networks with the same number of hidden layers, has drawn great attention for its breakthrough in hyperspectral image classification. For example, in [Hu2015Deep, Ghamisi2017A], CNN was used to extract the spectral features, which performs better than SVM. Nonetheless, excavating spatial information is of great importance in hyperspectral image classification and many CNN-based methods have done explorations on this aspect. For instance, Yang

*et*

*al.*[Yang2016Hyperspectral] proposed a two-channel deep CNN to jointly learn spectral-spatial features from hyperspectral images, where the channels are used for learning spectral and spatial features, respectively. Besides, in [Yue2015Spectral], Yue

*et*

*al.*projected hyperspectral data to several principal components before adopting CNN to extract spectral-spatial features. In the recent work of Li

*et*

*al.*[Li2016Hyperspectral], deep CNN is used to learn pixel-pair features, and the classification results of pixels in different pairs from the neighborhood are then fused. Additionally, Zhang

*et*

*al.*[Zhang2018Diverse] proposed a deep CNN model based on diverse regions, which employs different local or global regions inputs to learn joint representation of each pixel. Although CNN-based hyperspectral image classification methods can extract spectra-spatial features automatically, the effectiveness of the obtained features is still restricted by some issues. For example, they simply apply the fixed convolution kernels to different regions in a hyperspectral image, which does not consider the geometric appearance of various local regions and may result in undesirable misclassifications.

### Ii-B Graph Convolutional Network

The concept of neural network for graph data was first proposed by Gori *et* *al.* [Gori2005A], of which the advantage over CNN and RNN is that it can work on the graph-structured non-Euclidean data. Specifically, the Graph Neural Network (GNN) can collectively aggregate the node features in a graph and properly embed the entire graph in a new discriminative space. Subsequently, Scarselli *et* *al.* [Franco2009The]

made GNN trainable by a supervised learning algorithm for practical data. However, their algorithm is computationally expensive and runs inefficiently on large-scale graphs. Therefore, Bruna

*et*

*al.*[Bruna2014Spectral] developed the operation of ‘graph convolution’ based on spectral property, which convolves on the neighborhood of every graph node and produces a node-level output. After that, many extensions of graph convolution have been investigated and achieved advanced results [Dai2016Discriminative, Monti2017Geometric]. For instance, Hamilton

*et*

*al.*[Hamilton2017Inductive] presented an inductive framework called ‘GraphSAGE’, which leverages node features to effectively generate node embeddings for previously unseen data. Apart from this, Defferrard

*et*

*al.*[Defferrard2016Convolutional] proposed a formulation of CNNs in the context of spectral graph theory. Based on their work, Kipf and Welling [Kipf2016Semi] proposed a fast approximation localized convolution, which makes the GCN model able to encode both graph structure and node features. In their work, GCN was simplified by a first-order approximation of graph spectral convolution, which leads to more efficient filtering operations.

With the rapid development of graph convolution theories, GCN has been widely applied to various applications, such as recommender systems [Ying2018Graph] and semantic segmentation [Qi20173D]. Besides, to our best knowledge, GCN has been deployed for hyperspectral image classification in only one prior work [8474300]. However, [8474300] only utilizes a fixed graph during the node convolution process, so the intrinsic relationship among the pixels cannot be precisely reflected. Moreover, the neighborhood size in their method is also fixed and thus the spectral-spatial information in different local regions cannot be flexibly captured. To cope with these issues, we propose a novel dynamic multi-scale GCN which dynamically updates the graphs and fuses multi-scale spectral-spatial information for hyperspectral image classification. As a result, the accurate node embeddings can be acquired, which ensures satisfactory classification performance.

## Iii The Proposed Method

This section details our proposed MDGCN model (see Fig. 1). When an input hyperspectral image is given, it is pre-processed by the Simple Linear Iterative Clustering (SLIC) algorithm [Radhakrishna2012SLIC] to be segmented into several homogeneous superpixels. Then, graphs are constructed over these superpixels at different spatial scales. After that, the convolutions are conducted on these graphs, which simultaneously aggregates multi-scale spectral-spatial features and also gradually refine the input graphs. The superpixels potentially belonging to the same class will be ideally clustered together in the embedding space. Finally, the classification result is produced by the well-trained network. Next we detail the critical steps of our MDGCN by explaining the superpixel segmentation (Section III-A), presenting the GCN backbone (Section III-B), elaborating the dynamic graph evolution (Section III-C), and describing the multi-scale manipulation (Section III-D).

### Iii-a Superpixel Segmentation

A hyperspectral image usually contains hundreds of thousands of pixels, which may result in unacceptable computational complexity for the subsequent graph convolution and classification. To address this problem, we adopt a segmentation algorithm named SLIC [Radhakrishna2012SLIC] to segment the entire image into a small amount of compact superpixels, and each superpixel represents a homogeneous image region with strong spectral-spatial similarity. Concretely, the SLIC algorithm starts from an initial grid on the image and then creates segmentation through iteratively growing the local clusters using a -means algorithm. When the segmentation is finished, each superpixel is treated as a graph node instead of the pixel in the input image, therefore the amount of graph nodes can be significantly reduced, and the computational efficiency can be improved. Here the feature of each node (i.e., superpixel) is the average spectral signatures of the pixels involved in the corresponding superpixel. Another advantage for implementing the superpixel segmentation is that the generated superpixels also help to preserve the local structural information of a hyperspectral image, as nearby pixels with high spatial consistency have a large probability to belong to the same land-cover type (i.e., label).

### Iii-B Graph Convolutional Network

GCN [Kipf2016Semi] is a multi-layer neural network which operates directly on a graph and generates node embeddings by gradually fusing the features in the neighborhood. Different from traditional CNN which only applies to data represented by regular grids, GCN is able to operate on the data with arbitrary non-Euclidean structure. Formally, an undirected graph is defined as , where and are the sets of nodes and edges, respectively. The notation denotes the adjacency matrix of which indicates whether each pair of nodes is connected and can be calculated as

(1) |

where the parameter is empirically set to 0.2 in the experiments, represents a superpixel and is the set of neighbors of the example .

To conduct node embeddings for , spectral filtering on graphs is defined, which can be expressed as a signal filtered by in the Fourier domain, namely

(2) |

where

is the matrix composed of the eigenvectors of the normalized graph Laplacian

. Hereis a diagonal matrix containing the eigenvalues of

, is the degree matrix , anddenotes the identity matrix with proper size throughout this paper. Then, we can understand

as a function of the eigenvalues of , i.e., . To reduce the computational consumption of eigenvector decomposition in Eq. (2), Hammond*et*

*al.*[Hammond2009Wavelets] approximated by a truncated expansion in terms of Chebyshev polynomials up to -order, which is

(3) |

where is a vector of Chebyshev coefficients; with being the largest eigenvalue of . According to [Hammond2009Wavelets], the Chebyshev polynomials are defined as with and . Therefore, the convolution of a signal by the filter can be written as

(4) |

where denotes the scaled Laplacian matrix. Eq. (4) can be easily verified by using the fact . It can be observed that this expression is a -order polynomial regarding the Laplacian (i.e., -localized). That is to say, the filtering only depends on the nodes which are at most steps away from the central node. In this paper, we consider the first-order neighborhood, i.e., , and thus Eq. (4) becomes a linear function on the graph Laplacian spectrum with respect to .

After that, we can build a neural network based on graph convolutions by stacking multiple convolutional layers in the form of Eq. (4), and each layer is followed by an element-wise non-linear operation softplus() [7280459]. In this way, we can acquire a diverse class of convolutional filter functions by stacking multiple layers with the same configuration. With the linear formulation, Kipf and Welling [Kipf2016Semi] further approximated , as the neural network parameters can adapt to this change in scale during the training process. Therefore, Eq. (4) can be simplified to

(5) |

where and are two free parameters. Since reducing the number of parameters is beneficial to address overfitting, Eq. (5) is converted to

(6) |

by letting . Since has the eigenvalues in the range , repeatedly applying this operator will lead to numerical instabilities and exploding/vanishing gradients in a deep neural network. To cope with this problem, Kipf and Welling [Kipf2016Semi] performed the renormalization trick with and . As a result, the convolution operation of GCN model can be expressed as

(7) |

where is the output (namely, embedding result) of the layer; represents an activation function, such as the softplus function [7280459] used in this paper; and denotes the trainable weight matrix included by the layer.

### Iii-C Dynamic Graph Evolution

As mentioned in the introduction, one major disadvantage of conventional GCN is that the graph is fixed throughout the convolution process, which will degrade the final classification performance if the input graph is not accurate. To remedy this defect, in this paper, we propose a dynamic GCN in which the graph can be gradually refined during the convolution process. The main idea is to find an improved graph by fusing the information of current data embeddings and the graph used in the previous layer.

We denote the adjacency matrix in the layer by . The embedding kernel encodes the pairwise similarity of the embeddings generated from the layer. For each data point , it is reasonable to assume that

obeys Gaussian distribution with the covariance

and unknown mean , i.e., . Based on the definitions above, the fused kernel can be obtained by linearly combining and , namely(8) |

where is the weight assigned to the embedding kernel . The operation in Eq. (8) actually corresponds to the addition operator: with being the embedding result of the data point and being the fused result. From Eq. (8), we can see that this fusion technique utilizes the information of the embedding results encoded in and also the previous adjacency matrix to refine the graph. The advantages of such strategy are two-fold: firstly, the introduction of embedding information helps to find a more accurate graph; and secondly, the improved graph will in turn make the embeddings more discriminative. However, there are still some problems regarding the fused kernel . The fusion in Eq. (8) will lead to performance degradation if the embeddings are not sufficiently accurate to characterize the intrinsic similarity of the input data. As a result, according to [Bo2012Unsupervised], we need to re-emphasize the inherent structure among the input data carried by the initial adjacency matrix. Therefore, we do the following projection on the fused result by using the initial adjacency matrix , which leads to

(9) |

where

denotes white noise, i.e.,

; the parameter is used to control the relative importance of . With this projection, we have:(10) |

where is an identity matrix. Therefore, the marginal distribution of is

(11) |

Since is Gaussian distributed with the covariance , the graph can be dynamically updated as

(12) |

To start Eq. (12), here for the first graph convolutional layer is the initial adjacency matrix . By expanding Eq. (12), each element of can be written as

(13) |

where is the inner product of two vectors. By observing Eq. (13), we can find that is similar to if they have many common neighbors.

### Iii-D Multi-Scale Manipulation

Multi-scale information has been widely demonstrated to be useful for hyperspectral image classification problems [Srivastava2014Dropout, 7729625]. This is because that the objects in a hyperspectral image usually have different geometric appearances, and the contextual information revealed by different scales helps to exploit the abundant local property of image regions from diverse levels. In our method, the multi-scale spectral-spatial information is captured by constructing graphs at different neighborhood scales. Specifically, at the scale , every superpixel is connected to its -hop neighbors. Fig. 2 exhibits the -hop and -hop neighbors of a central example to illustrate the multi-scale design. Then, the receptive field of at the scale is formed as

(14) |

where and is the set of -hop neighbors of . By considering both the effectiveness and efficiency, in our method, we construct the graphs at the scale , , and , respectively. Therefore, the formulation of the graph convolutional layer is expressed as

(15) |

where , and denote the adjacency matrix, the output matrix, and the trainable weight matrix of the graph convolutional layer at the scale . Note that the input matrix is shared by all scales. Based on Eq. (15), the output of MDGCN can be obtained by

(16) |

where is the number of graph convolutional layers shared by all scales, and is the output of MDGCN. The convolution process of MDGCN is summarized in Algorithm 1. In our model, the cross-entropy error is adopted to penalize the difference between the network output and the labels of the original labeled examples, which is

(17) |

where is the set of indices corresponding to the labeled examples; denotes the number of classes and denotes the label matrix. Similar to [Kipf2016Semi], the network parameters here are learned by using full-batch gradient descent, where all superpixels are utilized to perform gradient descent in each iteration. The implementation details of our MDGCN are shown in Algorithm 2.

## Iv Experimental Results

In this section, we conduct exhaustive experiments to validate the effectiveness of the proposed MDGCN method, and also provide the corresponding algorithm analyses. To be specific, we first compare MDGCN with other state-of-the-art approaches on three publicly available hyperspectral image datasets, where four metrics including per-class accuracy, overall accuracy (OA), average accuracy (AA), and kappa coefficient are adopted. Then, we demonstrate that both the multi-scale manipulation and dynamic graph design in our DMGCN are beneficial to obtaining the promising performance. After that, we validate the effectiveness of our method in dealing with the boundary regions. Finally, we compare the computational time of various methods to show the efficiency of our algorithm.

### Iv-a Datasets

The performance of the proposed MDGCN is evaluated on three datasets, i.e., the Indian Pines, the University of Pavia, and the Kennedy Space Center, which will be introduced below.

#### Iv-A1 Indian Pines

The Indian Pines dataset was collected by Airborne Visible/Infrared Imaging Spectrometer sensor in 1992, which records north-western India. It consists of pixels with a spatial resolution of 20 m 20 m and has 220 spectral channels covering the range from 0.4 m to 2.5 m. As a usual step, 20 water absorption and noisy bands are removed, and 200 bands are reserved. The original ground truth includes 16 land-cover classes, such as ‘Alfalfa’, ‘Corn-notill’, and ‘Corn-mintill’. Fig. 3 exhibits the false color image and ground-truth map of the Indian Pines dataset. The amounts of labeled and unlabeled pixels of various classes are listed in Table I.

ID | Class | #Labeled | #Unlabeled |
---|---|---|---|

1 | Alfalfa | 30 | 31 |

2 | Corn-notill | 30 | 1398 |

3 | Corn-mintill | 30 | 800 |

4 | Corn | 30 | 207 |

5 | Grass-pasture | 30 | 453 |

6 | Grass-trees | 30 | 700 |

7 | Grass-pasture-mowed | 15 | 13 |

8 | Hay-windrowed | 30 | 448 |

9 | Oats | 15 | 5 |

10 | Soybean-notill | 30 | 942 |

11 | Soybean-mintill | 30 | 2425 |

12 | Soybean-clean | 30 | 563 |

13 | Wheat | 30 | 175 |

14 | Woods | 30 | 1235 |

15 | Buildings-grass-trees-drives | 30 | 356 |

16 | Stone-steel-towers | 30 | 63 |

#### Iv-A2 University of Pavia

The University of Pavia dataset captured the Pavia University in Italy with the ROSIS sensor in 2001. It consists of pixels with a spatial resolution of 1.3 m 1.3 m and has 103 spectral channels in the wavelength range from 0.43 m to 0.86 m after removing noisy bands. This dataset includes 9 land-cover classes, such as ‘Asphalt’, ‘Meadows’, and ‘Gravel’, which are shown in Fig. 4. Table II lists the amounts of labeled and unlabeled pixels of each class.

ID | Class | #Labeled | #Unlabeled |
---|---|---|---|

1 | Asphalt | 30 | 6601 |

2 | Meadows | 30 | 18619 |

3 | Gravel | 30 | 2069 |

4 | Trees | 30 | 3034 |

5 | Painted metal sheets | 30 | 1315 |

6 | Bare soil | 30 | 4999 |

7 | Bitumen | 30 | 1300 |

8 | Self-blocking bricks | 30 | 3652 |

9 | Shadows | 30 | 917 |

#### Iv-A3 Kennedy Space Center

The Kennedy Space Center dataset was taken by AVIRIS sensor over Florida with a spectral coverage ranging from 0.4 m to 2.5 m. This dataset contains 224 bands and 614 512 pixels with a spatial resolution of 18 m. After removing water absorption and noisy bands, the remaining 176 bands of the image have been preserved. The Kennedy Space Center dataset includes 13 land-cover classes, such as ‘Srub’, ‘Willow swamp’, and ‘CP hammock’. Fig. 5 exhibits the false color image and ground-truth map of the Kennedy Space Center dataset. The numbers of labeled and unlabeled pixels of different classes are listed in Table III.

ID | Class | #Labeled | #Unlabeled |
---|---|---|---|

1 | Srub | 30 | 728 |

2 | Willow swamp | 30 | 220 |

3 | CP hammock | 30 | 232 |

4 | Slash pine | 30 | 228 |

5 | Oak/Broadleaf | 30 | 146 |

6 | Hardwood | 30 | 207 |

7 | Swamp | 30 | 96 |

8 | Graminoid | 30 | 393 |

9 | Spartina marsh | 30 | 469 |

10 | Cattail marsh | 30 | 365 |

11 | Salt marsh | 30 | 378 |

12 | Mud flats | 30 | 454 |

13 | Water | 30 | 836 |

### Iv-B Experimental Settings

In our experiments, the proposed algorithm is implemented via TensorFlow with Adam optimizer. For all the adopted three datasets introduced in Section

IV-A, usually 30 labeled pixels (i.e., examples) are randomly selected in each class for training, and only 15 labeled examples are chosen if the corresponding class has less than 30 examples. During training, 90% of the labeled examples are used to learn the network parameters and 10% are used as validation set to tune the hyperparameters. Meanwhile, all the unlabeled examples are used as the test set to evaluate the classification performance. The network architecture of our proposed MDGCN is kept identical for all the datasets. Specifically, three neighborhood scales, namely

, , and , are respectively employed for graph construction to incorporate multi-scale spectral-spatial information into our model. For each scale, we employ two graph convolutional layers with 20 hidden units, as GCN-based methods usually do not require deep structure to achieve satisfactory performance [8474300, Gao:2018:LLG:3219819.3219947]. Besides, the learning rate and the number of training epochs are set to 0.0005 and 5000, respectively.

To evaluate the classification ability of our proposed method, other recent state-of-the-art hyperspectral image classification methods are also used for comparison. Specifically, we employ two CNN-based methods, i.e., Diverse Region-based deep CNN (DR-CNN) [Zhang2018Diverse] and Recurrent 2D-CNN (R-2D-CNN) [Yang2018Hyperspectral], together with two GCN-based methods, i.e., Graph Convolutional Network (GCN) [Kipf2016Semi] and Spectral-Spatial Graph Convolutional Network (SGCN) [8474300]. Meanwhile, we compare the proposed MDGCN with three traditional hyperspectral image classification methods, namely Matrix-based Discriminant Analysis (MDA) [Hang2016Matrix], Hierarchical guidance Filtering-based ensemble classification (HiFi) [7906599], and Joint collaborative representation and SVM with Decision Fusion (JSDF) [7360896]

, respectively. All these methods are implemented ten times on each dataset, and the mean accuracies and standard deviations over these ten independent implementations are reported.

ID | GCN [Kipf2016Semi] | SGCN [8474300] | R-2D-CNN [Yang2018Hyperspectral] | DR-CNN [Zhang2018Diverse] | MDA [Hang2016Matrix] | HiFi [7906599] | JSDF [7360896] | MDGCN |

1 | 95.002.80 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 99.381.92 | 100.000.00 | 100.000.00 |

2 | 56.714.42 | 84.432.50 | 54.942.23 | 80.381.50 | 75.087.35 | 87.424.29 | 90.753.19 | 80.180.84 |

3 | 51.502.56 | 82.875.53 | 73.314.33 | 82.213.53 | 81.445.03 | 93.392.81 | 77.843.81 | 98.260.00 |

4 | 84.643.16 | 93.081.95 | 84.0612.98 | 99.190.74 | 95.293.02 | 97.682.76 | 99.860.33 | 98.570.00 |

5 | 83.713.20 | 97.131.34 | 87.640.31 | 96.471.10 | 91.723.56 | 94.332.71 | 87.202.73 | 95.140.33 |

6 | 94.032.11 | 97.291.27 | 91.214.34 | 98.621.90 | 95.461.76 | 98.711.86 | 98.540.28 | 97.160.57 |

7 | 92.310.00 | 92.310.00 | 100.000.00 | 100.000.00 | 100.000.00 | 95.003.36 | 100.000.00 | 100.000.00 |

8 | 96.611.86 | 99.030.93 | 99.110.95 | 99.780.22 | 99.800.28 | 99.590.48 | 99.800.31 | 98.890.00 |

9 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 |

10 | 77.471.24 | 93.773.72 | 70.815.11 | 90.411.95 | 81.954.68 | 91.373.49 | 89.994.24 | 90.021.02 |

11 | 56.561.53 | 84.982.82 | 56.351.08 | 74.460.37 | 69.137.07 | 84.333.55 | 76.755.12 | 93.351.47 |

12 | 58.296.58 | 80.055.17 | 63.0612.81 | 91.003.14 | 76.585.12 | 95.023.10 | 87.102.82 | 93.052.30 |

13 | 100.000.00 | 99.430.00 | 98.861.62 | 100.000.00 | 99.430.75 | 99.290.25 | 99.890.36 | 100.000.00 |

14 | 80.033.93 | 96.730.92 | 88.742.58 | 91.853.40 | 90.923.21 | 98.320.76 | 97.212.78 | 99.720.05 |

15 | 69.556.66 | 86.803.42 | 87.082.78 | 99.440.28 | 91.575.27 | 96.712.06 | 99.580.68 | 99.720.00 |

16 | 98.410.00 | 100.000.00 | 97.621.12 | 100.000.00 | 96.634.91 | 99.130.81 | 100.000.00 | 95.710.00 |

OA | 69.241.56 | 89.491.08 | 72.111.28 | 86.650.59 | 81.911.33 | 91.901.36 | 88.341.39 | 93.470.38 |

AA | 80.931.71 | 92.991.04 | 84.551.79 | 93.990.25 | 90.310.71 | 95.600.58 | 94.030.55 | 96.240.21 |

Kappa | 65.271.80 | 88.001.23 | 68.661.46 | 84.880.67 | 79.541.46 | 90.771.53 | 86.801.55 | 92.550.43 |

### Iv-C Classification Results

To show the effectiveness of our proposed MDGCN, here we quantitatively and qualitatively evaluate the classification performance by comparing MDGCN with the aforementioned baseline methods.

ID | GCN [Kipf2016Semi] | SGCN [8474300] | R-2D-CNN [Yang2018Hyperspectral] | DR-CNN [Zhang2018Diverse] | MDA [Hang2016Matrix] | HiFi [7906599] | JSDF [7360896] | MDGCN |

1 | - | - | 84.960.56 | 92.103.34 | 78.841.99 | 77.774.52 | 82.404.07 | 93.550.37 |

2 | - | - | 79.992.29 | 96.393.20 | 79.963.41 | 95.492.07 | 90.763.74 | 99.250.23 |

3 | - | - | 89.490.17 | 84.230.71 | 85.583.46 | 94.123.26 | 86.714.14 | 92.030.24 |

4 | - | - | 98.120.65 | 95.260.67 | 90.902.29 | 82.684.25 | 92.882.16 | 83.781.55 |

5 | - | - | 99.850.11 | 97.770.00 | 99.930.08 | 97.251.67 | 100.000.00 | 99.470.09 |

6 | - | - | 76.797.40 | 90.442.27 | 75.527.11 | 99.630.78 | 94.304.55 | 95.260.50 |

7 | - | - | 88.694.57 | 89.051.76 | 84.285.11 | 97.772.43 | 96.621.37 | 98.921.04 |

8 | - | - | 67.545.67 | 78.491.53 | 81.826.92 | 95.122.34 | 94.693.74 | 94.991.33 |

9 | - | - | 99.840.08 | 96.340.22 | 97.501.48 | 83.863.40 | 99.560.36 | 81.030.49 |

OA | - | - | 82.380.88 | 92.621.15 | 81.602.07 | 92.081.28 | 90.821.30 | 95.680.22 |

AA | - | - | 87.250.68 | 91.120.12 | 86.041.67 | 91.520.99 | 93.100.65 | 93.150.28 |

Kappa | - | - | 77.310.97 | 90.271.44 | 76.242.63 | 89.601.65 | 88.021.62 | 94.250.29 |

#### Iv-C1 Results on the Indian Pines Dataset

The quantitative results obtained by different methods on the Indian Pines dataset are summarized in Table IV, where the highest value in each row is highlighted in bold. We observe that the CNN-based methods including R-2D-CNN and DR-CNN achieve relatively low accuracy, which is due to the reason that they can only conduct the convolution on a regular image grid, so the specific local spatial information cannot be captured. By contrast, GCN-based methods such as SGCN and MDGCN are capable of adaptively aggregating the features on irregular non-Euclidean regions, so they can yield better performance than R-2D-CNN and DR-CNN. The HiFi algorithm, which combines spectral and spatial information in diverse scales, ranks in the second place. This implies that the multi-scale spectral-spatial features are quite useful to enhance the classification performance. Furthermore, we observe that the proposed MDGCN achieves the top level performance among all the methods in terms of OA, AA, and Kappa coefficient, and the standard deviations are also very small, which reflects that the proposed MDGCN is more stable and effective than the compared methods.

Fig. 6 exhibits a visual comparison of the classification results generated by different methods on the Indian Pines dataset, and the ground-truth map is provided in Fig. 6. Compared with the ground-truth map, it can be seen that some pixels of ‘Soybean-mintill’ are misclassified into ‘Corn-notill’ in all the classification maps because these two land-cover types have similar spectral signatures. Meanwhile, due to the lack of spatial context, the classification map obtained by GCN suffers from pepper-noise-like mistakes within certain regions. Comparatively, the result of the proposed MDGCN method yields smoother visual effect and shows fewer misclassifications than other compared methods.

#### Iv-C2 Results on the University of Pavia Dataset

Table V presents the quantitative results of different methods on the University of Pavia dataset. Note that GCN and SGCN are not used for comparison as they are not scalable to this large dataset. Similar to the results on the Indian Pines dataset, the results in Table V indicate that the proposed MDGCN is in the first place and outperforms the compared methods by a substantial margin, which again validates the strength of our proposed multi-scale dynamic graph convolution. Besides, it is also notable that DR-CNN performs better than HiFi and JSDF, which is different from the results on the Indian Pines dataset. This is mainly because that DR-CNN and our MDGCN exploit spectral-spatial information with diverse region-based inputs, which can well adapt to the hyperspectral images containing many boundary regions. Since the objects belonging to the same class in the University of Pavia dataset are often distributed in widely scattered regions, DR-CNN and MDGCN are able to achieve better performance than HiFi, JSDF, and other baseline methods. Furthermore, from the classification results in Fig. 7, stronger spatial correlation and fewer misclassifications can be observed in the classification map of the proposed MDGCN when compared with DR-CNN and other competitors.

ID | GCN [Kipf2016Semi] | SGCN [8474300] | R-2D-CNN [Yang2018Hyperspectral] | DR-CNN [Zhang2018Diverse] | MDA [Hang2016Matrix] | HiFi [7906599] | JSDF [7360896] | MDGCN |

1 | 86.913.46 | 95.120.32 | 94.710.16 | 98.720.21 | 96.882.22 | 97.281.72 | 100.000.00 | 100.000.00 |

2 | 83.293.08 | 95.155.15 | 79.030.98 | 97.971.36 | 97.842.31 | 99.660.89 | 92.071.59 | 100.000.00 |

3 | 87.574.31 | 96.170.51 | 80.244.11 | 97.492.00 | 88.456.24 | 100.000.00 | 95.134.01 | 100.000.00 |

4 | 24.8612.31 | 71.178.58 | 42.195.88 | 62.463.94 | 78.296.98 | 99.031.12 | 59.018.13 | 100.000.00 |

5 | 63.365.47 | 97.712.64 | 79.393.33 | 94.662.75 | 86.765.06 | 100.000.00 | 85.347.82 | 100.000.00 |

6 | 61.014.43 | 89.953.48 | 77.057.85 | 97.651.05 | 94.273.95 | 99.211.00 | 86.483.63 | 94.910.25 |

7 | 91.205.63 | 98.223.08 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 98.931.51 | 100.000.00 |

8 | 78.206.45 | 89.110.58 | 98.170.76 | 97.421.77 | 94.083.60 | 100.000.00 | 94.763.56 | 100.000.00 |

9 | 85.393.96 | 99.590.35 | 96.670.62 | 99.930.12 | 98.792.94 | 100.000.00 | 100.000.00 | 100.000.00 |

10 | 84.284.93 | 98.041.08 | 98.301.30 | 98.840.56 | 98.023.41 | 97.782.60 | 100.000.00 | 100.000.00 |

11 | 94.681.95 | 99.230.45 | 89.031.16 | 100.000.00 | 98.621.53 | 99.780.23 | 100.000.00 | 100.000.00 |

12 | 82.142.42 | 95.630.24 | 94.640.80 | 98.940.73 | 94.981.96 | 99.970.08 | 95.522.02 | 100.000.00 |

13 | 98.990.67 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 |

OA | 83.600.81 | 95.440.92 | 91.110.60 | 97.210.27 | 95.920.81 | 99.300.39 | 95.690.34 | 99.790.01 |

AA | 78.601.01 | 94.241.84 | 86.881.03 | 95.700.33 | 94.380.96 | 99.440.31 | 92.870.63 | 99.610.02 |

Kappa | 81.700.90 | 94.911.03 | 90.060.66 | 0.970.00 | 95.440.91 | 99.220.44 | 95.170.37 | 1.000.00 |

#### Iv-C3 Results on the Kennedy Space Center Dataset

Table VI presents the experimental results of different methods on the Kennedy Space Center dataset. It is apparent that the performance of all methods is better than that on the Indian Pines and the University of Pavia dataset. This could be due to that the Kennedy Space Center dataset has higher spatial resolution and contains less noise than the Indian Pines and the University of Pavia dataset, and thus is more suitable for classification. As can be noticed, HiFi algorithm achieves the highest OA among all the baseline methods. However, slight gaps can still be observed between HiFi and our MDGCN in terms of OA. For the proposed MDGCN, it is also worth noting that misclassifications only occur in the 6th class (‘Hardwood’), which further demonstrates the advantage of our proposed MDGCN. Fig. 8 visualizes the classification results of the eight different methods, where some critical regions of each classification map are enlarged for better performance comparison. We can see that our MDGCN is able to produce precise classification results on these small and difficult regions.

### Iv-D Impact of the Number of Labeled Examples

In this experiment, we investigate the classification accuracies of the proposed MDGCN and the other competitors under different numbers of labeled examples. To this end, we vary the number of labeled examples per class from 5 to 30, and report the OA gained by all the methods on the Indian Pines dataset. The results are shown in Fig. 9. We see that the performance of all methods can be improved by increasing the number of labeled examples. It is noteworthy that the CNN-based methods, i.e., R-2D-CNN, DR-CNN, have poor performance when labeled examples are very limited, since these methods require a large number of labeled examples for training. Besides, we can observe that the proposed MDGCN consistently yields higher OA than the other methods with the increase of the number of labeled examples. Moreover, the performance of MDGCN is more stable than the compared methods with the changed number of labeled examples. All these observations indicate the effectiveness and stability of our MDGCN method.

### Iv-E Ablation Study

ID | MGCN | MDGCN | |||

1 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 |

2 | 81.151.06 | 76.894.17 | 71.221.35 | 78.742.59 | 80.180.84 |

3 | 92.812.21 | 96.701.62 | 96.022.81 | 94.121.00 | 98.260.00 |

4 | 100.000.00 | 98.760.55 | 98.472.37 | 98.550.00 | 98.570.00 |

5 | 89.620.31 | 93.762.22 | 89.512.54 | 90.180.22 | 95.140.33 |

6 | 93.360.91 | 96.780.56 | 95.212.27 | 94.571.49 | 97.160.57 |

7 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 |

8 | 98.660.32 | 98.571.40 | 99.810.46 | 100.000.00 | 98.890.00 |

9 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 |

10 | 84.500.30 | 88.032.01 | 76.264.30 | 85.672.01 | 90.021.02 |

11 | 79.590.93 | 91.511.27 | 91.701.46 | 90.370.64 | 93.351.47 |

12 | 91.210.38 | 91.733.41 | 86.354.17 | 90.903.15 | 93.052.30 |

13 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 | 100.000.00 |

14 | 99.550.06 | 99.780.08 | 99.870.07 | 99.760.07 | 99.720.05 |

15 | 98.311.99 | 99.640.21 | 99.630.15 | 98.671.38 | 99.720.00 |

16 | 98.410.00 | 96.151.55 | 96.831.74 | 98.410.00 | 95.710.00 |

OA | 88.530.54 | 92.030.39 | 89.530.55 | 91.240.70 | 93.470.38 |

AA | 94.200.28 | 95.520.32 | 93.810.30 | 95.000.58 | 96.240.21 |

Kappa | 86.970.60 | 90.900.44 | 88.060.63 | 90.000.80 | 92.550.43 |

As is mentioned in the introduction, our proposed MDGCN contains two critical parts for boosting the classification performance, i.e., multi-scale operation and dynamic graph convolution. Here we use the Indian Pines dataset to demonstrate the usefulness of these two operations, where the number of labeled pixels per class is kept identical to the above experiments in Section IV-C. To show the importance of multi-scale technique, we exhibit the classification results in Table VII by using the dynamic graphs with three different neighborhood scales, i.e., , , and . It can be observed that higher neighborhood scale does not necessarily result in better performance, since the spectral-spatial information cannot be sufficiently exploited with only a single-scale graph. Comparatively, we find that MDGCN consistently performs better than the settings of , , and in terms of OA, AA, and Kappa coefficient, which indicates the usefulness of incorporating the multi-scale spectral-spatial information into the graphs.

To show the effectiveness of dynamic graphs, Table VII also lists the results acquired by only using multi-scale graph convolution network (MGCN), where the graphs are fixed throughout the classification procedure. Compared with the results of MDGCN, there is a noticeable performance drop in the OA, AA, and Kappa coefficient of MGCN, which indicates that utilizing fixed graph convolution is not ideal for accurate classification. Therefore, the dynamically updated graph in our method is useful for rendering good classification results.

### Iv-F Classification Performance in the Boundary Region

One of the defects in traditional CNN-based methods is that the weights of each convolution kernel are identical when convolving all image patches, which may produce misclassifications in boundary regions. Different from the coarse convolution of traditional CNN-based methods with fixed size and weights, the graph convolution of the proposed MDGCN can be flexibly applied to irregular image patches and thus will not significantly ‘erase’ the boundaries of objects during the convolution process. Therefore, the boundary information will be preserved and our MDGCN will perform better than the CNN-based methods in boundary regions. To reveal this advantage, in Fig. 10, we show the classification maps of a boundary region in the Indian Pines dataset obtained by different methods. The investigated boundary region is indicated by a black box in Fig. 10. Note that the results near the class boundaries are quite confusing and inaccurate in the classification maps of GCN, SGCN, R-2D-CNN, DR-CNN, MDA, HiFi, and JSDF, since the spatial information is very limited to distinguish the pixels around class boundaries. In contrast, the classification map of the proposed MDGCN (see Fig. 10) is more compact and accurate than those of other methods.

### Iv-G Running Time

Table VIII reports the running time of deep models including GCN, SGCN, R-2D-CNN, DR-CNN, and the proposed MDGCN on the three datasets adopted above. The codes for all methods are written in Python, and the running time is reported on a desktop computer with a 2.20-GHz Intel Core i5 CPU with 12 GB of RAM and a RTX 2080 GPU. As can be observed, the CNN-based methods always consume massive computation time since they require deep layers and multiple types of convolution kernels, which results in a large number of parameters, and thus significantly increasing the computational complexity. In contrast, GCN and SGCN employ a fixed graph for convolution, and thus the number of parameters is greatly reduced. As a result, they need less computational time than the CNN-based methods. Due to the utilization of superpixel technique, the size of the graphs used in MDGCN is far smaller than that in GCN and SGCN. Consequently, the time consumption of the proposed MDGCN is the lowest among the five methods, even though MDGCN employs multiple graphs at different neighborhood scales.

GCN | SGCN | R-2D-CNN | DR-CNN | MDGCN | |
---|---|---|---|---|---|

IP | 656 | 1176 | 2178 | 5258 | 74 |

PaviaU | - | - | 2313 | 8790 | 258 |

KSC | 147 | 287 | 1527 | 4528 | 20 |

## V Conclusion

In this paper, we propose a novel Multi-scale Dynamic Graph Convolutional Network (MDGCN) for hyperspectral image classification. Different from prior works that depend on a fixed input graph for convolution, the proposed MDGCN critically employs dynamic graphs which are gradually refined during the convolution process. Therefore, the graphs can faithfully encode the intrinsic similarity among image regions and help to find accurate region representations. Meanwhile, multiple graphs with different neighborhood scales are constructed to fully exploit the multi-scale information, which comprehensively discover the hidden spatial context carried by different scales. The experimental results on three widely-used hyperspectral image datasets demonstrate that the proposed MDGCN is able to yield better performance when compared with the state-of-the-art methods.