A CNN With Multi-scale Convolution for Hyperspectral Image Classification using Target-Pixel-Orientation scheme

01/30/2020 ∙ by Jayasree Saha, et al. ∙ 26

Recently, CNN is a popular choice to handle the hyperspectral image classification challenges. In spite of having such large spectral information in Hyper-Spectral Image(s) (HSI), it creates a curse of dimensionality. Also, large spatial variability of spectral signature adds more difficulty in classification problem. Additionally, training a CNN in the end to end fashion with scarced training examples is another challenging and interesting problem. In this paper, a novel target-patch-orientation method is proposed to train a CNN based network. Also, we have introduced a hybrid of 3D-CNN and 2D-CNN based network architecture to implement band reduction and feature extraction methods, respectively. Experimental results show that our method outperforms the accuracies reported in the existing state of the art methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 5

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Hyperspectral image (HSI) classification has received considerable attention in recent years for a variety of application using neural network-based techniques. Hyperspectral imagery has several hundreds of contiguous narrow spectral bands from the visible to the infrared frequency in the entire electromagnetic spectrum. It is expected to provide finer classification with such high spectral resolution due to having a distinct spectral signature for each pixel instances. However, such a large number of spectral dimension creates the curse of dimensionality. Along with this, the following issues bring challenges in the classification of HSIs. 1) limited training examples, and 2) large spatial variability of spectral signature. In general, contiguous spectral bands may contain some redundant information which leads to the Hughes phenomenon 

[1]. It causes accuracy drop in classification when there is an imbalance between the high number of spectral channels and scarce training examples. Conventionally, dimension-reduction techniques are used to extract a better spectral feature. For instance, Independent Component Discriminant Analysis (ICDA) [2]

tries to find statistically independent components using ICA where it assumes that at most one component has a Gaussian distribution. ICA uses higher order statistics to compute uncorrelated components compared to the PCA 

[3] which uses covariance matrix. Some non-linear techniques such as quadratic discriminant analysis [4], kernel-based methods [5] are also employed to handle non-linearity in HSIs. However, extracted features in the reduced dimensional space may not be optimal for classification. HSI classification task gets more complicated with the following facts: i) Spectral signature of objects belonging to the same class may be different. ii) Similarly, the spectral signature of objects belonging to different class may be the same. Therefore, only spectral components may not be sufficient to provide features for classification. Recent studies prove that incorporation of spatial context along with spectral components improves classification task considerably. There are two ways of exploiting spatial and spectral information for classification. In the first approach, spatial and spectral information are processed separately and then combined at the decision level [35, 19]. The second strategy uses joint spectral-spatial features [34, 36, 33, 37]. In this paper, we have adopted the second strategy to accomplish the task of classification of hyperspectral images with higher accuracy than the state of the art techniques.

In literature, 1-D [33], 2-D [34], and 3-D [20] CNN based architectures are well-known for HSI classification. Also, hybridisation of different CNN-type architecture is employed [21]. 1D-CNNs uses pixel-pair strategy [33] which combined a set of two training samples. This set reveals the neighborhood of the observed pixel. Yet, it can not use full power of spatial information in hyper-spectral image classification. It completely ignores neighborhood profile. In general, 2-D and 3-D CNN based approaches are more suitable in such a scenario. However, there are many other architectures, e.g., Deep Belief Network [22, 23, 24, 25, 26]

, Autoencoders 

[27, 28, 29, 30, 31, 32]

which provide efficient solution to the hyperspectral image classification problem. In the present context, we are more interested in scrutinizing various CNN architecture for the current problem. In general, a few core components are available for making any CNN architecture. For example, convolution, pooling, batch-normalization 

[44], and activation layers. In practice, there are various ways of using convolution mechanism. A few of them are very popular, namely, point convolution, group convolution, depth-wise separable convolution [43]. Similarly, there is a variation to the pooling mechanism, namely, adaptive pooling [45]

. Recently, many mid-level components are developed, e.g., inception module which comprises of multi-scale convolutions. Mid-level components are sequentially combined to make a large network, such as, VGG 

[39], GoogleNet [40], etc. Additionally, skip architecture [41]

proves to be a successful way of making a very deep network to deal with the vanishing gradient problem. Hyperspectral image classification is still an interesting and challenging problem where the effectiveness of various core components of CNN and their arrangement to resolve the classification problem, needs to be studied.

In this paper, we present a CNN architecture which performs three major tasks in a pipeline of processing, such as, :1) band reduction, 2) feature extraction, and 3) classification. The first block of processing uses point-wise 3-D convolution layers. For feature extraction, we have used multi-scale convolution module. In our work, we have proposed two architectures for feature extraction which eventually lead to two different CNN architectures. In the first architecture, we have used inception module with four parallel convolution structures for feature extraction. Additionally, we have used similar multi-scale convolutions in inception like structure but with a different arrangement. The second architecture extracts finer contextual information compared to the first one. We feed the extracted features to a fully connected layer to form the high-level representation. We have trained our network in an end to end manner by minimizing cross-entropy loss using Adam optimizer. Our proposed architecture gives a state of the art performance without any data augmentation on three benchmark HSI classification datasets. Besides this new architecture, we have proposed a way to incorporate spatial information along with the spectral one. It not only covers the neighborhood profile for a given window, but also it observes the change of neighborhood by shifting its current window. This process appears to be more beneficial towards the boundary location compared to the still window. The contribution of this paper can be summarized as follows:

  1. A novel technique to incorporate spatial information with the spectral one has been proposed. The design is aimed at improving classification accuracy at the boundary location of each class.

  2. A novel end to end shallow and wider neural network has been proposed, which is a hybridization of 3-D CNN with 2-D-CNN. This hybrid structure provides a solution for appropriately using spectral information and extract more delicate features. Also, we have shown two different arrangements of similar multi-scale convolutional layers to extract distinctive features.

Section II-A gives a detailed description of the proposed classification framework, including the technique of inclusion of the spatial information. Performance and comparisons among the proposed networks and current state of the art methods are presented in Section III. This paper is concluded in Section IV.

Ii Proposed Classification Framework

The proposed classification framework shown in Fig 1 mainly consists of three tasks: i) organizing a target-pixel-orientation model using available training samples, ii) constructing a CNN architecture to extract uncorrelated spectral information, and iii) learning spatial correlation with neighboring pixels.

Fig. 1: Flowchart of the proposed classification framework
Fig. 2: Example of Target-Patch-Orientation Model

Ii-a Target-Pixel-Orientation Model for Training Samples

Consider a hyperspectral data set with spectral bands. We have labeled samples denoted as in an feature space and class labels where is the number of class. Let be the number of available labeled samples in the class and . We propose a Target-Pixel-Orientation (TPO) scheme. In this scheme, we consider a window whose center pixel is the target pixel. We select eight neighbors of the target pixel by simply shifting the window into eight different directions in a clockwise manner. Fig 2 shows one example of how we prepare eight neighbors of a target pixel with window. We have demonstrated target pixel with a blue box which is surrounded by window shown in a red border. First sub-image in Fig 2 depicts the window when the target pixel occupies the center position of that window. Other eight sub-images are the neighbors of the first sub-image which are numbered by 1 to 8. We consider each of nine windows as one view for the target pixel. However, we have described TPO with one spectral channel to make the illustration simple. In our proposed system, we have considered

spectral channels. Therefore, input to the model is 4-dimensional tensor. We perform the following operation to form the input for our models.

(1)

Where is a function which is responsible for stacking of channels and represents patch of spectral channel in view. represents nine views in TPO scheme. We have converted labeled samples to such that each has dimension.

Ii-A1 Advantage of TPO for class boundary

We observe that a patch of a pixel appears very differently at the boundary region of any class compared to pixels in the non-boundary area. In general, the non-boundary pixels are surrounded by pixels belonging to the same class. In such a scenario, TPO provides more than one views for the target pixel at the boundary region. We have illustrated this with a two-class situation in Fig-3

. The patch of a target pixel at near boundary contains all pixels of similar class (blue). However, the patch of a target pixel at the border includes pixels of two classes (blue and red). If we consider only one patch surrounded that target pixel, we may fail to classify border pixels. In this scenario

TPO brings a different view of patches for a single target pixel at the boundary. We have shown TPO of target pixel at border and near-border in Fig 4 and Fig 5 respectively. In the given situation, there is at least one view where every pixel belongs to the blue class for the border pixel. However, there are other views which are similar to the views of the pixel at the near boundary.

Fig. 3: The position of a target pixel in a near-boundary and boundary position for a two-class scenario.
Fig. 4: Example of Target-Patch-Orientation of a target pixel lies at the boundary of a class.
Fig. 5: Example of Target-Patch-Orientation of a target pixel lies at the near boundary of a class.

Ii-B Network Architecture

The framework of the HSI classification is shown in Fig 1. It consists of mainly three blocks, namely, band-reduction, feature extraction, and classification. TPO extracts samples from the given dataset as described in Section II-A. The label of each sample is that of the pixel located in the center of the first view among the nine views (discussed in Sec II-A).

Ii-B1 Band-Reduction

Fig. 6: Diagram of the Band Reduction layer in the proposed network.

This block contains three consecutive “BasicConv3d” layers. The designed “BasicConv3d” layer contains 3-D batch-normalization layer and rectified linear unit (ReLU) layers sequentially after 3-D point-wise convolution layer. Parameter of 3-D convolution layer is the input channel, output channel, and the kernel size. However, we have adjusted kernel size experimentally for the different dataset. Hence, we have used p, q, and r notation in defining kernel size in Fig 

6. Assume the spectral dimensionality and spatial size . The first 3-D convolutional layer (C1) primarily filters the prepared data with nine kernels, producing a feature map. As we have used point-wise 3-D convolution, there is no change in the spatial size of the sample. But, the size of the spectral channel is changed based on the value of p which is 8 in this example. The size of the spectral channel in the convolved sample can be computed using the following equation.

(2)

Where represents the size of the spectral channel, which is in this case. , and

represent kernel size, padding, and stride. For the above example,

, and holds. Therefore, we are getting 96 channels in the convolved sample. The second layer (C2) combines the features obtained in the C1 layer with nine kernels, resulting in a feature map. The third layer (C3) combines the features obtained in the C1 layer with nine kernels, resulting in a feature map. We have a reduced number of bands from 103 to 50 at this point. Now we reshape our data in 3 dimensions by stacking nine views for each 50 spectral information, leading to -sized sample. We feed the reshaped output of band-reduction block to feature extraction layer.

Fig. 7: Diagram of the Feature Extraction layer (TPO-CNN1) in the proposed network.
Fig. 8: Diagram of the Feature Extraction layer (TPO-CNN2) in the proposed network.

Ii-B2 Feature Extraction

We have taken a tiny patch as an input sample. Our assumption is that a shallow but wider network, i.e., “multi-scale filter bank” extracts more appropriate features from small patches. Hence, we have considered similar to inception-module for feature extraction. We have used inception module in two different ways forming two separate networks. Fig 7 and Fig 8 depict feature extraction modules of TPO-CNN1 and TPO-CNN2 . Each “BasicConv2d” layer in the figures contains a 2-D batch-normalization layer and rectified linear units (ReLU) sequentially after 2-D convolution layer. Parameter of the 2-D convolution layer is the input channel, output channel, and the kernel size. Each rectangular block of “BasicConv2d” in the diagram contains parameters of the 2-D convolution layer. We denote this by , where refers to the kernel size of the convolution layer and is the number of input channel. On the other hand, each block of “AvgPool2d” depicts the kernel size and the stride value for the average pooling layer in the diagram. We denote this by , where refers to the kernel size of the pooling layer. TPO-CNN1 uses a multi-scale filter bank that locally convolves the input sample with four parallel blocks with different filter sizes in convolution layer. Each parallel block consists of either one or many “BasicConv2D” layer and pooling layer. TPO-CNN1 has the following details: :( , followed by followed by , followed by and followed by . The and filters are used to exploit local spatial correlations of the input sample while the filters are used to address correlations among nine views and their respective spectral information. The outputs of the TPO-CNN1 feature extraction layer are combined at concatenation layer to form a joint view-spatio-spectral feature map used as input to the subsequent layers. However, since the size of the feature maps from the four convolutional filters is different from each other, we have padded the input feature with zeros to match the size of the output feature maps of each parallel blocks. For example, we have padded input with 0, 1 and 2 zeros for , and filters, respectively. In TPO-CNN1, we have used one adaptive average pooling [45] layer with output size sequentially after concatenation layer. However, we have split the inception architecture of TPO-CNN1 into three small inception layer. Each has two parallel convolutional layers. Each concatenation layer is followed by an adaptive average pooling layer with output . Finally, we concatenate all the pooled information.

Ii-B3 Classification

Outputs of feature extraction block are flattened and fed to the fully connected layers whose output channel is the number of class. The fully connected layers is followed by 1-D Batch-Normalization layer and a softmax activation function. In general, the classification layer can be defined as

(3)

where is the input of the fully connected layer, and and are the weights and bias of the fully connected layer, respectively. BN(·) is the 1-D Batch-Normalization layer. is the

-dimensional vector which represents the probability that a sample belong the

class.

Ii-C Learning the Proposed Network

We have trained the proposed networks by minimizing cross-entropy loss function. Let

represents the ground-truth for the training samples present in a batch .

denotes the conditional probability distribution of the model, i.e model predicts that

training sample belongs to the class with probability . The cross-entropy loss function is given by

(4)

In our dataset, the ground truth is represented as one-hot encoded vector. i.e., Each

is a dimensional vector where represents the number of classes. If the class label of the sample is then,

To train the model, Adam optimizer with a batch size of 512 samples is used with a weight decay of 0.0001 . We initially set a base learning rate as 0.0001. All the layers are initialized from a uniform distribution.

Iii Experimental Results

Fig. 9: University of Pavia dataset. Three-band color composite image is given on left, ground truth is given in the middle, and color code used in ground-truth is shown on the right.
Fig. 10: Indian Pines dataset. Three-band color composite image is given on left, ground truth is given in the middle, and color code used in ground-truth is shown on the right.
Fig. 11: Salinas dataset. Three-band color composite image is given on left, ground truth is given in the middle, and color code used in ground-truth is shown on the right.
U. Pavia Indian Pines Salinas
Sensor ROSIS AVIRIS AVIRIS
Place
Pavia, Northern
Italy
Northwestern
Indiana
Salinas valey
california
wavelength range 0.43-0.86 0.4-2.5 0.4-2.5
Spatial Resolution 1.3 m 20 m 3.7 m
No of bands 103 220 224
No. of Classes 9 16 16
Image size
TABLE I: A brief description on HSI Datasets

Iii-a Datasets

The performance of HSI classification is observed by experimenting with three popular datasets: the Pavia University scene (U.P) (Fig-9), the Indian Pines (I.P) (Fig-10), and the Salinas (S) dataset (11). Table I contains a brief description about the datasets. We have discarded water absorption bands in Indian Pines. Also, we have rejected some classes in Indian Pines dataset which has less than 400 samples. We have selected 200 labeled pixels from each class to prepare a training set for each of the three HSI datasets. The rest of the labeled samples constitute the test set. As different spectral channels ranges differently, we normalize them to the range [0, 1] using the function defined in Eq. 5, where denotes the pixel value of a given spectral channel and and

provide mean and standard deviation of the dataset.

(5)

Iii-B Quantitative Metrics

We evaluate the performance of the proposed architecture quantitatively in terms of the following metrics.

Iii-B1 Overall Accuracy(OA)

Overall Accuracy is computed using the following formulae in the test samples where number of classes, considered for a given HSI dataset:

(6)

Iii-B2 Average Accuracy (AA)

Average Accuracy is computed using the following formulae in the test samples where number of classes, considered for a given HSI dataset:

(7)

Iii-B3 -score

The -score [18] is a statistical measure about the agreement between two classifiers. Each classifier classifies N samples into C mutually exclusive classes. -score is given by the following equation:

(8)

where is the relative observed agreement between classifiers, and is the hypothetical probability of chance agreement. indicates complete agreement between two classifiers, while refers no agreement at all.

Iii-C Implementation Platform

The network is implemented in pytorch

111https://pytorch.org/, a popular deep learning library, written in python. We have trained our models on a machine with GeForce GTX 1080 Ti GPU.

Iii-D Comparison with Other Methods

The key features of our proposed methods are 1) use of spatial feature with spectral one, 2) band-reduction using several consecutive 3-D CNNs and 3) feature extraction with a multi-scale convolutional network. We have chosen six state of the art methods namely,: 1) CNN-PPF [33], 2) DR-CNN [34], 3) 2S-Fusion, 4) BASS [37], 5) DPP-ML [36], and 6) S-CNN+SVM [38]. Every comparable method exploits spatial features along with the spectral one. CNN-PPF uses a pixel pair strategy to increase the number of training samples and feeds them into a deep network having 1-D convolutional layers. DR-CNN exploits diverse region-based 2-D patches from the image to produce more discriminative features. On the contrary, 2S-Fusion processes spatial and spectral information separately and fuses them using adaptive class-specific weights. However, BASSNET extracts band specific spatial-spectral features. In DPP-ML

, convolutional neural networks with multiscale convolution are used to extract deep multiscale features from the HSI. SVM-based methods are common in traditional hyperspectral image classification. In

S-CNN+SVM, the Siamese convolutional neural network extracts spectral-spatial features of HSI and feeds them to a SVM classifier. In general, the performance of deep learning-based algorithms supersedes traditional techniques (e.g, k-NN, SVM, ELM). We have compared the performance of the proposed techniques with the best results reported for each of these state of the art techniques. In S-CNN+SVM and 2S-Fusion, performance on the salinas dataset is not reported. To maintain consistency in the results, we ran our algorithm with the classes and the number of samples for each class used in 2S-Fusion, DR-CNN, DPP-ML for Indian Pines.

Class
Training
samples
CNN
-PPF
BASS
S-CNN
+SVM
2S
-Fusion
DR
-CNN
DPP
-ML
TPO-
CNN1
TPO-
CNN2
Asphalt 200 97.42 97.71 100 97.47 98.43 99.38 99.78 100
Meadows 200 95.76 97.93 98.12 99.92 99.45 99.59 99.88 99.99
Gravel 200 94.05 94.95 99.12 83.80 99.14 97.33 99.21 100
Trees 200 97.52 97.80 99.40 98.98 99.50 99.31 99.41 99.93
Painted metal sheets 200 100 100 99.18 100 100 100 100 100
Bare Soil 200 99.13 96.60 99.10 97.75 100 99.99 99.75 100
Bitumen 200 96.19 98.14 98.50 77.44 99.70 99.85 100 100
Self-Blocking Bricks 200 93.62 95.46 99.91 96.65 99.55 99.02 99.77 100
Shadows 200 99.60 100 100 99.65 100 100 100 100
OA 96.48 99.68 97.50 99.56 99.46 99.72 99.78 99.99
TABLE II: Class-specific Accuracy(%) and OA of comparable techniques for the University of Pavia dataset
Class
Trainnig
Samples
CNN-
PPF
BASS
S-CNN
+SVM
TPO-
CNN1
TPO-
CNN2
Asphalt 200 92.99 96.09 98.25 100 100
Meadows 200 96.66 98.25 99.64 99.92 99.75
Gravel 200 95.58 100 97.10 100 99.68
Trees 200 100 99.24 99.86 99.73 99.82
Sheets 200 100 100 100 100 100
Bare soil 200 96.24 94.82 98.87 100 100
Bitumen 200 87.80 94.41 98.57 99.74 100
Bricks 200 98.98 97.46 100 100 100
Shadows 200 99.81 99.90 100 100 99.72
OA 94.34 96.77 99.04 99.89 99.84
TABLE III: Class-specific Accuracy(%) and OA of comparable techniques for the Indian Pines dataset
Class
DR
-CNN
DPP
-ML
TPO-
CNN1
TPO-
CNN2
2S-
Fusion
TPO-
CNN1
TPO-
CNN2
Alfalfa - - - - 100 100 100
Corn-notill 98.20 99.03 100 100 95.35 100 100
Corn-mintill 99.79 99.74 99.51 99.67 98.75 100 100
Corn - - - - 100 100 100
Grass-pasture 100 100 100 100 100 100 100
Grass-trees - - - - 99.32 100 100
Grass-pasture-mowed - - - - 100 100 100
Hay-windrowed 100 100 98.84 98.85 100 100 100
Oats - - - - 100 100 100
Soybean-notill 99.78 99.61 100 100 100 100 100
Soybean-mintill 96.69 97.80 100 99.91 98.03 100 100
Soybean-clean 99.86 100 100 100 100 100 100
Wheat - - - - 97.87 100 100
Woods 99.99 100 100 100 99.62 100 100
Buildings-Grass-Trees-Drives - - - - 98.53 100 100
Stone-Steel-Towers - - - - 100 100 100
OA 98.54 99.08 99.54 99.55 98.65 100 100
TABLE IV: Class-specific Accuracy(%) and OA of comparable techniques for the Indian Pines dataset
Class
CNN
-PPF
BASS
DR
-CNN
DPP
-ML
TPO-
CNN1
TPO-
CNN2
Brocoli-green-weeds-1 100 100 100 100 100 100
Brocoli-green-weeds-2 99.88 99.97 100 100 100 100
Fallow 99.60 100 99.98 100 100 99.72
Fallow-rough-plow 99.49 99.66 99.89 99.25 100 100
Fallow-smooth 98.34 99.59 99.83 99.44 99.84 99.88
Stubble 99.97 100 100 100 100 100
Celery 100 99.91 99.96 99.87 100 100
Grapes-untrained 88.68 90.11 94.14 95.36 94.30 98.17
Soil-vinyard-develop 98.33 99.73 99.99 100 99.75 100
Corn-senesced-green-weeds 98.60 97.46 99.20 98.85 94.02 99.35
Lettuce-romaine-4wk 99.54 99.08 99.99 99.77 100 100
Lettuce-romaine-5wk 100 100 100 100 100 100
Lettuce-romaine-6wk 99.44 99.44 100 99.86 100 100
Lettuce-romaine-7wk 98.96 100 100 99.77 100 100
Vinyard-untrained 83.53 83.94 95.52 90.50 95.08 94.03
Vinyard-vertical-trellis 99.31 99.38 99.72 98.94 100 100
OA 94.80 95.36 98.33 97.51 97.98 98.72
TABLE V: Class-specific Accuracy(%) and OA of comparable techniques for the Salinas dataset

Iii-E Results and Discussion

The performance of the proposed TPO-CNN1 and TPO-CNN2 on test-samples are compared with the aforementioned deep learning-based classifiers in Tables IIIIIIVV. We have considered spatial window for generating the outcome of our algorithms. We have seen that our models supersede other methods for every dataset. However, TPO-CNN2 produces better results compared to TPO-CNN1 in University of Pavia and Salinas datasets whereas their performances are comparable in Indian Pines. The results signify that the arrangements of multi-scale convolutions in TPO-CNN2 is able to extract more useful features for the classification compared to TPO-CNN1.

We have shown thematic maps generating from the classification of three HSI scenes using our networks in Figure 12. In order to check the consistency of our network, we repeat experiments 10 times with different training sets. Table VI shows the mean and standard deviation of OA over these 10 experiments for each data set.

Datasets U.P I.P S
TPO-CNN 1 2 1 2 1 2
OA Mean 99.76 99.90 99.67 99.65 98.10 98.65
Std-dev 0.24 0.06 0.14 0.17 0.37 0.28
AA Mean 99.80 99.94 99.82 99.78 97.87 98.49
Std-dev 0.20 0.06 0.07 0.11 0.42 0.30
Mean 99.44 99.86 99.61 99.58 99.29 99.44
Std-dev 0.32 0.08 0.17 0.20 0.18 0.14
TABLE VI: Mean and standard deviation of 10 Independent Experiments
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 12: Thematic maps resulting from classification using -patch by TPO-CNN1 and TPO-CNN2 respectively for (a)-(b) University of Pavia dataset, (c)-(d) Salinas data set, and (e)-(f) Indian Pines dataset. Color code is similar to its ground truths.
Datasets TPO-CNN1 TPO-CNN2
U.P I.P S U.P I.P S
98.66 97.59 95.55 99.23 97.78 95.67
99.67 99.27 97.20 99.68 99.49 97.78
99.84 99.71 93.70 99.94 99.89 99.14
TABLE VII: Overall Accuracy (OA) for varying patch size

Iii-F Comparison of Different Hyper-parameter Settings

There are two hyper-parameters which have a direct effect on the accuracy of classification task: 1) Spatial size of input image, and the 2) Number of spectral channels obtained from the band reduction block. Figures 12(b) and 12(a) depicts test accuracies on 3 HSI datasets for different choices of input patch size on the same randomly selected training sample. We observe that with increasing patch size accuracies in Indian Pines and Salinas increases in both the networks. However, for Salinas accuracy drops when patch size reaches in TPO-CNN1. This behavior again supports the fact that the arrangements of multi-scale convolutions of TPO-CNN2 is superior to TPO-CNN1 with respect to feature extraction. Table VIII shows the adjusted (refer to Section II-B1) parameters used for 3 HSI datasets. We vary the value of to get a different number of bands and observe its impact on classification accuracy. We did not observe any monotonically increasing or decreasing behavior in overall classification accuracy for changing the number of bands with varying value of . Figures 12(d) and 12(c) depicts the change in overall accuracy (OA) for varying .

(a)
(b)
(c)
(d)
Fig. 13: Variation of test accuracy on the three HSI datasets (a)-(b) with varying patch size, (c)-(d) with varying number of spectral channels.
U.P I.P S
p 8 32 32
q 16 57 61
r 32 64 64
TABLE VIII: Parameters in Band Reduction Block

Iii-G Classification performance with decreasing number of training samples

In this section, the influence of decreasing the number of samples on the classification accuracy has been studied on the University of Pavia, Indian Pines, and Salinas datasets. We present the experimental results with the setup as mentioned above with spatial resolution. Here, each class selects a fixed number of samples per class from the labeled pixels. To showcase the effect of decreasing number of training samples on the classification accuracy, we have chosen several values of N, e.g., 150, 100, and 50. Our proposed architecture can still beat most of the comparable methods with 150 training samples per class. Table IX reassures that network architecture for feature extraction in TPO-CNN2 brings more useful feature for U. Pavia and India Pines even with a small number of training samples compared to TPO-CNN1. However we observe a small deviation of this results in the Salinas dataset with 100 training samples per class.

#samples 150 100 50
TPO-CNN 1 2 1 2 1 2
U.P OA
99.58
(0.22)
99.93
(0.05)
99.16
(0.62)
99.75
(0.13)
98.25
(1.15)
98.98
(0.62)
AA
99.76
(0.08)
99.92
(0.06)
98.88
(0.82)
99.67
(0.17)
97.68
(1.50)
98.65
(0.82)
99.44
(0.30)
99.91
(0.07)
99.45
(0.31)
99.82
(0.07)
98.54
(0.74)
99.17
(0.39)
I.P
OA
99.44
(0.12)
99.47
(0.18)
98.72
(0.46)
98.88
(0.35)
94.12
(1.32)
95.88
(1.02)
AA
99.33
(0.15)
99.38
(0.22)
98.49
(0.55)
98.67
(0.42)
93.10
(1.54)
95.22
(1.25)
99.59
(0.20)
99.60
(0.24)
99.15
(0.30)
99..23
(0.19)
96.21
(0.77)
97.18
(0.50)
S OA
97.39
(0.24)
97.51
(0.39)
96.23
(0.60)
96.08
(0.73)
95.81
(0.29)
96.28
(0.39)
AA
99.06
(0.11)
99.21
(0.09)
98.57
(0.24)
98.57
(0.24)
98.26
(0.33)
98.48
(0.24)
97.08
(0.27)
97.22
(0.44)
95.80
(0.67)
95.63
(0.80)
95.33
(0.33)
95.86
(0.43)
TABLE IX: Performance measures on the three datasets with decreasing number of training samples

Iii-H Analysis on the TPO strategy

In order to judge how the TPO strategy as described in Section II-A affects the performance of the classifier, we compare the classification results using a single view where the position of the target pixel is the center of the given window. Table X supports the fact that TPO strategy has a direct effect on the classification accuracies. We observe that OA increases by 6.85%, 8.36%, -1.13% in TPO-CNN1 model and 3.09%, 15.11% 2.75% in TPO-CNN2 model for classification of U. Pavia, Indian Pines, Salinas, respectively. In brief, TPO-scheme improves results compared to the single view in U.Pavia and Indian-Pines for both the models. However, we observe a different behavior with Salinas in TPO-CNN1. This suggests consistent behavior of TPO-CNN2 and a positive impact of TPO-strategy on the model for all three datasets.

view stat U.P
I.P
S
TPO-CNN1
OA 98.71 96.91 91.13
AA 98.68 98.51 95.02
9 98.26 96.38 90.06
OA 91.86 88.55 92.26
AA 92.08 93.31 96.37
1 89.03 86.39 91.34
TPO-CNN2
OA 99.01 97.34 94.88
AA 99.18 98.60 97.97
9 98.66 96.81 94.27
OA 95.92 82.83 92.13
AA 95.68 88.18 96.58
1 94.56 79.48 91.20
TABLE X: Impact of TPO-strategy on the classification considering spatial size

Iv Conclusion

In this paper, a hybrid of 3-D and 2-D CNN-based network architecture is proposed for HSI classification. We also propose a strategy (namely, target-pixel orientation (TPO)) to incorporate spatial and spectral information of HSI. In general, classification accuracy degrades due to mis-classification at the boundary region. Our approach attempts to take care of this limitation by using the orientation of the Target-pixel-view. Our architectural design of neural network exploits point-wise 3-D convolutions for band reduction whereas we adopt multi-scale 2-D inception like architecture for feature extraction. We have tested the granular arrangement of multi-scale convolutions in inception like architecture in (TPO-CNN2). We find TPO-CNN2 provides better results compared to TPO-CNN1. The experimental results with real hyperspectral images demonstrates the positive impact of including TPO strategy. Also, the proposed work improves the performance of the classification accuracy compared to the state of the art methods even with a smaller number of training samples (For example, 150 samples per class). All the experimental results suggest that the arrangement of multi-scale convolutions in TPO-CNN2 provides more useful features compared to TPO-CNN1.

References

  • [1] G. Hughes, “On the mean accuracy of statistical pattern recognizers,” in IEEE Transactions on Information Theory, vol. 14, no. 1, pp. 55-63, 1968.
  • [2] A. Villa, J. A. Benediktsson, J. Chanussot and C. Jutten, “Hyperspectral Image Classification With Independent Component Discriminant Analysis,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 12, pp. 4865-4876, Dec. 2011.
  • [3] G. Licciardi, P. R. Marpu, J. Chanussot and J. A. Benediktsson, “Linear Versus Nonlinear PCA for the Classification of Hyperspectral Data Based on the Extended Morphological Profiles,” in IEEE Geoscience and Remote Sensing Letters, vol. 9, no. 3, pp. 447-451, 2012.
  • [4] J. Li et al., “Multiple Feature Learning for Hyperspectral Image Classification,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 3, pp. 1592-1606, 2015.
  • [5] G. Camps-Valls and L. Bruzzone, “Kernel-based methods for hyperspectral image classification,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 43, no. 6, pp. 1351-1362, 2005.
  • [6] W. Li, S. Prasad, J. E. Fowler and L. M. Bruce, “Locality-Preserving Dimensionality Reduction and Classification for Hyperspectral Image Analysis,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 50, no. 4, pp. 1185-1198, 2012.
  • [7] X. Wang, Y. Kong, Y. Gao and Y. Cheng, “Dimensionality Reduction for Hyperspectral Data Based on Pairwise Constraint Discriminative Analysis and Nonnegative Sparse Divergence,” in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 4, pp. 1552-1562, 2017.
  • [8] S. Chen and D. Zhang, “Semisupervised Dimensionality Reduction With Pairwise Constraints for Hyperspectral Image Classification,” in IEEE Geoscience and Remote Sensing Letters, vol. 8, no. 2, pp. 369-373, 2011.
  • [9] W. Zhao and S. Du, “Spectral–Spatial Feature Extraction for Hyperspectral Image Classification: A Dimension Reduction and Deep Learning Approach,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 8, pp. 4544-4554, 2016.
  • [10] F. A. Mianji and Y. Zhang, “Robust Hyperspectral Classification Using Relevance Vector Machine,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 49, no. 6, pp. 2100-2112, 2011.
  • [11] A. Samat, P. Du, S. Liu, J. Li and L. Cheng, “ : Ensemble Extreme Learning Machines for Hyperspectral Image Classification,” in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 7, no. 4, pp. 1060-1069, 2014.
  • [12] W. Li, C. Chen, H. Su and Q. Du, “Local Binary Patterns and Extreme Learning Machine for Hyperspectral Imagery Classification,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 7, pp. 3681-3693, 2015.
  • [13] T. Lu, S. Li, L. Fang, L. Bruzzone and J. A. Benediktsson, “Set-to-Set Distance-Based Spectral–Spatial Classification of Hyperspectral Images,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 12, pp. 7122-7134, 2016.
  • [14] Guang-Bin Huang, Qin-Yu Zhu, Chee-Kheong Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol 70, Issues 1-3, pp. 489-501, 2006.
  • [15]

    Jie Gui, Zhenan Sun, Wei Jia, Rongxiang Hu, Yingke Lei, Shuiwang Ji, “Discriminant sparse neighborhood preserving embedding for face recognition,” Pattern Recognition, vol. 45, Issue 8, pp 2884-2893, 2012.

  • [16] D. Lunga, S. Prasad, M. M. Crawford and O. Ersoy, “Manifold-Learning-Based Feature Extraction for Classification of Hyperspectral Data: A Review of Advances in Manifold Learning,” in IEEE Signal Processing Magazine, vol. 31, no. 1, pp. 55-66, 2014.
  • [17] Christian Szegedy, Sergey Ioffe and Vincent Vanhoucke, “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” in CoRR, vol. abs/1602.07261, arXiv, 2016.
  • [18] Cohen, J. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement, vol. 20, no. 1, pp. 37-46. 1960.
  • [19] S. Jia, X. Zhang and Q. Li, “Spectral–Spatial Hyperspectral Image Classification UsingRegularized Low-Rank Representation and Sparse Representation-Based Graph Cuts,” in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 6, pp. 2473-2484, 2015.
  • [20]

    H. Zhang, Y. Li, Y. Jiang, P. Wang, Q. Shen and C. Shen, “Hyperspectral Classification Based on Lightweight 3-D-CNN With Transfer Learning,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 8, pp. 5813-5828, 2019.

  • [21] S. K. Roy, G. Krishna, S. R. Dubey and B. B. Chaudhuri, “HybridSN: Exploring 3-D-2-D CNN Feature Hierarchy for Hyperspectral Image Classification,” in IEEE Geoscience and Remote Sensing Letters, pp. 1-5, 2019.
  • [22] Y. Chen, X. Zhao and X. Jia, “Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief Network,” in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 6, pp. 2381-2392, June 2015.
  • [23] T. Li, J. Zhang and Y. Zhang, “Classification of hyperspectral image based on deep belief networks,” IEEE International Conference on Image Processing (ICIP), pp. 5132-5136, 2014.
  • [24] P. Zhong, Z. Gong, S. Li and C. Schönlieb, “Learning to Diversify Deep Belief Networks for Hyperspectral Image Classification,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 6, pp. 3516-3530, 2017.
  • [25] P. Zhong, Zhiqiang Gong and C. Schönlieb, “A DBN-crf for spectral-spatial classification of hyperspectral data,” 23rd International Conference on Pattern Recognition (ICPR), pp. 1219-1224, 2016.
  • [26] Y. Chen, X. Zhao and X. Jia, “Spectral–Spatial Classification of Hyperspectral Data Based on Deep Belief Network,” in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 8, no. 6, pp. 2381-2392, 2015.
  • [27] J. Feng, L. Liu, X. Zhang, R. Wang and H. Liu, “Hyperspectral image classification based on stacked marginal discriminative autoencoder,” 2017 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Fort Worth, TX, 2017, pp. 3668-3671.
  • [28]

    Y. Sun, J. Li, W. Wang, A. Plaza and Z. Chen, “Active learning based autoencoder for hyperspectral imagery classification,” IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 469-472 , 2016.

  • [29]

    J. E. Ball and P. Wei, “Deep Learning Hyperspectral Image Classification using Multiple Class-Based Denoising Autoencoders, Mixed Pixel Training Augmentation, and Morphological Operations,” IEEE International Geoscience and Remote Sensing Symposium (IGARSS), pp. 6903-6906, 2018.

  • [30] J. Feng, L. Liu, X. Cao, L. Jiao, T. Sun and X. Zhang, “Marginal Stacked Autoencoder With Adaptively-Spatial Regularization for Hyperspectral Image Classification,” in IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 9, pp. 3297-3311, 2018.
  • [31] S. Zhou, Z. Xue and P. Du, “Semisupervised Stacked Autoencoder With Cotraining for Hyperspectral Image Classification,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 6, pp. 3813-3826, June 2019.
  • [32] C. Tao, H. Pan, Y. Li and Z. Zou, “Unsupervised Spectral–Spatial Feature Learning With Stacked Sparse Autoencoder for Hyperspectral Imagery Classification,” in IEEE Geoscience and Remote Sensing Letters, vol. 12, no. 12, pp. 2438-2442, 2015.
  • [33] W. Li, G. Wu, F. Zhang and Q. Du, “Hyperspectral Image Classification Using Deep Pixel-Pair Features,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 2, pp. 844-853, Feb. 2017.
  • [34] M. Zhang, W. Li and Q. Du, “Diverse Region-Based CNN for Hyperspectral Image Classification,” in IEEE Transactions on Image Processing, vol. 27, no. 6, pp. 2623-2634, June 2018.
  • [35] S. Hao, W. Wang, Y. Ye, T. Nie and L. Bruzzone, “Two-Stream Deep Architecture for Hyperspectral Image Classification,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2349-2361, April 2018.
  • [36] Z. Gong, P. Zhong, Y. Yu, W. Hu and S. Li, “A CNN With Multiscale Convolution and Diversified Metric for Hyperspectral Image Classification,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 6, pp. 3599-3618, June 2019.
  • [37] A. Santara et al., “BASS Net: Band-Adaptive Spectral-Spatial Feature Learning Neural Network for Hyperspectral Image Classification,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 9, pp. 5293-5301, 2017.
  • [38]

    B. Liu, X. Yu, P. Zhang, A. Yu, Q. Fu and X. Wei, “Supervised Deep Feature Extraction for Hyperspectral Image Classification,” in IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 1909-1921, April 2018.

  • [39] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition”, in CoRR, vol. abs/1409.1556, arXiv, 2014.
  • [40] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D.Anguelov, D. Erhan, V. Vanhoucke and A. Rabinovich, “Going Deeper with Convolutions”, in CoRR, vol. abs/1409.4842, arXiv, 2014.
  • [41] K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition”,in CoRR, vol. abs/1512.03385, arXiv, 2015.
  • [42]
  • [43]

    Ma, Ningning and Zhang, Xiangyu and Zheng, Hai-Tao and Sun, Jian, “ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design”, in Computer Vision – ECCV, pp. 122–138, 2018.

  • [44] S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”, in CoRR, vol. abs/1502.03167, arXiv, 2015.
  • [45] B. McFee, J. Salamon and J. P. Bello, ”Adaptive Pooling Operators for Weakly Labeled Sound Event Detection,” in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 2180-2193, 2018.