A Biologically Interpretable Two-stage Deep Neural Network (BIT-DNN) For Hyperspectral Imagery Classification

by   Yue Shi, et al.

Spectral-spatial based deep learning models have recently proven to be effective in hyperspectral image (HSI) classification for various earth monitoring applications such as land cover classification and agricultural monitoring. However, due to the nature of "black-box" model representation, how to explain and interpret the learning process and the model decision remains an open problem. This study proposes an interpretable deep learning model – a biologically interpretable two-stage deep neural network (BIT-DNN), by integrating biochemical and biophysical associated information into the proposed framework, capable of achieving both high accuracy and interpretability on HSI based classification tasks. The proposed model introduces a two-stage feature learning process. In the first stage, an enhanced interpretable feature block extracts low-level spectral features associated with the biophysical and biochemical attributes of the target entities; and in the second stage, an interpretable capsule block extracts and encapsulates the high-level joint spectral-spatial features into the featured tensors representing the hierarchical structure of the biophysical and biochemical attributes of the target ground entities, which provides the model an improved performance on classification and intrinsic interpretability. We have tested and evaluated the model using two real HSI datasets for crop type recognition and crop disease recognition tasks and compared it with six state-of-the-art machine learning models. The results demonstrate that the proposed model has competitive advantages in terms of both classification accuracy and model interpretability.



There are no comments yet.


page 1

page 8

page 9

page 11


A Novel CropdocNet for Automated Potato Late Blight Disease Detection from the Unmanned Aerial Vehicle-based Hyperspectral Imagery

Late blight disease is one of the most destructive diseases in potato cr...

An Interpretable Deep Hierarchical Semantic Convolutional Neural Network for Lung Nodule Malignancy Classification

While deep learning methods are increasingly being applied to tasks such...

On the Effectiveness of Interpretable Feedforward Neural Network

Deep learning models have achieved state-of-the-art performance in many ...

IAIA-BL: A Case-based Interpretable Deep Learning Model for Classification of Mass Lesions in Digital Mammography

Interpretability in machine learning models is important in high-stakes ...

Deep-URL: A Model-Aware Approach To Blind Deconvolution Based On Deep Unfolded Richardson-Lucy Network

The lack of interpretability in current deep learning models causes seri...

Deep Adaptive Wavelet Network

Even though convolutional neural networks have become the method of choi...

Logic Rules Meet Deep Learning: A Novel Approach for Ship Type Classification

The shipping industry is an important component of the global trade and ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Recently, the deep learning models have been widely used for the hyperspectral image (HSI)-based earth monitoring applications, such as the land cover classification and agricultural monitoring [RN4, RN5], and ecological management [RN8, RN9]

. However, most of existing deep learning-based approaches have difficulty in explaining biophysical and biochemical characteristics due to the “black-box” representation of the features extracted from the intermediate layers and the complex design of the network architectures

[RN11, RN10]. Therefore, the interpretability of the deep learning for HSI classification has become one of the most active research topics in the remote sensing community, which can enhance and improve the robustness and accuracy of models in the earth monitoring applications from the biological perspective of the target entities [RN13, RN14, RN12].

Some efforts on interpretable deep learning-based models in the remote sensing field have been made [RN13, RN14, RN12]. Among all the explainable deep learning models for HSI classifications, the visualization of the feature representations is the most direct way to improve the interpretability [RN34]. This type of methods adds an additional layer to visualize the intermediate features or patterns, either maximizing the score of a given unit in a pre-trained deep learning model or inverting feature maps of an intermediate layer back to the input image [RN30, RN53, RN28]. For example, [RN29] studied the spatial distribution and significant of the output of each layer, and proposed a multilayer visualization approach to simultaneously visualize the sample distribution, the details of the subpixel level, the target units and labels hidden in the deep levels of airborne AVIRIS data and spaceborne Hyperion data. Another way to improve the interpretability of deep learning models is to construct the network architecture which can bring the network an explicit semantic meaning [RN35, RN34]. For example, [RN36] proposed an unsupervised model, named as multiple-layer feature-matching generative adversarial networks (MARTA GANs), to explore and extract the representation of unlabelled data during the learning processes. In this model, a generative model was used to integrate local and global features, and a discriminative model was set to learn better spectral representations from HSI images.

Despite existing researches are encouraging, the complexity in the interaction of the ground structure and reflectance radiation of objects makes the biological interpretability of a deep learning model challenging. On one hand, unavoidable spectral-spatial perturbations and redundancies in HSI data always cause difficulty in accurately representing the features of intermediate layers [RN23]. On the other hand, it is hard to capture a hierarchically biological relationship among the high-level features produced by deeper layers [RN26].

Generally speaking, a well-designed interpretable model for HSI classification needs to deal with two issues: 1) how to extract the interpretable features that are associated with biological attributes, 2) how to represent the hierarchical structure of biophysical and biochemical attributes of the target entities. To address these issues, in this study, a biological interpretable two-stage deep neural network (BIT-DNN) model is designed to achieve an accurate classification for HSI data with a full consideration of the biophysical and biochemical basis of the target entities during the learning process, which consists of two stages. Wherein, the first stage of this model, which is designed for the extraction of the interpretable spectral features, and generation of the low-level spectral features with enhanced biophysical and biochemical representations. The second stage is designed for characterizing the biological relationship and the hierarchical structure of the high-level joint spectral-spatial features by integrating the spatial texture information with the extracted spectral features. As a result, the proposed BIT-DNN model is able to represent the biological attributes of the target entities captured by HSI data, enabling improved interpretability of decision making.

The rest of this paper is organised as follows: Section II provides an overview of related work on existing interpretable deep learning models networks for HSI image classification; Section III presents our new proposed interpretable deep neural network (BITS-DNN) for HSI image classification; Section IV introduces criteria of interpretability assessment; Section V describes the experimental evaluation; Section VI concludes the work.

Ii Related work on existing interpretable deep learning models for HSI image classification

To evaluate the interpretablity of a deep learning model fairly, the interpretable criteria are essential. Generally, for a deep learning task, interpretability should be considered in the life cycle of data science: date collection, pre-processing, data modelling, post hoc analysis

[RN71]. The intrinsic interpretability of HSI data represents the reflection and radiation characteristics of ground entities, three interpretable criteria exist in assessing and evaluating whether a deep learning model is self-explained in HSI classification. They include 1) Pre-model interpretability, 2) In-model interpretability, and 3) Post-model interpretability (post hoc analysis).

Specifically, the pre-model interpretability, which is prior to the main model construction stage, mainly focuses on enhancing the biological attributes of the ground entities hidden in the HSI data. Two of the popular approaches in pre-model interpretability are data description standardization and explainable feature enhancement. For example, [RN79] proposed a dimensionality reduction method to explore the spectral and spatial characteristics of HSI data, this approach improved the representation of the hyperspectral patch alignment in the main model, and performed well in the small sample learning. [RN73] transformed original three colour channels into an interpretable dataset with a tensor representing the potential shapes or texture attributes of target objectives.

The in-model interpretability refers to use the causality or physical constraints on the main model and enable them to extract interpretable features with explicit semantic meaning. [RN37]

developed a deep capsule CNN-based network for HSI classification tasks in order to better model the hierarchical relationships of features. Through the exploitation of the correlation of spectral-spatial features, the approach added structures called “capsules” to the CNN network. The capsules allowed efficient handling of high level complexity of the entities, including the spatial position in the image, the associated spectral signatures, and the potential transformations. Although such approaches can improve the interpretability, they could always make the network layers deeper. Thus, a significant number of filters are added into the network architecture, leading to the vanishing gradients problem and the limited performance of activations and gradients in the training progress

[RN39, RN38]. From this aspect, the priori knowledge-based feature enhancement or encoding technology is an efficient way to uncover the discriminate spectral-spatial characteristics hidden in the raw hyperspectral images [RN40]. For instance, in order to formalize and exploit the knowledge of automatic urban objects identification, [RN41] proposed a knowledge-based deep learning model for urban object detection. In comparison with traditional CNN-based approaches, this model provided a better performance on the interpretation of HSR images in order to map the territory automatically. [RN42]

proposed an LiDAR-based deep learning model to classify the forested landslides, in which the prior-knowledge was manually integrated with the feature extraction layers. As a result, the output features from the intermediate layers provided interpretable information for the geological characteristics of the target landslide, which subsequently achieved a better forested landslide classification in steep and rugged terrain.

The post-model interpretability, which is generally decoupled from the main model, refers to explain the representations of the intermediate outputs. Two of the popular post-model interpretability approaches include: the visualization-based approach and the interpretable activation optimization. For example, the visualization-based approach is the most direct way to explore the high-level representations of the spectral information hidden in the deeper layers. [RN30] compared these two different ways for extracting and visualizing image features from different layers and encoding dense features at multiple scales into global features. Since [RN32]

introduced a principle component analysis based activation function for optimizing the deep spectral-spatial features collected from the HSI classification framework, various convolutional neural network (CNN) based deep learning models have been focused on the exploration and interpretation of the spectral-spatial pattern of the target entities using the post-interpretable approach. For instance,

[RN31] concatenated the pixel-wised spectral-spatial features and the visualized full connected layers to extract and explain the contributions and representations of the intermediate outputs for the final classification. [RN53] explored the interpretation of the training process in an unsupervised way, for this purpose, they proposed an encoder-decoder paradigm, in which the significant information of the input HSI patches was extracted in a lower dimensional space via a CNN encoder.

However, most of the existing interpretable deep learning approaches were designed based on the statistical properties of the sample space [RN45, RN44]

. Thus, the learning process is modelled as a set of joint probability density functions, and a large number of high-quality labelled training data is required. Neglecting the biophysical and biochemical attributes hidden in the redundancy information of the HSI data makes the classification performance highly depend on the scale and quality of the labelled samples. Moreover, the effect of mixed-pixels, which may degrade the intra-class variability, exaggerate the inter-class similarity and produce feature interferences during the learning process, were often not fully considered

[RN47, RN46]. Therefore, most existing deep learning approaches are often have a poor interpretability for the high-level features of HSI data and the “salt and pepper” noises on the final classifications [RN49].

Iii The proposed method: a biologically interpretable two-stage deep neural network (BIT-DNN)

In this study, we consider HSI data as a data cube X (with a size of ), where H, W, and B is the height, width, and bands of the original data cube, each pixel comprises an individual spectral signal with B bands. We propose a novel deep learning framework, a biologically interpretable two-stage deep neural network (BITS-DNN) to deal with the HSI classification. The architecture of the BIT-DNN is shown in Fig.1.

Fig. 1: The high-level system overview of the proposed interpretable two-stage spectral-spatial deep neural network

Iii-a The pre-processing block

Considering the explicit biochemical and biophysical properties of vegetation in different band ranges [RN50, RN51], we split HSI images into 7 segmentations: blue (), green (), red (), red-edge1 (), red-edge2 (), red-edge3 (), near infrared (). The process in each data segmentation is parallel, which not only increases the computing efficiency, but also help interpret the photochemical meaning of the intermediate variables.

Iii-B Stage 1– An enhanced interpretable feature block

The enhanced interpretable feature block, the first stage of feature learning, is introduced to extract and generate interpretable low-level spectral features. It consists of multiple layers including: two 1D CNN layers, two fully connected layers, and the spectral enhanced interpretable layer. Its architecture is shown in Fig.2.

Fig. 2: The architecture of the enhanced interpretable feature block in the two-stage learning process, Stage 1. This stage is to extract and generate interpretable low-level spectral features.

The output of the pre-processing block is a series of HSI segmentations, and each HSI segmentation is then separately introduced into two 1D-CNN layers of the enhanced interpretable feature block. Each of the channel works as a spectral feature extractor to learn the corresponding spectral responses which are sensitive to certain biophysical or biochemical attributes of the target class. Subsequently, two fully connected layers are used to map the sensitive spectral features into the corresponding classes. Finally, in order to highlight the relevance among these spectral features, a feature enhanced interpretable layer is designed to generate a new spectral feature set with the improved biochemical and biophysical meanings. The detailed information about each layer in this block is described in the following subsections.

Iii-B1 1D CNN layer

For a given HSI segmentation, the spectral sub-regions which are sensitive to the target classes should be firstly separated from the redundant band information. For this purpose, two 1D CNN layers (i.e. conv1 and conv2 layer in Fig.2

) are introduced into our network to extract the pixel-wised spectral response features that are sensitive to the target classes from a series of filters with various reception fields. The convolution unit transforms the input vector into the aimed receptive fields as follow:


where rm is the length of the receptive field, is the layer output of the neuron in the filter, is the weighted factor of the neuron in filter for the spectral vector of the pixel, and is the input spectral vector patch corresponding to the neuron.

Iii-B2 A fully connected layer

The output from the is served as an input for the fully connected layer. The output, denoted as , is a 7-channel dimensional matrix (i.e. ) which represents the contribution of a given spectral segmentation to the targeted classes. It is noteworthy that two fully connected layers are used in our model to nonlinearly map the sensitive spectral information into a class.

Iii-B3 Spectral enhanced layer

In order to enhance potential biophysical and biochemical information hidden in spectral information extracted from different spectral segmentations, a spectral enhanced layer is designed to considers the potential combinations of two bands and three bands in order to generate the enhanced spectral features. For two-band-combined features, denoted as , a binary model is employed as follow:


where, is the two-band-combined feature in , equals to , thus, a total of 21 features would be generated in the feature set, .

For three-band-combined features denoted as , considering the area of a hypothetical triangle in feature space that connects the spectral information among various segmentations in a geometrical way, a triangular index model is used as follow:


where, is the three-band-combined feature in , k equals to , thus, a total of 35 features would be generated in the feature set .

Finally, the output of the enhanced interpretable feature block, denoted as , would be a 63-channel dimensional matrix, which involves a series of the extracted and generated features, thus, .

Iii-C Stage 2 – An interpretable capsule block

The interpretable capsule block, the second stage of feature learning, is introdued to better model the hierarchical structure of the biophysical and biochemical attributes of the target ground entities in order to achieve highly accurate classification and high interpretability. It consists of a 2D CNN layer, a capsule layer and a classification capsule layer (see Fig.3). Specifically, the output from the enhanced interpretable feature block (Stage 1) would be firstly input into a 2D CNN layer, in which the spatial texture information provided by the spectral feature maps are integrated into the featured spectral information and then the jointly spectral-spatial features are ouputted. Subsequently, the spectral-spatial feature maps would be fed into a capsule layer, where the spectral-spatial features would be encapsulated into a series of featured tensor as high-level features. Finally, in order to use these high-level features to classify the HSI data into different classes, a class capsule layer is designed to output a membership that a certain featured tensor belongs to a label. The detailed information about the layers in the block is described in the following subsections:

Fig. 3: The architecture of the interpretable capsule block in the two-stage learning processing, Stage 2. This stage is designed to better model the hierarchical structure of the biophysical and biochemical attributes of the target ground entities in order to achieve high accuracy and interpretability.

Iii-C1 2D-CNN layer

In order to integrate the spatial texture information from the neighbour pixels into the spectral features of the central pixel from the enhanced interpretable feature block, a typical is firstly introduced. The goal of this layer is to generate joint spectral-spatial features from the enhanced spectral features produced in Stage 1. In other words, the convolutional operation can be regarded as a feature updating process, which integrates the spatial structural information into the pixel-wised spectral feature and improves the representation of the input features for the target classes. In this layer, the input is the output spectral feature set from Stage 1, and then, a convolution kernel with a size of is employed, where

, which considers the pixel-wised spectral features in different wavelengths. Then, a batch normalization and ReLU activation function are used to obtain the K-dimensional feature channels,

, with a size of .

Iii-C2 Capsule layers

The traditional convolutional layers for deep neural networks might not be efficient in modelling the hierarchical structure of the extracted spectral-spatial features in a HSI classification task. This may result in a poor performance in characterizing and detecting the potential transformation and rotation of target classes. The design of the capsule layer in our approach aims to integrate the spectral-spatial scalar features into the vector features to represent the hierarchical structure of such extracted biophysical and biochemical-associated information, which provides the most comprehensive and accurate features that support the interpretability and reliability of the model.

Specifically, the capsule layer (the major capsule layer) comprises capsules, each of the capsules is composed of convolutional neurons with a kernel of . Within this layer, each capsule is able to model the hierarchical structure of the joint spectral-spatial features, , representing various attributes of target entities, such as orientation, pose, biochemical or biophysical components into a series of featured tensor. These encapsulated tensors are versatile in representing the homogeneous attributes of the targeted entities from the diversity of the spectral signatures in the HSI data, In addition, these featured tensors preserve much more information of biophysical and biochemical correlation relationships between the extracted spectral-spatial features and the targeted entities.

Compared with the general capsule layer rendered in computer vision with a series of pre-defined instantiation parameters

[RN37], the capsule layer in our network is considered as an “inverse rendering”, which focuses on the extraction and detection of the instantiation parameters of the spectral-spatial feature vectors. Specifically, this inverse rendering process firstly extracts the low-level spectral-spatial information (i.e. the output of layer), and then groups them into a four-dimensional tensor that comprises feature capsules with a size of

. Each feature integrated into the tensor plays as an activity element of a linear subspace, which preserves most of the biophysical or biochemical variance (or fluctuation) information of a given class. These groups of spectral-spatial features allow the

capsule not only to detect the specific biophysical and biochemical features but also to learn the potential variants caused by the surface structure and texture, providing the network with rotational invariance properties. In this context, the norm of an activity vector represents the instantiation parameters, and its length represents the probability of a spectral-spatial feature that the capsule is looking for. For better representation of these properties, a non-linear squash function is used in this study to scale down the length of the activity vector, . The squash function is formulated as follow:


where is the scaled activity vector.

Iii-C3 Class-capsule layer

The class-capsule layer is designed to connect all the outputs of capsules as the encoder units of targeted objectives. In this work, the length of the final encoder units is the number of classes, and the width of which is the number of the capsules (i.e. ). For each input patch, the activity vectors will be encoded as the probability of belonging to corresponding classes. For this purpose, a dynamic routing algorithm proposed by [RN43] is employed to connect the current layer with the previous capsule layer in order to iteratively update the parameters between these two layers. The aim of this step is to provide a well-designed learning process that not only connects the spectral information between capsules but also highlights the part-whole spatial correlation through reinforcing the connection coefficients between different layers, and subsequently achieving accurate predictions. Mathematically, the encoder unit in layer is formulated as:


where is the capsule outputs in layer , is the biases of the capsule in layer , and is a transformation matrix that connects the mth capsule output in layer with the capsule output in layer . This formula allows the low-level capsules in layer to make prediction for superior capsules in layer , improving the representativeness of the extracted features in biochemical-biophysical domain. Subsequently, a dynamic routing coefficient is introduced to reinforce the prediction agreement during the process of calculating the input of capsule n in layer :


where measures the contribution of the capsule in layer to activate the nth capsule in layer , the sum of all the routing coefficient must be 1, and is obtained by:


where is the prior which indicates the correlation relationship between the capsule in layer and the nth capsule in layer , it is initialized as 0 and is iteratively refined as follow:


where is the activity vector of the capsule layer , which can be calculated based on the function as follows:


Conceptually, through the dynamic routing algorithm, the similar prediction from the capsule layer will be grouped, and subsequently capturing the robust prediction with clearer biochemical and biophysical meaning. Finally, the prediction performance can be calculated by the loss function (

) as follows:


where is 1 when class is present in the data, otherwise is 0. The and works as edge which forces the length of the into a set of small interval values to minimize the loss. Here the is set to 0.9 and is set to 0.1, is a regularization parameter, which is set to 0.5 in order to stop the learning and reduce the effect of the negative activity vectors.

Iii-D The activation block

The activation block involves two fully connected layers, which transforms the output activity vectors of the spectral-spatial feature encoder to yield the classification map. The decoder employs the Adam optimizer with a learning rate as follow:


where is set to 0.0005 to balance the weights between and during the reconstruction of loss.

Iv Interpretability assessment methods

We assess the interpretability of the proposed model from three aspects: pre-model, in-model, and post-model interpretability.

Iv-a Pre-model interpretability

In order to evaluate the pre-model interpretability of the proposed pre-processing block, two standard metrics, Shannon entropy and Dunn index, are used to measure and visualize the quality of labelled clusters. Shannon entropy measures the uncertainty and disorder within the information represented by the intermediate features; and the entropy of a class is defined as:


where is the contribution (or probability) of the feature to the class . A low-entropy implies a high-concentration of the feature set within the same class. Dunn index is defined as the ratio of the minimum inter-class distance and the maximum intra-class distance, thus,


where is the inter-class distance defined by the L2-norm distance between the class center (mean feature sequence) of class and , and is the intra-class distance defined by the L2-norm distance between any two samples and with the same label. A larger Dunn index suggests a better clustering because it indicates a smaller intra-class distance or inter-class distance.

Iv-B In-model interpretability

In our proposed model, the spectral enhanced layers described in Section 3.2 and the capsule layers described in Section 3.3 are the intrinsic interpretable blocks which fully consider the physical mechanism of the spectral combinations and the biological hierarchal interactions among the extracted spectral-spatial features, respectively. In this study, the learning process of the proposed model will be stepwise outputted in order to evaluate the in-model interpretability.

Iv-C Post-model (Post Hoc) interpretability

To evaluate the post model interpretability, auxiliary data are used to explain the biophysical or biochemical meanings of the intermediate features generated in the hidden layers of the model. This is decoupled from the main model, thus, it is only used to evaluate the interpretability of the intermediate layers without affecting the performance of the main model. Therefore, the selection of the auxiliary data is important to explore and have a good understanding of the feature representations in the intermediate layers. In this study, considering the differences in the study cases in landcover classification and crop disease detection, we selected two types of auxiliary data for the post hoc analysis.

1) Vegetation indices data

In order to quantify the biological attributes of the intermediate features extracted in the deep layers, the coefficients of determination () between such features and biochemical- and biophysical-associated indices are calculated based on univariate correlation analyses. For landcover classification, the ground retrievable parameters represent the biological attributes of the specific vegetation types, we employ 10 popular vegetation indices that have proven to be sensitive to certain biological attributes (see Table I) as substitutes.

Vegetation index Relate to Formula Ref
Normalized Difference
Vegetation Index (NDVI) Vegetation coverage [RN70]
Photochemical Reflectance
Index (PRI) Photosynthetic efficiency [RN59]
Red-edge Chlorophyll
Index (CIred-edge) Chlorophyll content [RN62]
Normalized Difference
Water Index (NDWI) Water [RN64]
Triangular Vegetation
Index (TVI) Green LAI [RN60]
Structural Independent
Pigment Index (SIPI) Pigment content [RN67]
Plant Senescence
Reflectance Index (PSRI) Nutrient [RN66]
Normalized Pigment Chlorophyll
ratio index (NPCI) Chlorophyll density [RN61]
Optimized Soil Adjusted
Vegetation Index (OSAVI) soil background [RN69]
TABLE I: The biophysical- and biochemical-associated vegetation indices that used in this study

2) Biological parameters

For the crop disease monitoring, the measured biological parameters, include leaf area index (LAI), leaf chlorophyll content (CHL), leaf anthocyanin content (ANTH), nitrogen balance index (NBI), and percentile dry matters (PDM). They were synchronously measured at the same place where the HSI measurements were collected. In order to guarantee the sample scale and spatial resolution of HIS data are consistent, a total of 72 sampling sites with subplots were set. Wherein, the CHL, ANTH, and NBI were measured by the Dualex Scientific sensor (FORCE-A, Inc. Orsay, France), a hand-held leaf-clip sensor designed to non-destructively evaluate the content of pigments and epidermal flavonol. For the LAI acquisition, the LAI-2200 Plant canopy analyzer (Li-Cor Biosciences Inc., Lincoln, NE, USA) was used. For the PDM measurement, 10-12 leaves for each sampling subplot were weighed with an electronic balance (Haozhuang, Inc, Shanghai, China) and dried in an electric blowing drying oven (DGG-9240A, Senxin, Inc, Shanghai, China) over 10 hours. After drying, the percentile dry matter (PDM) of the leaves was calculated by the ratio of dry and fresh weight.

In order to find the linear correlation between the enhanced spectral features after training and the parameters, a correlation analysis is used. The coefficient of determination () is used to assess the interpretability of such features on the learning process.

V Experimental Evaluation

To evaluate the effectiveness of the proposed model, we have applied it to two real datasets (see Table II

) and have compared it with six state-of-the-art machine learning models for HIS -based classification tasks: crop type classification and crop disease diagnosis. These six machine learning models include 1) two traditional machine learning approaches, support vector machine (SVM)


, random forest (RF)

[RN74]; 2) two-dimensional convolutional neural network (2D-CNN) [RN75]; and 3) three spectral-spatial based deep learning approaches including three-dimensional convolutional neural network (3D-CNN) [RN76], spectral-spatial residual network (SSRN) [RN77] and the capsule network (CapNet) [RN37]. The detailed information and the experimental configurations and evaluation are descripted below.

V-a HSI data description

The two datasets used for the evaluation and validatation of the proposed model in this study, include a public available dataset, Indian Pines (IP), and an experimentally measured Wheat Yellow Rust (WYR) dataset. The IP dataset is used for the task of crop type classification, and the WYR data is used for the task of crop diseases diagnosis. The detailed description of these two datasets is presented as follows:

1) IP dataset

IP dataset was collected by the Airborne Visible and Infrared Imaging Spectrometer (AVIRIS) sensor in 1992, which covers the different crop planting areas in north-western Indiana, USA, and contains a total of 16 ground truth classes. This dataset involves 224 hyperspectral bands in the range of with a size of . The detailed information about the IP data set can be found in [RN56].

2) WYR dataset

The WYR dataset was collected by the DJI S1000 UAV system (SZ DJI Technology Co Ltd., Gungdong, China) based on UHD-185 Imaging spectrometer (Cubert GmbH, Ulm, Baden-Württemberg, Germany) in 2018. This dataset involves the 125 bands from visible to near-infrared bands between and with a size of . All the images were obtained at a flight height of 30 m, with a spatial resolution close to per pixel. Hyperspectral images were manually labelled based on the ground synchronization survey of the occurrence conditions of yellow rust.

IP dataset WYR dataset
Land cover type Samples Crop status Samples
Alfalfa 46 Health 10842
Corn-notill 1428 Yellow rust 7682
Corn-min 930 Others 3613
Corn 237
Grass/Pasture 483
Grass/Trees 730
Grass/Pasture-mowed 28
Hay-windrowed 478
Oats 20
Soybeans-notill 972
Soybeans-min 2455
Soybeans-clean 593
Wheat 205
Woods 1265
Bldg-Grass-Tree-Drives 386
Stone-steel towers 93
Background 10776
TABLE II: Number of available samples in the IP and WYR datasets.

V-B Evaluation metrics

To valid the effectiveness of the proposed model, we have used the following metrics: the overall accuracy and recalled accuracy [RN81], sensitivity and specificity [RN80], kappa coefficient [RN82] and execution time.

V-C Model evaluation and interpretability analysis

V-C1 Case Study 1: the ground surface classification using Indian Pines (IP) dataset

The first experiment is used to evaluate the performance of the proposed network on the classification of crop types using the IP dataset. Table III provides a quantitative accuracy assessment and classification comparison of the proposed approach, together with the six approaches, SVM, RF, 2D CNN, 3D CNN, SSRN, and CapNet. In all of the tests and comparisons, two-thirds of the labelled pixels were randomly selected as the training dataset, and the remining pixels were used as the testing dataset. Comparing to the existing methods, the proposed model shows a significant improvement in the classification performance. For example, the overall and average accuracies of the proposed approach respectively reaches and , with a Kappa value of 0.85; the average sensitivity and specificity are and respectively. Among the competitors, the CapNet achieves the second best classification performance (OA=, AA=, Kappa=, the average sensitivity of , and average specificity of ). Following the CapNet, the SSRN achieves an overall and average accuracy of and , a Kappa of , average sensitivity of , and average specificity of . The lowest classification accuracy is occurred for the hay-windrowed, it only achieves an accuracy of and by the machine learning-based SVM and RF classifiers, respectively. In comparison, the deep learning-based classifiers are able to achieve an improved classification, wherein, the proposed approach obtains the highest accuracy of . Just like our proposed method, the CapNet and SSRN also exploit joint spectral-spatial information and provide better classification performance than the traditional machine learning models using single type of features (either spectral or spatial information). Overall, the proposed approach outperforms the CapNet and SSRN with a considerable improvement in the evaluation of indices in terms of average sensitivity, average specificity, OA, AA, and Kappa. Regarding the computing time, because of the two-stage feature learning architecture of the proposed model, it requires more computing resources and time, and the average computing time reaches 305.2s for the multi-classifications task, which is the slowest among all of the competitors.

Class Proposed SVM RF 2D-CNN 3D-CNN SSRN CapNet
Alfalfa 98.2 81.5 62.5 80.1 75.2 98.1 95.4
Corn-notill 98.4 76.1 56.2 75.2 92.2 95.2 95.8
Corn-min 98.7 71.4 41.2 82.1 88.1 96.7 96
Corn 97.2 94.2 82.5 69.2 84.5 93.6 96.1
Grass/Pasture 98.8 93.3 92.1 88.3 86.5 95 98.3
Grass/Trees 96.6 84.2 31.5 67.2 72.5 92.9 94.2
99.1 91.3 88.4 92.8 89.2 94.4 98.4
Hay-windrowed 99.5 51.4 13.9 67 56.5 96.2 99.4
Oats 98.1 75 55.9 68.2 97.2 96.2 97.4
Soybeans-notill 97.4 83.5 88.2 66.1 98.1 96.2 95.3
Soybeans-min 97.7 84.6 55.4 81.2 93.5 95.5 96.8
Soybeans-clean 96.4 94.2 91.2 72.1 99.6 93.8 93.6
Wheat 99.1 96.1 97.4 91.2 92.4 94.5 97.4
Woods 97.2 67.4 49.2 92.3 96.2 95.2 95.5
97.6 81.2 88.4 82.2 90.3 93.2 96.5
96.2 82.2 82.3 93.3 92.1 95.3 95.4
Background 94.6 67.2 33.4 65.7 88.1 90.3 92.4
sensitivity (%)
98.55 81.98 66.23 79.42 88.71 94.84 96.11
specificity (%)
97.69 81.47 65.86 79.18 88.47 94.32 95.54
OA(%) 98.65 81.7 67.57 78.66 89.5 96.3 97.59
AA(%) 97.69 80.87 65.27 78.48 87.76 94.84 96.11
Kappa 0.85 0.69 0.5 0.78 0.82 0.83 0.81
Time (s) 305.2 44.8 65.2 110.5 117.4 245.3 217.5
TABLE III: Accuracy evaluation for the classification of IP dataset.

Fig. 4 illustrates the detailed comparison of the sensitivity and specifcity of each vegetation categories in the IP dataset. It shows that the sensitivity and specificity of the deep learning based classifiers are higher than those of two traditional machine learning based classifiers, SVM and RF. Besides, among all of the 17 vegetation categories, the proposed approach achieves the highest sensitivity and specificity in 16 classes (except the soybeans-clean), which indicates that the performance of the proposed approach outperforms the other competitors in the leakage and misclassification.

Fig. 4: Multi-class comparison between the different vegetation categories in IP dataset. (a) the sensitivity of the proposed approach and the competitors for each vegetation class. (b) the specificity of the proposed approach and the competitors for each vegetation class

Fig. 5 demonstrates the classification maps which is corresponding to the accuracy evaluation in Table 3. Because only the spectral signature of each pixel is considered in the traditional machine learning models, there are noticeable salt and paper noises found in the classification maps produced by SVM and RF (Fig. 5c and 5d). On the other hand, as a typical neural network model, the 2D CNN generally introduces some misclassification in class boundaries (Fig. 5e). The main reason is due to the typical defect of the 2D CNN, where only spatial information is involved in the convolution process of the model, which makes the classification more sensitive to spatial scales of the pre-defined window size. The spectral-spatial classifiers (i.e. 3D CNN, SSRN, CapNet, and the proposed approach) demonstrates a higher performance in the accuracy and class consistency of ground surface classification than SVM, RF and 2D-CNN where only the spectral or spatial signal is considered. Although the similar classification maps are illustrated in Fig. 5f-i, the map produced by our proposed approach shows fewer misclassified pixels and clearer class edge and delineation than the results of 3DCNN, SSRN, and CapNet. In addition, if we compare the labelled and unlabelled (not covered in Fig. 5

b) areas, there are less potential outliers in the resultant map of the proposed model. This indicates that the proposed model provides more consistent results in the task of ground surface classification than other five methods.

Fig. 5: The comparison of the classification maps of Indian Pines dataset. (a) the false colour composition map of the raw data. (b) Ground-truth data used in the training and evaluation of the models. (c-i) the classification result of SVM, RF, 2D-CNN, 3D-CNN, SSRN, CapNet, and the proposed model, respectively.

V-C2 Case Study 2: the crop stress detection using the WYR dataset

The second experiment is used to evaluate the performance of the proposed method on crop disease detection using the WYR dataset.

Considering SVM, RF, and 2D CNN can only handle one type of information (spectral or spatial) , only the spectral-spatial classifiers (i.e. 3D CNN, SSRN, and CapNet) were compared against the proposed approach. In this study, labelled data are randomly selected as the training samples and the rest are used as the testing samples. Table IV shows the overall accuracy (OA) of 3DCNN, SSRN, CapNet, and the proposed approach using five different window sizes, , , , , and . The results demonstrate that the proposed approach consistently outperforms the other three models in most cases of the parameter configuration (except for the window size of ). More specifically, the proposed approach provides an OA improvement up to for kernel, for kernel, for kernel, and for kernel, respectively. For the window size of , the OA of the proposed approach is slightly lower () than that of CapNet. One explanation for this exception may lie in the fact that the proposed model is sequentially learning the heterogeneities in spectral and spatial dimensions. The input patch with a window size would beyond the width of the wheat leaf under the spatial resolution of 0.02m, and it may involve mixed patterns in the neurons of the learning process.

In addition, Table V

provides a detailed comparison of the confusion matrix of 3DCNN, SSRN, CapNet, and the proposed approach. Firstly, when the input patch in the range from

to , a larger window size generally leads to a higher classification accuracy. One possible explanation is that the larger the input size is, the more spectral-spatial pattern information is learned by the models. Secondly, the proposed approach requires a smaller window size to achieve a similar classification performance, or achieve a higher accuracy with the same configuration of window size when compared with the competitors. For instance, the average accuracy of the proposed approach with a window size of reaches , higher than the second accuracy (i.e. ) produced by the CapNet with the same window size. It is similar to the accuracy value of achieved by the CapNet with the window size of . The possible reason is that the featured tensor composed by the capsule units reinforces the generalization capability of the spectral-spatial information in characterizing the target items, and represents the heterogeneity and rotation invariance of each labelled class. Such findings suggest that the proposed model is not only able to describe more interpretable features in the learning procrss, but also able to increase the efficiency and accuracy of the classification.

Spatial size 3DCNN SSRN CapNet Proposed
76.72 83.66 87.72 90.13
81.12 84.01 88.11 90.75
80.41 85.01 88.08 90.82
80.44 84.49 88.49 91.01
81.01 84.56 90.5 90.49
TABLE IV: The overall accuracy (OA,) of the 3DCNN, SSRN, CapNet, and proposed approach using different window size of input patch for crop stress detection based on WYR dataset
healthy wheat(%) 79.52 81.41 80.64 80.66 81.53
Yellow rust(%)
76.31 78.74 78.31 78.58 78.31
Soil(%) 83.61 83.7 83.78 83.82 83.4
Avg. sensitivity (%)
91.16 92.2 93 92.35 91.42
Avg. specificity(%)
93 92.58 93.05 93.39 93.31
AA (%) 79.81 81.28 80.91 81.02 81.08
Kappa 0.78 0.8 0.8 0.81 0.8
healthy wheat(%) 83.63 84.65 85.93 84.04 84.8
Yellow rust(%)
80.67 81.95 82.74 81.94 81.41
Soil(%) 86.08 86.14 86.42 86.87 86.64
Avg. sensitivity (%)
81.1 83.66 82.37 82.56 82.12
Avg. specificity(%)
83.2 83.8 83.42 83.43 83.3
AA (%) 83.46 84.25 85.03 84.28 84.28
Kappa 0.82 0.83 0.83 0.84 0.84
healthy wheat(%) 88.06 88.93 88.97 91.08 91.91
Yellow rust(%)
86.11 86.45 86.96 87.39 87.44
Soil(%) 89.26 89.23 89.48 91.24 92.7
Avg. sensitivity (%)
84.5 85.73 87.28 85.85 85.3
Avg. specificity(%)
85.27 85.95 87.13 86.34 86.67
AA (%) 87.81 88.2 88.47 89.9 90.68
Kappa 0.85 0.85 0.86 0.86 0.87
healthy wheat(%) 90.11 90.96 90.89 91.42 91.16
Yellow rust(%)
88.6 88.61 89.45 89.68 88.8
Soil(%) 91.87 91.77 91.77 91.97 91.64
Avg. sensitivity (%)
89 90.1 89.39 90.72 92.44
Avg. specificity(%)
9.44 90.44 30.68 92.56 93.7
AA (%) 90.19 90.45 90.7 91.02 90.53
Kappa 0.88 0.89 0.91 0.94 0.92
TABLE V: The detailed performance comparison of the 3DCNN, SSRN, CapNet, and proposed approach using different window size of input patch for crop stress detection based on WYR dataset.

Fig. 6 illustrates the detailed comparison of the sensitivity and specificity between the proposed approach and its competitors with various window sizes for each class in the WYR dataset. Similar to the accuracy assessment results, the proposed approach achieves better sensitivity and specificity with the window size from to than the other models, and the highest values occurs in the window size of . This suggests that the proposed approach may achieve the best classification for the WYR dataset with the window size of .

Fig. 6: The comparison of the sensitivity and specificity of the proposed approach and the spectralspatial classifier with the window size from to ()

For the purpose of demonstration, Fig. 7 shows the classification maps of yellow rust from a winter wheat field plot produced by the 3DCNN, SSRN, CapNet, and the proposed approach with a window size of . The comparison of these four maps illustrates that the proposed approach outperforms the competitors in the class delineation and distribution of yellow rust. Specifically, the class boundaries of yellow rust pixels obtained by the proposed approach are clearer and more precise, such boundary characterization is identical with the typical yellow rust pathogen features observed at the canopy scale. In addition, the yellow rust class contains stripe features, which have been better delineated in the maps obtained from the proposed approach. Moreover, by comparing with the classification results over unlabelled areas, there is a noticeable consistency in classification with better pathological distribution features, which also suggests that the proposed approache provides a better generalization performance on the detection of wheat yellow rust than its competitors.

Fig. 7: The comparison of the classification maps of wheat yellow rust dataset. (a) the false colour composition map of the raw data. (b) Ground-truth data used in the training and evaluation of the models. () the classification result of 3D-CNN, SSRN, CapNet, and the proposed model, respectively

Fig. 8 shows the variations of the convergence of the proposed network architecture and its competitors with a window size of

and an epoch number of 700. The results demonstrate that the proposed approach provides a stable accuracy in both training and testing processes (i.e. the average accuracies reached

and , respectively). Meanwhile, the accuracy results show the tendency of decline for other algorithms (3DCNN, SSRN, and CapNet) during the training and testing processes. For instance, the average training accuracy of the 3DCNN reached , but its testing accuracy only reaches . The possible reason for such accuracy decline phenomenon is due to the “black-box” learning process in the intermediate layers, which always lead the traditional network architecture to a local optimum. This occurs when the scale of sampling is not big enough to cover all of the possible states of the target classes. In this case, benefiting from the physical mechanism and interpretability discussed in previous sections, the learning process of the proposed method is able to represent the biophysical and biochemical variations and the spatial structure characteristics between the healthy wheat and the wheat infected with yellow rust. This explains why the proposed approached provides a greater performance in classification accuracy and robustness than the other models.

Regarding the convergence, it is noteworthy that the training accuracy of the proposed approach (the blue line in Fig. 8a) reveals an “S-shape” curve, and we can separate this progress into four parts. At the beginning, the rate of convergence is slow during the first epochs. And then, this rate increases dramatically from to epochs. After this point, the training accuracy reveals a fluctuation between the and from the to epochs. Finally, the accuracy is stabilized at around . This tendency may associate with the two-stage network architecture of the proposed model, the training and extraction of sensitive spectral features in the first stage would produce a chain reaction for the learning process of the second stage and the final accuracy. Similarly, although the convergence rate of the proposed approach in the testing process is faster than that in the training process, it is still slower than the other three methods.

Fig. 8: Evolution of the (a) training and (b) testing accuracy (in ) of 3DCNN, SRNN, CapNet, and the proposed approach with a window size of based on the WYR data set

V-D Interpretability analysis

The interpretability of the model is one of the most important contributions in our work. We evaluate it from three perspectives: 1) pre-model interpretability, 2) post hoc analysis, 3) in-model interpretability with the methods introduced in Section 4.2

V-D1 Pre-model interpretability

The Shannon entropy and Dunn index are used to evaluate the effect of the spectral segmentation layer on uncertainty and cluster of the features in the main model (see Fig. 9). The Shannon entropy of each class is calculated by averaging all feature vectors within it, and the entropy values presented in Fig. 9 is the average of all classes.

Fig. 9 indicates that, in comparison with its competitors, our proposed approach achieves a lower Shannon entropy (i.e. intra-class disorder) and a higher Dunn index (i.e. inter-class clustering) in both vegetation classification (Fig. 9a-b) and crop disease detection (Fig. 9c-d). The rationale behind is that the spectral segmentation layer provides a physical constraint on the raw HSI data, which makes the learning process of the main model conduct in the band ranges with explicit spectral-biological attributes. Therefore, the intermediate features produced by the proposed approach may provide greater representations of the intrinsic inter-class differences and lower statistical-derived learning error than the other models.

Fig. 9:

Shannon entropy and Dunn index for IP datasets (a-b) and WYR datasets (c-d). The values entropy is calculated for each class, then averaged all of them. The error bars represent the standard deviation across the classes

V-E Post-model (post hoc) analysis

In this work, the post hoc analysis mainly focuses on the first learning stage of the proposed network, the aim of the post hoc analysis is to explore the biological correlation of the spectral features produced by the spectral enhancement layers with the auxiliary ground datasets. For Case Study 1, Fig. 10a shows the correlation relationship between the spectral segmentation-based features produced by the spectral enhancement layers of Stage 1 and the pre-selected vegetation indices, which reveals the potential spectral-derived biological properties of the enhanced spectral features. For instance, the red-edge and near-infrared associated features reveal the highest coefficient of determination () with the canopy structure associated SVIs, such as NDVI, PRI, and CIred-edge. Such correlations not only indicate the statistical representations of the generated spectral features, but also represent the subtile reflectance differences between the different plant categories in the Indian Pines dataset. Similarly, Fig. 10b shows the correlation graph between the enhanced spectral features and the ground measured auxiliary parameters for the WYR dataset based disease detection task, which reveals the potential biophysical and biochemical properties hidden in the enhanced spectral features. For instance, the red and red-edge associated features reveals the highest coefficient of determination () with the ground measured LAI, the green and red associated features exhibit a higher sensitivity with ground measured CHL. These findings from the two case studies based on the IP and WRY datasets has proven that the intermediate features can characterise not only the statistical properties for the corresponding classes, but also their biophysical and biochemical attributes, which provides the interpretability of the biological differences between the target classes.

Fig. 10: The visualization of the correlation of the between the ground measured parameters and the extracted features from spectral enhance layers for (a) IP dataset and (b) WYR dataset

V-F In-model interpretability

The main model split the learning process into two stages: the spectral significance enhancement and spectral-spatial hierarchical construction. These two stages sequentially explore the spectral significance by extracting the biological associated spectral signature from the HSI data, and represent spectral-spatial hierarchical structure of the target class by encapsulating the extracted spectral-spatial information into capsule features. Besides, such a two-stage learning architecture would improve the observation and explaination of the evolution of the features in different layers. Because the biological interpretability of the outputs in Stage 1 (i.e. the enhanced spectral feature sets) has been proven in Section 5.4.2, here, we only discuss whether the biological interpretability at the high-level capsule features can be achieved by the intrinsic architecture of the model. Specifically, Fig. 11 illustrates the visualization of the weights of the convolutional kernels in layer, the outputted feature maps and feature capsules from the capsule layers for both IP and WYR datasets.

For Case study 1, Fig. 11a visualises the weights of the layer, which provides a direct way to understand the evolution progress of the intermediate features. It is noteworthy that the weights of the convolutional kernels after training red, red-edge, and near-infrared associated features are higher than other features, which means the spectral features from red, red-edge, and near-infrared segmentations are more sensitive to the HSI texture signatures (i.e the canopy structure characteristics in the IP dataset). This finding is also in agreement with previous studies [RN55, RN61]. The final feature maps and the capsulized feature vectors from the capsule layers are shown in Fig. 11

b. where the well-designed capsule layer is able to manage the intermediate scalar features throughout the network, and also calculate the corresponding instantiation parameters to represent the hierarchical structure and potential transformations of the target classes. This will help better characterising the rotation invariance of spectral and spatial features of each class. The length of each feature vector are used to estimate the probability that a specific spectral-spatial feature occurred in each of the class, and final classification would be determined by the maximum length.

For Case Study 2, the visualization of weights of the covn3 layer (see Fig. 11c) also provides a direct way to understand the contribution of each intermediate feature in the spatial-dimension. It is noteworthy that, in a convolution kernel, the weights of the neighbour pixels for the features sensitive to the biophysical parameters (e.g. LAI, PDM) are generally higher than the features sensitive to biochemical parameters (e.g. CHL, ANTH). This indicates that, comparing with the biochemical parameters, the texture information and spatial pattern provide more representations of the physical parameters in the detection and classification of yellow rust. In other words, the proposed approach provides better capability in characterizing the appearance symptom (e.g. leaf rolling, wither) when the wheat is infected by yellow rust. The feature maps from the class-capsules layer are shown in Fig. 11d, the spectral-spatial features are integrated into three feature vectors, the length of each are used to estimate the probability that a specific biophysical and biochemical feature occurred in each of the class, and final classification would be determined by the maximum length.

Fig. 11: The visualization of the weights of the convolutional kernels in conv3 layer and the outputted feature maps and feature capsules of the capsule layers for Case Study 1: vegetation classification of IP datasets, (a-b) and Case Study 2: disease detection of WYR dataset, (c-d).

Vi Conclusion

In this study, a new deep learning architecture based on two-stage spectral-spatial feature learning is presented to achieve an effective classification with a biological interpretable learning from HIS data. Specifically, the proposed network firstly split the input HSI data into 7 spectral segmentations, where the most valuable spectral features sensitive to the reflectance and radiation properties of the target classes are extracted by the 1D-CNN based layers. Subsequently, a set of enhanced features with the explicit biophysical and biochemical properties are generated by a well-designed feature enhancement layer to characterize the biological associated properties of each class. Finally, a series of spectral-spatial capsule unites are employed to output the feature vectors that represent the enhanced feature set as a collection of canonical spectral-spatial pattern and the specific individual instantiation parameters at a higher level. Through this network, the intermediate features uncover more biological and structural patterns of the target ground objectives, which subsequently leads to an increasing interpretability and a reduction of the computing complexity, and therefore, a more accurate model convergence. The comparison with the state-of-the-art models for HSI data classification, reveals that the proposed BIT-DNN exhibits a competitive performance in classification accuracy.

The most important contribution of the proposed approach lies in its interpretable learning process. It can uncover the potential biological and structural patterns of the target items from the inherent spectral-spatial complexity of the HSI data, and the potential transformation and rotation by means of a neural hierarchy structure, which disentangles such biological and structural associated features from the instantiation parameters. Therefore, the high-level feature capsules would be activated by the prediction agreement of the lower-level features, and it intrinsically builds the connections to better express the rotational invariance of the combination of the biological and structure features, and further achieves consistently high classification accuracy on both the tasks of plant categories classification and crop disease detection.


This research is supported by BBSRC (BB/R019983/1), BBSRC (BB/S020969/1). The work is also supported by Newton Fund Institutional Links grant, ID 332438911, under the Newton-Ungku Omar Fund partnership (the grant is funded by the UK Department of Business, Energy, and Industrial Strategy (BEIS) and the Malaysian Industry-Government Group for High Technology and delivered by the British Council. For further information, please visit www.newtonfund.ac.uk.)