QRNN3D
3D Quasi-Recurrent Neural Network for Hyperspectral Image Denoising (TNNLS 2020)
view repo
In this paper, we propose an alternating directional 3D quasi-recurrent neural network for hyperspectral image (HSI) denoising, which can effectively embed the domain knowledge – structural spatio-spectral correlation and global correlation along spectrum. Specifically, 3D convolution is utilized to extract structural spatio-spectral correlation in an HSI, while a quasi-recurrent pooling function is employed to capture the global correlation along spectrum. Moreover, alternating directional structure is introduced to eliminate the causal dependency with no additional computation cost. The proposed model is capable of modeling spatio-spectral dependency while preserving the flexibility towards HSIs with arbitrary number of bands. Extensive experiments on HSI denoising demonstrate significant improvement over state-of-the-arts under various noise settings, in terms of both restoration accuracy and computation time. Our code is available at https://github.com/Vandermode/QRNN3D.
READ FULL TEXT VIEW PDF3D Quasi-Recurrent Neural Network for Hyperspectral Image Denoising (TNNLS 2020)
Hyperspectral image (HSI) is made up of massive discrete wavebands for each spatial position of real scenes and provides much richer information about scenes than RGB images, which has led to numerous applications in remote sensing [27, 34], classification [6, 38, 31, 2, 45], tracking [37]
[36], and more. However, due to the limited light for each band, traditional HSIs are often degraded by various noises (i.e., Gaussian, stripe, deadline, and impulse noises) during the acquisition process. These degradations negatively influence the performance of all subsequent HSI processing tasks aforementioned. Therefore, HSI denoising is an essential pre-processing in the typical workflow of HSI analysis and processing.Recently, more HSI denoising works pay attention to the domain knowledge of the HSI — structural spatio-spectral correlation and global correlation along spectrum (GCS) [42]. Top-performing classical methods [9, 42, 8, 39, 41]
typically utilize non-local low-rank tensors to model them. Although these methods achieve higher accuracy by effectively considering these underlying characteristics, the performance of such methods is inherently determined by how well the human handcrafted prior (
e.g. low-rank tensors) matches with the intrinsic characteristics of an HSI. Besides, such approaches generally formulate the HSI denoising as a complex optimization problem to be solved iteratively, making the denoising process time-consuming.Alternative learning-based approaches rely on convolutional neural networks in lieu of the costly optimization and handcrafted priors
[7, 46]. Promising results notwithstanding, these approaches model HSI by learned multichannel or band-wise 2D convolutions, which sacrifice either the flexibility with respect to the spectral dimension [7] (hence requiring retraining network to adapt to HSIs with mismatched spectral dimention), or the model capability to extract GCS knowledge[46] (thus leading to relatively low performance as shown in Figure 1).In principal, the trade-off between the model capability and flexibility imposes a fundamental limit for real-world applications. In this paper, we find that combining domain knowledge with 3D deep learning (DL) can achieve both goals simultaneously. Unlike prior DL approaches
[7, 46] that always utilize the 2D convolution as a basic building block of network, we introduce a novel building block namely 3D quasi-recurrent unit (QRU3D) to model HSI from a 3D perspective. This unit contains a 3D convolutional subcomponent and a quasi-recurrent pooling function [5], enabling structural spatio-spectral correlation and GCS modeling respectively. The 3D convolutional subcomponent can extract spatio-spectral features from multiple adjacent bands, while the quasi-recurrent pooling recurrently merges these features over the whole spectrum, controlled by a dynamic gating mechanism. This mechanism renders the pooling weights to be dynamically calculated by the input features, thereby allowing for adaptively modeling the GCS knowledge. To eliminate the unidirectional causal dependency (Figure 4), introduced by the vanilla recurrent structure, we furthermore propose an alternating directional structure with no additional computation cost.Our network, called 3D quasi-recurrent neural network (QRNN3D), has been designed to make full use of the domain knowledge especially the GCS. It makes significant improvements in model capability/accuracy while is agnostic to the spectral dimension of input HSIs, thus can be applied to any HSIs captured by unknown sensors (with different spectral resolutions). Over extensive experiments, QRNN3D outperforms all leading-edge methods on several benchmark datasets under various noise settings as shown in Figure 1.
Our main contributions are summarized that we
present a novel building block namely QRU3D that can effectively exploit the domain knowledge – structural spatio-spectral correlation and global correlation along spectral (GCS) simultaneously.
introduce an alternating directional structure to eliminate the unreasonable causal dependency towards HSI modeling, with no additional computation cost.
demonstrate our model pretrained on ICVL dataset can be directly utilized to tackle remotely sensed imagery which is infeasible in conventional 2D DL approaches for the HSI modeling.
The remainder of this paper is organized as follows. In Section II, we review related HSI denoising methods and DL approaches that inspire our work. Section III introduces the QRNN3D approach for HSI denoising. Extensive experimental results on natural scenes of HSI database and remote sensed images are presented in Section IV, followed by more discussions that facilitate the understanding of QRNN3D in Section V. Conclusions are drawn in Section VI.
Existing methods towards HSI denoising can be roughly classified into two categories depending on the noise model.
The most frequently used noise model is zero-mean white and homogeneous Gaussian additive noise. Under this assumption, BM4D [28], an extension of the BM3D filter [13] to volumetric data, could be directly applied for HSI denoising. By regarding the GCS and non-local self-similarity in HSI simultaneously, Peng et al. proposed a tensor dictionary learning (TDL) model [30] which achieved very promising performance. Following this line, more sophisticated methods have been successively proposed [14, 16, 50, 9, 42, 8, 41, 19]. Among these methods, the low-rank tensor based models, i.e. ITS-Reg [42], LLRT [9] and a new iterative projection and denoising algorithm, i.e. NG-meet [19] achieve state-of-the-art performance, owing to their elaborate efforts on modeling intrinsic property of the HSI.
Besides, several works [48, 20, 43, 11, 39] aim to resolve the realistic complex noise by modeling the noise with complicated non-i.i.d. statistical structures. They all frame the denoising problem into a low-rank based optimization scheme, and then utilize some constraints (e.g. total variation, and nuclear norm) to remove the complex noise (e.g. non-i.i.d. Gaussian, stripe, deadline, impulse).
Recently, leveraging the power of the DL, Chang et al. [7] extended the 2D image denoising architecture – DnCNN [49] to remove various noise in HSIs. They argued the learned filters can well extract the structural spatial information. Yuan et al. [46] utilized a deep residual network to recover the remotely sensed images under Gaussian noise, which processed HSI with a sliding window strategy. Concurrently to our work, Dong et al. [15] proposed a 3D factorizable U-net architecture to exploit spatial-spectral correlations in HSIs from the 3D perspective. All these DL-based methods insufficiently exploit the GCS knowledge, and they cannot adjust the learned parameters to adaptively fit input data, consequently lacking the freedoms to discriminate the input-dependent spatio-spectral correlations.
In this paper, we leverage the power of the DL to automatically learn the mapping purely from the data instead of handcrafted prior and complex optimization, reaching to orders-of-magnitude speedup in both Gaussian and complex noise contexts. Besides, our DL-based method can effectively exploit the underlying characteristics — structural spatio-spectral correlation and GCS, even without sacrificing the flexibility towards HSIs with arbitrary number of bands.
Researches on Gray/RGB image denoising has been dominated by the discriminative learning based approach especially the deep convolutional neural network (CNN) in recent years [49, 29, 33, 10, 52, 51]. Zhang et al. [49]
proposed a modern deep architecture namely DnCNN by embedding the batch normalization
[23] and residual learning [18]. Meanwhile, Mao et al. [29]presented a very deep fully convolutional encoding-decoding framework for image restoration such as denoising and super-resolution. Both of them yielded better Gaussian denoising results and less computation time than the highly-engineered benchmark BM3D
[13]. Along this line, more works have been proposed to explore the deep architecture design for image denoising. For example, MemNet [33] introduces memory block to investigate the long-term information. Residual dense network [52] goes beyond that to build dense connections inner blocks. Residual non-local attention network [51] utilizes local and non-local attention blocks to extract features that capture the long-range dependencies between pixels and pay more attention to the challenging parts.Although all these networks can be directly extended into the HSI case, none of them specifically consider the domain knowledge of the HSI.
Modeling image sequence with various lengths is a fundamental problem in a variety of research fields such as precipitation nowcasting, video processing, and so on.
Bidirectional recurrent convolutional networks (BRCN) [22] and convolutional LSTM (ConvLSTM) [44] were proposed for resolving the multi-frame super-resolution and precipitation nowcasting problem respectively. The key insight of these models is to replace the common-used recurrent full connections by weight-sharing convolutional connections such that they can greatly reduce the large number of network parameters and well model the temporal dependency in a finer level (i.e. patch-based rather than frame-based). However, these patch-based operations cannot efficiently capture the spectral correlation, meanwhile recurrently applying convolution along spectrum would drastically increase the computational complexity. In contrast, our QRNN3D employs an elementwise recurrent mechanism, enabling good scaling to HSI with a large number of bands. Besides, this mechanism naturally imposes a prior constraint over the spectrum, making it well-suited for extracting GCS knowledge.
Layer | Stride | Output size | |
---|---|---|---|
Extractor | 16 | ||
Encoder | 16 | ||
32 | |||
32 | |||
64 | |||
64 | |||
Decoder | 64 | ||
32 | |||
32 | |||
16 | |||
16 | |||
Reconstructor | 1 |
An HSI degraded by additive noise can be linearly modeled as
(1) |
where , is the observed noisy image, is the original clean image, denotes the additive random noise. indicate the spatial height, spatial width, and number of spectral bands respectively.
Here, we consider miscellaneous noise removal in denoising context, where can represent different types of random noise including Gaussian noise, sparse noise (stripe, deadline and impulse) or mixture of them. Given a noisy HSI, our goal is to obtain its noise-free counterpart.
In this section, we introduce the residual encoder-decoder QRNN3D for HSI denoising. As shown in Figure 2, our network consists of six pairs of symmetric QRU3D with convolution and deconvolution for encoder and decoder respectively, leading to twelve layers in total. We use two layers with stride=2 convolution to downsample the input in encoder part, and then two layers with stride=1/2 to upsample in decoder part. The benefits from downsampling and unsampling operations are that we can use a larger network under the same computational cost, and increase receptive field size to make use of the context information in larger image region. Table I illustrates our network configuration. Each layer contains a QRU3D with kernel size , which is set to maximize performance empirically [35]. Stride and output channels (
) in each layer are listed and other configuration (e.g. padding) can be inferred implicitly.
In the following, we first present the QRU3D, which is the core building block in our method. Then, alternating directional structure used to eliminate the unreasonable causal dependency is introduced, and learning details are provided.
QRU3D is the basic building block of QRNN3D. It consists of two subcomponents, i.e. 3D convolutional subcomponent and quasi-recurrent pooling, as shown in Figure 3. Unlike the 2D convolution, both of the subcomponents do not enforce the number of spectral bands, making the QRNN3D free for processing HSIs with arbitrary bands.
The 3D convolutional subcomponent of QRU3D performs two set of 3D convolutions [24, 35]
with separated filter banks, producing sequence of tensors passed through different activation functions,
(2) |
where is the input feature maps coming from last layer (in first layer, input with ); is a high dimensional candidate tensor. has the same dimension as , representing the neural forget gate that controls the behavior of dynamic memorization. Both and are the 3D convolutional filter banks and denotes a 3D convolution, indicates a sigmoid non-linearity.
The 3D convolution is achieved by convolving a 3D kernel to a whole HSI in both spatial and spectral dimensions. The 3D convolution in the spatial domain can mimic numerous operations widely used in low-level vision (like image patch extraction and 2D patch transform in BM3D [13, 26]) and the 3D convolution in the spectral domain can model the local spectrum continuity to alleviate the spectral distortion. Consequently, the embedded C3D can effectively exploit the structural spatio-spectral correlation in HSIs.
Although the 3D convolutional subcomponent has already exploited the inter-band relationship, it is computed in a local way and cannot explicitly exploit GCS. To effectively utilize the GCS, we present quasi-recurrent pooling, in which pooling operation and dynamic gating mechanism are introduced.
In our QRU3D, the quasi-recurrent pooling is applied after the candidate tensor and neural forget gate are obtained by the 3D convolutional subcomponent. We first split and along the spectrum, generating sequences of and respectively, and then feed these states into a quasi-recurrent pooling function [5],
(3) |
where denotes an element-wise multiplication, is the hidden state merged through all previous states and also represents the -th band in the output of this layer, with all entries equal to zero. The forget gate balances the weight of current candidate and previous memory, i.e. hidden state . Its value depends on the current input instead of being fixed like a convolutional filter, which can effectively adapt to the input image own and not solely rely on the parameters learned in the training stage. By this construction, the inter-band information would be accurately merged. Meanwhile, since this dynamic pooling recurrently operates across the whole spectrum, the GCS can be effectively exploited. The output feature maps will be produced by concatenating all hidden states along the spectrum.
In addition, due to independent neural gate and element-wise recurrent operations (multiplication), the QRU3D is highly parallel, enabling good scaling to HSI with a large number of bands. More specifically, the calculation of neural forget gate is only dependent on multiple contiguous bands of input instead of involving the previous hidden state in typical RNNs (e.g. LSTM [21] and GRU [12]). Meanwhile, the elementwise multiplication is exceedingly computationally economical than the convolution used by ConvLSTM [44], thus can be easily recurrently utilized hundreds of times.
A forward 3D quasi-recurrent unit, as in Equation (3), reads a candidate tensor in order starting from the first to the last , so that a hidden state only depends on the previous (and theirs corresponding bands). This introduces the causal dependency since the computing stream of hidden state propagates unidirectionally as shown in Figure 3(a), which is not reasonable for the HSI.
A typical solution is to use a bidirectional structure [32, 22, 4], in which a layer of network contains two sublayers, i.e. a forward QRU3D and a backward QRU3D in our case, as shown in Figure 3(b). The forward QRU3D reads the candidate tensor sequence in order and calculates a sequence of forward hidden states. The backward QRU3D reads the sequence in reverse order, leading to a sequence of backward hidden states. The output of this layer is calculated by adding the forward and backward hidden states elementwisely. However, this structure makes the computational burden unacceptable because of the nearly double amount of memory consumption.
To ease this issue, we present an alternating directional structure for HSIs. In specific, a QRNN3D with alternating directional structure changes the direction of computing stream of hidden state in each layer, as shown in Figure 3(c). This structure is built by alternately stacking forward and backward QRU3D, in which a forward (or backward) state is be merged by a backward (or forward) state in next layer, such that the global context information could be propagated through the whole spectrum.
Compared with the typical solution by bidirectional structure, our proposed alternating directional structure almost adds no additional computation cost, while keeping the ability to model the dependency from whole spectrum of an HSI regardless of the position of the output.
Stage | 1 | 2 | 3 | |||||
---|---|---|---|---|---|---|---|---|
Noise model | Gaussian noise with known | Gaussian noise with unknown | Unknown complex noise | |||||
Epoch | 0 20 | 20 30 | 30 35 | 35 45 | 45 50 | 50 85 | 85 95 | 95 100 |
Learning rate | ||||||||
Batch size | 16 | 64 |
Noisy | LRMR[48] | LRTV[20] | NMoG[11] | TDTV[39] | D-CNN[46] | MemNet[33] | Ours | |
---|---|---|---|---|---|---|---|---|
Case 1 |
||||||||
Case 2 |
||||||||
Case 3 |
||||||||
Case 4 |
||||||||
Case 5 |
We conduct several experiments using data from ICVL hyperspectral dataset [3], where 201 images were collected at spatial resolution over 31 spectral bands. The simulated pseudo color image samples from this dataset are illustrated in Figure 5. We use 100 images for training, 5 images for validation, while others are for testing. To enlarge the training set, we crop multiple overlapped volumes from training HSIs and then regard each volume as a training sample. During cropping, each volume has a spatial size of and a spectral size of for the purpose of preserving the complete spectrum of an HSI. Data augmentation schemes such as rotation and scaling are also employed, resulting in roughly 50k training samples in total. As for testing set, we crop the main region of each image with size of given the computation cost^{1}^{1}1It’s unwieldy to evaluate a image with large size in some competing methods rather than ours, see Figure 1 for more detail..
Besides, we evaluate the robustness and flexibility of our model in remotely sensed hyperspectral datasets including Pavia Centre, Pavia University, Indian Pines and Urban. Pavia Centre and Pavia University were acquired by the ROSIS sensor, the number of spectral bands is 102 for Pavia Centre and 103 for Pavia University. Indian Pines and Urban were gathered by 224-bands AVIRIS sensor and 210-bands HYDICE hyperspectral system respectively. Both of them have been used for real HSI denoising experiments [20, 39, 9].
Real-world HSIs are usually contaminated by several different types of noise, including the most common Gaussian noise, impulse noise, dead pixels or lines, and stripes [48, 17, 11]. We define five types of complex noise as follows, and the types of complex noise are referred as Case 1-5 respectively.
Non-i.i.d. Gaussian noise. Entries in all bands are corrupted by zero-mean Gaussian noise with different intensities, randomly selected from 10 to 70.
Gaussian + Stripe noise. All bands are corrupted by non-i.i.d. Gaussian noise as Case 1. One third of bands (10 bands for ICVL dataset) are randomly chosen to add stripe noise (5% to 15% percentages of columns).
Gaussian + Deadline noise. The noise generation process is nearly the same as Case 2 except the stripe noise is replaced by deadline.
Gaussian + Impulse noise. Each band is contaminated by Gaussian noise as Case 1. One third of bands are randomly selected to add impulse noise with intensity ranged from 10% to 70%.
Mixture noise. Each band is randomly corrupted by at least one kind of noise mentioned in Case 1-4.
We compare our method against both traditional and DL methods in both Gaussian and complex noise cases. In general, the traditional methods are best suited to be applied in a specific noise setting, relying on their noise assumption. While DL methods, can be applied in various noise setting by training multiple models to tackle miscellaneous noises. For the sake of fairness, we adopt different traditional baselines in these two noise contexts, given their noise assumptions.
In Gaussian noise case, we compare with several representative traditional methods including filtering-based approaches (BM4D [28]), dictionary learning approach (TDL [30]), and tensor-based approaches (ITSReg [42], LLRT [9]). In complex noise case, the competing traditional baselines include low-rank matrix recovery approaches (LRMR [48], LRTV [20], NMoG [11]), and low-rank tensor approach (TDTV [39]).
For DL approaches, we compare our model with HSID-CNN [46]. Besides, any DL method for single image denoising can be extended to HSI denoising case (by modifying the first layer to adapt the HSI, i.e. changing from 3 to 31). For completeness, we also compare such state-of-the-art 2D DL approach, i.e. MemNet [33] with in first layer, which entails the fixed number of spectral bands. Since the training setting is different between ours and other DL approaches, we finetune/retrain their pretrained models with our well-designed training strategy to achieve better performance in our dataset.
We develop an incremental training policy to stabilize and accelerate the training, which also avoids the network converging to a poor local minimum. The philosophy of our training policy is simple: learning to solve tasks in an easy-to-difficult way [1]. Networks are learned by minimizing the mean square error (MSE) between the predicted high-quality HSI and the ground truth. The network parameters are initialized as in [17], and optimized using ADAM optimizer [25]
with the deep learning framework Pytorch
^{2}^{2}2https://pytorch.org/ on a machine with NVIDIA GTX 1080Ti GPU, Intel(R) Core(TM) i7-7700K CPU of 4.2GHz and 16 GB RAM. Unlike training networks independently to tackle several different types of noise separately, we simply train two models in both Gaussian and complex noise cases respectively. Our network learning goes through three stages, from the easy task of Gaussian denoising with fixed noise level, to the difficult one of complex noise removal. The models are incrementally trained that reuse the prior state (pretrained parameters) to maximize the training efficiency (See discussions in Section V-A). We follow the previous image restoration work [29] to choose hyper-parameters of learning algorithm. These values were empirically set to make network learning fast yet stable. Specifically, the learning rate is initialized at and decayed at epochs, where the validation performance not increases any more. Small batch size (i.e. 16) is used to accelerate training at first stage, while large batch size (i.e. 64) is adopted to stabilize training when tackling harder cases (e.g. complex noise case). The overview of our training procedures is shown in Table II, with detailed hyper-parameter setting.To give an overall evaluation, three quantitative quality indices are employed, i.e. PSNR, SSIM [40], and SAM [47]. PSNR and SSIM are two conventional spatial-based indexes, while SAM is spectral-based. Larger values of PSNR and SSIM imply better performance, while a smaller value of SAM suggests better performance.
Sigma | Index | Methods | |||||||
---|---|---|---|---|---|---|---|---|---|
Noisy | BM4D | TDL | ITSReg | LLRT | HSID-CNN | MemNet | Ours | ||
[28] | [30] | [42] | [9] | [46] | [33] | ||||
30 | PSNR | ||||||||
SSIM | |||||||||
SAM | |||||||||
50 | PSNR | ||||||||
SSIM | |||||||||
SAM | |||||||||
70 | PSNR | ||||||||
SSIM | |||||||||
SAM | |||||||||
Blind | PSNR | ||||||||
SSIM | |||||||||
SAM |
Case | Index | Methods | |||||||
---|---|---|---|---|---|---|---|---|---|
Noisy | LRMR | LRTV | NMoG | TDTV | HSID-CNN | MemNet | Ours | ||
[48] | [20] | [11] | [39] | [46] | [33] | ||||
1 | PSNR | ||||||||
SSIM | |||||||||
SAM | |||||||||
2 | PSNR | ||||||||
SSIM | |||||||||
SAM | |||||||||
3 | PSNR | ||||||||
SSIM | |||||||||
SAM | |||||||||
4 | PSNR | ||||||||
SSIM | |||||||||
SAM | |||||||||
5 | PSNR | ||||||||
SSIM | |||||||||
SAM |
Index | Methods | ||||||||
---|---|---|---|---|---|---|---|---|---|
Noisy | LRMR | LRTV | NMoG | TDTV | HSID-CNN | Ours | Ours | Ours | |
[48] | [20] | [11] | [39] | [46] | S | P | F | ||
PSNR | |||||||||
SSIM | |||||||||
SAM |
Zero mean additive white Gaussian noises with different variance are added to generate the noisy observations. The model trained at the end of stage 2 (epoch 50) is used to tackle all different levels of corruption
^{3}^{3}3We do not train multiple networks to tackle different noise intensities respectively. Instead, only one single network is trained using training sample with various noise intensities. . Figure 6 shows the denoising results under noise level . It can be easily observed that the image restored by our method is capable of properly removing the Gaussian noise while finely preserving the structure underlying the HSI. Traditional methods like BM4D and TDL introduce evident artifacts to some areas. Other methods suppress the noise better, but still lose some fine-grained details and produce relatively low-quality results compared with ours. The qualitative assessment results are listed in Table III. Compared with all competing methods, the QRNN3D achieves better performance in most qualitative/quantitative assessments, further confirming the high fidelity of our method.Five types of the complex noise are added to generate noisy samples. In brief, cases 1-5 represent non-i.i.d Gaussian noise, Gaussian + stripes, Gaussian + deadline, Gaussian + impulse, and mixture of them respectively (see Section IV-A2 for more details). Like Gaussian noise case, a single model trained at the end of stage 3 (epoch 100) is utilized to dealing with case 1-5 simultaneously. It’s worth noting that each sample in our training set is corrupted by one of noise types (i.e. cases 1-4), while in case 5, each testing sample suffers from multiple types of noise, not contained in the training set. We show the qualitative and quantitative results in Figure 7 and Table IV respectively, which show our QRNN3D significantly outperforms the other methods. Furthermore, the results in mixture noise case exhibit the strong generalization of our model since the mixture noise is not seen by our model in the training stage.
In Figure 7, the observation images are corrupted by miscellaneous complex noises. Low-rank matrix recovery methods, i.e. LRMR and LRTV, holding the assumption that the clean HSI lies in low-rank subspace from the spectral perspective, successfully remove great mass of noise, but at a cost of losing fine details. Our QRNN3D eliminates miscellaneous noises to a great extent, while more faithfully preserving the fine-grained structure of original image (e.g. the texture of road in the second photo of Figure 7) than top-performing traditional low-rank tensor approach TDTV and other DL methods. Figure 8 shows the PSNR value of each bands in these HSIs. It can be seen that the PSNR values of all bands obtained by Our QRNN3D are obviously higher than those compared methods.
Here, we conduct experiments on Pavia University in mixture noise case. Given the similarity between Pavia Centre and Pavia University, the model is first trained from scratch only on Pavia Centre. It can be seen our train-from-scratch model (Ours-S in Table V) performs undesirable, even compared with traditional method TDTV (29.64 v.s. 30.06).
Nevertheless, our method utilizes QRU3D, which makes it can be naturally used for input data with various number of bands. On the basis of this flexibility, we directly apply our model pretrained on ICVL dataset (in complex noise case) to Pavia University. Although the Pavia University is recorded with a spectral curve totally distinct from ICVL dataset, our model called Ours-P performs much better than all compared methods^{4}^{4}4The result of HSID-CNN is also obtained by its pretrained model on ICVL dataset under complex noise case. The learned MemNet cannot be useful for the data with different bands and its results are not provided in Table V., which strongly verifies the robustness of our method.
Furthermore, we employ small pieces of samples from Pavia Center to fine-tune the model only learned from ICVL dataset. This learned model (Ours-F in Table V) significantly boosts the performance. The visual comparison is provided in Figure 9. Interestingly, the Gaussian-like residuals are still visible in Ours-S model, while Ours-P model suffers from stripes. Ours-F model combines the strengths of the two models, yielding clear and clean result. This seems to indicate the knowledge from ICVL dataset is complementary to one from Pavia Centre
dataset, so that the transfer learning enabled by flexibility will bring great benefits in performance.
We also verify our model in real-world noisy HSI Indian Pines and Urban without corresponding ground truth. It can be observed in Figure 10 and Figure 11 that terrible atmosphere and water absorption obstruct the view to the real scenario, severely degrading the quality of images. The Gaussian denoising methods, e.g.
BM4D, TDL, cannot accurately estimate the underlying clean image due to the non-Gaussian noise structure. Our QRNN3D successfully tackles this unknown noise, and produces sharper and clearer result than others, consistently demonstrating the robustness and flexibility of our model.
Model | PSNR (dB) | Time (s) | Params (#) |
---|---|---|---|
MemNet | 39.76 | 0.88 | 2.94M |
QRU2D | 38.63 | 0.60 | 0.29M |
WQRU2D | 39.82 | 1.16 | 0.88M |
C3D | 36.83 | 0.56 | 0.43M |
WC3D | 40.00 | 0.93 | 1.72M |
QRU3D | 40.23 | 0.74 | 0.86M |
U | 40.07 | 0.75 | 0.86M |
B | 40.26 | 1.26 | 1.72M |
A | 40.23 | 0.74 | 0.86M |
In this section, we provide a broad discussion and analysis of QRNN3D to facilitate understanding of where its great performance comes from. We first demonstrate the efficacy of our incremental training policy, then analyze the functionality of each network component in QRNN3D (i.e. 3D convolution, quasi-recurrent pooling, alternating-directional structure). The selection of network hyper-parameters is followed. The visualization method (and results) of GCS knowledge in QRU3D are presented in final.
The key idea of our training policy lies at the fact that knowledge can be efficiently learned in an easy-to-difficult way [1]. Our training policy enables reusing prior learned knowledge (pretrained parameters), which significantly stabilizes and accelerates the whole training process. As an example, we show the optimization curves with and without reusing the pretrained parameters when training the model in complex noise case. As shown in Figure 12, training from scratch renders the optimization slow, instable and converge to a poor local minimum, in contrast to training with a good initialization in our incremental learning policy.
To thoroughly verify the functionality of each component in our QRNN3D, comprehensive ablation experiments are conducted on HSI Gaussian denoising task on ICVL dataset. We focus on the components associated with HSI modeling and domain knowledge embedding, and study the best trade-off between performance and computational burden. The evaluation measures include PSNR, running time and total number of parameters of network.
We choose our encoder-decoder QRNN3D as the benchmark. For fair comparison, same network architecture is used except the modification in the investigated component. Ablation results are exhibited in Table VI and analyzed in the following.
Table VI investigates the effect of subcomponents (i.e. 3D convolution and quasi-recurrent pooling function) in QRU3D. QRU3D is the basic building block of our QRNN3D. In the experiments, four variants of this basic block are tested, i.e. QRU2D,WQRU2D, C3D and WC3D.
QRU2D is instantiated by replacing the 3D convolution by 2D convolution (implemented by simply setting the kernel size to ). Drastic performance losing (i.e. -1.6 dB) can be observed in Table VI, meaning ignoring the structural spectral correlation would severely impact the model capacity.
WQRU2D is formed by a wider QRU2D model whose number of parameters is comparable to QRU3D. Nevertheless, It can be observed that the QRU3D still outperforms the WQRU2D, even with less computation cost, which suggests the higher efficiency of 3D convolution against the 2D approach towards HSI modeling.
C3D is constructed by removing the quasi-recurrent pooling (and the associated neural gates), definitely a residual encoder-decoder 3D convolutional neural network. We find lack of mechanism to model the GCS, would degrade the performance by a large margin (-3.4 dB).
WC3D is built by a wider C3D model with more parameters (four times as much as the C3D model). It can be seen the PSNR of QRU3D is 40.23 dB, higher than the WC3D’s 40.00 dB. This suggests that the improvement of quasi-recurrent pooling is not just because it adds width to the C3D model. Besides, the QRU3D has only parameters and running time of the WC3D model and is also narrower. This comparison shows that the improvement from quasi-recurrent pooling is complementary to going wider in standard ways.
Table VI also shows the results of different directional structures denoted by initials (e.g. U for unidirectional, e.t.c.). Without considering backward spectral dependency, the unidirectional architecture performs worst. After eliminating the causal dependency, both alternating directional and bidirectional architectures significantly exceed the unidirectional one, and achieve similar performance (40.26 v.s. 40.23). Nevertheless, the bidirectional version requires much larger memory footprint than ours alternating directional structure, indicating the alternating directional structure can be used as a lightweight alternative to the typical bidirectional one.
Depth | Width | PSNR (dB) | Time (s) | Params (#) |
---|---|---|---|---|
10 | 16 | 39.85 | 0.68 | 0.42M |
12 | 40.23 | 0.74 | 0.86M | |
14 | 39.52 | 0.80 | 1.30M | |
12 | 12 | 39.82 | 0.62 | 0.48M |
16 | 40.23 | 0.74 | 0.86M | |
20 | 40.01 | 1.18 | 1.34M |
Our principle of network hyper-parameter selection is to make it compact yet work. Table VII shows the results of hyper-parameter selection on Gaussian denoising task through a small grid search, where we select the depth and width of our QRNN3D considering the best tradeoff between performance and computation overload.
Nonetheless, we note the major goal of this work is to introduce a novel building block, specially tailored to model HSI. Such building block can be naturally inserted into any network topology, not restricted to the encoder-decoder network used in this paper. We mainly show the effectiveness of our proposed building block and don’t pursue higher performance via exhaustive search of other configurations. We have demonstrated state-of-the-art performance of our QRNN3D without heavy engineering effort on network hyper-parameter selection. Our current hyper-parameter setting might not be perfect, and the performance could be boosted potentially by parameter tuning, though this is not a major focus of this paper.
To visualize the captured GCS knowledge in QRNN3D, we first unfold the Equation (3) and obtain
(4) |
where .
We define the by the degree of ’s contribution to under Frobenius norm measure, i.e.
(5) |
where denotes element-wise division. It also implies the band ’s effect on band . The captured GCS in each QRU3D layer can be calculated through a single inference pass by using Equation (5). To completely visualize GCS^{5}^{5}5in a forward (backward) QRU3D, the captured GCS is an upper (lower) triangular matrix, we choose the first bidirectional QRU3D for such analysis^{6}^{6}6The body of QRNN3D is equipped with the alternating directional structure, while in head and tail, the bidirectional directional structure is employed to avoid directional bias.. Figure 12(a) exhibits the captured GCS of a random selected HSI, showing the output of each band would be highly affected by the whole spectrum. Figure 12(b) illustrates the number of relative bands for output of each band. It can be seen that 15th to 17th bands () are deeply correlated to almost all bands (). Figure 12(c) summarizes this statistics of all testing images on ICVL. It shows that a randomly selected band would be typically related to at least 15 bands (31 in total), meaning the GCS is effectively utilized by our model and our method can also automatically determine the most relative bands across global spectra.
In this paper, we have proposed an alternating directional 3D quasi-recurrent neural network for hyperspectral image denoising. Our main contribution is the novel use of 3D convolution subcomponent, quasi-recurrent pooling function, and alternating directional scheme for efficient spatio-spectral dependency modeling. We have applied our model to resolve HSI denoising beyond the Gaussian, especially in the very challenging real-world complex noise case, and achieve better performance and faster speed. We also show our model pretrained on ICVL dataset can be directly utilized to tackle remotely sensed images which is infeasible in most of existing DL approaches for the HSI modeling.
In addition, the visualized results for global correlation along spectrum (GCS) in our 3D quasi-recurrent unit (QRU3D) further experimentally convinces the GCS is effectively exploited by our model. It’s also worth investigating the proposed QRU3D in other image sequence modeling tasks in future.
European Conference on Computer Vision
, pages 19–34. Springer, 2016.The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, pages 4260–4268, 2017.Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages 1724–1734, 2014.Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.
In The IEEE International Conference on Computer Vision (ICCV), December 2015.International Conference on Machine Learning (ICML)
, pages 448–456, 2015.Self-paced learning-based probability subspace projection for hyperspectral image classification.
IEEE Transactions on Neural Networks and Learning Systems, PP(99):1–6, 2018.