Hyperspectral image (HSI) data contains abundant saptial and spectral information, which makes it have a wide range of applications. Nevertheless, because of the senosr restriction and atmospheric interference, HSIs often suffer from various types of noise, such as Gaussian noise, stripe noise and dead lines, etc . Thus, it is essential to reduce the noise in HSIs in order to facilitate the following high-level analysis tasks.
The goal of HSI denoising is to recover a clean image from a noisy observation . The degradation model can be formulated as , where
is additive white Gaussian noise (AWGN) with standard deviationin general. To address this ill-posed inverse problem, the prior knowledge about needs to be adopted to constrain the solution space. Over the past decades, in the literature, a variety of reasonable priors have been proposed for HSI denoising, such as total variation, non-local self-similarity, sparse representation, low-rank model and so on. For example, Maggioni et al.  proposed an algorithm called BM4D which exploits the local correlation in each cube and the non-local correlation between different cubes. Considering the high spectral correlation across bands and high spatial similarity within each band, Renard et al. 
proposed a low-rank tensor approximation method (LRTA), which performs both spatial low-rank approximation and spectral dimensionality reduction. Besides, Zhanget al.  proposed an efficient HSI restoration method based on low-rank matrix recovery (LRMR). Chang et al.  claimed that the non-local self-similarity was the key ingredient for denoising, and proposed a unidirectional low-rank tensor recovery method to capture the intrinsic structure correlation in HSIs. To combine both the spatial non-local similarity and global spectral low-rank property, He et al.  proposed a unified spatial-spectral paradigm for HSI denoising called NG-Meet. The major drawback of the above mentioned approaches is that they are time-consuming due to the complex optimization process, which prevents their usage in practice. In addition, these manually introduced prior knowledge only reflect the characteristics of a certain respect of the data, so the representation ability of these methods is limited.
Recently, deep learning based approaches have been proposed for hyperspectral image denoising. Yuanet al. et al.  proposed a spatial-spectral gradient network for mixed noise removal in HSIs, in consideration of the spatial structure directionality and spectral differences. Although these methods achieve impressive denoising results, there is still much potential to explore and promote this domain forward.
A feasible strategy is to explore the most relevant part of the auxiliary spectral information to make full use of the spectral low-rank property, and make the network adaptively learn significant features. In view of this point, in this paper, we introduce an attention-based deep residual convolutional neural network (ADRN) for HSI denoising. Both a single band and itsadjacent bands are simultaneously fed into the network to take full advantage of the spatial-spectral information. Convolution layers with different sizes of reception field are adopted to extract multi-scale spatial and spectral feature respectively. Then, shortcut connections are built to enable the information flow from the fused feature representation to the final residual output, which can reduce the traditional degradation and feature vanish problem. More importantly, to increase the ability of discriminative learning, we integrate the channel attention mechanism into the network to make it more aware of the information that is more relevant and features that are more crucial. To the best of our knowledge, this is the first work in HSI denoising that considers the attention mechanism. Compared with start-of-the-art methods, our proposed ADRN scheme achieves superior performance in both quantitative and visual evaluations.
|Noise Level||Criterion||LRTA ||BM4D ||LRMR ||HSID-CNN ||LLRT ||NG-Meet ||Proposed|
In this section, we introduce in detail the proposed attention-based deep residual network for HSI denoising. The overall architecture of our network is illustrated in Fig. 1(a). represents an input noisy band and denotes its
adjacent bands. The multi-scale feature extraction module is in charge of acquiring the spatial contextual and spectral correlation information for further processing. Then the multi-level feature representation module is tailored to construct the residual noise. Finally, the clean signal is obtained through subtracting the residual from the spatial input. In the following, we will elaborate blocks and loss function of our network.
2.1 Feature Extraction Block
The ground objects in HSIs have various sizes in different regions naturally. This fact implies that our denoising network should be able to capture the contextual information of multiple scales. Inspired by Inception , in our network, four types of convolution layers with reception field sizes–1, 3, 5, 7–are adopted to extract both the spatial and spectral features, as described in Fig. 1(b). Furthermore, to avoid the expensive computation burden and accelerate the speed in test, a convolution layer is inserted to reduce the channel dimension when the filter size is more than 1.
2.2 Residual Block
As the network goes deeper, information extracted from the early stage of the network may vanish or ”wash out” by the time it reaches the output layer . In addition, the deeper networks often suffer from gradient vanishing problem, which makes the training process slow or even divergent. To address these problems, we adopt the shortcut connection from ResNet  to directly pass the early feature map to the later layers, as illustrated in Fig. 1(c). This greatly increases the flow of information and thus contributes to the prediction of residual noise and the back propagation of gradients, thereby accelerating the training process.
2.3 Channel Attention Block
The traditional CNN treats each channel of a feature map equally, which lacks discriminativa learning ability across channels and thus inhibits the representation power of deep networks. We observe that feature maps extracted from the spectral input contribute differently to the final denoising result, and some of them may be not that beneficial. Thus, what our network learns should concentrate on the significant features. Moreover, in our residual learning strategy, convolution kernels that are responsible for high-frequency extraction should be paid more attention to facilitate the prediction of noise. In view of these concerns, we introduce a channel attention block to adaptively modulate feature representation.
The structure of our channel attention block (CAB) is illustrated in Fig. 1(d). For the -th CAB, we have
where and are the input and output feature map respectively, is the residual component acquired by two stacked convolution layer equipped with filter size of :
where and are weight sets and
denotes the ReLU function.is the learned calibration weight, for which we exploit the global average pooling on first. A convolution layer with ReLU is followed to downsample the channel number by the ratio . Then, the channel number is increased back to the original amount through a convolution layer with Sigmoid to guarantee lies in [0,1]:
where and are weight sets and means the global average pooling operation.
2.4 Residual Learning and Loss Function
In order to avoid the degradation phenomenon as the network goes deeper and ensure the stability and effectiveness of the training process, our network does not directly predict the clean image, but outputs a residual noise :
where denotes model parameters learned by back propagation algorithm. Then the restored clean image can be obtained by subtracting residual noise from the spatial input:
The loss function of our training process consists of two parts: reconstruction loss and regularization loss :
where controls the trade-off between two terms. aims to ensure the restored result approximate to the ground truth:
while is used to enforce the residual noise satisfy a zero-mean distribution.:
where denotes the number of training batch, and mean the height and width of training images.
2.5 Implementation Details
The adjacent band number is set to 64, the downsample ratio is set to 10 as in  and the trade-off parameter
is equal to 10 during all the training procedure. We use the truncated normal distribution to initialize the weights and train the network from scratch. In optimization, we exploit Adam with a mini-batch size of 382 (two times of the band number), while the parameters for Adam are set as , and
, which follow the default setting in TensorFlow. The learning rate starts from 0.0001 and decays exponentially every certain training steps (such as 5000). The total iteration is roughly about 300,000 times.
In this section, extensive experimental results are provided to validate the effectiveness of our method. Several state-of-the-art methods are used for comparison, including: BM4D , low-rank tensor approximation (LRTA) , LRMR , HSID-CNN , LLRT  and NG-Meet . MPSNR  and MSSIM  are served as the evaluation criterion. Better HSI denoising results lead to higher MPSNR and MSSIM.
We follow exactly the same setting in deep model training and test as HSID-CNN . We use the Washington DC Mall image with a size of to train our model, out of which we select for testing and the other part of for training. First, we utilize the ENVI software to normalize the gray values of each HSI band to [0,1]. Then we crop
patches from the training part with the stride of 5. The simulated noisy patches are generated through imposing additive white Gaussian noise (AWGN) with standard deviation of [5, 25, 50, 75, 100] to formulate the training data. For the simulated HSI denoising process, three types of noise are employed: First, different bands have the same noise intensity. For instance,is set from 5 to 100, as listed in Table LABEL:table:result
. Second, the noise intensity of different bands conforms a random probability distribution, labeled asrand(25)
. Third, for different bands, the noise intensity is also different but varies like a Gaussian distribution centered at the middle band:
where , and in our settings.
The averages and standard deviations of MPSNR and MSSIM are obtained by repeating 10 runs of compared methods. The best performance for each quality criterion is marked in bold and the second-best one is underlined. Compared with other algorithms, the proposed ADRN achieves the highest MPSNR and MSSIM values in almost all noisy levels except the case . Under such a small noise level, all methods achieve a relatively good performance and the gap is small. In contrast, as the noise level goes higher and more complicated, our approach clearly outperforms other algorithms.
It is worth noting that NG-Meet achieves the best HSI denoising performance in the literature. However, it assumes that noise follows independently and identically distributed (i.i.d) Gaussian distribution, and its performance dropped dramatically when encountering non-i.i.d. noise.
To further demonstrate the effectiveness of our proposed method, Fig. 2 and Fig. 3 show the pseudo-color images of the test data (composed of bands 57, 27 and 17) after denoising in the case and respectively. The MPSNR and MSSIM values of each method are marked under the denoised images. Although LLRT and NG-Meet show a good noise reduction ability under the uniform noise intensities, it does not work well under unequal noise intensities for different bands. Our proposed method achieves the best performance in objective and subjective evaluations, which demonstrate the effectiveness of our proposed method.
In this paper, we presented an attention-based deep residual network for HSI denoising. Both the spatial information and its adjacent bands are simultaneously assigned to the model to fully exploit the spatial-spectral structural correlation. Then, through incorporating the convolution layer of various reception fields, shortcut connection, and channel attention mechanism, we formulate a multi-scale feature extraction module and a multi-level feature representation module to respectively capture both the multi-scale spatial-spectral feature and fuse feature representations with different levels for the final restoration. Furthermore, we adopt the residual learning strategy to ensure the stability and efficiency of the training procedure. The simulated experiment indicated that our propose ADRN outperforms mainstream methods in both quantitative and visual evaluations.
-  Behnood Rasti, Paul Scheunders, Pedram Ghamisi, Giorgio Licciardi, and Jocelyn Chanussot, “Noise reduction in hyperspectral imagery: Overview and application,” Remote Sensing, vol. 10, no. 3, pp. 482, 2018.
-  Matteo Maggioni, Vladimir Katkovnik, Karen Egiazarian, and Alessandro Foi, “Nonlocal transform-domain filter for volumetric data denoising and reconstruction,” IEEE transactions on image processing, vol. 22, no. 1, pp. 119–133, 2012.
-  Nadine Renard, Salah Bourennane, and Jacques Blanc-Talon, “Denoising and dimensionality reduction using multilinear tools for hyperspectral images,” IEEE Geoscience and Remote Sensing Letters, vol. 5, no. 2, pp. 138–142, 2008.
-  Hongyan Zhang, Wei He, Liangpei Zhang, Huanfeng Shen, and Qiangqiang Yuan, “Hyperspectral image restoration using low-rank matrix recovery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 8, pp. 4729–4743, 2013.
-  Yi Chang, Luxin Yan, and Sheng Zhong, “Hyper-laplacian regularized unidirectional low-rank tensor recovery for multispectral image denoising,” in , 2017, pp. 4260–4268.
-  Wei He, Quanming Yao, Chao Li, Naoto Yokoya, and Qibin Zhao, “Non-local meets global: An integrated paradigm for hyperspectral denoising,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6868–6877.
-  Qiangqiang Yuan, Qiang Zhang, Jie Li, Huanfeng Shen, and Liangpei Zhang, “Hyperspectral image denoising employing a spatial–spectral deep residual convolutional neural network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 2, pp. 1205–1218, 2018.
-  Qiang Zhang, Qiangqiang Yuan, Jie Li, Xinxin Liu, Huanfeng Shen, and Liangpei Zhang, “Hybrid noise removal in hyperspectral imagery with a spatial-spectral gradient network,” IEEE Transactions on Geoscience and Remote Sensing, 2019.
-  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
-  Li Da, Li Lin, and Li Xiang, “Classification of remote sensing images based on densely connected convolutional networks,” Computer Era, 2018.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu,
“Image super-resolution using very deep residual channel attention networks,”in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 286–301.
-  Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard,
“Tensorflow: A system for large-scale machine learning,”in 12th USENIX Symposium on Operating Systems Design and Implementation, 2016, pp. 265–283.
-  Q. Huynh-Thu and M. Ghanbari, “Scope of validity of psnr in image/video quality assessment,” Electronics Letters, vol. 44, no. 13, pp. 800–801, 2008.
-  Wang Zhou, Bovik Alan Conrad, Sheikh Hamid Rahim, and Eero P Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans Image Process, vol. 13, no. 4, pp. 600–612, 2004.