1 Introduction
Hyperspectral image (HSI) data contains abundant saptial and spectral information, which makes it have a wide range of applications. Nevertheless, because of the senosr restriction and atmospheric interference, HSIs often suffer from various types of noise, such as Gaussian noise, stripe noise and dead lines, etc [1]. Thus, it is essential to reduce the noise in HSIs in order to facilitate the following highlevel analysis tasks.
The goal of HSI denoising is to recover a clean image from a noisy observation . The degradation model can be formulated as , where
is additive white Gaussian noise (AWGN) with standard deviation
in general. To address this illposed inverse problem, the prior knowledge about needs to be adopted to constrain the solution space. Over the past decades, in the literature, a variety of reasonable priors have been proposed for HSI denoising, such as total variation, nonlocal selfsimilarity, sparse representation, lowrank model and so on. For example, Maggioni et al. [2] proposed an algorithm called BM4D which exploits the local correlation in each cube and the nonlocal correlation between different cubes. Considering the high spectral correlation across bands and high spatial similarity within each band, Renard et al. [3]proposed a lowrank tensor approximation method (LRTA), which performs both spatial lowrank approximation and spectral dimensionality reduction. Besides, Zhang
et al. [4] proposed an efficient HSI restoration method based on lowrank matrix recovery (LRMR). Chang et al. [5] claimed that the nonlocal selfsimilarity was the key ingredient for denoising, and proposed a unidirectional lowrank tensor recovery method to capture the intrinsic structure correlation in HSIs. To combine both the spatial nonlocal similarity and global spectral lowrank property, He et al. [6] proposed a unified spatialspectral paradigm for HSI denoising called NGMeet. The major drawback of the above mentioned approaches is that they are timeconsuming due to the complex optimization process, which prevents their usage in practice. In addition, these manually introduced prior knowledge only reflect the characteristics of a certain respect of the data, so the representation ability of these methods is limited.Recently, deep learning based approaches have been proposed for hyperspectral image denoising. Yuan
et al. [7]utilized both the spatial and spectral information to recover the clean image through multiscale feature extraction and multilevel feature representation by neural networks. Zhang
et al. [8] proposed a spatialspectral gradient network for mixed noise removal in HSIs, in consideration of the spatial structure directionality and spectral differences. Although these methods achieve impressive denoising results, there is still much potential to explore and promote this domain forward.A feasible strategy is to explore the most relevant part of the auxiliary spectral information to make full use of the spectral lowrank property, and make the network adaptively learn significant features. In view of this point, in this paper, we introduce an attentionbased deep residual convolutional neural network (ADRN) for HSI denoising. Both a single band and its
adjacent bands are simultaneously fed into the network to take full advantage of the spatialspectral information. Convolution layers with different sizes of reception field are adopted to extract multiscale spatial and spectral feature respectively. Then, shortcut connections are built to enable the information flow from the fused feature representation to the final residual output, which can reduce the traditional degradation and feature vanish problem. More importantly, to increase the ability of discriminative learning, we integrate the channel attention mechanism into the network to make it more aware of the information that is more relevant and features that are more crucial. To the best of our knowledge, this is the first work in HSI denoising that considers the attention mechanism. Compared with startoftheart methods, our proposed ADRN scheme achieves superior performance in both quantitative and visual evaluations.2 Methodology
Noise Level  Criterion  LRTA [3]  BM4D [2]  LRMR [4]  HSIDCNN [7]  LLRT [5]  NGMeet [6]  Proposed 

MPSNR  39.0090.0034  41.1880.0023  40.8780.0036  41.6840.0025  41.5320.0054  41.7810.0052  41.5800.0043  
MSSIM  0.99260.0002  0.99620.0001  0.99520.0001  0.99660.0001  0.99680.0001  0.99660.0001  0.99720.0001  
MPSNR  30.6720.0033  31.1360.0025  33.0290.0023  33.0500.0028  34.7010.0097  35.3660.0094  35.5270.0104  
MSSIM  0.96290.0002  0.96850.0002  0.98090.0001  0.98130.0001  0.98620.0 001  0.98800.0001  0.99020.0001  
MPSNR  26.8320.0052  26.7520.0034  28.8060.0043  28.9680.0039  30.7590.0115  31.6690.0139  32.0700.0102  
MSSIM  0.92460.0001  0.92080.0002  0.95320.0001  0.95360.0001  0.97050.0001  0.97520.0001  0.97960.0001  
MPSNR  24.6820.0054  24.2610.0035  26.3060.0046  26.7530.0039  28.3850.0134  29.1160.0147  29.8620.0175  
MSSIM  0.88660.0001  0.86700.0001  0.91920.0001  0.92730.0001  0.95250.0002  0.95940.0001  0.96730.0001  
MPSNR  23.1750.0048  22.5770.0054  24.3100.0047  25.2960.0043  26.7120.0145  27.7560.0083  28.2390.0176  
MSSIM  0.84940.0003  0.81190.0002  0.87990.0002  0.90140.0001  0.93280.0001  0.94540.0001  0.95350.0002  
MPSNR  28.8430.0025  34.4240.0034  36.0940.0033  37.3670.0028  34.3602.6908  36.0400.3682  37.3010.1633  
MSSIM  0.93310.0001  0.98330.0002  0.98560.0001  0.99160.0001  0.97180.0275  0.99040.0001  0.99170.0004  
MPSNR  28.2000.0023  34.1090.0037  35.9620.0025  36.8040.0029  28.6350.0019  35.4020.0053  37.7220.0080  
MSSIM  0.91190.0002  0.97940.0001  0.98930.0001  0.98950.0001  0.90940.000  0.98940.0001  0.99290.0001 
In this section, we introduce in detail the proposed attentionbased deep residual network for HSI denoising. The overall architecture of our network is illustrated in Fig. 1(a). represents an input noisy band and denotes its
adjacent bands. The multiscale feature extraction module is in charge of acquiring the spatial contextual and spectral correlation information for further processing. Then the multilevel feature representation module is tailored to construct the residual noise. Finally, the clean signal is obtained through subtracting the residual from the spatial input. In the following, we will elaborate blocks and loss function of our network.
2.1 Feature Extraction Block
The ground objects in HSIs have various sizes in different regions naturally. This fact implies that our denoising network should be able to capture the contextual information of multiple scales. Inspired by Inception [9], in our network, four types of convolution layers with reception field sizes–1, 3, 5, 7–are adopted to extract both the spatial and spectral features, as described in Fig. 1(b). Furthermore, to avoid the expensive computation burden and accelerate the speed in test, a convolution layer is inserted to reduce the channel dimension when the filter size is more than 1.
2.2 Residual Block
As the network goes deeper, information extracted from the early stage of the network may vanish or ”wash out” by the time it reaches the output layer [10]. In addition, the deeper networks often suffer from gradient vanishing problem, which makes the training process slow or even divergent. To address these problems, we adopt the shortcut connection from ResNet [11] to directly pass the early feature map to the later layers, as illustrated in Fig. 1(c). This greatly increases the flow of information and thus contributes to the prediction of residual noise and the back propagation of gradients, thereby accelerating the training process.
2.3 Channel Attention Block
The traditional CNN treats each channel of a feature map equally, which lacks discriminativa learning ability across channels and thus inhibits the representation power of deep networks. We observe that feature maps extracted from the spectral input contribute differently to the final denoising result, and some of them may be not that beneficial. Thus, what our network learns should concentrate on the significant features. Moreover, in our residual learning strategy, convolution kernels that are responsible for highfrequency extraction should be paid more attention to facilitate the prediction of noise. In view of these concerns, we introduce a channel attention block to adaptively modulate feature representation.
The structure of our channel attention block (CAB) is illustrated in Fig. 1(d). For the th CAB, we have
(1) 
where and are the input and output feature map respectively, is the residual component acquired by two stacked convolution layer equipped with filter size of :
(2) 
where and are weight sets and
denotes the ReLU function.
is the learned calibration weight, for which we exploit the global average pooling on first. A convolution layer with ReLU is followed to downsample the channel number by the ratio . Then, the channel number is increased back to the original amount through a convolution layer with Sigmoid to guarantee lies in [0,1]:(3) 
where and are weight sets and means the global average pooling operation.
2.4 Residual Learning and Loss Function
In order to avoid the degradation phenomenon as the network goes deeper and ensure the stability and effectiveness of the training process, our network does not directly predict the clean image, but outputs a residual noise :
(4) 
where denotes model parameters learned by back propagation algorithm. Then the restored clean image can be obtained by subtracting residual noise from the spatial input:
(5) 
The loss function of our training process consists of two parts: reconstruction loss and regularization loss :
(6) 
where controls the tradeoff between two terms. aims to ensure the restored result approximate to the ground truth:
(7) 
while is used to enforce the residual noise satisfy a zeromean distribution.:
(8) 
where denotes the number of training batch, and mean the height and width of training images.
2.5 Implementation Details
The adjacent band number is set to 64, the downsample ratio is set to 10 as in [12] and the tradeoff parameter
is equal to 10 during all the training procedure. We use the truncated normal distribution to initialize the weights and train the network from scratch. In optimization, we exploit Adam
[13] with a minibatch size of 382 (two times of the band number), while the parameters for Adam are set as , and, which follow the default setting in TensorFlow
[14]. The learning rate starts from 0.0001 and decays exponentially every certain training steps (such as 5000). The total iteration is roughly about 300,000 times.3 Experiments
In this section, extensive experimental results are provided to validate the effectiveness of our method. Several stateoftheart methods are used for comparison, including: BM4D [2], lowrank tensor approximation (LRTA) [3], LRMR [4], HSIDCNN [7], LLRT [5] and NGMeet [6]. MPSNR [15] and MSSIM [16] are served as the evaluation criterion. Better HSI denoising results lead to higher MPSNR and MSSIM.
We follow exactly the same setting in deep model training and test as HSIDCNN [7]. We use the Washington DC Mall image with a size of to train our model, out of which we select for testing and the other part of for training. First, we utilize the ENVI software to normalize the gray values of each HSI band to [0,1]. Then we crop
patches from the training part with the stride of 5. The simulated noisy patches are generated through imposing additive white Gaussian noise (AWGN) with standard deviation of [5, 25, 50, 75, 100] to formulate the training data. For the simulated HSI denoising process, three types of noise are employed: First, different bands have the same noise intensity. For instance,
is set from 5 to 100, as listed in Table LABEL:table:result. Second, the noise intensity of different bands conforms a random probability distribution, labeled as
rand(25). Third, for different bands, the noise intensity is also different but varies like a Gaussian distribution centered at the middle band:
(9) 
where , and in our settings.
The averages and standard deviations of MPSNR and MSSIM are obtained by repeating 10 runs of compared methods. The best performance for each quality criterion is marked in bold and the secondbest one is underlined. Compared with other algorithms, the proposed ADRN achieves the highest MPSNR and MSSIM values in almost all noisy levels except the case . Under such a small noise level, all methods achieve a relatively good performance and the gap is small. In contrast, as the noise level goes higher and more complicated, our approach clearly outperforms other algorithms.
It is worth noting that NGMeet achieves the best HSI denoising performance in the literature. However, it assumes that noise follows independently and identically distributed (i.i.d) Gaussian distribution, and its performance dropped dramatically when encountering noni.i.d. noise.
To further demonstrate the effectiveness of our proposed method, Fig. 2 and Fig. 3 show the pseudocolor images of the test data (composed of bands 57, 27 and 17) after denoising in the case and respectively. The MPSNR and MSSIM values of each method are marked under the denoised images. Although LLRT and NGMeet show a good noise reduction ability under the uniform noise intensities, it does not work well under unequal noise intensities for different bands. Our proposed method achieves the best performance in objective and subjective evaluations, which demonstrate the effectiveness of our proposed method.
4 Conclusion
In this paper, we presented an attentionbased deep residual network for HSI denoising. Both the spatial information and its adjacent bands are simultaneously assigned to the model to fully exploit the spatialspectral structural correlation. Then, through incorporating the convolution layer of various reception fields, shortcut connection, and channel attention mechanism, we formulate a multiscale feature extraction module and a multilevel feature representation module to respectively capture both the multiscale spatialspectral feature and fuse feature representations with different levels for the final restoration. Furthermore, we adopt the residual learning strategy to ensure the stability and efficiency of the training procedure. The simulated experiment indicated that our propose ADRN outperforms mainstream methods in both quantitative and visual evaluations.
References
 [1] Behnood Rasti, Paul Scheunders, Pedram Ghamisi, Giorgio Licciardi, and Jocelyn Chanussot, “Noise reduction in hyperspectral imagery: Overview and application,” Remote Sensing, vol. 10, no. 3, pp. 482, 2018.
 [2] Matteo Maggioni, Vladimir Katkovnik, Karen Egiazarian, and Alessandro Foi, “Nonlocal transformdomain filter for volumetric data denoising and reconstruction,” IEEE transactions on image processing, vol. 22, no. 1, pp. 119–133, 2012.
 [3] Nadine Renard, Salah Bourennane, and Jacques BlancTalon, “Denoising and dimensionality reduction using multilinear tools for hyperspectral images,” IEEE Geoscience and Remote Sensing Letters, vol. 5, no. 2, pp. 138–142, 2008.
 [4] Hongyan Zhang, Wei He, Liangpei Zhang, Huanfeng Shen, and Qiangqiang Yuan, “Hyperspectral image restoration using lowrank matrix recovery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 8, pp. 4729–4743, 2013.

[5]
Yi Chang, Luxin Yan, and Sheng Zhong,
“Hyperlaplacian regularized unidirectional lowrank tensor recovery
for multispectral image denoising,”
in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, 2017, pp. 4260–4268.  [6] Wei He, Quanming Yao, Chao Li, Naoto Yokoya, and Qibin Zhao, “Nonlocal meets global: An integrated paradigm for hyperspectral denoising,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 6868–6877.
 [7] Qiangqiang Yuan, Qiang Zhang, Jie Li, Huanfeng Shen, and Liangpei Zhang, “Hyperspectral image denoising employing a spatial–spectral deep residual convolutional neural network,” IEEE Transactions on Geoscience and Remote Sensing, vol. 57, no. 2, pp. 1205–1218, 2018.
 [8] Qiang Zhang, Qiangqiang Yuan, Jie Li, Xinxin Liu, Huanfeng Shen, and Liangpei Zhang, “Hybrid noise removal in hyperspectral imagery with a spatialspectral gradient network,” IEEE Transactions on Geoscience and Remote Sensing, 2019.
 [9] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
 [10] Li Da, Li Lin, and Li Xiang, “Classification of remote sensing images based on densely connected convolutional networks,” Computer Era, 2018.
 [11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[12]
Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu,
“Image superresolution using very deep residual channel attention networks,”
in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 286–301.  [13] Diederik Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” Computer Science, 2014.

[14]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey
Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, and Michael Isard,
“Tensorflow: A system for largescale machine learning,”
in 12th USENIX Symposium on Operating Systems Design and Implementation, 2016, pp. 265–283.  [15] Q. HuynhThu and M. Ghanbari, “Scope of validity of psnr in image/video quality assessment,” Electronics Letters, vol. 44, no. 13, pp. 800–801, 2008.
 [16] Wang Zhou, Bovik Alan Conrad, Sheikh Hamid Rahim, and Eero P Simoncelli, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans Image Process, vol. 13, no. 4, pp. 600–612, 2004.
Comments
There are no comments yet.