1 Introduction
Image denoising is a classical yet still active theme in computer vision. It has become an essential and indispensable step in many image processing applications. In recent years, various algorithms have been proposed, which include nonlocal selfsimilarity(NSS) models [9], sparse representation models [10], and deep learning approaches [11] [12] [13]. Among them, BM3D [14], WNNM [15] and CSF [16] are considered as the statsoftheart methods in nondepth learning approaches. NSS models like BM3D provide high image quality and efficiency, and they are very effective in Gaussian denoising with known noise level. In recent, many stateoftheart CNN algorithms like IRCNN [7] outperform the nonlocal and collaboration filtering approaches. The deep CNNs are trained to learn the image prior, which shows that CNN algorithms have a strong ability to fit the structure and pattern inside the image.
Since the deep learning methods have achieved massive success in classification [17] as well as other computer vision problems [18] [8]. A lot of CNN algorithms have been proposed in image denoising. Aiming at image restoration task, it is important to use the prior information properly. Many priorbased approaches like WNNM involve a complex optimization problem in the inference stage, this leads to achieve high performance hardly without sacrificing computation efficiency. To overcome the limitation of priorbased methods, several discriminative learning methods have been developed to learn image prior models in the inference procedure. This kind of models can get rid of the iterative optimization procedure in the test phase. Inspired by the work of [11] [7], this task of image denoising can be seen as a Maximum A Posteriori(MAP) problem from the Bayesian perspective. The deep CNN can be used to learn the prior as a denoiser. The motivation of this work is whether we can increase the prior from the view of convolution itself. The work of [5] shows the design of multiscale convolutions which can help to extract more features from the previous layer. Reference [6] introduced the work of dilated convolution, which can enlarge the receptive field and keep the amount of calculation. Subsequently, the work of [8] shows that dilated residual network can perform better than the residual network in the image classification. In the image denoising task, due to the little difference between adjacent pixels, the dilated convolution can bring more discrepancy information from the front layer to the back layer. In addition, it could increase the generalization ability of the model and require no extra computation cost. The design of residual module can accelerate the whole training process and prevent the gradient vanishing.
This paper proposes an enhanced denoiser based on the residual dilated convolutional neural network. Inspired by the residual learning insight proposed by
[11], we modified the dilated residual network based on residual learning. We treat image denoising as a plain discriminative learning problem. So CNN is usually utilized to separate the noisy. The reasons of using CNN can be summarized as the following. First, CNN with very deep architecture is effective in exploiting image characteristics and increasing the capacity and flexibility of the model. Second, a lot of progress have been made in CNN training methods, including parametric rectifier linear unit(PReLU)
[30], batch normalization and residual architecture. These methods can speed up the training process and improve the denoising performance. Third, there are many tools and libraries to support parallel computation for CNN on GPU, which can give a significant improvement on runtime performance. Contrary to existing various residual networks, we use multiscale convolution to extract more information from the original image and hybrid dilated convolution module to avoid the gridding effect. In the experiment, we compare several statoftheart methods, such as BM3D, IRCNN, DnCNN [11] and FFDnet [31]. For gaussian denoising, our result shows that the proposed enhanced denoiser can make the processed image better with only half parameters of other CNN methods. And the proposed model has a competitive run time performance.In summary, this paper has the following two main contributions:
Firstly, we proposed a lightweight and effective image denoiser based on multiscale convolution group. The experiments shows that our proposed model can achieve better performance and speed over the current statoftheart methods.
Secondly, we shows the proposed network can handle both gray and color image denosing robustly without the increment of parameters.
The rest of this paper is organized as follows. Section 2 gives a review of recent image denoising approaches. Section 3 formally describe our research problem and method in detail. Section 4 presents the experimental results of proposed method and the comparison with one baseline model and five different model. Finally, we conclude in section 5.
2 Related Work
Here, we provide a brief review of image denoising methods. Harmeling et al. [21] was firstly to apply multilayer perception(MLP) for image denoising task, which image patches and large image databases were utilized to achieve excellent results. In [13], a trainable nonlinear reaction diffusion(TNRD) model was proposed and all the parameters can be simultaneously learned from training data through a loss based approach. It can be expressed as a feedforward deep network by unfolding a fixed number of gradient inference steps. DeepAM [12] is consisted of two steps, proximal mapping and end continuation. It is the regularizationbased approach for image restoration, which enables the CNN to operate as a prior or regularizer in the alternating minimization(AM) algorithm. IRCNN [7] uses the HQS framework to show that CNN denoiser can bring strong image prior into modelbased optimization methods. All the above methods have shown that the decouple of the fidelity term and regularization term can enable a wide variety of existing denoising models to solve image denoising task.
Residual learning has multiple realization. The first approach is using a skipped connection from a certain layer to another layer during forward and backward propagations. This was firstly introduced by [19] to solve the gradient vanishing when training very deep architecture in image classification. In lowlevel computer vision problems, implemented a residual module within three convolution block by a skipped connection. Another residual implementation is transforming the label data into the difference between the input data and the clean data. The residual learning [11] can not only speed up the training, but also make the weights of network sparser.
Dilated convolution was originally developed for wavelet decomposition [IEEEexample:Holschneider1989A]
. The main idea of dilated convolution is to increase the image resolution by inserting ”holes” between pixels. The dilated convolution enables dense feature extraction in deep CNNs and enlarges the field of convolutional kernel. Chen et al.
[1] designs an atrous spatial pyramid pooling(ASPP) scheme to capture multiscale objects and context information by using multiple dilated convolution. In image denoising, Wang et al. [32] proposed an approach to calculate receptive field size when dilated convolution is included.3 Method
3.1 Dilated filter
Dilated filter was introduced to enlarge receptive field. The context information can facilitate the reconstruction of the corrupted pixels in image denoising. In CNN, there are two basic ways to enlarge the receptive field to capture the context information. We can either increase the filter size or the depth. According to the existing network design [5] [19], using the filter with a large depth is a popular way. In this paper, we use the dilated filter and keep the merits of traditional convolution. In Fig. 1, the image filtered by different dilate rate shows the different receptive field. A dilated filter with dilation factor can be simply interpreted as a sparse filter of size . For kernel size and dilate rate , only 9 entries of fixed positions can be nonzeros. But the use of dilated convolutions may cause gridding artifacts [6]. It occurs when a feature map has higherfrequency content than the sampling rate of the dilated convolution. And we notice that the hybrid dilated convolution(HDC) proposed by [8] addressed this issue theoretically. Suppose convolutional layers with kernel size have dilation rates of , the HDC is going to let the final size of receptive field cover a square region without any holes or missing edges. So the maximum distance between two nonzero values can be defined as
(1) 
where the design goal is to let with . For example, for kernel size , pattern works as ; however, an pattern does not work as . The benefit of HDC is that it can naturally integrated with the original layers of network, without adding extra modules.
The HDC can make the better use of receptive field information. 1 dilate convolution and 5 dilate convolution will extract features information at different level. In other word, the dilated convolution can extract information at different scale.
3.2 Multiscale convolution group
Multiscale extraction for image feature is a common technique in solving computer vision problems. Multiscale extraction can make use of feature maps in different levels. In deep CNNs, the kernel and
kernel can extract different scales of features. The addition of the multiscale structure increases the width of network , on the other hand, improves the generalization of network. Inspired by the Inception module
[IEEEexample:Szegedy2014Going], we proposed the multiscale convolution group. Inception module consists of the pooling layer and convolution. For image denoising task, due to the unchanged size of output image, we usually remove the pooling layer. We apply three scale filters with different numbers. The number of each of the three filter is 12, 20 , 32. The reason is that the sum of the feature map is 64 and this combination achieves the balance between feature extraction and parameters. Differing from the Inception module, we concatenate the feature maps directly. It can significantly reduce the parameters. Considering of the computation cost, we only use the multiscale module in the first layer.3.3 Architecture
Our proposed network structure illustrated in fig. 2 is inspired by [7] and [8]. It consists of eleven layers with different dilated rate convolution. The first layer is the multiscale convolution group. The residual connection starts from the second layer. The skipped connection is used for training deep network because it is beneficial for alleviating the gradient vanishing problem [25] [26]. Another advantage is the residual module can make the weights of network sparse, which can reduce the inference computation time. Each dilated convolution block will be followed by batch normalization [24] layers and parametric rectifield linear unit(PRelu) [30]. Such network design techniques have been widely used in recent CNN architecture. In particular, it has been pointed out that this kind of combination can not only enables fast and stable training but also tends to better result in Gaussian denoising task [7]. The PReLU and BN layer can accurate the convergence of network. Image denoising is going to recover a clean image from a noisy observation. We consider the noisy image can be expressed as , where stands for the clean image, and is the unknown Gaussian noise distribution. Owing to residual learning strategy applied, the labels are obtained by calculating the difference between the input image and the clean image. So the output of our network is the residual image, which is the prediction of the noise distribution.
4 Experiments
It is widely acknowledged that convolution neural networks generally benefits from the giant training datasets. However, due to the limitation of computer resources, we only used 400 images of size for grayscale denoising experiments. According to [7]
, using more dataset dose not improve the PSNR results of BSD68 dataset. As for the color image denoising experiment, 400 selected images from validation set of ImageNet database
[2] are the training datasets and we crop the images into small patches of size and select patches for training.4.1 Training details
Due to the characteristic of Gaussian convolution, the output image may produce boundary artifacts. So we apply zero padding strategy and use small patches to tackle with this problem. The number of feature maps in each layer is 64. And the depth of network is set to eleven which is kind of lightweight framework. The patch size of input images is
, and we use date augmented techniques to increase the diversity of the training data. The Adam solver [4] is applied to optimize the network parameters . The learning rate is started from and then fixed to. The learning rate is decreasing 10 times after 60 epoches. The hyperparameters of Adam is set the default setting. The minibatch size we use is 64, which can balance the memory usage and effects well. The network parameters are initialized using the Xavier initialization
[30]. And our denoiser models are trained with the MatConvNet package [3]. The environment we use is under Matlab(2016b) software and an Nvidia 1080Ti GPU. And we trained 100 epoches to get the result, it almost takes half a day to train a denoised model with the specific noise level. 400 images of publicly available Berkeley segmentation(BSD500) [13] and Urban 100 datasets [IEEEexample:Huang2015Single] are used for training the Gaussian denoising model. In addition, we generated 3200 images by using data augmentation via image rotation, cropping and flipping. For the test data set, Set12 and BSD68 are used. Note that all those image are widely used for the evaluation of denoising models and they are not included in the training datesets. For training and validation, Gaussian noises with are added to verify the effect of our denoised model.4.2 denoising results
We compared the proposed denoiser with several stateoftheart denoising methods, including two modelbased optimization methods BM3D and WNNM, one discriminative learning method TNRD, and three deep learning methods included IRCNN, DnCNN, FFDnet. Fig. 4 shows the visual results with details of different methods. It can be seen that both BM3D and WNNM tend to produce oversmooth textures. TRND can preserve fine details and sharp edges, but it seems that artifacts in the smooth region are generated. The three deep learning methods and the proposed method can have a pleasure result in the smooth region. It is clearly that the proposed method can preserve better texture than the other methods, such as the region above the balcony fence. Another comparison results of different methods in fig. 5 show that our method reaches a better visual result.
Methods  BM3D  WNNM  TNRD  IRCNN  DDRN  DnCNN  FFDNet  Proposed 

31.075  31.371  31.422  31.629  31.682  31.718  31.631  31.751  
28.568  28.834  28.923  29.145  29.181  29.228  29.189  29.258  
25.616  25.874  25.971  26.185  29.213  26.231  26.289  26.323  
24.212  24.401    24.591  24.617  24.641  24.788  24.793 
In order to show the capacity of the proposed model, we do the quantitative and qualitative evaluation on 2 widely used testing data sets. The average PSNR results of different results on the BSD68 and Set12 are shown in table.I and table.V. BSD68 consists of 68 gray images, which has diverse images. We can have the following observation. Firstly, the proposed method can achieve the best average PSNR result than those competing methods on BSD68 data sets. Compared to the benchmark method BM3D on BSD68, the WNNM and TNRD have a notable gain of between 0.3dB and 0.35 dB. The method IRCNN can have a PSNR gain of nearly 0.55dB. In contrast, our proposed model can outperform BM3D nearly 0.7dB on all the three noise levels. Secondly, the proposed method is better than DnCNN and FFDnet when the noise level is below 75. This result shows that the proposed method has the better tradeoff between receptive field size and modeling capacity.
Table.V lists the PSNR results of different methods on the 12 test images. The best two PSNR result for each image with each level is highlighted in red and blue color. It can be seen that the proposed method can achieve the top two PSNR values on most of the test images. For the average PSNR values, the proposed method has best performance among all the methods in noisy. And it is less efficient than FFDnet in noisy. This is because FFDnet can outperforms the other methods on image ”House” and ”Barbara”, which this two images have rich amount of repetitive structures.
For color image denoising, we use the same network parameters. The only difference is the input tensor becomes
. The visual comparisons are shown in fig.6 and fig.3. It is obviously that CBM3D generates false color artifacts in some region while the proposed model can recover the image with more natural color and texture structure, like more sharp edges. In addition, table.II shows that proposed model can outperform the benchmark method CBM3D among three noise level. In the meantime, the proposed method is more effective than three deep CNN methods in the color BSD68 dataset.Methods  

CBM3D  33.52  30.71  27.38 
CDDRN  33.93  31.24  27.93 
CDnCNN  33.89  31.23  27.92 
CFFDnet  33.87  31.21  27.96 
Proposed  34.10  31.43  28.09 
We give a brief calculation about the amount of parameters. Note that the values are different for gray and color image denoising due to the different network depth. For instance, DnCNN uses 17 convolution layers for gray image denosing and 20 for color image denoising, whereas FFDnet takes 15 for gray and 12 for color. In addition, FFDnet set 64 channels for gray image and 96 channels for color image. However, the proposed method can outperform the other method without the increment of the depth in color image denoising. It indicates that our model is more robust without sacrificing the computing resource.
Methods  gray/param  color/param 

DnCNN  5.6 10  6.7 10 
FFDnet  5.5 10  8.3 10 
Proposed  3.3 10  3.4 10 
We also compare the computation time to check the applicability of the proposed method. BM3D and TNRD are utilized to be the comparison due to their potential value in practical applications. We use the Nvidia cuDNNv6 deep learning library to accelerate the GPU computation and we do not consider the memory transfer time between CPU and GPU. Since both the proposed denoiser and TNRD support parallel computation on GPU, we also provide the GPU runtime. Table. IV lists run time comparison of different methods for denoising images of size , and . For each kind of tests, we run several times to get the average runtime. We can see that the proposed method is very competitive in both CPU and GPU computation. Such a good performance over the BM3D is properly attributed to the following reasons. First, the convolution and PRelu activation function are simple effective and efficient. Second, batch normalization is adopted, which is beneficial to Gaussian denoising. Third, residual architecture can not only accelerate the inference time of deep network, but also have a larger model capacity.
Size  Device  256256  512512  10241024 

BM3D  CPU  0.69  0.52  0.371 
GPU        
DnCNN  CPU  2.14  8.62  32.10 
GPU  0.018  0.046  0.135  
FFDnet  CPU  0.44  1.81  7.32 
GPU  0.008  0.016  0.046  
Proposed  CPU  0.41  1.62  4.68 
GPU  0.004  0.009  0.038 
Images  C.man  House  Peppers  Starfish  Monar.  Airpl.  Parrot  Lena  Barbara  Boat  Man  Couple  Average 
Noise Level  
BM3D  31.915  34.944  32.701  31.146  31.859  31.076  31.376  34.271  33.114  32.140  31.929  32.108  32.372 
WNNM  32.168  35.129  32.987  31.825  32.712  31.387  31.621  34.273  33.598  32.269  32.115  32.172  32.688 
TNRD  32.154  34.541  33.069  31.702  32.611  31.484  31.709  34.289  32.098  32.143  32.178  32.065  32.502 
IRCNN  32.539  34.886  33.314  32.021  32.816  31.698  31.839  34.528  32.431  32.342  32.398  32.401  32.769 
DnCNN  32.611  34.972  33.297  32.197  33.087  31.696  31.831  34.621  32.638  32.416  32.451  32.471  32.857 
FFDNet  32.417  35.005  33.102  32.021  32.768  31.587  31.768  34.632  32.501  32.348  32.402  32.447  32.749 
Proposed  32.633  35.006  33.261  32.161  33.238  31.722  31.908  34.598  32.586  32.443  32.452  32.454  32.872 
Noise Level  
BM3D  29.452  32.864  30.161  28.561  29.254  28.427  28.934  32.077  30.717  29.908  29.615  29.719  29.969 
WNNM  29.642  33.221  30.421  29.031  29.842  28.693  29.152  32.243  31.239  30.031  29.752  29.821  30.257 
TNRD  29.634  32.504  30.392  28.952  29.978  28.928  29.245  32.016  29.327  29.867  29.824  29.703  30.031 
IRCNN  30.083  33.056  30.825  29.267  30.085  29.087  29.461  32.421  29.924  30.172  30.040  30.081  30.376 
DnCNN  30.176  33.059  30.871  29.405  30.282  29.131  29.432  32.438  29.996  30.211  30.096  30.118  30.432 
FFDNet  30.062  33.268  30.786  29.331  30.142  29.052  29.431  32.586  29.978  30.226  30.101  30.176  30.428 
Proposed  30.233  33.099  30.831  29.447  30.454  29.111  29.462  32.478  30.063  30.232  30.091  30.122  30.471 
Noise Level  
BM3D  26.131  29.693  26.683  25.044  25.818  25.102  25.898  29.051  27.225  26.782  26.808  26.463  26.722 
WNNM  26.447  30.326  26.941  25.442  26.312  25.413  26.132  29.244  27.782  26.968  26.938  26.635  27.025 
TNRD  26.582  29.554  27.235  25.301  26.447  25.499  26.185  28.978  25.729  26.891  26.993  26.522  26.812 
IRCNN  26.878  29.955  27.334  25.567  26.611  25.887  26.521  29.399  26.235  27.173  27.166  26.875  27.136 
DnCNN  27.031  30.002  27.321  25.701  26.781  25.865  26.483  29.385  26.217  27.201  27.242  26.901  27.176 
FFDNet  27.028  30.432  27.428  25.769  26.882  25.901  26.576  29.677  26.476  27.318  27.296  27.068  27.321 
Proposed  27.307  30.177  27.415  25.781  26.931  25.878  26.483  29.484  26.417  27.287  27.233  26.996  27.282 
5 Conclusion
In this paper, we have designed an effective CNN denoisers for image denoising. Specifically, with the aid of skipped connections, we can easily train a deep and complex convolutional network. A lot of deep learning skills are integrated to speed up the training process and boost the denoising performance. The modelbased and priorbased approaches are the popular way to tackle with denoising problem. Followed by the instruction of priorbased model, we show the possibility of increasing the features by using multiscale module and residual HDC module. Extensive experimental results have demonstrated that the proposed method can not only produce favorable image denoising performance quantitatively and qualitatively, but also have a promising run time by GPU. There are still some work for further study. First, it would be a promising direction to train a lightweight denoiser for practical applications. Second, extending the proposed method to other image restoration problems, such as image deblurring. Third, it would be interesting to investigate how to denoise the nonGaussian noisy according to some properties of gaussian denoising models.
References
References
 [1] Chen L C , Papandreou G , Kokkinos I , et al. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 40(4):834848.

[2]
Deng J , Dong W , Socher R , et al. ImageNet: A largescale hierarchical image database[C]// 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009.
 [3] Vedaldi A , Lenc K . MatConvNet  Convolutional Neural Networks for MATLAB[J]. 2014.
 [4] Kingma D , Ba J . Adam: A Method for Stochastic Optimization[J]. Computer Science, 2014.
 [5] Szegedy C , Liu N W , Jia N Y , et al. Going deeper with convolutions[C]// 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, 2015.
 [6] Yu F , Koltun V , Funkhouser T . Dilated Residual Networks[J]. 2017.
 [7] Zhang K , Zuo W , Gu S , et al. Learning Deep CNN Denoiser Prior for Image Restoration[J]. 2017.
 [8] Wang P , Chen P , Yuan Y , et al. Understanding Convolution for Semantic Segmentation[J]. 2017.
 [9] Mairal J , Bach F , Ponce J , et al. Nonlocal sparse models for image restoration[C]// 2009 IEEE 12th International Conference on Computer Vision. IEEE, 2010.
 [10] Dong W , Zhang L , Shi G . Centralized sparse representation for image restoration[J]. IEEE Transactions on Image Processing, 2011.
 [11] Zhang K , Zuo W , Chen Y , et al. Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising[J]. IEEE TRANSACTIONS ON IMAGE PROCESSING.
 [12] Kim Y , Jung H , Min D , et al. Deeply Aggregated Alternating Minimization for Image Restoration[C]// Computer Vision and Pattern Recognition. IEEE, 2017.
 [13] Chen Y , Pock T . Trainable Nonlinear Reaction Diffusion: A Flexible Framework for Fast and Effective Image Restoration[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 39(6):12561272.
 [14] Dabov K , Foi A , Katkovnik V , et al. Image Denoising by Sparse 3D TransformDomain Collaborative Filtering[J]. IEEE Transactions on Image Processing, 2007, 16(8):20802095.
 [15] Gu S , Zhang L , Zuo W , et al. Weighted Nuclear Norm Minimization with Application to Image Denoising[C]// Computer Vision and Pattern Recognition. IEEE, 2014.
 [16] Schmidt U , Roth S . Shrinkage Fields for Effective Image Restoration[C]// Computer Vision and Pattern Recognition. IEEE, 2014.
 [17] Krizhevsky A , Sutskever I , Hinton G . Imagenet classification with deep convolutional neural networks[C]// NIPS. Curran Associates Inc. 2012.
 [18] Ronneberger O , Fischer P , Brox T . UNet: Convolutional Networks for Biomedical Image Segmentation[J]. 2015.
 [19] He K , Zhang X , Ren S , et al. Deep Residual Learning for Image Recognition[J]. 2015.
 [20] Gatys L A , Ecker A S , Bethge M . Texture synthesis using convolutional neural networks[C]// International Conference on Neural Information Processing Systems. MIT Press, 2015.
 [21] Harmeling S . Image denoising: Can plain neural networks compete with BM3D?[C]// Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2012.

[22]
Zhao H , Gallo O , Frosio I , et al. Loss Functions for Image Restoration With Neural Networks[J]. IEEE Transactions on Computational Imaging, 2017, 3(1):4757.

[23]
Sajjadi M S M , Sch?lkopf, Bernhard, Hirsch M . EnhanceNet: Single Image SuperResolution Through Automated Texture Synthesis[J]. 2016.
 [24] Ioffe S , Szegedy C . Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift[J]. 2015.
 [25] Mao X J , Shen C , Yang Y B . Image Denoising Using Very Deep Fully Convolutional EncoderDecoder Networks with Symmetric Skip Connections[J]. 2016.
 [26] He K , Zhang X , Ren S , et al. Identity Mappings in Deep Residual Networks[J]. 2016.
 [27] Ulyanov D , Lebedev V , Vedaldi A , et al. Texture Networks: Feedforward Synthesis of Textures and Stylized Images[J]. 2016.
 [28] He K , Zhang X , Ren S , et al. Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification[J]. 2015.
 [29] Burger H C , Schuler C , Harmeling S . Learning how to combine internal and external denoising methods[M]// Pattern Recognition. 2013.
 [30] He K , Zhang X , Ren S , et al. Delving Deep into Rectifiers: Surpassing HumanLevel Performance on ImageNet Classification[J]. 2015.
 [31] Zhang K , Zuo W , Zhang L . FFDNet: Toward a Fast and Flexible Solution for CNN based Image Denoising[J]. IEEE Transactions on Image Processing, 2018:11.
 [32] Wang T , Sun M , Hu K . Dilated Residual Network for Image Denoising[J]. 2017.
Comments
There are no comments yet.