I Introduction
Image denoising is a wellstudied yet not wellsolved problem [1, 2, 3, 4, 5, 6, 7], where the goal is to recover the underlying signal from its contaminated observation. The contaminations can be categorized into many different types according to their distributions and behaviors, e.g., additive (Gaussian) noise, shot (Poisson) noise, JPEG noise, etc. We focus on the saltandpepper
(s&p) noise, which is an impulse contamination to the image. In an image with the s&p noise, pixels become maximal or minimal values with a predefined probability, which is called the noise level, i.e. the higher this value is, the more pixels will be contaminated. The s&p noise is a special case of randomvalue impulse noise defined in
[4] and [5]. For a given noise level , an s&p contaminated image could be defined as(1) 
where both and are 2 random values generated on each pixel, with the former one determining if a pixel will be contaminated or not and later one controlling if that the pixel will turn to be the maximal (salt) value or the minimal (pepper) value. From Equation 1, one observes that the s&p noise is neither like the additive (Gaussian) noise, which can be fully separated from the signal [1]; nor like the shot (Poisson) noise, which is signal dependent [5]. It appears as the pure noise at the contaminated locations (called missing pixel in [5]
) and therefore erases all signals there. This fact prevents us to use any optimization method in a continuous space to recover the signal, since the gradients estimated from missing pixels are totally not reliable and they will further scatter to other unpolluted locations. Traditional ways to recover images from s&p pollution all require nonlinear searches and mappings. The search step
[8, 9, 10] generally determines the locations of the contaminated pixels and the mapping step tries to give a feasible estimate at each contaminated pixel by weightedly averaging the similar pixel values around it. This set of filters are named switching filters. However, when the noise level is increasing, the search step becomes more and more unreliable. On the other hand, the signal estimation step will also be degraded by the highlevel noise since the similarity estimation becomes intractable.(a) Original  (b) 70% s&p  (c) Median5  (d) Median5 x2 
PSNR  6.72 db  14.01 db  19.14 db 
(e) Median5 x5  (f) Median5 x10  (g) Median5 x25  (h) Our method 
24.09 db  24.89 db  24.52 db  33.07 db 
To alleviate this limitation, [4] uses switching templates to avoid noise disturbance in the process of measuring similarity. Based on the similarities, they extract repairable information in nonlocal regions instead of local patches. This filter is named as nonlocal switching filter (NLSF). The method uses a trained Convolutional Neural Network to finally refine the signal recovered by NLSF. Therefore, NLSF is considered as a prepocessing step to the neural network. This method is a combination of traditional methods and the learningbased method.
Besides the usual learningbased ideas that train models to denoise using pairs of clean images and their noisy versions, the noisetonoise method [5] trains models only on noisy images. They discover that training without using clean images can achieve, sometimes even exceeds, the result obtained by training using ground truths. Following this paradigm, [5] shows the ability to remove the randomvalued impulse noise, which can be considered as a superset of s&p noise. To deal with the gradient loss problem introduced by pure noisy pixels, they adopt an annealed version of the “ loss” to replace the traditional
loss. The loss function is gradually changing from
to as the training progress. However, the speed of this annealing procedure must be carefully chosen (usually reducing the power of the norm on the loss function according to the number of iterations). When the prediction is far from the truth, the loss function must be closer to ; when the prediction is getting closer enough, becomes more favorable, since loss emphasis the number of different pixels, which leads the learning process to a detail amendment stage.We introduce the use of local nonlinear search into the neural network without performing any preprocessing step and also avoid changing the loss function from loss to some other losses that are not easy to optimize. We resort to median filter [11], which is the first efficient method to denoise the saltandpepper noise. By incorporating the medianfilterlike operations into deep neural networks, our method outperforms stateoftheart methods. Details of our methodology as well as the model design can be found in Section II, evaluations are presented in Section III. Section IV is for conclusion of our work. We release our source code, training dataset and pretrained models at https://github.com/llmpass/medianDenoise for reproducibility.
Ii Methodology
Median filter is a traditional nonlinear filter which is especially efficient for removing impulse noise. It replaces the pixel centered in a given window with the median of this window. As shown in Figure
1, applying median filter on a highly contaminated image (b) removes spikes and therefore greatly improves the signal to noise ratio. Applying a 55 median filter once (Figure 1c) and twice (Figure 1d),respectively, removes about and noise. A natural idea is to repeatedly apply the median filter upon the image until all spikes are replaced by the median in a fixedsize local window. It does remove the noise, however, it fails to recover the signal. The Peak Signal to Noise Ratio (PSNR) increases in the first several iterations but drops finally as the image becomes blocky and blurry, see Figure 1g. This phenomena indicates that the median filter deviates the signal too much from its original shape, which is also the main reason why modern researchers abandon median filter in denoising s&p noise.In addition, the best PSNR value appears at different iterations of repeated median filtering when denoising different levels of noise. Figure 2 shows the higher density the noise is, the more iterations of median filters are required.
Our basic idea is to keep the ability of spike removal from the traditional median filter but try to recover the degradations introduced by it. Figure 3 illustrates a simple 1D synthetic example. We contaminate an evenlysampled 1D sine function (dotted curves in Figure 3a) by level s&p noise. After that, we tried to use different ways to recover the clean signal:

Repeated Median filters, see Figure 3bd;

Repeated Gaussian filters, see Figure 3ef;

Alternating Median and Gaussian filters, see Figure 3hj.
Here, all Median and Gaussian filters have the same window size that equals to 5 pixels. One may observe the third schema yields the best approximations (green curves) to the original sine function, no matter in the aspect of the signal shape or mean square errors (mse) between the smoothed curves (solid) and the true signal (dotted) curve. Using only Median filters creates plateaulike artifacts; using only Gaussian filters over smooth the noisy curves. By alternating Median and Gaussian filters, apparent plateaulike artifacts are washed out while the resulting curve still stays close to the true signal. The quickdropping mse values between smoothed curves generated by the third schema and the truth quantitatively support our observation.
We leverage these observations to design our deep neural network model for 2d image denoising. We replace the Gaussian filter, which is a fixparameter smoothing filter, by a set of learnable convolution operations and thus design an endtoend fully convolutional network with Median and other convolutions alternatingly appearing.
Instead of directly applying median filters on the images, we implement median filtering as a neural network operation and perform it on different feature channels. In this way, we essentially remove spikes in different feature spaces and then combine the despiked features to predict a better noise removed image. On one hand, the median filtering in the feature space acts just like the switch filters in the traditional methods [8, 4]; on the other hand, the despike ability introduced by median operations allow the gradients to pass through the nonnoisy pixels.
Iia Median layer definition
Median filter is applied to each element of a feature channel in a moving window fashion. For example, an input image that consists of RGB channels, corresponds to 3 feature channels; a set of features generated after the convolution generally contains many number of channels. For each feature channel, we first extract a set of given size (,
size patches centered at each pixel. Then, we find the median of the sequence formed by all elements in that patch. We show a simple tensorflow/python implementation of this median filter layer in Listing
1. Here, parameteris a channel of the input tensor and
denotes an integer kernel size.In practice, this median layer is applied on each feature channel and then we concatenate them to form a new set of features, e.g. median layer will be applied 64 times given a set of feature channels generated by Convolutions.
IiB Network architecture
As shown in Figure 4a, our network is a fully convolutional network, so that no restrictions are posed on the size of the input. It starts with 2 consecutive median layers, which are then followed by a sequence of residual blocks and median layers. The last part of the network is just residual blocks without inserting median layers in between them. In practice, we only insert median layers into the first half of the sequence of residual blocks. The first part of the network is dedicated to remove noise from the image, the second half of the network is designed for recovering the signal.
We choose to generate features per convolution layer and our residual block is designed as a skip connection over 2
convolutions, followed by batch normalization layers and nonlinear activations (relu in practice), as shown in Figure
4b.As mentioned beforehand, we stick to use the simplest loss as our objective function. This loss is simply defined as the the mean square error of the estimated image and the ground truth image, as minimizing mse directly relates to increase of denoise metrics psnr. Details can be found in Equation 3.
noise level  ConvRelu 16  ConvRelu 16  ResBlock 16  ResBlock 16  ResBlock 32  ResBlock 32 

without median  with median  without median  with median  without median  with median  
30%  31.15  33.82  40.38  40.89  36.86  40.90 
50%  30.15  32.01  36.55  36.93  36.86  37.28 
70%  28.88  29.37  31.98  32.23  32.22  32.40 
Iii Evaluation
We design several experiments to evaluate the properties of median layers (Section IIIB) and performances of the proposed network (Section IIIC).
Iiia Training and testing setup
For fair comparisons, we train all models with the same data set described in [12] that contains different images, which is also employed in other works [4]. Since our network is a fully convolutional network, the input size can be arbitrary. We first resize these images to and then we generate patches from them as clean images. We degrade each patch by the s&p noise with levels from to with a step equals to as a sequence of noisy images. The models are trained to learn a series of weights in layers that can transfer the input noisy image to the clean image.
To quantitatively compare the performance of different methods, we perform denoising on 3 sets of the images. The first set of image consist of some classical images in the image processing field (also used in [4]); the second set is BSD300 [13] (https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/segbench/BSDS300/html/dataset/images.html); the third set is Kodak Image Dataset (http://r0k.us/graphics/kodak/), which has been widely used as the evaluation set [1, 2, 5].
The metric considered in the comparison is Peak Signal to Noise Ratio (PSNR). It is defined by
(2) 
where is the meansquared error between two 8bit images and , defined by
(3) 
IiiB Effects of the median layer
The first experiment is designed to show the effectiveness of the median layer. We trained several pairs of fully convolutional networks mainly consisting of residual or convolutionbatchNormrelu blocks, but one with median layers and one without them.
(a) Original  (b) 70% s&p (6.72 db) 
(c) W/ medians (32.16db)  (d) W/O medians (28.70db) 
Image  Noise level  DBA [14]  NASNLM [9]  PARIGI [10]  NLSF [4]  NLSFMLP [15]  NLSFCNN [4]  Noise2Noise [5]  Ours 

30%  34.42  28.09  33.90  34.20  30.80  35.38  36.39  37.04  
50%  30.11  26.15  29.91  30.12  29.28  32.55  34.68  35.00  
70%  25.84  25.97  25.22  25.79  27.63  30.18  32.83  33.07  
30%  28.07  23.68  25.19  28.21  25.19  28.71  30.89  40.46  
50%  24.24  22.91  22.61  24.45  23.86  26.01  27.96  34.83  
70%  21.12  22.63  20.06  21.02  22.61  24.11  25.09  29.96  
30%  29.41  20.61  29.74  32.88  29.64  33.47  39.98  40.65  
50%  27.47  16.69  27.25  29.66  28.28  30.92  36.13  38.84  
70%  24.99  16.32  24.29  26.33  26.90  29.06  30.55  33.29  
30%  26.85  22.38  28.88  32.27  30.01  32.99  30.70  30.83  
50%  25.27  21.82  25.44  27.99  28.57  30.23  29.86  30.07  
70%  22.11  21.58  21.46  23.04  27.04  27.70  28.79  29.05  
BSD300  30%  29.92  25.74  12.04  30.01  29.77  30.87  39.83  40.90 
average  50%  26.32  24.50  6.01  26.25  26.19  27.84  35.92  37.28 
70%  22.81  24.65  5.42  22.85  26.19  25.35  31.42  32.40 
Noise level  DeepBoosting [7]  Noise2Noise [5]  Ours 

30%  21.69  34.95  36.39 
50%  19.50  32.27  34.35 
70%  15.74  30.49  31.56 
(a)  (b)  (c)  (d)  
(e)  (f)  (g)  (h)  
(i)  (j)  (k)  (l)  
90% s&p  noise2noise  ours  ground truth 
We train two sets of deep fully convolutional networks, the first set of networks are traditional ones that do not contain any median layers, the second set of networks are the counterparts of the first set with median layers inserted into them with the same strategy shown in Figure 4a, i.e. the first half of the network contains median layers, the second half does not. The networks in the first set consist of repeated blocks of convolution, batch normalization and activation or repeated residual blocks as shown in Figure 4b.
Losses in Figure 6 shows how median layers boost the PSNR value of the network. Training losses of two networks with median layers inserted converge to a better minima comparing to the losses without the median layers. The “ConvRelu 16” network in Figure 6 is a DnCnn [1] style network, which consists of 16 stacked ConvolutionBatchNormalizationActivation units. The “ResBlock 16” network in Figure 6 is formed by simply replacing the ConvolutionBatchNormalizationActivation units to residual blocks shown in Figure 4b. All convolution layers here generate 64 features.
PSNR comparisons of models with and without median layers inserted (Table I) show the improvements of PSNR. The PSNR values of models with median layers are usually db higher than the ones that do not have them.
IiiC Comparisons to the stateofthearts
Both quantitative and qualitative comparisons to the stateofthearts are performed in this section.
IiiC1 Quantitative comparisons
We quantitatively compare our network in Figure 4a to several stateoftheart methods. Baselines include 5 traditional methods: DecisionBased Algorithm (DBA) [14], Adaptive Switching Nonlocal Filter (NASNLM) [9], PARIGI [10], NLSF [4] (prepocessing part of NLSFCNN), NLSFMLP (NLSF with multilayer perception proposed in [15]) and 2 most recent neural network based methods: NLSFCNN [4] and Noise2Noise [5], as shown in Table II. Many methods here are designed for denoising s&p noise with moderate levels, therefore, we choose to evaluate the methods under noise levels equal to , and .
In addition, we also compare our method to DeepBoosting [7] and Noise2Noise [5] on Kodak image dataset, as shown in Table III.
Our method outperforms most of the stateofthearts besides the pepper image. Comparing to current best baseline method Noise2Noise [5], PSNR values achieved by our model is about 12db higher in average and the severer the noise contamination, the comparably better our method performs.
IiiC2 Qualitative comparisons on extremely highlevel noise
We further qualitatively compare our method to noise2noise method [5] on denoising extremely highlevel s&p noise (noise level equals to ). In Figure 7, we choose three images from BSD300 dataset, where different challenges can be found there:

both sharp feature and smooth background exist in the first image (Figure 7d);

pure black and white interphase pattern in the second image (Figure 7h);

noiselike nature scene background (Figure 7l).
The leftmost column in Figure 7 shows the contaminated images, which are the noisy version of their counterparts in the rightmost column. One may hardly see the contours of the original salient objects there, since of pixels become either maximal or minimal values.
Our method performs consistently better than noise2noise on all of these challenges. In Figure 7b, noise2noise generates many small white blocky artifacts on the sky (red rectangle) and also blurs the sharp edges (blue rectangle) of the windows. Both of these 2 degradations are alleviated in our result shown in Figure 7c.
Recovering underlying signal with pure black and white interphase pattern from highlevel s&p noise contamination is a very difficult problem because both signal and noise are almost binary in each channel. The method may have a hard time to distinguish which pixel is contaminated. By comparing the results shown in Figure 7f (noise2noise) and Figure 7g (ours), one may observe that our method produces higher quality images.
Noiselike patterns are common to many nature scene images, for example, the grass and the feathers of the owl in Figure 7l and the leaves in the bridge image in Table II. We observe that the image still looks noisy after being processed by noise2noise method, where apparent small blue and red dots stand out on the grass (Figure 7j). However, such dots are invisible in our result, as shown in Figure 7k.
Iv Conclusion
In this paper, we show that incorporating the median filtering technique in the deep neural network helps achieving compelling results in denoising the s&p noise, especially when the noise level is high. The ability of the median layer to denoise is also experimentally testified with increasing PSNR. Our work opens the door in adopting traditional lowlevel nonlinear signal processing techniques in deep neural networks. The methodology of inserting nonlinear spatial layers may boost the performances of some wellknown deep networks.
The median is the optimum point of a set of values under norm, which minimizes the sum of absolute deviations. This fact makes median layers act as a regularizer to the feature channels. Unlike the annealing procedure on the loss function adopted in [5], where the speed of evolving the loss from to must be carefully chosen to achieve the best result (with respect to the amount of noises), median layers is a more feasible way to control the quality of the extracted features. A single model can be trained to recover latent images with different levels of noise contaminations only using loss.
Spatial filtering have been invented and could be leveraged into convolutional neural networks to deal with images affected by nonlinear noise. More study on the median placements could result in understanding its impact in the process.
References
 [1] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
 [2] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for CNN based image denoising,” IEEE Transactions on Image Processing, 2018.

[3]
D. Liu, B. Wen, X. Liu, Z. Wang, and T. Huang, “When image denoising meets highlevel vision tasks: A deep learning approach,” in
Proceedings of the TwentySeventh International Joint Conference on Artificial Intelligence, IJCAI18. International Joint Conferences on Artificial Intelligence Organization, 7 2018, pp. 842–848. [Online]. Available:
https://doi.org/10.24963/ijcai.2018/117  [4] B. Fu, X. Zhao, Y. Li, X. Wang, and Y. Reng, “A convolutional neural networks denoising approach for salt and pepper noise,” Multimedia Tools and Applications, pp. 1–18, 2018.

[5]
J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and
T. Aila, “Noise2noise: Learning image restoration without clean data,” in
International Conference on Machine Learning (ICML) 2018
, 2018, pp. 2971–2980. 
[6]
R. Furuta, N. Inoue, and T. Yamasaki, “Fully convolutional network with multistep reinforcement learning for image processing,” in
AAAI Conference on Artificial Intelligence (AAAI), 2019. 
[7]
C. Chen, Z. Xiong, X. Tian, and F. Wu, “Deep boosting for image denoising,”
in
European Conference on Computer Vision 2018 (ECCV)
, 2018. 
[8]
W. Wang and P. Lu, “An efficient switching median filter based on local outlier factor,”
IEEE Signal Processing Letters, vol. 18, pp. 551–554, 2011.  [9] J. Varghese, N. Tairan, and S. Subash, “Adaptive switching nonlocal filter for the restoration of salt and pepper impulsecorrupted digital images,” Arabian Journal for Science & Engineering, vol. 40, pp. 3233–3246, 2015.
 [10] J. Delon, A. Desolneux, and T. Guillemot, “Parigi: a patchbased approach to remove impulsegaussian noise from images,” Image Process On Line, vol. 5, pp. 130–154, 2016.
 [11] T. S. Huang, G. J. Yang, and G. Y. Tang, “A fast twodimensional median filtering algorithm,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, pp. 13–18, 1979.

[12]
C. Dong, C. C. Loy, K. He, and X. Tang, “Image superresolution using deep convolutional networks,”
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 38, pp. 295–307, 2015.  [13] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. 8th Int’l Conf. Computer Vision, vol. 2, July 2001, pp. 416–423.
 [14] K. S. Srinivasan and D. Ebenezer, “A new fast and efficient decisionbased algorithm for removal of highdensity impulse noises,” IEEE Signal Processing Letters, vol. 14, pp. 189–192, 2007.

[15]
H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain
neural networks compete with bm3d?” in
Proc. 2012 IEEE Conference on Computer Vision and Pattern Recognition
, vol. 157, 2012, pp. 2392–2399.
Comments
There are no comments yet.