Convolutional Neural Network with Median Layers for Denoising Salt-and-Pepper Contaminations

08/18/2019 ∙ by Luming Liang, et al. ∙ Nanjing University of Aeronautics and Astronautics Microsoft 9

We propose a deep fully convolutional neural network with a new type of layer, named median layer, to restore images contaminated by the salt-and-pepper (s&p) noise. A median layer simply performs median filtering on all feature channels. By adding this kind of layer into some widely used fully convolutional deep neural networks, we develop an end-to-end network that removes the extremely high-level s&p noise without performing any non-trivial preprocessing tasks, which is different from all the existing literature in s&p noise removal. Experiments show that inserting median layers into a simple fully-convolutional network with the L2 loss significantly boosts the signal-to-noise ratio. Quantitative comparisons testify that our network outperforms the state-of-the-art methods with a limited amount of training data. The source code has been released for public evaluation and use (



There are no comments yet.


page 1

page 4

page 5

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Image denoising is a well-studied yet not well-solved problem [1, 2, 3, 4, 5, 6, 7], where the goal is to recover the underlying signal from its contaminated observation. The contaminations can be categorized into many different types according to their distributions and behaviors, e.g., additive (Gaussian) noise, shot (Poisson) noise, JPEG noise, etc. We focus on the salt-and-pepper

(s&p) noise, which is an impulse contamination to the image. In an image with the s&p noise, pixels become maximal or minimal values with a predefined probability, which is called the noise level, i.e. the higher this value is, the more pixels will be contaminated. The s&p noise is a special case of random-value impulse noise defined in

[4] and [5]. For a given noise level , an s&p contaminated image could be defined as


where both and are 2 random values generated on each pixel, with the former one determining if a pixel will be contaminated or not and later one controlling if that the pixel will turn to be the maximal (salt) value or the minimal (pepper) value. From Equation 1, one observes that the s&p noise is neither like the additive (Gaussian) noise, which can be fully separated from the signal [1]; nor like the shot (Poisson) noise, which is signal dependent [5]. It appears as the pure noise at the contaminated locations (called missing pixel in [5]

) and therefore erases all signals there. This fact prevents us to use any optimization method in a continuous space to recover the signal, since the gradients estimated from missing pixels are totally not reliable and they will further scatter to other unpolluted locations. Traditional ways to recover images from s&p pollution all require nonlinear searches and mappings. The search step

[8, 9, 10] generally determines the locations of the contaminated pixels and the mapping step tries to give a feasible estimate at each contaminated pixel by weightedly averaging the similar pixel values around it. This set of filters are named switching filters. However, when the noise level is increasing, the search step becomes more and more unreliable. On the other hand, the signal estimation step will also be degraded by the high-level noise since the similarity estimation becomes intractable.

(a) Original (b) 70% s&p (c) Median5 (d) Median5 x2
PSNR 6.72 db 14.01 db 19.14 db
(e) Median5 x5 (f) Median5 x10 (g) Median5 x25 (h) Our method
24.09 db 24.89 db 24.52 db 33.07 db
Fig. 1: Classic Lenna image with the high-level s&p noise contamination and filtering results with repeated median filter.

To alleviate this limitation, [4] uses switching templates to avoid noise disturbance in the process of measuring similarity. Based on the similarities, they extract repairable information in non-local regions instead of local patches. This filter is named as non-local switching filter (NLSF). The method uses a trained Convolutional Neural Network to finally refine the signal recovered by NLSF. Therefore, NLSF is considered as a prepocessing step to the neural network. This method is a combination of traditional methods and the learning-based method.

Besides the usual learning-based ideas that train models to denoise using pairs of clean images and their noisy versions, the noise-to-noise method [5] trains models only on noisy images. They discover that training without using clean images can achieve, sometimes even exceeds, the result obtained by training using ground truths. Following this paradigm, [5] shows the ability to remove the random-valued impulse noise, which can be considered as a superset of s&p noise. To deal with the gradient loss problem introduced by pure noisy pixels, they adopt an annealed version of the “ loss” to replace the traditional

loss. The loss function is gradually changing from

to as the training progress. However, the speed of this annealing procedure must be carefully chosen (usually reducing the power of the norm on the loss function according to the number of iterations). When the prediction is far from the truth, the loss function must be closer to ; when the prediction is getting closer enough, becomes more favorable, since loss emphasis the number of different pixels, which leads the learning process to a detail amendment stage.

We introduce the use of local nonlinear search into the neural network without performing any pre-processing step and also avoid changing the loss function from loss to some other losses that are not easy to optimize. We resort to median filter [11], which is the first efficient method to denoise the salt-and-pepper noise. By incorporating the median-filter-like operations into deep neural networks, our method outperforms state-of-the-art methods. Details of our methodology as well as the model design can be found in Section II, evaluations are presented in Section III. Section IV is for conclusion of our work. We release our source code, training dataset and pretrained models at for reproducibility.

Ii Methodology

Fig. 2: Peak Signal to Noise Ratio trends with respect to the number of iterations of repeated median filters.

Median filter is a traditional nonlinear filter which is especially efficient for removing impulse noise. It replaces the pixel centered in a given window with the median of this window. As shown in Figure

1, applying median filter on a highly contaminated image (b) removes spikes and therefore greatly improves the signal to noise ratio. Applying a 55 median filter once (Figure 1c) and twice (Figure 1d),respectively, removes about and noise. A natural idea is to repeatedly apply the median filter upon the image until all spikes are replaced by the median in a fixed-size local window. It does remove the noise, however, it fails to recover the signal. The Peak Signal to Noise Ratio (PSNR) increases in the first several iterations but drops finally as the image becomes blocky and blurry, see Figure 1g. This phenomena indicates that the median filter deviates the signal too much from its original shape, which is also the main reason why modern researchers abandon median filter in denoising s&p noise.

In addition, the best PSNR value appears at different iterations of repeated median filtering when denoising different levels of noise. Figure 2 shows the higher density the noise is, the more iterations of median filters are required.

(a) Noisy signal
(b) Median filtered
(c) Median filter applied twice
(d) Median filter applied 3 times
(e) Gaussian filtered
(f) Gaussian filter applied twice
(g) Gaussian filter applied 3 times
(h) Median filtered followed by Gaussian filtered
(i) filters applied: Median, Gaussian and Median
(j) filters applied: Median, Gaussian, Median and Gaussian
Fig. 3: 1D signal denoising example using median filters and gaussian filters.

Our basic idea is to keep the ability of spike removal from the traditional median filter but try to recover the degradations introduced by it. Figure 3 illustrates a simple 1D synthetic example. We contaminate an evenly-sampled 1D sine function (dotted curves in Figure 3a) by -level s&p noise. After that, we tried to use different ways to recover the clean signal:

  1. Repeated Median filters, see Figure 3b-d;

  2. Repeated Gaussian filters, see Figure 3e-f;

  3. Alternating Median and Gaussian filters, see Figure 3h-j.

Here, all Median and Gaussian filters have the same window size that equals to 5 pixels. One may observe the third schema yields the best approximations (green curves) to the original sine function, no matter in the aspect of the signal shape or mean square errors (mse) between the smoothed curves (solid) and the true signal (dotted) curve. Using only Median filters creates plateau-like artifacts; using only Gaussian filters over smooth the noisy curves. By alternating Median and Gaussian filters, apparent plateau-like artifacts are washed out while the resulting curve still stays close to the true signal. The quick-dropping mse values between smoothed curves generated by the third schema and the truth quantitatively support our observation.

We leverage these observations to design our deep neural network model for 2d image denoising. We replace the Gaussian filter, which is a fix-parameter smoothing filter, by a set of learnable convolution operations and thus design an end-to-end fully convolutional network with Median and other convolutions alternatingly appearing.

Instead of directly applying median filters on the images, we implement median filtering as a neural network operation and perform it on different feature channels. In this way, we essentially remove spikes in different feature spaces and then combine the de-spiked features to predict a better noise removed image. On one hand, the median filtering in the feature space acts just like the switch filters in the traditional methods [8, 4]; on the other hand, the de-spike ability introduced by median operations allow the gradients to pass through the non-noisy pixels.

Ii-a Median layer definition

Median filter is applied to each element of a feature channel in a moving window fashion. For example, an input image that consists of RGB channels, corresponds to 3 feature channels; a set of features generated after the convolution generally contains many number of channels. For each feature channel, we first extract a set of given size (,

size patches centered at each pixel. Then, we find the median of the sequence formed by all elements in that patch. We show a simple tensorflow/python implementation of this median filter layer in Listing

1. Here, parameter

is a channel of the input tensor and

denotes an integer kernel size.

def find_medians(x, k=3):
    patches = tf.extract_image_patches(
            ksizes=[1, k, k, 1],
            strides = [1, 1, 1, 1],
            rates=[1, 1, 1, 1],
    m_idx = int(k*k/2 + 1)
    top, _ = tf.nn.top_k(patches, m_idx, sorted=True)
    median = tf.slice(top, [0, 0, 0, m_idx-1], [-1, -1, -1, 1])
    return media
Listing 1: Tensorflow implementation of Median layer

In practice, this median layer is applied on each feature channel and then we concatenate them to form a new set of features, e.g. median layer will be applied 64 times given a set of feature channels generated by Convolutions.

Ii-B Network architecture

As shown in Figure 4a, our network is a fully convolutional network, so that no restrictions are posed on the size of the input. It starts with 2 consecutive median layers, which are then followed by a sequence of residual blocks and median layers. The last part of the network is just residual blocks without inserting median layers in between them. In practice, we only insert median layers into the first half of the sequence of residual blocks. The first part of the network is dedicated to remove noise from the image, the second half of the network is designed for recovering the signal.

(a) Fully convolutional network with median layers in between residual blocks.
(b) Our residual blocks.
Fig. 4: Our network structure.

We choose to generate features per convolution layer and our residual block is designed as a skip connection over 2

-convolutions, followed by batch normalization layers and nonlinear activations (relu in practice), as shown in Figure


As mentioned beforehand, we stick to use the simplest loss as our objective function. This loss is simply defined as the the mean square error of the estimated image and the ground truth image, as minimizing mse directly relates to increase of denoise metrics psnr. Details can be found in Equation 3.

noise level ConvRelu 16 ConvRelu 16 ResBlock 16 ResBlock 16 ResBlock 32 ResBlock 32
without median with median without median with median without median with median
30% 31.15 33.82 40.38 40.89 36.86 40.90
50% 30.15 32.01 36.55 36.93 36.86 37.28
70% 28.88 29.37 31.98 32.23 32.22 32.40
TABLE I: PSNR (db) comparisons w/o Median layers on BSD300.

Iii Evaluation

We design several experiments to evaluate the properties of median layers (Section III-B) and performances of the proposed network (Section III-C).

Iii-a Training and testing setup

For fair comparisons, we train all models with the same data set described in [12] that contains different images, which is also employed in other works [4]. Since our network is a fully convolutional network, the input size can be arbitrary. We first resize these images to and then we generate patches from them as clean images. We degrade each patch by the s&p noise with levels from to with a step equals to as a sequence of noisy images. The models are trained to learn a series of weights in layers that can transfer the input noisy image to the clean image.

To quantitatively compare the performance of different methods, we perform denoising on 3 sets of the images. The first set of image consist of some classical images in the image processing field (also used in [4]); the second set is BSD300 [13] (; the third set is Kodak Image Dataset (, which has been widely used as the evaluation set [1, 2, 5].

The metric considered in the comparison is Peak Signal to Noise Ratio (PSNR). It is defined by


where is the mean-squared error between two 8-bit images and , defined by


Iii-B Effects of the median layer

The first experiment is designed to show the effectiveness of the median layer. We trained several pairs of fully convolutional networks mainly consisting of residual or convolution-batchNorm-relu blocks, but one with median layers and one without them.

(a) Original (b) 70% s&p (6.72 db)
(c) W/ medians (32.16db) (d) W/O medians (28.70db)
Fig. 5: Denoise results with and without median layers on an image of BSD300, measured by PSNR (db).
Fig. 6: Training losses with and without median layers.
Image Noise level DBA [14] NASNLM [9] PARIGI [10] NLSF [4] NLSF-MLP [15] NLSF-CNN [4] Noise2Noise [5] Ours
30% 34.42 28.09 33.90 34.20 30.80 35.38 36.39 37.04
50% 30.11 26.15 29.91 30.12 29.28 32.55 34.68 35.00
70% 25.84 25.97 25.22 25.79 27.63 30.18 32.83 33.07
30% 28.07 23.68 25.19 28.21 25.19 28.71 30.89 40.46
50% 24.24 22.91 22.61 24.45 23.86 26.01 27.96 34.83
70% 21.12 22.63 20.06 21.02 22.61 24.11 25.09 29.96
30% 29.41 20.61 29.74 32.88 29.64 33.47 39.98 40.65
50% 27.47 16.69 27.25 29.66 28.28 30.92 36.13 38.84
70% 24.99 16.32 24.29 26.33 26.90 29.06 30.55 33.29
30% 26.85 22.38 28.88 32.27 30.01 32.99 30.70 30.83
50% 25.27 21.82 25.44 27.99 28.57 30.23 29.86 30.07
70% 22.11 21.58 21.46 23.04 27.04 27.70 28.79 29.05
BSD300 30% 29.92 25.74 12.04 30.01 29.77 30.87 39.83 40.90
average 50% 26.32 24.50 6.01 26.25 26.19 27.84 35.92 37.28
70% 22.81 24.65 5.42 22.85 26.19 25.35 31.42 32.40
TABLE II: PSNR (db) Comparisons with state-of-the-arts on a set of classic images and BSD300 image database. Best performances of every noise levels of different images are in bold.
Noise level DeepBoosting [7] Noise2Noise [5] Ours
30% 21.69 34.95 36.39
50% 19.50 32.27 34.35
70% 15.74 30.49 31.56
TABLE III: PSNR (db) Comparisons with state-of-the-arts on Kodak image database. Best performances of every noise levels of different images are in bold.
(a) (b) (c) (d)
(e) (f) (g) (h)
(i) (j) (k) (l)
90% s&p noise2noise ours ground truth
Fig. 7: Detailed comparisons between noise2noise and our model. Our model outperforms noise2noise consistently on different challenges: 1) smoothly changing background (the first row); 2) white and black strips (the second row) and 3) noise-like natural scene.

We train two sets of deep fully convolutional networks, the first set of networks are traditional ones that do not contain any median layers, the second set of networks are the counterparts of the first set with median layers inserted into them with the same strategy shown in Figure 4a, i.e. the first half of the network contains median layers, the second half does not. The networks in the first set consist of repeated blocks of convolution, batch normalization and activation or repeated residual blocks as shown in Figure 4b.

Losses in Figure 6 shows how median layers boost the PSNR value of the network. Training losses of two networks with median layers inserted converge to a better minima comparing to the losses without the median layers. The “ConvRelu 16” network in Figure 6 is a DnCnn [1] style network, which consists of 16 stacked Convolution-BatchNormalization-Activation units. The “ResBlock 16” network in Figure 6 is formed by simply replacing the Convolution-BatchNormalization-Activation units to residual blocks shown in Figure 4b. All convolution layers here generate 64 features.

PSNR comparisons of models with and without median layers inserted (Table I) show the improvements of PSNR. The PSNR values of models with median layers are usually db higher than the ones that do not have them.

Iii-C Comparisons to the state-of-the-arts

Both quantitative and qualitative comparisons to the state-of-the-arts are performed in this section.

Iii-C1 Quantitative comparisons

We quantitatively compare our network in Figure 4a to several state-of-the-art methods. Baselines include 5 traditional methods: Decision-Based Algorithm (DBA) [14], Adaptive Switching Non-local Filter (NASNLM) [9], PARIGI [10], NLSF [4] (prepocessing part of NLSF-CNN), NLSF-MLP (NLSF with multi-layer perception proposed in [15]) and 2 most recent neural network based methods: NLSF-CNN [4] and Noise2Noise [5], as shown in Table II. Many methods here are designed for denoising s&p noise with moderate levels, therefore, we choose to evaluate the methods under noise levels equal to , and .

In addition, we also compare our method to DeepBoosting [7] and Noise2Noise [5] on Kodak image dataset, as shown in Table III.

Our method outperforms most of the state-of-the-arts besides the pepper image. Comparing to current best baseline method Noise2Noise [5], PSNR values achieved by our model is about 1-2db higher in average and the severer the noise contamination, the comparably better our method performs.

Iii-C2 Qualitative comparisons on extremely high-level noise

We further qualitatively compare our method to noise2noise method [5] on denoising extremely high-level s&p noise (noise level equals to ). In Figure 7, we choose three images from BSD300 dataset, where different challenges can be found there:

  • both sharp feature and smooth background exist in the first image (Figure 7d);

  • pure black and white interphase pattern in the second image (Figure 7h);

  • noise-like nature scene background (Figure 7l).

The left-most column in Figure 7 shows the contaminated images, which are the noisy version of their counterparts in the right-most column. One may hardly see the contours of the original salient objects there, since of pixels become either maximal or minimal values.

Our method performs consistently better than noise2noise on all of these challenges. In Figure 7b, noise2noise generates many small white blocky artifacts on the sky (red rectangle) and also blurs the sharp edges (blue rectangle) of the windows. Both of these 2 degradations are alleviated in our result shown in Figure 7c.

Recovering underlying signal with pure black and white interphase pattern from high-level s&p noise contamination is a very difficult problem because both signal and noise are almost binary in each channel. The method may have a hard time to distinguish which pixel is contaminated. By comparing the results shown in Figure 7f (noise2noise) and Figure 7g (ours), one may observe that our method produces higher quality images.

Noise-like patterns are common to many nature scene images, for example, the grass and the feathers of the owl in Figure 7l and the leaves in the bridge image in Table II. We observe that the image still looks noisy after being processed by noise2noise method, where apparent small blue and red dots stand out on the grass (Figure 7j). However, such dots are invisible in our result, as shown in Figure 7k.

Iv Conclusion

In this paper, we show that incorporating the median filtering technique in the deep neural network helps achieving compelling results in denoising the s&p noise, especially when the noise level is high. The ability of the median layer to denoise is also experimentally testified with increasing PSNR. Our work opens the door in adopting traditional low-level nonlinear signal processing techniques in deep neural networks. The methodology of inserting non-linear spatial layers may boost the performances of some well-known deep networks.

The median is the optimum point of a set of values under norm, which minimizes the sum of absolute deviations. This fact makes median layers act as a regularizer to the feature channels. Unlike the annealing procedure on the loss function adopted in [5], where the speed of evolving the loss from to must be carefully chosen to achieve the best result (with respect to the amount of noises), median layers is a more feasible way to control the quality of the extracted features. A single model can be trained to recover latent images with different levels of noise contaminations only using loss.

Spatial filtering have been invented and could be leveraged into convolutional neural networks to deal with images affected by non-linear noise. More study on the median placements could result in understanding its impact in the process.


  • [1] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyond a Gaussian denoiser: Residual learning of deep CNN for image denoising,” IEEE Transactions on Image Processing, vol. 26, no. 7, pp. 3142–3155, 2017.
  • [2] K. Zhang, W. Zuo, and L. Zhang, “Ffdnet: Toward a fast and flexible solution for CNN based image denoising,” IEEE Transactions on Image Processing, 2018.
  • [3]

    D. Liu, B. Wen, X. Liu, Z. Wang, and T. Huang, “When image denoising meets high-level vision tasks: A deep learning approach,” in

    Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18

    .   International Joint Conferences on Artificial Intelligence Organization, 7 2018, pp. 842–848. [Online]. Available:
  • [4] B. Fu, X. Zhao, Y. Li, X. Wang, and Y. Reng, “A convolutional neural networks denoising approach for salt and pepper noise,” Multimedia Tools and Applications, pp. 1–18, 2018.
  • [5] J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila, “Noise2noise: Learning image restoration without clean data,” in

    International Conference on Machine Learning (ICML) 2018

    , 2018, pp. 2971–2980.
  • [6]

    R. Furuta, N. Inoue, and T. Yamasaki, “Fully convolutional network with multi-step reinforcement learning for image processing,” in

    AAAI Conference on Artificial Intelligence (AAAI), 2019.
  • [7] C. Chen, Z. Xiong, X. Tian, and F. Wu, “Deep boosting for image denoising,” in

    European Conference on Computer Vision 2018 (ECCV)

    , 2018.
  • [8]

    W. Wang and P. Lu, “An efficient switching median filter based on local outlier factor,”

    IEEE Signal Processing Letters, vol. 18, pp. 551–554, 2011.
  • [9] J. Varghese, N. Tairan, and S. Subash, “Adaptive switching non-local filter for the restoration of salt and pepper impulse-corrupted digital images,” Arabian Journal for Science & Engineering, vol. 40, pp. 3233–3246, 2015.
  • [10] J. Delon, A. Desolneux, and T. Guillemot, “Parigi: a patch-based approach to remove impulse-gaussian noise from images,” Image Process On Line, vol. 5, pp. 130–154, 2016.
  • [11] T. S. Huang, G. J. Yang, and G. Y. Tang, “A fast two-dimensional median filtering algorithm,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 27, pp. 13–18, 1979.
  • [12]

    C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep convolutional networks,”

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 38, pp. 295–307, 2015.
  • [13] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” in Proc. 8th Int’l Conf. Computer Vision, vol. 2, July 2001, pp. 416–423.
  • [14] K. S. Srinivasan and D. Ebenezer, “A new fast and efficient decision-based algorithm for removal of high-density impulse noises,” IEEE Signal Processing Letters, vol. 14, pp. 189–192, 2007.
  • [15] H. C. Burger, C. J. Schuler, and S. Harmeling, “Image denoising: Can plain neural networks compete with bm3d?” in

    Proc. 2012 IEEE Conference on Computer Vision and Pattern Recognition

    , vol. 157, 2012, pp. 2392–2399.