Implementation of the VBM3D Video Denoising Method and Some Variants

01/06/2020
by   Thibaud Ehret, et al.
8

VBM3D is an extension to video of the well known image denoising algorithm BM3D, which takes advantage of the sparse representation of stacks of similar patches in a transform domain. The extension is rather straightforward: the similar 2D patches are taken from a spatio-temporal neighborhood which includes neighboring frames. In spite of its simplicity, the algorithm offers a good trade-off between denoising performance and computational complexity. In this work we revisit this method, providing an open-source C++ implementation reproducing the results. A detailed description is given and the choice of parameters is thoroughly discussed. Furthermore, we discuss several extensions of the original algorithm: (1) a multi-scale implementation, (2) the use of 3D patches, (3) the use of optical flow to guide the patch search. These extensions allow to obtain results which are competitive with even the most recent state of the art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

page 12

page 17

11/30/2018

Non-Local Video Denoising by CNN

Non-local patch based methods were until recently state-of-the-art for i...
10/03/2017

VIDOSAT: High-dimensional Sparsifying Transform Learning for Online Video Denoising

Techniques exploiting the sparsity of images in a transform domain have ...
09/09/2016

Image Denoising Via Collaborative Support-Agnostic Recovery

In this paper, we propose a novel image denoising algorithm using collab...
02/11/2019

Color Image and Multispectral Image Denoising Using Block Diagonal Representation

Filtering images of more than one channel is challenging in terms of bot...
08/26/2018

Patch-based Contour Prior Image Denoising for Salt and Pepper Noise

The salt and pepper noise brings a significant challenge to image denois...
01/18/2022

Joint denoising and HDR for RAW video sequences

We propose a patch-based method for the simultaneous denoising and fusio...
10/18/2021

An Analysis and Implementation of the HDR+ Burst Denoising Method

HDR+ is an image processing pipeline presented by Google in 2016. At its...

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

VBM3D was proposed by [7] as an adaptation to video denoising of BM3D, the successful image denoising algorithm [9]

. The algorithme is designed for additive white Gaussian noise with zero mean and standard deviation

, i.e.

where is a position in the video domain (a pixel), is the noisy video and the unknown clean video. The denoising principle of VBM3D (and BM3D) is based on the redundancy of similar patches. Groups of similar 2D patches are assembled in a 3D stack of patches. A separable 3D transform is applied to this stack. The stack is denoised by applying a shrinkage operator to the coefficients in the transformed domain. The algorithm follows four basic steps:

  1. Search for similar patches in the sequence, grouping them in a 3D stacks,

  2. Apply a 3D linear domain transform to the 3D block,

  3. Shrink the transformed coefficients,

  4. Apply the inverse transform,

  5. Aggregate the resulting patches in the video.

The underlying idea here is that, due to the high redundancy of a stack of similar patches, the energy will be concentrated in a few coefficients of the transform, while the noise is spread evenly among all coefficients. This allows to jointly denoise the patches of each stack. The reconstruction of the estimated video is obtained by aggregating for each pixel the estimated patches that contain it. This principle is applied twice. The first time the patches are denoised using a hard threshold in the transformed domain. In the second iteration a Wiener filter is used, with Wiener coefficients computed using the output of the first step as oracle. There exist other adaptations to video and 3D images of this framework: BM4D

[15] and VBM4D [14]. These methods stack similar 3D spatio-temporal patches in a 4D stack. The main difference between them is that VBM4D uses motion compensated patches whereas BM4D aims at denoising volumetric images (such as those appearing in medical imaging), where the 3rd dimension is just another spatial dimension. In [1] however, it was shown that even non-motion compensated 3D patches provide a very good denoising performance.

In this article, we revisit the VBM3D method, providing an open source implementation and we discuss some variants introduced in [1], such as 3D patches (as in BM4D [15]), optical flow guided patch search, and a multiscale implementation. In the next section, the algorithm itself is reviewed. In the following sections we consider three possible extensions: multiscale, spatio-temporal patches and motion compensated patch search using an optical flow. Finally the performance of the algorithm is compared against other state of the art algorithms.

Notation

The noisy input video denoted consists of frames written for between and . The clean unknown video is denoted by . A patch is a small rectangular piece of the video, for example of size ( pixels width and height, 3 pixels in the temporal dimension). Patches are represented by bold lowercase letters, e.g. . If we need to emphasize the location of the patch we write , where represents the location on the video of the top-left-front pixel of the patch.

For the parameters of the method we will use the same notation as in [7]. The different parameters for the patch search are the number of patches to return, the number of temporal frames that are being searched, the search region for the reference frame , the search region for the other frames , the maximum number of patches to compute for each local search, a correcting factor and a maximum distance threshold . The threshold parameter in the first step is . We also write for the element-wise product. Similar parameters will be used for the hard thresholding first step and for the Wiener filtering second step. Because they can have different values for the different steps, they’ll be differentiated using the subscript and respectively.

Ii VBM3D : the algorithm

VBM3D performs two steps, as its image counterpart BM3D. The first hard-thresholding step computes a a basic estimate which is refined in the Wiener filtering step, yielding the final estimate. Figure 1 represents the global structure of the algorithm. The pseudocode of the core of the algorithm is presented in Algorithm 5.

Fig. 1: Scheme of the core of the VBM3D algorithm.

Ii-a The patch search

The groups of similar patches are built by selecting a reference patch and searching around it for its nearest neighbors. This patch search is the main difference between BM3D and VBM3D. The image version of the algorithm searches a square 2D window centered at the reference patch. While the same could be done in a space-time volume for videos, Dabov et al. [7] propose a predictive searchheuristic to reduce the size of the search window. The idea behind the proposed search is to track patches in the video. Let and be two patches at positions and respectively. The distance between and used to find the nearest neighbors for a given regularizing factor is

(1)

This distance regularizes the patch trajectories by favoring non moving patches. Suppose that the reference patch is located at . The goal of the proposed search is to find (at most) patches in a window of frames around frame . The steps are the following:

  1. Find the nearest neighbors to in frame in a square search region of size centered at the location of the reference patch . Let be the set of found patches.

  2. Find the nearest neighbors to in frames . The search region at is the union of square regions centered at positions of the candidates in , where . This is depicted in Figure 2.

  3. Similarly, find nearest neighbors to in frames . This time the search region is the union of squares centered at the positions of the patches in .

  4. Remove the candidates with distance (1) larger than from

    the set combining all the candidates computed in each frame.

  5. Keep the best candidates from .

  6. Because the next steps of the algorithm (in particular the domain transforms) require a number of patches that is a power of , only the largest power of smaller or equal to is kept.

These steps are summarized in Algorithms 1 and 2.

Fig. 2: The patch search starts with a local spatial search in the current frame of size . Only the best candidates are then used in the following frame to predict where to search. This leads to local spatial searches (each centered at the candidates from the previous frame), of size . In practice, , and .
1.4 input : Noisy video , a reference patch , the number of patches to return , a number of temporal frames , a search region for the reference frame , a search region for the other frames , the maximum number of patches to compute with each local search , the size of the patch , a correcting factor , a maximum distance threshold
output :  the list of the patches closest to and their distance
frame at which is located // Centered on using as a reference
// Search in the following frames
1 for  to  do
2       for  do
             // Centered on using as a reference
3            
4      
// Search in the previous frames
5 for  to  do
6       for  do
             // Centered on using as a reference
7            
8      
elements from with distance smaller than // the 3D transform requires a number of patches which is a power of
9 closest power of small or equal than if  has more than elements then
10       best candidate from
return
Algorithm 1 compute_similar_patches: VBM3D patch search
1.4 input : A patch center of the search region, a reference patch , the search region size , the size of the patch , the number of patches to return , a correcting factor , a frame
output :  the list of the patches closest to in the region described by and and their distance
// Compute the distances for all the patches in the local region
1 for each patch of size in the spatial region of size centered on  do
2       if  is at the same spatial position than  then
3             Add to
4      else
5             Add to
6      
Sort according to the value of the distance Keep the best elements in return
Algorithm 2 local_search: Local search

Ii-B Patch stack filtering: hard-threshold step

The first step processes the image by filtering stacks of similar patches and aggregating them in an output image. The reference patches for the stacks are all patches on a sub-grid of step . For each group there is first an estimation, followed by an aggregation.

Estimation.

For a given reference patch , we first find its similar patches as described in Section II-A. These patches are stacked in a 3D volume , of size . The first spatial slice of this stack is the reference patch, and the remaining ones are the similar patches in ordered by their distance to . A 3D domain transform is first applied to the group of patches followed by a thresholding of the resulting spectrum (excepting the DC component of every patch). The inverse 3D domain transform is then applied to the thresholded coefficients to obtain the estimation, i.e.

(2)

where is defined by

(3)

In practice the 3D transforms are chosen to be separable, consisting of a 2D spatial transform applied directly on the patches of , typically a bi-orthogonal wavelet transform, followed by an 1D transform along the third dimension of the stack, typically a Haar transform. The pseudocode of the estimation for the first step is presented in Algorithm 3.

Aggregation.

An output pixel is estimated several times, since it belongs to several patches and each patch can be estimated multiple times (once for each group it belongs to). To compute the output frame these estimates are aggregated. For a pixel of the frame, this aggregation is performed as

(4)

where a patch has a support of the size of the image and is zero everywhere except on the actual location of the patch. The weights used during the aggregation are computed in part during the estimation. The part computed during the estimation is where corresponds to the number of coefficients that have not been thresholded during the estimation of . Each coefficient is properly defined because is always positive as the DC component is never thresholded. The rest of the weight comes from a 2D Kaiser window of size (see Eq. (6)) located on the position of applied to avoid boundary effects from the patch. The Kaiser window of size of parameter is

(5)
(6)

Finally the (patch) weight is defined by . The pseudocode with the aggregation step can be found in Algorithm 5.

1.4 input : A group of similar patches , a 3D transform

, the noise variance

output : A list of filtered patches , the aggregation weight
1 for each patch in  do
2       for each pixel of  do
3             if  or is the DC component then
4                  
5            else
6                  
7            
8      
return
Algorithm 3 ht_filtering: Hard thresholding
1.4 input : A group of similar patches , a first estimate of called , a 3D transform , the noise variance
output : A list of filtered patches , the aggregation weight
1 for each patch in , the corresponding patch in  do
2       for each pixel of  do
3            
4      
return ,
Algorithm 4 wiener_filtering: Wiener thresholding

Ii-C Patch stack filtering: Wiener filtering step

The second step of the algorithm uses the basic estimate computed during the first step for the patch search, and to compute the coefficients of a Wiener shrinkage of the transformed coefficients. The video is processed by building groups of patches around reference patches from a coarse subgrid with step .

Estimation.

Given a reference patch , a group of similar patches selected as described in Section II-A, but using patches extracted from the basic estimate for computing the patch distance. Two sets of patches are extracted: one from the noisy sequence and the other one from the basic estimate at the same locations. Once the sets of candidates, from the noisy sequence and for the basic sequence, have been computed, a Wiener filtering step is applied. The coefficients for the filtering are computed using but it’s which is used for the estimation. Just like the first step, a 3D domain transform is first applied to the group of patches before the application of the Wiener filter. The computation is done following Equation (7).

(7)

where on a frequency (with corresponding to the same frequency in the basic estimation) is defined by

(8)

In practice the 3D transforms are chosen as a 2D transform applied directly on the patches of and , typically a DCT, followed by an 1D transform along the third dimension of the group, typically a Haar transform. The pseudocode of the estimation for the first step is presented in Algorithm 4.

Aggregation.

Just as with the first step, the estimation gives multiple estimates per pixel. Therefore the different estimates are aggregated using the same principle as in Section II-B, defined by Equation (4). The only difference lies in the weights. The weights used during the aggregation are computed in part during the estimation. The part computed during the estimation is which is actually linked to the squared

norm of the vector of coefficients used for the filtering. The rest of the weight come from a 2D Kaiser window

of size applied to avoid boundary effects from the patches. Finally the pixel weight is defined by . The pseudocode with the aggregation step can be found in Algorithm 5.

1.4 input : A video , the noise variance , the number of similar patches to compute , a number of temporal frames , the size of the search region in the reference frame , the size of the search region in the other frame the number of patches kept in each frame , the size of the patch , , the distance threshold , the thresholding parameter , the domain transform , the coefficient fo the Kaiser window , the step of the grid on which the patch are taken
output : An final estimate denoised video
Kaiser window of size and coefficient Kaiser window of size and coefficient // Step 1
1 for each on the grid of step  do
       // Search for similar patches in the noisy video
       // Filter the group of patches using a hard thresholding
2       for  do
3            
4      
5for each pixel  do
6      
// Step 2
7 for each on the grid of step  do
       // Search for similar patches in the basic estimate
       patches from at the same position than the one from // Filter the group of patches using a Wiener filtering
8       for  do
9            
10      
11for each pixel  do
12      
return
Algorithm 5 VBM3D algorithm

Ii-D Reproducing the original VBM3D

We now compare the results obtained with our implementation to those obtained from the binaries released by the authors of the original VBM3D [7]111Available at http://www.cs.tut.fi/~foi/GCF-BM3D/.. Throughout this work we will use a test set of seven grayscale test sequences obtained from the Derf’s Test Media collection222https://media.xiph.org/video/derf/.. The original sequences are of higher resolution and in RGB. They have been downscaled and converted to grayscale.

Table I compares the PSNR obtained by our implementation with the original binaries. Our results below are those of the original implementation. The average gap starts at 0.27dB for and reduces to 0.18dB for . Station and sunflower are the sequences where the gap is larger, reaching 0.68dB for the later at .

A visual comparison of both results is shown in Figure 3, for noise . Both results are visually very similar, with the same type of artifacts. A closer inspection reveals that the result of the original VBM3D is slightly smoother, resulting in less noticeable DCT artifacts and on some textures with decreased contrast (see for instance the grass in the rightmost figure).

Method Crowd Park Pedestrians Station Sunflower Touchdown Tractor Average
VBM3D (original) 35.65 34.75 40.83 38.93 40.49 39.04 37.01 38.10
VBM3D (ours) 35.52 34.59 40.65 38.38 39.81 39.01 36.82 37.83
VBM3D (original) 32.25 31.25 36.94 35.45 36.46 36.08 33.07 34.50
VBM3D (ours) 32.06 31.12 36.81 35.10 35.95 36.05 32.97 34.30
VBM3D (original) 28.65 27.68 32.81 32.02 32.65 33.52 29.41 30.96
VBM3D (ours) 28.39 27.64 32.62 31.80 32.31 33.35 29.38 30.78
TABLE I: Comparison of the denoising quality with the binary program provided by Dabov et al. with [8]. Results were both computed using the normal profile ’np’ parameters.
Fig. 3: Top: results of VBM3D [7], with the original authors’ implementation. Bottom: result obtained with our implementation. The noise level is . The contrast has been linearly scaled for better visualization.

Iii Extensions

Iii-a Using optical flow to guide the search

Many video denoising methods take advantage of the optical flow to estimate motion in the video. Typically, optical flow is used by video denoising methods that require aggregating information along motion trajectories [4, 13, 6, 11, 3]. These methods require a motion estimate as accurate as possible. Most patch-based methods, on the other hand, do not require such an accurate motion estimate, as they are based on finding similar patches in a 3D search region [5, 7, 14]. These methods either do not use any motion estimate at all, or use a very rough one (e.g. block matching) to guide the search region. However, it has been observed that using optical flow to shape the search region can still be beneficial for patch based methods [2, 1], as it allows to find better matches.

For a pair of two images A and B, the optical flow aims at finding a vector field such that any point of the image domain solves

(9)

The displacement vector can be sub-pixel, in which case the image

needs to be interpolated. We decided to use the TVL1 optical flow method

[21], in particular the implementation provided in [20]. We compute the optical flow in a downscaled version of the video by a factor of , and scale it back to the original resolution. This reduces the running time and the impact of the noise while still having a reasonable precision, as shown in [11].

We add the optical flow to VBM3D as a guide, the same way it is done for VNLB [2]. The spatio-temporal search region is defined as two sequences of square windows of size (plus the window corresponding to the reference frame), whose centers follow the motion trajectory of the reference patch. The trajectory is estimated using the forward and backward optical flow. We use the center corresponding to the position of the reference patch propagated in previous, respectively following, frames using the backward, respectively forward, optical flow. The forward half of the trajectory passing through is computed by integrating the forward optical flow (also indexed by time on top of the spatial position) as follows:

(10)

where and denote the round and floor operators. The backward half of the trajectory is defined analogously using the backward optical flow

This also means that only one center needs to be tracked, in contrast to the required by a regular VBM3D search. Parameters are kept the same as in the original algorithm. The guided version of the search is summarized in Algorithm 6.

We also tested setting the parameter to zero for both steps since one can assume that it would be redundant with the regularization of the patch search offered by the optical flow. PSNR results are shown in Table II. Using the optical flow as a guide greatly improves the quality of the denoising. The non-zero yields better results both with and without the optical flow guide, thus we decided to keep it in our final version.

Figure 4 shows results obtained with the different extensions that will be described in this section. The first column corresponds to VBM3D with the default parameters, and the 3rd column to the one using an optical flow guided search region (denoted “VBM3D OF”). Guiding the patch search allows to recover more details. This is because the tracking of patches is more robust to motion than the block-matching suggested for the original VBM3D and therefore provides better matches. Since the optical flow is used only as guide for the center of the search region, computing the optical flow from the noisy data is not a problem as it is not required to be very precise.

1.4 input : Noisy video , a reference patch , the number of patches to return , a number of temporal frames , a search region for the reference frame , a search region for the other frames , the maximum number of patches to compute with each local search , the size of the patch , a correcting factor , a maximum distance threshold , the forward and backward optical flows and
output :  the list of the patches closest to and their distance
frame at which is located // Search in the following frames
1 for  to  do
         Follow the center using the forward optical flow
2      
// Search in the previous frames
3 for  to  do
         Follow the center using the forward optical flow
4      
elements from with distance smaller than // the 3D transform requires a number of patch which is a power of
5 closest power of small or equal than if  has more than elements then
6       best candidate from
return
Algorithm 6 compute_similar_patches: VBM3D patch search guided by the optical flow
Method Crowd Park Pedestrians station Sunflower Touchdown Tractor Average
VBM3D (without) 35.52 34.59 40.65 38.38 39.81 39.01 36.82 37.83
VBM3D (without,d=0) 35.72 35.00 40.13 39.34 40.18 39.01 36.94 38.05
VBM3D (with) 35.61 34.94 41.04 39.79 41.75 39.89 38.43 38.78
VBM3D (with,d=0) 35.78 35.19 40.61 40.27 41.85 39.75 38.51 38.85
VBM3D (without) 32.06 31.12 36.81 35.10 35.95 36.05 32.97 34.30
VBM3D (without,d=0) 32.00 31.24 36.13 35.38 36.09 35.83 33.03 34.24
VBM3D (with) 32.17 31.44 37.32 36.26 38.02 36.91 34.58 35.24
VBM3D (with,d=0) 32.10 31.49 36.72 36.32 37.98 36.61 34.58 35.11
VBM3D (without) 28.39 27.64 32.62 31.80 32.31 33.35 29.38 30.78
VBM3D (without,d=0) 28.29 27.65 32.24 31.79 32.33 33.11 29.40 30.69
VBM3D (with) 28.55 27.99 33.29 32.80 34.46 34.00 30.79 31.70
VBM3D (with,d=0) 28.45 27.98 32.94 32.75 34.40 33.79 30.76 31.58
TABLE II: Comparison of the denoising quality with and without guiding with an optical flow. Guiding the search leads to much better results. It is also better in general to keep the additional regularization given by .

Iii-B Spatio-temporal patches

The patch similarity is determined based on the (squared) Euclidean distance between patches. Due to the noise, the Euclidean distance follows a non-central distribution, with variance

where are the noisy versions of the patches , and denotes the number of pixels in the patch. The noise in the distance can be reduced by considering larger patches. However, increasing the size of the patch also increases the distances between patches and reduces the likelihood of finding similar ones. The additional temporal dimension in a spatio-temporal patch allows to increase the number of pixels in the patch, without increasing its spatial size. Due to the high redundancy of the video in the temporal dimension, increasing the temporal size of the patch causes a much lower increase in the patch distances.

When the motion is known or can be estimated, then it is natural to consider motion compensated spatio-temporal patches (see for instance [14, 6]). Alternatively, rectangular spatio-temporal patches with no motion compensation have been also used [18, 2]. For more complex types of motion, using rectangular spatio-temporal patches will result in a larger variability in the set of nearest neighbors of a given patch, due to the fact that both the spatial texture and the motion pattern may vary. At least in principle, better results should be obtained using motion compensation. However, in practice, for higher levels of noise the bad quality of the estimated motion can undermine the final result.

The principle of BM3D has been applied to 3D patches with and without motion compensation. VBM4D, introduced in [14], uses motion compensated spatio-temporal patches for video denoising (the “V” stands for video). The motion is estimated using block matching. BM4D uses cubic patches without motion compensation (of size or ), aiming at filtering volumetric images [15]. In [15] the authors compare the performance of both VBM4D and BM4D on video denoising concluding that VBM4D was the best video filtering strategy.

However, in [2, 1] it is shown that rectangular spatio-temporal patches with a temporal size of only two frames improve the denoising quality and still provide higher temporal consistency than a 2D patch. Based on those results, in this work we evaluate rectangular spatio-temporal patches of size in the first step and in the second (i.e we keep the spatial patch size).

In Table III we compare the quantitative results obtained by using spatial and spatio-temporal patches (denoted by “ST” in the table). We also consider the effect of the optical-flow-guided patch search, indicated as “OF”.

The first four columns of Figure 4 show results with/out motion-compensated search and spatio-temporal patches. From a qualitative point of view, using spatio-temporal patches provides better temporal consistency. In addition, the patch distance is more reliable since the number of pixels in the patches is doubled. This help retrieve details and texture for regions with a simple motion (e.g. translational). For low noise levels, the effect of these 3d patches is mixed. While the results seem more consistent temporally, they are blurrier for sequences with complex motions. This explains the drop in PSNR observed for pedestrians, sunflower and touchdown between VBM3D and VBM3D ST for . As the noise level increases this detail loss is out-weighted by the increased robustness to noise of the patch matching.

When spatio-temporal patches are used in conjunction with the optical-flow-guided search, their positive impact is magnified. Although the patches are not themselves motion-compensated, having a motion-compensated search region helps find better matches, even in sequences with more complex motion patterns. The motion-compensated search region also improves the temporal consistency, although to a lesser extent than the spatio-temporal patches. The best result, both in terms of PSNR and temporal consistency, is obtained when both strategies are used together (VBM3D ST+OF).

Method Crowd Park Pedestrians Station Sunflower Touchdown Tractor average
VBM3D [7] 35.65 34.75 40.83 38.93 40.49 39.04 37.01 38.10
VBM3D (ours) 35.52 34.59 40.65 38.38 39.81 39.01 36.82 37.83
VBM3D ST 35.65 34.66 40.41 38.55 39.65 38.91 36.90 37.82
VBM3D OF 35.61 34.94 41.04 39.79 41.75 39.89 38.43 38.78
VBM3D ST+OF 35.74 35.04 41.01 40.41 41.91 39.98 38.71 38.97
VBM3D [7] 32.25 31.25 36.94 35.45 36.46 36.08 33.07 34.50
VBM3D (ours) 32.06 31.12 36.81 35.10 35.95 36.05 32.97 34.30
VBM3D ST 32.39 31.36 36.97 35.57 36.14 36.16 33.23 34.55
VBM3D OF 32.17 31.44 37.32 36.26 38.02 36.91 34.58 35.24
VBM3D ST+OF 32.48 31.71 37.61 37.02 38.45 37.19 35.18 35.66
VBM3D [7] 28.65 27.68 32.81 32.02 32.65 33.52 29.41 30.96
VBM3D (ours) 28.39 27.64 32.62 31.80 32.31 33.35 29.38 30.78
VBM3D ST 29.18 28.13 33.35 32.50 32.70 33.65 29.66 31.31
VBM3D OF 28.55 27.99 33.29 32.80 34.46 34.00 30.79 31.70
VBM3D ST+OF 29.30 28.50 34.21 33.68 35.06 34.47 31.46 32.38
TABLE III: Quantitative denoising results (PSNR and SSIM) for seven grayscale test sequences of size from the Derf’s Test Media collection for several variants of VBM3D. We highlighted the best performance in black and the second best in brown.
Fig. 4: From left to right: result of VBM3D (our implementation with default parameters), VBM3D ST, VBM3D OF, VBM3D ST+OF, VBM3D ST+OF+MS (Lanczos multiscaler) and the original clean video. The noise level is . Contrast has been linearly scaled for better visualization.

Iii-C Multiscale video denoising

Multiscale approaches have shown to both reduce the residual noise but also improve the visual quality of the result. Indeed, most denoising algorithms work by processing local regions of the image/video (a patch, a group of patches, the receptive field of a neuron, etc). As a result, these methods fail to remove the lower frequencies of the noise. This results in smooth bumps mostly noticeable in large non-textured areas. Multiscale approaches are able to reduce this artifact by applying the denoising algorithm at different scales.

There are two main approaches for multi-scale denoising in the literature. The first one consists in modifying the denoising algorithm to consider several scales of the image/video. See for instance the multiscale version of the EPLL method [22] proposed by Papyan and Elad [16]. Another approach proposed in [12, 17] considers the denoising algorithm as a black box, applying it as is at multiple scales. The result is then obtained by merging the results of each scale. This has the benefit that it can be applied to any denoising method, without any adaptation needed.

The multiscaler of [12] first creates a pyramid from the noisy input image by applying a downscaling operator

(11)

At each scale, the denoising algorithm is applied and yields denoised images . These are then recomposed into a single multiscale output . The recomposition is recursive. The recursion is initialized at the coarsest level by defining , and then proceeds as follows:

(12)

where is the upscaling operator, and is a low-pass filter with cutoff frequency parameterized by This equation substitutes the low frequencies of the single-scale result with the multiscale solution , computed using the coarser scales . The recomposition parameter determines which low frequencies are substituted. In one extreme, the low-pass filter lets all frequencies pass, in which case the whole coarse solution is used. At the other end, filters out all frequencies: the solution at the coarser level is discarded, and the output of the recomposition is the single scale denoising . In [12] it is found that the optimum is to filter out some of the high frequencies of the coarser level . However, the exact amount depends on the denoiser.

To apply the multiscaler on a video, we apply it spatially, i.e. the temporal dimension is not downscaled. We first create a spatial pyramid of the entire video by creating a pyramid of each frame. We then denoise these videos, and recompose them by applying Eq. (12) to each frame. This is summarized in Algorithm 7.

1.4 input : A video , a list of scales , parameters for VBM3D,
output : The denoised video
1 for each scale in  do
       // Compute the video at the given scale
2       for each frame in  do
3             at scale
      // Denoise the video
4      
5for each frame index  do
6       Combine the for in
return
Algorithm 7 : Multi-scale processing

The parameters of the multiscaler are the number of scales, the downsampling ratio and the recomposition parameter . We shall set the downscaling ratio to 2 (the default), and try different values for the number of scales and the recomposition factor. There are different possibilities for the down/upscaling operators and the low-pass filter.

DCT pyramid.

The DCT multiscaler uses a DCT pyramid which guarantees white Gaussian noise at all scales. The downscaling is performed by computing the DCT transform of the image and keeping only the quadrant of the image corresponding to the lowest frequencies. The upscaling is done by zero-padding in the DCT domain, and the low-pass filtering zeroes out the highest frequencies in the DCT domain. In this case

represents the ratio of frequencies that are left: corresponds to an all-pass filter, where as filters out all the image.

Laplacian pyramid.

The downscaling and upscaling operators samples the image using a Lanczos kernel:

(13)

We set . For the downscaling, to reduce aliasing, the image is downsampled using a scaled version of the kernel (as described in [19]).333This corresponds to Matlab’s imresize scaling function using the lanczos3 interpolation kernel. The low pass filter used is the Gaussian filter of width . When we obtain the single scale output, and if no frequencies from the coarser solutions are discarded.

Figures 5 and 6 show plots of the average PSNR for our seven test sequences obtained with the two multiscalers varying the number of scales and the recomposition cutoff parameter . In each figure, we show the results of the multiscaler applied to the standard VBM3D, and to the one with statio-temporal patches and guided patch search (VBM3D ST+OF).

Both multiscalers have a positive impact when applied to the standard VBM3D. The gain can be up to 0.3dB for noise 10, 0.5dB for noise 20 and 0.8dB for noise 40. However, when applied to the VBM3D ST+OF version of the algorithm, the multiscaler does not improve the result. In fact, for noise 10 and 20, the best PSNR is attained by the single scale algorithm and the PSNR deteriorates as the cutoff frequency of the low-pass filter is increased (i.e. as more frequency components from the coarse solution are used).

The multiscaler achieves a better denoising of large objects with smooth textures by removing low-frequency noise left by the denoiser. This low-frequency noise is much stronger for the VBM3D denoiser than for the VBM3D ST+OF version. Hence, the improvement provided by the multiscaler is smaller for the latter. This also depends on the characteristics of the sequence. Frames with smooth objects occupying larger areas will benefit from the multiscaler. Yet, the multiscaler introduces artifacts penalizing the PSNR (particularly additional ringing for the DCT multiscaler). These artifacts are not temporally coherent and can therefore be quite noticeable.

Based on these plots we chose to select the Lanczos3 multiscaler. We use 2 scales and a recomposition factor when applied to VBM3D ST+OF (i.e. VBM3D ST+OF+MS) and 3 scales with recomposition factor when applied to the standard VBM3D (VBM3D MS). Table IV shows the obtained PSNRs. The visual results of VBM3D ST+OF+MS are shown in Figure 4. The impact of the multiscaler can be noticed in the top row (results for pedestrian) as a reduction of the low-frequency noise in the smooth areas in the image. For the other sequences, since they are highly textured, the effect of the multiscaler is subtle. A careful examination reveals some texture loss.

Fig. 5: Effect of the DCT multiscaler for VBM3D without extensions (square markers) and VBM3D ST+OF (round markers). Each plot shows the average PSNR over our seven test sequences obtained when varying the recomposition factor for different number of scales. From left to right, . The multiscaler has a positive effect on the original VBM3D, but not on the improved VBM3D ST+OF.
Fig. 6: Effect of the Lanczos multiscaler for VBM3D without extensions (square markers) and VBM3D ST+OF (round markers). Each plot shows the average PSNR over our seven test sequences obtained when varying the recomposition factor for different number of scales. From left to right, . The multiscaler has a positive effect on the original VBM3D, and on the improved VBM3D ST+OF for .
Method Crowd Park Pedestrians Station Sunflower Touchdown Tractor average
VBM3D (ours) 35.52 34.59 40.65 38.38 39.81 39.01 36.82 37.83
VBM3D MS (DCT) 35.28 34.42 40.73 38.62 40.51 39.23 36.93 37.96
VBM3D MS (Lanczos) 35.48 34.57 40.79 38.59 40.35 39.19 36.93 37.99
VBM3D ST+OF 35.74 35.04 41.01 40.41 41.91 39.98 38.71 38.97
VBM3D ST+OF+MS (DCT) 35.48 34.74 40.82 39.56 41.16 39.62 38.06 38.49
VBM3D ST+OF+MS (Lanczos) 35.72 35.01 41.05 40.34 41.88 39.97 38.66 38.95
VBM3D (ours) 32.06 31.12 36.81 35.10 35.95 36.05 32.97 34.30
VBM3D MS (DCT) 31.75 30.92 37.31 35.49 37.22 36.33 33.49 34.64
VBM3D MS (Lanczos) 32.04 31.13 37.29 35.45 36.95 36.31 33.35 34.65
VBM3D ST+OF 32.48 31.71 37.61 37.02 38.45 37.19 35.18 35.66
VBM3D ST+OF+MS (DCT) 32.08 31.32 37.47 36.27 37.71 36.71 34.51 35.15
VBM3D ST+OF+MS (Lanczos) 32.46 31.68 37.74 36.99 38.45 37.16 35.18 35.67
VBM3D (ours) 28.39 27.64 32.62 31.80 32.31 33.35 29.38 30.78
VBM3D MS (DCT) 28.31 27.68 33.75 32.42 33.90 33.60 30.27 31.42
VBM3D MS (Lanczos) 28.51 27.78 33.54 32.31 33.55 33.64 30.01 31.33
VBM3D ST+OF 29.30 28.50 34.21 33.68 35.06 34.47 31.46 32.38
VBM3D ST+OF+MS (DCT) 28.90 28.18 34.25 33.19 34.41 34.04 31.00 32.03
VBM3D ST+OF+MS (Lanczos) 29.30 28.51 34.46 33.70 35.06 34.50 31.56 32.44
TABLE IV: Quantitative denoising results (PSNR and SSIM) for seven grayscale test sequences of size from the Derf’s Test Media collection for several variants of VBM3D. We highlighted the best performance in black and the second best in brown.

Iv Comparison with the state of the art

In Table V, we compare the PSNR of different recent denoising methods with VBM3D as well as with versions of VBM3D modified using different combinations of the improvements suggested in Sections III-C, III-B, III-A. VBM3D combined with spatio-temporal patches and a patch-search guided with an optical flow gives very competitive results even compared to the latest state of the art, and especially for higher noises. When combined with the other improvements, it seems that multiscaling does not increase the quality of the denoising and can even degrade it in terms of PSNR. Visual examples are shown in Figures 4, where we compare results obtained with the original VBM3D with our implementation, using 3D patches and an optical flow to guide the search region.

Method Crowd Park Pedestrians Station Sunflower Touchdown Tractor average
VBM3D [7] 35.65 34.75 40.83 38.93 40.49 39.04 37.01 38.10
BM4D [15] 35.84 34.45 41.15 40.23 40.97 39.78 37.33 38.54
VBM4D [14] 36.05 35.31 40.61 40.85 41.88 39.79 37.73 38.88
SPTWO [6] 36.57 35.87 41.02 41.24 42.84 40.45 38.92 39.56
VNLnet [10] 37.00 36.39 41.96 42.44 43.76 41.05 38.89 40.21
VNLB [2] 37.24 36.48 42.23 42.14 43.70 41.23 40.20 40.57
VBM3D ST+OF+MS 35.72 35.01 41.05 40.34 41.88 39.97 38.66 38.95
VBM3D [7] 32.25 31.25 36.94 35.45 36.46 36.08 33.07 34.50
BM4D [15] 32.37 30.96 37.43 36.71 37.13 36.54 33.53 34.95
VBM4D [14] 32.40 31.60 36.72 36.84 37.78 36.44 33.95 35.10
SPTWO [6] 32.94 32.35 37.01 38.09 38.83 37.55 35.15 35.99
VNLnet [10] 33.40 32.94 38.32 38.49 39.88 37.11 35.23 36.47
VNLB [2] 33.49 32.80 38.61 38.78 39.82 37.47 36.67 36.81
VBM3D ST+OF+MS 32.46 31.68 37.74 36.99 38.45 37.16 35.18 35.67
VBM3D [7] 28.65 27.68 32.81 32.02 32.65 33.52 29.41 30.96
BM4D [15] 29.10 27.82 33.44 32.98 33.06 33.68 29.84 31.42
VBM4D [14] 28.72 27.99 32.62 32.93 33.66 33.68 30.20 31.40
SPTWO [6] 29.02 28.79 31.32 32.37 32.61 31.80 30.61 30.93
VNLnet [10] 29.69 28.29 34.21 33.96 35.12 33.88 31.41 32.51
VNLB [2] 29.88 29.28 34.68 34.65 35.44 34.18 32.58 32.95
VBM3D ST+OF+MS 29.30 28.51 34.46 33.70 35.06 34.50 31.56 32.44
TABLE V: Quantitative denoising results (PSNR and SSIM) for seven grayscale test sequences of size from the Derf’s Test Media collection on several state-of-the-art video denoising algorithms. We highlighted the best performance in black and the second best in brown.
Fig. 7: Comparison with state-of-the-art video denoising methods (). From left to right: result of VBM3D [7], VBM4D [14], VNLB [2], VNLnet [10], VBM3D SF+OF+MS (Lanczos) and ground truth. Contrast has been linearly scaled for better visualization.

References

  • [1] P. Arias, G. Facciolo, and J.-M. Morel. A comparison of patch-based models in video denoising. In 2018 IEEE 13th Image, Video, and Multidimensional Signal Processing Workshop (IVMSP), pages 1–5, June 2018.
  • [2] P. Arias and J.-M. Morel. Video denoising via empirical bayesian estimation of space-time patches. Journal of Mathematical Imaging and Vision, 60(1):70–93, Jan 2018.
  • [3] P. Arias and J.-M. Morel. Kalman filtering of patches for frame recursive video denoising. In

    2019 IEEE Conf. on Comp. Vision and Pattern Recognition (CVPR) Workshops (NTIRE)

    , 2019.
  • [4] J. C. Brailean and A. K. Katsaggelos. Simultaneous recursive displacement estimation and restoration of noisy-blurred image sequences. IEEE Transactions on Image Processing, 4(9):1236–1251, 1995.
  • [5] A. Buades, B. Coll, and J. M. Morel. Denoising image sequences does not require motion estimation. In Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, pages 70–74, 2005.
  • [6] A. Buades, J.-L. Lisani, and M. Miladinović. Patch-based video denoising with optical flow estimation. IEEE Transactions on Image Processing, 25(6):2573–2586, June 2016.
  • [7] K. Dabov, A. Foi, and K. Egiazarian. Video denoising by sparse 3D transform-domain collaborative filtering. In EUSIPCO, 2007.
  • [8] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Color image denoising via sparse 3d collaborative filtering with grouping constraint in luminance-chrominance space. In Proc. IEEE Int. Conf. Image Process., 2007.
  • [9] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on image processing, 2007.
  • [10] A. Davy, T. Ehret, G. Facciolo, J.-M. Morel, and P. Arias. Non-local video denoising by cnn. arXiv preprint arXiv:1811.12758, 2018.
  • [11] T. Ehret, J.-M. Morel, and P. Arias. Non-local kalman: A recursive video denoising algorithm. In 2018 25th IEEE International Conference on Image Processing (ICIP), pages 3204–3208. IEEE, 2018.
  • [12] G. Facciolo, N. Pierazzo, and J. Morel. Conservative scale recomposition for multiscale denoising (the devil is in the high frequency detail). SIAM Journal on Imaging Sciences, 10(3):1603–1626, 2017.
  • [13] C. Liu and W. T. Freeman. A high-quality video denoising algorithm based on reliable motion estimation. In ECCV, pages 706–719, 2010.
  • [14] M. Maggioni, G. Boracchi, A. Foi, and K. Egiazarian. Video denoising, deblocking, and enhancement through separable 4-D nonlocal spatiotemporal transforms. IEEE Transactions on Image Processing, 2012.
  • [15] M. Maggioni, V. Katkovnok, K. Egiazarian, and A. Foi. A nonlocal transform-domain filter for volumetric data denoising and reconstruction. IEEE Transactions on Image Processing, 22(1):119–133, 2013.
  • [16] V. Papyan and M. Elad. Multi-scale patch-based image restoration. IEEE Transactions on image processing, 25(1):249–261, 2016.
  • [17] N. Pierazzo, J.-M. Morel, and G. Facciolo. Multi-Scale DCT Denoising. Image Processing On Line, 7:288–308, 2017. https://doi.org/10.5201/ipol.2017.201.
  • [18] M. Protter and M. Elad. Image sequence denoising via sparse and redundant representations. IEEE Transactions on Image Processing, 18(1):27–35, 2009.
  • [19] D. Schumacher. General filtered image rescaling. In D. KIRK, editor, Graphics Gems III (IBM Version), pages 8 – 16. Morgan Kaufmann, San Francisco, 1992.
  • [20] J. Sánchez Pérez, E. Meinhardt-Llopis, and G. Facciolo. TV-L1 Optical Flow Estimation. Image Processing On Line, 3:137–150, 2013. https://doi.org/10.5201/ipol.2013.26.
  • [21] C. Zach, T. Pock, and H. Bischof. A duality based approach for realtime tv-l 1 optical flow. In Joint Pattern Recognition Symposium. Springer, 2007.
  • [22] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 479–486, Nov 2011.