VBM3D is an extension to video of the well known image denoising algorithm BM3D, which takes advantage of the sparse representation of stacks of similar patches in a transform domain. The extension is rather straightforward: the similar 2D patches are taken from a spatio-temporal neighborhood which includes neighboring frames. In spite of its simplicity, the algorithm offers a good trade-off between denoising performance and computational complexity. In this work we revisit this method, providing an open-source C++ implementation reproducing the results. A detailed description is given and the choice of parameters is thoroughly discussed. Furthermore, we discuss several extensions of the original algorithm: (1) a multi-scale implementation, (2) the use of 3D patches, (3) the use of optical flow to guide the patch search. These extensions allow to obtain results which are competitive with even the most recent state of the art.READ FULL TEXT VIEW PDF
. The algorithme is designed for additive white Gaussian noise with zero mean and standard deviation, i.e.
where is a position in the video domain (a pixel), is the noisy video and the unknown clean video. The denoising principle of VBM3D (and BM3D) is based on the redundancy of similar patches. Groups of similar 2D patches are assembled in a 3D stack of patches. A separable 3D transform is applied to this stack. The stack is denoised by applying a shrinkage operator to the coefficients in the transformed domain. The algorithm follows four basic steps:
Search for similar patches in the sequence, grouping them in a 3D stacks,
Apply a 3D linear domain transform to the 3D block,
Shrink the transformed coefficients,
Apply the inverse transform,
Aggregate the resulting patches in the video.
The underlying idea here is that, due to the high redundancy of a stack of similar patches, the energy will be concentrated in a few coefficients of the transform, while the noise is spread evenly among all coefficients. This allows to jointly denoise the patches of each stack. The reconstruction of the estimated video is obtained by aggregating for each pixel the estimated patches that contain it. This principle is applied twice. The first time the patches are denoised using a hard threshold in the transformed domain. In the second iteration a Wiener filter is used, with Wiener coefficients computed using the output of the first step as oracle. There exist other adaptations to video and 3D images of this framework: BM4D and VBM4D . These methods stack similar 3D spatio-temporal patches in a 4D stack. The main difference between them is that VBM4D uses motion compensated patches whereas BM4D aims at denoising volumetric images (such as those appearing in medical imaging), where the 3rd dimension is just another spatial dimension. In  however, it was shown that even non-motion compensated 3D patches provide a very good denoising performance.
In this article, we revisit the VBM3D method, providing an open source implementation and we discuss some variants introduced in , such as 3D patches (as in BM4D ), optical flow guided patch search, and a multiscale implementation. In the next section, the algorithm itself is reviewed. In the following sections we consider three possible extensions: multiscale, spatio-temporal patches and motion compensated patch search using an optical flow. Finally the performance of the algorithm is compared against other state of the art algorithms.
The noisy input video denoted consists of frames written for between and . The clean unknown video is denoted by . A patch is a small rectangular piece of the video, for example of size ( pixels width and height, 3 pixels in the temporal dimension). Patches are represented by bold lowercase letters, e.g. . If we need to emphasize the location of the patch we write , where represents the location on the video of the top-left-front pixel of the patch.
For the parameters of the method we will use the same notation as in . The different parameters for the patch search are the number of patches to return, the number of temporal frames that are being searched, the search region for the reference frame , the search region for the other frames , the maximum number of patches to compute for each local search, a correcting factor and a maximum distance threshold . The threshold parameter in the first step is . We also write for the element-wise product. Similar parameters will be used for the hard thresholding first step and for the Wiener filtering second step. Because they can have different values for the different steps, they’ll be differentiated using the subscript and respectively.
VBM3D performs two steps, as its image counterpart BM3D. The first hard-thresholding step computes a a basic estimate which is refined in the Wiener filtering step, yielding the final estimate. Figure 1 represents the global structure of the algorithm. The pseudocode of the core of the algorithm is presented in Algorithm 5.
The groups of similar patches are built by selecting a reference patch and searching around it for its nearest neighbors. This patch search is the main difference between BM3D and VBM3D. The image version of the algorithm searches a square 2D window centered at the reference patch. While the same could be done in a space-time volume for videos, Dabov et al.  propose a predictive searchheuristic to reduce the size of the search window. The idea behind the proposed search is to track patches in the video. Let and be two patches at positions and respectively. The distance between and used to find the nearest neighbors for a given regularizing factor is
This distance regularizes the patch trajectories by favoring non moving patches. Suppose that the reference patch is located at . The goal of the proposed search is to find (at most) patches in a window of frames around frame . The steps are the following:
Find the nearest neighbors to in frame in a square search region of size centered at the location of the reference patch . Let be the set of found patches.
Find the nearest neighbors to in frames . The search region at is the union of square regions centered at positions of the candidates in , where . This is depicted in Figure 2.
Similarly, find nearest neighbors to in frames . This time the search region is the union of squares centered at the positions of the patches in .
Remove the candidates with distance (1) larger than from
the set combining all the candidates computed in each frame.
Keep the best candidates from .
Because the next steps of the algorithm (in particular the domain transforms) require a number of patches that is a power of , only the largest power of smaller or equal to is kept.
The first step processes the image by filtering stacks of similar patches and aggregating them in an output image. The reference patches for the stacks are all patches on a sub-grid of step . For each group there is first an estimation, followed by an aggregation.
For a given reference patch , we first find its similar patches as described in Section II-A. These patches are stacked in a 3D volume , of size . The first spatial slice of this stack is the reference patch, and the remaining ones are the similar patches in ordered by their distance to . A 3D domain transform is first applied to the group of patches followed by a thresholding of the resulting spectrum (excepting the DC component of every patch). The inverse 3D domain transform is then applied to the thresholded coefficients to obtain the estimation, i.e.
where is defined by
In practice the 3D transforms are chosen to be separable, consisting of a 2D spatial transform applied directly on the patches of , typically a bi-orthogonal wavelet transform, followed by an 1D transform along the third dimension of the stack, typically a Haar transform. The pseudocode of the estimation for the first step is presented in Algorithm 3.
An output pixel is estimated several times, since it belongs to several patches and each patch can be estimated multiple times (once for each group it belongs to). To compute the output frame these estimates are aggregated. For a pixel of the frame, this aggregation is performed as
where a patch has a support of the size of the image and is zero everywhere except on the actual location of the patch. The weights used during the aggregation are computed in part during the estimation. The part computed during the estimation is where corresponds to the number of coefficients that have not been thresholded during the estimation of . Each coefficient is properly defined because is always positive as the DC component is never thresholded. The rest of the weight comes from a 2D Kaiser window of size (see Eq. (6)) located on the position of applied to avoid boundary effects from the patch. The Kaiser window of size of parameter is
Finally the (patch) weight is defined by . The pseudocode with the aggregation step can be found in Algorithm 5.
The second step of the algorithm uses the basic estimate computed during the first step for the patch search, and to compute the coefficients of a Wiener shrinkage of the transformed coefficients. The video is processed by building groups of patches around reference patches from a coarse subgrid with step .
Given a reference patch , a group of similar patches selected as described in Section II-A, but using patches extracted from the basic estimate for computing the patch distance. Two sets of patches are extracted: one from the noisy sequence and the other one from the basic estimate at the same locations. Once the sets of candidates, from the noisy sequence and for the basic sequence, have been computed, a Wiener filtering step is applied. The coefficients for the filtering are computed using but it’s which is used for the estimation. Just like the first step, a 3D domain transform is first applied to the group of patches before the application of the Wiener filter. The computation is done following Equation (7).
where on a frequency (with corresponding to the same frequency in the basic estimation) is defined by
In practice the 3D transforms are chosen as a 2D transform applied directly on the patches of and , typically a DCT, followed by an 1D transform along the third dimension of the group, typically a Haar transform. The pseudocode of the estimation for the first step is presented in Algorithm 4.
Just as with the first step, the estimation gives multiple estimates per pixel. Therefore the different estimates are aggregated using the same principle as in Section II-B, defined by Equation (4). The only difference lies in the weights. The weights used during the aggregation are computed in part during the estimation. The part computed during the estimation is which is actually linked to the squared
norm of the vector of coefficients used for the filtering. The rest of the weight come from a 2D Kaiser windowof size applied to avoid boundary effects from the patches. Finally the pixel weight is defined by . The pseudocode with the aggregation step can be found in Algorithm 5.
We now compare the results obtained with our implementation to those obtained from the binaries released by the authors of the original VBM3D 111Available at http://www.cs.tut.fi/~foi/GCF-BM3D/.. Throughout this work we will use a test set of seven grayscale test sequences obtained from the Derf’s Test Media collection222https://media.xiph.org/video/derf/.. The original sequences are of higher resolution and in RGB. They have been downscaled and converted to grayscale.
Table I compares the PSNR obtained by our implementation with the original binaries. Our results below are those of the original implementation. The average gap starts at 0.27dB for and reduces to 0.18dB for . Station and sunflower are the sequences where the gap is larger, reaching 0.68dB for the later at .
A visual comparison of both results is shown in Figure 3, for noise . Both results are visually very similar, with the same type of artifacts. A closer inspection reveals that the result of the original VBM3D is slightly smoother, resulting in less noticeable DCT artifacts and on some textures with decreased contrast (see for instance the grass in the rightmost figure).
Many video denoising methods take advantage of the optical flow to estimate motion in the video. Typically, optical flow is used by video denoising methods that require aggregating information along motion trajectories [4, 13, 6, 11, 3]. These methods require a motion estimate as accurate as possible. Most patch-based methods, on the other hand, do not require such an accurate motion estimate, as they are based on finding similar patches in a 3D search region [5, 7, 14]. These methods either do not use any motion estimate at all, or use a very rough one (e.g. block matching) to guide the search region. However, it has been observed that using optical flow to shape the search region can still be beneficial for patch based methods [2, 1], as it allows to find better matches.
For a pair of two images A and B, the optical flow aims at finding a vector field such that any point of the image domain solves
The displacement vector can be sub-pixel, in which case the image
needs to be interpolated. We decided to use the TVL1 optical flow method, in particular the implementation provided in . We compute the optical flow in a downscaled version of the video by a factor of , and scale it back to the original resolution. This reduces the running time and the impact of the noise while still having a reasonable precision, as shown in .
We add the optical flow to VBM3D as a guide, the same way it is done for VNLB . The spatio-temporal search region is defined as two sequences of square windows of size (plus the window corresponding to the reference frame), whose centers follow the motion trajectory of the reference patch. The trajectory is estimated using the forward and backward optical flow. We use the center corresponding to the position of the reference patch propagated in previous, respectively following, frames using the backward, respectively forward, optical flow. The forward half of the trajectory passing through is computed by integrating the forward optical flow (also indexed by time on top of the spatial position) as follows:
where and denote the round and floor operators. The backward half of the trajectory is defined analogously using the backward optical flow
This also means that only one center needs to be tracked, in contrast to the required by a regular VBM3D search. Parameters are kept the same as in the original algorithm. The guided version of the search is summarized in Algorithm 6.
We also tested setting the parameter to zero for both steps since one can assume that it would be redundant with the regularization of the patch search offered by the optical flow. PSNR results are shown in Table II. Using the optical flow as a guide greatly improves the quality of the denoising. The non-zero yields better results both with and without the optical flow guide, thus we decided to keep it in our final version.
Figure 4 shows results obtained with the different extensions that will be described in this section. The first column corresponds to VBM3D with the default parameters, and the 3rd column to the one using an optical flow guided search region (denoted “VBM3D OF”). Guiding the patch search allows to recover more details. This is because the tracking of patches is more robust to motion than the block-matching suggested for the original VBM3D and therefore provides better matches. Since the optical flow is used only as guide for the center of the search region, computing the optical flow from the noisy data is not a problem as it is not required to be very precise.
The patch similarity is determined based on the (squared) Euclidean distance between patches. Due to the noise, the Euclidean distance follows a non-central distribution, with variance
where are the noisy versions of the patches , and denotes the number of pixels in the patch. The noise in the distance can be reduced by considering larger patches. However, increasing the size of the patch also increases the distances between patches and reduces the likelihood of finding similar ones. The additional temporal dimension in a spatio-temporal patch allows to increase the number of pixels in the patch, without increasing its spatial size. Due to the high redundancy of the video in the temporal dimension, increasing the temporal size of the patch causes a much lower increase in the patch distances.
When the motion is known or can be estimated, then it is natural to consider motion compensated spatio-temporal patches (see for instance [14, 6]). Alternatively, rectangular spatio-temporal patches with no motion compensation have been also used [18, 2]. For more complex types of motion, using rectangular spatio-temporal patches will result in a larger variability in the set of nearest neighbors of a given patch, due to the fact that both the spatial texture and the motion pattern may vary. At least in principle, better results should be obtained using motion compensation. However, in practice, for higher levels of noise the bad quality of the estimated motion can undermine the final result.
The principle of BM3D has been applied to 3D patches with and without motion compensation. VBM4D, introduced in , uses motion compensated spatio-temporal patches for video denoising (the “V” stands for video). The motion is estimated using block matching. BM4D uses cubic patches without motion compensation (of size or ), aiming at filtering volumetric images . In  the authors compare the performance of both VBM4D and BM4D on video denoising concluding that VBM4D was the best video filtering strategy.
However, in [2, 1] it is shown that rectangular spatio-temporal patches with a temporal size of only two frames improve the denoising quality and still provide higher temporal consistency than a 2D patch. Based on those results, in this work we evaluate rectangular spatio-temporal patches of size in the first step and in the second (i.e we keep the spatial patch size).
In Table III we compare the quantitative results obtained by using spatial and spatio-temporal patches (denoted by “ST” in the table). We also consider the effect of the optical-flow-guided patch search, indicated as “OF”.
The first four columns of Figure 4 show results with/out motion-compensated search and spatio-temporal patches. From a qualitative point of view, using spatio-temporal patches provides better temporal consistency. In addition, the patch distance is more reliable since the number of pixels in the patches is doubled. This help retrieve details and texture for regions with a simple motion (e.g. translational). For low noise levels, the effect of these 3d patches is mixed. While the results seem more consistent temporally, they are blurrier for sequences with complex motions. This explains the drop in PSNR observed for pedestrians, sunflower and touchdown between VBM3D and VBM3D ST for . As the noise level increases this detail loss is out-weighted by the increased robustness to noise of the patch matching.
When spatio-temporal patches are used in conjunction with the optical-flow-guided search, their positive impact is magnified. Although the patches are not themselves motion-compensated, having a motion-compensated search region helps find better matches, even in sequences with more complex motion patterns. The motion-compensated search region also improves the temporal consistency, although to a lesser extent than the spatio-temporal patches. The best result, both in terms of PSNR and temporal consistency, is obtained when both strategies are used together (VBM3D ST+OF).
Multiscale approaches have shown to both reduce the residual noise but also improve the visual quality of the result. Indeed, most denoising algorithms work by processing local regions of the image/video (a patch, a group of patches, the receptive field of a neuron, etc). As a result, these methods fail to remove the lower frequencies of the noise. This results in smooth bumps mostly noticeable in large non-textured areas. Multiscale approaches are able to reduce this artifact by applying the denoising algorithm at different scales.
There are two main approaches for multi-scale denoising in the literature. The first one consists in modifying the denoising algorithm to consider several scales of the image/video. See for instance the multiscale version of the EPLL method  proposed by Papyan and Elad . Another approach proposed in [12, 17] considers the denoising algorithm as a black box, applying it as is at multiple scales. The result is then obtained by merging the results of each scale. This has the benefit that it can be applied to any denoising method, without any adaptation needed.
The multiscaler of  first creates a pyramid from the noisy input image by applying a downscaling operator
At each scale, the denoising algorithm is applied and yields denoised images . These are then recomposed into a single multiscale output . The recomposition is recursive. The recursion is initialized at the coarsest level by defining , and then proceeds as follows:
where is the upscaling operator, and is a low-pass filter with cutoff frequency parameterized by This equation substitutes the low frequencies of the single-scale result with the multiscale solution , computed using the coarser scales . The recomposition parameter determines which low frequencies are substituted. In one extreme, the low-pass filter lets all frequencies pass, in which case the whole coarse solution is used. At the other end, filters out all frequencies: the solution at the coarser level is discarded, and the output of the recomposition is the single scale denoising . In  it is found that the optimum is to filter out some of the high frequencies of the coarser level . However, the exact amount depends on the denoiser.
To apply the multiscaler on a video, we apply it spatially, i.e. the temporal dimension is not downscaled. We first create a spatial pyramid of the entire video by creating a pyramid of each frame. We then denoise these videos, and recompose them by applying Eq. (12) to each frame. This is summarized in Algorithm 7.
The parameters of the multiscaler are the number of scales, the downsampling ratio and the recomposition parameter . We shall set the downscaling ratio to 2 (the default), and try different values for the number of scales and the recomposition factor. There are different possibilities for the down/upscaling operators and the low-pass filter.
The DCT multiscaler uses a DCT pyramid which guarantees white Gaussian noise at all scales. The downscaling is performed by computing the DCT transform of the image and keeping only the quadrant of the image corresponding to the lowest frequencies. The upscaling is done by zero-padding in the DCT domain, and the low-pass filtering zeroes out the highest frequencies in the DCT domain. In this caserepresents the ratio of frequencies that are left: corresponds to an all-pass filter, where as filters out all the image.
The downscaling and upscaling operators samples the image using a Lanczos kernel:
We set . For the downscaling, to reduce aliasing, the image is downsampled using a scaled version of the kernel (as described in ).333This corresponds to Matlab’s imresize scaling function using the lanczos3 interpolation kernel. The low pass filter used is the Gaussian filter of width . When we obtain the single scale output, and if no frequencies from the coarser solutions are discarded.
Figures 5 and 6 show plots of the average PSNR for our seven test sequences obtained with the two multiscalers varying the number of scales and the recomposition cutoff parameter . In each figure, we show the results of the multiscaler applied to the standard VBM3D, and to the one with statio-temporal patches and guided patch search (VBM3D ST+OF).
Both multiscalers have a positive impact when applied to the standard VBM3D. The gain can be up to 0.3dB for noise 10, 0.5dB for noise 20 and 0.8dB for noise 40. However, when applied to the VBM3D ST+OF version of the algorithm, the multiscaler does not improve the result. In fact, for noise 10 and 20, the best PSNR is attained by the single scale algorithm and the PSNR deteriorates as the cutoff frequency of the low-pass filter is increased (i.e. as more frequency components from the coarse solution are used).
The multiscaler achieves a better denoising of large objects with smooth textures by removing low-frequency noise left by the denoiser. This low-frequency noise is much stronger for the VBM3D denoiser than for the VBM3D ST+OF version. Hence, the improvement provided by the multiscaler is smaller for the latter. This also depends on the characteristics of the sequence. Frames with smooth objects occupying larger areas will benefit from the multiscaler. Yet, the multiscaler introduces artifacts penalizing the PSNR (particularly additional ringing for the DCT multiscaler). These artifacts are not temporally coherent and can therefore be quite noticeable.
Based on these plots we chose to select the Lanczos3 multiscaler. We use 2 scales and a recomposition factor when applied to VBM3D ST+OF (i.e. VBM3D ST+OF+MS) and 3 scales with recomposition factor when applied to the standard VBM3D (VBM3D MS). Table IV shows the obtained PSNRs. The visual results of VBM3D ST+OF+MS are shown in Figure 4. The impact of the multiscaler can be noticed in the top row (results for pedestrian) as a reduction of the low-frequency noise in the smooth areas in the image. For the other sequences, since they are highly textured, the effect of the multiscaler is subtle. A careful examination reveals some texture loss.
|VBM3D MS (DCT)||35.28||34.42||40.73||38.62||40.51||39.23||36.93||37.96|
|VBM3D MS (Lanczos)||35.48||34.57||40.79||38.59||40.35||39.19||36.93||37.99|
|VBM3D ST+OF+MS (DCT)||35.48||34.74||40.82||39.56||41.16||39.62||38.06||38.49|
|VBM3D ST+OF+MS (Lanczos)||35.72||35.01||41.05||40.34||41.88||39.97||38.66||38.95|
|VBM3D MS (DCT)||31.75||30.92||37.31||35.49||37.22||36.33||33.49||34.64|
|VBM3D MS (Lanczos)||32.04||31.13||37.29||35.45||36.95||36.31||33.35||34.65|
|VBM3D ST+OF+MS (DCT)||32.08||31.32||37.47||36.27||37.71||36.71||34.51||35.15|
|VBM3D ST+OF+MS (Lanczos)||32.46||31.68||37.74||36.99||38.45||37.16||35.18||35.67|
|VBM3D MS (DCT)||28.31||27.68||33.75||32.42||33.90||33.60||30.27||31.42|
|VBM3D MS (Lanczos)||28.51||27.78||33.54||32.31||33.55||33.64||30.01||31.33|
|VBM3D ST+OF+MS (DCT)||28.90||28.18||34.25||33.19||34.41||34.04||31.00||32.03|
|VBM3D ST+OF+MS (Lanczos)||29.30||28.51||34.46||33.70||35.06||34.50||31.56||32.44|
In Table V, we compare the PSNR of different recent denoising methods with VBM3D as well as with versions of VBM3D modified using different combinations of the improvements suggested in Sections III-C, III-B, III-A. VBM3D combined with spatio-temporal patches and a patch-search guided with an optical flow gives very competitive results even compared to the latest state of the art, and especially for higher noises. When combined with the other improvements, it seems that multiscaling does not increase the quality of the denoising and can even degrade it in terms of PSNR. Visual examples are shown in Figures 4, where we compare results obtained with the original VBM3D with our implementation, using 3D patches and an optical flow to guide the search region.
2019 IEEE Conf. on Comp. Vision and Pattern Recognition (CVPR) Workshops (NTIRE), 2019.