Exploiting Temporal Attention Features for Effective Denoising in Videos

08/05/2020 ∙ by Aryansh Omray, et al. ∙ Indian Institute Of Technology 3

Video denoising has significant applications in diverse domains of computer vision, such as video-based object localization, text detection, and several others. An image denoising approach applied to video denoising results in flickering due to ignoring the temporal aspects of video frames. The proposed method makes use of the temporal as well as the spatial characteristics of video frames to form a two-stage denoising pipeline. Each stage uses a channel-wise attention mechanism to forward the encoder signal to the decoder side. The Attention Block used here is based on soft attention to rank the filters for effective learning. A key advantage of our approach is that it does not require prior information related to the amount of noise present in the video. Hence, it is quite suitable for application in real-life scenarios. We train the model on a large set of noisy videos along with their ground-truth. Experimental analysis shows that our approach performs denoising effectively and also surpasses existing methods in terms of efficiency and PSNR/SSIM metrics. In addition to this, we construct a new dataset for training video denoising models and also share the trained model online for further comparative studies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 7

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Though the advances in the sensor hardware technology have increased the quality of images and videos dramatically over the past few years, the inherent noise in the images and videos still seems to be a problem. These effects are generally due to low photon count and limitation of the physical imaging sensors present in DSLR, smartphone, medical imaging devices, digital telescopes, etc. Deep learning based denoising

[28, 27]

assumes that noise is additive in nature, which involves building models that are robust to random noise perturbations. The convolutional neural network described later in this paper makes use of residual connections

[11] to exploit the additive nature of noise. Traditional denoising methods that make use of standard image processing operations are threshold-based and hence not suitable for application in real-life scenario. In contrast, the algorithms employing deep learning are more robust to varying noise patterns, work efficiently and can help us achieve real time video denoising. Unlike image denoising, which only deals with the spatial properties of the image, video denoising methods also need to look at the temporal aspects of the video frames. The major contributions of the paper are:

  • We propose a deep neural network-based video denoising approach that can perform robustly against varying levels of noise.

  • We preserve correspondences between adjacent frames by incorporating an attention mechanism in our network. This helps in effective denoising of the video frames by eliminating the flickering effect that is common in most frame-level denoising approaches.

  • Our approach improves upon the existing denoising techniques in terms of the PSNR and SSIM values, and also has a lower response time, which makes it more suitable for practical use.

  • Construction of a new dataset to train models for video denoising, and making the data publicly available for comparative studies.

2 Related Work

We discuss about the recent advances in the fields of image and video denoising on the following two sub-sections.

2.1 Image Denoising

Initial methods in the domain of image denoising deployed spatial filtering techniques such as min & max filters, which target the least and the most intensity based regions of a digital image. Min filters serve well in the presence of Salt Noise [2], whereas max filters are suitable for Pepper Noise [2]. Similarly, Median filters [23] were developed, which works well for both Salt and Pepper[2] Noise. Gaussian filters [9]

make use of the property that Fourier Transform of a Gaussian is also a Gaussian, and using this property, any signal can be transformed with Fast Fourier Transform to perform faster denoising. Among other filter-based methods, Bilateral filters

[29] and Non-local means filter [3] are also quite popular. A common problem with filter-based denoising methods is that they are suitable for only a few specific types of noise. Further developments are done to perform image denoising on real noise, and one interesting approach that is a variant of the non-local means filter is the Non-local Bayesian Image Denoising [3] approach, which gives good results. Nevertheless, still generalization seems to be a significant issue across all methods. With the advent of deep learning, it was realized that the denoised results on metrics like PSNR [13] could be further improved by applying deep learning-based denoising methods.
Different types of deep convolutional networks such as DnCNN [36], FFDNet [37], MWCNN [21] have been developed to perform denoising of images. DnCNN [36] employs a feed-forward CNN network using residual learning[11]

in addition to batch normalization

[15] for image denoising. FFDNet [37] downsamples input images into sub-images before passing into the model for non-linear mapping and up samples the output into the denoised image. However, most of the above approaches still suffer from unrealistic assumptions, e.g., the noise distribution is Gaussian and the noise present is additive. As an improvement, a few blind image denoising approaches were developed, such as [6]

that uses generative adversarial network for blind denoising to extract a noise block of the image and adds it to the original image which is next passed through a convolutional neural network to estimate the noise. This method has been seen to provide encouraging results, but it considers the noise present to be additive. More recent approaches towards image denoising employ U-Net

[26] as their core architecture and have shown promising results. Very recently, few approaches have been developed that consider challenging situations such as denoising in extreme low-light situations [5, 30]. These networks also employ the U-Net as their main architecture for spatial denoising[38]

, but introduce novelty in the training loss function. For example, in

[30] the loss function is computed from a weighted average of PSNR [13], SSIM [19], Edge Loss [35] and Mean Squared Loss [16]. Successful application of deep learning in the image denoising task motivates further research in the video denoising domain.

2.2 Video Denoising

The task of video denoising is different from that of image denoising in the sense that in the former, both spatial and temporal aspects need to be considered. Applying image denoising techniques to videos in a frame-by-frame manner produces severe inconsistencies in resulting frames, which is known as flickering. One of the first approaches to tackle noise in video signals was VBM4D [22], which uses non-local grouping and collaborative filtering by stacking multiple video frames in the dimension as compared to VBM3D [20] which uses the 3D structure for video denoising. This 4D structure is useful in modeling both temporal and spectral correlation to eliminate the effects of flickering. Another similar extension of a non-local Bayesian image denoising algorithm to video denoising is the work in [1]. However, these approaches are capable of handling only specific types of noise and fail to generalize well for different types of noise. Moreover, it suffers from high processing time, making it unsuitable for use in video cameras. Given all these limitations, deep learning has shown a lot better results in temporal denoising and low processing time, making it suitable for real-time denoising purpose.
Approaches such as DVDNet [27] and FastDVDNet [28] focus on eliminating additive white Gaussian noise(AWGN) by employing residual learning, which leads to faster convergence of neural networks. Here, multiple U-Net [26] are used for spatial and temporal denoising. While the DVDNet [27] algorithm is 25 times slower than [28] due to the use of optical-flow estimation [33] for establishing temporal correlations between frames. Due to the assumption that the noise’s nature is additive white gaussian noise, these approaches do not work well for real noise scenarios. Some work involving recurrent denoising auto-encoder [4] has been done, which assumes Gaussian noise, which is added to a clean image and then the denoising auto-encoder [4] estimates the clean denoised image using score matching. To maintain temporal stability in denoising auto-encoder [31] LSTM [12] networks have been used. The architecture employs an encoder-decoder structure in which the subsequent outputs of the encoder side are processed to maintain temporal correlations.
Some research work on blind denoising has been done to address the noise generalization problem, e.g., ViDeNN [7] for deep blind video denoising and [8] which has been seen to work well for different types of noise. However, we observe that on an average the work in [8] shows a lower PSNR value as compared to approaches purely based on Gaussian additive noise denoising. Much recently, some research has been done on using kernel predicting convolutional neural networks for denoising [24, 32]

. This network predicts different weights for each pixel, and then the denoised frame is generated. These approaches are generally based on conventional filter-based approaches in image denoising but use deep learning to train those weight vectors while using separate networks for temporal and spatial denoising. Use of an asymmetric loss function

[32] solves the problem associated with this approach.
By looking at different advantages and disadvantages of different approaches, we now present video denoising approach, which performs satisfactorily for different types of noise at different levels. Our approach employs a two-layered architecture, as shown in Fig. 1 which follows from FastDVDnet [28] with residual connections [11] between the mid-frame and the final output. Each Attention U-Net as shown in Fig. 2 in the architecture is based on U-Net [26] architecture with the concatenation of filters replaced by a channel-wise attention mechanism.

3 Dataset

The training dataset111Dataset comprises of about 60 videos taken from various sources while maintaining low compression rates and high picture quality across all videos. All videos were digitally compressed in MP4 format while downloading from source and have a resolution of at least . The length of videos ranges from seconds to about seconds, with the majority of videos being about seconds long. Upon converting the videos into individual frames, the total number of frames in the videos is close to . The videos were captured at about 30 frames/second using high-quality digital cameras. The videos in the training dataset belong to different types of lighting conditions to help the model generalize and give better results. While some of the videos have static or slow-moving objects, some contains fast-moving objects, some others are regular videos.

To prepare the dataset for training, Gaussian noise was added into into individual frames of each video from the dataset. The noise was sampled randomly from a Gaussian distribution having

and added to each frame during preprocessing. Let denotes the original frame, denotes the noisy frame and denotes Gaussian noise of mean

and standard deviation

, then,

(1)

4 Architecture

The architecture is designed by taking care of both spatial and temporal aspects of the videos and is trained to minimize the chances of flickering. This section discusses the entire architecture in a very detailed manner at each level of the architecture.

4.1 The Two-stage Pipeline

The convolutional neural network to denoise videos is designed as a two-stage pipeline, as shown in Fig. 1, which takes care of the temporal aspects of the videos to reduce flickering instances. The two-stage design of the network is followed by FastDVDnet [28] architecture. However, the network presented in this paper uses residual connection [11] between the initial mid-input frame and the output frame instead of using it in each denoising block as in FastDVDnet [28]. The use of residual connection[11] only between the mid-input and the output frame gives the rest of the network the ability to model the noise effectively, leading to faster convergence. This ensures that the network learns to distinguish noise from video effectively rather than overfitting on video frames. For each pass in the network, a total of five frames are taken as input, and a set of three consecutive frames is passed into the Attention U-Net at the first stage. This is done so that the individual Attention U-Net utilizes the temporal aspects of frames and the spatial effects. The Attention U-Net at the first stage share their parameters as this leads to fewer parameters, and thereby saving memory during training the model. Also, since the individual Attention U-Net needs to process the input frames similarly, i.e., extract the temporal and spatial features of the frames and model the noise distribution in the frames, so it makes sense to share the parameters rather than having three individual networks at the first stage. The second stage of the network takes the three outputs from the first stage and passes it into another Attention U-Net to get the output. The output is then added with the middle frame using a residual connection [11] as discussed earlier to obtain the denoised frame.

Figure 1: A high-level overview of entire architecture

4.2 Attention U-Net

The design of Attention U-Net, as shown in Fig. 2, is based on U-Net [26] architecture except for using Attention-based blocks between corresponding layers in the encoder and decoder instead of concatenating layers from encoder to decoder stage between corresponding layers in the architecture. The Attention Block is added to the corresponding levels between the encoder and the decoder part of the Attention U-Net. Two such Attention Blocks are added into each Attention U-Net. Each Attention U-Net takes as input three frames which are then stacked together in the spectral dimension to form a

shaped tensor and then passed into the Attention U-Net. In each Attention U-Net, the stacked images are passed through several convolutional layers, as shown in Fig.

2. The network uses a max-pooling layer to downsample the input features and uses deconvolutional layers as an upsampling mechanism. It should be further noted that after each downsampling and before each upsampling operation in both the encoder and decoder side, the layers are connected using an Attention Block to act as a channel-wise attention mechanism. A Batch Normalization [15]

layer then follows each convolutional layer in the network which is next followed by a ReLU

[34] activation layer to account for the model’s non-linearity except the final layer and the deconvolutional layers. The maximum numbers of filters at any stage is as opposed to in the U-Net [26] architecture to reduce the number of parameters as well as reduce the time for training the model. We observe that keeping the number of filters to 1024 has small impact on the performance improvement of model, but it increases the training time and memory requirements significantly.

Figure 2: Detailed view of a single network in architecture.

4.3 Attention Block

The Attention Block used in the architecture acts as a soft-attention mechanism to guide the gradients in favor of biasing the most informative channel in the feature channels based on Sequence and Excitation Blocks [14]. The use of Sequence and Excitation blocks in this architecture helps the network in providing feedback to the layers in the decoder part of the Attention U-Net about the most important features while discarding the less important ones. Since the Attention U-Net does not concatenate encoder features to the decoder side like U-Net [26], or add them as done in FastDVDnet [28]

, the model needs a mechanism to forward the encoder signals to the decoder side for passage of the encoder information to the decoder and to enable effective feature extraction. The use of attention blocks helps in reducing the number of parameters of the network compared to U-Net

[26]. Let be the input and output to the attention block respectively, and , be the weight matrices, and , be the bias terms for fully-connected layers, and is the average pooling function as shown in Eqn. 3. It may be noted that has been used for all filters instead for each filter,

(2)
(3)
Figure 3: Overview of the Attention Mechanism used in the architecture.

5 Results and Experiments

We have implemented our algorithm in PyTorch

[25]

framework using python programming language. For training the model, the total number of the epochs is

, and the batch size has set to be one. loss function is used to train the network. The optimizer used to minimize the loss function is Adam [18], and the learning rate for training the model is kept to be for the first fifty epochs and for the rest of the fifty epochs. The weights of the networks are initialized using Xavier [10] method. The images extracted from the videos are randomly cropped into the size of to introduce some data augmentation while avoiding overfitting. We trained our model on an Nvidia V100 GPU with 16 GB of memory.

To evaluate the proposed approach and compare its performance with other existing denoising techniques, we use the DAVIS [17] data set which consists of 30 high-quality videos. First, we add noise to these videos by following a method similar to that discussed in Section 3 for training data generation. Testing is done on a system having Nvidia GTX 1050 GPU with 4 GB memory. For this, the video frames are resized into sized patches, and next the first 50 frames from each of the 30 videos are used for evaluation. In Fig. 4, we present five consecutive frames of a noisy video (with =20) in the first row, and the corresponding denoised images generated by the proposed approach in the second row. It can be visually observed from the figure that each input frame has indeed been properly denoised.

Figure 4: Frame by frame results obtained from denoising at . Noisy input frames are shown in first row whereas denoised output frames are shown in second row.

Table 1 presents a comparative study of our work with three other approaches, namely, DVDnet[27], FastDVDnet [28] and ViDeNN [7] in terms of the average PSNR [13] and SSIM [19] metrics obtained from the 50 frames corresponding to the test videos. Results are shown by incorporating various degrees of noise to the test video, i.e., for the following values: {10, 20, 30, 40}.

Method = 10 = 20 = 30 = 40
DVDnet[27] 32.66/0.81 29.18/0.51 28.65/0.43 27.94/0.38
FastDVDnet[28] 33.47/0.81 30.01/0.48 29.54/0.41 28.78/0.31
ViDeNN[7] 34.24/0.78 30.19/0.49 29.07/0.35 28.59/0.27
Ours 32.59/0.88 31.76/0.70 29.97/0.63 28.79/0.56
Table 1: Average PSNR/SSIM values obtained from the different methods used in the comparative study The best result is represented in bold for each noise level.

It can be seen that, on an average, our approach outperforms each of the other denoising techniques in terms of PSNR and SSIM values for the different noise levels.

We also note the average execution time for the different methods, i.e., the time required to make a forward pass through the network, which excludes the data processing time for each method. The execution times denoted here are computed for input of size during testing time. We observe that our method is significantly time-efficient. While the execution times of DVDnet [27] is around 5.1 seconds (including the flow estimation process), FastDVDnet [28] is around 0.197 seconds, ViDeNN is around 0.107 second, our approach has a response time of only 0.031 seconds.

The qualitative results obtained for the various denoising techniques used in the comparative study on four sample images are shown in Fig. 5 for the different noise levels, i.e., for . While the first row represents the ground-truth, the following rows represent the noisy input images, and results from DVDnet [27], FastDVDnet [28], ViDeNN [7], and our approach, respectively.

Figure 5: Comparative study of the results of the different denoising methods for varying levels of noise.

From the results shown in Fig. 5, it can be observed that our approach performs better for all noise levels except for on PSNR [13] metric, which can be attributed to the fact that number of videos having noise of that level were less in number. However, it should also be noted that ViDeNN [7] produces unwanted artifacts on denoising which leads to poor video quality. The SSIM values of our approach does not reduce significantly as compared to other approaches on increasing noise levels.

6 Conclusions and Future Work

Our approach uses a channel-wise attention mechanism to address the problem of video denoising using an U-Net based Auto-encoder Network in a temporal setting. The proposed network minimizes the effects of flickering by incorporating a temporal attention mechanism during the forward propagation. Our approach has shown promising results in denoising videos corrupted by moderate to high degrees of noise. In future, it may be studied if the use of adversarial training can help in improving the performance further. Video denoising from real-world noise may also be considered as a future scope for work.

References

  • [1] P. Arias and J. Morel (2018-01-01) Video denoising via empirical bayesian estimation of space-time patches. Journal of Mathematical Imaging and Vision 60 (1), pp. 70–93. External Links: ISSN 1573-7683, Document, Link Cited by: §2.2.
  • [2] A. C. Bovik (2005) Handbook of image and video processing. Academic Press. Cited by: §2.1.
  • [3] A. Buades, B. Coll, and J. Morel (2005) A non-local algorithm for image denoising. In

    2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)

    ,
    Vol. 2, pp. 60–65. Cited by: §2.1.
  • [4] C. R. A. Chaitanya, A. S. Kaplanyan, C. Schied, M. Salvi, A. Lefohn, D. Nowrouzezahrai, and T. Aila (2017-07)

    Interactive reconstruction of monte carlo image sequences using a recurrent denoising autoencoder

    .
    ACM Trans. Graph. 36 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §2.2.
  • [5] C. Chen, Q. Chen, J. Xu, and V. Koltun (2018) Learning to see in the dark. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 3291–3300. Cited by: §2.1.
  • [6] J. Chen, J. Chen, H. Chao, and M. Yang (2018) Image blind denoising with generative adversarial network based noise modeling. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 3155–3164. Cited by: §2.1.
  • [7] M. Claus and J. van Gemert (2019) ViDeNN: deep blind video denoising. CoRR abs/1904.10898. External Links: Link, 1904.10898 Cited by: §2.2, Table 1, §5, §5, §5.
  • [8] T. Ehret, A. Davy, J. Morel, G. Facciolo, and P. Arias (2019) Model-blind video denoising via frame-to-frame training. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 11361–11370. Cited by: §2.2.
  • [9] E. S. Gedraite and M. Hadad (2011) Investigation on the effect of a gaussian blur in image filtering and segmentation. In Proceedings ELMAR-2011, Vol. , pp. 393–396. Cited by: §2.1.
  • [10] X. Glorot and Y. Bengio (2010-13–15 May) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    , Y. W. Teh and M. Titterington (Eds.),

    Proceedings of Machine Learning Research

    , Vol. 9, Chia Laguna Resort, Sardinia, Italy, pp. 249–256.
    External Links: Link Cited by: §5.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. CoRR abs/1512.03385. External Links: Link, 1512.03385 Cited by: §1, §2.1, §2.2, §4.1.
  • [12] S. Hochreiter and J. Schmidhuber (1997-11) Long short-term memory. Neural Comput. 9 (8), pp. 1735–1780. External Links: ISSN 0899-7667, Link, Document Cited by: §2.2.
  • [13] A. Horé and D. Ziou (2010) Image quality metrics: psnr vs. ssim. In 2010 20th International Conference on Pattern Recognition, Vol. , pp. 2366–2369. Cited by: §2.1, §5, §5.
  • [14] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vol. , pp. 7132–7141. Cited by: §4.3.
  • [15] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. CoRR abs/1502.03167. External Links: Link, 1502.03167 Cited by: §2.1, §4.2.
  • [16] K. Janocha and W. Czarnecki (2017-02) On loss functions for deep neural networks in classification. Schedae Informaticae 25, pp. . External Links: Document Cited by: §2.1.
  • [17] A. Khoreva, A. Rohrbach, and B. Schiele (2018) Video object segmentation with language referring expressions. In ACCV, Cited by: §5.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §5.
  • [19] C. Li and A. C. Bovik (2010) Content-partitioned structural similarity index for image quality assessment. Signal Processing: Image Communication 25 (7), pp. 517–526. Cited by: §2.1, §5.
  • [20] Z. Li, Z. Dong, A. Yu, Z. He, and T. Yi (2019) An enhanced v-bm3d algorithm for videosar denoising combined with temporal information. In 2019 IEEE 4th International Conference on Signal and Image Processing (ICSIP), Vol. , pp. 994–998. Cited by: §2.2.
  • [21] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo (2018-06) Multi-level wavelet-cnn for image restoration. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §2.1.
  • [22] M. Maggioni, G. Boracchi, A. Foi, and K. Egiazarian (2012) Video denoising, deblocking, and enhancement through separable 4-d nonlocal spatiotemporal transforms. IEEE Transactions on Image Processing 21 (9), pp. 3952–3966. Cited by: §2.2.
  • [23] M. C. Mehdi Mafi and M. Adjouadi (2018) A robust edge detection approach in the presence of high impulse intensity through switching adaptive median and fixed weighted mean filtering. IEEE Transactions on Image Processing. Cited by: §2.1.
  • [24] B. Mildenhall, J. T. Barron, J. Chen, D. Sharlet, R. Ng, and R. Carroll (2018) Burst denoising with kernel prediction networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.
  • [25] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §5.
  • [26] O. Ronneberger, P.Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), LNCS, Vol. 9351, pp. 234–241. Note: (available on arXiv:1505.04597 [cs.CV]) External Links: Link Cited by: §2.1, §2.2, §4.2, §4.3.
  • [27] M. Tassano, J. Delon, and T. Veit (2019) DVDNET: a fast network for deep video denoising. In 2019 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 1805–1809. Cited by: §1, §2.2, Table 1, §5, §5, §5.
  • [28] M. Tassano, J. Delon, and T. Veit (2020-06) FastDVDnet: towards real-time deep video denoising without flow estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2, §4.1, §4.3, Table 1, §5, §5, §5.
  • [29] C. Tomasi and R. Manduchi (1998) Bilateral filtering for gray and color images. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Vol. , pp. 839–846. Cited by: §2.1.
  • [30] A. Utsav and Pratik (2019) Image enhancement and denoising in extreme low-light conditions. In International Journal of Innovative Technology and Exploring Engineering, Vol. 9. External Links: ISSN 2278-3075, Link Cited by: §2.1.
  • [31] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008) Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 1096–1103. Cited by: §2.2.
  • [32] T. Vogels, F. Rousselle, B. Mcwilliams, G. Röthlin, A. Harvill, D. Adler, M. Meyer, and J. Novák (2018-07) Denoising with kernel prediction and asymmetric loss functions. ACM Trans. Graph. 37 (4). External Links: ISSN 0730-0301, Link, Document Cited by: §2.2.
  • [33] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid (2013) DeepFlow: large displacement optical flow with deep matching. In 2013 IEEE International Conference on Computer Vision, Vol. , pp. 1385–1392. Cited by: §2.2.
  • [34] B. Xu, N. Wang, T. Chen, and M. Li (2015) Empirical Evaluation of Rectified Activations in Convolutional Network. ArXiv. External Links: 1505.00853, Link Cited by: §4.2.
  • [35] Z. Xu, X. Baojie, and W. Guoxin (2017) Canny edge detection based on open cv. In 2017 13th IEEE International Conference on Electronic Measurement Instruments (ICEMI), Vol. , pp. 53–56. Cited by: §2.1.
  • [36] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang (2017) Beyond a gaussian denoiser: residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing 26 (7), pp. 3142–3155. Cited by: §2.1.
  • [37] K. Zhang, W. Zuo, and L. Zhang (2018) FFDNet: toward a fast and flexible solution for cnn-based image denoising. IEEE Transactions on Image Processing 27 (9), pp. 4608–4622. Cited by: §2.1.
  • [38] L. Zhang and P. Bao (2003) Denoising by spatial correlation thresholding. IEEE transactions on circuits and systems for video technology 13 (6), pp. 535–538. Cited by: §2.1.