Real-time Deep Video Deinterlacing

08/01/2017 ∙ by Haichao Zhu, et al. ∙ SenseTime Corporation The Chinese University of Hong Kong 0

Interlacing is a widely used technique, for television broadcast and video recording, to double the perceived frame rate without increasing the bandwidth. But it presents annoying visual artifacts, such as flickering and silhouette "serration," during the playback. Existing state-of-the-art deinterlacing methods either ignore the temporal information to provide real-time performance but lower visual quality, or estimate the motion for better deinterlacing but with a trade-off of higher computational cost. In this paper, we present the first and novel deep convolutional neural networks (DCNNs) based method to deinterlace with high visual quality and real-time performance. Unlike existing models for super-resolution problems which relies on the translation-invariant assumption, our proposed DCNN model utilizes the temporal information from both the odd and even half frames to reconstruct only the missing scanlines, and retains the given odd and even scanlines for producing the full deinterlaced frames. By further introducing a layer-sharable architecture, our system can achieve real-time performance on a single GPU. Experiments shows that our method outperforms all existing methods, in terms of reconstruction accuracy and computational performance.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Interlacing technique has been widely used in the past few decades for television broadcast and video recording, in both analog and digital ways. Instead of capturing all scanlines for each frame, only odd numbered scanlines are captured for the current frame (Fig. 2(a), upper), and the other even numbered scanlines are captured for the following frame (Fig. 2(a), lower). It basically trades the frame resolution for the frame rate, in order to double the perceived frame rate without increasing the bandwidth. Unfortunately, since the two half frames are captured in different time instances, there are significant visual artifacts such as line flickering and “serration” on the silhouette of moving objects (Fig. 2(b)), when the odd and even fields are interlaced displayed. The degree of “serration” depends on the motion of objects and hence is spatially varying. This makes deinterlacing (removal of interlacing artifacts) an ill-posed problem.

Many deinterlacing methods have been proposed to suppress the visual artifacts. A typical approach is to reconstruct two full frames from the odd and even half frames independently (Fig. 2(c)). However, the result is usually unsatisfactory, due to the large information loss (50% loss) (Doyle, 1990; Wang et al., 2012, 2013). Higher-quality reconstruction can be obtained by first estimating object motion (Jeon et al., 2009; Mohammadi et al., 2012; Lee and Lee, 2013). However, motion estimation from half interlacing frames are not reliable, and also computationally expensive. Hence, they are seldomly used in practice, let alone real-time applications.

In this paper, we propose the first deep convolutional neural networks (DCNNs) method tailormade for the video deinterlacing problem. To our best knowledge, no DCNN-based deinterlacing method exists. One may argue that existing DCNN-based methods for interpolation or super-resolution (Mallat, 2016; Dong et al., 2016) can be applied to reconstruct the full frames from the half frames, in order to solve the deinterlacing problem. However, such naive approach lacks of utilizing the temporal information between the odd and even half frames, just like the existing intra-field deinterlacing methods (Doyle, 1990; Wang et al., 2012). Moreover, this naive approach follows the conventional translation-invariant assumption. That means, all pixels in the output full frames are processed with the same set of convolutional filters, even though half of the scanlines (odd/even numbered) actually exist in the input half frames. Fig. 3(b) shows a full frame, reconstructed by the state-of-the-art DCNN-based super-resolution method, SRCNN (Dong et al., 2016), exhibiting obvious halo artifact. Instead of replacing the potentially error-contaminated pixels from the convolutional filtering with the groundtruth pixels in the input half frames and leading to visual artifacts (Fig. 3(c)), we argue that we should only reconstruct the missing scanlines, and leave the pixels in the original odd/even scanlines intact. All these motivate us to design a novel DCNN model tailored for solving the deinterlacing problem.


Figure 2. (a) Two half fields are captured in two distinct time instances. (b) The interlaced display exhibits obvious artifacts on the silhouette of moving car. (c) Two full frames reconstructed from the two half frames independently with an intra-field deinterlacing method ELA (Doyle, 1990).

In particular, our newly proposed DCNN architecture circumvents the translation-invariant assumption and takes the temporal information into consideration. Firstly, we only estimate the missing scanlines to avoid modifying the groundtruth pixel values from the odd/even scanlines (input). That is, the output of the neural network system are two half frames containing only the missing scanlines. Unlike most existing methods which ignore the temporal information between the odd and even frames, we reconstruct each half output frame from both the odd and even frames. In other words, our neural network system takes two original half frames as input and outputs two missing half frames (complements).

Since we have two outputs, two neural networks are needed for training. We further accelerate it by combining the lower-levels of two neural networks (Bengio, 2012), as the input are the same and hence the lower-level convolutional filters are sharable. With this improved network structure, we can achieve real-time performance.


Figure 3. (a) An input interlaced frame. (b) Directly applying SRCNN to deinterlacing introduces blurry and halo artifacts. (c) The visual artifacts are worsen if we retain the pixels from the input odd/even scanlines. (d) Our result.

To validate our method, we evaluate it over a rich variety of challenging interlaced videos including live broadcast, legacy movies, and legacy cartoons. Convincing and visually pleasant results are obtained in all experiments (Fig. 1 & 3(d)). We also compare our method to existing deinterlacing methods and DCNN-based models in both visual comparison and quantitative measurements. All experiments confirm that our method not only outperforms existing methods in terms of accuracy, but also speed performance.

2. Related Work

Before introducing our method, we first review existing works related to deinterlacing. They can be roughly classified into tailor-made deinterlacing methods, traditional image resizing methods, and DCNN-based image restoration approaches.

Image/Video Deinterlacing


Figure 4. The architecture of the proposed convolutional neural network.

Image/video deinterlacing is a classic vision problem. Existing methods can be classified into two categories: intra-field deinterlacing (Doyle, 1990; Wang et al., 2012, 2013) and inter-field deinterlacing (Jeon et al., 2009; Mohammadi et al., 2012; Lee and Lee, 2013). Intra-field deinterlacing methods reconstruct two full frames from the odd and even fields independently. Since there is large information loss (half of the data is missing) during frame reconstruction, the visual quality is usually less satisfying. To improve visual quality, inter-field deinterlacing methods incorporate the temporal information between multiple fields from neighboring frames during frame reconstruction. Accurate motion compensation or motion estimation (Horn and Schunck, 1981) is needed to achieve satisfactory quality. However, accurate motion estimation is hard in general. In addition, motion estimation requires high computational cost, and hence inter-field deinterlacing methods are seldom used in practice, especially for applications requiring real-time processing.

Traditional Image Resizing     Traditional image resizing methods can also be used for deinterlacing by scaling up the height of each field. To scale up an image, cubic (Mitchell and Netravali, 1988) and Lanczos interpolation (Duchon, 1979) are frequently used. While they work well for low-frequency components, high-frequency components (e.g. edges) may be over-blurred. More advanced image resizing methods, such as kernel regression (Takeda et al., 2007) and bilateral filter (Hung and Siu, 2012) can improve the visual quality by preserving more high-frequency components. However, these methods may still introduce noise or artifacts if the vertical sampling rate is less than the Nyquist rate. More critically, they only utlize a single field and ignore the temporal information, and hence suffer the same problem as intra-deinterlacing methods.

DCNNs for Image Restoration     In recent years, deep convolutional neural networks (DCNNs) based methods have been proposed to solve many image restoration problems. Xie et al. (2012) proposed a DCNN model for image denosing and inpainting. This model recovers the values of corrupted pixels (or missing pixels) by learning the mapping between corrupted and uncorrupted patches. Dong et al. (2016) proposed to adopt DCNN for image super-resolution, which greatly outperforms the state-of-the-art image super-resolution methods. Gharbi et al. (2016) further proposed a DCNN model for joint demosaiking and denosing. It infers the values of three color channels of each pixel from a single noisy measurement.

It seems that we can simply re-train these state-of-the-art neural network based methods for our deinterlacing purpose. However, our experiments show that visual artifacts are still unavoidable, as these DCNNs generally follow the conventional translation-invariant assumption and modify the values of all pixels, even in the known odd/even scanlines. Using a larger training dataset or deeper network structure may alleviate this problem, but the computational cost is drastically increased and still there is no guarantee that the values of the known pixels remain intact. Even if we fix the values of the known pixels (Fig. 3(c)), the quality does not improve. In contrast, we propose a novel DCNN tailored for deinterlacing. Our model only estimates the missing pixels instead of the whole frame, and also take the temporal information into account to improve visual quality.

3. Overview

Given an input interlaced frame (Fig. 4(a)), our goal of deinterlacing is to reconstruct two full size original frames and from (Fig. 4(d)). We denote the odd field of as (blue pixels in Fig. 4(a)), and the even field of as (red pixels in Fig. 4(a)). The superscripts, odd and even, denote the odd- or even-numbered half frames. The subscripts, and , denote the two fields are captured at two different time instances. Our goal is to reconstruct two missing half frames, (light blue pixels in Fig. 4(c)) and (pink pixels in Fig. 4(c)). Note that we retain the known fields (blue pixels) and (red pixels) in our two output full frames (Fig. 4(d)).

To estimate the unknown pixels and from the interlaced frame , we propose a novel DCNN model (Fig. 4(b) & (c)). The input interlaced frame can be of any resolution, and two half output images are obtained with five convolutional layers. The weights of the convolutional operators are trained from a DCNN model training procedure based on a prepared training dataset. During the training phase, we synthesize a set of interlaced videos from progressive videos of different types as the training pairs. The reason that we need to synthesize interlaced videos for training is that no groundtruth exists for the existing interlaced videos captured by interlaced scan devices. The details of preparing the training dataset and the design of the proposed DCNN are described in Section 4.

4. DCNN-based Video Deinterlacing

4.1. Training Data Preparation

While there exists a large collection of interlaced videos over the Internet, unfortunately, the ground-truth of these videos is lacking. Therefore, to prepare a training data set, we have to synthesize interlaced videos from existing progressive videos. To enrich our data variety, we collect videos from the Internet and capture videos using progressive scan devices ourselves. The videos are of different genres, ranging from scenic, sports, computer-rendered, to classic movies and cartoons. Then we randomly sample pairs of consecutive frames from each collected video and obtain frame pairs in total. For each pair of consecutive frames, we rescale each frame to the size of and label them as the pair of original frames and (ground-truth full frames) (Fig. 5(a)). Then we synthesize an interlaced frame based on these two original frames as , i.e., the odd lines of are copied from and the even lines of are copied from (Fig. 5(b) & 6). For each triplet of resolution, we further divide them into -resolution patch triplets

with the sampling stride setting to

. Note that during patch generation, the parity of the divided patches remain the same as original images. Finally, for each patch triplet , we use as a training input (Fig. 5(b)) and the corresponding and as training outputs (Fig. 5(c)). In particular, we convert patches into the Lab color space and only use the L channel for training. Altogether, we collect 9,792 patch triplets from the prepared videos, where of the triplets are used for training and the rest are used for validation during the training process. Note that, although our model is trained by patches of resolution, the trained convolutional operators can actually be applied on images of any resolution.


Figure 5. Training data preparation. (a) Two consecutive frames and from an input video. (a) An interlaced frame is synthesized by taking the odd lines from and even lines from respectively and regarded as the training input. (c) The even lines of and the odd lines of are regarded as the training output.

Figure 6. A real example of synthesizing an interlaced frame from two consecutive progressive frames.

4.2. Neural Network Architecture

With the prepared training dataset, we now present how we design our network structure for deinterlacing. An illustration of our network structure is shown in Fig. 4. It contains five convolutional layers. Our goal is to reconstruct the original two frames and from an input interlaced frame . In the following, we first explain our design rationales and then describe the architecture in detail.


Figure 7. Reconstructing two frames from two fields independently leads to inevitable visual artifacts due to the large information loss.

The Input/Output Layers     One may suggest to utilize the existing neural network (e.g. SRCNN (Dong et al., 2016)) to learn from and from independently. This effectively turns the problem into a super-resolution or image upscaling problem. However, there are two drawbacks.

First of all, since the two frame reconstruction processes (i.e. from to and to ) are independent from each other, the neural network can only estimate the full frame from the known half frame without the temporal information. This inevitably leads to less satisfying results due to the large (50%) information loss. In fact, the two fields in the interlaced frame are temporally correlated. Consider an extreme case where the scene in the two consecutive frames are static. In this scenario, the two consecutive frames are exactly the same, and the interlaced frame should also be artifact-free and exactly equal to the groundtruth we are looking for. However, using this naive super-resolution approach, we have to feed the half frame (or ) to reconstruct a full frame. It completely ignores the another half frame (which now contains the exact pixel values) and introduces artifacts (due to 50% information loss). Fig. 7 shows the poor result of one such scenario. In contrast, our proposed neural network takes the whole interlaced frame as input (Fig. 4(a)). Note that the temporal information is implicitly taken into consideration in our network, since the two fields captured at different time instances are used for reconstructing each single frame. The network may exploit the temporal correlation between fields to improve the visual quality in higher-level convolutional layers.

Secondly, the standard neural network generally follows the conventional translation-invariant assumption. That means all pixels in the input image are processed with the same set of convolutional filters. However, in our deinterlacing application, half of the pixels in and actually exist in and should be directly copied from . Applying convolutional filters on these known pixels inevitably changes their original colors and leads to clear artifacts (Fig. 3(b) & (c)). In contrast, our neural network only estimates the unknown pixels and (Fig. 4(c)) and copies the known pixels from to and directly (Fig. 4(d)).

Pathway Design     Since we estimate two half frames and from the interlaced frame , we actually have to train two networks/pathways independently. Separately training two networks is computational costly. Instead of training two networks, one may suggest to train a single network for estimating the two half frames simultaneously by doubling the depth of each convolutional layer. However, this also highly increases the computational cost, since the number of the trained weights are doubled. As reported by (Bengio, 2012), deep neural network is to seek good representation of input data, and such representations can be transferred to many other tasks if the input data is similar. For example, the trained features of AlexNet (Krizhevsky et al., 2012) (originally designed for object recognition) can also be used for texture recognition and segmentation (Cimpoi et al., 2015). In fact, the lower-level layers of the convolutional networks are always lower-level feature detectors that can detect edges and other primitives. These lower-level layers in the trained models can be reused for new tasks by training new higher-level layers on top of them. Therefore, in our deinterlacing scenario, it is natural to combine the lower-level convolutional layers to reduce the computation, since the input of the two networks/pathways is completely the same. On top of these weight-sharing lower-level layers, higher-level layers are trained separately for estimating and respectively. This makes the higher-level layers more adaptable to different objectives. Our method can be regarded as training one neural network for estimating and then fixing the first three convolutional layers and re-training a second neural network for estimating .

Detailed Architecture     As illustrated in Fig. 4(b) & (c), our network contains five convolutional layers with weights. The first, second, and third layers are sequentially connected and shared by both pathways. The first convolutional layer has kernels of size . The second convolutional layer has kernels of size and is connected to the output of the first layer. The third convolutional layer has kernels of size and is connected to the output of the second layer. The forth and fifth layers branch into two pathways without any connection between them. The forth convolutional layer has kernels of size where each pathway has kernels. The fifth convolutional has kernels of size where each pathway has

kernel. The activations for the first two layers are ReLU functions, while for the rest layers are identify functions. The strides of convolution for the first four layers are

pixel. For the last layer, the horizontal stride remains pixel, while the vertical stride is pixels to obtain half-height images.

4.3. Learning and Optimization

Given the training dataset containing a set of triplets , the optimal weights of our neural network are trained via the following objective function:

(1)

where is the number of training samples, and are the estimated output of the neural network, is the total variation regularizer (Aly and Dubois, 2005; Johnson et al., 2016) and is the regularization scalar.

We trained our neural network using Tensorflow on a workstation equipped with a single nVidia TITAN X Maxwell GPU. The standard ADAM optimization method

(Kingma and Ba, 2014) is used to solve Eq. 1. The learning rate is and is set to

in our experiments. The number of epochs is

and the batch size for each epoch is . It takes about hours to train the neural network.

5. Result and Discussion


Figure 8. Comparisons between bicubic interpolation, SRCNN (Dong et al., 2016) and our method.

Figure 9. Comparisons between the state-of-the-art deinterlacing tailored methods, including ELA (Doyle, 1990), WLSD (Wang et al., 2014), and FBA (Vedadi and Shirani, 2013), with our method.

We evaluate our method on a large collection of interlaced videos downloaded from the Internet or captured by ourselves with interlaced scan cameras. These videos include live sporting videos (“Soccer” in Fig. 1 and “Tennis” in Fig. 8), scenic videos (“Leaves” in Fig. 1 and “Bus” in Fig. 8), computer-rendered gameplay videos (“Hunter” in Fig. 8), legacy movies (“Haystack” in Fig. 8), and legacy cartoons (“Rangers” in Fig. 8). Note that, we have no access to the original progressive frames (groundtruth) of these videos. Without groundtruth, we can only compare our method to existing methods visually, but not quantitatively.

To evaluate quantitatively (with comparison to the groundtruth), we synthesize a set of test interlaced videos from progressive scan videos of different genres. None of these synthetic interlaced videos exist in our training data. Fig. 9 presents a set of synthetic interlaced videos, including sports (“Basketball”), scenic (“Taxi”), computer-rendered (“Roof”), movies (“Jumping”), and cartoons (“Tide” and “Girl”). Due to the page limit, we only present one representative interlaced frame for each video sequence. While two full size frames can be recovered from each single interlaced frame, we only show the first frame in all our results. Please refer to the supplementary materials for more complete results.

Visual Comparison     We first compare our method with the classic bicubic interpolation and the existing DCNN tailored for super-resolution, i.e. SRCNN (Dong et al., 2016). Since SRCNN is not designed for deinterlacing, we re-train their model with our prepared dataset for deinterlacing purpose. The results are presented in Fig. 1 and 8. “Soccer”, “Bus” and “Tennis” are in format and exhibit severe interlacing artifacts. Besides, the frames also contain motion-blur and video compression artifacts. Since both bicubic interpolation and SRCNN reconstruct each frame from a single field alone, their results are unsatisfactory and exhibit obvious artifacts due to the large information loss. SRCNN performs even worse than the bicubic interpolation, since it follows the conventional translation-invariant assumption which not held in deinterlacing scenario. In comparison, our method can obtain much clearer and sharper results than our competitors. The “Hunter” example shows a moving character from a gameplay where the computer-rendered object contours/boundaries are sharply preserved. Both bicubic interpolation and SRCNN lead to blurry and zig-zag near these sharp edges. In contrast, our method obtains the best reconstruction result in achieving sharp and smooth boundaries. The “Haystack” and “Rangers” examples are both taken from legacy DVDs in interlaced NTSC format. In the “Haystack” example, only the character is moving, while the background remains static. Without considering the temporal information, both bicubic interpolation and SRCNN fails to recover the fine texture of the haystacks and obtain blurry results. In sharp contrast, our method successfully recovers the fine texture by taking two fields into consideration.

PSNR/SSIM Taxi Roof Basketball Jumping Tide Girl
bicubic 31.56/0.9453 33.11/0.9808 34.67/0.9783 37.81/0.9801 31.87/0.9809 29.14/0.9585
ELA 32.47/0.9444 34.41/0.9839 32.08/0.9605 38.82/0.9844 33.89/0.9811 31.62/0.9724
WLSD 35.99/0.9746 35.70/0.9883 35.05/0.9794 38.19/0.9819 34.17/0.9820 32.00/0.9761
FBA 34.94/0.9389 35.26/0.9815 33.93/0.9749 38.27/0.9822 35.15/0.9822 31.78/0.9756
SRCNN 30.12/0.9214 32.01/0.9749 29.18/0.9353 36.81/0.97094 33.02/0.9758 27.79/0.9477
Ours 38.15/0.9834 35.44/0.9866 36.55/0.9838 39.75/0.9889 35.37/0.9807 35.44/0.9866
Table 1. PSNR and SSIM between the deinterlaced frames and groundtruth of all methods.
Average time (s) ELA WLSD FBA Bicubic SRCNN Our Methods
With sharable layers Without sharable layers
0.6854 2.9843 4.1486 0.7068 0.3010 0.0835 0.2520
0.0676 1.0643 1.6347 0.2812 0.0998 0.0301 0.0833
0.0317 0.4934 0.6956 0.1176 0.0423 0.0204 0.0556
0.0241 0.4956 0.7096 0.1110 0.0419 0.0137 0.0403
Table 2. Timing statistics for all methods.

We further compare our method to the state-of-the-art deinterlacing methods, including ELA (Doyle, 1990), WLSD (Wang et al., 2014), and FBA (Vedadi and Shirani, 2013). ELA is the most widely used deinterlacing methods due to its high performance. It is an intra-field method and uses edge directional correlation to reconstruct the missing scanlines. WLSD is the state-of-the-art intra-field deinterlacing method based on optimization. It generally produces better result than that of ELA, but with a higher computational expense. FBA is the state-of-the-art inter-field method. Fig. 9 shows the results of all methods for a set of synthetic interlaced videos, in which we have the groundtruths for quantitative evaluation. Besides the reconstructed frames, we also blow-up the difference images for better visualization. The difference image is simply computed as the pixel-wise absolute difference between the output and the groundtruth. As we can observe, all our competitors generate artifacts surrounding the boundaries. The sharper the boundary is, the more obvious the artifact is. In general, ELA produces the most artifacts since it adopts a simple interpolator and utilizes information from a single field alone. WLSD produces less artifacts as it adopts a more complex optimization-based strategy to fill the missing pixels. But it still only utilizes information of a single field and has large information loss during reconstruction. Though FBA utilizes the temporal information, it still cannot achieve good visual quality because they only rely on simple interpolators. In contrast, our method produces significantly less artifacts than all competitors.

Quantitative Evaluation     We train our neural network by minimizing the loss of Eq. 1 on the training data. The training loss and validation loss throughout the whole training epochs are shown in Fig. 10. Both training and validation losses reduce rapidly after the first few epochs and converge in around 50 epochs.

We also compare the accuracy of our method to our competitors in terms of peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM). Note that we only compute the PSNR and SSIM for those test videos with groundtruth. We take the average value over all frames of each video sequence in computing both measurements. Table 1 presents the statistics. Our method outperforms the competitors in terms of both PSNR and SSIM in most cases.

Timing Statistics     Lastly, we compare the running time of our method to our competitors on a workstation with Intel Core CPU i7-5930, 65GB RAM equipped with a nVidia TITAN X Maxwell GPU. The statistics are presented in Table 2. Our method achieves the highest performance among all methods in all resolutions. It processes even faster than ELA with apparently better visual quality. ELA and SRCNN have similar performance and are slighter slower than our method. Bicubic interpolation, WLSD, and FBA have much higher computational complexity and are far from real-time processing. Note that ELA is only a CPU method without GPU acceleration. In particular, with a single GPU, our method already achieves real-time performance up to the resolution of (33 fps). With one more GPU, our method can also achieve real-time performance for -resolution videos. We also test our model without sharing lower-level layers, i.e., two separate networks are needed for reconstructing the two frames. The statistics is shown in the last column in Table 2. This strategy roughly triples the computational time while quality is similar to that with sharing low-level layers.


Figure 10. Training loss and validation loss of our neural network.

Limitations     Since our method does not explicitly separate the two fields for reconstructing two full frames, the two fields may interfere each other badly when the motion between the two fields are extremely large. The first row in Fig. 11 presents an example where the interlaced frame has a very large motion, obvious artifacts can be observed. Our method may also fail when the interlaced frame contains very thin horizontal structures. The second row of Fig. 11 shows an example where a horizontal thin reflection stripe appears on a car. Only one line of the reflection stripe is scanned in the interlaced frame. Our neural network fails to identify it as a result of interlacing, but regards it as the original structures and incorrectly preserves it in the reconstructed frame. This is because this kind of patches is rare and gets diluted by the large amount of common cases. We may relieve this problem by training the neural network with more such training patches.


Figure 11. Failure cases. The top row shows a case where our result contains obvious artifacts when the motion of the interlaced frame is too large. The bottom row shows a case where our method fails to identify thin horizontal structures as interlacing artifacts and incorrectly preserves it in the reconstructed frame.

6. Conclusion

In this paper, we present the first DCNN for video deinterlacing. Unlike the conventional DCNNs suffering from the translation-invariant issue, we proposed a novel DCNN architecture by adopting the whole interlaced frame as input and two half frames as output. We also propose to share the lower-level convolutional layers for reconstructing the two output frames to boost efficiency. With this strategy, our method achieves real-time deinterlacing on a single GPU for videos of resolution up to . Experiments show that our method outperforms existing methods, including traditional deinterlacing methods and DCNN-based models re-trained for deinterlacing, in terms of both reconstruction accuracy and computational performance.

Since our method takes the whole interlaced frame as the input, frame reconstruction is always influenced by both fields. While this may produce better results in most of the cases, it occasionally leads to visually poorer results when the motion between two fields is extremely large. In this scenario, reconstructing each frame from a single field without considering temporal information may produce better results. A possible solution is to first recognize such large-motion frames, and then decide whether temporal information should be utilized for deinterlacing.

References

  • [1]
  • Aly and Dubois [2005] Hussein A. Aly and Eric Dubois. 2005. Image up-sampling using total-variation regularization with a new observation model. IEEE Transactions on Image Processing 14, 10 (2005), 1647–1659.
  • Bengio [2012] Yoshua Bengio. 2012.

    Deep learning of representations for unsupervised and transfer learning.

    Proceedings of ICML Workshop on Unsupervised and Transfer Learning 27, 17–36.
  • Cimpoi et al. [2015] Mircea Cimpoi, Subhransu Maji, and Andrea Vedaldi. 2015. Deep filter banks for texture recognition and segmentation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    . 3828–3836.
  • Dong et al. [2016] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2016. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38, 2 (2016), 295–307.
  • Doyle [1990] T. Doyle. 1990. Interlaced to sequential conversion for EDTV applications. In Proceedings of International Workshop on Signal Processing of HDTV. 412–430.
  • Duchon [1979] Claude E. Duchon. 1979. Lanczos filtering in one and two dimensions. Journal of Applied Meteorology 18, 8 (1979), 1016–1022.
  • Gharbi et al. [2016] Michaël Gharbi, Gaurav Chaurasia, Sylvain Paris, and Frédo Durand. 2016. Deep joint demosaicking and denoising. ACM Transactions on Graphics 35, 6 (2016), 191.
  • Horn and Schunck [1981] Berthold K.P. Horn and Brian G. Schunck. 1981. Determining optical flow. Artificial intelligence 17, 1-3 (1981), 185–203.
  • Hung and Siu [2012] K.W. Hung and W.C. Siu. 2012. Fast image interpolation using the bilateral filter. IET Image Processing 6, 7 (2012), 877–890.
  • Jeon et al. [2009] Gwanggil Jeon, Jongmin You, and Jechang Jeong. 2009. Weighted fuzzy reasoning scheme for interlaced to progressive conversion. IEEE Transactions on Circuits and Systems for Video Technology 19, 6 (2009), 842–855.
  • Johnson et al. [2016] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. 2016. Perceptual losses for real-time style transfer and super-resolution. In Proceedings of European Conference on Computer Vision. Springer, 694–711.
  • Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
  • Lee and Lee [2013] Kwon Lee and Chulhee Lee. 2013. High quality spatially registered vertical temporal filtering for deinterlacing. IEEE Transactions on Consumer Electronics 59, 1 (2013), 182–190.
  • Mallat [2016] Stéphane Mallat. 2016. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A 374, 2065 (2016), 20150203.
  • Mitchell and Netravali [1988] Don P. Mitchell and Arun N. Netravali. 1988. Reconstruction filters in computer-graphics. In Computer Graphics. 221–228.
  • Mohammadi et al. [2012] H. Mahvash Mohammadi, Y. Savaria, and J.M.P. Langlois. 2012. Enhanced motion compensated deinterlacing algorithm. IET Image Processing 6, 8 (2012), 1041–1048.
  • Takeda et al. [2007] Hiroyuki Takeda, Sina Farsiu, and Peyman Milanfar. 2007. Kernel regression for image processing and reconstruction. IEEE Transactions on image processing 16, 2 (2007), 349–366.
  • Vedadi and Shirani [2013] Farhang Vedadi and Shahram Shirani. 2013.

    De-Interlacing Using Nonlocal Costs and Markov-Chain-Based Estimation of Interpolation Methods.

    IEEE Transactions on Image Processing 22, 4 (2013), 1559–1572.
  • Wang et al. [2012] Jin Wang, Gwanggil Jeon, and Jechang Jeong. 2012. Efficient adaptive deinterlacing algorithm with awareness of closeness and similarity. Optical Engineering 51, 1 (2012), 017003–1.
  • Wang et al. [2013] Jin Wang, Gwanggil Jeon, and Jechang Jeong. 2013. Moving Least-Squares Method for Interlaced to Progressive Scanning Format Conversion. IEEE Transactions on Circuits and Systems for Video Technology 23, 11 (2013), 1865–1872.
  • Wang et al. [2014] Jin Wang, Gwanggil Jeon, and Jechang Jeong. 2014. De-Interlacing algorithm using weighted least squares. IEEE Transactions on Circuits and Systems for Video Technology 24, 1 (2014), 39–48.
  • Xie et al. [2012] Junyuan Xie, Linli Xu, and Enhong Chen. 2012. Image denoising and inpainting with deep neural networks. In Advances in Neural Information Processing Systems. 341–349.