pytorch implementation for "Deep Flow-Guided Video Inpainting"(CVPR'19)
Video inpainting, which aims at filling in missing regions of a video, remains challenging due to the difficulty of preserving the precise spatial and temporal coherence of video contents. In this work we propose a novel flow-guided video inpainting approach. Rather than filling in the RGB pixels of each frame directly, we consider video inpainting as a pixel propagation problem. We first synthesize a spatially and temporally coherent optical flow field across video frames using a newly designed Deep Flow Completion network. Then the synthesized flow field is used to guide the propagation of pixels to fill up the missing regions in the video. Specifically, the Deep Flow Completion network follows a coarse-to-fine refinement to complete the flow fields, while their quality is further improved by hard flow example mining. Following the guide of the completed flow, the missing video regions can be filled up precisely. Our method is evaluated on DAVIS and YouTube-VOS datasets qualitatively and quantitatively, achieving the state-of-the-art performance in terms of inpainting quality and speed.READ FULL TEXT VIEW PDF
In this paper, we present a new inpainting framework for recovering miss...
Extreme image or video completion, where, for instance, we only retain 1...
We propose the onion-peel networks for video completion. Given a set of
We propose a novel video inpainting algorithm that simultaneously
We propose a novel feed-forward network for video inpainting. We use a s...
Modern optical flow methods are often composed of a cascade of many
Over the last few years, deep learning techniques have yielded significa...
pytorch implementation for "Deep Flow-Guided Video Inpainting"(CVPR'19)
The goal of video inpainting is to fill in missing regions of a given video sequence with contents that are both spatially and temporally coherent [4, 12, 22, 24]. Video inpainting, also known as video completion, has many real-world applications such as undesired object removal  and video restoration .
Inpainting real-world high-definition video sequences remains challenging due to the camera motion and the complex movement of objects. Most existing video inpainting algorithms [12, 21, 22, 27, 30] follow the traditional image inpainting pipeline, by formulating the problem as a patch-based optimization task, which fills missing regions through sampling spatial or spatial-temporal patches of the known regions then solve minimization problem. Despite some good results, these approaches suffer from two drawbacks. First, these methods typically assume smooth and homogeneous motion field in the missing region, therefore they cannot handle videos with complex motions. A failure case is shown in Fig. 1(b). Second, the computational complexity of optimization-based methods is high thus those methods are infeasible for the real-world applications. For instance, the method by Huang et al.  requires approximately 3 hours to inpaint a 854480-sized video with 90 frames containing 18% missing regions.
through the use of Convolutional Neural Network (CNN)
, video inpainting using deep learning remains much less explored. There are several challenges for extending deep learning-based image inpainting approaches to the video domain. As shown in Fig.1(c), a direct application of an image inpainting algorithm on each frame individually will lead to temporal artifacts and jitters. On the other hand, due to the large amount of RGB frames, feeding the entire video sequence at once to a 3D CNN is also difficult to ensure the temporal coherence. Meanwhile, an extremely large model capacity is needed to directly inpaint the entire video sequence, which is not computationally practical given its large memory consumption.
Rather than filling the RGB pixels, we propose an alternative flow-guided approach for video inpainting. The motivation behind our approach is that completing a missing flow is much easier than filling in pixels of a missing region directly, while using the flow to propagate pixels temporally preserves the temporal coherence naturally. As shown in Fig. 1(d), compared with RGB pixels, the optical flow is far less complex and easier to complete since the background and most objects in a scene typically have trackable motion. This observation inspires us to design our method to alleviate the difficulty of video inpainting by first synthesizing a coherent flow field across frames. Most pixels in the missing regions can then be propagated and warped from the visible regions. Finally we can fill up the small amount of regions that are not seen in the entire video using the pixel hallucination .
In order to fill up the optical flows in videos, we design a novel Deep Flow Completion Network (DFC-Net) with the following technical novelties:
(1) Coarse-to-fine refinement
: The proposed DFC-Net is designed to recover accurate flow field from missing regions. This is made possible through stacking three similar subnetworks (DFC-S) to perform coarse-to-fine flow completion. Specifically, the first subnetwork accepts a batch of consecutive frames as the input and estimates the missing flow of the middle frame on a relatively coarse scale. The batch of coarsely estimated flow fields is subsequently fed to the second subnetwork followed by the third subnetwork for further spatial resolution and accuracy refinement.
(2) Temporal coherence maintenance: Our DFC-Net is designed to naturally encourage global temporal consistency even though its subnetworks only predict a single frame each time. This is achieved through feeding a batch of consecutive frames as inputs, which provide richer temporal information. In addition, the highly similar inputs between adjacent frames tend to produce continuous results.
(3) Hard flow example mining: We introduce hard flow example mining strategy to improve the inpainting quality on flow boundary and dynamic regions.
In summary, the main contribution of this work is a novel flow-guided video inpainting approach. We demonstrate that compelling video completion in complex scenes can be achieved via high-quality flow completion and pixel propagation . A Deep Flow Completion network is designed to cope with arbitrary shape of missing regions, complex motions, and maintain temporal consistency. In comparison to previous methods, our approach is significantly faster in runtime speed, while it does not require any assumptions about the missing regions and the motions of the video contents. We show the effectiveness of our approach on both the DAVIS  and YouTube-VOS  datasets with the state-of-the-art performance.
Non-learning-based Inpainting. Prior to the prevalence of deep learning, most image inpainting approaches fall into two categories, i.e., diffusion-based or patch-based methods, which both aim to fill the target holes by borrowing appearance information from known regions. A diffusion-based method [1, 5, 19] propagates appearance information around the target hole for image completion. This approach is incapable of handling the appearance variations and filling large holes. A patch-based method [6, 8, 10, 29] completes missing regions by sampling and pasting patches from known regions or other source images. This kind of approach has been extended to the temporal domain for video inpainting [21, 22, 27]. Strobel et al.  and Huang et al.  further estimate the motion field in the missing regions to address the temporal consistency problem. In comparison to diffusion-based methods, patch-based methods can better handle non-stationary visual data. However, the dense computation of patch similarity is a very time-consuming operation. Even by using the PatchMatch [2, 3] to accelerate the patch matching process, the speed of  is still approximately 20 times slower than our approach. Importantly, unlike our deep learning based approach, all the aforementioned methods cannot capture high-level semantic information. They thus fall short in recovering content in regions that encompasses complex and dynamic motion from multiple objects.
Learning-based Inpainting. The emergence of deep learning inspires recent works to investigate various deep architectures for image inpainting. Earlier works [17, 26] attempted to directly train a deep neural network for inpainting. With the advent of Generative Adversarial Networks (GAN), some studies [15, 23, 35] formulate inpainting as a conditional image generation problem. By using GAN, Pathak et al.  train an inpainting network that can handle large-sized holes. Iizuka et al.  improved  by introducing both global and local discriminators for deriving the adversarial losses. More recently, Yu et al.  presented a contextual attention mechanism in a generative inpainting framework, which further improves the inpainting quality. These methods achieve excellent results in image inpainting. Extending them directly to the video domain is, however, challenging due to the lack of temporal constraints modeling. In this paper we formulate an effective framework that is specially designed to exploit redundant information across video frames. The notion of pixel propagation through deeply estimated flow fields is new in the literature. The proposed techniques, e.g., coarse-to-fine flow completion, maintaining temporal coherence, and hard flow example mining are shown effective in the experiments, outperforming existing optimization-based and deep learning-based methods.
Figure 2 depicts the pipeline of our flow-guided video inpainting approach. It contains two steps, the first step is to complete the missing flow while the second step is to propagate pixels with the guidance of completed flow fields.
In the first step, a Deep Flow Completion Network (DFC-Net) is proposed for coarse-to-fine flow completion. DFC-Net consists of three similar subnetworks named as DFC-S. The first subnetwork estimates the flow in a relatively coarse scale and feeds them into the second and third subnetwork for further refinement. In the second step, after the flow is obtained, most of the missing regions can be filled up by pixels in known regions through a flow-guided propagation from different frames. A conventional image inpainting network  is finally employed to complete the remaining regions that are not seen in the entire video. Thanks to the high-quality estimated flow in the first step, we can easily propagate these image inpainting results to the entire video sequence.
Section 3.1 will introduce our basic flow completion subnetwork DFC-S in detail. The stacked flow completion network, DFC-Net, is specified in Sec. 3.2. Finally, the RGB pixel propagation procedure will be clarified in Sec. 3.3.
Two types of inputs are provided to the first DFC-S in our network: (i) a concatenation of flow maps from consecutive frames, and (ii) the associated sequence of binary masks, each of which indicating the missing regions of each flow map. The output of this DFC-S is the completed flow field of the middle frame. In comparison to using a single flow map input, using a sequence of flow maps and the corresponding masks improves the accuracy of flow completion considerably.
More specifically, suppose represents the initial flow between -th and -th frames and denotes the corresponding indicating mask. We first extract the flow field using FlowNet 2.0  and initialize all holes in
by smoothly interpolating the known values at the boundary inward. To complete, the input and are concatenated along the channel dimension and then fed into the first subnetwork, where denotes the length of consecutive frames. Generally, is sufficient for the model to acquire related information and feeding more frames do not produce apparent improvement. With this setting, the number of input channels is for the first DFC-S (11 flow maps each for the x- and y-direction flows, and 11 binary masks). For the second and third DFC-S, inputs and outputs are different. Their settings will be discussed in Sec. 3.2.
As shown in Fig. 2(a), considering the tradeoff between model capacity and speed, DFC-S uses the ResNet-50  as the backbone. ResNet-50 consists of five blocks named as ‘conv1’, ‘conv2_x’ to ‘conv5_x’. We modify the input channel of the first convolution in ‘conv1’ to fit the shape of our inputs (e.g.,
in the first DFC-S). To increase the resolution of features, we decrease the convolutional strides and replace convolutions by dilated convolutions from the ‘conv4_x’ to ‘conv5_x’ similar to
. An upsampling module that is composed of three alternating convolution, relu and upsampling layers are then appended to enlarge the prediction. To project the prediction to the flow field, we remove the last activation function in the upsampling module.
Figure 2(a) depicts the architecture of DFC-Net, which is constructed by stacking three DFC-S. Typically, the smaller the hole, the easier the missing flow can be completed, so we first shrink the size of input frames of the first subnetwork to obtain good initial results. The frames are then gradually enlarged in the second and third subnetwork to capture more details, following a coarse-to-fine refinement paradigm. Compared with the original size, inputs for three subnetworks are resized as , and respectively.
After obtaining the coarse flow from the first subnetwork, the second subnetwork focuses on further flow refinement. To better align the flow field, the forward and backward flows are refined jointly in the second subnetwork. Suppose is the coarse flow field generated by the first subnetwork. For each pair of the consecutive frames, -th frame and -th frame, the second subnetwork takes a sequence of estimated bidirectional flow and as input and produces refined flows . Similar to the first subnetwork, binary masks and are also fed into the second subnetwork to indicate masked regions of the flow field. The second subnetwork shares the same architecture as the first subnetwork, however, the number of input and output channels is different.
Finally, predictions from the second subnetwork are enlarged and further fed into the third subnetwork, which strictly follows the same procedure as the second subnetwork to obtain the final results. A step-by-step visualization is provided in Fig. 3, the quality of the flow field is gradually improved through the coarse-to-fine refinement.
Training. During training, for each video sequence, we randomly generate the missing regions. The optimization goal is to minimize the distance between predictions and ground-truth flows. Three subnetworks are first pre-trained separately and then jointly fine-tuned in end-to-end manner. Specifically, the loss of the -th subnetwork is defined as:
where is the ground-truth flow and is element-wise multiplication. For the joint fine-tuning, the overall loss is a linear combination of subnetwork losses.
Hard Flow Example Mining (HFEM). Because the majority of the flow area is smooth in video sequences, there exists a huge bias in the number of training samples between the smooth region and the boundary region. In our experiments, we observe that directly using loss generally leads to the imbalanced problem, in which the training process is dominated by smooth areas and the boundary region in the prediction is blurred. What is worse, the incorrect edge of flow can lead to serious artifacts in the subsequent propagation step.
To overcome this issue, inspired by , we leverage the hard flow example mining mechanism to automatically focus more on the difficult areas thus to encourage the model to produce sharp boundaries. Specifically, we sort all pixels in a descending order of the loss. The top percent pixels are labeled as hard samples. Their losses are then enhanced by a weight to enforce the model to pay more attention to those regions. The loss with hard flow example mining is defined as:
where is the binary mask indicating the hard regions. As shown in Fig. 4, the hard examples are mainly distributed around the high frequency regions such as the boundaries. Thanks to the hard flow example mining, the model learns to focus on producing sharper boundaries.
The optical flow generated by DFC-Net establishes a connection between pixels across frames, which could be used as the guidance to inpaint missing regions by propagation. Figure 2(b) illustrates the detailed process of flow-guided frame inpainting .
Flow Guided Pixel Propagation. As the estimated flow may be inaccurate in some locations, we first need to check the validity of the flow. For a forward flow and a location , we verify a simple condition based on photometric consistency: , where and is a relatively small threshold (i.e., 5). This condition means that after the forward and backward propagation, the pixel should go back to the original location. If it is not satisfied, we shall believe that is unreliable and ignore it in the propagation. The backward flow can be verified with the same approach.
After the consistency check, as shown in Fig. 2(b)(1), all known pixels are propagated bidirectionally to fill the missing regions based on the valid estimated flow. In particular, if an unknown pixel is connected with both forward and backward known pixels, it will be filled by a linear combination of their pixel values whose weights are inversely proportional to the distance between the unknown pixel and known pixels.
Inpaint Unseen Regions in Video. In some cases, the missing region cannot be filled by the known pixels tracked by optical flow (e.g., white regions in Fig. 2(b)(2)), which means that the model fails to connect certain masked regions to any pixels in other frames. The image inpainting technique  is employed to complete such unseen regions. Figure 2(b)(2) illustrates the process of filling unseen regions. In practice, we pick the a frame with unfilled regions in the video sequence and apply  to complete it. The inpainting result is then propagated to the entire video sequence based on the estimated optical flow. A single propagation may not fill all missing regions, so image inpainting and propagation steps are applied iteratively until no more unfilled regions can be found. In average, for a video with 12% missing regions, there are usually 1% of unseen pixels and they can be filled after 1.1 iterations.
Inpainting Settings. Two common inpainting settings are considered in this paper. The first setting aims to remove the undesired foreground object, which has been explored in the previous work [12, 22]. In this setting, a mask is given to outline the region of the foreground object. In the second setting, we want to fill up an arbitrary region in the video, which might contain either foreground or background. This setting corresponds to some real-world applications such as watermark removal and video restoration. To simulate this situation, following [15, 35], a square region in the center of video frames is marked as the missing region to fill up. Unless otherwise indicated, for a video frame with size , we fix the size of the square missing region as . The non-foreground mask typically leads to inaccurate flow field estimation, which makes this setting more challenging.
Datasets. To demonstrate the effectiveness and generalization ability of the flow-guided video inpainting approach, we evaluate our method on DAVIS  and YouTube-VOS  datasets. DAVIS dataset contains 150 high-quality video sequences. A subset of 90 videos has all frames annotated with the pixel-wise foreground object masks, which is reserved for testing. For the remaining 60 unlabeled videos, we adopt them for training. Although DAVIS is not originally proposed for the evaluation of video inpainting algorithms, it is adopted here because of the precise object mask annotations. YouTube-VOS  consists of 4,453 videos, which are split into 3,471 for training, 474 for validation and 508 for testing. Since YouTube-VOS does not provide dense object mask annotations, we only use it to evaluate the performance of the models in second inpainting setting.
(1) Setting 1: foreground object removal. To prepare the training set, we synthesize and overlay a mask of random shape onto each frame of a video. Random motion is introduced to simulate the actual object mask. Masked and unmasked frames form the training pairs. For testing, since the ground-truths of removed regions are not available, evaluations are thus conducted through a user study.
(2) Setting 2: fixed region inpainting. Each of the training frame is covered by a fixed square region at the center of the frame. Again, masked and unmasked frames form the training pairs. For testing, besides the user study, we also report the PSNR and SSIM following [20, 33] in this setting. PSNR measures image’s distortion, while SSIM measures the similarity in structure between the two images.
We quantitatively and qualitatively compare our approach with other existing methods on DAVIS and YouTube-VOS datasets. For YouTube-VOS, our model is trained on its training set. The data in DAVIS dataset is insufficient for training a model from scratch. We thus use the pretrained model from YouTube-VOS and fine-tune it using the DAVIS training set. The performances are reported on their respective test set.
Quantitative Results. We first make comparison with existing methods quantitatively on the second inpainting task that aims to fill up a fixed missing region. The results are summarized in Table 1.
|YouTube-VOS||DAVIS||time111Following , we report the running time on the “CAMEL” video in DAVIS dataset. While Newson et al.  have not reported the execution time in the paper, we use the similar environment with  to test their execution time.(min.)|
|Newson et al. ||23.92||0.37||24.72||0.43||270|
|Huang et al. ||26.48||0.39||27.39||0.44||180|
Our approach achieves the best performance on both datasets. As shown in Table 1, directly applying the image inpainting algorithm  on each frame leads to inferior results. Compared with conventional video inpainting approaches [12, 22], our approach could better handle videos with complex motions. Meanwhile, our approach is significantly faster in runtime speed and thus it is more well-suited for real-world applications.
User study. Evaluation metrics in terms of reconstruction errors are not perfect as there are many reasonable solutions for the original video frames. Therefore, we perform a user study to quantify the performance of our approach and existing works [12, 35] for their inpainting quality. We use the models trained on DAVIS dataset for this experiment. Specifically, we randomly choose 15 videos from DAVIS testing set for each participant. The videos are then inpainted by three approaches (ours, Deepfill , and Huang et al. ) under two different settings. To better display the details, the video is played at a low frame rate ( FPS). For each video sample, participants are requested to rank the three inpainting results after the video is played.
We invited 30 participants for the user study. The result is summarized in Fig. 5, which is consistent with the quantitative result. Our approach significantly outperforms the other two baselines, while the image inpainting method performs the worst since it is not designed to maintain temporal consistency on its output. Figure 6 shows some examples of our inpainting results222We highly recommend watching the video demo in https://youtu.be/zqZjhFxxxus.
Qualitative Comparison. In Fig. 7, we compare our method with Huang et al.’s method in two different settings. From the first case, it is evident that our DFC-Net can better complete the flow. Thanks to the completed flow, the model can easily fill up the region with correct pixel value. In the more challenging case shown in the second example, our method is much more robust on inpainting the complex masked region such as the part of a woman, compared to the notable artifacts in Huang et al.’s result.
In this section, we conduct a series of ablation studies to analyze the effectiveness of each component in our flow-guided video inpainting approach. Unless otherwise indicated we employ the training set of YouTube-VOS for training. For better quantitative comparison, all performances are reported on the validation set of YouTube-VOS under the second inpainting setting, since we have the ground-truth of the removed regions under this setting.
Comparison with Image Inpainting Approach. Our flow-guided video inpainting approach significantly eases the task of video inpainting by using the synthesized flow fields as a guidance, which transforms the video completion problem into a pixel propagation task. To demonstrate the effectiveness of this paradigm, we compare it with a direct image inpainting network for each individual frame. For a fair comparison, we adopt the Deepfill architecture but with multiple color frames as input, which is named as ‘Deepfill+Multi-Frame’. Then the ‘Deepfill+Multi-Pass’ architecture stacks three ‘Deepfill+Multi-Frame’ like DFC-Net. Table 2 presents the inpainting results on both DAVIS and YouTube-VOS. Although the multi-frame input and stacking architecture can bring marginal improvements compared to Deepfill. The significant gap between ‘Deepfill+Multi-Frame’ and our method demonstrates that using the high-quality completed flow field as guidance can ease the task of video inpainting.
Effectiveness of Hard Flow Example Mining. As introduced in Sec. 3.2, most of the area of optical flow is smooth and that may result in degenerate models. Therefore, a hard flow example mining mechanism is proposed to mitigate the influence of the label bias in the problem of flow inpainting. Similarly, in this experiment, we adopt the first DFC-S to examine the effectiveness of hard flow example mining
|Flow completion (EPE)||Video inpainting|
|smooth region||hard region||overall||PSNR||SSIM|
Table 3 lists the flow completion accuracy under different mining settings, as well as the corresponding inpainting performance. The parameter
represents the percentage of samples that are labeled as the hard one. We use the standard end-point-error (EPE) metric to evaluate our inpainted flow. For clear demonstration, all flow samples are divided into smooth and non-smooth sets according to their variance. Overall, the hard flow example mining mechanism improves the performance under all settings. Whenis smaller, which means samples are harder, it will increase the difficulty during training. However, if is larger, the model would not get much improvement compared with the baseline. The best choice of ranges from to . In our experiments, we fix as .
Effectiveness of Stacked Architecture. Table 4 depicts the step-by-step refinement results of DFC-Net, including flows and the corresponding inpainting frames. To further demonstrate the effectiveness of stacked DFC-Net, Table 4 also includes two other baselines that are constructed as follows:
DFC-Single: DFC-Single is a single stage flow completion network that is similar to DFC-S. To ensure a fair comparison, DFC-Single adopts a deeper backbone, i.e. ResNet-101.
DFC-Net (w/o MS): The architecture of DFC-Net (w/o MS) is the same as DFC-Net. However, in each stage of this baseline model, the input’s scale does not change and the data is full resolution from the start to the end.
|Flow completion||Video inpainting|
|DFC-Net (w/o MS)||0.95||27.02||0.40|
By inspecting Table 4 closer, we could find that the end-point-error is gradually reduced by the coarse-to-fine refinement. The result of DFC-Single is somewhat inferior to the second stage, which suggests the effectiveness of using the stacked architecture in this task. To further indicate the effectiveness of using multi-scale input in each stage, we compare our DFC-Net with DFC-Net (w/o MS). The performance gap verifies that the strategy of using multi-scale input in each stage improves the result of our model since using the large scale’s input in the early stage typically causes the instability of training.
Effectiveness of Flow-Guided Pixel Propagation. After obtaining the completed flow, all known pixels are first propagated bidirectionally to fill the missing regions based on the valid estimated flow. This step produces high-quality results and also reduces the size of missing regions that have to be handled in the subsequent step.
|w/o pixel propagation||19.43||0.24|
|w/ pixel propagation||27.50||0.41|
As shown in Table 5, compared with a baseline approach that directly use the image inpainting and flow warping to inpaint unseen regions, this intermediate step greatly eases the task and improves the overall performance.
|Huang et al. w/o Flownet2||–||27.39||0.44|
|Huang et al. w/ FlowNet2||1.02||27.73||0.45|
Ablation Study on Initial Flow. The flow estimation algorithm is important but not vital since it only affects the flow quality outside the missing regions. By contrast, the quality of the completed flow inside the missing regions is more crucial. We substitute the initial flow of  with flow estimated by FlowNet2 to ensure a fair comparison. Table 6 and Fig. 8 demonstrate the effectiveness of our method.
Failure Case. A failure case is shown in Fig. 9. Our method failed in this case mainly because the completed flow is inaccurate on the edge of the car. The propagation process cannot amend that. In the future, we will use the learning based propagation method to mitigate the influence of the inaccuracy of the estimated flow. Other more contemporary flow estimation methods [13, 14, 31] will be investigated too.
We propose a novel deep flow-guided video inpainting approach, showing that high-quality flow completion could largely facilitate inpainting videos in complex scenes. Deep Flow Completion network is designed to cope with arbitrary missing regions, complex motions, and yet maintain temporal consistency. In comparison to previous methods, our approach is significantly faster in runtime speed, while it does not require any assumption about the missing regions and the movements of the video contents. We show the effectiveness of our approach on both the DAVIS  and YouTube-VOS  datasets with the state-of-the-art performance.
Acknowledgements. This work is supported by SenseTime Group Limited, the General Research Fund sponsored by the Research Grants Council of the Hong Kong SAR (CUHK 14241716, 14224316. 14209217), and Singapore MOE AcRF Tier 1 (M4012082.020).
Filling-in by joint interpolation of vector fields and gray levels.IEEE Transactions on Image Processing, 10(8):1200–1211, 2001.
European Conference on Computer Vision, pages 29–43. Springer, 2010.
IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages I–I. IEEE, 2001.