Demo for "Fast User-Guided Video Object Segmentation by Interaction-and-Propagation Networks", CVPR 2019.
We present a deep learning method for the interactive video object segmentation. Our method is built upon two core operations, interaction and propagation, and each operation is conducted by Convolutional Neural Networks. The two networks are connected both internally and externally so that the networks are trained jointly and interact with each other to solve the complex video object segmentation problem. We propose a new multi-round training scheme for the interactive video object segmentation so that the networks can learn how to understand the user's intention and update incorrect estimations during the training. At the testing time, our method produces high-quality results and also runs fast enough to work with users interactively. We evaluated the proposed method quantitatively on the interactive track benchmark at the DAVIS Challenge 2018. We outperformed other competing methods by a significant margin in both the speed and the accuracy. We also demonstrated that our method works well with real user interactions.READ FULL TEXT VIEW PDF
Demo for "Fast User-Guided Video Object Segmentation by Interaction-and-Propagation Networks", CVPR 2019.
Video object segmentation is a task of separating a foreground object from a video sequence. It is an essential task in video editing with a wide range of applications from the consumer-level video editing to the professional TV and movie post-production. This problem is often solved by either a fully-automatic approach (i.e. unsupervised foreground object segmentation ) or a semi-supervised approach (i.e. ground-truth object masks are given on few frames [5, 28]). However, both solutions have limitations in reflecting a user’s intention or refining incorrect estimations.
Interactive video segmentation can potentially resolve this issue by allowing user intervention given in a user-friendly form such as scribbles [37, 31, 2]. However, existing interactive methods require a lot of user interactions to obtain results with acceptable quality for video editing applications. In this paper, we aim to develop an interactive video object segmentation technique that can estimate accurate object masks in a video sequence with minimal user interactions.
Interactive video cutout methods usually follow the procedure of the rotoscoping [4, 20], where a user sequentially processes a video frame-by-frame. In this scenario, the user verifies and updates the object mask with multiple interactions at every frame. This rotoscoping-style interaction requires a lot of effort and is more suitable for professional uses that require high-quality results.
Recently, Caelles et al.  introduced another workflow for the video object cutout that can minimize the user’s effort. In this scenario, which we call as the round-based interaction, the user provides annotations on a selected frame and an algorithm computes the segmentation maps for all video frames in a batch process. To refine the results, the process of user annotation and segmentation map computations are repeatedd until the user is satisfied with the results. This round-based interaction is useful for consumer-level applications and rapid prototyping for professional usage, where the efficiency is the main concern. One can control the quality of the segmentation according to the time budget, as more rounds of interactions will provide more accurate results.
In this paper, we present a deep learning based method for the interactive video object segmentation tailored to the round-based interaction scenario (Fig. 1). While several deep learning approaches for video object segmentation have been proposed [5, 28], they are usually too slow for the interactive scenario as they rely heavily on online learning. Even with a fast video segmentation algorithm , designing a deep neural network (DNN) and its training mechanism for the interactive segmentation scenario remains as a challenge.
To solve this challenging problem, we propose the Interaction-and-Propagation Networks and an effective training method. Our framework consists of two deep CNNs, each of which is dedicated to the core operations interaction and propagation respectively. The interaction network takes the user annotation (e.g. scribbles) to segment the foreground object. The propagation network transfers the object mask computed in the source frame to other neighboring frames. These two networks are internally connected using our feature aggregation module and are also externally connected so that each of them takes the other’s output as its input.
The two networks are trained jointly to adapt to each other, which reduces unstable behaviors between the two operations. We also propose the concept of multi-round training, which is specifically designed to simulate a real testing scenario of the interactive video segmentation. In this training strategy, a number of user feedback cycles and the response of networks form a single training iteration (see Fig. 3). This new training scheme greatly improves the performance of our model.
Our framework is quantitatively evaluated on the interactive track benchmark at the DAVIS Challenge 2018  and achieves the state-of-the-art performance with a big gap compared to other competing methods . We also demonstrate the usefulness of our method with real interactive cutout use-cases. We will release the source code that contains our trained model and the graphical user interface.
We categorize the video object segmentation into three categories based on different types of user interactions.
In the unsupervised setting, there is no user interaction. The unsupervised approaches run automatically but they can only segment visually salient objects based on the appearance or the motion. For example, Jain et al.  combine an appearance model with an optical flow model to segment generic objects in videos. Similarly, Tokmakov et al. 
use a motion estimation network with a recurrent neural network to segment moving foregrounds. The fundamental limitation of the unsupervised methods is that users have no means to select the object of interest.
In the semi-supervised setting, the ground-truth mask of an object in the first frame is provided. The goal is to propagate the object mask throughout the entire video sequence. Many recent approaches [5, 36, 24] employ the online learning by fine-tuning deep network models at the testing time in order to remember the appearance of the target object on the given object mask. Then the object segmentation is performed for each frame. Instead of employing the online learning, Jampani et al.  propagate the object mask by bilateral filtering. Oh et al.  use Siamese two-stream networks and leverage synthetic training data. Although the semi-supervised methods do not have the limitation of the unsupervised methods, they require a fully annotated object mask in the initial frame, which can be expensive to acquire. Additionally, semi-supervised methods rely on extra information such as fully annotated masks or external tools to further improve the output quality.
In the interactive setting, users can provide various types of inputs (e.g. bounding box, scribbles, or masks) to select an object of interest in the beginning. Users can also provide more interactions to refine the segmentation results. The goal of this interactive approach is to achieve satisfactory segmentation results with a minimum number of user interactions. Many interactive methods [37, 31, 9, 2, 4, 20] have been proposed. [37, 31, 33]
solve spatio-temporal graphs with hand-crafted energy terms. Some methods find the corresponding patches between a target frame and a reference frame, then utilize local classifiers[2, 44] or an existing patch-match algorithm . [1, 20] solve the segmentation task by tracking. Recently, [3, 6] proposed deep-learning based methods by modifying semi-supervised methods to the interactive scenario. Benard and Gygli  use the deep interactive image segmentation method  to select an object given initial strokes or clicks, and use the semi-supervised video object segmentation method  to propagate the object mask. Compared to such a simple combination of two separate methods, we carefully design two module networks to interact with each other and train the whole networks jointly using our new multi-round training scheme.
Recently, several methods have been introduced for integrating user interaction with deep neural networks for various interactive tasks. Xu et al. proposed to transform clicks  or bounding boxes  into Euclidean distance maps for the interactive image segmentation. Zhang et al. 
incorporated a user’s color selection for the image colorization. Sangkloyet al.  and Isola et al.  used sketches to help generate realistic natural images.
Different from the above interactive approaches that only consider an interaction given once onto an image, our model considers multiple user inputs possibly drawn onto different video frames. The sequence of multiple user interactions is aggregated by a specially designed recurrent block called the feature aggregation module. In addition, we use the segmentation results from previous rounds as an additional channel, in order to consider the unique characteristics of the interactive video segmentation.
Given user annotations on a video frame (e.g. scribbles drawn on the foreground and background pixels of an image), we aim for cutting out the target object in all frames of the given video. From the initial user input, we generate object masks of all frames solely based on the user annotation. If the user provides additional feedback annotations after reviewing the generated masks, our method refines the object masks based on both additional user annotations and the previous mask estimation results.
To this end, we define two basic operations for the task: interaction and propagation. Two deep CNNs dedicated for each operation are proposed as shown in Fig. 2 (a),(b). The interaction network generates the object mask (or refines the previous results) for the annotated frame according to the user inputs. The propagation network generates the object masks (or refines the previous results) by temporally propagating the object mask information both forward and backward starting from the frame with user annotation.
To prevent the error accumulation due to drifts and occlusions during the propagation, the propagation network refers to a reliable visual memory similar to [26, 41, 42]. While [26, 42] employ a Siamese network to access the reference frame directly, we modified the framework to make it more suitable for the interactive video object segmentation. Specifically, as the most reliable information is contained in the user annotated frames in the interactive scenario, we allow the propagation network to access the features of the interaction network. In addition, we propose a feature aggregation module that accumulates all the previous reference information encoded by the interaction network. This reference-guided propagation is effective, especially for the long-term propagation.
We refer to the series of operations consisting of both the user interactions on one frame and a number of consecutive propagation towards both ends as a round (see Fig. 3). Users are able to repeat several rounds of interactions to refine the segmentation results until they are satisfied with the results as shown in Fig. 1. Both networks operate on the results obtained from the previous round. We use the same networks for every round.
We have two networks, interaction and propagation, and both networks are constructed as an encoder-decoder structure that can effectively produce a sharp mask output. We adopt the ROI align before the encoder to make our networks to pay attention to the region of interest (the area around the target object) . We take ResNet50  (without the last global pooling and fully-connected layers) as the encoder network, and also modify it to be able to take additional input channels (e.g. scribbles and the previous masks) by implanting additional filters at the first convolution layer [28, 39]
. The network weights are initialized from the ImageNet pre-trained model, except for the newly added filters which are initialized randomly.
The decoder takes the output of the encoder and produces an object mask. To reconstruct a sharp mask by fully exploiting the information at different scales, the decoder additionally takes intermediate feature maps inside the encoder through skip connections. We make modifications to the feature pyramid networks [21, 29] by adding residual blocks  and use it as the building block of our decoder, as shown in Fig. 2 (d),(e). The decoder estimates the object mask in a quarter scale of an input image. For the multi-object scenario where scribbles for each object are given, we first estimate masks for each object then merge the masks to get the multi-object mask using the soft aggregation proposed in .
The input to the interaction network consists of a frame, the object mask from the previous round (if available), and two binary user annotation maps for the positive and the negative regions respectively. The inputs are concatenated along the channel dimension to form an input tensor
. The object mask is represented as a probability map filled with values between 0 and 1. If no previous mask is available (e.g. at the first round), we feed a neutral mask filled with 0.5 for all pixels. The output of this network is , the probabilities of the target object at every pixel.
The input to the propagation network consists of a frame, the object mask obtained at the previous frame, and the object mask obtained at the previous round. Similar to the interaction network, the inputs are concatenated along the channel dimension to be a tensor . The two object masks are represented with probabilities and the neutral mask is used if the mask is not available. Different from the interaction network, the decoder of this propagation network additionally takes the reference feature map which is computed by our feature aggregation module. The reference feature map and the encoder output of this propagation network are concatenated along the channel dimension and are fed into the decoder.
In the interactive video object segmentation, the system often takes multiple user annotations in different frames through multiple rounds. It is important to exploit all previous user inputs for good performance. To achieve this, we propose a feature aggregation module which is specially designed for accumulating information of the target object from all user interactions. We use the encoder output of the interaction network to generate reference feature maps. We update the feature maps recurrently when a new user interaction triggers the interaction network. We design this module to be able to select memorable features by self-attention. As shown in Fig. 2
(c), the module first performs a global average pooling on the spatial dimension of the feature maps to obtain compact feature vectors. The vectors are concatenated and fed into two fully-connected layers with a bottleneck. The outputs of the layers are two channel-wise weight vectors (and
) after reshaping and a softmax. We place the softmax layer to make sure that. The two feature maps are channel-wise weighted by and , then merged by the summation: . and are the aggregated reference feature map at the round and respectively, and is the encoder output of the interaction network at the round , and is an element-wise multiplication on the channel dimension.
While fully convolutional networks for image segmentation  can handle image inputs in any resolution, the performance heavily relies on the absolute scale of objects. For example, small objects are easily missed and objects larger than the receptive field need to be estimated by observing only a part of the objects. This issue can be addressed when the network knows where to look. In our case, we can reason about the region of interest (ROI) from the guidance (e.g. scribbles and masks).
To take advantage of the guidance, we first compute a tight box that contains all available guiding information (which include user scribbles, the mask from the previous frame, and the mask from the previous round) and set the ROI to a box that is computed by doubling each side of the tight box. Then, the ROI area for all the inputs is bilinearly warped into a fixed size (e.g. in our implementation) before we feed them into the encoders [17, 13]
. Finally, the prediction made within the ROI is inversely warped and pasted back to the original location. The training losses become scale-invariant as they are computed in the ROI-aligned space, and this enables us to not use the complex balanced loss functions. Note that we set ROI as the whole image at the first round and start to compute ROI using the guidance from the second round.
For the best testing performance, we make our training loop close to the real testing scenario: a user interacts with our model multiple times while providing feedback in the forms of scribbles on multiple frames. We propose a new multi-round training scheme where a single training sample consists of multiple rounds of user interactions. At every round, our model is trained to refine the previous round’s results by understanding the user’s intention (interaction network) and temporally propagating the object mask (propagation network). Two networks are trained jointly by making an estimation using the previous estimation that can be inferred from the other network. Losses are computed at every intermediate prediction and the back-propagation is performed at every loss computation to update the parameters of the networks. At each round, user inputs are synthesized by simulating user behaviors. Fig. 3 shows an example of a single training iteration in our multi-round training scheme.
One challenge in training an interactive model is collecting user input data. For our scenario where a user provides scribbles as feedback, it is not feasible to collect large training data. Instead, we train our model with synthetically generated user interactions. In the first round, positive scribbles are sampled from the foreground region. In the following rounds, scribbles are synthesized within false negative and false positive areas where the areas are computed using the ground-truth mask. We sample positive scribbles from the false negative area and negative scribbles from the false positive area.
We use morphological skeletonization to automatically generate realistic scribbles similar to . Given a candidate area to sample scribbles, we first remove small false estimations isolated from the main body by repeating a binary morphological opening operation. Then, we perform the skeletonization of the mask to get either positive and negative scribbles within the target area. We use a fast implementation of the thinning algorithm  for the skeletonization.
A concern can be raised about the gap between the simulated and the real scribbles. We empirically validate that our model trained with simulated user scribbles works well with real user interactions as shown in our demo video.
It is widely known that training deep networks requires a large amount of data. However, video data that comes with object masks are limited due to laborious human annotation process. We bypass the issue by employing two-stage training where our networks are first pre-trained on synthetic image data and then are fine-tuned on real video data. The idea that trains a video segmentation network on image data was proposed in , and we follow the data simulation method in . The method produces a set of reference and target frame pairs by applying random affine transforms and object composition. This pre-training is similar to training on videos, but temporal propagation is limited to a single step as there are no consecutive frames.
For the pre-training, we combine multiple image datasets that come with object masks (salient object detection – [34, 7], semantic segmentation – [8, 12, 22]). After the pre-training, we use the video data from the training subset of DAVIS , GyGo , and Youtube-VOS  to train our networks.
To sample training data, we first resize video frames to be 480-pixels on the shorter edge while keeping the aspect ratio. Then, consecutive 400 400 sized patches are sampled from a random location of the video, where is the length of a training video clip. We randomly skipped frames to simulate fast motion and is gradually increased from 4 to 8 during training. We also augment all the training samples using random affine transforms. The number of rounds also grows from 1 to 3 during training. The loss is computed by the cross-entropy function and we use Adam optimizer with a fixed learning rate of 1e-5. The training with video data takes about 5 days using a single NVIDIA GeForce 1080 Ti GPU.
One potential issue observed during our testing is that the propagated mask may be worse than the mask from the previous round. This happens especially when the destination is far from the user-selected frame.We conjecture that the long-term propagation may be unstable as our model is trained on short video clips. To address this issue, we modified our testing scheme in two ways; continuous updating and restricted propagation. In continuous updating, we update the previous round’s masks with newly estimated masks by the weighted average. The weighting factor is inversely proportional to the propagated distance, and different weighting functions such as a linear and the Gaussian were tested. We empirically found that the different weighting functions end up giving similar performance. We used a simple linear function in our experiments. For the restricted propagation, we propagate the object mask until we reach a frame in which user annotations were given in any previous rounds. The restricted propagation improves not only the accuracy by preventing the drift, but also the runtime speed since it requires a smaller number of propagations. This testing scheme is depicted in Fig. 4.
It is difficult to evaluate interactive video object segmentation methods quantitatively because the user input is directly related to the segmentation results, and vice versa. To tackle this problem with the evaluation, Caelles et al.  introduced a robot agent service that simulates human interaction according to the intermediate results of an algorithm. We used their method to quantitatively evaluate our method.
To fairly compare our method against the state-of-the-art methods, we evaluated our model on the interactive track benchmark in the DAVIS Challenge 2018 . In the challenge, each method can interact with a robot agent up to 8 times and is expected to compute masks within 30 seconds per object for each interaction. The performance of each method is evaluated using two metrics: area under the curve (AUC) and Jaccard at 60 seconds (J@60s). AUC is designed to measure the overall accuracy of the evaluation. J@60 measures the accuracy with a limited time budget (60 seconds). We summarize the evaluation results in Table 1. In both metrics, our method outperforms competing methods by a large margin .
|Najafi et al. ||0.549||0.395|
|Lin et al.||0.450||0.240|
|Huang et al.||0.328||0.335|
|Rakelly et al.||0.269||0.273|
Fig. 5 shows examples of our results obtained after 5 interactions with the automatic evaluation robot in the DAVIS Challenge 2018. Our method generates accurate segmentation results for various object types with complex motions even if there are multiple object instances. In the supplementary video, we present the recording of our real-time demo with real user interactions.
We conduct an ablation studies using the DAVIS-2017 validation set to validate the effectiveness of our feature aggregation module and training scheme. Specifically, we compare our complete model with three variant models. No Reference is a model without the feature aggregation module. In No Aggregation model, the feature aggregation module is replaced with a simple identity connection without feature aggregation. No Multi-Round is a model trained with the number of rounds as one (i.e. at each training iteration, there is only one interaction from the user).
The Jaccard score of ablation models with growing number of interactions is shown in Fig. 6. As shown in Fig. 6, the proposed multi-round training is crucial for achieving high accuracy and our feature aggregation module further improves the performance by allowing the networks to exploit the reference information from all previous user inputs.
Another ablation study was conducted on the use of the training data. Our complete model is first pre-trained on static image data and then fine-tuned using video data. To validate the effect of the pre-training, we compare variant models that are just trained on the video data without the pre-training. Also, to further inspect the effect of the amount of video training data, we evaluate variants that are fine-tuned with only 60 train videos of DAVIS-2017. Table 2 summarizes the results obtained by our variant models trained using different combinations of training datasets. Without pre-training, our performance drops significantly. The use of additional training video data further raises our performance.
While our method demonstrates satisfactory results on both the quantitative and the qualitative evaluations, we found few failure cases as shown in Fig. 7. We observed that rapid and complex object motions may lead our propagation network to drift by the error accumulating as shown in Fig. 7 (top). We believe that a good future direction is to augment the algorithm with a reliable temporal propagation of object masks.
Another limitation we found is that our method may be less stable on very challenging scenes in the current round-based scenario. Our method mostly improves results with additional user interactions, but this is not guaranteed as shown in Fig. 7 (bottom). Since we take only partial annotations from users at each round, the propagated masks from newer round are sometimes less accurate and there is no guarantee that we can always keep better results from different rounds. This is because there is no safety gear in the testing scenario and it can be resolved by asking the user for the confirmation of the mask being good to prevent updating the masks.
While object segmentation in a video is one of the most basic tasks for video editing, it requires a lot of user effort and time with existing tools. To make it more accessible, we have presented a novel technique that generates object segmentation masks in video frames with minimum user inputs. Our method consists of interaction and propagation networks that share information with the feature aggregation module. We proposed the multi-round training scheme designed for interactive tasks and it plays a key role in achieving high accuracy. While our model is trained using synthetic user interactions, our method not only shows the best performance on the quantitative evaluation but also demonstrates good performance with real user interactions.
There are directions to further improve our system. The drifting during propagation is still a major challenge, although we greatly improved the performance with the aggregated reference features and the multi-round training. We believe that a better semantic understanding of the scene will help to resolve this problem by robustly linking the instances with appearance changes across video frames. Another important future work is supporting high-resolution videos. This is one of the common issues in many deep learning-based segmentation algorithms, and we hope that this can be addressed with a better network architecture or by combining our work with additional post-processing modules.
This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (2018-0-01858).
Jumpcut: non-successive mask transfer and interpolation for video cutout.ACM Transactions on Graphics (TOG), 34(6):195, 2015.