Unsupervised Generation of Optical Flow Datasets from Videos in the Wild

12/05/2018 ∙ by Hoang-An Le, et al. ∙ University of Amsterdam 6

Dense optical flow ground truth of non-rigid motion for real-world images are not available due to the non-intuitive annotation. Aiming at training optical flow deep networks, we present an unsupervised algorithm to generate optical flow ground truth from real-world videos. The algorithm extracts and matches objects of interest from pairs of images in videos to find initial constraints, and applies as-rigid-as-possible deformation over the objects of interest to obtain dense flow fields. The ground truth correctness is enforced by warping the objects in the first frames using the flow fields. We apply the algorithm on the DAVIS dataset to obtain optical flow ground truths for non-rigid movement of real-world objects, using either ground truth or predicted segmentation. We discuss several methods to increase the optical flow variations in the dataset. Extensive experimental results show that training on non-rigid real motion is beneficial compared to training on rigid synthetic data. Moreover, we show that our pipeline generates training data suitable to train successfully FlowNet-S, PWC-Net, and LiteFlowNet deep networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 4

page 8

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Optical flow is an important modality in computer vision. Representing object motion in image space, optical flow provides temporal consistency and dense correspondences, which serve as basic building blocks for several methods in object tracking [2, 44] and action recognition [15, 35, 37].

Despite being an ill-posed problem, optical flow estimation has gained significant improvement with the emergence of deep neural networks 

[11, 20, 21, 36]

. One hindrance to supervised deep learning approaches is the demand of large scale dataset with ground truth flow annotations. Nonetheless, unlike detection or segmentation tasks where ground truths is obtained from human annotations, optical flow ground truth is not intuitive for manual labelling, nor can it be directly obtained from sensors like depth images. As an example, the KITTI datasets 

[16, 29] are constructed by registering point clouds from 10 consecutive frames, manually removing ambiguous points, and projected back to the image space of the next frame. While being the largest optical flow datasets available with real world images, only 200 pairs of frames are available with ground truth flow without dense annotation.

The data-demanding problem of deep networks can be lessened with data augmentation techniques or by using synthetic data. A well-known synthetic dataset for optical flow benchmark is the MPI-Sintel [9]. The images and annotations are rendered from a computer-generated imagery (CGI) movie named Sintel. It aims to simulate most of the challenges encountered in real-world scenarios such as large motions, specular highlights, motion blur, defocus blur, and atmospheric effects. Although the dataset serves as a good basis for evaluating or fine-tuning deep learning models, the small number of frames (around 2,000) limits the usability of this dataset for reliably training deep learning models from scratch.

On the other hand, large-scale synthetic datasets like FlyingChairs [11] and FlyingThings3D [27] are proposed for training purposes. They are based on computer-aided design (CAD) models with random similarity transformations (zooming, rotation, and translation) and moving background images to simulate the effect of moving objects and cameras. The datasets are proved to be useful in training optical flow models [26]. On the other hands, it is shown in the the same work that deep networks trained on highly repetitive or monotonous textures fail to generalize, and it is important that the training data consists of natural texture and contain diversity. It is also shown that having training with non-rigid movements are important as motion patterns of real-life objects are not always rigid, nor do they deform in a similar manner. Such type of movements are not included in the currently available dataset [11, 27] and has yet been received little attention from the literature.

In this paper, we present and analyze a new approach to create optical flow training data from real videos. To the best of our knowledge, this is one of the first attempts to study the effect of optical flow datasets on deep networks (the previous being the work of [26]), and the first one to consider non-rigid deformation. The proposed method creates large optical flow training data consisting of natural textures and non-rigid movements. Creating data for such task is not straightforward as in other vision tasks (e.g. recognition, detection etc.) Therefore, we present a novel pipeline to generate such data. The pipeline is based on a rigid square matching algorithm [39] that deforms a 2D image using an as-rigid-as-possible principle [1]. The algorithm allows natural large deformations to conform to physical plausibility, which is employed in industrial cartoon image deformation [38]. We collect movement statistics from real-world videos by finding correspondences between objects-of-interest from consecutive frames. These movement statistics and their variations are used to deform the objects to create optical flow ground truth (See Figure 1).

The paper has the following contributions. First, this is the first approach to create optical flow training data from real videos at no cost (fully unsupervised). The pipeline requires no prior model and thus can be applied to any type of video to generate more optical flow ground truth to accommodate to specific problem domains. Second, we provide the only large scale optical flow dataset consisting of natural textures and non-rigid movements. The algorithms and datasets will be made publicly available upon acceptance. Third, we provide an extensive analysis of optical flow prediction performance with movement and texture variations on the training datasets.

Figure 1: Overview of the proposed pipeline to generate ground-truth optical flow from two frames: (1) objects of interest are segmented, and (2) point-wise image matching is computed; (3) correspondences are used as constraints to deform the objects as-rigid-as-possible; (4) the resulting flow field is used to warp the object; and (5) the object is enhanced with a random background; The resulting pair of frames is used to train a deep neural network with the dense flow field as ground truth.

2 Related Work

In this section, we give an overview of the most related work. We focus on datasets available for training with optical flow ground truth, data augmentation techniques, and (first) computing optical flow itself.

Optical Flow  Optical Flow is the apparent motion field resulted from projecting objects’ velocities into image space. As estimation of optical flow is an ill-posed problem [5], since the pioneering methods of [18, 25], several priors are proposed to constrain the problem. These priors include the assumption of brightness constancy, local smoothness, and Lambertian surface reflectance [5, 40]. To deal with spatial discontinuities and brightness inconstancy, Black et al. proposes discontinuity-preservation in a robust statistics framework [6]. Strategies based on coarse-to-fine warping [7, 8, 43] is employed to reduce the correspondence search space. EpicFlow [32]

proposes an effective way to interpolate sparse matches to dense optical flow, which has been used as post-processing 

[3, 4, 19, 42].

Recently, with the success of deep learning based neural networks, optical flow estimation has been shifted from energy-optimization based to a data-driven based approach. Dosovitskiy et al[11]

proposes FlowNet, a convolutional neural network (CNN) that is trained in an end-to-end paradigm. The architecture is iteratively put together by Ilg 

et al[21] to reach the state-of-the-art performance. Further improvements include efforts to apply domain knowledge and classical principles such as spatial pyramid, warping, and cost volume to push the state-of-the-art results while reducing the model complexity. They are LiteFlowNet [20] with 30 times fewer parameters than those of FlowNet2, and PWC-Net [36] with 17 times fewer parameters. In this paper, we use our generated optical flow ground truth to train such architecture and show that it is helpful to improve the results.

Data Augmentation  Data augmentation is a generic strategy to obtain more training data. This is useful for training deep neural networks to obtain models which generalizes well to testing. Data augmentation is used for many computer vision tasks, including image classification [22], image segmentation [24], and depth estimation [13].

A very generic and widely used technique for augmenting image data is to perform geometric and color augmentations. Examples of augmentation are reflecting the image, cropping, scaling, and translating the image, and changing the color palette of the image. Data augmentation for optical flow networks is first proposed by [11] and studied in detail by [26]. The results show that both color and geometry type augmentation are complementary and important to improve the performance. Inspired by these data augmentation techniques, we propose methods to increase the diversity of the obtained optical flow ground truth data, and use data augmentation during training our neural networks.

Dataset S/N Scene types #Frames
KITTI  2012 [16] N Rigid 194
KITTI 2015 [29] N Rigid 200
Sintel [9] S Non-rigid 1,064
Monkaa [27] S Non-rigid 8,591
Body flow [31] S Non-rigid  100K
GTAV [33] S Non-rigid  250K
Driving [27] S Rigid 4,392
Virtual KITTI [14] S Rigid 21K
FlyingChairs [11] S Rigid 22K
FlyingThings3D [27] S Rigid 23K
UvA-Nature [23] S Rigid  300K
SceneNet RGBD [28] S Rigid  5M
Table 1: Datasets with Optical Flow. Only the KITTI datasets provide natural scenes (N). Other datasets are generated synthetically (S), and most datasets have rigid motion only. Our method generates optical flow ground truth from real-world videos.

Datasets  Since the emergence of deep neural network, optical flow estimation has been shifted toward data-driven approaches. As a consequence, data play crucial roles in the success of these methods. The challenge to obtain ground truth is that the determination of optical flow is a non-intuitive task and cannot be manually come up with by humans. Table 1 provides an overview of current datasets for optical flow. For a more extensive overview including depth, and disparity, we refer to Mayer et al[26].

When comparing natural scenes (N) and synthetically generated scenes (S), we observe that most of the datasets provide optical flow for synthetically generated scenes (S). Only the KITTI datasets [16, 29] provide ground-truth optical flow for natural scenes, yet the datasets are limited to around 200 frames for car-driving scenes which consist mostly of rigid motion flow. When comparing rigid versus non-rigid optical flow, it can be derived that most datasets focus on rigid motion. There exists no dataset with natural images together with non-rigid motion optical flow ground truth. In this paper, the aim is to obtain optical flow for natural scenes with non-rigid motion.

3 Generating Optical Flow Pairs

Figure 1 presents the overview of our pipeline. We generate an optical flow ground truth from a pair of frames , taken from a video sequence. The object of interest () is extracted using image segmentation. The correspondences generated by image matching are used as constraints for the deformation process which results in the dense flow field . The dense flow field is not perfect for the original pair of objects , , mainly due to matching errors. Therefore, we use the obtained flow field to warp the object in the first frame . This guarantees the correctness of the ground truth as the generated flow field is now the the underlining mechanism for the image pair . The final images are obtained by combining the objects with a random background image. This makes a data point, including a pair of images plus a densely annotated optical flow ground truth, which can be used to train a supervised optical flow deep network. This modular approach also allows to study in detail the impact of different dataset factors on training optical flow deep networks. The following subsections describes each step in more detail.

3.1 Image segmentation

To extract the object of interest from the video frames ( and ), image segmentation is used. We extract the main object from the frames, resulting in and . To show a proof-of-concept of our ground-truth flow algorithm, we use the image segmentation provided by DAVIS dataset [30]. Experimentally we will show that our pipeline also works by applying an off-the-shelf Mask R-CNN [17] segmentation algorithm on the video frames.

3.2 Image matching

We use image matching to find point correspondences between a pair of objects to obtain real movement statistics. We use Deep Matching [41] because it is able to provide quasi-dense correspondences which are robust to non-rigid deformation and repetitive textures. Moreover, this method provides correspondences in weakly textured areas.

The resulting point-correspondences are (still) sparse, and are not sufficient to train a deep neural network. Instead, we use the image matching to capture the statistics of real world object movements, and use these to compute a dense flow field.

Figure 2: Examples of image segments, the obtained point matches, the computed flow and the resulting warped images. Note the significant differences between the second frame and the warped image: the errors in the point matches yield a different dense flow field, which is better represented by the warped object. (Best viewed in color.)

3.3 Image deformation

We use image deformation to obtain a dense flow field . The deformation process is guided by the point correspondences computed by the image matching and regularized by minimizing the amount of scaling and shearing factors of the local image regions. We follow the as-rigid-as-possible method of [12]111We used the Opt toolbox [10], see http://optlang.org, which supports large shape deformations while conforming to physical plausibility [38, 39].

The image deformation task is formulated as an energy optimization problem over a square lattice on the original object. Let be the centroid of the lattice cell in the original object, and be the desired position (constraint) of the resulting deformation. The pairs are obtained from the image matching process. We now fit , i.e. the lattice after deformation, to adhere to the constraints:

(1)

where denotes the set point wise matches.

To regularize the deformation, the objective is to rigidly transform each lattice square, while imposing the constraints from Eq (1). Therefore, the goal is to find an optimal 2D rotation matrix

and a translation vector

, defined over the vertices of the lattice cell :

(2)

where are the vertices of the cell with centroid . Similarly is the th vertex of the deformed cell, and is a weight per vertex which is set to for simplicity for all the vertices in the lattice [12]. The centroids are defined as:

(3)

Wang et al[39] solves for the optimal translation vector by setting the partial derivatives with respect to in Equation 2 to zero, yielding

(4)

Substituting back to Eq. 2 simplifies the regularization term to find the rotation matrix , yielding:

(5)

The total energy is the weighted sum of the 2 terms:

(6)

where we use the default trade-off between data fit () and regularization (), following [12].

From the lattice structures, we compute the dense flow field . Due to errors in the image matching and the regularization in the deformation process, the obtained flow field may not exactly correspond to the segmented objects, see Figure 2. Hence, it is not suitable as ground truth between the two frames.

3.4 Image warping

The goal of the obtained flow field is to deform the first object to be close to the second object by using physical constraints. To use the obtained flow field as ground truth, image warping is applied: the flow field is used to warp the object of the first frame , resulting in , see Figure 2. This results in a true correspondence between , with .

3.5 Background generation

In the final step of the pipeline, the objects and

are projected on a background to obtain full frames for training. To increase the variance in the training data, we not only project the objects back on the original background, but we also use a set of 16K random images obtained from Flickr as background.

4 Optical Flow Variation

Ideally, training data for optical flow neural networks consists of a high variety of examples, including different types of textures, motion, and displacements. The evaluation by Mayer et al[26] shows that for synthetic optical flow, deep networks learn a better model when (a) texture is varied from easy to difficult over the course of the training, and that (b) the displacement statistics of the train set and test set should match. Therefore, we explore different strategies to increase variation in texture and displacement.

4.1 Displacement Variation

One way to control the variation in displacement is to augment the matching results with a (arbitrary) scaling or rotation operation. However, this would bring artifacts in the image appearance and warping method. Therefore, we focus on three different variations.

Frame distance  To increase the variation in the flow displacement, we expand the distance between the pair of frames being used for generating the optical flow. Instead of using a pair of subsequent frames and , we use a pair of frames and , with . Larger frame distances usually imply a larger object displacement, which increases the variation. However, larger frame distances comes at the cost of a lower matching quality between the objects, see Figure 3.

Figure 3: From left to right: matching points in the first frames, and those at 1, 2, 3, 4, 5, 8, 12 frame distance. The matching errors increase when the frame distance increases, mostly due to extreme changes in objects’ pose and camera viewpoint, in this case the confusion between 2 left and right legs of the woman.

Gaussian movement noise  Another way to increase the variation in flow displacements is by adding small Gaussian noise to the constraints obtained by the image matching process. In particular, we sample at random one-thirds of the matching pairs, and add to the constraints a small Gaussian noise of 3-pixel mean and 3-pixel variance, i.e. resulting in before the image deformation process.

Background flow  Random background images are used to increase the appearance and flow variation. We follow [11] and apply a (random) affine transformation to the full background.

4.2 Appearance

To increase the variation in appearance of the objects, we re-texture the object after the deformation phase. So, the flow field is based on the original texture, while the object pairs and are re-textured.

Figure 4: Variation in appearance by re-texturing of objects. From left to right: original texture (d1), repetitive-pattern synthetic images (RSYN), natural image texture (t1), repetitive-pattern natural images (RNAT), Sintel-val texture, FlyingChairs val texture. All natural images are downloaded from Flickr.

This allows us to control the appearance variation of training data. To control the appearance variation, a group of synthetic images is used with repetition patterns and a group of nature images from scenery images, see Figure 4.

5 Experiments

5.1 Experimental Setup

Datasets  We generate optical flow ground truth with our proposed method from the DAVIS [30] dataset. DAVIS provides video segmentation annotations. The DAVIS dataset contains 10K frames from real videos with 384 objects segmented. We use a single object per pair of frames, by taking the union of the segmented objects. We will make the source code for our data generation pipeline, and the obtained optical flow ground truth for the DAVIS dataset, available upon acceptance.

We compare performance against methods trained on the synthetic FlyingChairs dataset [11], which contains 22K image pairs with corresponding flow fields of a chair rendered in front of random backgrounds. FlyingChairs is used extensively to train deep nets.

For evaluation, we use the Sintel [9] dataset. Sintel contains large displacement non-rigid optical flow obtained from the open source 3D film Sintel. We use a subset of 410 image pairs from the train set for evaluating our methods. We also evaluate on the HumanFlow [31] dataset containing non-rigid motion of human bodies. For evaluation, we use the provided validation split, consisting of 530 image pairs from the train set. The performance of the different methods is evaluated using the average end-point-error (EPE).

Network Architectures  We train different deep network architectures with our optical flow ground truth data. For most of the experiments, FlowNet-S [11] is used. FlowNet-S is relatively fast to train and hence suited to explore the influence of different choices in generating the optical flow dataset. FlowNet-S is trained using the long learning schedule from the authors [21, 26]

: Adam optimizer with learning rate 0.0001 cutting half after 200K iteration. We validate the number of epochs on Sintel validation set. We also train two more recent optical flow deep networks: PWC-Net 

[36] and LiteFlowNet [20]. For these networks, the standard settings are used provided by the authors.

5.2 Importance of warping objects

In the first experiment, we compare the optical flow performance based on several stages of our pipeline. We train a FlowNet-S network using the obtained dense flow field , with the original frames and , which is compared to FlowNet-S being trained using the ground-truth data when we use the warped image .

In Table 2 the results are shown. It can be derived that learning is beneficial for estimating optical flow. Using our full pipeline, including the warped objects, is beneficial over training with only the dense flow field.

S-val HF
Non-Learning Zero Flow 14.65 0.73
FlowNetS Flying Chairs 5.76 0.63
Original Object 6.06 0.57
Warped Object 5.50 0.44
Table 2: Performance comparison of FNS when trained on the original objects versus warped objects in combination with the dense flow field. We compare against zero-flow and training FNS on FlyingChairs. It can be derived that learning non-rigid flow, using warped images, works best.

5.3 Flow variations

In this section, we study the effect of appearance and displacement variations in training optical flow.

Frame distance  In this experiment, we analyze the influence of augmenting the dataset with image pairs from larger frame distances, i.e. using a pair of with for . The results are shown in Table 3. From these results, it is shown that increasing the frame distance is (in general) beneficial for training a better model. Another observation is the effect of diminishing returns: adding frames with increases performance marginally on Sintel-Val and even slightly decreases the performance on HumanFlow.

Frame distances Sintel-val HumanFlow
d1 5.50 0.44
d1-2 5.37 0.49
d1-3 5.17 0.47
d1-4 5.10 0.45
d1-5 5.06 0.46
d1-8 5.03 0.49
d1-12 5.06 0.44
Table 3: Performance when training data is increased by larger frame distances (d). Adding frame distances up to is beneficial. Including larger frame distances only slightly increases or even slightly decreases the performance. Generating training data with works best.

Texture frames  In this section, we extend the appearance variation of the training data by texture replacement, denoted by t. The original texture is replaced, after obtaining the flow field, by a random texture downloaded from Flickr. So for t1, the data generated by d1 is used. Although and are re-textured, the data in d1 and t1 have exactly the same optical flow displacements. Using re-textured objects will disentangle the network from using semantic (class specific) information. This could be beneficial for learning a generic optical flow network.

In the first experiment, we evaluate the performance over different evaluation sets, each having different texture properties. We train FlowNet-S using the train data of d1, d1-4, t1-4, and d1-4+t1-4, and compare the performance to training on FlyingChairs (FC). We evaluate the performance on the re-textured d1 and d5 with different textures, see Figure 4. While the d1 validation set share the same displacement with the training data, d5 contains larger displacement not seen during training.

The results are shown Table 4. From this evaluation, it can be derived that increasing the frame distances does not learn networks invariant to texture variations. Therefore, we conclude that learning with texture variations is important to learn class agnostic optical flow.

d1 d5
Training set FCv SINv RSYN RNAT FCv SINv RSYN RNAT
FC 3.28 3.13 3.47 3.38 6.67 6.28 7.29 6.92
d1 2.37 2.24 2.52 2.45 4.49 4.20 4.82 4.62
d1-4 2.35 2.22 2.55 2.42 4.36 4.02 4.75 4.42
t1-4 2.21 2.10 2.43 2.26 4.30 3.99 4.60 4.32
d1-4+t1-4 2.17 2.05 2.42 2.23 4.20 3.86 4.61 4.24
Table 4: Performance for different texture types. Improvement of t1-4 and d1-4+t1-4 over d1 is higher than one of d1-4, even though the shares similar displacement distribution, indicating the benefit of training on diversified texture sets.

In the second experiment, we compare different combinations of frame distances and texture variations and evaluate on Sintel-val and HumanFlow. In Table 5 the results are shown. It can be concluded that training data with texture variation is important for learning a FlowNet-S model which generalize well to different evaluation data.

Training set Sintel-val HumanFlow
d1-4 5.10 0.45
t1-4 4.96 0.38
d1-4 + t1 5.04 0.44
d1-4 + t1-4 4.98 0.48
Table 5: Comparison of different combinations of frame distances (d) and texture variations (t). Texture variation increases performance, where t1-4 performs best especially for HumanFlow.

Gaussian Movement Noise  In this experiment, we evaluate the influence of adding Gaussian movement noise. We denote by n1, the obtained data using d1 as starting point, yet by adding noise to the constraints in the image deformation process.

Initial results show that adding noise (n1), to d1-4 and d1-4 + t1, does not improve the performance on Sintal-val (5.10 vs. 5.10 and 5.04 vs. 5.03 respectively). Hence, adding Gaussian movement noise is not helpful for training optical flow networks.

Background  In this experiment, we increase the optical flow variation by adding background flow. We denote with b1-5, the data generated by d1-5, yet with a randomly moving background (from a affine transformation).

Initial experiments on the Sintel-val dataset show that when adding b1, the performance of d1-4 increases from 5.10 (d1-4) to 5.06 (d1-4+b1-4), and that the performance of d1-8 decreases from 5.03 to 5.10 (d1-8+b1-8). These results indicate that FlowNet-S does not improve when this form of camera motion is added to the training data. This could be due to the simplicity of the motion model, since the scene depth is ignored for the background flow.

Displacement Statistics  In the last experiment, we show an analysis of the flow distribution of the different training set and compare this to FlyingChair train set and the Sintel-val set. The results are shown in Figure 5 (left). It can be observed that by adding frame distance, the data distribution of our optical flow dataset becomes more similar to the Sintel-val dataset. This is an indication that learning could help. Is can also be derived that the FlyingChairs dataset has significantly more pixels on all levels of displacement. This indicates that we should extend our dataset even more.

In Figure 5 (right), we illustrate the (average) end point error as a function of the ground-truth pixels displacements. It is clear that training on our dataset outperforms training on the FlyingChairs dataset up to 125 pixels, which also corresponds to the drop in the displacement distribution shown in the left.

Figure 5: Flow displacement distributions of the training set (top, note the y-axis is in log scale) and the end-point-error as function of ground truth pixel displacement (bottom). FlyingChairs has significantly more pixels in both small and large displacement, yet our EPE is lower for displacements up to 125 pixels.

5.4 Comparison to state-of-the-art

We compare state-of-the-art algorithms for optical flow, FlowNet-S, PWC-Net [36], and LiteFlowNet [20] trained on the data sets obtained from our proposed method on DAVIS to the same algorithms trained on FlyingChairs.

The results are shown in Table 6. For HumanFlow our proposed method reaches, or improves over the networks trained on FlyingChairs. On Sintel-val, our proposed method does not reach the same performance, yet including dataset variations is improving performance to get close to the baseline of training on FlyingChairs.

Due to the complexity of scenes, the Sintel movie is designed with mostly static scenes with moving characters, we also compare the results on the non-rigid (NR) and rigid (R) movement on Sintel-val. For the non-rigid motion, we manually mask out the images for evaluation using the ground truth segmentation provided by [9]. The results show that, our proposed method for obtaining non-rigid flow improves over training on FlyingChairs for the non-rigid motion.

Sintel-val HumanFlow
all NR R
FNS FC 5.76 16.47 3.23 0.63
d1 5.50 16.11 2.97 0.44
t1-4 4.96 15.04 2.57 0.38
PWC FC 4.01 13.52 2.00 0.30
d1 5.22 17.20 2.53 0.34
d1-5 4.16 13.31 2.08 0.30
LFN FC 4.38 14.90 2.15 0.30
d1 5.05 16.50 2.49 0.28
d1-8 4.58 14.21 2.44 0.28
Table 6: Comparison of FlowNet-S (FNS), PWCNet (PWC), and LiteFlowNet (LFN) trained on FlyingChairs and on our generated non-rigid optical flow datasets. Training deep networks on ground truth generated by our pipeline results in competitive models.
Figure 6: Example of DAVIS ground truth segmentation (top) and mask RCNN segmentation (bottom). Despite consisting uncertainties and errors, segments from mask-RCNN cover larger image regions and vary in sizes and shapes. (Best viewed in color.)

5.5 Segmentation using Mask R-CNN

In all experiments so far, the provided segmentation annotation of the DAVIS dataset are used. In this experiment, segments are automatically determined by using the off-the-shelve Mask R-CNN method.

Mask R-CNN [17] is trained on the class labels from MS-COCO and predicts different objects in each video frame of DAVIS. Due to uncertainties in the inference process, the Mask R-CNN segments may not fit to the whole objects or contain only single objects like ones manually annotated in DAVIS, but could appear in any shape and contains various regions of image. This does not have much effect on the performance of deep matching, given the pair of frames are within a small distance. In this experiment, we use the results from consecutive pairs of images, while leaving the larger frame distance for future study.

The results are shown in Table 7. In general, Mask R-CNN segments are larger and have more variation in terms of shape and sizes compared to those of the original DAVIS (the mean segment size is 73K pixels compared to 55K pixels), giving Mask R-CNN more information. Therefore it performs on a par with dataset being trained on large frame distance, d1-8 on both FlowNetS and LiteFlowNet, and even outperforms these on human flow. We conclude that using Mask R-CNN is an alternative for using oracle segmentation, which enable the use of any video for training optical flow deep networks.

O/M Sintel-val HumanFlow
all NR R
FNS FC 5.76 16.47 3.23 0.63
d1 O 5.50 16.11 2.97 0.44
d1-5 O 5.06 14.28 2.98 0.46
d1 M 5.02 15.07 2.72 0.39
LFN FC 4.38 14.90 2.15 0.30
d1 O 5.05 16.50 2.49 0.28
d1-8 O 4.58 14.21 2.44 0.28
d1 M 4.60 14.33 2.48 0.26
Table 7: Performance of FNS and LFN when trained on our pipeline using Mask R-CNN (indicated M) for object segmentation, instead of oracle segmentation (O). Generating optical flow using Mask R-CNN is an alternative to using segment annotations, which allow to use a more diverse set of videos in the wild.
Figure 7: Qualitative results on QUVA repetition dataset [34] of LiteFlowNet that is trained on FlyingChairs (middle row) and on the d1-8 set obtained from DAVIS using our unsupervised optical flow generation pipeline (bottom row). LFN trained using our pipeline can capture the non-rigid movements of objects in the scenes with better details and delineation. (Best viewed in color.)

5.6 Performance on real world images

In this final experiment, we qualitatively show results of optical flow prediction from LiteFlowNet on real world images. In figure 7 we show the predicted optical flow of LiteFlowNet being trained on FlyingChairs and our d1-8 set for the images taken from the QUVA repetition dataset [34]. The dataset contains 100 video sequences focused on repetitive activities in real life, with little camera motion, hence contain mostly non-rigid movements. The model trained with our non-rigid flow set can capture better delineation and object details, especially for non-rigid movements of human parts. This emphasizes the needs to have real texture and non-rigid optical flow the dataset for training optical flow deep networks.

6 Conclusion

In this paper, we introduce an unsupervised pipeline to generate a densely annotated optical flow dataset from videos to train supervised deep networks for optical flow.

Extensive experimental results show that optical flow ground truth generated from the DAVIS videos with non-rigid real movements results in adequate optical flow prediction networks. The pipeline can work either with provided video object segmentation, or with running an off-the-shelf object segmentation algorithm like Mask R-CNN. The latter allows to study, in future work, the effect of using more training data with more diverse non-rigid movements, by applying our pipeline on a larger set of source videos.

Acknowledgements: This project was funded by the EU Horizon 2020 program No. 688007 (TrimBot2020). We would like to thank Thành V. Lê for his support in designing the figures.

References

  • [1] M. Alexa, D. Cohen-Or, and D. Levin. As-rigid-as-possible Shape Interpolation. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’00, pages 157–164, New York, NY, USA, 2000. ACM Press/Addison-Wesley Publishing Co.
  • [2] S. Aslani and H. Mahdavi-Nasab. Optical Flow Based Moving Object Detection and Tracking for Traffic Surveillance. International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering, pages 1252–1256, 2013.
  • [3] M. Bai, W. Luo, K. Kundu, and R. Urtasun. Exploiting semantic information and deep matching for optical flow. In Lecture Notes in Computer Science, volume 9910 LNCS, pages 154–170. Springer, 2016.
  • [4] C. Bailer, K. Varanasi, and D. Stricker. CNN-based patch matching for optical flow with thresholded hinge embedding loss. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , volume 2, page 7, 2017.
  • [5] S. S. Beauchemin and J. L. Barron. The Computation of Optical Flow. ACM Comput. Surv., 27(3):433–466, sep 1995.
  • [6] M. Black and P. Anandan. Nonmetric calibration of wide-angle lenses and polycameras. Computer Vision and Image Understanding, 63(1):75–104, 1996.
  • [7] T. Brox and J. Malik. Large Displacement Optical Flow: Descriptor Matching in Variational Motion Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(3):500–513, mar 2011.
  • [8] A. Bruhn, J. Weickert, and C. Schnörr. Lucas/Kanade meets Horn/Schunck: Combining local and global optic flow methods. International journal of computer vision, 61(3):211–231, 2005.
  • [9] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor, European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, oct 2012.
  • [10] Z. DeVito, M. Mara, M. Zollöfer, G. Bernstein, C. Theobalt, P. Hanrahan, M. Fisher, and M. Nießner. Opt: A Domain Specific Language for Non-linear Least Squares Optimization in Graphics and Imaging. ACM Transactions on Graphics 2017 (TOG), 2017.
  • [11] A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbacs, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox. FlowNet: Learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision (ICCV), volume 11-18-Dece, pages 2758–2766, 2015.
  • [12] M. Dvorožňák. Interactive As-Rigid-As-Possible Image Deformation and Registration. In The 18th Central European Seminal on Computer Graphics, 2014.
  • [13] D. Eigen, C. Puhrsch, and R. Fergus. Depth Map Prediction from a Single Image Using a Multi-scale Deep Network. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 2366–2374, Cambridge, MA, USA, 2014. MIT Press.
  • [14] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig. Virtual Worlds as Proxy for Multi-Object Tracking Analysis. In CVPR, 2016.
  • [15] R. Gao, B. Xiong, and K. Grauman. Im2Flow: Motion Hallucination From Static Images for Action Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), jun 2018.
  • [16] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012.
  • [17] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-CNN. In The IEEE International Conference on Computer Vision (ICCV), oct 2017.
  • [18] B. K. Horn and B. G. Schunck. Determining optical flow. Artificial Intelligence, 17(1-3):185–203, aug 1981.
  • [19] Y. Hu, R. Song, and Y. Li. Efficient coarse-to-fine patchmatch for large displacement optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5704–5712, 2016.
  • [20] T.-W. Hui, X. Tang, and C. C. Loy. LiteFlowNet: A Lightweight Convolutional Neural Network for Optical Flow Estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [21] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [22] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems - Volume 1, NIPS’12, pages 1097–1105, USA, 2012. Curran Associates Inc.
  • [23] H.-A. Le, A. S. Baslamisli, T. Mensink, and T. Gevers. Three for one and one for three: Flow, Segmentation, and Surface Normals. In Bristish Machine Vision Conference (BMVC), jul 2018.
  • [24] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 07-12-June, pages 3431–3440, 2015.
  • [25] B. D. Lucas and T. Kanade. An Iterative Image Registration Technique with an Application to Stereo Vision. In Proceedings of the 7th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’81, pages 674–679, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers Inc.
  • [26] N. Mayer, E. Ilg, P. Fischer, C. Hazirbas, D. Cremers, A. Dosovitskiy, and T. Brox. What Makes Good Synthetic Training Data for Learning Disparity and Optical Flow Estimation? International Journal of Computer Vision, 126(9):942–960, sep 2018.
  • [27] N. Mayer, E. Ilg, P. Häusser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A Large Dataset to Train Convolutional Networks for Disparity, Optical Flow, and Scene Flow Estimation. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4040–4048, 2016.
  • [28] J. McCormac, A. Handa, S. Leutenegger, and A. J.Davison. SceneNet RGB-D: Can 5M Synthetic Images Beat Generic ImageNet Pre-training on Indoor Segmentation? ICCV, 2017.
  • [29] M. Menze and A. Geiger. Object scene flow for autonomous vehicles. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, volume 07-12-June, pages 3061–3070, 2015.
  • [30] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool. The 2017 DAVIS Challenge on Video Object Segmentation. arXiv:1704.00675, 2017.
  • [31] A. Ranjan, J. Romero, and M. J. Black. Learning Human Optical Flow. In Bristish Machine Vision Conference (BMVC), 2018.
  • [32] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. EpicFlow: Edge-preserving interpolation of correspondences for optical flow. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1164–1172, jun 2015.
  • [33] S. R. Richter, Z. Hayder, and V. Koltun. Playing for Benchmarks. In International Conference on Computer Vision (ICCV), 2017.
  • [34] T. F. H. Runia, C. G. M. Snoek, and A. W. M. Smeulders. Real-world repetition estimation by div, grad and curl. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [35] K. Simonyan and A. Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems, 2014.
  • [36] D. Sun, X. Yang, M.-Y. Liu, J. Kautz, and J. K. Nvidia. PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [37] S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang. Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), jun 2018.
  • [38] D. Sýkora, J. Dingliana, and S. Collins. As-rigid-as-possible image registration for hand-drawn cartoon animations. In Proceedings of the 7th International Symposium on Non-Photorealistic Animation and Rendering - NPAR ’09, page 25, 2009.
  • [39] Y. Wang, K. Xu, Y. Xiong, and Z.-Q. Cheng. 2D Shape Deformation Based on Rigid Square Matching. Comput. Animat. Virtual Worlds, 19(3-4):411–420, sep 2008.
  • [40] A. Wedel and D. Cremers. Optical Flow Estimation. In Stereo Scene Flow for 3D Motion Analysis, pages 5–34. Springer London, London, 2011.
  • [41] P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow: Large Displacement Optical Flow with Deep Matching. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV), ICCV ’13, pages 1385–1392, Washington, DC, USA, 2013. IEEE Computer Society.
  • [42] J. Xu, R. Ranftl, and V. Koltun. Accurate optical flow via direct cost volume processing. arXiv preprint arXiv:1704.07325, 2017.
  • [43] L. Xu, J. Jia, and Y. Matsushita. Motion Detail Preserving Optical Flow Estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9):1744–1757, 2012.
  • [44] Z. Zhu, W. Wu, W. Zou, and J. Yan. End-to-End Flow Correlation Tracking With Spatial-Temporal Attention. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), jun 2018.

Appendix A More examples on optical flow generation

Figure 8 provides additional visualization of the optical flow generated by our unsupervised pipeline. Each group shows the segmented objects , with corresponding matches on the left columns, and generated optical flow and the warped image on the right column. Note that the generated optical flow does not mean to be the ground truth for the pair due to the errors in matching process, but it is for the pair . Note also the natural appearance of the warped images, although do not exactly fit to the real , but can capture the object non-rigid movement.

Figure 8: In each group, left column: example of image segments and the obtained point matches for 2 frames and , right column: the computed optical flow and the resulting warped image . Note the significant differences between the second frame (group left bottom) and the warped image (group right bottom): the errors in the point matches yield a different dense flow field, which is better represented by the warped object. (Best viewed in color.)

Appendix B More qualitative results on real images

Figure 9 shows additional qualitative results of LiteFlowNet on the QUVA repetition dataset [34]. For each group, from top to bottom respectively are the RGB images, optical flow prediction from LiteFlowNet being trained on the FlyingChairs datasets [11], and optical flow prediction from one being trained on our d1-8 set using the unsupervised optical flow generation pipeline. As our dataset focuses on non-rigid movements, the network can learn better to recognize human actions, and thus results in better details and sharper boundaries.

Figure 9: Qualitative results on QUVA repetition dataset [34] of LiteFlowNet that is trained on FlyingChairs (middle row) and on the d1-8 set obtained from DAVIS using our unsupervised optical flow generation pipeline (bottom row). LiteFlowNet trained using our pipeline can capture the non-rigid movements of objects in the scenes with better details and delineation. (Best viewed in color.)