FusionNet and AugmentedFlowNet: Selective Proxy Ground Truth for Training on Unlabeled Images

08/20/2018 ∙ by Osama Makansi, et al. ∙ University of Freiburg 0

Recent work has shown that convolutional neural networks (CNNs) can be used to estimate optical flow with high quality and fast runtime. This makes them preferable for real-world applications. However, such networks require very large training datasets. Engineering the training data is difficult and/or laborious. This paper shows how to augment a network trained on an existing synthetic dataset with large amounts of additional unlabelled data. In particular, we introduce a selection mechanism to assemble from multiple estimates a joint optical flow field, which outperforms that of all input methods. The latter can be used as proxy-ground-truth to train a network on real-world data and to adapt it to specific domains of interest. Our experimental results show that the performance of networks improves considerably, both, in cross-domain and in domain-specific scenarios. As a consequence, we obtain state-of-the-art results on the KITTI benchmarks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

11footnotetext: equal contribution

Like all deep learning applications that follow the supervised learning paradigm, the success of learning optical flow estimation stands and falls with the availability and quality of training data. In case of optical flow, the creation of ground-truth annotation on real images is extremely tedious and virtually impossible on large datasets. For this reason, state-of-the-art networks for optical flow estimation, such as FlowNet 2.0 

[1] and PWC-Net [2] have been trained on synthetically rendered images. These networks tend to generalize comparatively well to real images – in contrast to semantic tasks, such as object detection or semantic segmentation. This is because correspondence estimation is different from recognition and does not depend so much on the content of the images. In fact, optical flow estimation is possible without any learning, thus not requiring any training data. There is a long history of unsupervised optical flow methods that implement the concept of correspondence. These classical methods perform equally well as a state-of-the-art optical flow network, yet with significantly higher runtimes.

The advantage of learning comes in when correspondences cannot be established easily and priors are needed to make decisions. Typical examples are areas in the image that have homogeneous color (aperture problem) or areas that are occluded in the other image. Works from the pre-learning era used handcrafted regularizers [3, 4]

and corresponding optimization heuristics to hallucinate optical flow in these areas. Learning such priors is much more elegant and also more successful: networks tend to outperform these classical techniques especially in occluded areas. However, such learning of priors is no longer independent of the image content: while basic hallucination strategies for occluded regions can be estimated from synthetic data, the hallucinated content should ideally depend on the objects in the scene. Thus, there is a domain gap between synthesized training images and real images, just like in semantic tasks. Real images are required for training. Multiple strategies have been proposed to integrate real images into the training procedure. These span from using the same unsupervised training loss for the network as is used in variational methods 

[5, 6], over multi-task learning with an auxiliary task that allows learning from unlabelled images [7], to training on pseudo-ground-truth obtained from running an (unsupervised) variational method [8].

This paper comprises two contributions. First, we present an assessment network that learns to predict the error for each of a set of flow fields generated with various optical flow estimation techniques. Then, a fused optical flow field can be trivially obtained by selecting for each pixel the flow vector with the smallest predicted error. We show that this assessment network, which we call FusionNet combines the advantages of a potentially large set of techniques and avoids their limitations. As a consequence, FusionNet yields results that exceed the performance of all methods that produced its input. Independent on how the state of the art will improve in the future, FusionNet can always benefit from these improvements. However, this comes at the cost of very large runtimes, since a whole set of partially slow methods must be run on the test image for the assessment network to assemble the final flow field. This is a show stopper for most optical flow applications.

Thus, as a second contribution, we augment a FlowNet by training it on the flow obtained with the assessment network, which now serves as proxy ground-truth. This shifts the large runtimes to the training phase, while the final network is as fast as a regular FlowNet at test time. Training data can be generated on all sorts of unlabeled videos, which allows the augmented FlowNet to learn priors from real images. This yields the currently best accuracy-runtime trade-off and enables the specialization to target domains directly on real images without tedious modeling of synthetic scenes in such domains. We show-case this with state-of-the-art results on the KITTI benchmarks.

2 Related Work

Traditional optical flow estimation. Optical flow estimation goes back to the works of Lucas&Kanade [9] and Horn&Schunck [3]. Both rely on a brightness constancy term combined with a local or global smoothness assumption. Especially the variational approach of Horn&Schunck was extended by many successive works [4, 10, 11]. While variational methods are very precise in small displacement cases, they have deficits in case of large displacements. This was taken into account by Brox et al. [12], who mixed the variational method with a simple nearest-neighbor matching of local descriptors. DeepMatching [13] elaborated on the matching, and EpicFlow [14] improved the variational refinement. FlowFields [15] builds upon EpicFlow and elaborates on the matching using a random search strategy. The present state of the art is defined by DCFlow [16] and MRFlow [17]. The accuracy of these techniques is very high and on-par or even higher than with learning based techniques. However, the combinatorial search in state-of-the-art methods leads to quite large runtimes that do not allow for interactive frame rates.

Optical flow with supervised learning. End-to-end learning of optical flow was pioneered by the work of Dosovitskiy et al. [18], which presented the two network architectures FlowNetS and FlowNetC. The former is purely convolutional, while the latter includes an explicit correlation. The networks were trained on a simplistic dataset made from Flickr and chair images to which affine transformations were applied (FlyingChairs). Mayer et al. [19] introduced a more sophisticated 3D dataset (FlyingThings3D). Ilg et al. [1] presented a stack of networks termed FlowNet 2.0 with high accuracy and fast runtime. Ranjan et al. [20] presented a network architecture that contains a spatial pyramid and runs even faster than FlowNet 2.0, but at the cost of accuracy. Sun et al. [2] extended this idea by introducing correlations at the different pyramid levels. Their network termed PWC-Net currently achieves state-of-the-art results.

Other methods combine feature learning with traditional methods: FlowFieldsCNN [21]

uses an improved hinge embedding loss to train a Siamese architecture for feature extraction, which is then used in combination with FlowFields. PatchBatch 

[22] shows that CNN features can even be improved to a level on which plain nearest-neighbor matching performs well. DeepDiscreteFlow [23] combines a local network with a context network and discrete optimization.

Optical flow with unsupervised learning components. Ahmadi et al. [5]

proposed an unsupervised learning approach by using the brightness constancy loss from variational approaches to train a CNN. In principle, their approach replaces the Gauss-Newton step in variational optimization with back-propagation on a network representation. While coming from a fully unsupervised approach, the resulting flow fields are inferior to those of unsupervised variational techniques. Meister et al. 

[6] proposed an additional unsupervised loss based on forward-backward consistency to train the network termed UnFlow in a completely unsupervised manner.

Several other methods introduce unsupervised losses in addition to supervised training on synthetic data. Yu et al. [24] and Ren et al. [25] use the loss from variational approaches to refine the decoder stages of a pre-trained FlowNet. Lai et al. [26] use a GAN approach to distinguish the optical flow estimated by the generator network from ground-truth optical flow. Sedaghat et al. [7] proposed the self-supervised auxiliary task of next frame prediction as additional loss. Like the above-mentioned works, this allows them to improve FlowNet on real-world data. The guided optical flow proposed by Zhu et al. [8] uses the flow computed by a traditional, unsupervised method as proxy ground-truth to train a network in the usual supervised manner. The final network is limited by the performance of the traditional method that provided the proxy ground-truth. In contrast, we anticipate this drawback by training on flow fields produced by multiple different methods and locally selecting the best. This way, the final network can yield better results than any single method that produced the training data.

Optical flow fusion. The principle to locally select the best flow vector from a set of flow fields has been implemented outside the scope of deep learning. Lempitsky et al. [27]

proposed a combinatorial optimization approach to combine flow fields from multiple methods based on a smoothness loss. Their approach was also used in MDPFlow 

[28], which locally combined multiple hypotheses from coarser pyramid levels and nearest neighbor matching. In contrast to these approaches, our selection among flow vectors is based on a deep network that learns to predict directly the optical flow error rather than just selecting based on smoothness priors.

3 FusionNet

Figure 1: Overview of the FusionNet principle. Given the input images, the optical flow is estimated with various existing methods. Each method’s optical flow estimate is used to warp the second image. The two input images, the warped image, and the flow are fed into the proposed assessment network, which is trained on predicting the error of each flow field. Finally the flow fields are merged by locally choosing the flow vector with the minimum predicted error.

We assume that various optical flow estimation methods have different strengths and weaknesses. This does not exclude that these methods may have also many difficulties in common. However, as long as there are differences, we want to exploit these differences to choose from the method that works best on a particular problem.

To this end, we propose an assessment network that predicts the errors of the optical flow estimated by a set of existing methods, as shown in Figure 1 and is trained on synthetic data with available ground-truth optical flow. On first glance, this training on synthetic images looks like we are back at square one. However, the task of assessment is different from the task of flow estimation itself. First, we benefit from the information contained in the various input flow fields. Second, the assessment task may generalize more easily to other domains than the task of optical flow estimation, since it must only find ways to predict errors rather than predicting the flow field itself.

The assessment network uses a typical encoder-decoder architecture with skip-connections; the architecture details are as in FlowNetS [1]. It takes the two input images into account together with the flow estimate and the second image warped by that flow. This error map predicted by the assessment network is used to optimally combine the estimated flow fields. We refer to the complete setup, as shown in Figure 1

, as FusionNet. We investigate two different loss functions: an L1 loss and a hinge loss.

3.1 L1 Loss

For training the assessment network with an L1 loss, we let the network directly estimate the pixel-wise endpoint error. Let the estimated optical flow at a certain pixel be denoted with the x- and the y-components and . The ground-truth endpoint error for a pixel location is:

(1)

Let be the error predicted by the assessment network. To improve on the predicted error, we apply back-propagation with the L1 loss:

(2)

In principle, one could train a separate, specialized network for each input method to be assessed. However, since we want to improve on the generalization of the assessment network, we use the same network for assessing all input flows, and rather sample the mini-batches during training from the different methods. More training details are provided in Section 5.1.

3.2 Hinge Loss

Directly applying an L1 loss on the error makes the network estimate the error for each method. However, for the fusion we only need to know the input methods with the lowest error. That means, the L1 loss potentially solves a harder problem than necessary to reach the actual goal111That said, the prediction of the error could be valuable on its own right for a series of other purposes not discussed in this paper, for instance, uncertainty estimation.. A related problem to picking the input with the smallest error is the one of designing a distance metric to match patches. This metric only needs to reflect the ranking, e.g. ”A is closer to B than A is to C” [29, 30]. Many feature learning algorithms use this as a triplet loss [16, 31, 32, 33, 34]. With the same motivation, we use the well-known multi-class hinge loss [35, 36]

(3)

where is the index of the method with the lowest error according to the ground-truth, and is the minimum margin between the best estimate and the other estimates. If the predicted best error corresponds to the true index and all other errors are at least larger than , this loss will be zero. Otherwise, each error that is above the allowed margin will contribute to the loss. Since the network is allowed to rescale the errors, we can set without loss of generality. Note that the errors predicted by the network with this loss do no longer correspond to the L1 error but may be rescaled. The rescaling factor may even be different for each pixel. Obviously, the hinge loss implies joint training of the assessment network while giving all methods as input.

Figure 2: Using our FusionNet to augment a FlowNet: FlowNet and FusionNet are trained on labeled data. Subsequently, FusionNet is used to augment FlowNet with large amounts of unlabeled data.

4 Augmented FlowNet

Given the FusionNet from the last section, we can apply it to any unlabelled data to estimate high-quality optical flow. However, running FusionNet is very costly, since it requires running the various, partially very slow optical flow estimation methods. In order to have fast optical flow estimation at test time, we use the optical flow fields estimated with FusionNet as proxy ground-truth in order to finetune a FlowNet, for instance, to optimize it for a specific domain or to make it run better on general real-world videos. The principle is quite straightforward and illustrated in Figure 2.

5 Experiments

We evaluated the concept on the common optical flow benchmarks, where we can quantify the improvements by the fusion and by the augmentation of FlowNet directly. In addition, we demonstrate the effect of the augmentation in a motion segmentation context.

5.1 Training Details

For training the assessment network, we followed the same training schedule as proposed in Ilg et al. [1]

for training FlowNet, i.e., we first train on FlyingChairs for 1.2m iterations and subsequently on FlyingThings3D for 500k iterations. The augmented FlowNet is initialized with a FlowNet trained on the same schedule. We also applied the same data augmentation mechanism, i.e., a set of spatial and color transformations. The networks were implemented using the Caffe framework. The code will be made publicly available upon publication.

5.2 Datasets

We used the two publicly available synthetic datasets FlyingChairs[18] and FlyingThings3D[1] to train the assessment network and the initial FlowNet before augmentation. These are the two datasets for which labeled training data is available. For the unsupervised fine-tuning, we use various unlabeled datasets that we grouped to two domains: animation movies and driving.

Animation movies. We collected several animation movies from the Blender project[37] and used them for unsupervised training. For such animation movies there is the potential option to derive ground-truth optical flow, as shown in Butler et al. [38] and Mayer et al. [39], but we did not use this option here and rather used just the unlabeled videos for training. For the evaluation in this domain we used the official Sintel benchmark dataset [38].

Driving. Driving scenes are a popular application domain for optical flow. Thus, we selected them to make a second evaluation domain. For unsupervised training, we took approximately 100k frames from the Frankfurt part of the publicly available Cityscapes dataset [40]. For the evaluation in this domain, we used the two publicly available KITTI2012 [41] and KITTI2015 [42] benchmark datasets.

Motion Segmentation. For indirect evaluation of the optical flow on a motion segmentation task, we used approximately 32k frames from the UdG-MS19 and UdG-MS20 datasets [43] for unsupervised training. We evaluated the motion segmentation on the FBMS benchmark dataset [44].

5.3 FusionNet

We evaluated the FusionNet with the following optical flow estimation techniques as input: LDOF [12], DeepFlow [13], EpicFlow [14], FlowFields [15], and FlowNet2 [1]. There are some very recent methods with even better performance, such as DCFlow [16], PWC-Net [2], and MR-Flow[17], but their code was not operational in time to include them for the experiments. A nice property of FusionNet is that new methods can be integrated trivially at any time to improve results further.

Animation Domain Driving Domain

Method Sintel clean Sintel final KITTI 2012 KITTI 2015

AEE AEE AEE F1-noc AEE Fl-all

train test train test train test test train test


Inputs

LDOF[12]

DeepFlow[13]

EpicFlow[14]

FlowNet2[1]

FlowFields[15] 1.86 3.75 3.40 5.81


Best

DCFlow[16]

PWC-Net[2] 5.04

MR-Flow[17] 1.83 2.53


Our

FusionNet-L1 3.10

FusionNet-Hinge 1.58


Oracle

Table 1: Comparison of FusionNet to the state-of-the-art. The upper section of the table corresponds to the input methods used for FusionNet. FusionNet performs better than any of the input methods. The Oracle Fusion refers to a fusion based on the ground-truth error.

Table 1 compares FusionNet to the state-of-the-art on the common benchmark datasets. FusionNet consistently outperforms each of the techniques that have been provided as input, which demonstrates that the assessment network is able to locally select the best optical flow vectors. As a consequence, this brings it close to the most recent state of the art and would most likely outperform it if these methods were also included for selection. Table 1 also reports the results when selecting the flow vectors based on the ground-truth error. This oracle fusion is the lower bound that FusionNet can achieve with the respective optical flow fields given as input.

Flow GT Occlusion GT Image 0 Image 1
FlowFields [15] flow GT error Predicted error (L1) Predicted error (Hinge)
FlowNet2 [1] flow GT error Predicted error (L1) Predicted error (Hinge)
EpicFlow [14] flow GT error Predicted error (L1) Predicted error (Hinge)
DeepFlow [13] flow GT error Predicted error (L1) Predicted error (Hinge)
LDOF [12] flow GT error Predicted error (L1) Predicted error (Hinge)
Oracle FusionNet (L1) FusionNet (Hinge)
Figure 3: Qualitative example showing the error prediction of the assessment network and the corresponding fused optical flow field. First column: Different flows provided as input. Second column: Their ground truth error. Third and fourth column: Error prediction with the L1 loss and the hinge loss, respectively. The error predicted with the L1 loss is very close to the ground-truth error. The error predicted with the hinge loss is not interpretable, since it is a relative error and therefore is locally scaled. The bottom row shows the merged flow using the ground truth (Oracle) or the error predictions. The marked regions indicate how FusionNet picks the best motion vectors from the different input methods.

The predicted error when trained with an L1 loss is shown in Figure 3. One can observe that it matches the ground-truth error quite well. Thus, in hard cases, if one of the input methods is able to estimate the motion successfully, FusionNet is able to select the best estimate. The predicted error for the margin loss is not directly interpretable due to the local scaling. However, the resulting final optical flow field is as good as the one obtained with the L1 loss. Also quantitatively comparing the L1 loss against the hinge loss, there is no significant difference.

5.4 Augmented FlowNet

While FusionNet yields excellent optical flow that combines the best from all available methods, it requires 84 seconds per frame. In contrast, FlowNet 2.0 runs at 8 frames per second 222FlowNet2 runtime is reported on an Nvidia GTX1080 GPU, while the classical methods run on the CPU.. In this section we test how far we can transfer the good results from FusionNet to a FlowNet, thus inheriting also the runtimes of the latter.

Table 2 first shows the influence of the choice of the proxy ground-truth when fine-tuning a basic FlowNetC. Augmenting the FlowNet with an optical flow field that is superior to the baseline improves results, whereas inferior flow fields can decrease the performance. When using just a single proxy method, there is the dilemma of which method to choose. Feeding a random mixture of samples from various methods during fine-tuning (Rand. Mix) does not yield the best of all involved methods, but approximately the average of those. In contrast, the use of FusionNet resolves the dilemma.

Animation Domain Driving Domain

Sintel KITTI

Method clean final 2012 2015


Baseline

Dom.

AugmentedFlowNetD-FlowNet2 2.79 4.05

AugmentedFlowNetD-FlowFields 3.62

AugmentedFlowNetD-EpicFlow

AugmentedFlowNetD-DeepFlow

AugmentedFlowNetD-LDOF

AugmentedFlowNetD-Rand. Mix
AugmentedFlowNetD-FusionNet (L1)

AugmentedFlowNetD-FusionNet (Hinge) 2.75 3.97 3.52


Gen.

AugmentedFlowNetG-FusionNet (L1)

AugmentedFlowNetG-FusionNet (Hinge) 2.78 4.02 3.77


Table 2: Influence of the proxy ground truth on the augmented FlowNetC. Average endpoint errors on the training set of Sintel and KITTI are reported. Augmentation with a single proxy can improve results, but it is not obvious which method to choose. Using multiple methods to generate the proxy ground-truth (FusionNet) yields consistent improvements across all benchmarks. The benefits come largely from combining FlowNet2 and FlowFields. The upper part of the table shows experiments which are trained on domain-specific data (denoted AugmentedFlowNetD), while the bottom part shows experiments where the training data came from multiple domains to yield a generic network (denoted AugmentedFlowNetG).

We also distinguish between augmentation for a specific domain and generic augmentation. In the first case, we augment the FlowNet only on data from the respective domain, i.e., animation movies in case of Sintel and driving videos in case of KITTI; in the second case, data from both domains is used for finetuning. Specialization to a certain domain is one of the big advantages of learning-based optical flow methods, and a particular advantage for those methods that do not require supervision in that domain, as in our case.

Table 2 shows that domain-specific augmentation improves results considerably on KITTI, which is a very special scenario. The error is almost cut into half. However, also the generic augmentation is not much worse, as it also benefits from the training data from the special domain, even though it is now mixed with data from another domain. Obviously, the network can figure out automatically at test time from which domain the input is from, and applies the appropriate priors from that domain.

Table 3 extends the augmentation to a stacked FlowNet and compares it to UnFlow [6]. UnFlow uses an unsupervised loss, thus it can be specialized conveniently to any domain. The table shows results for UnFlow trained on CityScapes or the unlabeled data from KITTI, which outperform the supervised baseline, which was trained on synthetic data outside this domain. For better comparison to our strategy, we also report results for a semi-supervised version of UnFlow, i.e., it is initialized with a FlowNet trained on synthetic data before the unsupervised training starts.

Animation Domain Driving Domain

Sintel KITTI

Method clean final 2012 2015


Baseline


Dom.

UnFlow-C-CityScapes [6]

UnFlow-C-ours

UnFlow-C-KITTIraw [6]

UnFlow-CS [6]

UnFlow-CSS [6] 3.29


Dom.

AugmentedFlowNetD-C

AugmentedFlowNetD-CS

AugmentedFlowNetD-CSS 2.09 3.34 2.05


Gen.

AugmentedFlowNetG-C

AugmentedFlowNetG-CS

AugmentedFlowNetG-CSS 2.10 3.38 2.17


Table 3: Comparison of augmented FlowNet stacks to UnFlow [6]. AugmentedFlowNetD-C and UnFlow-C-ours are trained on the same domain-specific data and are initialized with the same model (Baseline). The results show that the purpose of domain adaptation is better achieved with the augmentation based on FusionNet than with the unsupervised loss of UnFlow. The results for the FlowNet augmented with data from both domains even show that it is not necessary to train separate networks for each domain, but that a generic network augmented on both domains is equally good.

Results show that the domain adaptation with the augmented FlowNet is clearly superior to the one of UnFlow333UnFlow does not require any supervision, which makes it biologically more plausible. From the engineering perspective, however, this is irrelevant.. As we already observed in Table 2, there is no significant difference between domain-specific training and training on a joint set of domains. This is also true for the stacked network.

Animation Domain Driving Domain

Method Sintel clean Sintel final KITTI 2012 KITTI 2015

AEE AEE AEE F1-noc AEE Fl-all

train test train test train test test train test


Un/Semi

DSTFlow[25]

GAN-OpticalFlow[26]

Hybrid-OpticalFlow-NextFrame[7]

UnFlow-CSS[6]


Sup

FlowNet2[1]

PWC-Net[2]


DCFlow[16]

MR-Flow[17] 2.53


Our

AugmentedFlowNetD-CSS

AugmentedFlowNetG-CSS

FusionNet-Hinge 1.58 3.18

Table 4: Comparison to the state-of-the-art. Numbers marked with have been obtained after fine-tuning on the training set of the respective benchmark. On the KITTI benchmarks, we clearly extend the state of the art. Thanks to additional fine-tuning with ground truth data, the augmented network even performs better than the FusionNet proxy, but also the generic version, which has only been finetuned on the FusionNet proxy, gets close to FusionNet and sometimes even outperforms it.

Table 4 compares the stacked augmented FlowNet to the state of the art. On the KITTI benchmarks, the augmented FlowNet sets the new state of the art after being finetuned also with the ground truth from the KITTI training set. But also the generic version, which has not been finetuned with ground truth data yields very good results. The direct comparison to FlowNet2 quantifies the improvement on stacked networks due to the augmentation.

Interestingly, the stacked augmented FlowNet often even outperforms the FusionNet proxy. This is due to the finetuning with ground truth in case of the domain-specific network. Sometimes, also the generic network is better than FusionNet, but not consistently.

Fig. 4 shows some qualitative examples of the augmentation by the proxy, yet with the smaller, non-stacked network.

Image 1 Flow GT Baseline Flow
Image 2 FusionNet Proxy Augmented Flow
Image 1 Flow GT Baseline Flow
Image 2 FusionNet Proxy Augmented Flow
Image 1 Flow GT Baseline Flow
Image 2 FusionNet Proxy Augmented Flow
Image 1 Flow GT Baseline Flow
Image 2 FusionNet Proxy Augmented Flow
Figure 4: Examples from Sintel Clean and KITTI2015. Using the proxy ground truth obtained with FusionNet for finetuning improves the network’s performance. Note that the ground truth on KITTI is sparse and lacks some moving objects.

Finally, we also evaluated the augmented FlowNet in an application scenario. We augmented the FlowNet on data from UdG-MS19 and UdG-MS20 [43] and fed its optical flow into the motion segmentation approach from Keuper et al. [45]. The motion segmentation performance was evaluated on the FBMS benchmark [44]. Table 5 shows that the adaptation to real images clearly helps a small FlowNetC to improve motion segmentation results. For the larger, stacked network, we do not see a significant improvement due to the augmentation. We attribute this to some saturation effect: the optical flow of FlowNet2 is already very good, such that other parts of the motion segmentation pipeline dominate the final result.



Method
F1 Measure Extracted Objects


FlowNetC (baseline)


AugmentedFlowNetD-C


FlowNet2 [1] (stacked baseline)


AugmentedFlowNetD-CS
28/69



Table 5: Motion segmentation results on the FBMS test set [44]. We fed the optical flow from the listed methods to the motion segmentation approach from Keuper et al. [45]. The augmentation on real data clearly improved the FlowNetC baseline. For the stacked network, we do not see a significant improvement. This is mainly because the optical flow is already very good in this case and other effects in the motion segmentation pipeline dominate the final result.

6 Conclusion

In this paper, we have presented two contributions: (1) We have presented a way to assemble a high-quality flow field from a set of input flow fields being computed in an unsupervised manner using existing optical flow estimation methods. This has been achieved by training an assessment network that learns to predict the errors of the input techniques. (2) We have shown that finetuning a FlowNet on such high-quality flow fields allows for unsupervised adaptation of the network to a specific domain. With this strategy, we obtained state-of-the-art results on the KITTI benchmarks. Moreover, we have shown that this strategy is more successful on domain adaptation than a fully unsupervised approach that does not make use of any synthetic data.

Acknowledgements

We acknowledge funding by the German Research Foundation (grant BR 3815/10-1) and the EU Horizon 2020 project Trimbot2020.

References

  • [1] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: Evolution of optical flow estimation with deep networks.

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)

  • [2] Sun, D., Yang, X., Liu, M., Kautz, J.: PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2018)
  • [3] Horn, B.K.P., Schunck, B.G.: Determining optical flow.

    In: Artificial Intelligence (AI). (1981)

  • [4] Mémin, E., Pérez, P.: A multigrid approach for hierarchical motion estimation. In: IEEE Int. Conference on Computer Vision (ICCV). (1998)
  • [5] Ahmadi, A., Patras, I.: Unsupervised convolutional neural networks for motion estimation. In: International Conference on Image Processing (ICIP). (2016)
  • [6] Meister, S., Hur, J., Roth, S.: UnFlow: Unsupervised learning of optical flow with a bidirectional census loss. In: Conference on Artificial Intelligence (AAAI). (2018)
  • [7] Sedaghat, N., Zolfaghari, M., Brox, T.: Hybrid learning of optical flow and next frame prediction to boost optical flow in the wild. Technical report (2017)
  • [8] Zhu, Y., Lan, Z., Newsam, S., Hauptmann, A.G.: Guided Optical Flow Learning. arXiv preprint arXiv:1702.022952 (2017)
  • [9] Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Imaging Understanding Workshop. (1981)
  • [10] Brox, T., Bruhn, A., Papenberg, N., Weickert, J.: High accuracy optical flow estimation based on a theory for warping. In: European Conference on Computer Vision (ECCV). (2004)
  • [11] Zach, C., Pock, T., Bischof, H.: A duality based approach for realtime TV-L 1 optical flow. In: Joint Pattern Recognition Symposium. (2007)
  • [12] Brox, T., Malik, J.: Large displacement optical flow: descriptor matching in variational motion estimation. (2011)
  • [13] Weinzaepfel, P., Revaud, J., Harchaoui, Z., Schmid, C.: DeepFlow: Large displacement optical flow with deep matching. In: IEEE Intenational Conference on Computer Vision (ICCV). (2013)
  • [14] Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.:

    EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow.

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
  • [15] Bailer, C., Taetz, B., Stricker, D.: Flow fields: Dense correspondence fields for highly accurate large displacement optical flow estimation. In: IEEE Int. Conference on Computer Vision (ICCV). (2015)
  • [16] Xu, J., Ranftl, R., Koltun, V.: Accurate Optical Flow via Direct Cost Volume Processing. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  • [17] Wulff, J., Sevilla-Lara, L., Black, M.J.: Optical flow in mostly rigid scenes. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  • [18] Dosovitskiy, A., Fischer, P., Ilg, E., Häusser, P., Hazırbaş, C., Golkov, V., v.d. Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: IEEE Int. Conference on Computer Vision (ICCV). (2015)
  • [19] Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  • [20] Ranjan, A., Black, M.: Optical flow estimation using a spatial pyramid network. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  • [21] Bailer, C., Varanasi, K., Stricker, D.: CNN-based Patch Matching for Optical Flow with Thresholded Hinge Embedding Loss. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2017)
  • [22] Gadot, D., Wolf, L.: PatchBatch: a Batch Augmented Loss for Optical Flow. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  • [23] Güney, F., Geiger, A.: Deep Discrete Flow. In: Asian Conference on Computer Vision (ACCV). (2016)
  • [24] Yu, J.J., Harley, A.W., Derpanis, K.G.: Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In: European Conference on Computer Vision (ECCV). (2016)
  • [25] Ren, Z., Yan, J., Ni, B., Liu, B., Yang, X., Zha, H.: Unsupervised deep learning for optical flow estimation. In: Conference on Artificial Intelligence (AAAI). (2017)
  • [26] Lai, W.S., Huang, J.B., Yang, M.H.: Semi-supervised learning for optical flow with generative adversarial networks. In: Advances in Neural Information Processing Systems 30. (2017)
  • [27] Lempitsky, V., Roth, S., Rother, C.: FusionFlow: Discrete-Continuous Optimization for Optical Flow Estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2008)
  • [28] Xu, L., Jia, J., Matsushita, Y.: Motion detail preserving optical flow estimation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI). (2012)
  • [29] Schultz, M., Joachims, T.: Learning a distance metric from relative comparisons. In: Advances in Neural Information Processing Systems 16. (2004)
  • [30] Weinberger, K.Q., Saul, L.K.: Distance metric learning for large margin nearest neighbor classification. JMLR (2009)
  • [31] Wang, J., Song, Y., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y.:

    Learning fine-grained image similarity with deep ranking.

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014)
  • [32] Hoffer, E., Ailon, N.: Deep metric learning using triplet network. In: ICLR (Workshop). (2015)
  • [33] Schroff, F., Kalenichenko, D., Philbin, J.:

    Facenet: A unified embedding for face recognition and clustering.

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
  • [34] Wohlhart, P., Lepetit, V.: Learning descriptors for object recognition and 3d pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
  • [35] Moore, R.C., DeNero, J.: L1 and l2 regularization for multiclass hinge loss models.

    In: Symposium on Machine Learning in Speech and Natural Language Processing. (2011)

  • [36] Doğan, Ü., Glasmachers, T., Igel, C.: A unified view on multi-class support vector classification. Journal of Machine Learning Research (2016)
  • [37] Project, B.: Open movies (agent, caminandes, cosmos, sintel and big buck bunny). https://www.blender.org/about/projects (2017)
  • [38] Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: European Conference on Computer Vision (ECCV). (2012)
  • [39] N.Mayer, E.Ilg, P.Häusser, P.Fischer, D.Cremers, A.Dosovitskiy, T.Brox: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  • [40] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.:

    The cityscapes dataset for semantic urban scene understanding.

    In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016)
  • [41] Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2012)
  • [42] Menze, M., Geiger, A.: Object scene flow for autonomous vehicles. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015)
  • [43] Mahmood, M.H., Díez, Y., Salvi, J., Lladó, X.: A collection of challenging motion segmentation benchmark datasets. (2017)
  • [44] Ochs, P., Malik, J., Brox, T.: Segmentation of moving objects by long term video analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2014)
  • [45] Keuper, M., Andres, B., Brox, T.: Motion trajectory segmentation via minimum cost multicuts. In: IEEE Int. Conference on Computer Vision (ICCV). (2015)