Unsupervised Deep Epipolar Flow for Stationary or Dynamic Scenes

04/08/2019 ∙ by Yiran Zhong, et al. ∙ Australian National University 0

Unsupervised deep learning for optical flow computation has achieved promising results. Most existing deep-net based methods rely on image brightness consistency and local smoothness constraint to train the networks. Their performance degrades at regions where repetitive textures or occlusions occur. In this paper, we propose Deep Epipolar Flow, an unsupervised optical flow method which incorporates global geometric constraints into network learning. In particular, we investigate multiple ways of enforcing the epipolar constraint in flow estimation. To alleviate a "chicken-and-egg" type of problem encountered in dynamic scenes where multiple motions may be present, we propose a low-rank constraint as well as a union-of-subspaces constraint for training. Experimental results on various benchmarking datasets show that our method achieves competitive performance compared with supervised methods and outperforms state-of-the-art unsupervised deep-learning methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Optical flow estimation is a fundamental problem in computer vision with many applications. Since Horn and Schunck’s seminal work 

[14], various methods have been developed using variational optimization [2, 43, 5], energy minimization [19, 24, 32, 40], or deep learning [7, 8, 25, 33]

. In this paper, we particularly tackle the problem of unsupervised optical flow learning using deep convolutional neural networks (CNNs). Compared to its supervised counterpart, unsupervised flow learning does not require ground-truth flow, which is often hard to obtain, as supervision and can thus be applied in broader domains.

Recent research has been focused on transforming traditional domain knowledge of optical flow into deep learning, in terms of either training loss formulation or network architecture design. For example, in view of brightness consistency between two consecutive images, a constraint that has been commonly used in conventional optical flow methods, researchers have formulated photometric loss [42, 31], with the help of fully differentiable image warping [15], to train deep neural networks. Other common techniques including image pyramid [4] (to handle large flow displacements), total variation regularization [30, 37] and occlusion handling [1] have also led to either new network structures (e.g., pyramid networks [25, 33]) or losses (e.g., smoothness loss and occlusion mask [35, 16]). In the unsupervised setting, existing methods mainly rely on the photometric loss and flow smoothness loss to train deep CNNs. This, however, poses challenges for the neural networks to learn optical flow accurately in regions with repetitive textures and occlusions. Although some methods [35, 16]

jointly learn occlusion masks, these masks do not mean to provide more constraints but only to remove the outliers in the losses. In light of the difficulties of learning accurate flow in these regions, we propose to incorporate

global epipolar constraints into flow network training in this paper.

Leveraging epipolar geometry in flow learning, however, is not a trivial task. An inaccurate or wrong estimate of fundamental matrices [13] would mislead the flow network training in a holistic way, and thus significantly degrade the model prediction accuracy. This is especially true when a scene contains multiple independent moving objects as one fundamental matrix can only describe the epipolar geometry of one rigid motion. Instead of posing a hard epipolar constraint, in this paper, we propose to use soft epipolar constraints that are derived using low-rankness when the scene is stationary, and union of subspaces structure when the scene is motion agnostic. We thus formulate corresponding losses to train our flow networks unsupervisedly.

Our work makes an attempt towards incorporating epipolar geometry into deep unsupervised optical flow computation. Through extensive evaluations on standard datasets, we show that our method achieves competitive performance compared with supervised methods, and outperforms existing unsupervised methods by a clear margin. Specifically, as of the date of paper submission, on KITTI and MPI Sintel benchmarks, our method achieves the best performance among published deep unsupervised optical flow methods.

2 Related work

Optical flow estimation has been extensively studied for decades. A significant number of papers have been published in this area. Below we only discuss a few geometry-aware methods and recent deep-learning based methods that we consider closely related to our method.

Supervised deep optical flow. Recently, end-to-end learning based deep optical flow approaches have shown their superiority in learning optical flow. Given a large amount of training samples, optical flow estimation is formulated to learn the regression between image pair and corresponding optical flow. These approaches achieve comparable performance to state-of-the-art conventional methods on several benchmarks while being significantly faster. FlowNet [7] is a pioneer in this direction, which needs a large-size synthetic dataset to supervise network learning. FlowNet2 [8] greatly extends FlowNet by stacking multiple encoder-decoder networks one after the other, which could achieve a comparable result to conventional methods on various benchmarks. Recently, PWC-Net [33] combines sophisticated conventional strategies such as pyramid, warping and cost volume into network design and set the state-of-the-art performance on KITTI [12, 23] and MPI Sintel [6]. These supervised deep optical flow methods are hampered by the need for large-scale training data with ground truth optical flow, which also limits their generalization ability.

Unsupervised deep optical flow. Instead of using ground truth flow as supervision, Yu et al. [42] and Ren et al. [28] suggested that, similar to conventional methods, the image warping loss can be used as supervision signals in learning optical flow. However, there is a significant performance gap between their work and the conventional ones. Then, Simon et al. [31] analyzed the problem and introduced bidirectional Census loss to handle illumination variation between frames robustly. Concurrently, Yang et al. [35] proposed an occlusion-aware warping loss to exclude occluded points in error computation. Very recently, Janai et al. [16] extended two-view optical flow to multi-view cases with improved occlusion handling performance. Introducing sophisticated occlusion estimation and warping loss reduces the performance gap between conventional methods and current unsupervised ones, nevertheless, the gap is still huge. To address this issue, we propose a global epipolar constraint in flow estimation that largely narrows the gap.

Geometry-aware optical flow. In the field of cooperating with geometry constrains, Valgaerts et al. [34] introduced a variational model to simultaneously estimate the fundamental matrix and the optical flow. Wedel et al. [36] utilized fundamental matrix prior as a weak constraint in a variational framework. Yamaguchi et al. [39] converted optical flow estimation task into a 1D search problem by using precomputed fundamental matrices and the small motion assumptions. These methods, however, assume that the scene is mostly rigid (and thus a single fundamental matrix is sufficient to constrain two-view geometry), and treat the dynamic parts as outliers [36]. Garg et al. [11] used the subspace constraint on multi-frame optical flow estimation as a regularization term. However, this approach, assumes an affine camera model and works over entire sequences. Wulff et al. [38] used semantic information to split the scene into dynamic objects and static background and only applied strong geometric constraints on the static background. Recently, inspired by multi-task learning, people started to jointly estimate depth, camera poses and optical flow in an unified framework [26, 41, 44]. These work mainly leverage a consistency between flows that estimated from a flow network and computed from poses and depth. This constraint only works for stationary scenes and their performance is only comparable with unsupervised deep optical flow methods.

By contrast, our proposed method is able to handle both stationary and dynamic scenarios without explicitly computing fundamental matrices. This is achieved by introducing soft epipolar constraints derived from epipolar geometry, using low-rankness and union-of-subspaces properties. Converting these constraints to proper losses, we can exert global geometric constraints in optical flow learning and obtain much better performance.

3 Epipolar Constraints in Optical Flow

Optical flow aims at finding dense correspondences between two consecutive frames. Formally, let denote the image at time , and the next image. For pixels in , we would like to find their correspondences in

. The displacement vectors

(with the total number of pixels in ) are the optical flow we would like to estimate.

Recall that in two-view epipolar geometry [13], by using the homogeneous coordinates, a pair of point correspondences in two frames and is related by a fundamental matrix ,

(1)

In the following sections, we show how to enforce the epipolar constraint as a global regularizer in flow learning.

3.1 Two-view Geometric Constraint

Given estimated optical flow , we can convert it to a series of correspondences and in and respectively. Then these corresponding points can be used to compute a fundamental matrix by normalized 8 points method [13]. Once the is estimated, we can compute its fitting error. Directly optimizing Eq. (1) is not effective as it is only an algebraic error that does not reflect the real geometric distances. We can use the Gold Standard method [13] to compute the geometric distances but it requires reconstructing the 3D points beforehand for every point. Otherwise, we can use its first-order approximation, the Sampson distance to represent the geometric error,

(2)

The difficulty of optimizing this equation comes from its chicken-and-egg character: it consists of two mutually interlocked sub-problems, i.e., estimating a fundamental matrix from an estimated flow and updating the flow to comply with the . This alternating method, therefore, heavily relies on proper initialization.

Up to now, we have only considered the static scene scenario, where only ego-motion exists. In a multi-motion scene, this method requires estimating for each motion, which again needs a motion segmentation step. It is still feasible to address this problem via iteratively solving three sub-tasks: (i) update flow estimation; (ii) estimate for each rigid motion given current motion segmentation; (iii) update motion segmentation based on the nearest .

However, this method again has several inherent limitations. First, the number of motions need to be known as a priori which is almost impossible in general optical flow estimation. Second, this method is still sensitive to the quality of initial optical flow estimation and motion labels. Incorrectly estimated flow may generate wrong , which will in turn lead flow estimation to the wrong solution, therefore making the estimation even worse. Third, the motion segmentation step is non-differentiable, so with it, an end-to-end learning becomes impossible.

To overcome these drawbacks, we formulate two soft epipolar constraints using low-rankness and union-of-subspaces properties. And we will show that these constraints can be easily included as extra losses to regularize the network learning.

3.2 Low-rank Constraint

In this section, we show that it is possible to enforce a soft epipolar constraint without explicitly computing the fundamental matrix in a static scene.

Note that we can rewrite the epipolar constraint in Eq. (1) as

(3)

where is the vectorized fundamental matrix of and

(4)

Observe that, lies on a subspace (of dimension up to eight), called epipolar subspace [17]. Let us define . Then the data matrix should be low-rank. This provides a possible way of regularizing optical flow estimation via rank minimization instead of explicitly computing . Specifically, we can formulate a loss as

(5)

which is unfortunately non-differentiable and is thus not feasible to serve as a loss for flow network training. Fortunately, we can still use its convex surrogate, the nuclear norm, to form a loss as

(6)

where the nuclear norm

can be computed by performing singular value decomposition (SVD) of

. Note that the SVD operation is differentiable and has been implemented in modern deep learning toolboxes such as Tensorflow and Pytorch, so this nuclear norm loss can be easily incorporated into network training. We also note that though this low-rank constraint is derived from epipolar geometry described by a fundamental matrix, it still applies in degenerate cases where a fundamental matrix does not exist. For example, when the motion is all zero or pure rotational, or the scene is fully planar,

will have rank six; under certain special motions, e.g., an object moving parallel to the image plane, its will have rank seven.

Comparing to the original epipolar constraint, one may concern that this low-rank constraint is too loose to be effective, especially when the ambient space dimension is only nine. Although a thorough theoretical analysis is out of the scope of this paper (interested readers may refer to literature such as [27]), we will show in our experiments that this loss can improve the model performance by a significant margin when trained on data with mostly static scenes. However, this loss becomes ineffective when a scene has more than one motion, as the matrix will then be full-rank.

3.3 Union-of-Subspaces Constraint

Figure 1: Motion segmentation and affinity matrix (constructed from ) visualization.

The scene contains three motions annotated by three different colors: the ego-motion and the two cars’ movements. On the right, we show a constructed affinity matrix from

which contains three diagonal blocks corresponding to these three motions. On the bottom left, we illustrate our estimated optical flow and the top left image shows that all these three motions are correctly segmented based on the . The sparse dots on the image are the sampled 2000 points that has been used to compute . It proves that our Union-of-Subspace constraint can work under multi-body scenarios.

In this section, we introduce another soft epipolar constraint, namely union-of-subspaces constraint, which can be applied in broader cases.

From Eq. (4), it’s not hard to observe that from one rigid motion lies on a common epipolar subspace because they all share the same fundamental matrix. When there are multiple motions in a scene, will lie in a union of subspaces. Note that this union-of-subspace structure has been shown to be useful in motion segmentation from two perspective images [20]. Here, we re-formulate it in optical flow learning and come up with an effective loss using closed-form solutions.

In particular, the union-of-subspaces structure can be characterized by the self-expressiveness property [10], i.e

., a data point in one subspace can be represented by a linear combination of other points from the same subspace. This has translated into a mathematical optimization problem 

[22, 18] as

(7)

where is the subspace self-expression coefficient and is a matrix function of estimated flows. Note that, in subspace clustering literature, other norms on have also been used, e.g., nuclear norm in [21] and norm in [10]. We are particularly interested in the Frobenius norm regularization due to its simplicity and equivalence to nuclear norm optimization [18], which is crucial for formulating an effective loss for CNN training.

However, in real world scenarios, the flow estimation inevitably contains noises. Therefore, we relax the constraints in Eq.(7) by alternatively optimizing the function below

(8)

Instead of using a iterative solver, given an , we can derive a closed form solution for , i.e.,

(9)

Plugging the solution of back to Eq.(8), we arrive at our final union-of-subspaces loss term that only depends on the estimated flow:

(10)

Directly applying this loss to the whole image will lead to GPU memory overflow due to the computation of (with the number of pixels in a image). To avoid this, we employ a randomly sampling strategy to sample 2000 flow points in a flow map and compute a loss based on these samples. This strategy is valid because random sampling will not change the intrinsic character of sets.

We remark that this subspace loss requires no prior knowledge of the number of motions in a scene, so it can be used to train a flow network on a motion-agnostic dataset. In a single-motion case, it works similarly to the low-rank loss since the optimal loss is closely related to the rank of  [18]. In a multi-motion case, as long as the epipolar subspaces are disjoint and principle angles between them are below certain thresholds [9], this loss can still serve as a global regularizer. Even when the scene is highly non-rigid or dynamic, unlike the hard epipolar constraint, this loss won’t be detrimental to the network training because it will have same values for both ground-truth flows and wrong flows. In Fig. 1, we show the results of a typical image pair from KITTI using this constraint, demonstrating the effectiveness of our method.

4 Unsupervised Learning of Optical Flow

We formulate our unsupervised optical flow estimation approach as an optimization of image based losses and epipolar constraint losses. In unsupervised optical flow estimation, only photometric loss can provide data term. Additionally, we use a smoothness term and our epipolar constraint term as our regularization terms. Our overall loss is a linear combination of these three losses

(11)

where are the weights for each term. We empirically set and for respectively.

4.1 Image Warping Loss

Similarly to conventional methods, we leverage the most popular brightness constancy assumption, i.e., should have similar pixel intensities, colors and gradients. Our photometric error is then defined by the difference between the reference frame and the warped target frame based on flow estimation.

In [31], they target at the case which the illumination may changes from frame to frame and propose a bidirectional census transform to handle this situation. We adopt this idea to our photometric error. Therefore, our photometric loss is a weighted summation of pixel intensities (or color) loss , image gradient loss and bidirectional census loss .

(12)

where are the weights for each term.

Inspired by [35], we only compute our photometric loss on non-occluded areas and normalize the loss by the number of pixels of non-occluded regions. We determine a pixel to be occluded or not by forward-backward consistency check. If the sum of its forward and backward flow is above a threshold , we set the pixel as occluded. We use in all experiments.

Our photometric loss is thus defined as follows:

(13)
(14)
(15)

where is computed through image warping with the estimated flow, and following [35], we use a robust Charbonnier penalty to evaluate differences.

4.2 Smoothness Loss

Commonly, there are two kinds of smoothness prior in conventional optical flow estimation: One is piece-wise planar, and the other is piece-wise linear. The first one can be implemented by penalizing the first order derivative of recovered optical flow and the later one is by the second order derivative. For most rigid scenes, piece-wise planar model can provide a better interpolation. But for deformable cases, piece-wise linear model suits better. Therefore, we use a combination of these two models as our smoothness regularization term. We further assumes that the edges in optical flows are edges in reference color images as well. Formally, our image guided smoothness term can be defined as:

(16)

where and and is a matrix form of .

5 Experiments

We evaluate our methods on standard optical flow benchmarks, including KITTI [12, 23], MPI-Sintel [6], Flying Chairs [7], and Middlebury [3]. We compare our results with existing optical flow estimation methods based on standard metrics, i.e., endpoint error (EPE) and percentage of optical flow outliers (Fl). We denote our method as EPIFlow.

5.1 Implementation details.

Architecture and Parameters.

We implemented our EPIFlow network in an end-to-end manner by adopting the architecture of PWC-Net [33]

as our base network due to its state-of-the-art performance. The original PWC-Net takes a structure of pyramid and learns on 5 different scales. However, a warping error is ineffective on low resolutions. Therefore, we pick the highest resolution output, upsample it to match the input resolution by bilinear interpolation, and compute our self-supervised learning losses only on that scale. The learning rate for initial training (from scratch) is

and that for fine-tuning is . Depending on the resolution of input images, the batch size is 4 or 8. We use the same data argumentation scheme as proposed in FlowNet2 [8]. Our network’s typical speed varies from 0.07 to 0.25 seconds per frame during the training process, depending on the input image size and the losses used, and is around 0.04 seconds per frame in evaluation. The experiments were tested on a regular computer equipped with a Titan XP GPU. EPIFlow is significantly faster compared with conventional methods.

KITTI 2012 KITTI 2015 Sintel Clean Sintel Final
Method EPE(all) EPE(noc) EPE(all) EPE(noc) Flall EPE(all) EPE(all)
train test train test train train test train test train test

Non-deep

EpicFlow [29] 3.47 3.8 1.5 9.27 26.29% 2.27 4.11 3.56 6.29
MRFlow [38] 12.19% (1.83) 2.53 (3.59) 5.38

Supervised

SpyNet-ft [25] (4.13) 4.1 2.0 35.07% (3.17) 6.64 (4.32) 8.36
FlowNet2-ft [8] (1.28) 1.8 1.0 2.30 10.41% (1.45) 4.16 (2.01) 5.74
PWC-Net [33] 4.14 10.35 2.55 3.93
PWC-Net-ft [33] (1.45) 1.7 0.9 (2.16) 9.60% (1.70) 3.86 (2.21) 5.17

Unsupervised

UnsupFlownet [42] (11.30) 9.9 (4.30) 4.6
DSTFlow-ft [28] (10.43) 12.4 (3.29) 4.0 (16.79) (6.96) 39.00% (6.16) 10.41 (7.38) 11.28
DF-Net-ft [44] (3.54) 4.4 (8.98) 25.70%
GeoNet [41] 10.81 8.05
UnFlow [31] (3.29) (1.26) (8.10) 9.38 7.91 10.21
OAFlow-ft [35] (3.55) 4.2 (8.88) 31.20% (4.03) 7.95 (5.95) 9.15
CCFlow [26] (5.66) 25.27%
Back2Future-ft [16] (6.59) (3.22) 22.94% (3.89) 7.23 (5.52) 8.81
Our-baseline 3.23 1.04 7.93 4.21 6.72 7.31
Our-gtF 2.61 1.04 6.03 2.89 6.15 6.71
Our-F 2.56 0.97 6.42 3.09 6.21 6.73
Our-low-rank 2.63 1.07 5.91 3.03 6.39 6.96
Our-sub 2.62 1.03 6.02 2.98 6.15 6.83
Our-sub-test-ft 2.61 (3.2) 1.03 (1.1) 5.56 2.56 (16.24%) 3.94 (6.84) 5.08 (8.33)
Our-sub-train-ft (2.51) 3.4 (0.99) 1.3 (5.55) (2.46) 16.95% (3.54) 7.00 (4.99) 8.51
Table 1: Performance comparison on the KITTI and Sintel optical flow benchmarks. The metric EPE(noc) indicates the average endpoint error of non-occluded regions while the term EPE(all) is that for all pixels. The KITTI 2015 testing dataset evaluates results by the percentage of flow outliers (Fl). The baseline, gtF, F, low-rank, and sub models were trained on the KITTI VO dataset. The parentheses indicate the corresponding models that were trained on the same data and the missing entries (-) indicate the results were not reported. Note that the current STOA unsupervised method Back2Future Flow [16] uses three frames as input. Best results are marked by bold fonts.

Pre-training.

We pre-trained our network on the Flying Chairs dataset using a weighted combination of the warping loss and smoothness loss. Flying Chairs is a synthetic dataset consisting of rendered chairs superimposed on real-world Flickr images. Training on such a large-scale synthetic dataset allows the network to learn the general concepts of optical flow before handling complicated real-world conditions, e.g., changeable light or motions. To avoid trivial solutions, we disabled the occlusion-aware term at the beginning of the training (i.e

., the first two epochs). Otherwise, the network would generate all zero occlusion masks which invalidate losses. The pre-training roughly took forty hours and its returned model was used as an initial model for other datasets.

5.2 Datasets

KITTI Visual Odometry (VO) Dataset.

The KITTI VO dataset contains 22 calibrated sequences with 87,060 consecutive pairs of real-world images. The ground truth poses of the first 11 sequences are available. We fine-tuned our initial model on the KITTI VO dataset using various loss combinations. We chose it for two reasons: (1) it provides ground truth camera poses for every frame, which simplifies the problem of network performance analysis and (2) most scenes in the KITTI VO dataset are stationary and thus can be fitted by an ego-motion. The relative poses (between a pair of images) and camera calibration can be used to compute fundamental matrices. To compare our various methods fairly, we use the first 11 sequences as our training set.

KITTI Optical Flow Dataset.

The KITTI optical flow dataset contains two subsets: KITTI 2012 and KITTI 2015, where the first one mostly contains stationary scenes and the latter one includes more dynamic scenes. KITTI 2012 provides 194 annotated image pairs for training and 195 pairs for testing while KITTI 2015 provides 200 pairs for training and 200 pairs for testing. Our training did not use the KITTI datasets’ multiple-view extension images.

MPI Sintel Dataset.

The MPI Sintel dataset provides naturalistic frames which were captured from an open source movie. It contains 1041 training image pairs with ground truth optical flows and pixel-wise occlusion masks, and also provides 552 image pairs for benchmark testing. The scenes of the MPI Sintel dataset were rendered under two different complexity (Clean and Final). Unlike the KITTI datasets, most scenes in the Sintel dataset are highly dynamic.

Input Ours Back2Future [16] Our Error Back2Future Error [16]
Figure 2: Qualitative results on KITTI 2015 Test dataset. We compare our method with Back2Future Flow [16]. The second column contains the flows estimated by Our-sub-ft model while the third column contains the results of Back2Future Flow. The flow error visualization is also provided where correct estimates are depicted in blue and wrong ones in red. Consistent with the quantitative analysis, our results are visually better on structural boundaries
Input Ours Back2Future [16] Our Error Back2Future Error [16]
Figure 3: Qualitative results on the MPI Sintel dataset. This figure shares the same layout with Fig. 2 except the top two rows are from the Final set and the two bottom rows are from the Clean set. The errors are visualized in gray on the Sintel benchmark.

5.3 Quantitative and Qualitative Results

We use the suffix “-baseline” to indicate our baseline model that was trained using only photometric and smoothness loss. “-F” represents the model that was trained using hard fundamental matrix constraint with estimated F. “-gtF” means that we used the ground truth fundamental matrix. “-low-rank” refers to the model applying the low rank constraint, and “-sub” is the model using our subspace constraint. “-ft” denotes the model fine-tuned on the datasets.

Input Our-baseline Our-F Our-low-rank Our-sub
Figure 4: Endpoint error performance of our various models on the KITTI 2015 training dataset. We compared Our-baseline, Our-F, Our-low-rank, and Our-sub models on the KITTI 2015 dataset to analyze their performance when handling dynamic objects. The results of the Our-sub model are much better.

KITTI VO training results.

We report our results that were trained on the KITTI VO dataset in Table  1, where our models are compared with various state-of-the-art methods. Our methods outperform all previous learning-based unsupervised optical flow methods with a notable margin. Note that most scenes in KITTI VO dataset are stationary, and therefore the difference between our-gtF, our-F, our-low-rank and our-sub is small across these benchmarks.

Benchmark Fine-tuning Results.

We fine-tuned our models on each benchmark and report the results with a suffix ’-ft’ in Table 2. For example, simply following the same hyper-parameters as before, we finetuned our models on the KITTI 2015 testing data. After fine-tuning, Our-sub model shows great performance improvement and achieved an EPE of 2.61 and 5.56 respectively on the KITTI 2012 and KITTI 2015 training datasets, which outperforms all the deep unsupervised methods and many supervised methods. Similarly, on the MPI Sintel trainings dataset, Our-sub-ft model performs best among the unsupervised methods, with an EPE of 3.94 on the Clean images and 5.08 on the Final images. Furthermore, both on the KITTI and Sintel testing benchmarks, our method outperformed the current state-of-the-art unsupervised method Back2Future Flow by a margin. We improve the best unsupervised performance from an Fl of 22.94% to 16.24% on KITTI 2015. The Our-sub-ft model achieved an EPE of 6.84 on the Sintel Clean dataset and 8.33 on the Final set, which are the results that unsupervised methods have never touched before. Additionally, it should be noted that the Back2Future Flow method is based on a multi-frame formulation while our method only requires two frames. Our model is also competitive compared with some fine-tuned supervised networks, such as SpyNet.

Qualitatively, as shown in Fig. 2 and Fig. 3, compared with the results of Back2Future Flow, the shapes in our estimated flows are more structured and have more explicit boundaries which represent motion discontinuities. This trend is also apparent in the flow error images. For example, on the KITTI 2015 dataset (Fig. 2), the results of Back2Future Flow usually bring a larger region of error with crimson colors around the objects.

It should be noted that fine-tuning on the target datasets (e.g., KITTI 2015) does not bring significant improvement because we have trained the models on a real-world dataset KITTI VO. The models have learned the general concepts of realistic optical flows and fine tuning just helps them familiar with the datasets’ characteristics. On the KITTI 2012 training set, the fine-tuned model achieves very close results with the Our-sub model, which are respectively 2.61 and 2.62 EPE. Fine-tuning on the Sintel Clean dataset improves the result from 6.15 to 3.94 EPE, because the Sintel Clean dataset renders the synthetic scenes under low complexity and the images are quite different from the real world.

5.4 Ablation study

Figure 5: Endpoint error over epochs on the Sintel Final dataset. We illustrate the endpoint errors over the training epochs when using various combinations of constraints. For all the three methods, the training started from the same pre-trained model ‘Our-baseline’. Combing the image warping and subspace constraints outperforms other two methods, which is consistent with the final fine-tuned results reported in Table 2.
Method KITTI 2015 Sintel Final
EPE(all) EPE(noc) EPE(all)
Our-baseline-ft 6.16 2.85 5.87
Our-F-ft 6.19 2.85 NaN
Our-low-rank-ft 5.72 2.62 5.59
Our-sub-ft 5.56 2.56 5.08
Table 2: Fine-tuning results comparison on KITTI 2015 and Sintel Final training sets. We fine-tuned our models on the training sets of KITTI 2015 and Sintel Final dataset. The term NaN indicates the model cannot converge.

The Our-F, Our-low-rank and Our-sub models all work well in stationary scenes and they have similar quantitative performance. To further analyze their capabilities in handling general dynamic scenarios, we fine-tuned each method on the KITTI 2015 and Sintel Final dataset. Both of them involve multiple motions in an image while Sintel scenes are more dynamic. As shown in Table 2, Our-sub can handle dynamic scenarios best and achieves the lowest EPE in both benchmarks. The hard fundamental constraint shares a similar performance with our baseline model but cannot converge on the Sintal dataset, whose EPE is reported as NaN. It is because a highly dynamic scene does not have a global fundamental . For the low-rank constraint, its performance is not affected by dynamic objects while it cannot gain information by modeling multiple movements as well. In Fig. 5, we provide the validation error curves over the training’s early stages on Sintal final dataset. The subspace loss helps the model converge quicker and achieve lower cost than other methods.

6 Conclusion

In this paper, we have proposed effective methods to enforce global epipolar geometry constraints for unsupervised optical flow learning. For a stationary scene, we applied the low-rank constraint to regularize a globally rigid structure. For general dynamic scenes (multi-body or deformable), we proposed to use the union-of-subspaces constraint. Experiments on various benchmarking datasets have proved the efficacy and superiority of our methods compared with state-of-the-art (unsupervised) deep flow methods. In the future, we plan to study the multi-frame extension, i.e., enforcing geometric constraints across multiple frames.

Acknowledgement This research was supported in part by Australia Centre for Robotic Vision, Data61 CSIRO, the Natural Science Foundation of China grants (61871325, 61420106007) the Australian Research Council (ARC) grants (LE190100080, CE140100016, DP190102261). The authors are grateful to the GPUs donated by NVIDIA.

References

  • [1] Luis Alvarez, Rachid Deriche, Théo Papadopoulo, and Javier Sánchez. Symmetrical dense optical flow estimation with occlusions detection. Int. J. Comp. Vis., 75(3):371–385, 2007.
  • [2] Gilles Aubert, Rachid Deriche, and Pierre Kornprobst. Computing optical flow via variational techniques. SIAM Journal on Applied Mathematics, 60(1):156–182, 1999.
  • [3] Simon Baker, Daniel Scharstein, J. P. Lewis, Stefan Roth, Michael J. Black, and Richard Szeliski. A database and evaluation methodology for optical flow. Int. J. Comp. Vis., 92(1):1–31, Mar 2011.
  • [4] Jean-Yves Bouguet. Pyramidal implementation of the affine lucas kanade feature tracker description of the algorithm. Intel Corporation, 5(1-10):4, 2001.
  • [5] Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. High accuracy optical flow estimation based on a theory for warping. In Proc. Eur. Conf. Comp. Vis., pages 25–36. Springer, 2004.
  • [6] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In Proc. Eur. Conf. Comp. Vis., pages 611–625, Oct. 2012.
  • [7] Alexey Dosovitskiy, Philipp Fischery, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick van der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proc. IEEE Int. Conf. Comp. Vis., pages 2758–2766, 2015.
  • [8] Ilg Eddy, Mayer Nikolaus, Saikia Tonmoy, Keuper Margret, Dosovitskiy Alexey, and Brox Thomas. Flownet 2.0: Evolution of optical flow estimation with deep networks. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Jul 2017.
  • [9] Ehsan Elhamifar and René Vidal. Clustering disjoint subspaces via sparse representation. In IEEE International Conference on Acoustics Speech and Signal Processing, pages 1926–1929. IEEE, 2010.
  • [10] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Trans. Pattern Anal. Mach. Intell., 35(11):2765–2781, 2013.
  • [11] Ravi Garg, Luis Pizarro, Daniel Rueckert, and Lourdes Agapito. Dense multi-frame optic flow for non-rigid objects using subspace constraints. In Proc. Asian Conf. Comp. Vis., pages 460–473, 2011.
  • [12] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2012.
  • [13] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003.
  • [14] Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.
  • [15] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Proc. Adv. Neural Inf. Process. Syst., pages 2017–2025, 2015.
  • [16] Joel Janai, Fatma Güney, Anurag Ranjan, Michael J. Black, and Andreas Geiger. Unsupervised learning of multi-frame optical flow with occlusions. In European Conference on Computer Vision (ECCV), volume Lecture Notes in Computer Science, vol 11220, pages 713–731. Springer, Cham, Sept. 2018.
  • [17] Pan Ji, Hongdong Li, Mathieu Salzmann, and Yiran Zhong. Robust multi-body feature tracker: a segmentation-free approach. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3843–3851, 2016.
  • [18] Pan Ji, Mathieu Salzmann, and Hongdong Li. Efficient dense subspace clustering. In IEEE Winter Conference on Applications of Computer Vision, pages 461–468. IEEE, 2014.
  • [19] Vladimir Kolmogorov and Ramin Zabih. Computing visual correspondence with occlusions via graph cuts. Technical report, Ithaca, NY, USA, 2001.
  • [20] Zhuwen Li, Jiaming Guo, Loong-Fah Cheong, and Steven Zhiying Zhou. Perspective motion segmentation via collaborative clustering. In Proc. IEEE Int. Conf. Comp. Vis., pages 1369–1376, 2013.
  • [21] Guangcan Liu, Zhouchen Lin, Shuicheng Yan, Ju Sun, Yong Yu, and Yi Ma. Robust recovery of subspace structures by low-rank representation. IEEE Trans. Pattern Anal. Mach. Intell., 35(1):171–184, 2013.
  • [22] Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-Shuang Huang, and Shuicheng Yan. Robust and efficient subspace segmentation via least squares regression. In Proc. Eur. Conf. Comp. Vis., pages 347–360. Springer, 2012.
  • [23] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2015.
  • [24] Moritz Menze, Christian Heipke, and Andreas Geiger. Discrete optimization for optical flow. In

    German Conference on Pattern Recognition

    , pages 16–28. Springer, 2015.
  • [25] Anurag Ranjan and Michael J. Black. Optical flow estimation using a spatial pyramid network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., July 2017.
  • [26] Anurag Ranjan, Varun Jampani, Kihwan Kim, Deqing Sun, Jonas Wulff, and Michael J Black. Adversarial collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. arXiv preprint arXiv:1805.09806, 2018.
  • [27] Benjamin Recht, Weiyu Xu, and Babak Hassibi.

    Necessary and sufficient conditions for success of the nuclear norm heuristic for rank minimization.

    In IEEE Conference on Decision and Control, pages 3065–3070. IEEE, 2008.
  • [28] Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. Unsupervised deep learning for optical flow estimation. In Proc. AAAI Conf. Artificial Intelligence, volume 3, page 7, 2017.
  • [29] Jerome Revaud, Philippe Weinzaepfel, Zaid Harchaoui, and Cordelia Schmid. Epicflow: Edge-preserving interpolation of correspondences for optical flow. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2015.
  • [30] Leonid I Rudin, Stanley Osher, and Emad Fatemi. Nonlinear total variation based noise removal algorithms. Physica D: nonlinear phenomena, 60(1-4):259–268, 1992.
  • [31] Meister Simon, Hur Junhwa, and Roth Stefan. Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In Proc. AAAI Conf. Artificial Intelligence, AAAI’18, 2018.
  • [32] Deqing Sun, Stefan Roth, and Michael J Black. Secrets of optical flow estimation and their principles. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2432–2439. IEEE, 2010.
  • [33] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., 2018.
  • [34] Levi Valgaerts, Andrés Bruhn, and Joachim Weickert. A variational model for the joint recovery of the fundamental matrix and the optical flow. In Proceedings of DAGM Symposium on Pattern Recognition, pages 314–324, 2008.
  • [35] Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu. Occlusion aware unsupervised learning of optical flow. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., June 2018.
  • [36] A. Wedel, D. Cremers, T. Pock, and H. Bischof. Structure- and motion-adaptive regularization for high accuracy optic flow. In Proc. IEEE Int. Conf. Comp. Vis., pages 1663–1668, Sept 2009.
  • [37] Andreas Wedel, Thomas Pock, Christopher Zach, Horst Bischof, and Daniel Cremers. An improved algorithm for tv-l 1 optical flow. In Statistical and geometrical approaches to visual motion analysis, pages 23–45. Springer, 2009.
  • [38] Jonas Wulff, Laura Sevilla-Lara, and Michael J. Black. Optical flow in mostly rigid scenes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., July 2017.
  • [39] K. Yamaguchi, D. McAllester, and R. Urtasun. Robust monocular epipolar flow estimation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1862–1869, June 2013.
  • [40] Jiaolong Yang and Hongdong Li.

    Dense, accurate optical flow estimation with piecewise parametric model.

    In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 1019–1027, 2015.
  • [41] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., volume 2, 2018.
  • [42] Jason J. Yu, Adam W. Harley, and Konstantinos G. Derpanis. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In European Conference on Computer Vision (ECCV) Workshops, pages 3–10, 2016.
  • [43] Christopher Zach, Thomas Pock, and Horst Bischof. A duality based approach for realtime tv-l 1 optical flow. In Joint Pattern Recognition Symposium, pages 214–223. Springer, 2007.
  • [44] Yuliang Zou, Zelun Luo, and Jia-Bin Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task consistency. In Proc. Eur. Conf. Comp. Vis., pages 38–55, 2018.