Caffe implementation for "Guided Optical Flow Learning"
We study the unsupervised learning of CNNs for optical flow estimation using proxy ground truth data. Supervised CNNs, due to their immense learning capacity, have shown superior performance on a range of computer vision problems including optical flow prediction. They however require the ground truth flow which is usually not accessible except on limited synthetic data. Without the guidance of ground truth optical flow, unsupervised CNNs often perform worse as they are naturally ill-conditioned. We therefore propose a novel framework in which proxy ground truth data generated from classical approaches is used to guide the CNN learning. The models are further refined in an unsupervised fashion using an image reconstruction loss. Our guided learning approach is competitive with or superior to state-of-the-art approaches on three standard benchmark datasets yet is completely unsupervised and can run in real time.READ FULL TEXT VIEW PDF
The difficulty of annotating training data is a major obstacle to using ...
This paper develops an unsupervised learning algorithm that trains a Dee...
CNN-based optical flow estimation has attracted attention recently, main...
Recent work has shown that convolutional neural networks (CNNs) can be u...
Scene flow describes the 3D position as well as the 3D motion of each pi...
Exploration bonus derived from the novelty of the states in an environme...
Reanalysis datasets combining numerical physics models and limited
Caffe implementation for "Guided Optical Flow Learning"
TensorFlow implementation for "Guided Optical Flow Learning"
Optical flow contains valuable information for general image sequence analysis due to its capability to represent motion. It is widely used in vision tasks such as human action recognition [18, 22, 21], semantic segmentation , video frame prediction , video object tracking etc.
. They remain top performers on a number of evaluation benchmarks; however, most of them are too slow to be used in real time applications. Due to the great success of Convolutional Neural Network (CNN), several works[7, 16] have proposed using CNNs to estimate the motion between image pairs and have achieved promising results. Although they are much more efficient than classical approaches, these methods require supervision and cannot apply to real world data where the ground truth is not easily accessible. Thus, some recent works [1, 20, 23]
have investigated unsupervised learning through novel loss functions but they often perform worse than supervised ones.
To improve the accuracy of unsupervised CNNs for optical flow estimation, we propose to use the results of classical methods as guidance for our unsupervised learning process. We refer to this as novel guided optical flow learning as shown in Fig. 1. Specifically, there are two stages. (i) We generate proxy ground truth flow using classical approaches, and then train a supervised CNN with them. (ii) We fine tune the learned models by minimizing an image reconstruction loss. By training the CNNs using proxy ground truth, we hope to provide a good initialization point for subsequent network learning. By fine tuning the models on target datasets, we hope to overcome the risk that CNN might have learned the failure cases of the classical approaches. The entire learning framework is thus unsupervised.
Our contributions are two-fold. First, we demonstrate that supervised CNNs can learn to estimate optical flow well even when only guided using noisy proxy ground truth data generated from classical methods. Second, we show that fine tuning the learned models for target datasets by minimizing a reconstruction loss further improves performance. Our proposed guided learning is completely unsupervised and achieves competitive or superior performance to state-of-the-art, real time approaches on standard benchmarks.
Given an adjacent frame pair and , our goal is to learn a model that can estimate the per-pixel motion field between the two images accurately and efficiently. and are the horizontal and vertical displacements, respectively. We describe our proxy ground truth guided framework in Section 2.1, and the unsupervised fine tuning strategy in Section 2.2.
Current approaches to the supervised training of CNNs for estimating optical flow use synthetic ground truth datasets. These synthetic motions/scenes are quite different from real ones which limits the generalizability of the learned models. And, even constructing synthetic dataset requires a lot of manual effort . The current largest synthetic datasets with dense ground truth optical flow, Flying Chairs  and FlyingThings3D , consist of only
k image pairs which is not ideal for deep learning especially for such an ill-conditioned problem as motion estimation. In order for CNN-based optical flow estimation to reach its full potential, a learning framework is needed that can scale the size of the training data. Unsupervised learning is one ideal way to achieve this scaling because it does not require ground truth flow.
Classical approaches to optical flow estimation are unsupervised in that there is no learning process involved [11, 4, 5, 2, 12]. They only require the image pairs as input, with some extra assumptions (like image brightness constancy, gradient constancy, smoothness) and information (like motion boundaries, dense image matching). These non-CNN based classical methods currently achieve the best performance on standard benchmarks and are thus considered the state-of-the-art. Inspired by their good performance, we conjecture that these approaches can be used to generate proxy ground truth data for training CNN-based optical flow estimators.
In this work, we choose FlowFields  as our classical optical flow estimator. To our knowledge, it is one of the most accurate flow estimators among the published work. We hope that by using FlowFields to generate proxy ground truth, we can learn to estimate motion between image pairs as effectively as using the true ground truth.
For fair comparison, we use the “FlowNet Simple” network as descried in 
as our supervised CNN architecture. This allows us to compare our guided learning approach to using the true ground truth, particularly with respect to how well the learned models generalize to other datasets. We use endpoint error (EPE) as our guided loss since it is the standard error measure for optical flow evaluation
where denotes the total number of pixels in . and are the proxy ground truth flow fields while and are the flow estimates from the CNN.
As stated in Section 1, a potential drawback to using classical approaches to create training data is that the quality of this data will necessarily be limited by the accuracy of the estimator. If a classical approach fails to detect certain motion patterns, a network trained on the proxy ground truth is also likely to miss these patterns. This leads us to ask if there is other unsupervised guidance that can improve the network training?
The unsupervised approach of  treats optical flow estimation as an image reconstruction problem based on the intuition that if the estimated flow and the next frame can be used to reconstruct the current frame then the network has learned useful representations of the underlying motions. During training, the loss is computed as the photometric error between the true current frame and the inverse-warped next frame
where . The inverse warp is performed using a spatial transformer module  inside the CNN. We use a robust convex error function, the generalized Charbonnier penalty
, to reduce the influence of outliers. This reconstruction loss is similar to the brightness constancy objective in classical variational formulations but is quite different from the EPE loss in the proxy ground truth guided learning. We thus propose fine tuning our model using this reconstruction loss as an additional unsupervised guide.
During fine tuning, the total energy we aim to minimize is a simple weighted sum of the EPE loss and the image reconstruction loss
where controls the level of reconstruction guidance. Note that we could add additional unsupervised guides like a gradient constancy assumption or an edge-aware weighted smoothness loss  to further fine tune our models.
An overview of our guided learning framework with both the proxy ground truth guidance and the unsupervised fine tuning is illustrated in Fig. 1.
Flying Chairs  is a synthetic dataset designed specifically for training CNNs to estimate optical flow. It is created by applying affine transformations to real images and synthetically rendered chairs. The dataset contains 22,872 image pairs: 22,232 training and 640 test samples according to the standard evaluation split.
MPI Sintel  is also a synthetic dataset derived from a short open source animated 3D movie. There are 1,628 frames, 1,064 for training and 564 for testing. It is the most widely adopted benchmark to compare optical flow estimators. In this work, we only report performance on its final pass because it contains sufficiently realistic scenes including natural image degradations.
KITTI Optical Flow 2012  is a real world dataset collected from a driving platform. It consists of 194 training image pairs and 195 test pairs with sparse ground truth flow. We report the average EPE in total for the test set.
We consider guided learning with and without fine tuning. In the no fine tuning regime, the model is trained using the proxy ground truth produced using a classical estimator. In the fine tuning regime, the model is first trained using the proxy ground truth and then fine tuned using both the proxy ground truth and the reconstruction guide. The Sintel and KITTI datasets are too small to produce enough proxy ground truth to train our model from scratch so the models evaluated on these datasets are first pretrained on the Chairs dataset. These models are then either applied to the Sintel and KITTI datasets without fine tuning or are fine tuned using the target dataset (proxy ground truth).
As shown in Fig. 1, our architecture consists of contractive and expanding parts. In the no fine tuning learning regime, we calculate the per-pixel EPE loss for each expansion. There are expansions resulting in losses. We use the same loss weights as in . The models are trained using Adam optimization with the default parameter values and . The initial learning rate is set to and divided by half every k iterations after the first k. We end our training at k iterations.
In the fine tuning learning regime, we calculate both the EPE and reconstruction loss for each expansion. Thus there are a total of losses. The generalized Charbonnier parameter is set to in the reconstruction loss. is . We use the default Adam optimization with a fixed learning rate of and training is stopped at k iterations.
We apply the same intensive data augmentation as in  to prevent over-fitting in both learning regimes. The proxy ground truth is computed using the FlowFields binary kindly provided by authors in .
|FlowNetS (Ground Truth) |
|FlowNetS (FlowFields) + Unsup|
We have three observations given the results in Table 1.
Observation : We can use proxy ground truth generated by state-of-the-art classical flow estimators to train CNNs for optical flow prediction. A model trained using the FlowFields proxy ground truth achieves an average EPE of on Chairs which is comparable to the achieved by the model trained using the true ground truth. Note that the proxy ground truth is still quite noisy with an average EPE of away from the true ground truth.
The model trained using the FlowFields proxy ground truth (EPE 3.34) performs worse than the FlowFields estimator (EPE 2.45), which is expected. This is because FlowFields adopts a hierarchical approach which is non-local in the image space. It also uses dense correspondence to capture image details. Thus, FlowFields itself can output crisp motion boundaries and accurate flow. However, unlike the CNN model, it cannot run in real time.
Observation : Sometime, training using proxy ground truth can generalize better than training using the true ground truth. The model trained using the Chairs proxy ground truth (computed with FlowFields) performs better (EPE 8.05) on Sintel than the model trained using the Chairs true ground truth (EPE 8.43). We make similar observations for KITTI111Note that FlowNetS’s performance on KITTI (EPE 9.1) is fine tuned.. This improved generalization might result from over-fitting when training with the true ground truth since the three datasets are quite different with respect to object and motion types. The proxy is noisier which could serve as a form of data augmentation for unseen motion types.
In addition, we experiment on directly training a Sintel model from scratch without using the pretrained Chairs model. We use the same implementation details. The performance is about one and half pixel worse in terms of EPE than using the pretrained model. Therefore, pretraining CNNs on a large dataset (with either true or proxy ground truth data) is important for optical flow estimation.
Observation : Our proposed fine tuning regime improves performance on all three datasets. Fine tuning results in an average EPE decrease from to for Chairs, to for Sintel, and to for KITTI. Note that an average EPE of for Chairs is very close to performance of the supervised model FlowNetS (EPE ). This demonstrates that image reconstruction loss is effective as an additional unsupervised guide for motion learning. It can act like fine tuning without requiring ground truth flow of the target dataset.
We also investigate training a network from scratch using a joint training regime. That is, using both and , not only using in the fine tuning stage. The performance was worse on all three benchmarks. The reason might be that pretraining using just the proxy ground truth prevents the model from becoming trapped in local minima. It thus can provide a good initialization for further network learning. A joint training regime using both losses may hurt the network’s convergence in the beginning.
However, we expect unsupervised learning to bring more complementarity. Image reconstruction loss may not be the most appropriate guidance for learning optical flow prediction. We will explore how to best incorporate additional unsupervised objectives in future work.
We compare our proposed method to recent state-of-the-art approaches. We only consider approaches that are fast because optical flow is often used in time sensitive applications. We evaluated all CNN-based approaches on a workstation with Intel Core I7 with 4.00GHz and a Nvidia Titan X GPU. For classical approaches, we just use their reported runtime. As shown in Table 2, our method performs the best for Sintel even though it does not require the true ground truth for training. For Chairs, we achieve on par performance with . For KITTI, we perform inferior to . This is likely because the flow in KITTI is caused purely by the motion of the car so the segmentation into layers performed in  helps in capturing motion boundaries. Our approach outperforms the state-of-the-art unsupervised approaches of [1, 20] by a large margin, thus demonstrating the effectiveness of our proposed guided learning using proxy ground truth and image reconstruction. Visual comparison of Sintel and KITTI results are shown in Fig. 2. We can see that UnsupFlowNet  is able to produce reasonable flow fields estimation, but quite noisy. And it doesn’t perform well in highly saturated and very dark regions. Our results are much more detailed and smoothed due to the proxy guidance and unsupervised fine tuning.
We propose a guided optical flow learning framework which is unsupervised and results in an estimator that can run in real time. We show that proxy ground truth data produced using state-of-the-art classical estimators can be used to train CNNs. This allows the training sets to scale which is important for deep learning. We also show that training using proxy ground truth can result in better generalization than training using the true ground truth. And, finally, we also show that an unsupervised image reconstruction loss can provide further learning guidance.
More broadly, we introduce a paradigm which can be integrated into future state-of-the-art motion estimation networks  to improve performance. In future work, we plan to experiment with large-scale video corpora to learn non-rigid real world motion patterns rather than just learning limited motions found in synthetic datasets.
Acknowledgements This work was funded in part by a National Science Foundation CAREER grant, IIS-1150115. We gratefully acknowledge NVIDIA Corporation through the donation of the Titan X GPU used in this work.