Supervised learning of optical flow estimation with a deep network yields a good trade-off between run time and accuracy of the estimated optical flow . However, such supervised learning requires a large number of training pairs, which have been provided via synthetic images. Such imagery lacks realism and diversity, and it keeps the network from using the full potential of the learning concept. Particularly on real-world data, FlowNet  does not yield the same accuracy as state-of-the-art conventional optical flow estimation techniques.
In this paper, we approach this problem by providing real-world data to the network during training. Since there is only a very limited amount of real-world image pairs with ground truth optical flow, we use a semi-supervised hybrid multi-tasking scheme that exploits real-world videos without ground truth and synthetic imagery with ground truth. For the network to learn useful concepts from the unlabeled data, we build on the self-supervised task of next-frame prediction as an auxiliary task. The general concept of this hybrid learning task is illustrated in Figure 1.
The hybrid multi-tasking combines the best of supervised learning on synthetic data and self-supervised learning on real data. On the KITTI optical flow benchmark, we obtained a clear improvement over the FlowNet, which was trained without the self-supervised next frame prediction task. The improvement over the baseline is even larger when testing on an application task for optical flow, such as action recognition.
In addition to the hybrid multi-task learning of optical flow estimation and next frame prediction, we also propose multi-task learning on next frame prediction and next flow prediction. The latter two sub-tasks are more compatible and improve results when feeding the optical flow into an action recognition network.
While we mainly focus on improving optical flow with the auxiliary task of next frame prediction, we also show benefits on next frame prediction.
2 Related Work
The FlowNet by Dosovitskiy  was the first deep network trained end-to-end on optical flow estimation. It was followed by Teney  and Tran . These supervised learning methods require training data with optical flow annotations. In Dosovitskiy et al.  and Mayer et al.  synthetic datasets were introduced to provide such data. Tran  applied an existing variational method to create pseudo-ground truth data.
formulated the task as an unsupervised learning problem. To this end, they used a cost function based on the classical color constancy assumption, as it is used in variational techniques.
Video prediction has been very popular recently [16, 31, 7, 14, 10, 15, 23, 19]. Although some of these works focus on prediction as the main objective [16, 31], most of them use it as an auxiliary task. Finn  proposed an action-conditioned video prediction model to facilitate unsupervised learning for physical interaction. Patraucean  learn optical flow by warping the current frame to the next one. Lotter  use prediction to learn representations for object recognition.
The works by Pintea , Walker [28, 29], Jayaraman , and Vondrick  focus on motion prediction. Their predicted motion is conditioned on a single input frame. In contrast, we model future motion based on current motion and the scene content by making explicit use of two consecutive frames as input.
3 Hybrid Architecture and Training Schedule
3.1 Optical Flow Estimation
The flow estimation network in the proposed hybrid architecture largely builds on the FlowNet architecture introduced in Dosovitskiy . As illustrated in Figure 8, we add two more up-convolutional layers (Upconv1, Upconv0) to the decoder. This yields a flow field with the resolution of the input images. This is advantageous when combining the network with next-frame prediction. In contrast, the network in Dosovitskiy et al. 
yields a lower resolution flow field, which is up-sampled with bilinear interpolation.
In Figure 8, the first and second row compose the encoder and decoder components of the flow estimator respectively. While our network follows the same multi-resolution scheme as in Dosovitskiy et al. , in Figure 8 we omit the extra details regarding the so-called refinement steps (Figure 3 of ) which represent lower-resolution outputs. We use the endpoint error loss (EPE) for training of this branch of the network. A more detailed illustration of the architecture is provided in the supplementary material.
3.2 Next-Frame Prediction
The network for the auxiliary task of next-frame prediction shares the encoder with the flow estimation network, and adds a second decoder stream with independent weights but using the same architecture. Rows 1 & 3 in Figure 8 form the next-frame prediction component of the network. As suggested in previous work , we use an L1 loss to avoid blur in the generated images.
As reported in section 4, we experimented with different number of frames as input for next-frame prediction. However, in the multi-tasking scheme, we only use a 2-frame set-up to be compatible with the paired task of flow estimation. The two three-channel RGB images are provided as a stacked six-channel input to the overall network.
3.3 Joint training
For joint training of the hybrid network, there are two challenges. First, the data comes from two different sources, and there are multiple ways how to mix them during training. Secondly, unlike synthetic data, the real data does not come with optical flow ground truth, i.e., for real data as input, there is no loss for the flow related stream of the network.
We mix the data at the minibatch level: data in a single batch is taken completely either from the synthetic dataset or the real-world dataset. The minibatch at iteration alternates between minibatches & from the two data sources
using the switch function
which always yields 0 or 1 and allows for different numbers of cycles and dedicated to each data source, respectively.
The total loss at the iteration is computed according to:
in which & are the flow and frame estimation losses respectively, with their assigned weights, & .
In case of real-data without ground truth optical flow, we deactivate the flow related loss and, thus, the flow related decoder stream of the network. Both the loss and the loss gradient are set to zero. We keep the loss in sync with the switch function to ensure the desired functionality: the network learns on both tasks when synthetic data is provided, but skips updating the optical flow decoder when there is no ground truth.
We set the loss weights such that is equal to where
’s are the estimates of the variances of the input data (frame vs. flow) and are computed over a subset of 500 random samples from the training sets. In the following experiments, we report results with a fixed ratio offor .
In our main experiments we fix the ratio of cycles dedicated to synthetic and real data sources. But we also provide an analysis on the effect of different cycle ratios on the quality of the output flow field.
3.4 Next-Flow Prediction
In a multi-tasking scheme, for the combination to yield significant improvements in the results, the two tasks need to be “related” . In context of the current work, we hypothesize that prediction of the “next-flow” (i.e. the future flow to come), may have more in common with the task of next-frame prediction. Figure 3 compares the two combinations and gives an intuition on how the two “prediction” tasks match.
More formally: The two tasks share the encoder component of the network, that maps , where is the internal representation to be learned by the network. This mapping is affected by both tasks during the backward pass. For multi-tasking to make sense, we expect that some features learned by the encoder are beneficial for both target tasks. We hypothesize that the pair of have more to share, compared to not only due to both being “prediction” tasks, but also because in obtaining the future flow, the network needs to learn, at least implicitly, about the future frame. This is not the case for the current flow (Figure 3). This hypothesis is supported by our experimental results.
3.5 Training Details
We train the network for 1 million iterations, with a batch size of 8 for both of the data sources. The initial learning rate is 0.0001, and drops by a factor of 0.5 every 100K iterations starting from 300K. We use ADAM  for optimization with , . On an NVIDIA Titan X, training takes roughly 10 days.
B2B Unsupervised FlowNet 
|FlowNet Baseline+NextFrame||Sports + KITTI:frames||8.39||-||15.08||4.29||1.86|
|FlowNet  KITTI:flow||-||7.52||9.1||-||-||-|
For hybrid training of the network we need 2 datasets per experiment. We used the so-called “FlyingThings3D” dataset of Mayer .  as the data source with ground truth optical flow. It consists of more than 20000 training images and allows training a network from scratch. Moreover, it provides an independent test set that we used for testing. The much smaller Sintel dataset  has 1064 samples and was used only for testing.
The only available real-world dataset with ground truth optical flow is the KITTI dataset. There are two independent datasets, KITTI 2012  and KITTI 2015 . We used both datasets for the quantitative evaluation of the optical flow. Since both datasets are independent, we always used one for training and the other for testing. Except for one experiment, we did not use the optical flow ground truth for training but only the images. We took the frames from the “multi-view” extension of the datasets, consisting of 4074 and 4200 images in the 2012 and 2015 versions, respectively.
There are many large real-world datasets without optical flow ground truth. We used mainly a subset of the Sports1M dataset  for the self-supervised training task. The subset includes all videos with a file size up to 5 MBytes, amounting to more than 220K videos and 220M frames. We will make the selection list available online. Also in another experiment, we simply used 50000 frames from videos of the Cityscapes dataset .
4.2 Direct Evaluation of the Optical Flow
In Figure 4 we visualize some of the flow fields estimated with our method, FlowNet , and an accurate but slow variational method (EpicFlow ). On real-world scenes, our flow fields capture the shape of moving objects much better than the baseline FlowNet. We believe this sharpness is a result of asking for pixel-level accurate results in the auxiliary task of frame prediction, which regulates the blurring that the flow branch tends to exert.
We quantitatively evaluated the method on KITTI 2012 & 2015. Table 1 shows these results along with two synthetic datasets. All the experiments used the same synthetic source of data and they differ only in the source of real data. ‘FlowNet Baseline’ is our full-resolution extension of the architecture of  trained on FlyingThings3D. ‘FlowNet+NextFrame‘ indicates our hybrid multi-tasking scheme.
Results from various configurations are displayed in Table 1. Although the Sports dataset has little similarity with the scenes in KITTI, using this data for the auxiliary task yields significant improvements on KITTI. Using frames from the Cityscapes dataset, improves the results even more, as the videos are recorded in a similar context to that of KITTI’s. There is no significant change on the synthetic datasets. This does not come as a surprise, since the FlyingThings3D dataset can cover other synthetic datasets like Sintel well. There is no significant domain shift from the training set to the test set in this case.
||Action Accuracy (%)|
||UCF101 ||HMDB51 |
|TV-L1  (as reported in )||87.2||-|
|FlowNet pre-trained with NextFrame||63.4||38.4|
|FlowNet+NextFrame Multi-tasking (1:5)||74.1||48.4|
|NextFlow+NextFrame Multi-tasking (1:5)||75.5||48.9|
Action classification accuracy. Each row contains results of training and testing the action classifier on optical flow generated by a specific method. 1:5 indicates the real to synthetic iterations ratio.
||EPE||Action Accuracy (%)|
Using video frames from the KITTI dataset (labeled as ‘KITTI:frames’) rather than the Sports or Cityscapes datasets for the auxiliary task, improves results on KITTI as expected. We also experimented with combining the two real datasets, both in a parallel fashion and in a pre-training/fine-tuning scheme (’SportsKITTI:frames’). The latter led to another large improvement. We submitted this version to the official KITTI evaluation site to obtain results on the KITTI test set. The result is essentially as good as the FlowNet fine-tuned on KITTI. Figure 5 depicts a qualitative comparison on this benchmark.
We also report results on the fine-tuned FlowNet combined with the hybrid learning on the auxiliary task at the very bottom of Table 1. This experiment shows that even when fine-tuning the FlowNet baseline on KITTI, hybrid training still yields significant improvements.
||Whole Image||Moving Regions|
|GDL + L1||23.9||18.7||29.9||25|
|Adv + GDL + L1||29.6||20.3||32||25.4|
4.3 Indirect Evaluation: Action Classification
As real-world videos rarely come with optical flow ground-truth (KITTI being an exception), possibilities for a direct evaluation of the optical flow is limited. Thus, we use the evaluation on flow-based action classification as an indirect quantitative measure on two larger real-world datasets. We use the action classifier network of Wang  and train/test it with optical flow from different optical flow methods as input.
Table 2 shows the results of this evaluation. We used the Sports dataset to provide unsupervised data. The hybrid learning was done with a ratio of 1:5 for real to synthetic cycles. The optical flow with our hybrid learning scheme improved results on action recognition by a large margin (12.1% on UCF and 9.8% on HMDB) when compared to the baseline FlowNet. We achieved even larger improvements by replacing current flow with ‘NextFlow’.
We also tried a pre-training/fine-tuning scenario in which the network is initially trained for the frame prediction task (on real data), and then fine-tuned with the main task (“FlowNet pre-trained with NextFrame”). Results confirm that this sequential learning is not sufficient. The multi-tasking scheme is necessary to make good use of the auxiliary task on the real data.
4.4 Impact of Task Combination Cycles
We evaluated on which ratio of training cycles on synthetic and real data one obtains the best performance and on how robust the method is to deviations from the optimal ratio. We used the Sports dataset as data source in this experiment. Figure 6 shows the results for various cycle ratios. Results are robust for a large range of ratios. Lower ratios approach the results of FlowNet, as the effect of the auxiliary task starts to vanish. Putting too much emphasis on the auxiliary task introduces artifacts in the optical flow field, since the network starts to care mostly about next frame prediction. In general, the ratio should be biased towards the supervised optical flow task. A ratio of 1:5 seems a good choice in general.
|Ours||Mathieu ||Ground Truth|
4.5 Next-Frame Prediction as a Single Task
We also evaluated the output of our next-frame prediction network and tested it on UCF. To this end, we trained it as an independent single-task network (). Table 4 shows a comparison with Mathieu  – which to the best of our knowledge is the current state-of-the-art in next-frame prediction on UCF. Without the use of any auxiliary cost functions, as introduced in Mathieu  for the sake of sharp results, and just with a single L1 loss, we obtain results on par with Mathieu on the moving regions of the image, and significantly better results on the whole image. This means that the network is more successful on applying the motion only to the dynamic areas and keeping the static areas intact. We show qualitative examples of the predicted frames in Figure 7 and in a video in the supplementary material.
Since frame prediction has only been an auxiliary task in the network, the input settings (particularly the cycles ratio) have been set to focus on improvement of the optical flow output. Therefore, by increasing the number of flow cycles, the next-frame prediction accuracy is degraded.
We have presented a way to improve a deep network for optical flow estimation on real data by training it with an additional self-supervised auxiliary task. Our experiments showed a consistent improvement of the optical flow quality on real-world data. Thus, we believe that this approach largely improves the transfer of deep networks trained on synthetic dataset to domains in the real world. While we focused here on optical flow, the concept may transfer also to similar problems, such as disparity estimation, and alternative self-supervised auxiliary tasks.
We acknowledge funding by the ERC Starting Grant VideoLearn.
-  A. Ahmadi and I. Patras. Unsupervised convolutional neural networks for motion estimation. arXiv:1601.06087 [cs], Jan. 2016.
-  T. Brox, A. Bruhn, N. Papenberg, and J. Weickert. High Accuracy Optical Flow Estimation Based on a Theory for Warping. In Computer Vision - ECCV 2004, pages 25–36. Springer, Berlin, Heidelberg.
-  D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In A. Fitzgibbon et al. (Eds.), editor, European Conf. on Computer Vision (ECCV), Part IV, LNCS 7577, pages 611–625. Springer-Verlag, Oct. 2012.
-  R. Caruana. Multitask Learning. In S. Thrun and L. Pratt, editors, Learning to Learn, pages 95–133. Springer US, 1998.
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The Cityscapes Dataset for Semantic Urban Scene Understanding.pages 3213–3223.
-  A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers, and T. Brox. FlowNet: Learning Optical Flow With Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 2758–2766, 2015.
-  C. Finn, I. Goodfellow, and S. Levine. Unsupervised learning for physical interaction through video prediction. In Advances In Neural Information Processing Systems, pages 64–72, 2016.
A. Geiger, P. Lenz, and R. Urtasun.
Are we ready for Autonomous Driving? The KITTI Vision Benchmark
Conference on Computer Vision and Pattern Recognition (CVPR).
-  B. K. P. Horn and B. G. Schunck. Determining optical flow. Artificial Intelligence, 17(1):185–203, Aug. 1981.
-  D. Jayaraman and K. Grauman. Look-Ahead Before You Leap: End-to-End Active Recognition by Forecasting the Effect of Motion. In Computer Vision – ECCV 2016, pages 489–505. Springer, Cham, Oct. 2016.
-  H. Jhuang, H. Garrote, E. Poggio, T. Serre, and T. Hmdb. A large video database for human motion recognition. In Proc. of IEEE International Conference on Computer Vision, 2011.
A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei.
Large-scale Video Classification with Convolutional Neural Networks.In CVPR, 2014.
-  D. P. Kingma and J. Ba. Adam: A Method for Stochastic Optimization.
-  W. Lotter, G. Kreiman, and D. Cox. Deep Predictive Coding Networks for Video Prediction and Unsupervised Learning. arXiv:1605.08104 [cs, q-bio], May 2016.
-  R. Mahjourian, M. Wicke, and A. Angelova. Geometry-Based Next Frame Prediction from Monocular Video. arXiv:1609.06377 [cs], Sept. 2016.
-  M. Mathieu, C. Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. arXiv:1511.05440 [cs, stat], Nov. 2015.
-  N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4040–4048, 2016.
-  M. Menze and A. Geiger. Object Scene Flow for Autonomous Vehicles. In Conference on Computer Vision and Pattern Recognition (CVPR).
-  V. Patraucean, A. Handa, and R. Cipolla. Spatio-temporal video autoencoder with differentiable memory. arXiv:1511.06309 [cs], Nov. 2015.
-  J. S. Pérez, E. Meinhardt-Llopis, and G. Facciolo. TV-L1 optical flow estimation. Image Processing On Line, 2013:137–150, 2013.
-  S. L. Pintea, J. C. van Gemert, and A. W. M. Smeulders. Déjà Vu:. In D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, Computer Vision – ECCV 2014, number 8691 in Lecture Notes in Computer Science, pages 172–187. Springer International Publishing, Sept. 2014.
-  J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid. EpicFlow: Edge-Preserving Interpolation of Correspondences for Optical Flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1164–1172, 2015.
-  M. Saito and E. Matsumoto. Temporal Generative Adversarial Nets. arXiv:1611.06624 [cs], Nov. 2016.
-  K. Soomro, A. R. Zamir, and M. Shah. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. arXiv:1212.0402 [cs], Dec. 2012.
-  D. Teney and M. Hebert. Learning to Extract Motion from Videos in Convolutional Neural Networks. arXiv:1601.07532 [cs], Jan. 2016.
-  D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Deep End2End Voxel2Voxel Prediction. pages 17–24, 2016.
-  C. Vondrick, H. Pirsiavash, and A. Torralba. Anticipating Visual Representations From Unlabeled Video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 98–106, 2016.
J. Walker, C. Doersch, A. Gupta, and M. Hebert.
An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders.In Computer Vision – ECCV 2016, pages 835–851. Springer, Cham, Oct. 2016.
-  J. Walker, A. Gupta, and M. Hebert. Dense Optical Flow Prediction from a Static Image. arXiv:1505.00295 [cs], May 2015.
-  L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.
-  T. Xue, J. Wu, K. Bouman, and B. Freeman. Visual Dynamics: Probabilistic Future Frame Synthesis via Cross Convolutional Networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 91–99. Curran Associates, Inc., 2016.
-  J. J. Yu, A. W. Harley, and K. G. Derpanis. Back to Basics: Unsupervised Learning of Optical Flow via Brightness Constancy and Motion Smoothness. arXiv:1608.05842 [cs], Aug. 2016.
-  C. Zach, T. Pock, and H. Bischof. A Duality Based Approach for Realtime TV-L1 Optical Flow. In Pattern Recognition, pages 214–223. Springer, Berlin, Heidelberg.