Hybrid Learning of Optical Flow and Next Frame Prediction to Boost Optical Flow in the Wild

12/12/2016 ∙ by Nima Sedaghat, et al. ∙ University of Freiburg 0

CNN-based optical flow estimation has attracted attention recently, mainly due to its impressively high frame rates. These networks perform well on synthetic datasets, but they are still far behind the classical methods in real-world videos. This is because there is no ground truth optical flow for training these networks on real data. In this paper, we boost CNN-based optical flow estimation in real scenes with the help of the freely available self-supervised task of next-frame prediction. To this end, we train the network in a hybrid way, providing it with a mixture of synthetic and real videos. With the help of a sample-variant multi-tasking architecture, the network is trained on different tasks depending on the availability of ground-truth. We also experiment with the prediction of "next-flow" instead of estimation of the current flow, which is intuitively closer to the task of next-frame prediction and yields favorable results. We demonstrate the improvement in optical flow estimation on the real-world KITTI benchmark. Additionally, we test the optical flow indirectly in an action classification scenario. As a side product of this work, we report significant improvements over state-of-the-art in the task of next-frame prediction.



There are no comments yet.


page 4

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Supervised learning of optical flow estimation with a deep network yields a good trade-off between run time and accuracy of the estimated optical flow [6]. However, such supervised learning requires a large number of training pairs, which have been provided via synthetic images. Such imagery lacks realism and diversity, and it keeps the network from using the full potential of the learning concept. Particularly on real-world data, FlowNet [6] does not yield the same accuracy as state-of-the-art conventional optical flow estimation techniques.

In this paper, we approach this problem by providing real-world data to the network during training. Since there is only a very limited amount of real-world image pairs with ground truth optical flow, we use a semi-supervised hybrid multi-tasking scheme that exploits real-world videos without ground truth and synthetic imagery with ground truth. For the network to learn useful concepts from the unlabeled data, we build on the self-supervised task of next-frame prediction as an auxiliary task. The general concept of this hybrid learning task is illustrated in Figure 1.

The hybrid multi-tasking combines the best of supervised learning on synthetic data and self-supervised learning on real data. On the KITTI optical flow benchmark, we obtained a clear improvement over the FlowNet, which was trained without the self-supervised next frame prediction task. The improvement over the baseline is even larger when testing on an application task for optical flow, such as action recognition.

In addition to the hybrid multi-task learning of optical flow estimation and next frame prediction, we also propose multi-task learning on next frame prediction and next flow prediction. The latter two sub-tasks are more compatible and improve results when feeding the optical flow into an action recognition network.

While we mainly focus on improving optical flow with the auxiliary task of next frame prediction, we also show benefits on next frame prediction.

2 Related Work

Since the work by Horn & Schunk [9], optical flow estimation has been dominated by variational methods [2, 20, 22].

The FlowNet by Dosovitskiy  [6] was the first deep network trained end-to-end on optical flow estimation. It was followed by Teney  [25] and Tran  [26]. These supervised learning methods require training data with optical flow annotations. In Dosovitskiy et al. [6] and Mayer et al. [17] synthetic datasets were introduced to provide such data. Tran  [26] applied an existing variational method to create pseudo-ground truth data.

Instead, Ahmadi and Patras [1] and Yu  [32]

formulated the task as an unsupervised learning problem. To this end, they used a cost function based on the classical color constancy assumption, as it is used in variational techniques.

Video prediction has been very popular recently [16, 31, 7, 14, 10, 15, 23, 19]. Although some of these works focus on prediction as the main objective [16, 31], most of them use it as an auxiliary task. Finn  [7] proposed an action-conditioned video prediction model to facilitate unsupervised learning for physical interaction. Patraucean  [19] learn optical flow by warping the current frame to the next one. Lotter  [14] use prediction to learn representations for object recognition.

The works by Pintea  [21], Walker  [28, 29], Jayaraman  [10], and Vondrick  [27] focus on motion prediction. Their predicted motion is conditioned on a single input frame. In contrast, we model future motion based on current motion and the scene content by making explicit use of two consecutive frames as input.

3 Hybrid Architecture and Training Schedule

3.1 Optical Flow Estimation

The flow estimation network in the proposed hybrid architecture largely builds on the FlowNet architecture introduced in Dosovitskiy  [6]. As illustrated in Figure 8, we add two more up-convolutional layers (Upconv1, Upconv0) to the decoder. This yields a flow field with the resolution of the input images. This is advantageous when combining the network with next-frame prediction. In contrast, the network in Dosovitskiy et al. [6]

yields a lower resolution flow field, which is up-sampled with bilinear interpolation.

In Figure 8, the first and second row compose the encoder and decoder components of the flow estimator respectively. While our network follows the same multi-resolution scheme as in Dosovitskiy et al. [6], in Figure 8 we omit the extra details regarding the so-called refinement steps (Figure 3 of [6]) which represent lower-resolution outputs. We use the endpoint error loss (EPE) for training of this branch of the network. A more detailed illustration of the architecture is provided in the supplementary material.

3.2 Next-Frame Prediction

The network for the auxiliary task of next-frame prediction shares the encoder with the flow estimation network, and adds a second decoder stream with independent weights but using the same architecture. Rows 1 & 3 in Figure 8 form the next-frame prediction component of the network. As suggested in previous work [16], we use an L1 loss to avoid blur in the generated images.

As reported in section 4, we experimented with different number of frames as input for next-frame prediction. However, in the multi-tasking scheme, we only use a 2-frame set-up to be compatible with the paired task of flow estimation. The two three-channel RGB images are provided as a stacked six-channel input to the overall network.

3.3 Joint training

For joint training of the hybrid network, there are two challenges. First, the data comes from two different sources, and there are multiple ways how to mix them during training. Secondly, unlike synthetic data, the real data does not come with optical flow ground truth, i.e., for real data as input, there is no loss for the flow related stream of the network.

Hybrid data

We mix the data at the minibatch level: data in a single batch is taken completely either from the synthetic dataset or the real-world dataset. The minibatch at iteration alternates between minibatches & from the two data sources


using the switch function


which always yields 0 or 1 and allows for different numbers of cycles and dedicated to each data source, respectively.

Batch-variant loss

The total loss at the iteration is computed according to:


in which & are the flow and frame estimation losses respectively, with their assigned weights, & .

In case of real-data without ground truth optical flow, we deactivate the flow related loss and, thus, the flow related decoder stream of the network. Both the loss and the loss gradient are set to zero. We keep the loss in sync with the switch function to ensure the desired functionality: the network learns on both tasks when synthetic data is provided, but skips updating the optical flow decoder when there is no ground truth.

We set the loss weights such that is equal to where

’s are the estimates of the variances of the input data (frame vs. flow) and are computed over a subset of 500 random samples from the training sets. In the following experiments, we report results with a fixed ratio of

for .

In our main experiments we fix the ratio of cycles dedicated to synthetic and real data sources. But we also provide an analysis on the effect of different cycle ratios on the quality of the output flow field.

Figure 3: Illustration of the two multi-tasking schemes introduced in this paper. (a) Combination of flow estimation with next-frame prediction; (b) the next-flow prediction replaces the task of current flow estimation. Each denotes a single frame in a video sequence of length 3, and denotes a flow field. In both scenarios only and are the inputs to the network. Therefore, the terms “current flow” and “next-flow” refer to and , respectively.
Overlayed FlowNet FlowNet NextFlow EpicFlow
Inputs +NextFrame +NextFrame
Figure 4: Some samples from the estimated optical flow fields on real scenes from HMDB51 [11]. Both of our suggested methods in the middle columns, show clear improvements in the flow fields and preserving the object shapes.
(a) First input frame
(b) FlowNetS+ft [6]
(c) B2B Unsupervised FlowNet [32]
(d) Ours
Figure 5: Sample result on the KITTI benchmark. The unsupervised method of Yu [32] has problems near the image boundaries and reveals blurred motion boundaries. Our method shows a similar quality as FlowNetS+ft, although it has been only fine-tuned on unlabeled data.

3.4 Next-Flow Prediction

In a multi-tasking scheme, for the combination to yield significant improvements in the results, the two tasks need to be “related” [4]. In context of the current work, we hypothesize that prediction of the “next-flow” (i.e. the future flow to come), may have more in common with the task of next-frame prediction. Figure 3 compares the two combinations and gives an intuition on how the two “prediction” tasks match.

More formally: The two tasks share the encoder component of the network, that maps , where is the internal representation to be learned by the network. This mapping is affected by both tasks during the backward pass. For multi-tasking to make sense, we expect that some features learned by the encoder are beneficial for both target tasks. We hypothesize that the pair of have more to share, compared to not only due to both being “prediction” tasks, but also because in obtaining the future flow, the network needs to learn, at least implicitly, about the future frame. This is not the case for the current flow (Figure 3). This hypothesis is supported by our experimental results.

3.5 Training Details

We train the network for 1 million iterations, with a batch size of 8 for both of the data sources. The initial learning rate is 0.0001, and drops by a factor of 0.5 every 100K iterations starting from 300K. We use ADAM [13] for optimization with , . On an NVIDIA Titan X, training takes roughly 10 days.

Dataset for KITTI’12 KITTI’15 Sintel FlyingThings3D
Frame Prediction train test train train test

B2B Unsupervised FlowNet [32]
- - 11.3 - - -
FlowNet [6] - 8.26 - - 4.50 -
FlowNet Baseline - 8.79 - 15.59 4.33 1.84
FlowNet Baseline+NextFrame Sports 8.55 - 15.16 4.38 1.86
FlowNet Baseline+NextFrame Cityscapes 8.49 - 14.68 4.24 1.80
FlowNet Baseline+NextFrame KITTI:frames 8.37 - 14.15 4.30 1.84
FlowNet Baseline+NextFrame Sports + KITTI:frames 8.39 - 15.08 4.29 1.86
FlowNet Baseline+NextFrame SportsKITTI:frames 7.78 9.2 13.95 4.36 1.85
FlowNet [6] KITTI:flow - 7.52 9.1 - - -
FlowNet+NextFrame KITTI:flow KITTI:frames 5.31 - 10.19 5.35 2.82

Table 1: Quantitative evaluation of optical flow estimation performance based on End Point Error (EPE). “KITTI:frames” indicates video frames (without flow annotations) from the KITTI dataset. Moreover, wherever the evaluation is performed on a KITTI (2012/2015) training subset, the data used for the training of the network is taken from its counterpart (2015/2012). The sign indicates a pre-training/fine-tuning process.

4 Experiments

4.1 Datasets

For hybrid training of the network we need 2 datasets per experiment. We used the so-called “FlyingThings3D” dataset of Mayer . [17] as the data source with ground truth optical flow. It consists of more than 20000 training images and allows training a network from scratch. Moreover, it provides an independent test set that we used for testing. The much smaller Sintel dataset [3] has 1064 samples and was used only for testing.

The only available real-world dataset with ground truth optical flow is the KITTI dataset. There are two independent datasets, KITTI 2012 [8] and KITTI 2015 [18]. We used both datasets for the quantitative evaluation of the optical flow. Since both datasets are independent, we always used one for training and the other for testing. Except for one experiment, we did not use the optical flow ground truth for training but only the images. We took the frames from the “multi-view” extension of the datasets, consisting of 4074 and 4200 images in the 2012 and 2015 versions, respectively.

There are many large real-world datasets without optical flow ground truth. We used mainly a subset of the Sports1M dataset [12] for the self-supervised training task. The subset includes all videos with a file size up to 5 MBytes, amounting to more than 220K videos and 220M frames. We will make the selection list available online. Also in another experiment, we simply used 50000 frames from videos of the Cityscapes dataset [5].

Moreover, we used the UCF101 [24] and HMDB51 [11] datasets for testing the optical flow indirectly in an action recognition scenario. The datasets contain more than 2M & 600K frames, respectively.

To compare the performance of our next-frame predictor to published work, we used the same subset of UCF101 [24] as Mathieu et al. [16]. It consists of 387 videos.

4.2 Direct Evaluation of the Optical Flow

In Figure 4 we visualize some of the flow fields estimated with our method, FlowNet [6], and an accurate but slow variational method (EpicFlow [22]). On real-world scenes, our flow fields capture the shape of moving objects much better than the baseline FlowNet. We believe this sharpness is a result of asking for pixel-level accurate results in the auxiliary task of frame prediction, which regulates the blurring that the flow branch tends to exert.

We quantitatively evaluated the method on KITTI 2012 & 2015. Table 1 shows these results along with two synthetic datasets. All the experiments used the same synthetic source of data and they differ only in the source of real data. ‘FlowNet Baseline’ is our full-resolution extension of the architecture of [6] trained on FlyingThings3D. ‘FlowNet+NextFrame‘ indicates our hybrid multi-tasking scheme.

Results from various configurations are displayed in Table 1. Although the Sports dataset has little similarity with the scenes in KITTI, using this data for the auxiliary task yields significant improvements on KITTI. Using frames from the Cityscapes dataset, improves the results even more, as the videos are recorded in a similar context to that of KITTI’s. There is no significant change on the synthetic datasets. This does not come as a surprise, since the FlyingThings3D dataset can cover other synthetic datasets like Sintel well. There is no significant domain shift from the training set to the test set in this case.

Action Accuracy (%)

UCF101 [24] HMDB51 [11]

EpicFlow [22] 82.8 56.1
TV-L1 [20] (as reported in [30]) 87.2 -
CNN based FlowNet 62.0 38.6
FlowNet pre-trained with NextFrame 63.4 38.4
FlowNet+NextFrame Multi-tasking (1:5) 74.1 48.4
NextFlow+NextFrame Multi-tasking (1:5) 75.5 48.9

Table 2:

Action classification accuracy. Each row contains results of training and testing the action classifier on optical flow generated by a specific method. 1:5 indicates the real to synthetic iterations ratio.

EPE Action Accuracy (%)


8.88 38.6 62.0
1:9 8.76 48.0 75.3
1:5 8.55 48.4 74.1
1:3 8.78 47.3 74.7
1:1 8.94 48.3 76.6
4:1 10.35 48.0 -

Table 3: Analysis of the effect of different cycles ratios on optical flow quality. For EPE, lower values are better. For action class accuracy, higher numbers are better.

Using video frames from the KITTI dataset (labeled as ‘KITTI:frames’) rather than the Sports or Cityscapes datasets for the auxiliary task, improves results on KITTI as expected. We also experimented with combining the two real datasets, both in a parallel fashion and in a pre-training/fine-tuning scheme (’SportsKITTI:frames’). The latter led to another large improvement. We submitted this version to the official KITTI evaluation site to obtain results on the KITTI test set. The result is essentially as good as the FlowNet fine-tuned on KITTI. Figure 5 depicts a qualitative comparison on this benchmark.

We also report results on the fine-tuned FlowNet combined with the hybrid learning on the auxiliary task at the very bottom of Table 1. This experiment shows that even when fine-tuning the FlowNet baseline on KITTI, hybrid training still yields significant improvements.

FlowNet 1:9 1:5 1:3 1:1 4:1 EpicFlow
Figure 6: Qualitative comparison of different data source combination cycles. On top of each column the ratio is displayed. When there is no real data involved (FlowNet), the network fails to estimate an acceptable flow field in real scenes. On the other hand, if training spends too many cycles only on frame prediction, as in the 4:1 column, the network no longer focuses enough on the optical flow task. The best results are obtained with a ratio of 1:5 or 1:3.

Whole Image Moving Regions
Similarity Sharpness Similarity Sharpness
Method PSNR(dB) (dB) PSNR(dB) (dB)

Mathieu [16]
L1 22.3 18.5 28.7 24.8
GDL + L1 23.9 18.7 29.9 25
Adv + GDL + L1 29.6 20.3 32 25.4
Ours (2-frame) L1 29.9 20.6 31.9 25.4
Ours (4-frame) L1 30.8 20.8 31.9 25.4

Table 4: Next frame prediction on UCF101 [24]. With just a simple L1 loss we already obtain clear improvements over the state-of-the-art.

4.3 Indirect Evaluation: Action Classification

As real-world videos rarely come with optical flow ground-truth (KITTI being an exception), possibilities for a direct evaluation of the optical flow is limited. Thus, we use the evaluation on flow-based action classification as an indirect quantitative measure on two larger real-world datasets. We use the action classifier network of Wang  [30] and train/test it with optical flow from different optical flow methods as input.

Table 2 shows the results of this evaluation. We used the Sports dataset to provide unsupervised data. The hybrid learning was done with a ratio of 1:5 for real to synthetic cycles. The optical flow with our hybrid learning scheme improved results on action recognition by a large margin (12.1% on UCF and 9.8% on HMDB) when compared to the baseline FlowNet. We achieved even larger improvements by replacing current flow with ‘NextFlow’.

We also tried a pre-training/fine-tuning scenario in which the network is initially trained for the frame prediction task (on real data), and then fine-tuned with the main task (“FlowNet pre-trained with NextFrame”). Results confirm that this sequential learning is not sufficient. The multi-tasking scheme is necessary to make good use of the auxiliary task on the real data.

We report also number of two variational methods, TV-L1 [33] and EpicFlow [22]. They provide a higher accuracy, but are also much slower than the network based approaches.

4.4 Impact of Task Combination Cycles

We evaluated on which ratio of training cycles on synthetic and real data one obtains the best performance and on how robust the method is to deviations from the optimal ratio. We used the Sports dataset as data source in this experiment. Figure 6 shows the results for various cycle ratios. Results are robust for a large range of ratios. Lower ratios approach the results of FlowNet, as the effect of the auxiliary task starts to vanish. Putting too much emphasis on the auxiliary task introduces artifacts in the optical flow field, since the network starts to care mostly about next frame prediction. In general, the ratio should be biased towards the supervised optical flow task. A ratio of 1:5 seems a good choice in general.

Ours Mathieu  [16] Ground Truth
Figure 7: Next frame prediction samples. Results of Mathieu  [16] are often a bit sharper due to the adversarial loss, yet the method also introduces distortions and artifacts; see the last two samples. Our next frame predictions are blurrier due to relying only on the L1 loss, but yield robust predictions without distortion. This explains the on-par quantitative results in Table 4.

4.5 Next-Frame Prediction as a Single Task

We also evaluated the output of our next-frame prediction network and tested it on UCF. To this end, we trained it as an independent single-task network (). Table 4 shows a comparison with Mathieu  [16] – which to the best of our knowledge is the current state-of-the-art in next-frame prediction on UCF. Without the use of any auxiliary cost functions, as introduced in Mathieu  [16] for the sake of sharp results, and just with a single L1 loss, we obtain results on par with Mathieu on the moving regions of the image, and significantly better results on the whole image. This means that the network is more successful on applying the motion only to the dynamic areas and keeping the static areas intact. We show qualitative examples of the predicted frames in Figure 7 and in a video in the supplementary material.

Since frame prediction has only been an auxiliary task in the network, the input settings (particularly the cycles ratio) have been set to focus on improvement of the optical flow output. Therefore, by increasing the number of flow cycles, the next-frame prediction accuracy is degraded.

5 Conclusions

We have presented a way to improve a deep network for optical flow estimation on real data by training it with an additional self-supervised auxiliary task. Our experiments showed a consistent improvement of the optical flow quality on real-world data. Thus, we believe that this approach largely improves the transfer of deep networks trained on synthetic dataset to domains in the real world. While we focused here on optical flow, the concept may transfer also to similar problems, such as disparity estimation, and alternative self-supervised auxiliary tasks.


We acknowledge funding by the ERC Starting Grant VideoLearn.