CoT-AMFlow: Adaptive Modulation Network with Co-Teaching Strategy for Unsupervised Optical Flow Estimation

11/04/2020 ∙ by Hengli Wang, et al. ∙ IEEE The Hong Kong University of Science and Technology 0

The interpretation of ego motion and scene change is a fundamental task for mobile robots. Optical flow information can be employed to estimate motion in the surroundings. Recently, unsupervised optical flow estimation has become a research hotspot. However, unsupervised approaches are often easy to be unreliable on partially occluded or texture-less regions. To deal with this problem, we propose CoT-AMFlow in this paper, an unsupervised optical flow estimation approach. In terms of the network architecture, we develop an adaptive modulation network that employs two novel module types, flow modulation modules (FMMs) and cost volume modulation modules (CMMs), to remove outliers in challenging regions. As for the training paradigm, we adopt a co-teaching strategy, where two networks simultaneously teach each other about challenging regions to further improve accuracy. Experimental results on the MPI Sintel, KITTI Flow and Middlebury Flow benchmarks demonstrate that our CoT-AMFlow outperforms all other state-of-the-art unsupervised approaches, while still running in real time. Our project page is available at https://sites.google.com/view/cot-amflow.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Mobile robots typically operate in complex environments that are inherently dynamic [27]. Therefore, it is important for such autonomous systems to be conscious of dynamic objects in their surroundings. Optical flow describes pixel-level correspondence between two ordered images, and can be regarded as a useful representation for dynamic object detection. Therefore, many approaches for mobile robot tasks, such as SLAM [32], dynamic object detection [28] and robot navigation [14], incorporate optical flow information to improve their performance.

With the development of deep learning technology, deep neural networks have presented highly compelling results for optical flow estimation

[6, 24, 12]. These networks typically excel at learning optical flow estimation from large amounts of data along with hand-labeled ground truth. However, this data labeling process can be extremely time-consuming and labor-intensive. Recent unsupervised optical flow estimation approaches have attracted much attention, because their advantage in not requiring ground truth enables them to be easily deployed in real-world applications [21, 18, 29, 16, 17]. However, their performance in challenging regions, such as partially occluded or texture-less regions, is often unsatisfactory [29, 15]. The underlying cause of this performance degradation is threefold: 1) The popular coarse-to-fine framework [17, 15] is often sensitive to noises in the flow initialization from the preceding pyramid level, and the challenging regions can introduce errors in the flow estimations, which in turn propagate to subsequent levels. 2) The commonly used cost volume [29, 16] for establishing feature correspondence can contain many outliers due to the ambiguous correspondence in challenging regions. However, most existing networks directly send the noisy cost volume to the following flow estimation layers without explicitly alleviating the impact of outliers. 3) Many training strategies have been proposed to improve accuracy in challenging regions for unsupervised optical flow estimation, such as occlusion reasoning [18, 29] and self-supervision [16, 17, 15]. These strategies generally train a single network to provide prior information. However, the prior information is not accurate enough because a single network can be easily disturbed by outliers if the ground truth is inaccessible. Also, the inaccurate prior information can further lead to significant performance degradation.

To overcome these limitations, we propose CoT-AMFlow, which comprises adaptive modulation networks, named AMFlows, that learn optical flow estimation in an unsupervised way with a co-teaching strategy. The overview of our proposed CoT-AMFlow is illustrated in Fig. 1, and we leverage three novel techniques to improve the flow accuracy, as follows:

  • We apply flow modulation modules (FMMs) in our AMFlow to refine the flow initialization from the preceding pyramid level using local flow consistency, which can address the issue of accumulated errors.

  • We present cost volume modulation modules (CMMs) in our AMFlow to explicitly reduce outliers in the cost volume using a flexible and efficient sparse point-based scheme.

  • We adopt a co-teaching strategy, where two AMFlows with different initializations simultaneously teach each other about challenging regions to improve robustness against outliers.

We conduct extensive experiments on the MPI Sintel [4], KITTI Flow 2012 [7], KITTI Flow 2015 [20] and Middlebury Flow [2] benchmarks. Experimental results show that our CoT-AMFlow outperforms all other unsupervised approaches, while still running in real time.

Figure 1: An overview of our CoT-AMFlow. We integrate self-supervision into a co-teaching framework, where two AMFlows with different initializations teach each other about challenging regions to improve stability against outliers and further enhance the accuracy of flow estimation.

2 Related Work

2.1 Optical Flow Estimation

Traditional approaches typically estimate optical flow by minimizing a global energy that measures both brightness consistency and spatial smoothness [9, 19, 3]

. With recent development in deep learning technology, supervised approaches using convolutional neural networks (CNNs) have been extensively applied in optical flow estimation, and the achieved results are very promising. FlowNet 

[6] was the first end-to-end deep neural network for optical flow estimation. It employs a correlation layer to compute feature correspondence. Later on, PWC-Net [24] and LiteFlowNet [12] presented a pyramid architecture, which consists of feature warping layers, cost volumes and flow estimation layers. Such an architecture can achieve remarkable flow accuracy and high efficiency simultaneously. Their subsequent versions [22, 11] also made incremental improvements. Unsupervised approaches generally adopt similar network architectures to supervised approaches, and focus more on training strategies. However, existing network architectures do not explicitly address the issues of noisy flow initializations and outliers in the cost volume, as previously mentioned. Therefore, we develop the FMMs and CMMs in our AMFlow to overcome these limitations.

Among the training strategies for unsupervised approaches, DSTFlow [21] first presented a photometric loss and a smoothness loss for unsupervised training. Additionally, some approaches train a single network to perform occlusion reasoning for accuracy improvement [18, 29]. Self-supervision [16, 17] is also an important strategy for unsupervised training. It first trains a single network to generate flow labels, and then conducts data augmentation to make flow estimations more challenging. The augmented samples are further employed as supervision to train another network. One variant of self-supervision is to train only one network with a two-forward process [15]. However, training a single network to provide flow labels is likely to be unreliable due to the disturbance of outliers and the lack of ground-truth supervision. To address this issue, we integrate self-supervision into a co-teaching framework, where two networks simultaneously teach each other about challenging regions to improve stability against outliers.

2.2 Co-Teaching Strategy

The co-teaching strategy was first proposed for the image classification task with extremely noisy labels [8]

. Since then, many researchers have resorted to this strategy for various specific robust training tasks, such as face recognition

[30] and object detection [5]

. The main difference between previous studies and our approach is that they focus on the task of supervised learning with noisy labels, while we focus on the task of unsupervised learning. Moreover, the noises in their tasks exist at image level (noisy image classification labels), while the outliers in our task exist at pixel level (inaccurate flow estimation pixels in challenging regions).

3 Methodology

Figure 2: An illustration of our AMFlow, which uses FMMs and CMMs to refine flow initializations and remove outliers in cost volumes, respectively.

3.1 AMFlow

In this subsection, we first introduce the overall architecture of our AMFlow, and then present our FMM and CMM. Since we use many notations, we suggest readers refer to the glossary provided in the appendix for better understanding. Fig. 2 illustrates an overview of our proposed AMFlow, which follows the pipeline of PWC-Net [24]. Different pyramid levels of feature maps are first extracted hierarchically from the input images and using a siamese feature pyramid network, and then are sent to the coarse-to-fine flow decoder. Here, we take level as an example to introduce our flow decoder, for simplicity. First, the upsampled flow estimation at level is processed by our FMM for refinement, and the generated modulated flow is employed to align the feature map with the feature map . A correlation operation is then employed to compute the cost volume , which is then processed by our CMM to remove outliers. After getting the modulated cost volume , we take it as input and employ the same flow estimation layer as PWC-Net [24] to estimate the flow residual, which is subsequently added with to obtain the flow estimation at level . This process iterates and the flow estimations at different scales are generated.

Flow Modulation Module (FMM). In the coarse-to-fine framework, a flow estimation from the preceding level is adopted as a flow initialization at the current level. Therefore, the inaccurate flow estimations in challenging regions can propagate to subsequent levels and cause significant performance degradation. Our FMM is developed to address this problem based on the concept of local flow consistency [35].

Our FMM is based on the assumption that the neighboring pixels with similar feature maps should have similar optical flows. Therefore, for a pixel with an inaccurate flow estimation , we will look for another pixel around , which has a similar feature map to and an accurate flow estimation . Then, we replace with .

To this end, we first compute a confidence map based on the upsampled flow estimation and the downsampled input images and , as illustrated in Fig. 2. The confidence computing operation is defined as follows:

(1)

where denotes the function for measuring the photometric difference [15], and denotes the warping operation of image based on flow . Then, we use a self-correlation layer to compute a self-cost volume , which measures the similarity between each pixel in the feature map and its neighboring pixels. The adopted self-correlation layer is identical to the correlation layer used in the above-mentioned flow decoder, except that it only takes one feature map as input. We further concatenate with , and send the concatenation to several convolution layers to obtain a displacement map . Finally, we warp based on to get the modulated flow estimation .

Cost Volume Modulation Module (CMM). Ambiguous correspondence in challenging regions can introduce noises into the cost volume, which further influence the subsequent flow estimation layers. Our CMM is designed to reduce noises in the cost volume.

Several traditional approaches have formulated the task of denoising the cost volume as a weighted least squares problem, which obtains the following solution for level [31, 10]:

(2)

where denotes the modulated cost at pixel for flow residual candidate ; pixel belongs to the neighbors of ; denotes the modulation weight; and denotes the original cost at pixel for flow residual candidate . Note that the one-dimensional is transformed from the original two-dimensional flow residual candidate for simplicity, which is the same as the scheme adopted in PWC-Net [24].

The intuition of our CMM is to implement (2) in deep neural networks, which is realized by a flexible and efficient sparse point-based scheme based on deformable convolution [34]:

(3)

where denotes the number of sampling points; denotes the modulation weight for the -th point; and is the fixed offset of the original convolution layer to . To make the modulation scheme more flexible, we also employ a separate convolutional layer on to learn an additional offset and a spatial-variant weight . These two terms can effectively and efficiently help remove outliers in challenging regions.

3.2 Loss Function

Input: and , learning rate , constant threshold

, epoch

and , iteration .
Output: and .
1 for  do
2       Shuffle training set
3       for  do
4             Forward individually to obtain , , , and ,
5             Set ,     

Filter out pixels with high occlusion probability

6             Compute
7             Compute
8             Update ,
9       end for
10      Update
11 end for
Algorithm 1 Co-Teaching Strategy

We employ three common loss functions, 1) photometric loss

, 2) smoothness loss and 3) self-supervision loss , to train our CoT-AMFlow, as illustrated in Fig. 1. For each network, the forward flow and backward flow can be obtained given the input images and . Then, we can compute an occlusion map with the range between 0 and 1 [29], where a higher value indicates that the corresponding pixel is more likely to be occluded, and vice versa. Based on these notations, we first introduce our adopted photometric loss [29] as follows:

(4)

where is the generalized Charbonnier penalty function [23]; stands for the stop-gradient; and denotes element-wise multiplication. (4) shows that occluded regions have little impact on , since there does not exist correspondence in these regions. Moreover, we stop the gradient at the occlusion maps to avoid a trivial solution. Then, the following formulation shows our utilized second-order edge-aware smoothness loss [26]:

(5)

where denotes the color channel and is the total number of pixels. We also adopt a self-supervision scheme [15]. Specifically, we first conduct transformations , and on , and respectively to construct augmented samples , , and . The transformations include spatial, occlusion and appearance transformation [15]. We also obtain a flow prediction based on and . Then, our self-supervision loss is shown as follows [16]:

(6)

where denotes the L2 norm. Note that, different from , measures the occlusion relationship between and . A higher value in indicates that the corresponding pixel is less likely to be occluded in but more likely to be occluded in [16]. Therefore, (6) shows that helps improve the accuracy of flow estimations in challenging regions.

The whole loss function for training our CoT-AMFlow is a weighted sum of the above three losses, as shown on Line 6 and 7 in Algorithm 1. The details will be introduced in Section 3.3.

3.3 Co-Teaching Strategy

Our co-teaching strategy is illustrated in Fig. 1, and the corresponding steps are shown in Algorithm 1. Specifically, we simultaneously train two networks (with parameter ) and (with parameter ). In each mini-batch, we first let the two networks forward individually to obtain several outputs (Line 4). Then, we filter out the pixels with a high occlusion probability by setting their value in the occlusion map as 1 (completely occluded and thus have no impact on ) (Line 5). The filtering threshold is controlled by , which equals 1 at the beginning and then decreases gradually with the increase of epoch number. The key point of our co-teaching strategy is that each network uses the occlusion maps estimated by the other network to compute its own loss function (Line 6 and 7). Finally, we update the parameters of the two networks separately and also update (Line 8 and 10). Next, we will answer two important questions about our co-teaching strategy: 1) Why do we need a dynamic threshold and 2) why can swapping the occlusion maps estimated by two networks help improve the accuracy for unsupervised optical flow estimation?

To answer the first question, we know that it is meaningless to compute photometric loss on the occluded regions, and thus we adopt an occlusion-masked photometric loss. According to [1], networks will first learn easy and clear patterns, i.e., unchallenging regions. However, with the number of epochs increasing, networks will gradually be affected by the inaccurately estimated occlusion maps and thus overfit on the occluded regions, which in turn will lead to more inaccurate occlusion estimations and further cause significant performance degradation. To address this, we keep more pixels in the initial epochs, i.e., is large. Then, we gradually filter our pixels with high occlusion probability, i.e., gradually decreases, to ensure the networks do not memorize these possible outliers.

The dynamic threshold can, however, only alleviate but not entirely avoid the adverse impact of the occluded regions. Therefore, we further adopt a scheme with two networks, which connects to the answer to our second question. The intuition is that different networks have different abilities to learn flow estimation, and correspondingly, they can generate different occlusion estimations. Therefore, swapping the occlusion maps estimated by the two networks can help them adaptively correct the inaccurate occlusion estimations. Compared with most existing approaches that directly transfer errors back to themselves, our co-teaching strategy can effectively avoid the accumulation of errors and thus improve stability against outliers for unsupervised optical flow estimation. Note that since deep neural networks are highly non-convex and a network with different initializations can lead to different local optimums, we employ two AMFlows with different initializations in our CoT-AMFlow, following [8], as illustrated in Fig. 1.

4 Experimental Results

4.1 Dataset and Implementation Details

In our experiments, we set in our loss function. In addition, we use for the first 40% of epochs and increase it to 0.15 linearly for the next 20% of epochs, after which it stays at a constant value. The learning rate adopts an exponential decay scheme, with the initialization as , and the Adam optimizer is used. Moreover, we set and in Algorithm 1 for evaluation on public benchmarks.

We first evaluate our CoT-AMFlow on three popular optical flow benchmarks, MPI Sintel [4], KITTI Flow 2012 [7] and KITTI Flow 2015 [20]. The experimental results are shown in Section 4.2. Then, we perform a generalization evaluation on the Middlebury Flow benchmark [2], as presented in Section 4.3. We also conduct extensive ablation studies to demonstrate the superiority of 1) our selection of and ; 2) our FMM and CMM; 3) our AMFlow over other network architectures; and 4) our co-teaching strategy over other strategies for unsupervised training. The experimental results are presented in the appendix.

Furthermore, we follow a similar training scheme to those of the previous unsupervised approaches [16, 17, 15]

for fair comparison. For the MPI Sintel benchmark, we first train our model on raw movie frames and then fine-tune it on the training set. For the two KITTI Flow benchmarks, we first employ the KITTI raw dataset to pre-train our model and then fine-tune it using multi-view extension data. Additionally, we adopt two standard evaluation metrics, the average end-point error (AEPE) and the percentage of erroneous pixels (F1)

[4, 7, 20, 2].

Approach S MPI Sintel KITTI 2012 KITTI 2015
Clean Final Noc All Noc All Time (s)
PWC-Net [24] 4.39 5.04 4.22 8.10 6.12 9.60 0.03
LiteFlowNet [12] 4.54 5.38 3.27 7.27 5.49 9.38 0.09
LiteFlowNet2 [11] 3.48 4.69 2.63 6.16 4.42 7.62 0.05
MaskFlownet [33] 2.52 4.17 2.07 4.82 3.92 6.11 0.06
RAFT [25] 1.61 2.86 3.07 5.10 0.20
UnFlow [18] 9.38 10.22 4.28 8.42 7.46 11.11 0.12
DDFlow [16] 6.18 7.40 4.57 8.86 9.55 14.29 0.06
SelFlow [17] 6.56 6.57 4.31 7.68 9.65 14.19 0.09
ARFlow [15] 4.78 5.89 4.71 8.49 8.91 11.80 0.01
ARFlow-mv [15] 4.49 5.67 4.56 7.53 8.97 11.79 0.02
UFlow [13] 5.21 6.50 4.26 7.91 8.41 11.13 0.04
CoT-AMFlow (Ours) 3.96 5.14 3.50 8.26 6.28 10.34 0.06
Table 1: Evaluation results on the MPI Sintel, KITTI Flow 2012 and KITTI Flow 2015 benchmarks. Here, we show the primary evaluation metrics used on each benchmark. For the Sintel Clean and Final benchmarks, the AEPE (px) for all pixels is presented. For the KITTI Flow 2012 and 2015, “Noc” and “All” represent the F1 () for non-occluded pixels and all pixels, respectively. “S” denotes supervised approaches. indicates the network using more than two frames. Best results for supervised and unsupervised approaches are both shown in bold font.

4.2 Performance on Public Benchmarks

According to the online leaderboards of the MPI Sintel111http://sintel.is.tue.mpg.de/results, KITTI Flow 2012222http://www.cvlibs.net/datasets/kitti/eval_stereo_flow.php?benchmark=flow and KITTI Flow 2015333http://cvlibs.net/datasets/kitti/eval_scene_flow.php?benchmark=flow benchmarks, as shown in Table 1, our CoT-AMFlow outperforms all other unsupervised optical flow estimation approaches. We can clearly observe that our approach is significantly ahead of other unsupervised approaches, especially on the MPI Sintel benchmark, where an AEPE improvement of 0.53px–5.42px is achieved on the Sintel Clean benchmark. We also use the KITTI Flow 2015 benchmark to record the average inference time of our CoT-AMFlow. The results in Table 1 show that our approach can still run in real time with the state-of-the-art performance. One exciting fact is that our unsupervised CoT-AMFlow can achieve considerable performance when compared with supervised approaches. Specifically, on the MPI Sintel Clean benchmark, our CoT-AMFlow outperforms some classic networks such as PWC-Net [24] and LiteFlowNet [12], while achieving only a slightly inferior performance compared with LiteFlowNet2 [11], which demonstrates the effectiveness of our adaptive modulation network and co-teaching strategy. Fig. 3 illustrates examples of the three public benchmarks, where we can obviously see that our CoT-AMFlow yields more robust and accurate results.

4.3 Generalization Analysis across Datasets

We employ the CoT-AMFlow trained on the MPI Sintel benchmark directly on the Middlebury Flow benchmark to test the generalization ability of our approach. Table 2 shows the online leaderboard of the Middlebury Flow benchmark444https://vision.middlebury.edu/flow/eval/results/results-e1.php. Note that our CoT-AMFlow has not been fine-tuned on the benchmark. We can observe that our CoT-AMFlow significantly outperforms the unsupervised UnFlow [18] and even presents superior performance over supervised approaches such as PWC-Net [24] and LiteFlowNet [12]. The results strongly verify that our CoT-AMFlow has an excellent generalization ability.

Metric PWC-Net [24] LiteFlowNet [12] UnFlow [18] CoT-AMFlow (Ours)
S
AEPE (px) 0.33 0.40 0.76 0.26
Table 2: Evaluation results on the Middlebury Flow benchmark. “S” denotes supervised approaches. Note that our CoT-AMFlow has not been fine-tuned on the benchmark. Best results for supervised and unsupervised approaches are both shown in bold font.
Figure 3: Examples of the MPI Sintel Clean, KITTI Flow 2012 and KITTI Flow 2015 benchmarks, where rows (a) and (b) on columns (1)–(3) show the flow estimations and the corresponding error maps of (1) ARFlow-mv [15], (2) SelFlow [17] and (3) our CoT-AMFlow, respectively. Significantly improved regions are highlighted with green dashed boxes.

5 Conclusion

In this paper, we proposed CoT-AMFlow, an adaptive modulation network with a co-teaching strategy for unsupervised optical flow estimation. Our CoT-AMFlow presents three major contributions: 1) a flow modulation module (FMM), which can refine the flow initialization from the preceding pyramid level to address the issue of accumulated errors; 2) a cost volume modulation module (CMM), which can explicitly reduce outliers in the cost volume to improve the accuracy of optical flow estimation; and 3) a co-teaching strategy for unsupervised training, which employs two networks to teach each other about challenging regions to improve robustness against outliers for unsupervised optical flow estimation. Extensive experiments have demonstrated that our CoT-AMFlow achieves the state-of-the-art performance for unsupervised optical flow estimation with an impressive generalization ability, while still running in real time. We believe that our CoT-AMFlow can be directly used in many mobile robot tasks, such as SLAM and robot navigation, to improve their performance. It is also promising to employ the co-teaching strategy in other unsupervised tasks, such as unsupervised disparity or scene flow estimation.

We thank the anonymous reviewers for their useful comments. This work was supported by the National Natural Science Foundation of China, under grant No. U1713211, Collaborative Research Fund by Research Grants Council Hong Kong, under Project No. C4063-18G, and HKUST-SJTU Joint Research Collaboration Fund, under project SJTU20EG03, awarded to Prof. Ming Liu.

References

  • [1] D. Arpit, S. K. Jastrzebski, N. Ballas, D. Krueger, E. Bengio, M. S. Kanwal, T. Maharaj, A. Fischer, A. C. Courville, Y. Bengio, et al. (2017) A closer look at memorization in deep networks. In ICML, Cited by: §3.3.
  • [2] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. J. Black, and R. Szeliski (2011) A database and evaluation methodology for optical flow.

    Inter. J. Comput. Vision

    92 (1), pp. 1–31.
    Cited by: §1, §4.1, §4.1.
  • [3] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert (2004) High accuracy optical flow estimation based on a theory for warping. In Eur. Conf. Comput. Vision (ECCV), pp. 25–36. Cited by: §2.1.
  • [4] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black (2012-10)

    A naturalistic open source movie for optical flow evaluation

    .
    In Proc. Eur. Conf. Comput. Vision (ECCV), A. Fitzgibbon et al. (Eds.) (Ed.), Part IV, LNCS 7577, pp. 611–625. Cited by: §1, §4.1, §4.1, CoT-AMFlow: Adaptive Modulation Network with Co-Teaching Strategy for Unsupervised Optical Flow Estimation.
  • [5] S. Chadwick and P. Newman (2019) Training object detectors with noisy data. In 2019 IEEE Intell. Veh. Symp. (IV), pp. 1319–1325. Cited by: §2.2.
  • [6] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In

    Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR)

    ,
    pp. 2758–2766. Cited by: §1, §2.1.
  • [7] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), Cited by: §1, §4.1, §4.1.
  • [8] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama (2018) Co-teaching: robust training of deep neural networks with extremely noisy labels. In Adv. Neural Inf. Process. Syst. (NIPS), pp. 8527–8537. Cited by: §2.2, §3.3.
  • [9] B. K. Horn and B. G. Schunck (1981) Determining optical flow. In Techn. Appl. Image Understanding, Vol. 281, pp. 319–331. Cited by: §2.1.
  • [10] A. Hosni, C. Rhemann, M. Bleyer, C. Rother, and M. Gelautz (2012) Fast cost-volume filtering for visual correspondence and beyond. IEEE Trans. Pattern Anal. Mach. Intell. 35 (2), pp. 504–511. Cited by: §3.1.
  • [11] T. Hui, X. Tang, and C. C. Loy (2020) A lightweight optical flow CNN - revisiting data fidelity and regularization. IEEE Trans. Pattern Anal. Mach. Intell. (), pp. 1–1. Cited by: §2.1, §4.2, Table 1.
  • [12] T. Hui, X. Tang, and C. Change Loy (2018) Liteflownet: a lightweight convolutional neural network for optical flow estimation. In Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), pp. 8981–8989. Cited by: §1, §2.1, §4.2, §4.3, Table 1, Table 2.
  • [13] R. Jonschkowski, A. Stone, J. T. Barron, A. Gordon, K. Konolige, and A. Angelova (2020) What matters in unsupervised optical flow. In Eur. Conf. Comput. Vision (ECCV), Cited by: Table 1.
  • [14] K. Lee, J. Gibson, and E. A. Theodorou (2020) Aggressive perception-aware navigation using deep optical flow dynamics and PixelMPC. IEEE Robot. Automat. Lett. 5 (2), pp. 1207–1214. Cited by: §1.
  • [15] L. Liu, J. Zhang, R. He, Y. Liu, Y. Wang, Y. Tai, D. Luo, C. Wang, J. Li, and F. Huang (2020) Learning by analogy: reliable supervision from transformations for unsupervised optical flow estimation. In Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), pp. 6489–6498. Cited by: Table 3, Table 6, §1, §2.1, §3.1, §3.2, Figure 3, §4.1, Table 1.
  • [16] P. Liu, I. King, M. R. Lyu, and J. Xu (2019) DDFlow: learning optical flow with unlabeled data distillation. In Proceedings of the AAAI Conf. Artif. Intelli., Vol. 33, pp. 8770–8777. Cited by: Table 6, §1, §2.1, §3.2, §4.1, Table 1.
  • [17] P. Liu, M. Lyu, I. King, and J. Xu (2019) Selflow: self-supervised learning of optical flow. In Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), pp. 4571–4580. Cited by: Table 6, §1, §2.1, Figure 3, §4.1, Table 1.
  • [18] S. Meister, J. Hur, and S. Roth (2018) Unflow: unsupervised learning of optical flow with a bidirectional census loss. In Thirty-Second AAAI Conf. Artif. Intelli., Cited by: Table 6, §1, §2.1, §4.3, Table 1, Table 2.
  • [19] E. Mémin and P. Pérez (1998) Dense estimation and object-based segmentation of the optical flow with robust techniques. IEEE Trans. Image Process. 7 (5), pp. 703–719. Cited by: §2.1.
  • [20] M. Menze and A. Geiger (2015) Object scene flow for autonomous vehicles. In Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), Cited by: §1, §4.1, §4.1.
  • [21] Z. Ren, J. Yan, B. Ni, B. Liu, X. Yang, and H. Zha (2017) Unsupervised deep learning for optical flow estimation. In Thirty-First AAAI Conf. Artif. Intelli., Cited by: §1, §2.1.
  • [22] Z. Ren, O. Gallo, D. Sun, M. Yang, E. Sudderth, and J. Kautz (2019) A fusion approach for multi-frame optical flow estimation. In 2019 IEEE Winter Conf. Appl. Comput. Vision (WACV), pp. 2077–2086. Cited by: §2.1.
  • [23] D. Sun, S. Roth, and M. J. Black (2010) Secrets of optical flow estimation and their principles. In 2010 IEEE Comput. Soc. Conf. Comput. Vision Pattern Recognit., pp. 2432–2439. Cited by: §3.2.
  • [24] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), pp. 8934–8943. Cited by: §1, §2.1, §3.1, §3.1, §4.2, §4.3, Table 1, Table 2.
  • [25] Z. Teed and J. Deng (2020) RAFT: recurrent all-pairs field transforms for optical flow. In Eur. Conf. Comput. Vision (ECCV), Cited by: Table 1.
  • [26] C. Tomasi and R. Manduchi (1998) Bilateral filtering for gray and color images. In Sixth Inter. Conf. Comput. Vision (IEEE Cat. No. 98CH36271), pp. 839–846. Cited by: §3.2.
  • [27] A. K. Ushani and R. M. Eustice (2018) Feature learning for scene flow estimation from lidar. In Conf. Robot Learn. (CoRL), pp. 283–292. Cited by: §1.
  • [28] H. Wang, Y. Liu, H. Huang, Y. Pan, W. Yu, J. Jiang, D. Lyu, M. Bocus J., M. Liu, I. Pitas, and R. Fan (2020) ATG-PVD: ticketing parking violations on a drone. In Eur. Conf. Comput. Vision Workshops (ECCVW), Cited by: §1.
  • [29] Y. Wang, Y. Yang, Z. Yang, L. Zhao, P. Wang, and W. Xu (2018) Occlusion aware unsupervised learning of optical flow. In Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), pp. 4884–4893. Cited by: §1, §2.1, §3.2.
  • [30] F. Yang, K. Li, Z. Zhong, Z. Luo, X. Sun, H. Cheng, X. Guo, F. Huang, R. Ji, and S. Li (2020) Asymmetric co-teaching for unsupervised cross-domain person re-identification.. In AAAI, pp. 12597–12604. Cited by: §2.2.
  • [31] K. Yoon and I. S. Kweon (2006) Adaptive support-weight approach for correspondence search. IEEE Trans. Pattern Anal. Mach. Intell. 28 (4), pp. 650–656. Cited by: §3.1.
  • [32] T. Zhang, H. Zhang, Y. Li, Y. Nakamura, and L. Zhang (2019) FlowFusion: dynamic dense RGB-D SLAM based on optical flow. In 2019 Int. Conf. Robot. Automat. (ICRA), Cited by: §1.
  • [33] S. Zhao, Y. Sheng, Y. Dong, E. I. Chang, Y. Xu, et al. (2020) MaskFlownet: asymmetric feature matching with learnable occlusion mask. In Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), pp. 6278–6287. Cited by: Table 1.
  • [34] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In Proc. IEEE Conf. Comput. Vision Pattern Recognit. (CVPR), pp. 9308–9316. Cited by: §3.1.
  • [35] H. Zimmer, A. Bruhn, and J. Weickert (2011) Optic flow in harmony. Inter. J. Comput. Vision 93 (3), pp. 368–388. Cited by: §3.1.

Appendix A Glossary of Notations

The glossary of notations used in the paper is presented in Table 3.

Appendix B Impact of Different and

In our co-teaching strategy, and controls the filtering speed and filtering range of the pixels with high occlusion probability, respectively. We consider three values of , and five values of , . We also test the training schemes that adopt a constant . The results of our CoT-AMFlow are shown in Table 4. We can observe that the dynamic threshold scheme can effectively improve the performance and our CoT-AMFlow is robust on different choices of . Moreover, has a significant impact on the performance. Specifically, a higher indicates that more pixels will be filtered out. We can see that the performance can be improved when increases. However, when too many pixels are filtered out, i.e., or , the performance can deteriorate because the networks cannot get sufficient training data. Note that we set and in the rest of our ablation studies.

Appendix C Effectiveness of Our FMM and CMM

Table 5 shows the evaluation results of variants of our CoT-AMFlow with some of the proposed modules disabled. We can observe that our FMM and CMM can effectively improve the optical flow accuracy, especially for the pixels with large movements. This is because our FMM can refine the flow initialization from the preceding pyramid level to address the issue of accumulated errors by using local flow consistency, while our CMM can explicitly reduce outliers in the cost volume to improve the accuracy of optical flow estimation by using a flexible and efficient sparse point-based scheme. In addition, the best performance is achieved by integrating our FMM and CMM, which demonstrates the effectiveness of our proposed modules.

Notation Meaning
Section 3.1
and The input images
and The downsampled input images at level
and The feature maps of input images at level
The forward flow estimation at level
The upsampled forward flow estimation at level
The modulated forward flow generated via our FMM at level
The confidence map used in our FMM at level
The self-cost volume used in our FMM at level
The displacement map used in our FMM at level
The cost volume at level
The modulated cost volume generated via our CMM at level
Section 3.2 and 3.3
and The input images
The forward flow estimation
The backward flow estimation
The occlusion map
, and The transformations employed on , , and , respectively [15]
, , and The samples augmented via the above-mentioned transformations
The forward flow prediction based on and
Table 3: A glossary of notations used in the paper.
Constant 4.31 4.10 3.95 4.34 4.89
4.22 3.98 3.83 4.16 4.65
4.27 4.05 3.79 4.02 4.64
4.29 3.92 3.85 4.13 4.51
Table 4: AEPE (px) results of our CoT-AMFlow with different and in the proposed co-teaching strategy. The best result is shown in bold font.
FMM CMM All
4.73 0.82 2.46 29.75
4.12 0.79 2.32 25.12
4.23 0.73 2.23 26.49
3.79 0.76 2.07 23.10
Table 5: AEPE (px) results of variants of our CoT-AMFlow with some of the proposed modules disabled, where “All” denotes the AEPE over all pixels, and “”, “” and “” denote the AEPE over pixels that move less than 10 pixels, between 10 and 40 pixels and more than 40 pixels, respectively. Best results are shown in bold font.

Appendix D Superiority of Our AMFlow over Other Network Architectures

To further demonstrate the superiority of our AMFlow over other network architectures, we compare the performance of different combinations of unsupervised network architectures and unsupervised training strategies. The results are shown in Table 6. From rows a)–d), we can observe that for each existing unsupervised approach, the performance can be significantly improved when the network architecture is changed from the original one to our AMFlow, which strongly demonstrates the effectiveness of our architecture. The reason why our AMFlow performs better is that it can address the issues of accumulated errors and reduce outliers in the cost volume to improve the optical flow accuracy by using our FMMs and CMMs. Moreover, from row e), we can see that, compared with other network architectures, our AMFlow achieves the best performance when equipped with the same training strategy, i.e., our co-teaching strategy, which further demonstrates the superiority of our AMFlow over other network architectures.

Appendix E Superiority of Our Co-Teaching Strategy over Other Strategies for Unsupervised Training

From columns 1)–4) in Table 6, we can observe that for each existing unsupervised approach, the performance can be significantly improved when the training strategy is changed from the original one to our co-teaching strategy, which strongly demonstrates the effectiveness of our strategy. The reason why our co-teaching strategy performs better is that it can improve robustness against outliers for unsupervised optical flow estimation by employing two networks to teach each other about challenging regions simultaneously. Moreover, from column 5), we can see that, compared with other training strategies, our co-teaching strategy achieves the best performance when employed in the same network architecture, i.e., our AMFlow, which further demonstrates the superiority of our co-teaching strategy over other strategies for unsupervised training.

 

StrategyNetwork
1)UnFlow-
Net [18]
2)DDFlow-
Net [16]
3)SelFlow-
Net [17]
4)ARFlow-
Net [15]
5)AMFlow
(Ours)

 

a) UnFlowStrat [18] 8.87 6.61
b) DDFlowStrat [16] 5.95 5.59
c) SelFlowStrat [17] 5.22 4.98
d) ARFlowStrat [15] 4.67 4.36
e) Co-Teaching (Ours) 5.65 4.73 3.94 4.29 3.79

 

Table 6: AEPE (px) results of different combinations of unsupervised network architectures and unsupervised training strategies. Note that XXXNet and XXXStrat denote the corresponding network architecture and unsupervised training strategy used in XXX, respectively. indicates the network using more than two frames. The best result is shown in bold font.