GyroFlow: Gyroscope-Guided Unsupervised Optical Flow Learning

03/25/2021 ∙ by Haipeng Li, et al. ∙ 14

Existing optical flow methods are erroneous in challenging scenes, such as fog, rain, and night because the basic optical flow assumptions such as brightness and gradient constancy are broken. To address this problem, we present an unsupervised learning approach that fuses gyroscope into optical flow learning. Specifically, we first convert gyroscope readings into motion fields named gyro field. Then, we design a self-guided fusion module to fuse the background motion extracted from the gyro field with the optical flow and guide the network to focus on motion details. To the best of our knowledge, this is the first deep learning-based framework that fuses gyroscope data and image content for optical flow learning. To validate our method, we propose a new dataset that covers regular and challenging scenes. Experiments show that our method outperforms the state-of-art methods in both regular and challenging scenes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Optical flow estimation is a fundamental yet essential computer vision task that has been widely applied in various applications such as object tracking 

[2], visual odometry [5], and image alignments [24]. The original formulation of the optical flow was proposed by Horn and Schunck [11], after which the accuracy of optical flow estimation algorithms has been improving steadily. Early traditional methods are minimizing pre-defined energy functions with various assumptions and constraints [35]

. Deep learning-based methods directly learn the per-pixel regression through convolutional neural networks, which can be divided into supervised 

[7, 40, 43] and unsupervised methods [41, 36]. The former are primarily trained on synthetic data [7, 4] due to the lack of ground-truth labels. In contrast, the later ones can be trained on abundant and diverse unlabeled data by minimizing the photometric loss between two images. Although existing methods achieve good results, they heavily rely on image contents, requiring images to contain rich textures and similar illumination conditions.

Figure 1: (a) Input low-light frame. (b) Optical flow result from existing baseline method ARFlow [32]. (c) Ground-Truth. (d) Result from our GyroFlow.

On the other hand, gyroscopes do not rely on image contents, which provide angular velocities in terms of roll, pitch, and yaw that can be converted into D motions, widely used for system control [27] and the HCI of mobiles [9]. Among all potential possibilities [3, 29, 15], one is to fuse the gyroscope for the motion estimation. Hwangbo et al. proposed to fuse gyroscope to improve the robustness of KLT feature tracking [15]. Bloesch et al. fused gyroscope for the ego-motion estimation [3]. These attempts demonstrate that if the gyroscope is integrated correctly, the performance and the robustness of the method can be vastly improved.

Given the camera intrinsic parameters, gyro readings can be converted into motion fields describing background motions but not dynamic object motions because it is confined to camera motion. It is engaging that gyroscopes do not look at image contents but produce reliable background camera motions under poor textures or dynamic scenes. Therefore, gyroscopes can be used to improve the performance of optical flow estimation in challenging scenes, such as poor textures or inconsistent illumination conditions.

In this paper, we propose GyroFlow, a gyroscope-guided unsupervised optical flow estimation method. We combine the advantages of image-based optical flow that recovers motion details based on the image content with those of a gyroscope that provides reliable background camera motions irrelevant to image contents. Specifically, we first convert gyroscope readings into gyro fields that describe background motions given the image coordinates and the camera intrinsic. Second, we estimate optical flow with an unsupervised learning framework and insert a proposed Self-Guided Fusion (SGF) module that supports the fusion of the gyro field during the image-based flow calculation. Fig. 1 shows an example, where Fig. 1 (a) represents the input of a night scene with poor image textures, and Fig. 1 (c) is the ground-truth optical flow between two frames. Image-based methods such as ARFlow [32] (Fig. 1 (b)) can produce the dynamic object motions but fail to compute the background motions in the sky, where no textures are available. Fig. 1 (d) shows our GyroFlow fusion result. As seen, both global motion and motion details are retained. From experiments, we notice that motion details can be better recovered if global motions are provided.

To validate our method, we propose a dataset GOF (Gyroscope Optical Flow) containing scenes under 4 different categories with synchronized gyro readings, including one regular scene (RE) and three challenging cases as low light scenes (Dark), foggy scenes (Fog), and rainy scenes (Rain). For quantitative evaluations, we further propose a test set, which includes accurate optical flow labels by the method [31], through extensive efforts. Note that existing flow datasets, such as Sintel [4], KITTI [8, 38] cannot be used for the evaluation due to the absence of the gyroscope readings. To sum up, our main contributions are:

  • We propose the first DNN-based framework that fuses gyroscope data into optical flow learning.

  • We propose a self-guided fusion module to effectively realize the fusion of gyroscope and optical flow.

  • We propose a dataset GOF for the evaluation. Experiments show that our method outperforms all existing methods.

Figure 2: The overview of our algorithm. It consists of a pyramid encoder and a pyramid decoder. For each pair of frames to , our encoder extracts features at different scales. The decoder includes two modules, at each layer , functions to fuse a gyro field and an optical flow to produce a fused flow as input to , which estimates an optical flow to the next layer.

2 Related Work

2.1 Gyro-based Vision Applications

Gyroscopes reflect the camera rotation. Many applications embedded with the gyroscope have been widely applied, including but not limited to video stabilization [22], image deblur [39], optical image stabilizer(OIS) [26], simultaneous localization and mapping (SLAM) [12], ego-motion estimation [3], gesture-based user authentication on mobile devices [10], image alignment with OIS calibration [28] and human gait recognition [49]. The most commonly used gyroscopes are in mobile phones. The synchronization between the gyro readings and the video frames is important. Jia et al. [19] proposed gyroscope calibration to improve the synchronization. Bloesch et al. [3] fused optical flow and inertial measurements to deal with the drifting issue. In this work, we acquire gyroscope data from the bottom layer of the Android layout, i.e., Hardware Abstraction Layer (HAL), to achieve accurate synchronizations.

2.2 Optical Flow

Our method is related to optical flow estimation. Traditional methods minimize the energy function between image pairs to compute an optical flow [35]. Recent deep approaches can be divided into supervised [7, 40, 43] and unsupervised methods [41, 36].

Supervised methods need to be trained with labeled ground-truth. FlowNet [7] first proposed to train train a fully convolutional network on synthetic dataset FlyingChairs. To deal with the large displacement scenes, SpyNet [40] introduced a coarse-to-fine pyramid network. PWC-Net [42], LiteFlowNet [13], IRR-PWC [14] designed lightweight and efficient networks by warping features, computing cost volumes, and imposing residual manner for iterative refinement with shared weights. Recently, RAFT [43] achieved state-of-art performance by constructing a pixel-level correlation volume and using a recurrent network to estimate optical flow.

Unsupervised methods do not require ground-truth annotations. DSTFlow [41] and Back2Basic [18] are pioneers for unsupervised optical flow estimation. Several works [37, 33, 44, 32] focus on dealing with the occlusion problem by forward-backward occlusion checking, range-map occlusion checking, data distillation, and augmentation regularization loss. Other methods concentrate on optical flow learning by improving image alignment, including the census loss [37], formulation of multi-frames [17], epipolar constraints [51], depth constraints [47], and feature similarity constraints [16]. UFlow [20] proposed a unified framework to systematically analyze and integrate different unsupervised components. Recently, UPFlow [36] proposed a neural upsampling module and pyramid distillation loss to improve the upsampling and learning of pyramid network, yielding the state-of-art performance.

However all the above methods may be invalid in challenging scenes, such as dark, rain, and fog environments. Zheng et al. [50] proposed a data-driven method to learn optical flow from low-light images. Li et al. [30] proposed a RainFlow to handle heavy rain. Yan et al. [46] proposed a semi-supervised network to deal with dense foggy scenes. In this paper, we build our GyroFlow upon unsupervised components with the fusion of gyroscope to cover both regular and challenging scenes.

2.3 Gyro-based Motion Estimation

Hwangbo et al. [15] proposed an inertial-aided KLT feature tracking method that deals with the camera rolling and illumination change. Bloesch et al. [3] presented a method for fusing optical flow and inertial measurements for robust ego-motion estimation. Li et al. [29]

proposed a gyro-aided optical flow estimation method to improve the performance in fast rotations. However, none of them took challenging scenes into account nor used a neural network to fuse the gyroscope data for optical flow improvement. In this work, we propose a DNN-based solution that fuses gyroscope data to image-based optical flow to improve optical flow estimations.

3 Algorithm

Our method is built upon convolutional neural networks that input a gyro field and two frames , to estimate a forward optical flow that describes the motion for every pixel from towards as:

(1)

where is our network with parameter .

Fig. 2 illustrates our pipeline. Firstly, the gyro field is produced by the gyroscope readings between the relative frames and (Sec. 3.1), then it is concatenated with the two frames to be fed into the network to produce an optical flow between and . Our network consists of two stages. For the first stage, we extract feature pairs at different scales. For the second stage, we use the decoder and the self-guided fusion module (Sec. 3.2) to produce optical flow in the manner of coarse-to-fine.

Our decoder consists of the feature warping [42], the cost volume construction [42], the cost volume normalization [20], the self-guided upsampling [36], and the parameters sharing [14]. In summary, the second pyramid decoding stage can be formulated as:

(2)

where represents the number of pyramid level, , are extracted features from and at the -th pyramid level. Specifically, the output flow is directly upsampled at the last layer. In the following, we first describe how to convert the gyro readings into a gyro field in Sec. 3.1 and then introduce our SGF module in Sec. 3.2.

3.1 Gyro Field

Figure 3: The pipeline of generating gyro field. For each frame pair, we record the timestamp of the first frame and the second frame . Then the gyroscope readings can be read out to compute an array of rotation matrix in the case of rolling-shutter camera. Furthermore, we convert the rotation array into the homography array that projects pixels of the first image into , yielding a gyro field by subtracting and .

We obtain gyroscope readings from mobile phones that are widely available and easy to access. For mobile phones, gyroscopes reflect camera rotations. We compute rotations by compounding gyroscope readings that include 3-axis angular velocities and timestamps. In particular, compared to previous work [22, 25, 39] that read gyro readings from the API, we directly read them from HAL of Android architecture to avoid the non-trivial synchronization problem that is critical for the gyro accuracy. Between frames and

, the rotation vector

is computed according to method [22], then the rotation matrix can be produced using Rodrigues Formula [6].

In the case of a global shutter camera, e.g., the pinhole camera, a rotation-only homography can be computed as:

(3)

where is the camera intrinsic matrix, and denotes the camera rotation from to .

For a rolling shutter camera that most mobile phones adopt, each scanline of the image is exposed at a slightly different time, as illustrated in Fig 3. Therefore, Eq. (3) is not applicable anymore since every row of the image should have a different orientation. In practice, it is not necessary to assign each row with a rotation matrix. We group several consecutive rows into a row patch and assign each patch with a rotation matrix. The number of row patches depends on the number of gyroscope readings per frame.

Here, the homography between the -th row at frame and can be modeled as:

(4)

where can be computed by accumulating rotation matrices from to .

In our implementation, we regroup the image into patches that compute a homography array containing horizontal homographies between two consecutive frames. Furthermore, to avoid the discontinuities across row patches, we convert the array of homographies into an array of D quaternions [48]

and then apply the spherical linear interpolation (SLERP) to interpolate the camera orientation smoothly yielding a smooth homography array. As shown in Fig 

3, we use the homography array to transform every pixel to , and subtract to as:

(5)

computing offsets for every pixel produces a gyro field .

3.2 Self-guided Fusion Module

As Fig. 1 illustrates: (a) denotes the origin images. (b) represents the ground-truth. (c) is the output of the UFlow [20], an unsupervised optical flow approach, where only motions of moving objects are roughly produced. As image-based optical flow methods count on image contents for the registration, they prone to be erroneous in challenge scenes, such as textureless scenarios, dense foggy environment [46], dark [50] and rainy scenes [30]. To combine the advantages of the gyro field and the image-based optical flow, we propose a self-guided fusion module (SGF). As seen in the (d), with the help of the gyro field, our result is much better compared with the UFlow [20].

Figure 4: Illustration of our self-guided fusion module (SGF). For a specific layer , we use blocks to independently produce the fusion map and the fusion flow , then we generate the output by Eq. 6.

The architecture of our SGF is shown in Fig. 4. Given the input features of image and at the -th layer as and . is warped by the gyro field , which is the forward flow from feature to . Then the warped feature is concatenated with as inputs to the map block, yielding a fusion map that ranges from to . Note that, in , those background regions, which can be aligned with the gyro field, are close to zeros, while the rest areas are distributed with different weights. Next, we input the gyro field and optical flow to the fusion block that computes a fusion flow . Finally, we fuse the and with to guide the network focusing on the moving foreground regions. The process can be described as:

(6)

where is the output of our SGF module and denotes the element-wise multiplier.

Specifically, to produce the fusion map and the fusion flow , we use two small dense blocks with convolutional layers, i.e., the map block and the fusion block. For the map block, the channel number of every convolutional layer is , , , , , respectively. The output is a

channel tensor, and it is passed to a sigmoid layer. For the fusion block, the channel number is

, , , , for each layer, and it outputs a channel tensor.

4 Experimental Results

4.1 Dataset

The representative datasets for optical flow estimation and evaluation include FlyingChairs [7], MPI-Sintel [4], KITTI 2012 [8], and KITTI 2015 [38]. On the gyro side, a dedicated dataset embedded with gyroscopes, named GF4 [28], is proposed for homography estimation. However, none of them combine accurate gyroscope readings with image contents to evaluate the optical flow. Therefore, we propose a new dataset and benchmark: GOF.

Training Set Similar to GF4 [28], a set of videos with gyroscope readings are recorded using a cellphone. Compare to GF4, which uses a phone with an OIS camera. We carefully choose a non-OIS camera phone to eliminate the effect of the OIS module. We collect videos in different environments, including regular scenes (RE), low light scenes (Dark), foggy scenes (Fog), and rainy scenes (Rain). For each scenario, we record a set of videos lasting for seconds, yielding frames under every environment. In total, we collect frames for the training set.

Figure 5: A glance at our evaluation set. It can be divided into categories, including regular scenes(RE), low light scenes(Dark), foggy scenes(Fog), and rainy scenes(Rain). Each category contains pairs, and a total of pairs evaluation dataset is proposed with synchronized gyroscope readings.
Figure 6: One label example on KITTI 2015, compared to RAFT[43](the second line) that computes an EPE equals , our label flow(the first line) produces a EPE. From the error map, we notice that our labeled optical flow is much more accurate than RAFT.

Evaluation Set For the evaluation, like the training set, we also capture videos in the scenes to compare with image-based registration methods. Each category contains pairs, so size of pairs evaluation set is prepared. Fig. 5 shows some examples.

Method RE Dark Fog Rain Avg
1) 4.962(+457.53%) 3.278(+228.13%) 7.358(+643.23%) 5.567(+425.68%) 5.665(+481.03%)
2) Gyro Field 2.583(+190.22%) 0.999(+0.00%) 1.279(+29.19%) 1.703(+60.81%) 1.922(+97.13%)
3) DIS [24] 2.374(+166.74%) 2.442(+144.44%) 4.677(+372.42%) 3.004(+183.66%) 3.399(+248.62%)
4) DeepFlow [45] - Sintel[4] 3.521(+295.62%) 3.425(+242.84%) 3.029(+205.96%) 11.812(+1015.39%) 4.858(+398.26%)
5) FlowNet2 [7] - Sintel[4] 11.140(+1151.69%) 44.641(+4368.57%) 2.633(+165.96%) 5.767(+444.57%) 6.701(+587.28%)
6) IRRPWC [14] - FlyingChairs [7] 12.487(+1303.03%) 69.864(+6893.39%) 1.916(+93.54%) 9.799(+825.31%) 8.234(+744.51%)
7) SelFlow[34] - Sintel [4] 4.186(+370.34%) 2.747(+174.97%) 7.307(+638.08%) 4.787(+352.03%) 5.626(+477.03%)
8) RAFT [43] - FlyingChairs[7] 1.246(+40.00%) 1.297(+29.83%) 1.136(+14.75%) 1.187(+12.09%) 1.349(+38.36%)
9) DDFlow [33] - GOF 2.273(+155.39%) 2.843(+184.58%) 3.070(+210.10%) 2.422(+128.71%) 2.527(+159.18%)
10) UnFlow [37] - GOF 1.120(+25.84%) 1.671(+67.17%) 0.990(+0%) 1.343(+26.53%) 1.221(+25.13%)
11) ARFlow [32] - GOF 0.972(+9.21%) 1.205(+20.62%) 1.186(+19.80%) 1.093(+3.21%) 1.035(+6.15%)
12) UFlow [20] - GOF 0.890(+0.00%) 1.641(+64.26%) 0.994(+0.40%) 1.059(+0.00%) 0.975(+0.00%)
13) DarkFlow [50] - FlyingChairs[7] 4.127(+363.71%) 4.346(+335.04%) 7.316(+638.99%) 4.891(+361.85%) 5.758(+490.56%)
14) RainFlow [30] - - - - -
15) FogFlow [46] - - - - -
16) Ours 0.742(16.63%) 0.902(9.71%) 0.658(33.54%) 0.730(31.07%) 0.717(26.46%)
Table 1: Quantitative comparisons on the evaluation dataset. We mark the best performance in red, and the second-best in blue. The percentage in the bracket indicates the improvements over second-best results. We use ’-’ to indicate which dataset the model is trained on.

For quantitative evaluation, a ground-truth optical flow is required for each pair. However, labeling ground-truth flow is non-trivial. As far as we know, no powerful tool is available for this task. Following [30, 46], we adopt the most related approach [31] to label the ground-truth flow with many efforts. It costs approximately minutes per image, especially for challenging scenes. We firstly label an amount of examples containing rigid objects, then we select those with good visual performance, i.e., the performance of image alignment, and discard the others. Furthermore, we refine the selected samples with detailed modification around the motion boundaries.

To verify the effectiveness of our labeled optical flow, we choose to label several samples from KITTI 2012 [8]. Given the ground-truth, we compare our labeled optical flow with results produced by the state-of-the-art supervised method, i.e., RAFT [43] pre-trained on FlyingChairs. Our labeled flow computes an endpoint error (EPE) of , where RAFT computes an EPE of , which is more than times larger than ours. Fig. 6 shows one example. From the error map, we notice that our labeled flow is much more accurate than the current SOTA method. We leverage this approach to generate ground-truth for evaluations.

4.2 Implementation Details

We conduct experiments on GOF dataset. Our method is built upon the PWC [42]. For the first stage, we train our model for k steps without occlusion mask and census loss [37]. For the second stage, we enable the bidirectional occlusion mask [37], the census loss [37], and the spatial transform [32] to fine-tune the model for about k steps.

We collect videos with gyroscope readings using Qualcomm QRD equipped with Snapdragon 7150, which records videos in X resolution. We add random crop, random horizontal flip, and random weather modification (add fog and rain [21]

) during training. We report endpoint error(EPE) in the evaluation set. The implementation is in PyTorch, and one NVIDIA RTX 2080 Ti is used to train our network. We use Adam optimizer 

[23] with parameters setting as , , , . The batch size is . It takes days to finish the entire training process.

Figure 7: Visual comparison of our method with gyro field, ARFlow [32], and the state-of-art model UFlow [20] on the GOF evaluation set. For the first challenging cases, we notice that our method achieves convincing results by fusing the background motions from the gyro field and the motion details from optical flow. For the last example, in regular scenarios, fusing gyro field helps the learning of optical flow where the network produces accuracy and sharp flow around the boundary of objects.

4.3 Comparisons with Image-based Methods

In this section, we compare our method with traditional, supervised, and unsupervised methods on GOF evaluation set with quantitative (Sec. 4.3.1) and qualitative comparisons (Sec. 4.3.2). To validate the effectiveness of key components, we conduct an ablation study in Sec.4.4.

4.3.1 Quantitative Comparisons

In Table 1, the best results are marked in red, and the second-best results are in blue. The percentage in the bracket indicates the improvements over the second-best results. Therefore, the percentage of the best results are negative. The second best is all zeros while the others are positive. ‘’ refers to no alignment, and ‘Gyro Field’ refers to alignment with pure gyro data.

For traditional methods, we compare our GyroFlow with DIS [24] and DeepFlow [45] pre-trained on Sintel [4] (Table 1, 3 4). As seen, their average EPEs are times larger than ours. In particular, DIS fails in foggy scenarios, and DeepFlow crashes in rainy scenes. Moreover, we try to implement the method [29], but results are unreasonable, so we do not report it.

Next, we compare with deep supervised optical flow methods, including FlowNet2 [7], IRRPWC [14], SelFlow [34], and recent state-of-the-art method RAFT [43] (Table 1, line ). For the lack of ground-truth labels during training, we cannot refine these methods on our trainset. So for each method, we search different pre-trained models and test them on the evaluation set. Here, we only report the best results. RAFT pret-rained on FlyingChairs [7] performs the best, but it is still not as good as ours.

We also compare our method with deep unsupervised optical flow methods, including DDFlow [33], UnFlow [37], ARFlow [32] and UFlow [20](Table 1 ). Here, we refine the models on our training set. UFlow achieves 3 second-best results. However, it is still not comparable with ours due to the unstable performance in challenging scenes.

As discussed in Sec. 2, RainFlow [30] is designed to estimate the optical flow under rainy scenes. FogFlow [46] aims for foggy environments, and DarkFlow [50] intends to compute flows in the low-light scenarios. We also compare with these methods from lines - in Table 1

. Note that all these methods are not open source. For DarkFlow 

[50], the authors do not provide source codes but offer a pre-trained version on FlyingChairs. For the other two methods [30, 46], no replies are received from the authors. We try to implement them, but the results are not satisfactory due to the uncertainty of some implementation details. Therefore, we leave it empty on the Table 1.

We find that our GyroFlow model is robust in all scenes and computes a EPE error which is better than the second-best method on average. Notably, for ‘Dark’ scenes that consist of poor image textures, the ‘Gyro Field’ alone achieves the second-best performance, indicating the importance of incorporating gyro motions, especially when the image contents are not reliable.

4.3.2 Qualitative Comparisons

In Fig. 7, we illustrate the qualitative results on the evaluation set. We choose one example for each of four different scenes, including the low-light scene (Dark), the foggy scene (Fog), the rainy scene (Rain), and the regular scene (RE). For the compare methods, we choose the gyro field and recent unsupervised methods, i.e., ARFlow [32] and UFlow [20] which are refined on our training set. In Fig. 7, we show optical flow along with corresponding error maps and also report the EPE error for each example. As shown, for challenge cases, our method can fuse the background motions from the gyro field with motions of dynamic objects from the image-based optical flow, delivering both better visual quality and lower EPE errors.

The unsupervised optical flow methods [33, 32, 20] are supposed to work well in RE scenes given sufficient textures. However, we notice that, even for the RE category, our method outperforms the others, especially at the motion boundaries. With the help of the gyro field that solves the global motion, the network can focus on challenging regions. As a result, our method still achieves better visual qualities and produces lower EPE errors in RE scenarios.

4.4 Ablation Studies

To evaluate the effectiveness of the design for each module, we conduct ablation experiments on the evaluation set. EPE errors are reported under categories, including Dark, Fog, Rain, and RE, along with the average error.

4.4.1 The Design of SGF

Method RE Dark Fog Rain Avg
DWI 3.77 3.15 5.59 4.24 4.38
DPGF 0.95 1.67 1.32 0.89 0.98
SGF-Fuse 0.72 0.99 0.99 0.94 0.80
SGF-Map 1.07 1.02 1.19 0.70 0.90
SGF-Dense 0.77 1.69 0.87 1.00 0.89
GyroFlow without SGF 0.79 1.71 1.35 1.06 0.95
Our SGF 0.74 0.90 0.66 0.73 0.72
Table 2: Comparison with alternative designs of the SGF module.
Figure 8: Visual example of our self-guided fusion module(SGF). Results of UFlow and UFlow with SGF are shown. Fusion map is used to guide network focus on motion details.

For SGF, we test several designs and report results in Table 2. First of all, two straightforward methods are adopted to build the module. DWI refers that we directly warp the with gyro field, then we input the warped image and to produce a residual optical flow. DPGF denotes that, for each pyramid layer, we directly add the gyro field onto the optical flow. As shown in Table 2, for DWI, the result is not good. Besides no gyroscope guidance being include during training, another possibility is that the warping operation breaks the image structure such as blurring and noising. ’DPGF’ gets a better result but still not comparable to our SGF design because the gyro field registers background motions that should not be concatenated to dynamic object motions. Furthermore, we compare our SGF with three variants: (1) SGF-Fuse, we remove the map block and the final fusion procedure. Although it computes a EPE error, it performs unstable in challenging scenes; (2) SGF-Map, where the fusion block is removed. It results in worse performance because the fusion map tends to be inaccurate except for the rainy scene. (3) SGF-Dense, we integrate the two blocks into one unified dense block, which produces a channels tensor of which the first two channels represent the fusion flow , and the last channel denotes the fusion map . We can find that our SGF is much better than them on average.

Model RE Dark Fog Rain Avg
UnFlow [37] 1.12 1.67 0.99 1.34 1.22
UnFlow [37] + SGF 0.83 1.33 0.94 0.94 0.90
ARFlow [32] 0.97 1.21 1.19 1.09 1.04
ARFlow [32] + SGF 0.77 1.54 0.85 0.94 0.86
UFlow [20] 0.89 1.64 0.99 1.06 0.98
UFlow [20] + SGF 0.89 0.95 0.71 0.78 0.80
Our baseline 0.79 1.71 1.35 1.06 0.95
Ours 0.74 0.90 0.66 0.73 0.72
Table 3: Comparison with unsupervised methods when equipped with our SGF module.

4.4.2 Unsupervised Methods with SGF.

We insert the SGF module into unsupervised methods [37, 32, 20], and the baseline represents our GyroFlow without SGF. In particular, similar to Fig. 2, we add the SGF before the decoder for each pyramid layer. Several unsupervised methods are trained on our dataset, and we report EPE errors in Table 3. After inserting our SGF module into these models, noticeable improvements can be observed in qualitative Table 1 and quantitative Table 3 results, which proves the effectiveness of our proposed SGF module. Fig. 8 shows an example. Both background motions and boundary motions are improved after integrating our SGF.

4.4.3 Gyro Field Fusion Layer

Intuitively, it is also possible to fuse the gyro field only once during training, so we add our SGF module to a specific pyramid layer. As illustrated in Table 4, we notice that the more bottom layer we add SGF to, the lower EPE error it produces. The best results can only be obtained when we add the gyro field at all layers.

Pyramid Layer RE Dark Fog Rain Avg
Baseline 0.79 1.71 1.35 1.06 0.95
resolution 1.03 1.04 0.85 1.03 0.95
resolution 0.94 0.98 0.95 0.93 0.92
resolution 0.89 1.17 1.19 0.87 0.89
resolution 0.81 1.13 0.94 0.91 0.87
All resolutions 0.74 0.90 0.66 0.73 0.72
Table 4: Adding gyro filed to different pyramid layers. Baseline indicates GyroFlow without SGF.

5 Conclusion

We have presented a novel framework GyroFlow for unsupervised optical flow learning by fusing the gyroscope. We have proposed a self-guided fusion module to fuse the gyro field and optical flow. For the evaluation, we have proposed a dataset GOF and labeled ground-truth optical flow for quantitative metric. The results show that our proposed method achieves state-of-art in all regular and challenging categories compared to existing methods. Our source code and dataset will be publicly released to facilitate and inspire more researches.

References

  • [1]
  • [2] Aseem Behl, Omid Hosseini Jafari, Siva Karthik Mustikovela, Hassan Abu Alhaija, Carsten Rother, and Andreas Geiger. Bounding boxes, segmentations and object coordinates: How important is recognition for 3d scene flow estimation in autonomous driving scenarios? In Proceedings of the IEEE International Conference on Computer Vision, pages 2574–2583, 2017.
  • [3] Michael Bloesch, Sammy Omari, Péter Fankhauser, Hannes Sommer, Christian Gehring, Jemin Hwangbo, Mark A Hoepflinger, Marco Hutter, and Roland Siegwart. Fusion of optical flow and inertial measurements for robust egomotion estimation. In 2014 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 3102–3107. IEEE, 2014.
  • [4] Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. In European conference on computer vision, pages 611–625. Springer, 2012.
  • [5] Jason Campbell, Rahul Sukthankar, and Illah Nourbakhsh. Techniques for evaluating optical flow for visual odometry in extreme terrain. In 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No. 04CH37566), volume 4, pages 3704–3711. IEEE, 2004.
  • [6] Jian S Dai. Euler–rodrigues formula variations, quaternion conjugation and intrinsic connections. Mechanism and Machine Theory, 92:144–152, 2015.
  • [7] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 2758–2766, 2015.
  • [8] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3354–3361. IEEE, 2012.
  • [9] Hari Prabhat Gupta, Haresh S Chudgar, Siddhartha Mukherjee, Tanima Dutta, and Kulwant Sharma. A continuous hand gestures recognition technique for human-machine interaction using accelerometer and gyroscope sensors. IEEE Sensors Journal, 16(16):6425–6432, 2016.
  • [10] Dennis Guse and Benjamin Müller. Gesture-based user authentication on mobile devices using accelerometer and gyroscope. In Informatiktage, pages 243–246, 2012.
  • [11] Berthold KP Horn and Brian G Schunck. Determining optical flow. Artificial intelligence, 17(1-3):185–203, 1981.
  • [12] Weibo Huang and Hong Liu. Online initialization and automatic camera-imu extrinsic calibration for monocular visual-inertial slam. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5182–5189. IEEE, 2018.
  • [13] Tak-Wai Hui, Xiaoou Tang, and Chen Change Loy. Liteflownet: A lightweight convolutional neural network for optical flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8981–8989, 2018.
  • [14] Junhwa Hur and Stefan Roth. Iterative residual refinement for joint optical flow and occlusion estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5754–5763, 2019.
  • [15] Myung Hwangbo, Jun-Sik Kim, and Takeo Kanade. Inertial-aided klt feature tracking for a moving camera. In 2009 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1909–1916. IEEE, 2009.
  • [16] Woobin Im, Tae-Kyun Kim, and Sung-Eui Yoon.

    Unsupervised learning of optical flow with deep feature similarity.

    In European Conference on Computer Vision, pages 172–188. Springer, 2020.
  • [17] Joel Janai, Fatma Guney, Anurag Ranjan, Michael Black, and Andreas Geiger. Unsupervised learning of multi-frame optical flow with occlusions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 690–706, 2018.
  • [18] J Yu Jason, Adam W Harley, and Konstantinos G Derpanis. Back to basics: Unsupervised learning of optical flow via brightness constancy and motion smoothness. In European Conference on Computer Vision, pages 3–10. Springer, 2016.
  • [19] Chao Jia and Brian L Evans. Online calibration and synchronization of cellphone camera and gyroscope. In 2013 IEEE Global Conference on Signal and Information Processing, pages 731–734. IEEE, 2013.
  • [20] Rico Jonschkowski, Austin Stone, Jonathan T Barron, Ariel Gordon, Kurt Konolige, and Anelia Angelova. What matters in unsupervised optical flow. arXiv preprint arXiv:2006.04902, 1(2):3, 2020.
  • [21] Alexander B. Jung, Kentaro Wada, Jon Crall, Satoshi Tanaka, Jake Graving, Christoph Reinders, Sarthak Yadav, Joy Banerjee, Gábor Vecsei, Adam Kraft, Zheng Rui, Jirka Borovec, Christian Vallentin, Semen Zhydenko, Kilian Pfeiffer, Ben Cook, Ismael Fernández, François-Michel De Rainville, Chi-Hung Weng, Abner Ayala-Acevedo, Raphael Meudec, Matias Laporte, et al. imgaug. https://github.com/aleju/imgaug, 2020. Online; accessed 01-Feb-2020.
  • [22] Alexandre Karpenko, David Jacobs, Jongmin Baek, and Marc Levoy. Digital video stabilization and rolling shutter correction using gyroscopes. CSTR, 1(2):13, 2011.
  • [23] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [24] Till Kroeger, Radu Timofte, Dengxin Dai, and Luc Van Gool. Fast optical flow using dense inverse search. In European Conference on Computer Vision, pages 471–488. Springer, 2016.
  • [25] László Kundra and Péter Ekler. Bias compensation of gyroscopes in mobiles with optical flow. Aasri Procedia, 9:152–157, 2014.
  • [26] Fabrizio La Rosa, Maria Celvisia Virzì, Filippo Bonaccorso, and Marco Branciforte. Optical image stabilization (ois). STMicroelectronics. Available online: http://www. st. com/resource/en/white_paper/ois_white_paper. pdf (accessed on 12 October 2017), 2015.
  • [27] Robert Patton Leland. Adaptive control of a mems gyroscope using lyapunov methods. IEEE Transactions on Control Systems Technology, 14(2):278–283, 2006.
  • [28] Haipeng Li, Shuaicheng Liu, and Jue Wang. Deepois: Gyroscope-guided deep optical image stabilizer compensation. arXiv preprint arXiv:2101.11183, 2021.
  • [29] Ping Li and Hongliang Ren. An efficient gyro-aided optical flow estimation in fast rotations with auto-calibration. IEEE Sensors Journal, 18(8):3391–3399, 2018.
  • [30] Ruoteng Li, Robby T Tan, Loong-Fah Cheong, Angelica I Aviles-Rivero, Qingnan Fan, and Carola-Bibiane Schonlieb. Rainflow: Optical flow under rain streaks and rain veiling effect. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7304–7313, 2019.
  • [31] Ce Liu, William T Freeman, Edward H Adelson, and Yair Weiss. Human-assisted motion annotation. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
  • [32] Liang Liu, Jiangning Zhang, Ruifei He, Yong Liu, Yabiao Wang, Ying Tai, Donghao Luo, Chengjie Wang, Jilin Li, and Feiyue Huang. Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6489–6498, 2020.
  • [33] Pengpeng Liu, Irwin King, Michael R Lyu, and Jia Xu. Ddflow: Learning optical flow with unlabeled data distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8770–8777, 2019.
  • [34] Pengpeng Liu, Michael Lyu, Irwin King, and Jia Xu.

    Selflow: Self-supervised learning of optical flow.

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4571–4580, 2019.
  • [35] Bruce D Lucas, Takeo Kanade, et al. An iterative image registration technique with an application to stereo vision. Vancouver, British Columbia, 1981.
  • [36] Kunming Luo, Chuan Wang, Shuaicheng Liu, Haoqiang Fan, Jue Wang, and Jian Sun. Upflow: Upsampling pyramid for unsupervised optical flow learning. arXiv preprint arXiv:2012.00212, 2020.
  • [37] Simon Meister, Junhwa Hur, and Stefan Roth. Unflow: Unsupervised learning of optical flow with a bidirectional census loss. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [38] Moritz Menze and Andreas Geiger. Object scene flow for autonomous vehicles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3061–3070, 2015.
  • [39] Janne Mustaniemi, Juho Kannala, Simo Särkkä, Jiri Matas, and Janne Heikkila. Gyroscope-aided motion deblurring with deep networks. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1914–1922. IEEE, 2019.
  • [40] Anurag Ranjan and Michael J Black. Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4161–4170, 2017.
  • [41] Zhe Ren, Junchi Yan, Bingbing Ni, Bin Liu, Xiaokang Yang, and Hongyuan Zha. Unsupervised deep learning for optical flow estimation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  • [42] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8934–8943, 2018.
  • [43] Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. In European Conference on Computer Vision, pages 402–419. Springer, 2020.
  • [44] Yang Wang, Yi Yang, Zhenheng Yang, Liang Zhao, Peng Wang, and Wei Xu. Occlusion aware unsupervised learning of optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4884–4893, 2018.
  • [45] Philippe Weinzaepfel, Jerome Revaud, Zaid Harchaoui, and Cordelia Schmid. Deepflow: Large displacement optical flow with deep matching. In Proceedings of the IEEE international conference on computer vision, pages 1385–1392, 2013.
  • [46] Wending Yan, Aashish Sharma, and Robby T Tan.

    Optical flow in dense foggy scenes using semi-supervised learning.

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13259–13268, 2020.
  • [47] Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1983–1992, 2018.
  • [48] Fuzhen Zhang. Quaternions and matrices of quaternions. Linear algebra and its applications, 251:21–57, 1997.
  • [49] Rong Zhang, Christian Vogler, and Dimitris Metaxas. Human gait recognition. In 2004 Conference on Computer Vision and Pattern Recognition Workshop, pages 18–18. IEEE, 2004.
  • [50] Yinqiang Zheng, Mingfang Zhang, and Feng Lu. Optical flow in the dark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6749–6757, 2020.
  • [51] Yiran Zhong, Pan Ji, Jianyuan Wang, Yuchao Dai, and Hongdong Li. Unsupervised deep epipolar flow for stationary or dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12095–12104, 2019.
  • [52]