VOLDOR: Visual Odometry from Log-logistic Dense Optical flow Residuals

04/14/2021 ∙ by Zhixiang Min, et al. ∙ Stevens Institute of Technology 0

We propose a dense indirect visual odometry method taking as input externally estimated optical flow fields instead of hand-crafted feature correspondences. We define our problem as a probabilistic model and develop a generalized-EM formulation for the joint inference of camera motion, pixel depth, and motion-track confidence. Contrary to traditional methods assuming Gaussian-distributed observation errors, we supervise our inference framework under an (empirically validated) adaptive log-logistic distribution model. Moreover, the log-logistic residual model generalizes well to different state-of-the-art optical flow methods, making our approach modular and agnostic to the choice of optical flow estimators. Our method achieved top-ranking results on both TUM RGB-D and KITTI odometry benchmarks. Our open-sourced implementation is inherently GPU-friendly with only linear computational and storage growth.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual odometry (VO) [59, 21, 22] addresses the recovery of camera poses from an input video sequence, which supports applications such as augmented reality, robotics and autonomous driving. Traditional indirect VO [47, 64, 57] methods rely on the geometric analysis of sparse keypoint correspondences to determine multi-view relationships among the input video frames. Moreover, by virtue of relying on local feature detection and correspondence pre-processing modules, indirect methods pose the VO problem as a reprojection-error minimization task. Conversely, direct methods [58, 16, 45], strive to jointly determine a (semi-)dense registration (warping) across images, as well as the parameters of a camera motion model. By virtue of evaluating a dense correspondence field, direct methods strive to minimize photometric error among registered images. While both of these contrasting approaches have been successful in practice, there are important limitations to be addressed. An open problem within indirect methods, is how to characterize feature localization error within the context of VO [41, 39, 40, 92], where motion blur, depth occlusions and viewpoint variations may corrupt such estimates. Nevertheless, least squares methods are commonly used under the assumption of zero-mean Gaussian distributed observation errors. On the other hand, the efficacy of direct methods relies on the strict adherence to the small-motion and appearance-constancy assumptions (or on the development of registration models robust to such variations), which speaks to the difficulty of adequately modeling data variability in this context and, in turn, reduces the scope of their applicability.

Figure 1: VOLDOR probabilistic graphical model. Optical flow field sequences are modeled as observed variables subject to Fisk-distributed measurement errors. Camera poses, depth map and rigidness maps are modeled as hidden variables.

Recent developments on optical flow estimation [76, 35]

using supervised learning have yielded state-of-the-art performance. However, such performance benefits have not yet permeated to pose-estimation tasks, where standard multi-view geometry methods still provide the “gold standard”. This work develops a dense indirect framework for monocular VO that takes as input externally computed optical flow from supervised-learning estimators. We have empirically observed that optical flow residuals tend to conform to a log-logistic (i.e. Fisk) distribution model parametrized by the optical flow magnitude. We leverage this insight to propose a probabilistic framework that fuses a dense optical flow sequence and jointly estimates camera motion, pixel depth, and motion-track confidence through a generalized-EM formulation. Our approach is dense in the sense that each pixel corresponds to an instance of our estimated random variables; it is indirect in the sense that we treat individual pixels as viewing rays within minimal feature-based multi-view geometry models (i.e. P3P for camera pose, 3D triangulation for pixel depth) and implicitly optimize for reprojection error. Starting from a deterministic bootstrap of a camera pose and pixel depths attained from optical flow inputs, we iteratively alternate the inference of depth, pose and track confidence over a batch of consecutive images.

The advantages of our framework include: 1) We present a modular framework that is agnostic to the optical flow estimator engine, which allows us to take full advantage of recent deep-learning optical flow methods. Moreover, by replacing sparse hand-crafted features inputs by learned dense optical flows, we gain surface information for poorly textured (i.e. feature-less) regions. 2) By leveraging our empirically validated log-logistic residual model, we attain highly accurate probabilistic estimates for our scene depth and camera motion, which do not rely on Gaussian error assumptions. Experiments on the KITTI

[25] and TUM RGB-D [75] benchmark have yielded top-ranking performance both on the visual odometry and depth estimation tasks. Our highly parallelizable approach also allows real-time application on commodity GPU-based architectures.

2 Related Works

Indirect methods. Indirect methods [91, 32, 60, 20, 13, 29] rely on geometric analysis of sparse keypoint correspondences among input video frames, and pose VO problem as a reprojection-error minimization task. VISO [47]

employs Kalman filter with RANSAC-based outlier rejection to robustly estimate frame-to-frame motion. PTAM

[64] splits tracking and mapping to different threads, and applies expensive bundle adjustment (BA) at the back-end to achieve better accuracy while retaining real-time application. ORB-SLAM [56, 57] further introduces a versatile SLAM system with a more powerful back-end with global relocalization and loop closing, allowing large environment applications.

Direct methods. Direct methods [61, 71, 97, 46, 45, 44, 70, 85] maintain a (semi-)dense model, and estimates the camera motion by finding a warping that minimizes the photometric error w.r.t. video frames. DTAM [58] introduces a GPU-based real-time dense modelling and tracking approach for small workspaces. LSD-SLAM [16] switched to semi-dense model that allows large scale CPU real-time application. DSO [15] builds sparse models and combines a probabilistic model that jointly optimizes for all parameters as well as further integrates a full photometric calibration to achieve current state-of-the-art accuracy.

Deep learning VO. Recently, deep-learning has shown thriving progress on the visual odometry problem. Boosting VO through geometric priors from learning-based depth predictions has been presented in [88, 52, 78]. Integration of deep-representations into components such as feature-points, depth maps and optimizers has been presented in [77, 5, 11, 12]. Deep-learning framework that jointly estimate depth, optical flow and camera motion has been presented in [101, 90, 93, 79, 80, 54, 102, 36, 83, 66, 72, 6, 9, 96]

. Further adding recurrent neural networks for learning temporal information is presented in

[81, 82, 87, 49]. However, deep learning methods are usually less explainable and have difficulty when transferring to unseen dataset or cameras with different calibration. Moreover, the precision of such methods still underperforms the state of the art.

Deep learning optical flow. Contrary to learning-based monocular depth approaches [23, 28, 27, 95, 31], which develop and impose strong semantic priors, learning for optical flow estimation may be informed by photometric error and achieve better generalization. Recent deep learning works on optical flow estimation [84, 33, 89, 65, 38, 51, 34, 100, 35, 14, 76] have shown satisfying accuracy, robustness and generalization, outperforming traditional methods, especially under challenging conditions such as texture-less regions, motion blur and large occlusions. FlowNet [14]

introduced an encoder-decoder convolutional neural network for optical flow. FlowNet2

[35] improved its performance by stacking multiple basic FlowNets. Most recently, PWC-Net [76] integrates spatial pyramids, warping and cost volumes into deep optical flow estimation, improving performance and generalization to the current state-of-the-art.

Figure 2: Iterative estimation workflow. We input externally computed optical flow estimates of a video sequence. Scene depth, camera poses and a rigidness maps are alternatively estimated by enforcing congruence between predicted rigid flow and input flow observations. Estimation is posed as a probabilistic inference task governed by a Fisk-distributed residual model.

3 Problem Formulation

Rationale. Optical flow can been seen as a combination of rigid flow, which is relevant to the camera motion and scene structure, along with an unconstrained flow describing general object motion [86]

. Our VO method inputs a batch of externally computed optical flow fields and infers the underlying temporally-consistent aforementioned scene structure (depth map), camera motions, as well as pixel-wise probabilities for the “rigidness” of each optical flow estimate. Further, we posit the system framework under the supervision of an empirically validated adaptive log-logistic residual model over the end-point-error (EPE) between the estimated rigid flow and the input (observed) flow.

Geometric Notation. We input a sequence of externally computed (observed) dense optical flow fields , where is the optical flow map from image to , while

denotes the optical flow vector of pixel

at time . We aim to infer the camera poses , where represents the relative motion from time to .
To define a likelihood model relating our observations to , we introduce two additional (latent) variable types: 1) a depth field defined over ; where we denote as the depth value at pixel , and 2) a rigidness probability map associated to at time ; while denotes the set of rigidness maps, and denotes the rigidness probability of pixel at time .
Having depth map and rigidness maps , we can obtain a rigid flow by applying the rigid transformation to the point cloud associated with , conditioned on . Assuming , we let denote the pixel coordinate of projecting the 3D point associated with into the camera image plane at time using given camera poses , by

(1)

where is the camera intrinsic and are the image coordinates of pixel . Hence, rigid flow can be defined as .

Mixture Likelihood Model We model the residuals between observation flow and rigid flow with respect to the continuous rigidness probability .

(2)

where the probability density function

represents the probability for having the rigid flow under the observation flow of , and

is a uniform distribution whose density varies with

. We will define these two functions in §4. Henceforth, when modeling the probability of , we only write down in the conditional probability, although the projection also depends on preceding camera poses , that we assume fixed and for which inherently does not contain any information. Moreover, jointly modeling for all previous camera poses along with would bias them as well as increase the computational complexity. In the following paragraph, we will denote Eq. (2) simply as . At this point, our visual odometry problem can be modeled as a maximum likelihood estimation problem

(3)

Furthermore, we promote spatial consistency among both our dense hidden variables and , through different mechanisms described in §5.1.

4 Fisk Residual Model

Figure 3: Empirical residual distribution. Optical flow EPE residual over flow magnitude for PWC-Net outputs on the entire groundtruth data for the KITTI [25] and Sintel [7] datasets.

The choice of an observation residual model plays a critical role for accurate statistical inference from a reduced number of observations [30, 37], where in our case, a residual is defined as the end-point error in pixels between two optical flow vectors.

In practice, hierarchical optical flow methods (i.e. relying on recursive scale-space analysis) [76, 50] tend to amplify estimation errors in proportion to the magnitude of the pixel flow vector. In light of this, we explore an adaptive residual model determining the residuals distribution w.r.t. the magnitude of optical flow observations. In Figure 3, we empirically analyzed the residual distribution of multiple leading optical flow methods [76, 35, 50, 67] w.r.t. groundtruth. We fit the empirical distribution to five different analytic models, and found the Fisk distribution to yield the most congruent shape over all flow magnitudes (See Figure 3 overlay). To quantify the goodness of fit, in Figure 4 (a), we posit K-S test results [3], which quantify the supremum distance (D value) of the CDF between our empirical distribution and a reference analytic distribution.

Hence, given , we model the probability of matching the underlying groundtruth, as

(4)

where the functional form for the PDF of the Fisk distribution is given by

(5)

From Figure 4 (b), we further determine the parameters of Fisk distribution. Since a clear linear correspondence is shown in the figure, instead of using a look-up table, we applied the fitted function to find the parameters, as

(6)
(7)

where and are learned parameters depending on the optical flow estimation method.

Next, we model the outlier likelihood function . The general approach [53, 98] is to assign outliers with uniform distribution to improve the robustness. In our work, for utilizing the prior given by the observation flow, we further let the density of uniform distribution be a function over the observation flow vector.

(8)

where is a hyper-parameter adjusting the density, which is also the strictness for choosing inliers. The numerical interpretation of is the optical flow percentage EPE for indistinguishable outlier (). Hence, flows with different magnitudes can be compared under a fair metrics to be selected as an inlier.

(a) K-S Test
(b) Parameterization of
Figure 4: Fitness quantification and model parameterization. (a) Results of KS-test [3] over four optical flow methods with five distributions (lower D values indicate better fit). (b)

are estimated from KITTI empirical data using, respectively, log-linear and linear regression, as described in Eq. (

6), (7).

5 Inference

In this section, we introduce our iterative inference framework which alternately optimizes depth map, camera poses and rigidness maps.

5.1 Depth and Rigidness Update

Generalized Expectation-Maximization (GEM).

We infer depth and its rigidness over time while assuming fixed known camera pose . We approximate the true posterior through a GEM framework [98]. In this section, we will denote Eq. (2) as , where the fixed is omitted. We approximate the intractable real posterior with a restricted family of distributions , where . For tractability, is further constrained to the family of Kronecker delta functions , where is a parameter to be estimated. Moreover, inherits the smoothness defined on the rigidness map in Eq. (11), which is proved in [98] as minimizing KL divergence between the variational distribution and the true posterior. In the M step, we seek to estimate an optimal value for given an estimated PDF on . Next, we describe our choice for the estimators used for this task.

Maximimum Likelihood Estimator (MLE). The standard definition of the MLE for our problem is given by

(9)

where is the estimated distribution density given by E step. However, we empirically found the MLE criteria to be overly sensitive to inaccurate initialization. More specifically, we bootstrap our depth map using only the first optical flow and use its depth values to sequentially bootstrap subsequent camera poses (more details in §5.2 and §5.3). Hence, for noisy/inaccurate initialization, using MLE for estimate refinement will impose high selectivity pressure on the rigidness probabilities , favoring a reduced number of higher-accuracy initialization. Given the sequential nature of our image-batch analysis, this tends to effectively reduce the set of useful down-stream observations used to estimate subsequent cameras.

Maximum Inlier Estimation (MIE). To reduce the bias caused by initialization and sequential updating, we relax the MLE criteria to the following MIE criteria,

(10)

which finds a depth maximizing for the rigidness (inlier selection) map . We provide experimental details regarding the MIE criteria in §6.3.

Figure 5: Model for depth inference

. The image 2D field is broken into alternatively directed 1D chains, while depth values are propagated through each chain. Hidden Markov chain smoothing is imposed on the rigidness maps.

We optimize through a sampling-propagation scheme as shown in Fig. 5. A randomly sampled depth is compared with the previous depth value together with a value propagated from the previous neighbor . Then, will be updated to the value of the best estimation among these three options. The updated will further be propagated to the neighbor pixel .

Updating the Rigidness Maps. We adopt a scheme where the image is split into rows and columns, reducing a 2D image to several 1D hidden Markov chains, and a pairwise smoothness term is posed on the rigidness map

(11)

where is a transition probability encouraging similar neighboring rigidness. In the E step, we update rigidness maps according to . As the smoothness defined in Eq. (11), the forward-backward algorithm is used for inferring in the hidden Markov chain.

(12)

where is a normalization factor while and are the forward and backward message of computed recursively as

(13)
(14)

where is the emission probability refer to Eq. (2), .

5.2 Pose Update

We update camera poses while assuming fixed known depth and rigidness maps . We use the optical flow chains in to determine the 2D projection of any given 3D point extracted from our depth map. Since we aim to estimate relative camera motion we express scene depth relative to the camera pose at time and use the attained 3D-2D correspondences to define a dense PnP instance [99]. We solve this instance by estimating the mode on an approximated posterior distribution given by Monte-Carlo sampling of the pose space through (minimal) P3P instances. The robustness to outliers correspondences and the independence from initialization provided by our method is crucial to bootstrapping our visual odometry system §5.3, where the camera pose need to be estimated from scratch with an uninformative rigidness map (all one).

The problem can be written down as maximum a posterior (MAP) by

(15)

Finding the optimum camera pose is equal to compute the maximum posterior distribution , which is not tractable since it requires to integral over to compute . Thus, we use a Monte-Carlo based approximation, where for each depth map position we randomly sample two additional distinct positions to form a 3-tuple , with associated rigidness values , to represent the group. Then the posterior can be approximated as

(16)

where is the total number of groups. Although the posterior is still not tractable, using 3 pairs of 3D-2D correspondences, PnP reaches its minimal form of P3P, which can be solved efficiently using the P3P algorithm [24, 48, 55, 42]. Hence we have

(17)

where denotes the P3P solver, for which we use AP3P [42]. The first input argument indicates the 3D coordinates of selected depth map pixels at time obtained by combining previous camera poses, while the second input argument is their 2D correspondences at time , obtained using optical flow displacement. Hence, we use a tractable variational distribution to approximate its true posterior.

(18)

where

is a normal distribution with mean

and predefined fixed covariance matrix for simplicity. Furthermore, we weight each variational distribution with , such that potential outliers indicated by rigidness maps can be excluded or down weighted. Then, the full posterior can be approximated as

(19)
Figure 6: Pose MAP approximation via meanshift-based mode search. Each 3D-2D correspondence is part of a unique minimal P3P instance, constituting a pose sample that is weighted by the rigidness map. We map samples to and run meanshift to find the mode.

We have approximated the posterior with a weighted combination of . Solving for the optimum on the posterior equates to finding the mode of the posterior. Since we assume all to share the same covariance structure, mode finding on this distribution equates to applying meanshift [10] with a Gaussian kernel of covariance . Note that since lies in [4], while meanshift is applied to vector space, an obtained a mode can not be guaranteed to lie in . Thus, poses are first converted to a 6-vector in Lie algebra through logarithm mapping, and meanshift is applied to the 6-vector space.

5.3 Algorithm Integration

We now describe the integrated workflow of our visual odometry algorithm, which we denote VOLDOR. Per Table 1, our input is a sequence of dense optical flows , and our output will be the camera poses of each frame as well as the depth map of the first frame. Usually, 4-8 optical flows per batch are used. Firstly, VOLDOR initializes all to one and is initialized from epipolar geometry estimated from using a least median of squares estimator [68] or, alternatively, from previous estimates if available (i.e. overlapping consecutive frame batches). Then, is obtained from two-view triangulation using and . Next, the optimization loop between camera pose, depth map and rigidness map runs until convergence, usually within 3 to 5 iterations. Note we did not smooth the rigidness map before updating camera poses to prevent loss of fine details indicating the potential high-frequency noise in the observations.

Input: Optical flow sequence
Output: Camera poses
      Depth map of the first frame
Initialize all to one
Initialize using epipolar geometry from
Triangulate from and
Repeat until converges
  For
    Update according to Eq. (19)
  Update and smooth according to Eq. (12)
  Update according to Eq. (10)
  Update according to Eq. (12) w/o smoothing
Table 1: The VOLDOR algorithm.

6 Experiments

Figure 7: Results on KITTI sequences 00, 01, 05, 07. Monocular VOLDOR does not deploy any mapping, bundle adjustment or loop closure. Scale is estimated assuming fixed and known camera height from the ground.

6.1 KITTI Benchmark

We tested on the KITTI odometry benchmark [25] of a car driving at urban and highway environments. We use PWC-Net [76] as our external dense optical flow input. The sliding window size is set to 6 frames. We set in Eq. (8) to 0.15, in Eq. (11) to 0.9. The Gaussian kernel covariance matrix in Eq. (18) is set to diagonal, scaled to 0.1 and 0.004 at the dimensions of translation and rotation respectively. The hyper-parameters for the Fisk residual model are , obtained from Figure 3. Finally, absolute scale is estimated from ground plane by taking the mode of pixels with surface normal vector near perpendicular. More details of ground plane estimation are provided in the appendix.

VISO2-M MLM-SFM VOLDOR
Sequence
Trans.
(%)
Rot.
(deg/m)
Trans.
(%)
Rot.
(deg/m)
Trans.
(%)
Rot.
(deg/m)
00 12.53 0.0260 2.04 0.0048 1.09 0.0039
01 28.09 0.0641 - - 2.31 0.0037
02 3.98 0.0123 1.50 0.0035 1.19 0.0042
03 4.09 0.0206 3.37 0.0021 1.46 0.0034
04 2.58 0.0162 1.43 0.0023 1.13 0.0049
05 14.68 0.0379 2.19 0.0038 1.15 0.0041
06 6.73 0.0195 2.09 0.0081 1.13 0.0045
07 14.95 0.0558 - - 1.63 0.0054
08 11.63 0.0215 2.37 0.0044 1.50 0.0044
09 4.94 0.0140 1.76 0.0047 1.61 0.0039
10 23.36 0.0348 2.12 0.0085 1.44 0.0043
Avg. 10.85 0.0249 2.03 0.0045 1.32 0.0042
Table 2: Results on KITTI training sequences 0-10. The translation and rotation errors are averaged over all sub-sequences of length from 100 meters to 800 meters with 100 meter steps.
Method Trans. (%) Rot. (deg/m)
VISO2-M [26] 11.94 0.0234
MLM-SFM [74] [73] 2.54 0.0057
PbT-M2 [19] [17] [18] 2.05 0.0051
BVO [62] 1.76 0.0036
VOLDOR (Ours) 1.65 0.0050
VO3pt* [2] [1] 2.69 0.0068
VISO2-S* [26] 2.44 0.0114
eVO* [69] 1.76 0.0036
S-PTAM* [63] [64] 1.19 0.0025
ORB-SLAM2* [57] 1.15 0.0027
Table 3: Results on KITTI odometry testing sequences 11-21.   * indicates the method is based on stereo input.

Table 2 and Figure 7 are our results on the KITTI odometry training set sequences 0-10. We picked VISO2 [26] and MLM-SFM [74] [73] as our baselines, whose scales are also estimated from ground height. Table 3 compares our result with recent popular methods on KITTI test set sequences 11-21, where we download from KITTI odometry official ranking board. As the results shows, VOLDOR has achieved top-ranking accuracy under KITTI dataset among monocular methods.

Table 4 shows our depth map quality on the KITTI stereo benchmark. We masked out foreground moving object as well as aligned our depth with groundtruth to solve scale ambiguity. We separately evaluated the depth quality for different rigidness probabilities, where . The EPE of PSMNet [8], GC-Net [43] and GA-Net [94] are measured on stereo 2012 test set and background outlier percentage is measured on stereo 2015 test set while our method is measured on training set on stereo 2015.

Methods Density EPE / px bg-outlier
GC-Net [43] 100.00% 0.7 2.21%
PSMNet [8] 100.00% 0.6 1.86%
GA-Net-15 [94] 100.00% 0.5 1.55%
Ours 27.07% 0.5616 1.47%
Ours 37.87% 0.6711 2.05%
Ours 49.55% 0.7342 2.56%
Ours 62.50% 0.8135 3.17%
Ours 78.18% 0.9274 4.17%
Ours 100.00% 1.2304 5.82%
Table 4: Results on KITTI stereo benchmark. A pixel is considered as outlier if disparity EPE is and . denotes the sum of pixel rigidness .
Sequence
ORB-SLAM2
(RGB-D)
DVO-SLAM
(RGB-D)
DSO
(Mono)
Ours
(Mono)
fr1/desk 0.0163 0.0185 0.0168 0.0133
fr1/desk2 0.0162 0.0238 0.0188 0.0150
fr1/room 0.0102 0.0117 0.0108 0.0090
fr2/desk 0.0045 0.0068 0.0048 0.0053
fr2/xyz 0.0034 0.0055 0.0025 0.0034
fr3/office 0.0046 0.0102 0.0050 0.0045
fr3/nst 0.0079 0.0073 0.0087 0.0071
Table 5: Results on TUM RGB-D dataset. The values are translation RMSE in meters.
(a) Depth likelihood with Gaussian-(MLE/MIE) Residual Model
(b) Epipole distribution with Gaussian Residual Model
(c) Depth likelihood with Fisk-(MLE/MIE) Residual Model
(d) Epipole distribution with Fisk Residual Model
(a) Ablation study over camera pose
(b) Ablation study over depth
(c) Timing over frame numbers
(d) Timing over pose samples
Figure 8: Fisk model qualitative study. (a) and (c) visualize the depth likelihood function under Gaussian and Fisk residual models with MLE and MIE criteria. Dashed lines indicate the likelihood given by a single optical flow. Solid lines are joint likelihood obtained by fusing all dashed lines. MLE and MIE are shown in different colors. (c) and (d) visualize the epipole distribution for 40K camera pose samples. For better visualization, the density color bars of (b) (d) are scaled differently.
Figure 9: Ablation study and runtime. (a) shows the camera pose error of VOLDOR under different residual models and dense optical flow input (*due to noisy ground estimations given by C2F-Flow, its scale is corrected using groundtruth). (b) shows our depth map accuracy under different residual models. (c) and (d) show the runtime of our method tested on a GTX 1080Ti GPU.
Figure 8: Fisk model qualitative study. (a) and (c) visualize the depth likelihood function under Gaussian and Fisk residual models with MLE and MIE criteria. Dashed lines indicate the likelihood given by a single optical flow. Solid lines are joint likelihood obtained by fusing all dashed lines. MLE and MIE are shown in different colors. (c) and (d) visualize the epipole distribution for 40K camera pose samples. For better visualization, the density color bars of (b) (d) are scaled differently.

6.2 TUM RGB-D Benchmark

Accuracy experiments on TUM RGB-D [75] compared VOLDOR vs. full SLAM systems. In all instances, we rigidly align trajectories to groundtruth for segments with 6 frames and estimate mean translation RMSE of all segments. Parameters remains the same to KITTI experiments. Our comparative baselines are an indirect sparse method ORB-SLAM2 [57], a direct sparse method DSO [15] and a dense direct method DVO-SLAM [45]. Per Table 5, VOLDOR performs well under indoor capture exhibiting smaller camera motions and diverse motion patterns.

6.3 Ablation and Performance Study

Figure 9 visualizes the depth likelihood function and camera pose sampling distribution. With our Fisk residual model, depth likelihood from each single frame has a well localized extremum (Fig.9-c), compared to a Gaussian residual model (Fig.9-a). This leads to a joint likelihood having a more distinguishable optimum, and results in more concentrated camera pose samplings (Fig.9-b,d). Also, with the MIE criteria, depth likelihood of the Fisk residual model is relaxed to a smoother shape whose effectiveness is further analyzed in Fig. 9, while a Gaussian residual model is agnostic to the choice between MIE and MLE (Fig.9-a). Per the quantitative study in Fig. 9 (b), compared to other analytic distributions, the Fisk residual model gives significantly better depth estimation when only a small number of reliable observations (low ) are used. Performance across different residual models tends to converge as the number of reliable samples increases (high ), while the Fisk residual model still provides the lowest EPE. Figure 9 (a) shows the ablation study on camera pose estimation for three optical flow methods, four residual models and our proposed MIE criteria. The accuracy of combining PWC-Net optical flow, a Fisk residual model, and our MIE criteria, strictly dominates (in the Pareto sense) all other combinations. Figure 9 (b) shows the MIE criteria yields depth estimates that are more consistent across the entire image sequence, leading to improved overall accuracy. However, at the extreme case of predominantly unreliable observations (very low ) MLE provides the most accurate depth. Figure 9 (c) shows the overall runtime evaluation for each component under different frame numbers. Figure 9 (d) shows runtime for pose update under different sample rates.

7 Conclusion

Conceptually, we pose the VO problem as an instance of geometric parameter inference under the supervision of an adaptive model of the empirical distribution of dense optical flow residuals. Pragmatically, we develop a monocular VO pipeline which obviates the need for a) feature extraction, b) ransac-based estimation, and c) local bundle adjustment, yet still achieves top-ranked performance in the KITTI and TUM RGB-D benchmarks. We posit the use of dense-indirect representations and adaptive data-driven supervision as a general and extensible framework for multi-view geometric analysis tasks.

References

  • [1] P. F. Alcantarilla (2011) Vision based localization: from humanoid robots to visually impaired people. Electronics (University of Alcala, 2011). Cited by: Table 3.
  • [2] P. F. Alcantarilla, J. J. Yebes, J. Almazán, and L. M. Bergasa (2012) On combining visual slam and dense scene flow to increase the robustness of localization and mapping in dynamic environments. In 2012 IEEE International Conference on Robotics and Automation, pp. 1290–1297. Cited by: Table 3.
  • [3] G. J. Babu and C. R. Rao (2004) Goodness-of-fit tests when parameters are estimated. Sankhya 66 (1), pp. 63–74. Cited by: Figure 4, §4.
  • [4] J. Blanco (2010) A tutorial on se (3) transformation parameterizations and on-manifold optimization. University of Malaga, Tech. Rep 3. Cited by: §5.2.
  • [5] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and A. J. Davison (2018) CodeSLAM—learning a compact, optimisable representation for dense visual slam. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 2560–2568. Cited by: §2.
  • [6] B. Bozorgtabar, M. S. Rad, D. Mahapatra, and J. Thiran (2019)

    SynDeMo: synergistic deep feature alignment for joint learning of depth and ego-motion

    .
    In Proceedings of the IEEE International Conference on Computer Vision, pp. 4210–4219. Cited by: §2.
  • [7] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black (2012-10) A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV), A. Fitzgibbon et al. (Eds.) (Ed.), Part IV, LNCS 7577, pp. 611–625. Cited by: Figure 3.
  • [8] J. Chang and Y. Chen (2018) Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5410–5418. Cited by: §6.1, Table 4.
  • [9] Y. Chen, C. Schmid, and C. Sminchisescu (2019) Self-supervised learning with geometric constraints in monocular video: connecting flow, depth, and camera. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7063–7072. Cited by: §2.
  • [10] Y. Cheng (1995) Mean shift, mode seeking, and clustering. IEEE transactions on pattern analysis and machine intelligence 17 (8), pp. 790–799. Cited by: §5.2.
  • [11] R. Clark, M. Bloesch, J. Czarnowski, S. Leutenegger, and A. J. Davison (2018) LS-net: learning to solve nonlinear least squares for monocular stereo. arXiv preprint arXiv:1809.02966. Cited by: §2.
  • [12] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia (2015) Exploring representation learning with cnns for frame-to-frame ego-motion estimation. IEEE robotics and automation letters 1 (1), pp. 18–25. Cited by: §2.
  • [13] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse (2007) MonoSLAM: real-time single camera slam. IEEE Transactions on Pattern Analysis & Machine Intelligence (6), pp. 1052–1067. Cited by: §2.
  • [14] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, P. Van Der Smagt, D. Cremers, and T. Brox (2015) Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §2.
  • [15] J. Engel, V. Koltun, and D. Cremers (2017) Direct sparse odometry. IEEE transactions on pattern analysis and machine intelligence 40 (3), pp. 611–625. Cited by: §2, §6.2.
  • [16] J. Engel, T. Schöps, and D. Cremers (2014) LSD-slam: large-scale direct monocular slam. In European conference on computer vision, pp. 834–849. Cited by: §1, §2.
  • [17] N. Fanani, M. Ochs, H. Bradler, and R. Mester (2016) Keypoint trajectory estimation using propagation based tracking. In 2016 IEEE Intelligent Vehicles Symposium (IV), pp. 933–939. Cited by: Table 3.
  • [18] N. Fanani, A. Stürck, M. Barnada, and R. Mester (2017) Multimodal scale estimation for monocular visual odometry. In 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 1714–1721. Cited by: Table 3.
  • [19] N. Fanani, A. Stürck, M. Ochs, H. Bradler, and R. Mester (2017) Predictive monocular odometry (pmo): what is possible without ransac and multiframe bundle adjustment?. Image and Vision Computing 68, pp. 3–13. Cited by: Table 3.
  • [20] C. Forster, M. Pizzoli, and D. Scaramuzza (2014) SVO: fast semi-direct monocular visual odometry. In 2014 IEEE international conference on robotics and automation (ICRA), pp. 15–22. Cited by: §2.
  • [21] F. Fraundorfer and D. Scaramuzza (2011) Visual odometry: part i: the first 30 years and fundamentals. IEEE Robotics and Automation Magazine 18 (4), pp. 80–92. Cited by: §1.
  • [22] F. Fraundorfer and D. Scaramuzza (2012) Visual odometry: part ii: matching, robustness, optimization, and applications. IEEE Robotics & Automation Magazine 19 (2), pp. 78–90. Cited by: §1.
  • [23] H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao (2018) Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2002–2011. Cited by: §2.
  • [24] X. Gao, X. Hou, J. Tang, and H. Cheng (2003) Complete solution classification for the perspective-three-point problem. IEEE transactions on pattern analysis and machine intelligence 25 (8), pp. 930–943. Cited by: §5.2.
  • [25] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. Cited by: §1, Figure 3, §6.1.
  • [26] A. Geiger, J. Ziegler, and C. Stiller (2011) Stereoscan: dense 3d reconstruction in real-time. In 2011 IEEE Intelligent Vehicles Symposium (IV), pp. 963–968. Cited by: §6.1, Table 3.
  • [27] C. Godard, O. M. Aodha, M. Firman, and G. J. Brostow (2019) Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3828–3838. Cited by: §2.
  • [28] C. Godard, O. Mac Aodha, and G. J. Brostow (2017) Unsupervised monocular depth estimation with left-right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 270–279. Cited by: §2.
  • [29] R. Gomez-Ojeda and J. Gonzalez-Jimenez (2016) Robust stereo visual odometry through a probabilistic combination of points and line segments. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 2521–2526. Cited by: §2.
  • [30] R. Gomez-Ojeda, F. Moreno, and J. Gonzalez-Jimenez (2017)

    Accurate stereo visual odometry with gamma distributions

    .
    In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1423–1428. Cited by: §4.
  • [31] X. Guo, H. Li, S. Yi, J. Ren, and X. Wang (2018) Learning monocular depth by distilling cross-domain stereo networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 484–500. Cited by: §2.
  • [32] J. Huang, S. Yang, Z. Zhao, Y. Lai, and S. Hu (2019) ClusterSLAM: a slam backend for simultaneous rigid body clustering and motion estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5875–5884. Cited by: §2.
  • [33] T. Hui, X. Tang, and C. Change Loy (2018)

    Liteflownet: a lightweight convolutional neural network for optical flow estimation

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8981–8989. Cited by: §2.
  • [34] J. Hur and S. Roth (2019) Iterative residual refinement for joint optical flow and occlusion estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5754–5763. Cited by: §2.
  • [35] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470. Cited by: §1, §2, §4.
  • [36] E. Ilg, T. Saikia, M. Keuper, and T. Brox (2018) Occlusions, motion and depth boundaries with a generic network for disparity, optical flow or scene flow estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 614–630. Cited by: §2.
  • [37] A. Jaegle, S. Phillips, and K. Daniilidis (2016) Fast, robust, continuous monocular egomotion computation. In 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 773–780. Cited by: §4.
  • [38] J. Janai, F. Guney, A. Ranjan, M. Black, and A. Geiger (2018) Unsupervised learning of multi-frame optical flow with occlusions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 690–706. Cited by: §2.
  • [39] K. Kanatani (2004) Uncertainty modeling and model selection for geometric inference. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (10), pp. 1307–1319. Cited by: §1.
  • [40] K. Kanatani (2008) Statistical optimization for geometric fitting: theoretical accuracy bound and high order error analysis. International Journal of Computer Vision 80 (2), pp. 167–188. Cited by: §1.
  • [41] Y. Kanazawa and K. Kanatani (2003) Do we really have to consider covariance matrices for image feature points?. Electronics and Communications in Japan (Part III: Fundamental Electronic Science) 86 (1), pp. 1–10. Cited by: §1.
  • [42] T. Ke and S. I. Roumeliotis (2017) An efficient algebraic solution to the perspective-three-point problem. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7225–7233. Cited by: §5.2.
  • [43] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry (2017) End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pp. 66–75. Cited by: §6.1, Table 4.
  • [44] C. Kerl, J. Stuckler, and D. Cremers (2015) Dense continuous-time tracking and mapping with rolling shutter rgb-d cameras. In Proceedings of the IEEE international conference on computer vision, pp. 2264–2272. Cited by: §2.
  • [45] C. Kerl, J. Sturm, and D. Cremers (2013) Dense visual slam for rgb-d cameras. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2100–2106. Cited by: §1, §2, §6.2.
  • [46] P. Kim, B. Coltin, and H. Jin Kim (2018) Linear rgb-d slam for planar environments. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 333–348. Cited by: §2.
  • [47] B. Kitt, A. Geiger, and H. Lategahn (2010) Visual odometry based on stereo image sequences with ransac-based outlier rejection scheme. In 2010 ieee intelligent vehicles symposium, pp. 486–492. Cited by: §1, §2.
  • [48] L. Kneip, D. Scaramuzza, and R. Siegwart (2011) A novel parametrization of the perspective-three-point problem for a direct computation of absolute camera position and orientation. In CVPR 2011, pp. 2969–2976. Cited by: §5.2.
  • [49] S. Li, F. Xue, X. Wang, Z. Yan, and H. Zha (2019) Sequential adversarial learning for self-supervised deep visual odometry. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2851–2860. Cited by: §2.
  • [50] C. Liu et al. (2009) Beyond pixels: exploring new representations and applications for motion analysis. Ph.D. Thesis, Massachusetts Institute of Technology. Cited by: §4.
  • [51] P. Liu, M. Lyu, I. King, and J. Xu (2019) SelFlow: self-supervised learning of optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4571–4580. Cited by: §2.
  • [52] S. Y. Loo, A. J. Amiri, S. Mashohor, S. H. Tang, and H. Zhang (2019) CNN-svo: improving the mapping in semi-direct visual odometry using single-image depth prediction. In 2019 International Conference on Robotics and Automation (ICRA), pp. 5218–5223. Cited by: §2.
  • [53] J. Ma, J. Zhao, J. Tian, A. L. Yuille, and Z. Tu (2014-04) Robust point matching via vector field consensus. IEEE Transactions on Image Processing 23 (4), pp. 1706–1721. External Links: Document, ISSN 1057-7149 Cited by: §4.
  • [54] R. Mahjourian, M. Wicke, and A. Angelova (2018) Unsupervised learning of depth and ego-motion from monocular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5667–5675. Cited by: §2.
  • [55] A. Masselli and A. Zell (2014) A new geometric approach for faster solving the perspective-three-point problem. In 2014 22nd International Conference on Pattern Recognition, pp. 2119–2124. Cited by: §5.2.
  • [56] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos (2015) ORB-slam: a versatile and accurate monocular slam system. IEEE transactions on robotics 31 (5), pp. 1147–1163. Cited by: §2.
  • [57] R. Mur-Artal and J. D. Tardós (2017) Orb-slam2: an open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics 33 (5), pp. 1255–1262. Cited by: §1, §2, §6.2, Table 3.
  • [58] R. A. Newcombe, S. J. Lovegrove, and A. J. Davison (2011-11) DTAM: dense tracking and mapping in real-time. In 2011 International Conference on Computer Vision, Vol. , pp. 2320–2327. External Links: Document, ISSN 2380-7504 Cited by: §1, §2.
  • [59] D. Nistér, O. Naroditsky, and J. Bergen (2006) Visual odometry for ground vehicle applications. Journal of Field Robotics 23 (1), pp. 3–20. Cited by: §1.
  • [60] D. Nistér (2004) An efficient solution to the five-point relative pose problem. IEEE transactions on pattern analysis and machine intelligence 26 (6), pp. 0756–777. Cited by: §2.
  • [61] G. Pascoe, W. Maddern, M. Tanner, P. Piniés, and P. Newman (2017) Nid-slam: robust monocular slam using normalised information distance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1435–1444. Cited by: §2.
  • [62] F. Pereira, J. Luft, G. Ilha, A. Sofiatti, and A. Susin (2017) Backward motion for estimation enhancement in sparse visual odometry. In 2017 Workshop of Computer Vision (WVC), pp. 61–66. Cited by: Table 3.
  • [63] T. Pire, T. Fischer, G. Castro, P. De Cristóforis, J. Civera, and J. J. Berlles (2017) S-ptam: stereo parallel tracking and mapping. Robotics and Autonomous Systems 93, pp. 27–42. Cited by: Table 3.
  • [64] T. Pire, T. Fischer, J. Civera, P. De Cristóforis, and J. J. Berlles (2015) Stereo parallel tracking and mapping for robot localization. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1373–1378. Cited by: §1, §2, Table 3.
  • [65] A. Ranjan and M. J. Black (2017) Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4161–4170. Cited by: §2.
  • [66] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12240–12249. Cited by: §2.
  • [67] J. Revaud, P. Weinzaepfel, Z. Harchaoui, and C. Schmid (2015)

    Epicflow: edge-preserving interpolation of correspondences for optical flow

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1164–1172. Cited by: §4.
  • [68] P. J. Rousseeuw (1984) Least median of squares regression. Journal of the American statistical association 79 (388), pp. 871–880. Cited by: §5.3.
  • [69] M. Sanfourche, V. Vittori, and G. Le Besnerais (2013) Evo: a realtime embedded stereo odometry for mav applications. In 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2107–2114. Cited by: Table 3.
  • [70] T. Schops, T. Sattler, and M. Pollefeys (2019) BAD slam: bundle adjusted direct rgb-d slam. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 134–144. Cited by: §2.
  • [71] D. Schubert, N. Demmel, V. Usenko, J. Stuckler, and D. Cremers (2018) Direct sparse odometry with rolling shutter. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 682–697. Cited by: §2.
  • [72] L. Sheng, D. Xu, W. Ouyang, and X. Wang (2019) Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep slam. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4302–4311. Cited by: §2.
  • [73] S. Song, M. Chandraker, and C. C. Guest (2013) Parallel, real-time monocular visual odometry. In 2013 ieee international conference on robotics and automation, pp. 4698–4705. Cited by: §6.1, Table 3.
  • [74] S. Song and M. Chandraker (2014) Robust scale estimation in real-time monocular sfm for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1566–1573. Cited by: §6.1, Table 3.
  • [75] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers (2012) A benchmark for the evaluation of rgb-d slam systems. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 573–580. Cited by: §1, §6.2.
  • [76] D. Sun, X. Yang, M. Liu, and J. Kautz (2018) PWC-net: cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8934–8943. Cited by: §1, §2, §4, §6.1.
  • [77] C. Tang and P. Tan (2018) Ba-net: dense bundle adjustment network. arXiv preprint arXiv:1806.04807. Cited by: §2.
  • [78] K. Tateno, F. Tombari, I. Laina, and N. Navab (2017) Cnn-slam: real-time dense monocular slam with learned depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6243–6252. Cited by: §2.
  • [79] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox (2017) Demon: depth and motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5038–5047. Cited by: §2.
  • [80] C. Wang, J. Miguel Buenaposada, R. Zhu, and S. Lucey (2018) Learning depth from monocular videos using direct methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2022–2030. Cited by: §2.
  • [81] R. Wang, S. M. Pizer, and J. Frahm (2019) Recurrent neural network for (un-) supervised learning of monocular video visual odometry and depth. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5555–5564. Cited by: §2.
  • [82] S. Wang, R. Clark, H. Wen, and N. Trigoni (2017) Deepvo: towards end-to-end visual odometry with deep recurrent convolutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 2043–2050. Cited by: §2.
  • [83] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu (2019) UnOS: unified unsupervised optical-flow and stereo-depth estimation by watching videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8071–8081. Cited by: §2.
  • [84] Y. Wang, Y. Yang, Z. Yang, L. Zhao, P. Wang, and W. Xu (2018) Occlusion aware unsupervised learning of optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4884–4893. Cited by: §2.
  • [85] T. Whelan, S. Leutenegger, R. Salas-Moreno, B. Glocker, and A. Davison (2015) ElasticFusion: dense slam without a pose graph. Cited by: §2.
  • [86] J. Wulff, L. Sevilla-Lara, and M. J. Black (2017) Optical flow in mostly rigid scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4671–4680. Cited by: §3.
  • [87] F. Xue, X. Wang, S. Li, Q. Wang, J. Wang, and H. Zha (2019) Beyond tracking: selecting memory and refining poses for deep visual odometry. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8575–8583. Cited by: §2.
  • [88] N. Yang, R. Wang, J. Stuckler, and D. Cremers (2018) Deep virtual stereo odometry: leveraging deep depth prediction for monocular direct sparse odometry. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 817–833. Cited by: §2.
  • [89] Y. Yang and S. Soatto (2018) Conditional prior networks for optical flow. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 271–287. Cited by: §2.
  • [90] Z. Yin and J. Shi (2018) Geonet: unsupervised learning of dense depth, optical flow and camera pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1983–1992. Cited by: §2.
  • [91] M. Yokozuka, S. Oishi, S. Thompson, and A. Banno (2019) VITAMIN-e: visual tracking and mapping with extremely dense feature points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9641–9650. Cited by: §2.
  • [92] B. Zeisl, P. F. Georgel, F. Schweiger, E. G. Steinbach, N. Navab, and G. Munich (2009) Estimation of location uncertainty for scale invariant features points.. In BMVC, pp. 1–12. Cited by: §1.
  • [93] H. Zhan, R. Garg, C. Saroj Weerasekera, K. Li, H. Agarwal, and I. Reid (2018) Unsupervised learning of monocular depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 340–349. Cited by: §2.
  • [94] F. Zhang, V. Prisacariu, R. Yang, and P. H.S. Torr (2019-06) GA-net: guided aggregation net for end-to-end stereo matching. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.1, Table 4.
  • [95] Z. Zhang, Z. Cui, C. Xu, Y. Yan, N. Sebe, and J. Yang (2019) Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4106–4115. Cited by: §2.
  • [96] C. Zhao, L. Sun, P. Purkait, T. Duckett, and R. Stolkin (2018) Learning monocular visual odometry with dense 3d mapping from dense 3d flow. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6864–6871. Cited by: §2.
  • [97] Y. Zhao and P. A. Vela (2018) Good line cutting: towards accurate pose tracking of line-assisted vo/vslam. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 516–531. Cited by: §2.
  • [98] E. Zheng, E. Dunn, V. Jojic, and J. Frahm (2014) Patchmatch based joint view selection and depthmap estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1517. Cited by: §4, §5.1.
  • [99] Y. Zheng, Y. Kuang, S. Sugimoto, K. Astrom, and M. Okutomi (2013) Revisiting the pnp problem: a fast, general and optimal solution. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2344–2351. Cited by: §5.2.
  • [100] Y. Zhong, P. Ji, J. Wang, Y. Dai, and H. Li (2019) Unsupervised deep epipolar flow for stationary or dynamic scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12095–12104. Cited by: §2.
  • [101] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe (2017) Unsupervised learning of depth and ego-motion from video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1851–1858. Cited by: §2.
  • [102] Y. Zou, Z. Luo, and J. Huang (2018) Df-net: unsupervised joint learning of depth and flow using cross-task consistency. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 36–53. Cited by: §2.