Visual odometry using depth cameras is the problem of finding a rigid-body transformation between two colored point clouds. This problem arises frequently in robotics and computer vision and is an integral part of many autonomous systems[1, 2, 3, 4, 5]. Direct visual odometry methods minimize the photometric error using image intensity values measured by the camera [6, 7, 8]. The explicit representation between color information and 2D/3D geometry (image or Euclidean space coordinates) is not directly available; hence, the current direct methods use numerical differentiation for computing the gradient and are limited to fixed image size and resolution, given the camera model and measurements for re-projection of the 3D points. In this setup, a coarse-to-fine image pyramid [9, Section 3.5] is constructed to solve the same problem several times with initialization provided by solving the previous coarser step.
Alternatively, the Continuous Visual Odometry (CVO) is a continuous and direct formulation and solution for the RGB-D visual odometry problem . Due to the continuous representation of CVO, it neither requires the association between two measurement sets nor the same number of measurements within each set. In addition, there is no need for constructing a coarse-to-fine image pyramid in the continuous sensor registration framework developed in . In this framework, the joint appearance and geometric embedding is modeled by representing the processes (RGB-D images) in a Reproducing Kernel Hilbert Space (RKHS) [11, 12].
Robust visual tracking has become a core aspect of state-of-the-art robotic perception and navigation in both structured and unstructured indoor and outdoor [15, 16, 13, 17, 18, 19, 20]. Hence, this work contributes to the foundations of robotic perception and autonomous systems via a continuous sensor registration framework enhanced by an adaptive hyperparameter learning strategy. In particular, this work has the following contributions:
We extend the continuous visual odometry framework for RGB-D cameras to an adaptive framework via online hyperparameter learning. We also perform a sensitivity analysis of the problem and propose a systematic way to choose the sparsification threshold discussed in .
We generalize the appearance (color) information inner product in  to a kernelized form that improves the performance. With this improvement alone, the experimental evaluations show that the original continuous visual odometry is intrinsically robust and its performance is similar to that of the state-of-the-art robust dense (and direct) RGB-D visual odometry method .
The remainder of this paper is organized as follows. The problem setup is given in §II. The adaptive continuous visual odometry framework is discussed in §III. The sensitivity analysis of the problem is provided in §IV. The experimental results are presented in §V. Finally, §VI concludes the paper and discusses future research directions.
Ii Problem Setup
Consider two (finite) collections of points, , . We want to determine which element , where and , aligns the two point clouds and the “best.” To assist with this, we will assume that each point contains information described by a point in an inner product space, . To this end, we will introduce two labeling functions, and .
In order to measure their alignment, we will be turning the clouds, and , into functions that live in some reproducing kernel Hilbert space, . The action, induces an action by . Inspired by this observation, we will set .
The problem of aligning the point clouds can now be rephrased as maximizing the scalar products of and , i.e., we want to solve
Ii-a Constructing the functions
We follow the same steps in  with an additional step in which we use the kernel trick to kernelize the information inner product. For the kernel of our RKHS, , we first choose the squared exponential kernel :
for some fixed real parameters (hyperparameters) and , and is the standard Euclidean norm on . This allows us to turn the point clouds to functions via
We can now define the inner product of and by
We use the well-known kernel trick in machine learning[21, 22, 23] to substitute the inner products in (4) with the appearance (color) kernel. The kernel trick can be applied to carry out computations implicitly in the high dimensional space, which leads to computational savings when the dimensionality of the feature space is large compared to the number of data points . After applying the kernel trick to (4), we get
where we choose to be also the squared exponential kernel with fixed real hyperparameters and that are set independently.
Iii Adaptive Continuous Visual Odometry via Online Hyperparameter Learning
The length-scale of the kernel, , is an important hyperparameter that affects the performance and convergence of the algorithm significantly. In the original framework in , was set using a fixed set of conditions within the solver to reduce the length-scale as the algorithm reached a local minimum. Intuitively, large values of encourage higher correlations between points that are far apart from each other; and small values of encourage the algorithm to focus on only points that are very close to each other with respect to the distance metric of the kernel (here we use the Euclidean distance). This latter case results in faster convergence and it can be thought of as refinement steps where the target and source clouds are already almost aligned.
Now the question to answer is how can we tune automatically and online at each iteration so that the overall registration performance is maximized? In this section, we provide a solution that is based on a greedy gradient descent search. As we will see, this approach is highly appealing due to its simplicity and the gain in performance. We first revisit Problem 1. The maximization of the inner product is a reduced form of the original cost and the fact that is an isometry. That is
where coefficients and are defined for each function’s inner product with itself similar to (II-A).
Computing the gradient of (III) with respect to is straightforward and is given by
where we defined , , and . Then using the following update (integration) rule we find the length-scale for the next iteration,
where is the step size (learning rate).
This strategy alone can lead to failure or extremely poor performance based on our observations. The reason is that CVO uses semi-dense data and in the absence of structure or texture in the environment the gradient can be weak or not well-behaved. To address this problem, we can simply define a search interval for the length-scale as . This additional step not only keeps in a feasible region but also allows the algorithm to detect when tracking is difficult and issue a warning message. To improve the convergence, when , we reduce both and by a reduction factor, , and continue as before.
Iv Sensitivity Analysis
Understanding how in equation (2) depends on is a surprisingly delicate problem which, surprisingly, offers a systematic way to choose the kernel sparsification threshold (see Table I). Consider the following normalization of :
so . Suppose we want to find an approximation of as gets small; this is equivalent to understanding as . Performing a Taylor expansion of about results in the zero function. A simple enough calculation shows that
where is some rational function. Due to the fact that is exponential, we have that
for any rational function . This shows that the Taylor series of about is trivially zero. (The underlying reason for the Taylor series being zero while the function is not is because is not analytic at . In fact, if we view as a complex function there is an essential singularity at , see §5.6 in .)
Rather than expanding about (where points are far apart), we can expand about (where points are close together). This results in the following expansion:
While this approximation is accurate when is large, as approaches zero this approximation falls apart. The exact function, , approaches zero as but the approximation has a pole at zero regardless of order. This motivates a minimum cutoff for such that (12) has a well controlled error. By applying this cutoff to the original function , we can obtain a kernel sparsification threshold that guarantees error bounds in the approximation (12). A plot of these values is shown in Fig. 4.
V Experimental Results
We now present experimental evaluations of the proposed method Adaptive CVO (A-CVO). We compare A-CVO with the original CVO  and the state-of-the-art direct (and dense) RGB-D visual odometry (DVO) . Since the original DVO source code requires outdated ROS dependency , we reproduced DVO results using the version provided by Matthieu Pizenberg , which only removes the dependency for ROS while maintains the DVO core source code unchanged. We also include the DVO results of Kerl et al.  for reference. We refer to the reproduced DVO results as DVO and the results directly taken from  as Kerl et al..
|Transformation convergence threshold|
|Gradient norm convergence threshold|
|Minimum step length|
|Kernel sparsification threshold|
|Spatial kernel initial length-scale|
Spatial kernel signal variance
|Spatial kernel minimum length-scale (A-CVO)|
|Spatial kernel maximum length-scale (A-CVO)|
|Color kernel length-scale|
|Color kernel signal variance|
|integration step size (A-CVO)|
|reduction factor (A-CVO)|
|CVO ||A-CVO||Kerl et al.||DVO||CVO ||A-CVO||Kerl et al.||DVO|
V-a Experimental Setup
To improve the computational efficiency, we adopted a similar approach to Direct Sparse Visual Odometry (DSO) by Engel et al.  to create a semi-dense point cloud (around 3000 points) for each scan. To prevent insufficient points being selected in environments that lack rich visual information, we also used a Canny edge detector  from OpenCV . When the points selected by the DSO point selector are less than one-third of the desired number of points, more points will be selected by downsampling the pixels highlighted by the Canny detector. While generating the point cloud, RGB values are first transformed into HSV colormap and normalized. The normalized HSV values are then combined with the normalized intensity gradients and utilized as the labels of the selected points in the color space. For all experiments, we used the same set of parameters, which are listed in Table I.
All experiments are performed on a Dell XPS15 9750 laptop with Intel i7-8750H CPU (6 cores with 2.20 GHz each) and 32GB RAM. The source code is implemented in C++ and compiled with the Intel Compiler. The kernel computations are parallelized using the Intel Threading Building Blocks (TBB) 
. Using compiler auto-vectorization and the parallelization, the average time for frame-to-frame registration is 0.5. The frame-to-frame registration time for the original CVO is 0.2 (5 ).
V-B TUM RGB-D Benchmark
We performed experiments on two parts of RGB-D SLAM dataset and benchmark by the Technical University of Munich . This dataset was collected indoors with a Microsoft Kinect using a motion capture system as a proxy for ground truth trajectory. For all tracking experiments, the entire images were used sequentially without any skipping, i.e., at full frame rate. We evaluated A-CVO, CVO, and DVO on the training and validation sets for all the fr1 sequences and the structure versus texture sequences. RGB-D benchmark tools  were then used to evaluate the Relative Pose Error (RPE) of all 3 methods, and evo  was utilized to visualize the trajectory.
Table II shows the Root-Mean-Squared Error (RMSE) of the RPE for fr1 sequences. The Trans. columns show the RMSE of the translational drift in and the Rot. columns show the RMSE of the rotational drift in . The Average* shows the average result by excluding fr1/floor sequence since Kerl et. al reported failure on that sequence . The rotational errors were not reported in the original paper . There are no corresponding validation sequences for fr1/teddy and fr1/floor. A-CVO improves the performance over CVO and outperforms DVO on both translational and rotational metrics. On the training sequences, A-CVO reduces the average translational error of CVO by 7.2, and on the validation sequences, the improvement reaches to 17.2. A-CVO has a 6.9 lower translational error than Kerl et al. on the training set (excluding the failure case). On the validation set, A-CVO has 17.1 improved performance compared with Kerl et al. which shows A-CVO can generalize across different scenarios better. It is worth noting that CVO is intrinsically robust and its performance is similar to that of the state-of-the-art robust dense (and direct) RGB-D visual odometry method . Next experiment will further reveal that CVO has the advantage of performing well in extreme environments that lack rich structure or texture.
|Sequence||CVO ||A-CVO||Kerl et al.||DVO||CVO ||A-CVO||Kerl et al.||DVO|
V-C Experiments using Structure vs. Texture Sequences
Table III shows the RMSE of RPE for the structure vs. texture sequences. This dataset contains image sequences in structure/nostructure and texture/notexture environments. As elaborated in , by treating point clouds as points in the function space (RKHS), CVO and A-CVO are inherently robust to the lack of features in the environment. A-CVO and CVO show the best performance on cases that either structure or texture is not rich in the environment. This reinforces the claim in  that CVO is robust to such scenes.
However, by online hyperparameter learning, A-CVO allows the parameters to be adaptively varying with the environment without the need for manual tuning, which improves the performance over the original CVO. We also note that DVO has the best performance on the case where the environment contains rich texture and structure information. This can be because of two reasons: 1) CVO and A-CVO adopted a semi-dense point cloud construction from DSO , while DVO uses the entire dense image without subsampling. Although the semi-dense tracking approach of Engel et al. [17, 18]
is computationally attractive and we advocate it, the semi-dense point cloud construction process used in this work is a heuristic process and might not necessarily capture the relevant information in each frame optimally; 2) DVO uses a motion prior as regularizer whereas CVO and A-CVO solely depend on the camera information with no regularizer. We conjecture this latter is the reason DVO, relative to the training set, does not perform well on validation sequences. The motion prior is a useful assumption when it is true! It can help to tune the method better on the training sets but if the assumption gets violated can lead to poor performance. The addition of an IMU sensor, of course, can improve the performance of all the compared methods and is an interesting future research direction.
V-D Discussions and Limitations
We have shown that A-CVO and CVO perform well across different indoor scenarios and different structure and texture conditions. The TUM RGB-D benchmark used in this paper was collected using a Microsoft Kinect for Xbox 360 which has a rolling shutter camera and is not designed for robotic applications. We observed that the blurred images due to camera motion are the most challenging frames for registration. The performance can degrade considerably as the extraction of the semi-dense structure cannot capture the structure and texture of the scene accurately. For example, a table edge that is usually a reliable part of an image, when blurred, can result in hallucinating multiple lines. Although more recent cameras used in robotics often use global shutters, the problem is still relevant and should be addressed. Exploring point selection strategies to improve the performance on challenging frames is also an interesting topic as future work.
The current implementation of CVO/A-CVO exploits vectorization and multi-threading which means the provided software gain additional performance benefits automatically as vector registers continue becoming wider. However, robotic applications require real-time software and more work is needed in order to achieve real-time performance using CPUs. An interesting research avenue to obtain real-time performance is a GPU implementation of the CVO/A-CVO.
Vi Conclusion and Future Work
We have developed an adaptive continuous visual odometry method for RGB-D cameras via online hyperparameter learning. The experimental results indicate that the original continuous visual odometry is intrinsically robust and its performance is similar to that of the state-of-the-art robust dense (and direct) RGB-D visual odometry method. Moreover, online learning of the kernel length-scale brings significant performance improvement and enables the method to perform better across different domains even in the absence of structure and texture in the environment.
In the future, we can use the invariant IMU model in  to predict the next camera pose and use the predicted pose as the initial guess in the A-CVO algorithm. This alone can increase the performance as the model performs an exact integration within a small time between two images. The integration of A-CVO into multisensor fusion systems [32, 33, 34, 35, 36] and keyframe-based odometry and SLAM systems [16, 37] are also interesting future research directions.
This article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity. The authors would like to thank Andreas Girgensohn for the helpful discussion on choosing the HSV colormap.
-  T. Whelan, H. Johannsson, M. Kaess, J. J. Leonard, and J. McDonald, “Robust real-time visual odometry for dense RGB-D mapping,” pp. 5724–5731, 2013.
-  S. A. Scherer and A. Zell, “Efficient onbard RGBD-SLAM for autonomous MAVs,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2013, pp. 1062–1068.
-  R. G. Valenti, I. Dryanovski, C. Jaramillo, D. P. Ström, and J. Xiao, “Autonomous quadrotor flight using onboard rgb-d visual odometry,” in Proceedings of the IEEE International Conference on Robotics and Automation. IEEE, 2014, pp. 5233–5238.
-  G. Loianno, J. Thomas, and V. Kumar, “Cooperative localization and mapping of MAVs using RGB-D sensors,” in Proceedings of the IEEE International Conference on Robotics and Automation. IEEE, 2015, pp. 4021–4028.
-  A. S. Huang, A. Bachrach, P. Henry, M. Krainin, D. Maturana, D. Fox, and N. Roy, “Visual odometry and mapping for autonomous flight using an RGB-D camera,” in Robotics Research. Springer, 2017, pp. 235–252.
-  C. Audras, A. Comport, M. Meilland, and P. Rives, “Real-time dense appearance-based SLAM for RGB-D sensors,” in Australasian Conference on Robotics and Automation, 2011.
-  R. A. Newcombe, S. J. Lovegrove, and A. J. Davison, “DTAM: Dense tracking and mapping in real-time,” in Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2011, pp. 2320–2327.
-  F. Steinbrücker, J. Sturm, and D. Cremers, “Real-time visual odometry from dense RGB-D images,” in Proceedings of the IEEE International Conference on Computer Vision. IEEE, 2011, pp. 719–722.
-  R. Szeliski, Computer Vision: Algorithms and Applications. Springer Science & Business Media, 2010.
-  M. Ghaffari, W. Clark, A. Bloch, R. M. Eustice, and J. W. Grizzle, “Continuous direct sparse visual odometry from RGB-D images,” in Proceedings of the Robotics: Science and Systems Conference, Freiburg, Germany, June 2019.
-  B. Schölkopf, R. Herbrich, and A. Smola, “A generalized representer theorem,” in Computational learning theory. Springer, 2001, pp. 416–426.
A. Berlinet and C. Thomas-Agnan,
Reproducing kernel Hilbert spaces in probability and statistics. Kluwer Academic, 2004.
C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for RGB-D cameras,” inProceedings of the IEEE International Conference on Robotics and Automation. IEEE, 2013, pp. 3748–3754.
-  J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Oct. 2012.
-  S. Klose, P. Heise, and A. Knoll, “Efficient compositional approaches for real-time robust direct visual odometry from RGB-D data,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2013, pp. 1100–1106.
-  C. Kerl, J. Sturm, and D. Cremers, “Dense visual slam for RGB-D cameras,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2013, pp. 2100–2106.
-  J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-scale direct monocular SLAM,” in Proceedings of the European Conference on Computer Vision. Springer, 2014, pp. 834–849.
-  J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 3, pp. 611–625, 2018.
-  A. R. Vidal, H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Ultimate SLAM? combining events, images, and IMU for robust visual SLAM in HDR and high-speed scenarios,” IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 994–1001, 2018.
-  S. Bryner, G. Gallego, H. Rebecq, and D. Scaramuzza, “Event-based, direct camera tracking from a photometric 3D map using nonlinear optimization,” in Proceedings of the IEEE International Conference on Robotics and Automation, vol. 2, 2019.
-  C. M. Bishop, Pattern recognition and machine learning. Springer, 2006.
-  C. Rasmussen and C. Williams, Gaussian processes for machine learning. MIT press, 2006, vol. 1.
-  K. P. Murphy, Machine learning: a probabilistic perspective. The MIT Press, 2012.
-  E. Saff and A. Snider, Fundamentals of Complex Analysis with Applications to Engineering and Science. Prentice Hall, 2003.
-  C. Kerl, “Dense Visual Odometry (dvo),” https://github.com/tum-vision/dvo, 2013.
-  M. Pizenberg, “DVO core (without ROS dependency),” https://github.com/mpizenberg/dvo/tree/76f65f0c9b438675997f595471d39863901556a9, 2019.
-  G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools, 2000.
-  J. Canny, “A computational approach to edge detection,” in Readings in Computer Vision. Elsevier, 1987, pp. 184–203.
-  Intel Corporation, “Official Threading Building Blocks (TBB) GitHub repository,” https://github.com/intel/tbb, 2019.
-  M. Grupp, “evo: Python package for the evaluation of odometry and SLAM.” https://github.com/MichaelGrupp/evo, 2017.
-  R. Hartley, M. Ghaffari, R. M. Eustice, and J. W. Grizzle, “Contact-aided invariant extended Kalman filtering for robot state estimation,” arXiv preprint arXiv:1904.09251, 2019.
-  S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual–inertial odometry using nonlinear optimization,” International Journal of Robotics Research, vol. 34, no. 3, pp. 314–334, 2015.
-  V. Usenko, J. Engel, J. Stückler, and D. Cremers, “Direct visual-inertial odometry with stereo cameras,” in Proceedings of the IEEE International Conference on Robotics and Automation. IEEE, 2016, pp. 1885–1892.
-  C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold preintegration for real-time visual–inertial odometry,” IEEE Transactions on Robotics, vol. 33, no. 1, pp. 1–21, 2017.
-  R. Hartley, M. G. Jadidi, L. Gan, J.-K. Huang, J. W. Grizzle, and R. M. Eustice, “Hybrid contact preintegration for visual-inertial-contact state estimation within factor graphs,” in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Madrid, Spain, October 2018, pp. 3783–3790.
-  K. Eckenhoff, P. Geneva, and G. Huang, “Closed-form preintegration methods for graph-based visual–inertial navigation,” International Journal of Robotics Research, vol. 38, no. 5, pp. 563–586, 2019.
-  R. Wang, M. Schworer, and D. Cremers, “Stereo dso: Large-scale direct sparse visual odometry with stereo cameras,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 3903–3911.