1 Introduction
Visual odometry is an important area of information fusion in which the central aim is to estimate the pose of a robot using data collected by visual sensors [1]. Because nearly all robotic tasks require knowledge of the pose of the robot, visual odometry plays a critical role in robot control, simultaneous localization and mapping (SLAM) and robot navigation, especially when external reference information about the environment (such as GPS data) is unavailable. Visual odometry can be viewed as a particular instance of the general pose tracking problem, which is the most fundamental perception problem in robotics [2].
To date, a variety of different visual odometry methods based on different sensor information have been studied and widely implemented. One of the most wellknown methods is the iterative closest point (ICP) algorithm [3], which estimates the robot’s pose by minimizing the distance between corresponding points in two laser scanning snapshots. However, this method can easily become trapped in local optima if a good initial guess is not provided. In addition to the ICP algorithm and its variants, odometry methods using camera images have also been studied [4] [5]. Such methods usually extract point features from the camera images and match them through a series of steps, including descriptor matching, RANSAC and bundle adjustment. Due to their expensive computational burden, these approaches are usually too slow for realtime application. One way of improving computational efficiency is to use sparse point features, but this approach does not fully exploit the available image data, ignoring much relevant information.
Recently, with RGBD cameras becoming smaller and cheaper, the opportunity has arisen to develop RGBD odometry methods that exploit both intensity and depth information. One such method was proposed by the Computer Vision Group at the Technical University of Munich (TUM). In this method, a singleobjective optimization problem is formulated to penalize the intensity difference between corresponding pixels in consecutive images
[6] [7]. This method can be implemented in realtime even on a singlecore CPU. However, the image depth information is only used to determine the relationship between corresponding pixels in consecutive images for intensity residual comparison; depth residuals are not considered. Thus, a new biobjective optimization problem was subsequently proposed in [8] to minimize both depth and intensity residuals, with the aim of improving estimation performance.In this paper, we consider the same biobjective optimization formulation as in [8]. Our aims are twofold: (i) to propose new computational approaches for solving this biobjective optimization formulation; and (ii) to explore and quantify the advantages of the biobjective optimization formulation for improving estimation robustness. The first computational approach we investigate, the socalled weighted sum method, involves integrating the two objective functions into a single objective using a weighting factor. We derive a new formula for adaptive calculation of this weighting factor, which is crucial to estimation accuracy. Our formula is based on a novel image complexity metric and differs from the corresponding formula in [8], which uses the ratio of median intensity and median depth values to calculate the weighting factor. The second computational approach we investigate, the socalled bounded objective method, involves optimizing one of the objective functions while the other objective function is bounded via a constraint. Again, our new image complexity metric is used, this time to determine an appropriate objective bound. To evaluate performance, the open source TUM RGBD dataset [9] was used. The computational results demonstrate that our new methods generally give results of superior accuracy compared with the methods in [6] [7] [8].
2 SingleObjective Optimization for Visual Odometry
The camera motion in 3D space has six degrees of freedom and can be denoted as
where , , are the translation components of the motion and , , are the rotation components of the motion. To estimate , we consider a world point and assume that its brightness is the same in two consecutive images. This is the socalled photoconsistency assumption [7], which can be expressed mathematically by
where represents the mapping coordinate of the world point in the first image and represents the corresponding coordinate of in the second image when given the true value of the camera motion . Moreover, and are the brightness (or intensity) values of the specified coordinates in the first and second images, respectively.
Based on the photoconsistency assumption, we can define the intensity difference corresponding to the motion estimate as
According to the results in [7], the more accurate the camera motion estimate, the smaller the residual . Thus, estimation quality in visual odometry can be assessed by considering the following leastsquares objective function, which is the sum of residual squares for world points:
Then the problem of determining the camera motion can be formulated as a leastsquares optimization problem, i.e.,
(1) 
To improve robustness, weighted residuals can be used to reduce the effect of noise and outliers in the image data. This motivates the following weighted objective function in quadratic form:
(2) 
where is a diagonal weight matrix and
3 BiObjective Optimization for RgbD Odometry
Traditional cameras only provide image intensity information. RGBD cameras, on the other hand, provide image intensity and image depth information, both of which can be used for visual odometry. For example, in the odometry methods introduced by the TUM Computer Vision Group [6] [7], the relationship between corresponding pixels in consecutive images is expressed in terms of the depth information in the first image, and the intensity information of both images is used to define the motion estimation residuals as in Section 2. More precisely, the relationship between corresponding pixels in consecutive images is defined by a warping function as follows:
where is the depth value of the pixel in the first image and is the warping function for calculating the mapping coordinate in the second image. For the specific form of the warping function , we refer the reader to [7].
Although singleobjective optimizationbased odometry methods are computationally fast and effective, they can produce poor results in some situations. For example, when textural features in the image sequence are poor, trajectory estimation accuracy will decrease dramatically. This is because the objective function only depends on image intensity information, and thus it can become nonconvex when image textural features are lacking. In this case, the “optimal” motion estimates obtained by applying an optimization iterative procedure may only be locally optimal. To investigate this hypothesis, we applied the singleobjective optimization approach (implemented using the GaussNewton method) to image sequences in the TUM RGBD dataset [9]. Our results are shown in Fig. 1. From the results, we see that the translation error of the motion estimates increases significantly when textural features are lacking. This motivates the new biobjective optimization formulation proposed in [8], in which both image intensity and image depth residuals are minimized to improve robustness.
The extension of RGBD odometry using biobjective optimization is inspired by the ICP algorithm and its variants, which estimate the sensor motion by minimizing residual coordinate differences, instead of image intensity values. Since RGBD cameras can provide both intensity and depth information simultaneously, we want to take full advantage of this feature by comparing depth differences, just as the ICP algorithm compares coordinate differences. Thus, we now consider two residuals instead of one:
(3) 
where and are the depth values of the specified coordinates in the first and second images, and projects the 3D coordinate of world point from the first camera coordinate system to the second camera coordinate system based on the homogeneous transformation matrix for . Operator “” selects the coordinate value along the direction. See the diagram in Fig. 2 for an explanation of the notation.
Based on defined in (3), we consider the following objective function:
(4) 
where is a diagonal weight matrix and
Combining objectives (2) and (4), we consider the following biobjective optimization problem:
(5) 
3.1 Weighted Sum Method
The weighted sum method is the most common approach to solving multiobjective optimization problems. In this method, the individual objective functions are assigned different weights and then added together to form a single objective function. More specifically, for individual objective functions
and decision vector
, the combined objective function is(6) 
where are the weights. If all of the weights are positive, then the minimum of (6) is Pareto optimal for the original multiobjective problem [10].
In essence, the objective weights provide additional degrees of freedom in the optimization problem. For our odometry problem (5), the new singleobjective optimization problem is defined as
(7) 
Notice that by dividing by , we can obtain an equivalent optimization problem as follows:
(8) 
where . Thus, we only need to consider a single weighting factor .
Problem (8) can be solved using the GaussNewton method. To do this, we linearize the residuals and using the Taylor expansion proposed in [11]:
where “” denotes the addition operator in Lie group SE(3) (for more details, see [12]); and and are the Jacobians defined by
Then the objective function in (8) can be approximated by a quadratic function of :
(9) 
where , and ().
Suppose that at iteration , we have the motion estimate . Then the increment should be chosen to minimize . According to the GaussNewton method, by differentiating (9) for , the optimal value of satisfies the linear system
(10) 
where denotes with and denotes with . To solve this linear system, methods such as Cholesky decomposition can be used. After solving (10), the updated motion estimate is given by . This iterative process continues until convergence is achieved.
The effectiveness of the weighted sum method depends crucially on the weighting factor , which must be selected a priori and reflects the preference of the decision maker. A good choice for can result in more accurate trajectory estimates when compared to singleobjective odometry methods, but a poor choice for may lead to unacceptable results. Systematic approaches to selecting the weights in multiobjective optimization problems have been developed (see, for example, [13]), but few of them have been investigated in the context of visual odometry. Tykkala et al. [8] proposed a method that determines based on the ratio of median intensity and median depth values:
where denotes the list of intensity values and denotes the list of depth values.
To explore the importance of the weight , we conducted two computational experiments with the TUM RGBD dataset. For our first experiment, we formed two image sequences: one containing images with poor textural features and one containing images with rich textural features. The structural features in both image sequences were rich. We observed that for the first sequence with poor textural features, the error decreases as is increased, but for the second sequence with rich textural features, the opposite occurs (see Fig. 3(a)). We believe that this is because the intensity objective function tends to be nonconvex when images lack textural features. In this case, large values of magnify the relative importance of the depth objective function , thus potentially preventing the overall objective function in (8) from becoming nonconvex.
For our second experiment, we again formed two image sequences: this time the first image sequence contained images with poor structural features and poor textural features, and the second image sequence contained images with rich structural features and poor textural features. As expected, the error decreases as increases for the image sequence with rich structural features (see Fig. 3(b)). This is because is likely to be convex when images contain rich structural information, and a large will increase ’s relative influence in the overall objective function.
Based on the experimental results in Fig. 3, we believe that the key to finding an optimal is to design a metric to measure textural and structural information. To do this, we consider the concept of image complexity, which is a measure of the inherent difficulty of finding a true target in a given image [14]. Peters et al. [14] has summarized many image complexity metrics for automatic target recognizers. Unfortunately, image complexity is a taskdependent notion and there is no universal metric applicable to all situations. After testing several of the metrics in [14], we designed our own metric for intensity complexity defined as follows:
(11) 
where and are the number of pixel rows and pixel columns, respectively, and denotes the intensity value at the specified pixel. For depth complexity, we use the analogue of (11) for the depth values:
(12) 
where denotes the depth value at the specified pixel. To standardize the intensity data and the depth data
, we define the following scaling factor as the ratio of the variance between them:
(13) 
Combining (11)(13), we calculate the value of weight as follows:
(14) 
where is as defined in (13) and is an adjustable constant. Notice that large values of indicate rich textural features, and large values of indicate rich structural features. Thus, we have deliberately chosen the value of in (14) to be inversely proportional to , and proportional to . The idea is to use large values of when the image sequence is rich in structure and/or poor in texture, and small values of when the image sequence is poor in structure and/or rich in texture.
3.2 Bounded Objective Method
The bounded objective method is another method for solving multiobjective optimization problems [13]. In this method, we minimize one of the objective functions (considered as the most important, or primary, objective), while the other objective functions are bounded using additional constraints.
For our odometry problem, we select as the primary objective function. The biobjective optimization problem in (5) then becomes
(15) 
where is an upper bound for the leastsquares sum of depth residuals. To solve the optimization problem in (15), we can again use the firstorder Taylor expansions of and . The optimal increment at point is then given by the solution of the following problem:
(16) 
where , , , , and are as defined in (9).
Problem (16) is a quadratically constrained quadratic program (QCQP). The general form for a QCQP is
QCQPs are of both theoretical and practical significance [15]. Because the matrices and are positive semidefinite, problem (16) is a convex QCQP. To solve this convex QCQP, we first transform it into a secondorder cone programming (SOCP) problem and then apply SOCP techniques [16]. The general form for a SOCP problem is
The norm appearing in the constraints is the standard Euclidean norm, i.e., . We first rewrite (16) as follows:
(17) 
By adding a new optimization variable , we can transform (17) into the following SOCP form:
(18) 
Problem (18), which is equivalent to (16) and (17) (see [16]), is clearly in the general SOCP form shown above.
To solve the SOCP problem in (18), we can use ECOS, an SOCP solver developed by Domahidi et al. [17]. ECOS implements an interior point method to solve SOCPs in the following standard form [18]:
where is a vector of optimization variables, is a vector of slack variables and is the cone
To reformulate (18) into the standard form required by ECOS, we set
and set
where denotes the zero column vector in .
The upper bound of the depth objective is a parameter that needs to be selected before starting the optimization procedure. This parameter plays the same role as in (8), i.e., balancing the relative importance of the depth and intensity objectives. However, compared to , the upper bound has a more explicit mathematical meaning and is easier to select a priori. In fact, since the value of can be measured directly when the true value of the camera motion is plugged into , it can be used to estimate the range of and find a good for optimization. In our algorithm, we choose the value of according to the complexity of depth image as follow:
where , is an adjustable threshold and is the depth metric in (12).
4 Performance Evaluation
poor structure  rich structure  poor structure  rich structure  
Method  rich texture  poor texture  poor texture  rich texture 
[m/s]  [m/s]  [m/s]  [m/s]  
Single objective  0.041667  0.125235  0.249357  0.015956 
Tykkala’s method  0.035970  0.106649  0.165702  0.016078 
Weighted sum  0.034464  0.088853  0.178571  0.015101 
Bounded objective  0.032715  0.095749  0.178994  0.015330 
poor structure  rich structure  poor structure  rich structure  
Method  rich texture  poor texture  poor texture  rich texture 
[m/s]  [m/s]  [m/s]  [m/s]  
Single objective  0.110646  0.074372  0.170460  0.015597 
Tykkala’s method  0.094845  0.077504  0.129923  0.014728 
Weighted sum  0.078033  0.076853  0.123848  0.014284 
Bounded objective  0.098715  0.066008  0.152104  0.015269 
For performance evaluation, we conducted a series of numerical experiments in which a set of image sequences were used to compute simulated camera trajectory. The image sequences used in our experiments are from “Structure vs. Texture” category in the TUM RGBD dataset. Images in this category can be demonstrated four different types as shown in Fig. 4. The image sequences in this dataset were created using colorful plastic foils to create textural features and white plastic foils to decrease textural features. Similarly, zigzag structure built from wooden panels are used to increase the structural features in images while planar surfaces are used to make the strucure features of images become poor.
We compare the estimated trajectories produced via the optimization procedures with the true trajectories and calculated the root mean square error (RMSE) of the drift in meters per second. Other RGBD odometry methods, such as the single objective method in [7] and a reimplementation of the biobjective odometry in [8], have also been applied in our experiments as references of our methods. Besides, we measure the runtime of different approaches on a ThinkPad E431 laptop with dualcore Intel i53210M CPU (2.50GHz) and 4 GB RAM to evaluate their realtime performance.
Specially, to ensure identical experimental conditions, we build the distribution model mentioned in [7] to eliminate the outliers in data and constructed the weighting matrix in objective function for all methods we evaluated. The results of our experiments are given in Tab. 1 and Tab. 2 (the result of perframe translational errors is also demonstrated in Fig. 5). It can be seen that the RMSEs of the singleobjective optimization based method increase considerably when textural feature of the sequences is poor. Compared to the method based on singleobjective optimization, our methods, the weighted sum method and the bounded objective method, give better performance, especially in poor textural feature cases. Tykkala’s method, which also uses bioptimization optimization, has a similar performance to ours in most cases. Our conclusion is that the new biobjective optimization formulation for RGBD odometry can alleviate the optimization problem become nonconvex and improve the accuracy of the estimates.
We also measure the average runtime for one match between two images with different methods in our experiments. From Tab. 3 we can see that our weighted sum method needs more time to accomplish one match than the method based on single objective optimization. But as its cost in time for one match is much less than one second, our weighted sum method can still be implemented as a realtime approach. The bounded sum method, however, due to its expensive cost in time, can not work in a realtime application currently. The main cause that give rise to this phenomenon is that the algorithms used to solve the SOCP are numerical approximation algorithms. They need more computations and iterations to get the solution than the analytic algorithms, like GaussNewton algorithm, used in the weighted sum method. Considering its convenience in setting parameter, the bounded sum method is still a promising method and it offers an alternative beyond other common methods in biobjecitve optimization.
Method  runtime[ms] 

single objective  15.42 
Tykkala’s method  21.06 
weighted sum  22.99 
bounded objective  7093 
5 Conclusion
In this paper, we studied two methods for solving a new biobjective optimization formulation for robust RGBD odometry. Both methods involve converting the biobjective optimization problem into a singleobjective problem. The weighted sum method involves minimizing the weighted linear sum of intensity and depth residuals. The bounded objective method involves minimizing the intensity residual subject to a bound on the depth residual. The experimental results show that both methods yield precise motion estimates and perform stably even when the textural information in the image sequence is poor. The bounded objective method is considerably slower than the weighted sum method. Thus, our current focus is on developing a parallel algorithm for enhancing realtime performance. We also hope to expand these ideas to other problems in robotics such as motion control, SLAM and navigation. One of the main contributions of our work is a discussion of how to use depth and intensity metrics to choose the parameters in both methods.
References

[1]
D. Nistér, O. Naroditsky, and J. Bergen, Visual Odometry, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol.1, 652659, 2004.
 [2] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics, MIT Press, Chap. 7, 2005.
 [3] P. J. Besl, and N. D. McKay, Method for Registration of 3D Shapes, in RoboticsDL Tentative, International Society for Optics and Photonics, 586606, 1992.
 [4] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, RGBD Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor Environments, in Experimental Robotics, Springer, 477491, 2014.
 [5] H. Strasdat, J. Montiel, and A. J. Davison, Scale DriftAware Large Scale Monocular SLAM, in Robotics: Science and Systems, Vol.2, No. 3, 512, 2010.
 [6] H. Steinbrucker, J. Sturm, and D. Cremers, RealTime Visual Odometry from Dense RGBD Images, in Proceedings of the IEEE International Conference on Computer Vision Workshops, 719722, 2011.
 [7] C. Kerl, J. Sturm, and D. Cremers, Robust Odometry Estimation for RGBD Cameras, in Proceedings of the IEEE International Conference on Robotics and Automation, 37483754, 2013.
 [8] T. Tykkala, C. Audras, and A. I. Comport, Direct Iterative Closest Point for RealTime Visual Odometry, in Proceedings of the IEEE International Conference on Computer Vision Workshops, 20502056, 2011.
 [9] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, A Benchmark for the Evaluation of RGBD SLAM Systems, in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 573580, 2012.
 [10] L. Zadeh, Optimality and Nonscalarvalued Performance Criteria, IEEE Transactions on Automatic Control, Vol. 8, No. 1, 5960, 1963.
 [11] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, g 2 o: A General Framework for Graph Optimization, in Proceedings of the IEEE International Conference on Robotics and Automation, 36073613, 2011.
 [12] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry, An Invitation to 3D Vision: From Images to Geometric Models, Springer, 2003.
 [13] R. T. Marler, and J. S. Arora, Survey of Multiobjective Optimization Methods for Engineering, Structural and Multidisciplinary Optimization, Vol. 26, No. 6, 369395, 2004.
 [14] R. A. Peters, and R. N. Strickland, Image Complexity Metrics for Automatic Target Recognizers, in Proceedings of the Automatic Target Recognizer System and Technology Conference, 117, 1990.
 [15] C. Lu, S. Fang, Q. Jin, Z. Wang, and W. Xing, KKT Solution and Conic Relaxation for Solving Quadratically Constrained Quadratic Programming Problems, SIAM Journal on Optimization, Vol. 21, No. 4, 14751490, 2011.
 [16] M. S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret, Applications of Secondorder Cone Programming, Linear Algebra and its Applications, Vol. 284, No. 1, 193228, 1998.
 [17] A. Domahidi, E. Chu, and S. Boyd, ECOS: An SOCP Solver for Embedded Systems, in Proceedings of the European Control Conference, 30713076, 2013.
 [18] A. Domahidi, E. Chu, and S. Boyd, CVXOPT: A Python Package for Convex Optimization, version 1.1.6, Available at cvxopt.org, 2013.