Bi-objective Optimization for Robust RGB-D Visual Odometry

11/27/2014 ∙ by Tao Han, et al. ∙ Zhejiang University Curtin University 0

This paper considers a new bi-objective optimization formulation for robust RGB-D visual odometry. We investigate two methods for solving the proposed bi-objective optimization problem: the weighted sum method (in which the objective functions are combined into a single objective function) and the bounded objective method (in which one of the objective functions is optimized and the value of the other objective function is bounded via a constraint). Our experimental results for the open source TUM RGB-D dataset show that the new bi-objective optimization formulation is superior to several existing RGB-D odometry methods. In particular, the new formulation yields more accurate motion estimates and is more robust when textural or structural features in the image sequence are lacking.



There are no comments yet.


page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual odometry is an important area of information fusion in which the central aim is to estimate the pose of a robot using data collected by visual sensors [1]. Because nearly all robotic tasks require knowledge of the pose of the robot, visual odometry plays a critical role in robot control, simultaneous localization and mapping (SLAM) and robot navigation, especially when external reference information about the environment (such as GPS data) is unavailable. Visual odometry can be viewed as a particular instance of the general pose tracking problem, which is the most fundamental perception problem in robotics [2].

To date, a variety of different visual odometry methods based on different sensor information have been studied and widely implemented. One of the most well-known methods is the iterative closest point (ICP) algorithm [3], which estimates the robot’s pose by minimizing the distance between corresponding points in two laser scanning snapshots. However, this method can easily become trapped in local optima if a good initial guess is not provided. In addition to the ICP algorithm and its variants, odometry methods using camera images have also been studied [4] [5]. Such methods usually extract point features from the camera images and match them through a series of steps, including descriptor matching, RANSAC and bundle adjustment. Due to their expensive computational burden, these approaches are usually too slow for real-time application. One way of improving computational efficiency is to use sparse point features, but this approach does not fully exploit the available image data, ignoring much relevant information.

Recently, with RGB-D cameras becoming smaller and cheaper, the opportunity has arisen to develop RGB-D odometry methods that exploit both intensity and depth information. One such method was proposed by the Computer Vision Group at the Technical University of Munich (TUM). In this method, a single-objective optimization problem is formulated to penalize the intensity difference between corresponding pixels in consecutive images

[6] [7]. This method can be implemented in real-time even on a single-core CPU. However, the image depth information is only used to determine the relationship between corresponding pixels in consecutive images for intensity residual comparison; depth residuals are not considered. Thus, a new bi-objective optimization problem was subsequently proposed in [8] to minimize both depth and intensity residuals, with the aim of improving estimation performance.

In this paper, we consider the same bi-objective optimization formulation as in [8]. Our aims are twofold: (i) to propose new computational approaches for solving this bi-objective optimization formulation; and (ii) to explore and quantify the advantages of the bi-objective optimization formulation for improving estimation robustness. The first computational approach we investigate, the so-called weighted sum method, involves integrating the two objective functions into a single objective using a weighting factor. We derive a new formula for adaptive calculation of this weighting factor, which is crucial to estimation accuracy. Our formula is based on a novel image complexity metric and differs from the corresponding formula in [8], which uses the ratio of median intensity and median depth values to calculate the weighting factor. The second computational approach we investigate, the so-called bounded objective method, involves optimizing one of the objective functions while the other objective function is bounded via a constraint. Again, our new image complexity metric is used, this time to determine an appropriate objective bound. To evaluate performance, the open source TUM RGB-D dataset [9] was used. The computational results demonstrate that our new methods generally give results of superior accuracy compared with the methods in [6] [7] [8].

2 Single-Objective Optimization for Visual Odometry

The camera motion in 3-D space has six degrees of freedom and can be denoted as

where , , are the translation components of the motion and , , are the rotation components of the motion. To estimate , we consider a world point and assume that its brightness is the same in two consecutive images. This is the so-called photo-consistency assumption [7], which can be expressed mathematically by

where represents the mapping coordinate of the world point in the first image and represents the corresponding coordinate of in the second image when given the true value of the camera motion . Moreover, and are the brightness (or intensity) values of the specified coordinates in the first and second images, respectively.

Based on the photo-consistency assumption, we can define the intensity difference corresponding to the motion estimate as

According to the results in [7], the more accurate the camera motion estimate, the smaller the residual . Thus, estimation quality in visual odometry can be assessed by considering the following least-squares objective function, which is the sum of residual squares for world points:

Then the problem of determining the camera motion can be formulated as a least-squares optimization problem, i.e.,


To improve robustness, weighted residuals can be used to reduce the effect of noise and outliers in the image data. This motivates the following weighted objective function in quadratic form:


where is a diagonal weight matrix and

3 Bi-Objective Optimization for Rgb-D Odometry

Figure 1: Motion estimation accuracy of the single-objective Gauss-Newton method for the TUM RGB-D dataset.

Traditional cameras only provide image intensity information. RGB-D cameras, on the other hand, provide image intensity and image depth information, both of which can be used for visual odometry. For example, in the odometry methods introduced by the TUM Computer Vision Group [6] [7], the relationship between corresponding pixels in consecutive images is expressed in terms of the depth information in the first image, and the intensity information of both images is used to define the motion estimation residuals as in Section 2. More precisely, the relationship between corresponding pixels in consecutive images is defined by a warping function as follows:

where is the depth value of the pixel in the first image and is the warping function for calculating the mapping coordinate in the second image. For the specific form of the warping function , we refer the reader to [7].

Although single-objective optimization-based odometry methods are computationally fast and effective, they can produce poor results in some situations. For example, when textural features in the image sequence are poor, trajectory estimation accuracy will decrease dramatically. This is because the objective function only depends on image intensity information, and thus it can become non-convex when image textural features are lacking. In this case, the “optimal” motion estimates obtained by applying an optimization iterative procedure may only be locally optimal. To investigate this hypothesis, we applied the single-objective optimization approach (implemented using the Gauss-Newton method) to image sequences in the TUM RGB-D dataset [9]. Our results are shown in Fig. 1. From the results, we see that the translation error of the motion estimates increases significantly when textural features are lacking. This motivates the new bi-objective optimization formulation proposed in [8], in which both image intensity and image depth residuals are minimized to improve robustness.

The extension of RGB-D odometry using bi-objective optimization is inspired by the ICP algorithm and its variants, which estimate the sensor motion by minimizing residual coordinate differences, instead of image intensity values. Since RGB-D cameras can provide both intensity and depth information simultaneously, we want to take full advantage of this feature by comparing depth differences, just as the ICP algorithm compares coordinate differences. Thus, we now consider two residuals instead of one:


where and are the depth values of the specified coordinates in the first and second images, and projects the 3-D coordinate of world point from the first camera coordinate system to the second camera coordinate system based on the homogeneous transformation matrix for . Operator “” selects the coordinate value along the -direction. See the diagram in Fig. 2 for an explanation of the notation.

Figure 2: Motion estimation via RGB-D odometry: is the world point under consideration, and are the pixels corresponding to , and and are the depth values corresponding to .

Based on defined in (3), we consider the following objective function:


where is a diagonal weight matrix and

Combining objectives (2) and (4), we consider the following bi-objective optimization problem:


3.1 Weighted Sum Method

The weighted sum method is the most common approach to solving multi-objective optimization problems. In this method, the individual objective functions are assigned different weights and then added together to form a single objective function. More specifically, for individual objective functions

and decision vector

, the combined objective function is


where are the weights. If all of the weights are positive, then the minimum of (6) is Pareto optimal for the original multi-objective problem [10].

In essence, the objective weights provide additional degrees of freedom in the optimization problem. For our odometry problem (5), the new single-objective optimization problem is defined as


Notice that by dividing by , we can obtain an equivalent optimization problem as follows:


where . Thus, we only need to consider a single weighting factor .

Problem (8) can be solved using the Gauss-Newton method. To do this, we linearize the residuals and using the Taylor expansion proposed in [11]:

where “” denotes the addition operator in Lie group SE(3) (for more details, see [12]); and and are the Jacobians defined by

Then the objective function in (8) can be approximated by a quadratic function of :


where , and ().

Suppose that at iteration , we have the motion estimate . Then the increment should be chosen to minimize . According to the Gauss-Newton method, by differentiating (9) for , the optimal value of satisfies the linear system


where denotes with and denotes with . To solve this linear system, methods such as Cholesky decomposition can be used. After solving (10), the updated motion estimate is given by . This iterative process continues until convergence is achieved.

The effectiveness of the weighted sum method depends crucially on the weighting factor , which must be selected a priori and reflects the preference of the decision maker. A good choice for can result in more accurate trajectory estimates when compared to single-objective odometry methods, but a poor choice for may lead to unacceptable results. Systematic approaches to selecting the weights in multi-objective optimization problems have been developed (see, for example, [13]), but few of them have been investigated in the context of visual odometry. Tykkala et al. [8] proposed a method that determines based on the ratio of median intensity and median depth values:

where denotes the list of intensity values and denotes the list of depth values.

(a) Experiment 1 (rich structural features)
(b) Experiment 2 (poor textural features)
Figure 3: Ratio of root mean square error (RMSE) and maximum error for two computational experiments using the TUM RGB-D dataset.

To explore the importance of the weight , we conducted two computational experiments with the TUM RGB-D dataset. For our first experiment, we formed two image sequences: one containing images with poor textural features and one containing images with rich textural features. The structural features in both image sequences were rich. We observed that for the first sequence with poor textural features, the error decreases as is increased, but for the second sequence with rich textural features, the opposite occurs (see Fig. 3(a)). We believe that this is because the intensity objective function tends to be non-convex when images lack textural features. In this case, large values of magnify the relative importance of the depth objective function , thus potentially preventing the overall objective function in (8) from becoming non-convex.

For our second experiment, we again formed two image sequences: this time the first image sequence contained images with poor structural features and poor textural features, and the second image sequence contained images with rich structural features and poor textural features. As expected, the error decreases as increases for the image sequence with rich structural features (see Fig. 3(b)). This is because is likely to be convex when images contain rich structural information, and a large will increase ’s relative influence in the overall objective function.

Based on the experimental results in Fig. 3, we believe that the key to finding an optimal is to design a metric to measure textural and structural information. To do this, we consider the concept of image complexity, which is a measure of the inherent difficulty of finding a true target in a given image [14]. Peters et al. [14] has summarized many image complexity metrics for automatic target recognizers. Unfortunately, image complexity is a task-dependent notion and there is no universal metric applicable to all situations. After testing several of the metrics in [14], we designed our own metric for intensity complexity defined as follows:


where and are the number of pixel rows and pixel columns, respectively, and denotes the intensity value at the specified pixel. For depth complexity, we use the analogue of (11) for the depth values:


where denotes the depth value at the specified pixel. To standardize the intensity data and the depth data

, we define the following scaling factor as the ratio of the variance between them:


Combining (11)-(13), we calculate the value of weight as follows:


where is as defined in (13) and is an adjustable constant. Notice that large values of indicate rich textural features, and large values of indicate rich structural features. Thus, we have deliberately chosen the value of in (14) to be inversely proportional to , and proportional to . The idea is to use large values of when the image sequence is rich in structure and/or poor in texture, and small values of when the image sequence is poor in structure and/or rich in texture.

3.2 Bounded Objective Method

The bounded objective method is another method for solving multi-objective optimization problems [13]. In this method, we minimize one of the objective functions (considered as the most important, or primary, objective), while the other objective functions are bounded using additional constraints.

For our odometry problem, we select as the primary objective function. The bi-objective optimization problem in (5) then becomes


where is an upper bound for the least-squares sum of depth residuals. To solve the optimization problem in (15), we can again use the first-order Taylor expansions of and . The optimal increment at point is then given by the solution of the following problem:


where , , , , and are as defined in (9).

(a) Poor structure poor texture
(b) Poor structure rich texture
(c) Rich structure poor texture
(d) Rich structure rich texture
Figure 4: The four types of images in the “Structure vs. Texture” category in the TUM RGB-D dataset.

Problem (16) is a quadratically constrained quadratic program (QCQP). The general form for a QCQP is

QCQPs are of both theoretical and practical significance [15]. Because the matrices and are positive semidefinite, problem (16) is a convex QCQP. To solve this convex QCQP, we first transform it into a second-order cone programming (SOCP) problem and then apply SOCP techniques [16]. The general form for a SOCP problem is

The norm appearing in the constraints is the standard Euclidean norm, i.e., . We first rewrite (16) as follows:


By adding a new optimization variable , we can transform (17) into the following SOCP form:


Problem (18), which is equivalent to (16) and (17) (see [16]), is clearly in the general SOCP form shown above.

To solve the SOCP problem in (18), we can use ECOS, an SOCP solver developed by Domahidi et al. [17]. ECOS implements an interior point method to solve SOCPs in the following standard form [18]:

where is a vector of optimization variables, is a vector of slack variables and is the cone

To reformulate (18) into the standard form required by ECOS, we set

and set

where denotes the zero column vector in .

The upper bound of the depth objective is a parameter that needs to be selected before starting the optimization procedure. This parameter plays the same role as in (8), i.e., balancing the relative importance of the depth and intensity objectives. However, compared to , the upper bound has a more explicit mathematical meaning and is easier to select a priori. In fact, since the value of can be measured directly when the true value of the camera motion is plugged into , it can be used to estimate the range of and find a good for optimization. In our algorithm, we choose the value of according to the complexity of depth image as follow:

where , is an adjustable threshold and is the depth metric in (12).

4 Performance Evaluation

poor structure rich structure poor structure rich structure
Method rich texture poor texture poor texture rich texture
[m/s] [m/s] [m/s] [m/s]
Single objective 0.041667 0.125235 0.249357 0.015956
Tykkala’s method 0.035970 0.106649 0.165702 0.016078
Weighted sum 0.034464 0.088853 0.178571 0.015101
Bounded objective 0.032715 0.095749 0.178994 0.015330
Table 1: RMSE result for the 1st-4th sequences in “Structure vs. Texture” category. In these sequences, the TGB-D camera is close to the panels and wooden surfaces.
poor structure rich structure poor structure rich structure
Method rich texture poor texture poor texture rich texture
[m/s] [m/s] [m/s] [m/s]
Single objective 0.110646 0.074372 0.170460 0.015597
Tykkala’s method 0.094845 0.077504 0.129923 0.014728
Weighted sum 0.078033 0.076853 0.123848 0.014284
Bounded objective 0.098715 0.066008 0.152104 0.015269
Table 2: RMSE result for the 5th-8th sequences in “Structure vs. Texture” category. In these sequences, the TGB-D camera is far from the panels and wooden surfaces.

For performance evaluation, we conducted a series of numerical experiments in which a set of image sequences were used to compute simulated camera trajectory. The image sequences used in our experiments are from “Structure vs. Texture” category in the TUM RGB-D dataset. Images in this category can be demonstrated four different types as shown in Fig. 4. The image sequences in this dataset were created using colorful plastic foils to create textural features and white plastic foils to decrease textural features. Similarly, zig-zag structure built from wooden panels are used to increase the structural features in images while planar surfaces are used to make the strucure features of images become poor.

We compare the estimated trajectories produced via the optimization procedures with the true trajectories and calculated the root mean square error (RMSE) of the drift in meters per second. Other RGB-D odometry methods, such as the single objective method in [7] and a re-implementation of the bi-objective odometry in [8], have also been applied in our experiments as references of our methods. Besides, we measure the runtime of different approaches on a ThinkPad E431 laptop with dual-core Intel i5-3210M CPU (2.50GHz) and 4 GB RAM to evaluate their real-time performance.

Specially, to ensure identical experimental conditions, we build the -distribution model mentioned in [7] to eliminate the outliers in data and constructed the weighting matrix in objective function for all methods we evaluated. The results of our experiments are given in Tab. 1 and Tab. 2 (the result of per-frame translational errors is also demonstrated in Fig. 5). It can be seen that the RMSEs of the single-objective optimization based method increase considerably when textural feature of the sequences is poor. Compared to the method based on single-objective optimization, our methods, the weighted sum method and the bounded objective method, give better performance, especially in poor textural feature cases. Tykkala’s method, which also uses bi-optimization optimization, has a similar performance to ours in most cases. Our conclusion is that the new bi-objective optimization formulation for RGB-D odometry can alleviate the optimization problem become non-convex and improve the accuracy of the estimates.

(a) Translational errors of the methods in sequence with rich textural feature
(b) Translational errors of the methods in sequence with rich textural feature
(c) Translational errors of the methods in sequence with poor textural feature
(d) Translational errors of the methods in sequence with poor textural feature
Figure 5: A comparison of per-frame translational errors between our two methods and the single objective optimization based method.

We also measure the average runtime for one match between two images with different methods in our experiments. From Tab. 3 we can see that our weighted sum method needs more time to accomplish one match than the method based on single objective optimization. But as its cost in time for one match is much less than one second, our weighted sum method can still be implemented as a real-time approach. The bounded sum method, however, due to its expensive cost in time, can not work in a real-time application currently. The main cause that give rise to this phenomenon is that the algorithms used to solve the SOCP are numerical approximation algorithms. They need more computations and iterations to get the solution than the analytic algorithms, like Gauss-Newton algorithm, used in the weighted sum method. Considering its convenience in setting parameter, the bounded sum method is still a promising method and it offers an alternative beyond other common methods in bi-objecitve optimization.

Method runtime[ms]
single objective 15.42
Tykkala’s method 21.06
weighted sum 22.99
bounded objective 7093
Table 3: Runtime result of different RGB-D odometry methods in our experiments.

5 Conclusion

In this paper, we studied two methods for solving a new bi-objective optimization formulation for robust RGB-D odometry. Both methods involve converting the bi-objective optimization problem into a single-objective problem. The weighted sum method involves minimizing the weighted linear sum of intensity and depth residuals. The bounded objective method involves minimizing the intensity residual subject to a bound on the depth residual. The experimental results show that both methods yield precise motion estimates and perform stably even when the textural information in the image sequence is poor. The bounded objective method is considerably slower than the weighted sum method. Thus, our current focus is on developing a parallel algorithm for enhancing real-time performance. We also hope to expand these ideas to other problems in robotics such as motion control, SLAM and navigation. One of the main contributions of our work is a discussion of how to use depth and intensity metrics to choose the parameters in both methods.


  • [1]

    D. Nistér, O. Naroditsky, and J. Bergen, Visual Odometry, in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol.1, 652-659, 2004.

  • [2] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics, MIT Press, Chap. 7, 2005.
  • [3] P. J. Besl, and N. D. McKay, Method for Registration of 3-D Shapes, in Robotics-DL Tentative, International Society for Optics and Photonics, 586-606, 1992.
  • [4] P. Henry, M. Krainin, E. Herbst, X. Ren, and D. Fox, RGB-D Mapping: Using Depth Cameras for Dense 3D Modeling of Indoor Environments, in Experimental Robotics, Springer, 477-491, 2014.
  • [5] H. Strasdat, J. Montiel, and A. J. Davison, Scale Drift-Aware Large Scale Monocular SLAM, in Robotics: Science and Systems, Vol.2, No. 3, 5-12, 2010.
  • [6] H. Steinbrucker, J. Sturm, and D. Cremers, Real-Time Visual Odometry from Dense RGB-D Images, in Proceedings of the IEEE International Conference on Computer Vision Workshops, 719-722, 2011.
  • [7] C. Kerl, J. Sturm, and D. Cremers, Robust Odometry Estimation for RGB-D Cameras, in Proceedings of the IEEE International Conference on Robotics and Automation, 3748-3754, 2013.
  • [8] T. Tykkala, C. Audras, and A. I. Comport, Direct Iterative Closest Point for Real-Time Visual Odometry, in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2050-2056, 2011.
  • [9] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, A Benchmark for the Evaluation of RGB-D SLAM Systems, in Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 573-580, 2012.
  • [10] L. Zadeh, Optimality and Non-scalar-valued Performance Criteria, IEEE Transactions on Automatic Control, Vol. 8, No. 1, 59-60, 1963.
  • [11] R. Kummerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, g 2 o: A General Framework for Graph Optimization, in Proceedings of the IEEE International Conference on Robotics and Automation, 3607-3613, 2011.
  • [12] Y. Ma, S. Soatto, J. Kosecka, and S. Sastry, An Invitation to 3-D Vision: From Images to Geometric Models, Springer, 2003.
  • [13] R. T. Marler, and J. S. Arora, Survey of Multi-objective Optimization Methods for Engineering, Structural and Multidisciplinary Optimization, Vol. 26, No. 6, 369-395, 2004.
  • [14] R. A. Peters, and R. N. Strickland, Image Complexity Metrics for Automatic Target Recognizers, in Proceedings of the Automatic Target Recognizer System and Technology Conference, 1-17, 1990.
  • [15] C. Lu, S. Fang, Q. Jin, Z. Wang, and W. Xing, KKT Solution and Conic Relaxation for Solving Quadratically Constrained Quadratic Programming Problems, SIAM Journal on Optimization, Vol. 21, No. 4, 1475-1490, 2011.
  • [16] M. S. Lobo, L. Vandenberghe, S. Boyd, and H. Lebret, Applications of Second-order Cone Programming, Linear Algebra and its Applications, Vol. 284, No. 1, 193-228, 1998.
  • [17] A. Domahidi, E. Chu, and S. Boyd, ECOS: An SOCP Solver for Embedded Systems, in Proceedings of the European Control Conference, 3071-3076, 2013.
  • [18] A. Domahidi, E. Chu, and S. Boyd, CVXOPT: A Python Package for Convex Optimization, version 1.1.6, Available at, 2013.