The main contribution of this work is the introduction of a joint loss function which is based on the photometric error of all feature correspondences. The correspondences are parameterized by one underlying epipolar geometry. This guarantees all correspondences to be epipolar-conform by construction, and allows to directly optimize the pose based on image intensities. Starting point is the well known Lucas-Kanade tracking method  which employs a quadratic photometric loss function (SSD) on a single image patch to optimize feature correspondences. Given the epipolar geometry, the search space can be drastically decreased from a 2D to a 1D search by including an epipolar constraint [23, 24].
We show how to optimize all correspondences simultaneously and optimize the epipolar geometry at the same time, given coarse initial values of these entities (the typical situation in many applications). This is achieved by varying the epipolar geometry, and by this the adjusted correspondences of all features, in a way that minimizes the joint photometric loss function. We denote the resulting procedure as Joint Epipolar Tracking (JET). As this joint optimization is performed directly on the intensity values, JET is a ‘direct‘ method, in today’s terminology. This is in contrast to the widely used minimization of the reprojection error which distills photometric information into geometric information and subsequently disregards the mere image intensities.
We show that JET outperforms the standard minimization of the reprojection error (RPE), when optimizing the relative pose (ego motion of the camera). The comparison was performed on synthetic and real data sets (all publicly available), synthetic data sets in order to have perfect ground truth and real data sets to demonstrate the feasibility under realistic conditions. The synthetic data sets are COnGRATS  (driving scenes on a road scene) and RGB-D data from ICL-NUIM  (indoor footage of a hand-held camera). As a representative of real data, we utilize the well known KITTI dataset [14, 15], which consists of real driving scenes in urban and highway scenarios. As dense depth information or optical flow ground truth are not available for this data set, we focus on comparing the quality of JET against RPE by regarding the relative pose.
Since we regard monocular image data, the scale of the pose remains undetermined and we only analyze the relative rotation and the relative unscaled translation of the motion. For a calibrated camera, these entities will entirely define the epipolar geometry as it is scale independent as well.
2 Related Work
Approaches to relative pose or 3D motion estimation can be divided into two basic categories: feature-based methods and direct methods, with some hybrid approaches existing as well. Feature-based methods are characterized by the extraction and the matching of salient and reproducible features that are tracked over frames. Prominent examples of feature point based optimization methods are , , and . Usually, these approaches minimize the reprojection error of tracked feature points.
So called ‘direct‘, appearance- or intensity-based methods, on the other hand, operate directly by matching pixel intensities. They propagate the original image information into the optimization scheme, usually using a differential optimization approach and therefore can often provide more accurate estimates of pose and structure. DTAM  was among the first real time dense systems. Semi-Dense Visual Odometry for a Monocular Camera  and its successor LSD-SLAM  as well as SVO  are more recent examples. We share the opinion of the authors of  who state that the separation into feature detection and tracking versus a state estimation creates an artificial gap between the data and the state sought.
PTAM  could also be considered as a hybrid method: A weak motion prior is used to initialize the search for a small number of features using a modified KLT at the highest level of a scale pyramid. The resulting tracks provide a better egomotion prediction which is then used to search for a larger number of KLT tracks at lower pyramid levels and so on until the bottom level is reached.
Stabilizing the estimation of correspondences by integrating a given (or assumed) epipolar relation into the matching process has been used in numerous approaches. For instance, the authors of [2, 23, 25, 27, 28] use epipolar constraints for stabilizing discrete matching, whereas Valgaerts et al.  proposed a variational approach to estimate dense optical flow and the epipolar geometry (represented by the fundamental matrix) simultaneously. Other direct methods that also explicitly take into consideration the depth structure of the scene are [4, 18, 20].
An important property which typically distinguishes appearance, direct, dense or semi-dense from feature based approaches is that direct methods often use parametric models of the flow field and hence can utilize edges as well as corners. If no explicit motion priors or dynamic models are used, these direct methods generally depend on a high frame rate that ensures moderate displacements, whereas feature-based matching can work even with very large displacements. However, even in this case a photometric ‘direct‘ post-optimization can be performed. JET is a well suited method to do just this.
We outline our approach and introduce our notation, starting from plain Lucas-Kanade tracking  in section 3.1 and subsequently revisiting epipolar constrained KL tracking in section 3.2. This leads to the presentation of joint epipolar tracking in section 3.3.
3.1 General Lucas-Kanade Tracking
The aim of differential direct tracking, often denoted as Lucas-Kanade tracking  is to successively determine the corresponding image feature point position in image for a given feature point in another image . We use the weighted sum of squared differences (WSSD) as loss function for patch comparisons, thus implicitly modeling the image noise as signal-independent, i.i.d. and Gaussian.
The non-negative pixel weights and the size of the patches are defined by a normalized kernel . All points and with a non-zero weight are taken into account for the patch difference. In a typical scenario, a feature point and an initial estimate for the corresponding feature are given and the task is to optimize this correspondence by minimizing the WSSD
for a specific realization of the image displacement . Since this problem cannot be solved directly in closed form, a local first order Taylor approximation of the image difference is usually applied. This yields the approximated weighted sum of squared differences:
using the abbreviations
Since the image difference has been linearized, this is an approximation to the ‘true‘ optimization problem, well known in nonlinear optimization theory as the Gauss-Newton method. The approximation in equation (3) yields a convex parabolic function which allows to solve for the optimal displacement . Due to the linearization of the image, this approximation should only be used to improve the feature correspondence which then serves as a new initialization for another step of the incremental optimization process.
3.2 Epipolar Constrained Tracking
In constrained epipolar tracking, we consider the relative pose given by a rotation matrix
and a translation vectorto be known, and adjust the feature correspondences to comply with the given epipolar geometry. This yields the epipolar constraint:
In equation (5), is the fundamental matrix,
which is defined up to a scale factor. is the camera matrix holding the intrinsic camera parameters and
is the skew symmetric matrix of the translation vector. We writeto denote that, for a given camera matrix, the fundamental matrix is fully determined by the (unscaled) motion parameters . We use the polar parametrization of the rigid transformation as proposed in :
These parameters are a minimum representation of the relative unscaled pose. The pitch angle , the yaw angle and the roll
are the rotational degrees of freedom about the-, - and -axis. The azimuth and the polar angle represent the unscaled translation in polar coordinates.
The optimal displacement can be computed via the minimization of :
In combination with equation (5) this yields the linear equation system:
The matrix on the left hand side of this equation system is symmetric and is a function of the motion parameter , i.e. there is a closed form solution to the optimal displacement showing the following dependency:
For a given epipolar geometry (which is equivalent to a given unscaled relative pose and a calibrated camera) the linear equation system (10) is the extension of the standard Lucas-Kanade equation (see equation (3)) with an epipolar constraint. It can be used to optimize image correspondences if the epipolar geometry is already known in beforehand.
3.3 Joint Epipolar Tracking
The present work extends the epipolar constrained tracking in the following sense: We do not only optimize each feature correspondence individually with respect to a given epipolar geometry, but build a joint loss function which can be optimized with respect to the underlying motion that characterizes the displacements of all image points (given that all points obey the same epipolar relation).
Using this approach, we can additionally optimize the motion parameters themselves. We call this method Joint Epipolar Tracking (JET). To this end, we perform a re-parametrization of the loss function by substituting the functional dependency into it (compare equations (8) and (11) respectively):
By using this definition of the displacement, the optimization of the loss function is no longer performed with respect to an image displacement but with respect to an epipolar geometry which is induced by the relative pose of the camera and the environment. This relative pose is evoking the optical flow in the image domain.
Joining together the loss functions from equation (12) for several feature correspondences and adding a prior term for the motion (expressing a statistical model of ‘typical‘ motion) yields the joint loss function
The minimization of this function allows to determine the motion parameters, and hence the unscaled relative pose, that best describes the optical flow. In equation (13) the part of the joint loss function that is dependent on the image information has been extended by a second part that incorporates statistical prior knowledge coupled via the coupling constant . The prior information is characterized by a prediction of the expected motion and a covariance matrix of the prediction residuals .
These motion prior terms are determined by a linear regression approach on a dataset of motion parameters that are representative for the type of motion to be expected (e.g. restricted car motion, unrestricted motion of a handheld camera). We use a very similar approach as in to determine the parameters of a linear predictor. The difference is that we employ a third order predictor, i.e. the preceding three motion parameter sets are taken into consideration when evaluating the statistics and performing the dynamic prediction.
Equation (13) can be expressed in vertex form and the optimization of is represented as the following least squares problem:
With an initial estimate of the motion parameters (the prediction based on the previous motion parameters), we can now solve this optimization problem using a nonlinear solver like the Ceres solver . The result of this motion optimization is then used to improve the feature correspondences by shifting the corresponding image point to its epipolar line by (see equations (10) and (11)). Since the original image difference has been replaced by a linear approximation during the Gauss-Newton approach at the beginning, these improved correspondences and the improved motion parameters serve as an initialization for the second iteration step of this optimization procedure. We continue with this procedure as long as the target loss function
is decreased. Note that the target loss function incorporates the exact image difference as introduced in equation (2).
The optimization of the relative pose using JET does not merely minimize the reprojection error111Actually the reprojection error is zero, since the feature correspondences are just optimized with respect to the relative pose., but rather than that minimize the photometric error of the feature correspondences by including the full image information encoded in the quantities , and . Compared to other leading direct methods, such as [9, 10], JET is the most compact formulation of the direct 2-view points pose optimization problem based on minimizing the photometric error.
First two moments ofand from the evaluation on COnGRATS and ICL-NUIM RGB-D dataset. Datasets ending with * indicate the use of prior knowledge. All values are in degrees.
We evaluated the JET procedure presented here on synthetic data [5, 16], applying noise to the different input parameters to investigate the stability against noise in our components. As we used synthetic data, we had perfect ground truth for our results to compare against, a situation usually very hard to obtain for real-life driving scenarios, [15, 14].
The aim in our experiment is to optimize the motion parameters and correct the feature correspondences so that they obey the epipolar geometry induced by the optimized motion parameters :
We compare the results achieved with JET against the results achieved with a method that minimizes the reprojection error (RPE).
4.1 Competing method: Optimization of the reprojection error
The competitor RPE optimization is a method that minimizes the reprojection error and performs the following steps:
Minimize the reprojection error:
Perform a minimum correction of the correspondences, so that they are in agreement with :
The first task is optional and optimizes the feature correspondences using standard Lucas-Kanade tracking as it is implemented in OpenCV . We will run experiments with both, step one enabled and disabled. In the mandatory second step, RPE optimizes the motion parameters by minimizing the reprojection error
for all feature correspondences. is the distance of the image point to the epipolar line specified by the fundamental matrix (see equation (6)) and . We delegate the optimization of the loss function of the RPE method to the Ceres-Solver from Google .
After having computed the optimized motion parameters , we determine the optimized corresponding points by projecting all to the closest points on their respective epipolar line. For that purpose we introduce the abbreviations
and obtain for the optimized corresponding point:
Both methods were initialized with exactly the same estimated image correspondences and the same estimate of motion parameters. When using synthetic data, it is straightforward to obtain ground truth reference values for the correspondences as well as for the motion parameters. The COnGRATS  scenes we used in the evaluation, re-use pose sequences from the KITTI Benchmark. To make a coarse estimate of the variation range of the motion parameters, we checked the statistics of the motion parameters on the KITTI dataset, which covers a wide range of driving scenarios and can be considered as representative for realistic car motion.
If we assume a normal distribution of
and use the KITTI motion statistic to find upper bounds for the variances of the translational and rotational degrees of freedom (and ), we can estimate the interval to be and . More than % of the motion parameters do not deviate by more than from their temporal predecessor.
We use these insights to justify a realistic variation range of and for the rotation and translation parameters respectively. These ranges correspond to more than standard deviations
. We apply uniformly distributed noise with the just derived intervals to the motion parameters.
A similar consideration for a hand held camera, as it is used in the second synthetic dataset , leads to a variation range of and for the rotation and translation parameters, respectively.
For the corresponding image points , we apply uniform noise to the - and -component of the ground truth value, each with a level of pixels.
4.3 Evaluation measures
|Rotation [deg]||Translation [deg]||SSD|
Each experiment gets initialized with an approximation of the pose and with initial image correspondences. To quantify the quality of the input and the output of the methods, the deviation from ground truth is expressed by the following four evaluation measures:
Rodrigues angle (rotational error):
The rotation parameters , and define a rotation matrix which is to be compared against the ground truth via the relative rotation . According to Rodrigues‘ formula, can be interpreted as a rotation of an angle about some axis . The absolute value of the Rodrigues angle serves as a measure for the deviation from the ground truth rotation.
Angle of intersection (translational error):
The translation parameters and represent the direction of the translation vector. The translation direction is compared to the ground truth via the absolute value of its angle of intersection .
RMS distance of corresponding points (positional error):
The quality of the point correspondences is characterized by the mean deviation from ground truth: .
Joint weighted sum of squared differences SSD (photometric error):
The only measure that is absolute and not relative to the ground truth is the SSD. It is the average squared gray value difference over all patches of the image correspondences :
4.4 COnGRATS & ICL-NUIM RGB-D dataset
The COnGRATS dataset contains two road scenes of a construction site on a highway (‘ConstructionSite‘) showing maneuvers at low velocities and another highway scene (‘Highway‘) with the car travelling mainly straight ahead at a much higher speed. Both scenes use a setup of the camera similar to KITTI  and were generated using the pose information from the KITTI odometry dataset . This enables us to use the extensive motion data in KITTI to generate a statistical model of ego-dynamics to be used as statistical prior. The results are shown in the first and second column of figure 2 and the mean and standard deviation are listed in table 1.
The results show that JET, using image information, reduces the rotational error to approximately the half of the value of RPE without using prior knowledge. While using prior knowledge does not seem to have a large impact on the optimization of the rotation of RPE, it does have it for JET. Using the prior, JET is able to nearly halve the rotational error once more, compared to not using a prior. The observations for JET are also true for the translational error : the use of a prior more than halves the error. In contrast to the optimization of the rotation, the translation optimization of RPE also greatly benefits from using the prior, leading to a reduction of the error by more than a half. This behavior becomes very clear when comparing the histograms of and for the cases with and without prior information (first and second column of figure 2 respectively). JET is the clear winner for the rotation optimization and also dominates the optimization of the translation without using the prior. Enabling the prior leads to a head to head situation for the translational error.
Regarding the SSD, it is very easy to see the influence of the optional Lucas-Kanade tracking for RPE. The value is strongly decreased. However, JET also dominates this area. It achieves SSD values that are clearly below the ground truth value indicating a very good quality of the optimization of the feature correspondences. Nevertheless, on an average the feature correspondences of JET deviate by about 1 pixel off the ground truth position. The reason for this behavior (similar and even better SSD value while still deviating from the ground truth position) can be explained by the use of patch matching and the existence of a locally non-constant optical flow field (caused by rotation and translation in the direction of the optical axis leading to different scalings). Apart from that, the RMS value of JET is clearly superior to the results of RPE.
The ICL-NUIM RGB-D dataset we evaluated on contains synthetic data of a hand held camera which is carried through a living room (‘LivingRoom02‘) and an office room (‘OfficeRoom02‘). The motion is dominated by strong rotations and involves only slight translation. As the motion is less constrained, compared to vehicle motions, the positive influence of integrating prior knowledge is less pronounced. Therefore, we only present results without using the prior (, ). They are visualized in the third column of figure 2 and listed in table 1.
In summary, the results of the RGB-D dataset are similar to the ones achieved on COnGRATS. JET is superior to RPE in optimizing the rotation and translation (see histograms in third column of 2). It is dominating the SSD results by achieving SSD values below the ground truth value and it is also clearly superior in optimizing the image point correspondences (RMS). Due to the harder requirements of data from a hand held camera, all results are slightly worse than they were for the COnGRATS dataset. Especially the optimization of the translation direction is very tough (see in third column of figure 2 and table 1), when only slight magnitudes of the translation can be observed. Already a minor shaking of the hand, as it is simulated in the scenes, can lead to constantly and much pronounced changes in the direction of the translation. Even though the effect of this behavior only has a small influence on the relative pose and the optical flow in the image domain, it has a strong influence when looking at the evaluation of the direction of the translation. This is a limitation of our parametrization: the direction of the translation is almost undetermined due to its vanishing magnitude, and no scale is available due to the use of a mono camera setup.
Apart from this, the optimization of the unscaled relative pose and the feature correspondences was very successful and largely improved by including the photometric matching information when using JET.
4.5 KITTI Dataset
We also performed experiments on the KITTI dataset. Since KITTI does not provide ground truth for image point correspondences (via a dense depth or optical flow map), we cannot use ground truth for the correspondences and apply noise to them to serve as an initializiation. Therefore, we initialize the correspondences by employing propagation based tracking as presented in . Similar to the experiments on the synthetic data, we compare the results of both methods. We use the KITTI ground truth of the pose and compare JET and RPE with respect to the rotational error and the translational error . In order to compare the quality of the feature correspondences of the two methods, we regard the photometric error (SSD).
The results of the experiment on the KITTI dataset are shown in table 2. The table presents the mean values of the Rodrigues angle (), the angle of intersection of the translation (), and the SSD for each KITTI sequence that has ground truth available. The results confirm that JET performs clearly better than RPE in matters of rotation optimization. The mean of the rotational error is two to three times lower than the one of RPE. In terms of translation, the results show a head and head situation of RPE and JET with a slight lead of JET. Thus, in summary JET yields a significantly better pose than RPE.
Besides improving the pose, JET also refines the feature correspondences. However, as correspondence ground truth is not available in KITTI, the residual error in feature correspondences after performing JET cannot be determined. However, the feature correspondences from JET possess a much smaller photometric error (SSD) than after RPE optimization as can be seen in table 2.
5 Summary and Conclusion
This paper proposed a novel algorithm in the area of feature tracking and frame-to-frame pose estimation, denoted as Joint Epipolar Tracking (JET). The proposed algorithm employs a direct method to simultaneously optimize the epipolar geometry and feature correspondences. It iteratively solves the minimization problem of the newly introduced joint loss function where additional statistical information about the motion can be included to serve as prior knowledge. The proposed method has been shown to perform better than the competing method of RPE optimization by experiments on several datasets, synthetic and real, such as COnGRATS, ICL-NUIM, and KITTI. It attains real-time performance: approximately fps utilizing roughly 400 features with patch size of pixels on a single thread of an Intel Core i7-6700 CPU. On an average, the rotational errors are three times smaller compared to RPE. The translation direction can be improved as well if the translation is sufficiently encoded in the optical flow of the image. Furthermore, the photometric error (SSD) of the feature patches is massively reduced in all cases which suggest a better quality also of the 3D information that can be computed from the point correspondences.
-  S. Agarwal, K. Mierle, and Others. Ceres solver. http://ceres-solver.org, 2012.
-  H. Alismail, B. Browning, and S. Lucey. Photometric Bundle Adjustment for Vision-Based SLAM. In Asian Conference on Computer Vision (ACCV), 2016.
-  H. Badino, A. Yamamoto, and T. Kanade. Visual Odometry by Multi-frame Feature Integration. In International Conference on Computer Vision (ICCV-W) Workshops, pages 222–229, 2013.
-  J. Berger, A. Neufeld, F. Becker, F. Lenzen, and C. Schnoerr. Second Order Minimum Energy Filtering on SE(3) with Nonlinear Measurement Equations. In International Conference on Scale Space and Variational Methods in Computer Vision (SSVM), pages 397–409, 2015.
-  D. Biedermann, M. Ochs, and R. Mester. COnGRATS: Realistic Simulation of Traffic Sequences for Autonomous Driving. In Image and Vision Computing New Zealand (IVCNZ), 2015.
-  H. Bradler, B. A. Wiegand, and R. Mester. The Statistics of Driving Sequences - And What We Can Learn from Them. In International Conference on Computer Vision (ICCV-W) Workshops, pages 106–114, 2015.
-  G. Bradski. Opencv. http://opencv.org, 2000.
I. Cvišić and I. Petrović.
Stereo odometry based on careful feature selection and tracking.In European Conference on Mobile Robots (ECMR), pages 1–6, 2015.
-  J. Engel, V. Koltun, and D. Cremers. Direct Sparse Odometry. In arXiv:1607.02565 [cs.CV], 2016.
-  J. Engel, T. Schöps, and D. Cremers. LSD-SLAM: Large-Scale Direct Monocular SLAM. In European Conference on Computer Vision (ECCV), pages 834–849, 2014.
-  J. Engel, J. Sturm, and D. Cremers. Semi-dense Visual Odometry for a Monocular Camera. In International Conference on Computer Vision (ICCV), pages 1449–1456, 2013.
-  N. Fanani, M. Ochs, H. Bradler, and R. Mester. Keypoint trajectory estimation using propagation based tracking. In Intelligent Vehicles Symposium (IV), 2016.
-  C. Forster, M. Pizzoli, and D. Scaramuzza. SVO: Fast Semi-Direct Monocular Visual Odometry. In International Conference on Robotics and Automation (ICRA), pages 15–22, 2014.
-  A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI dataset. International Journal of Robotics Research (IJRR), 32(11):1231–1237, 2013.
A. Geiger, P. Lenz, and R. Urtasun.
Are we ready for autonomous driving? the KITTI vision benchmark
Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354 – 3361, 2012.
-  A. Handa, T. Whelan, J. McDonald, and A. Davison. A Benchmark for RGB-D Visual Odometry, 3D Reconstruction and SLAM. In International Conference on Robotics and Automation (ICRA), pages 1524–1531, 2014.
-  G. Klein and D. Murray. Parallel Tracking and Mapping for Small AR Workspaces. In International Symposium on Mixed and Augmented Reality (ISMAR), pages 225–234, 2007.
-  F. Lenzen and J. Berger. Solution-Driven Adaptive Total Variation Regularization. In International Conference on Scale Space and Variational Methods in Computer Vision (SSVM), pages 203–215, 2015.
B. D. Lucas and T. Kanade.
An Iterative Image Registration Technique with an
Application to Stereo Vision.
International Joint Conference on Artificial Intelligence (IJCAI), pages 674–679, 1981.
-  A. Neufeld, J. Berger, F. Lenzen, and C. Schnoerr. Estimating Vehicle Ego-Motion and Piecewise Planar Scene Structure from Optical Flow in a Continuous Framework. In German Conference on Pattern Recognition (GCPR), pages 41–52, 2015.
-  R. A. Newcombe, S. J. Lovegrove, and A. J. Davison. DTAM: Dense tracking and mapping in real-time. In International Conference on Computer Vision (ICCV), pages 2320–2327, 2011.
-  M. Persson, T. Piccini, M. Felsberg, and R. Mester. Robust stereo visual odometry from monocular techniques. In Intelligent Vehicles Symposium (IV), pages 686–691, 2015.
-  T. Piccini, M. Persson, K. Nordberg, M. Felsberg, and R. Mester. Good Edgels to Track: Beating the Aperture Problem with Epipolar Geometry. In European Conference on Computer Vision (ECCV-W) Workshops, pages 652–664, 2014.
-  M. Trummer, J. Denzler, and C. Munkelt. KLT Tracking Using Intrinsic and Extrinsic Camera Parameters in Consideration of Uncertainty. In International Conference on Computer Vision Theory and Applications (VISAPP), pages 346–351, 2008.
-  M. Trummer, C. Munkelt, and J. Denzler. Extending GKLT Tracking - Feature Tracking for Controlled Environments with Integrated Uncertainty Estimation. In Scandinavian Conference on Image Analysis (SCIA), pages 460–469, 2009.
-  L. Valgaerts, A. Bruhn, and J. Weickert. A Variational Model for the Joint Recovery of the Fundamental Matrix and the Optical Flow. In German Conference on Pattern Recognition (GCPR), pages 314–324, 2008.
-  C. Vogel, K. Schindler, and S. Roth. 3D Scene Flow Estimation with a Rigid Motion Prior. In International Conference on Computer Vision (ICCV), pages 1291–1298, 2011.
-  K. Yamaguchi, D. McAllester, and R. Urtasun. Robust Monocular Epipolar Flow Estimation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1862–1869, 2013.