Ego-motion estimation is an essential functionality in achieving autonomous navigation and maneuver of robots. To estimate the robot’s pose in the environment the surrounding structures and objects needs to be modeled accurately. Recently many approaches using various sensors including LIDARs, radars, and/or cameras have been developed. Among them the cameras have advantages for their low cost, passive sensing, abundant information on the environment, mechanical robustness, and many more.
Visual odometry (VO), ego-motion estimation using one or more cameras, has been researched for a few decades to accomplish real-time processing, accurate pose estimation, and robustness to the external disturbances. It has been widely applied to many applications including augmented/virtual reality (AR/VR), advanced driver assistance system (ADAS) and autonomous driving. There exist variety of VO systems, for example ORB-SLAM2 [mur2017orb] is monocular or stereo and using feature points with binary descriptors, or Stereo-DSO [wang2017stereo] uses a stereo camera and edges for the feature. Monocular systems are attractive for their simple hardware configuration, but it has several limitations that the true scale of motion cannot be estimated, and it is mostly use for AR/VR where metric poses are not needed. Multi-view (stereo) VO systems produce more robust and true metric ego-motion estimates suitable for robot applications, but requires more computational cost. All VO algorithms are vulnerable to excessive dynamic objects, i.e., when the significant portion of the field-of-view (FOV) is covered by moving objects, the pose estimation becomes unstable and incorrect.
In this paper we propose a novel omnidirectional visual odometry system using a multi-camera rig with wide FOV fisheye lenses and wide baseline, to maximize the stability and accuracy of the pose estimation. From four cameras with 220° FOV lenses, it is possible to observe full 360° angle around the robot, and most regions in the environment are visible from more than two views which makes the stereo triangulation possible.
Our wide-baseline wide-FOV setup maximizes the pose accuracy as the stereo resolution is proportional to the baseline length, and it also minimizes the image area occluded by the robot body. However it also poses new challenges. First tracking and matching features become harder because the viewing directions to the points can be substantially different, and there is large distortion in the periphery of fisheye images where most overlaps between views occur. Second, as the cameras are mounted far apart, the rigid rig assumption is no longer valid. Due to shocks, vibrations, and heat the rig can deform or the cameras can move unexpectedly.
We add three novel components to the VO pipeline to resolve these issues. To improve the feature matching the lens distortion needs to be removed, but as the fisheye lens covers more than 180°, the image cannot be warped to a single plane. We propose a hybrid projection model that uses two planes in the overlapping regions and a cylinder smoothly connecting them. This projection model enables continuous tracking of feature points in each view and consistent feature descriptors across views.
To estimate the current pose from noisy feature matches, we need to use a RANSAC algorithm with a minimal pose estimator. For the omnidirectional case, one can use minimal solvers with a generalized camera model [Kneip2014UPnPAO, Schweighofer2008GloballyOO], but we use a simpler approach of computing the pose using P3P [kneip2011novel] from one view and checking the inliers for all feature matches. This multi-view P3P RANSAC effectively and robustly estimates the rig poses in highly dynamic scenes.
Lastly we implement the online extrinsic calibration to deal with unexpected changes of rig-to-camera poses throughout the system execution. Besides the rig deformation by external causes, the initial calibration may not be very accurate due to the size and position of the rig. Online extrinsic calibration built in the local bundle adjustment constantly updates the extrinsic parameters from the tracked features in the current map, and it greatly improves the robustness and accuracy of the system.
For experimental evaluation we render synthetic datasets with ground-truth poses as well as collect challenging real datasets using the omnidirectional rig in Figure 1. Extensive experiments show that the proposed VO system accomplish good performance in all synthetic and real datasets.
Our main contributions can be summarized as follows:
An effective novel image projection model which allows to find and track feature correspondences between the fisheye cameras in a wide-baseline setup.
A proposed visual odometry system that uses multiple fisheye cameras with overlapping views operates robustly in highly-dynamic environment using the multi-view P3P RANSAC algorithm and the online extrinsic calibration integrated with the backend local bundle adjustment.
Extensive evaluation of the proposed algorithms to the synthetic and real datasets verifies the superior performance. All datasets as well as the system implementation will be made public with the paper publication.
Ii Related Works
In the VO and visual SLAM literature, many different camera configurations have been researched. There are various monocular systems [mur2015orb, engel2018direct, forster2014svo] that are point feature-based, directly optimizing poses with image contents, or hybrid. They show outstanding performance, but due to fundamental limitation of monocular setup, metric poses cannot be estimated. For robotic application, stereo-based systems [mur2017orb, wang2017stereo] have been proposed. Another limitation of the conventional systems is small FOV which can make a VO system unstable due to lack of features or existence of dynamic objects. For this practical reason, fisheye camera based methods have been researched recently. Caruso et al. [caruso2015large] propose a fisheye visual SLAM with direct methods. Liu et al. [liu2017direct] use a fisheye stereo camera and recover metric scale trajectory. Most recently, Matsuki et al. [matsuki2018omnidirectional] proposed an omnidirectional visual odometry with the direct sparse method.
For improved environmental awareness and perception capabilities, multi-camera methods also have been studied. [hee2013motion] present a visual odometry algorithm for a multi-camera system which can observe full surrounding view. They successfully estimate ego-motion with the 2-point algorithm showing the importance of the inter-camera correspondences to recover metric scale. Heng et al. [heng2015self] implement a visual SLAM and self-calibration system with at least one calibrated-stereo camera and an arbitrary number of monocular cameras where they have overlapping views with the stereo camera. Recently, a robust multi-camera system using direct methods with plane sweeping stereo is proposed by Liu et al.[liutowards]. Finding correspondences between fisheye images is a challenging and important problem and many researchers devoted efforts in it. Special descriptors [guan2017brisks, zhao2015sphorb] are designed to consider the distortion, and Hane et al. [hane2014real] and Gao et al. [gao2017dual] proposed dense matching algorithms for fisheye images.
In this section, we introduce the notations used in this paper. A rigid transformation
is parameterized as an axis-angle rotation vectorand a translation vector in . It transforms a 3D point to , where is the rotation matrix for . also denotes the composition of transformations, and is the inverse transformation. As in Figure 2 we use three coordinate systems, world (w), body (b), and camera (c), and when needed the coordinate system is marked on the left of symbols, like meaning the rigid transform from the body to the world coordinate system or a point in the world coordinate system. When the time is involved it is denoted as a subscript. For example the camera coordinate of a world point at time can be written as
The camera intrinsic calibration determine the mapping between a 3D point and a pixel coordinate in the image. We denote the projection function for a camera intrinsic parameter . We use for projection onto the unit sphere, , where is a unit-length ray pointing , which can also be a feature point in the image.
We propose the robust omnidirectional visual odometry (ROVO) system with hybrid projection model, multi-view P3P RANSAC, and online-extrinsic calibration. An overview of the system architecture is show in Figure 3. We assume that the camera intrinsic parameters and the initial extrinsic parameters are known. When input fisheye images arrive, the hybrid projection model warps the input images into perspective-cylindrical images, and the feature detection and tracking modules run on the warped images per view. After the intra-view feature tracking, we perform inter-view feature matching by comparing the feature descriptors in the overlapping regions. Then the multi-view P3P RANSAC algorithm computes the real-scale camera pose, followed by pose optimization which updates the estimated rigid pose with the 3D points fixed. Finally the back-end local bundle adjustment module updates the recent rig trajectory and 3D point locations, as well as the rig extrinsic parameters in a unified framework. For optimization tasks we use Ceres solver [ceres-solver].
Iv-a Hybrid Projection Model
Due to wide FOV the original fisheye images contain large amount of information about the environment, but at the same time the periphery of the images is distorted excessively. In our wide-baseline wide-FOV setup there exist large overlapping area between images, but direct feature matching with the descriptors from such distorted areas does not yield good correspondences in quality and quantity. The local feature descriptors work best when the images are purely perspective, thus we develop a projection model which ensures the overlapping areas between views are as perspective as possible. At the same time, for feature tracking to be successful, the warped image must be continuous and smooth.
To satisfy these conditions, our hybrid projection model has two projection planes at the left and right sides and the cylinder at the center connects the two planes smoothly. Figure 3(a) shows the hybrid projection model for our rig; for each camera the planes parallel to the baselines of the cameras are connected with cylinders perpendicular to the projection planes. Note that when the camera centers are located close to the rig plane, the proposed method is similar to the stereo rectification and the projected -coordinates of a scene point in the other images should be roughly same.
To build the warped image, we need to find the pixel position in the original fisheye image corresponding to each pixel in the warped image.
For the point , the 3D point on the projection plane/cylinder can be computed using the plane/cylinder equation, and then its fisheye image coordinate is given as
Figure 3(b) shows the fisheye coordinates for the warped pixels (top) and the warped images (bottom).
Iv-B Intra- and Inter-view Feature Processing
For each warped image, we perform the intra-view feature processing which is the standard feature detection and tracking. We use the ORB feature detector [rublee2011orb] with minimum distance constraints to neighboring features to ensure that features are extracted uniformly. The existing feature points are tracked by the KLT tracking algorithm [lucas1981iterative].
To improve the pose estimation and mapping quality we need to track the features across views. The inter-view feature processing finds the matching features in the overlapping regions between views and transfer the information when a feature goes out of the FOV in one image. The ORB descriptors are attached to the tracked points, and we use the -nearest neighbor feature matching algorithm to find the feature correspondences. Incorrect matches are filtered according to the policies similar to the stereo matching:
-distance between the two matched points is small.
the feature discriptor and orientation difference is small.
it fulfills epipolar consistency, left-right consistency and positive disparity.
the zero normalized cross correlation (ZNCC) cost is small.
Finally, we triangulate the matched points to compute their 3D coordinates, which are used in pose estimation. After the feature processing, we obtain the feature-landmark correspondences for all camera ’s. Note the feature locations are convertd to unit-length rays.
Iv-C Multi-view P3P RANSAC
After the feature processing and triangulation, the current pose of the rig is estimated from the established 2D-3D feature correspondences. Our multi-view P3P RANSAC algorithm extends the monocular P3P RANSAC algorithm [lepetit2009epnp]
. In our RANSAC iterations, one view is selected randomly with Probability Proportional to the Size of match sets (PPS sampling), then the minimum sample set is randomly chosen among the correspondences in the view. From the camera pose candidates estimated by the monocular P3P algorithm, the rig poses are computed and all correspondences in all views are tested for inlier check. The detailed process is shown in Algorithm1.
PPS sampling choose the cameras with more feature matches more frequently to increase the chance of finding good poses, while all-view inter checking enforces the estimated pose is consistent with all observations. To determine the best pose in the RANSAC loop, we use the reprojection errors of the inlier matches only.
After RANSAC finishes, the estimated rig pose is optimized by minimizing the reprojection error of all inliers while the 3D points are fixed,
where is the transformation from the body to the camera ’s coordinate system, ’s are the inlier indices, and
Iv-D Online Extrinsic Calibration
To deal with the deformation and motion of the camera during operation, the camera extrinsic parameters are jointly updated in the local bundle adjustment module. For online extrinsic calibration, we add the camera extrinsic parameters into the optimization
where the rig poses in the active time window, the landmark positions , as well as the camera extrinsics are optimized to minimize the cost. We give higher weight for the points observed in multiple cameras.
Since the cameras in our system are fixed on a rig, we need to give an extra constraint that the distance between the cameras are constant. Otherwise the metric scale reconstruction is not possible as the rig can grow or shrink over time. The constraints can be written as for neighboring camera pairs , where represents the translation from the camera to .
|Sequences||Translation RMSE(m)||Average Inlier Ratio(%)||Average reprojection Error(°)||# of frame||Length(m)|
V Experimental Results
In order to evaluate the proposed system, we conduct extensive experiments with synthetic datasets along with real-world datasets. Using the synthetic datasets with ground-truth, we quantitatively measure the accuracy by computing the Root Mean Squared Error(RMSE) between the estimated poses and the ground truth. In addition, we compare average inlier ratio and average reprojection error to observe the tendency of experimental results. In the real datasets, we qualitatively evaluate the performance by overlaying the estimated trajectory to the satellite images. Additionally, we show the effectiveness of online extrinsic calibration in both synthetic and real datasets with comparative experiment.
V-a Evaluation of Hybrid Projection Model
To test our hybrid projection model, we conduct two simple experiments. First, we compare the ORB descriptor similarity during intra-view tracking. The mean hamming distances of the 2000 features for 100 frames in the original fisheye images and in our hybrid-warped images are compared. Second, we compare the inlier ratios of feature correspondences between inter-view matching. The inlier ratio is computed only from the feature matches that satisfies the epipolar constraints.
|Original fisheye||Hybrid projection (ours)|
As shown in Table II our projection model reduces the average descriptor distance and boosts the inlier ratio more than , mainly by removing the lens distortion and aligning the projections. The boosted matching performance significantly contributes the robustness of VO systems.
V-B Synthetic and Real Datasets
We render four urban sequences with different structures lighting conditions using Blender. As a baseline we use 4view-MultiFOV which is modified from the urban canyon dataset by Zhang et al. [zhang2016benefit]. Static Urban is a 2km-long sequence with a significant illumination change due to building shadows. Dynamic Urban is a 1.3km-long sequence with moving vehicles. Cloudy Urban and Sunset Urban are 350m long sequences of the same scene with different weather conditions. All images are rendered for four simulated fish-eye cameras ( FOV) with resolution, which is same FOV and resolution with the real camera. The ground truth poses, camera intrinsic parameters and extrinsic parameters are included in the datasets.
For real datasets we use the four global-shutter camera rig on the vehicle as shown in Figure 1. All cameras output four software-synchronized images at 10Hz. We use a standard camera rig calibration with a large checkerboard. The datasets are collected by driving the vicinity of Hanyang university, and they contain many challenges of harsh illumination changes, highly dynamic road with many moving vehicles, and narrow streets.
V-C Experiments with Synthetic Datasets
To evaluate the robustness and accuracy, we conduct an experiment by providing a randomly perturbed camera extrinsics to the system (zero-mean Gaussian noise with ). NoisyExt is the result with the noisy extrinsics, and OnlineExt is the result with online extrinsic calibration. For comparison GT refers to the ground-truth rig trajectory, and GTExt is the VO result with the ground-truth extrinsics.
Quantitative and qualitative comparison is shown in Table I, Figure 5 and Figure 6. While the VO with noisy extrinsics fails to estimate correct trajectories, with online calibration overall error decreases drastically and the trajectory is also accurately estimated. We also observe that average inlier ratio and average reprojection error are significantly improved close to GTExt. Figure 8 gives in-depth view of online extrinsic calibration. Starting with noisy extrinsics, the extrinsic parameters are updated quickly and converges to the ground-truth within 100 frames.
V-D Experiments with Real Datasets
Among the collected datasets, we present the results of ParkingLot, Wangsimni, and Seongsu. ParkingLot has a loop trajectory and Figure 7(a) shows the accuracy of our system (note that our system do not use loop closing). Figure 7(b), (c) shows the Wangsimni and Seongsu results. Wangsimni sequence is taken in a heavy traffic with many moving objects. Our system is able to reject those outliers and robustly estimate the accurate trajectory. In Seongsu sequence the vehicle is driven in narrow passages, and the distance to the near buildings is very close compared to the camera baseline. The hybrid warping can successfully find the matching features in this challenging situations.
The online calibration makes a big contribution in real sequences. Figure 9 shows the reprojected epipolar lines of a feature in a neighbor view. With the initial calibration, the epipolar lines are quite off from the true matches, but with online calibration they fall on the correct position.
In this paper we propose a novel omnidirectional visual odometry system for a wide-baseline camera rig with wide-FOV lenses. To deal with the challenges from the fisheye distortion and appearance changes due to wide-baseline, we add a hybrid projection model, a multi-view P3P RANSAC algorithm, and online extrinsic calibration in local bundle adjustment. The extensive experimental evaluation using both synthetic datasets with ground-truth and real sequences verifies that the proposed components are effective in solving the problems.
This research was supported by Samsung Research Funding & Incubation Center for Future Technology under Project Number SRFC-TC1603-05, Next-Generation Information Computing Development Program through National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT(NRF-2017M3C4A7069369), and the NRF grant funded by the Korea government(MISP)(NRF-2017R1A2B4011928).