Real-time Monocular Visual Odometry for Turbid and Dynamic Underwater Environments

06/15/2018 ∙ by Maxime Ferrera, et al. ∙ 0

In the context of robotic underwater operations, the visual degradations induced by the medium properties make difficult the exclusive use of cameras for localization purpose. Hence, most localization methods are based on expensive navigational sensors associated with acoustic positioning. On the other hand, visual odometry and visual SLAM have been exhaustively studied for aerial or terrestrial applications, but state-of-the-art algorithms fail underwater. In this paper we tackle the problem of using a simple low-cost camera for underwater localization and propose a new monocular visual odometry method dedicated to the underwater environment. We evaluate different tracking methods and show that optical flow based tracking is more suited to underwater images than classical approaches based on descriptors. We also propose a keyframe-based visual odometry approach highly relying on nonlinear optimization. The proposed algorithm has been assessed on both simulated and real underwater datasets and outperforms state-of-the-art visual SLAM methods under many of the most challenging conditions. The main application of this work is the localization of Remotely Operated Vehicles (ROVs) used for underwater archaeological missions but the developed system can be used in any other applications as long as visual information is available.



There are no comments yet.


page 7

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Accurate localization is critical for most robotic underwater operations, especially when navigating in areas with obstacles such as rocks, shipwrecks or Oil & Gas structures. As GPS is not available underwater, most of the existing approaches rely on the use of IMUs (Inertial Measurement Units), pressure sensors and DVLs (Doppler Velocity Logs) (Paull et al. (2014)

). These sensors measurements are classically integrated into an Extended Kalman Filter to estimate the vehicle motions. However, the measurement noises and the integration of linearization errors lead to unavoidable drift over time. To limit this drift, acoustic positioning systems like USBL (Ultra Short Baseline) or LBL (Long Baseline) can be used but they are expensive or may require calibration (LBL), complicating the planning of archaeological operations. Cameras have also been used as complementary sensors allowing to limit the drift by matching temporally spaced images (

Eustice et al. (2008); Johnson-Roberson et al. (2010); Mahon et al. (2008); Beall et al. (2010); Warren et al. (2012); Carrasco et al. (2015)). If the aforementioned approaches have shown good results on very large trajectories, they require the use of high-end sensors as the cameras or the acoustic positioning systems are only used to constrain the drift.

In contrast, in this work, we are interested in the development of an accurate localization method from a minimal set of low-cost sensors for lightweight Remotely Operated Vehicles (ROVs) used for archaeology operations. As ROVs always embed a camera for remote control purpose, we decided to develop a visual odometry framework based only on a monocular camera to estimate the ego-motion of the robot.

The literature is quite sparse for pure visual localization method (i.e. methods relying solely on cameras) in the underwater field in comparison to in-air or on-earth applications. Indeed, visual odometry (VO) or visual SLAM (Simultaneous Localization And Mapping) has been exhaustively studied for terrestrial and aerial operations (Cadena et al. (2016)) and is becoming quite mature. Note that visual SLAM differs from VO by maintaining a reusuable global map, allowing for instance the detection of loop closures when seeing again already mapped scenes. In the underwater environment, localization from vision is still an open problem and the state-of-the-art algorithms in VO or SLAM do not give satisfying results (Weidner et al. (2017); Carrasco et al. (2016)). This is mainly due to the visual degradations caused by the medium specific properties. Indeed, the strong light absorption of the medium shorten the visual perception to a few meters and make the presence of an artificial lightning system mandatory when operating in deep waters. Besides, the propagation of light is scattered by floating particles, causing turbidity effects on the captured images. In the darkness of deep waters, the fauna is also a cause of visual degradation as animals are attracted by the artificial light and tend to get in the field of view of the camera, leading to dynamism and occlusions in the images. In front of these difficulties, many works tackle the localization problem using sonar systems (Paull et al. (2014); Ribas et al. (2007)), as they do not suffer from these visual degradations. Nevertheless, the information delivered by a sonar is not as rich as optical images (Bonin-Font et al. (2015)) and remain very challenging to analyze. Furthermore, at close-range, acoustic systems do not provide accurate enough localization information whereas visual sensing can be highly effective (Palomeras et al. (2018)).

Figure 1: Remotely Operated underwater Vehicle used with an underwater image of the seabed and the estimated trajectory in red with the reconstructed sparse 3D map.

In this paper we propose a new monocular VO algorithm dedicated to the underwater environment that overcomes the aforementioned visual degradations. The first contribution is a thorough evaluation of features tracking methods on underwater images. This evaluation is presented in section 3 and we show that optical flow tracking performs better than usual descriptors matching methods. Then, the developed monocular odometry algorithm is detailed in section 4. Inspired by visual SLAM methods, we developed a keyframe-based approach relying on nonlinear optimization to ensure minimal drift in the estimated trajectories. We base this monocular VO on an optical flow tracking. To increase the robustness of our algorithm to short occlusions, we propose a retracking mechanism allowing to find lost features. We also use a simple initialization method robust to planar and non planar scenes. Finally, in section 5, we present results obtained on both simulated and real datasets. We show that it performs better than state-of-the-art visual SLAM algorithms under many of the most challenging conditions in underwater video sequences. Note that we do not investigate the retrieval of the scale factor inherent to monocular system here but focus on developing an accurate VO method as it is the cornerstone of any vision-based localization method. The real underwater video sequences used for evaluation are made publicly available for the community 111

2 Related Work

Eustice et al. (2008)

were among the firsts to present a successful use of visual information as a complementary sensor for underwater robots localization. They used an Extended Information Filter (EIF) to process dead-reckoning sensors and insert visual constraints based on the 2D-2D matching of points coming from overlapping monocular images. Here, only the relative pose between cameras is computed so the visual motion is estimated up to scale and do not provide the full 6 degrees of freedom of the robot motions. Following their work, many stereo-vision based systems were proposed. The advantage of using stereo cameras lies in the availability of the metric scale in opposition to the scale ambiguity inherent to pure monocular system. The scale factor can indeed be resolved from the known baseline between both sensors (assuming the stereo system calibrated). In

Johnson-Roberson et al. (2010) and Mahon et al. (2008), stereo extensions of the method introduced by Eustice et al. (2008) are presented. Later, nonlinear optimization steps were integrated (Beall et al. (2010); Warren et al. (2012); Carrasco et al. (2015)) to further process the visual data through bundle adjustment (Triggs et al. (2000)). However, all these methods rely on expensive navigational sensors to estimate the vehicles ego-motion. The visual information is only used to bound the drift using low-overlap imagery systems (1-2hz).

Closer to our work, some stereo VO approaches using higher frame rate videos (10-20 hz) to estimate underwater vehicles ego-motion have been recently presented. In Drap et al. (2015), features are matched within stereo pairs to compute 3D point clouds and the camera poses are estimated by aligning these successive point clouds, making it a pure stereo vision method. In parallel, the work of Bellavia et al. (2017) uses a keyframe-based approach but their features tracking is done by matching descriptors both spatially (between stereo images pair) and temporally. Moreover, they do not perform bundle adjustment to optimize the estimated trajectory.

Despite the advantage of stereo-vision systems over monocular cameras, embedding a single camera is materially more practical. Furthermore, developing a monocular VO algorithm makes it portable to any kind of underwater vehicles, as long as it is equipped with a camera. Even if monocular systems do not provide the scale factor, it is possible to retrieve it from other sensors like an IMU or a pressure sensor.

The early work of (Garcia et al. (2001); Gracias et al. (2003)) studied the use of a monocular camera as a mean of motion estimations for underwater vehicles navigating near the seabed. In Garcia et al. (2001), low-overlap monocular images are used to estimate the robot motions but the processing is performed offline. Gracias et al. (2003) proposed a real-time mosaic-based visual localization method, estimating the robot motions through the computation of homographies with the limiting assumptions of purely planar scenes and 4 degrees of freedom motions (). Underwater monocular-based methods using cameras at high frame rate (10-20hz) were studied by Shkurti et al. (2011) and Burguera et al. (2015). In their approaches, they fuse visual motion estimation in an Extended Kalman Filter (EKF) along with an IMU and a pressure sensor. The authors of Palomeras et al. (2013) make use of a camera to detect known patterns in a structured underwater environment and use it to improve the localization estimated by navigational sensors integrated into an EKF. However, such a method is limited to known and controlled environment. More recently, Creuze (2017) presented a monocular underwater localization method that does not rely on an EKF framework but iteratively estimates ego-motion by integration of optical flow measurements corrected by an IMU and a pressure sensor. This latter is used to compute the scale factor of the observed scene.

While most of the underwater odometry or SLAM systems rely on the use of an EKF, or its alternative EIF version, in aerial-terrestrial SLAM, filtering method have been put aside to the profit of more accurate keyframe-based approaches using bundle adjustment (Strasdat et al. (2012)). PTAM (Klein and Murray (2007)) was one of the first approach able to use bundle adjustment in real-time along with Mouragnon et al. (2006). The work of Strasdat et al. (2011) and ORB-SLAM from Mur-Artal et al. (2015) build on PTAM and improves it by adding a loop-closure feature highly reducing the localization drift by detecting loops in the trajectories. Whereas all these methods match extracted features between successive frames, SVO (Forster et al. (2017)) and LSD-SLAM (Engel et al. (2014)) are two direct approaches directly tracking photometric image patches to estimate the camera motions. Following these pure visual systems, tightly coupled visual-inertial systems have been recently presented (Leutenegger et al. (2015); Bloesch et al. (2015); Lin et al. (2018)) with higher accuracy and robustness than standard visual systems. These visual-inertial systems are all built on very accurate pure visual SLAM or VO methods, assessing the need of a good visual localization method first.

Adding other low-cost sensors will not improve the localization of ROVs without an accurate VO method first. Hence, contrarily to most of the approaches in underwater localization, we propose here a keyframe-based VO method solely based on visual data coming from a high frame monocular camera. Inspired by aerial-terrestrial SLAM, we choose to rely on bundle adjustment to optimize the estimated trajectories, thus avoiding the integration of linearization errors of filtered approaches. We show that our method outperforms state-of-the-art visual SLAM algorithms on underwater datasets.

3 Features tracking methods evaluation

Figure 2: Images used to evaluate the tracking of features. (a) Images from the TURBID dataset (Codevilla et al. (2015)). (b) Images acquired on a deep antic shipwreck (depth: 500 meters, Corsica, France) - Credit: DRASSM (French Department of Underwater Archaeological Research).
Figure 3: Features tracking methods evaluation on the TURBID dataset (Codevilla et al. (2015)) (presented in Fig.2 (a)). Graphs (a) and (b) illustrates number of features respectively detected and tracked with different detectors while (c) and (d) illustrates number of features respectively detected with the Harris corner detector and tracked as before (the SURF and SIFT curves coinciding with the Harris-KLT one in (c)).
Figure 4: Features tracking methods evaluation on a real underwater sequence (presented in Fig.2 (b)). Graphs (a) and (b) illustrates number of features respectively detected and tracked with different detectors while (c) and (d) illustrates number of features respectively detected with the Harris corner detector and tracked as before (the SIFT curve coinciding with the SURF one in (c)).

As discussed in the introduction, in the underwater environment the captured images are mainly degraded by turbidity. Moreover, underwater scenes do not provide many discriminative features and often show repetitive similar patterns like coral branches, holes made by animals in the sand or simply algaes or sand ripples in shallow waters. In order to develop a VO system robust to underwater visual degradations, we have conducted a comparison of different combinations of detectors and descriptors along with the optical flow based pyramidal implementation of the Kanade-Lucas (KLT) method (Bouguet (2000)) on two different underwater datasets. All the images used are set as 640x480 gray-scale images to fit the input format of classical VO methods and we tried to extract 500 features homogeneously spread on each image.

The first evaluation is performed on the TURBID dataset created by Codevilla et al. (2015). This dataset consists in series of static photographies of printed seabeds taken in a pool (Fig. 2 (a)). The pictures were taken with an increasing level of turbidity between each picture by adding specific quantities of milk in the water. In their paper, Codevilla et al. evaluated the repeatability of many detectors to estimate their robustness to turbidity. In VO, beyond repeatability, it is the features tracking efficiency which is of most importance. The purpose of this evaluation is hence to count the number of features tracked between two consecutive frames. To perform a fair comparison and avoid initializing the KLT on the right locations, a virtual translation of 10 pixels is applied to the second picture.

The first combination of detector-descriptor that we evaluate along with the KLT are: ORB (Rublee et al. (2011)), BRISK (Leutenegger et al. (2011)), the Fast (Rosten et al. (2010)) detector with the BRIEF (Calonder et al. (2012)) and FREAK (Alahi et al. (2012)) descriptors, SIFT (Lowe (2004)) and SURF (Bay et al. (2006)). The SIFT and SURF detectors are the ones recommended by Codevilla et al.

for VO applications. Descriptors are matched by looking for correspondences in the second image in a 40x40 pixels area around each feature extracted in the first image. For the KLT, the Harris corner detector version of Shi and Tomasi (

Shi and Tomasi (1994)) is used to detect features in the first image. The optical flow is then computed for these features with respect to the second image. Fig. 3 displays (a) the number of features detected in each image for every method and (b) the number of tracked features between consecutive pictures. The resulting graphs clearly show that the KLT method is able to track the highest number of features. Indeed, more than 80% of the detected features are successfully tracked in the first fifteen images whereas for the others methods this number is way below 50%. However, we can see that the Harris detector is the only one able to extract almost 500 features in each image (Fig. 3 (a)). We have therefore run another evaluation using only this detector. Note that the requirements of some descriptors discard non suited detections which is the reason of the difference in the number of detected features in Fig. 3 (c). The results in Fig. 3 (d) show that the Harris detector increases the performance of all the descriptors evaluated but none of them matches the performance of the KLT method.

To confirm the results obtained on this first dataset, a second evaluation is run on another dataset consisting of eleven successive images taken from a real underwater video sequence (Fig. 2

(b)). To simulate a real VO scenario, the features detected in the first image are tracked in the ten following images. In order to remove outliers robustly, the epipolar consistency of matched features is checked by computing the essential matrix in a RANSAC loop (

Fischler and Bolles (1981)). Fig. 4 (a) and (b) show that the KLT still track the highest number of features across this sequence. Around 60% of the features detected in the first image are still successfully tracked in the last image with the KLT. For the others method, the ratio of features correctly tracked between the first and last image barely reach 20%. Once again, using the Harris detector improves the results for most of the descriptors, increasing the tracking ratio up to about 35%, but the KLT remains the most efficient tracking method (Fig. 4 (c,d)).

In front of these results we choose to build our VO algorithm on the tracking of Harris corners detected with the Shi-Tomasi detector and tracked through optical flow with the iterative pyramidal Kanade-Lucas algorithm. Another advantage of using the KLT over using descriptors resides in its low computation cost. However, the computation of descriptors allows the matching of features between non-successive images. This property is very useful when parts of the field of view gets temporary occluded. Unfortunately, this is not possible with direct approaches like the KLT method. As short occlusions are quite frequent in our context, as explained in section 1, we have enhanced the optical flow tracking to tackle this issue and introduce it in the next section which describes the pipeline of the developed monocular VO system.

4 The Visual Odometry framework

Figure 5: Pipeline of the proposed visual odometry algorithm.

The pipeline of our keyframe-based VO approach is summarized in Fig. 5. The system is based on the tracking of 2D features over successive frames in order to estimate their 3D positions in the world referential. The 2D observations of these 3D landmarks are then used to estimate the motion of the camera. Frames used for the triangulation of the 3D map points are considered as keyframes and we store the most recent ones in order to optimize the estimated trajectory along with the structure of the 3D map through bundle adjustment. In this, the method follows the approach of Klein and Murray (2007), Strasdat et al. (2011) and Mur-Artal et al. (2015). However, in opposition to these methods we do not build the tracking on the matching of descriptors. Instead we use the KLT method, more adapted to the underwater environment as demonstrated in section 3. Loop closure mechanisms of SLAM approach require the computation and the storing of descriptors for every 3D points in the map, the pose of every keyframes and their respective 2D observations of map points. These extensive computation operations and high storage requirement are avoided in the proposed system.

At each frame we want to estimate the state of the system through the pose of the camera, defined as:


where is the position of the camera in the 3D world and is its orientation. Furthermore, for each added keyframe we want to first estimate new landmarks and then optimize a set of keyframe poses with the respective observed landmarks. This set is denoted :


In the following, we assume that the monocular camera is calibrated and that distortion effects are compensated. The geometrical camera model considered in this work is the pinhole model and its mathematical expression of world points projection is:


with and the pixel coordinates, the intrinsic calibration parameters, the projective matrix computed from the state , , and the homogeneous representation of the landmark . The coefficients , and , represent respectively the focal lengths along the and axes and the position of the optical center in pixel.

4.1 Frame to frame features tracking

Features are extracted on every new keyframe using the Shi-Tomasi method to compute Harris corners. The motion of the 2D features is then estimated using the KLT. After each flow estimation, we thoroughly remove outliers from the tracked features: first, we compute the disparity between the forward and backward optical flow estimations and remove features whose disparity is higher than a certain threshold, then, from the intrinsic calibration parameters, we compute the essential matrix using the 5-points method of Nister (2004) between the previous keyframe and the current frame. This essential matrix is computed within a RANSAC process in order to remove the features not consistent with the estimated epipolar geometry.

Once enough motion of the camera is detected, the tracked 2D features are triangulated from their observations in the current frame and in the previous keyframe. The current frame used here is converted into a keyframe and new 2D corners are extracted in order to reach 250 features - the 2D observations of the seen 3D points included. All these 2D features are then tracked in the same way as described above.

4.2 Features retracking

The main drawback of optical flow based tracking is that lost features are usually permanently lost whereas descriptors allows the matching of features with strong view-point change. In the underwater context, the powerful lights embedded onto the ROV sometimes attract small fishes in the camera field of view. The occlusions due to fishes above tracked features leads to strong photometric disturbances and, as a consequence, quick loss of features. However, the fishes are moving very fast and their positions change very quickly between successive frames. We take advantage of this property to increase the robustness of our tracking method over short occlusions. We keep a small window of the most recent frames (five frames is enough in practice) with the features lost through optical flow in it. At each tracking iteration, we try to retrack the lost features contained in the retracking window. Features retracked are added to the set of currently tracked features while the others are moved back in the window until deletion.

This features tracking implementation is used to track both pure 2D features, for future triangulation, and 2D observations of already mapped points, for pose estimations.

4.3 Pose Estimation

The estimation of the 6 degrees of freedom of the pose of every frame uses their respective 2D-3D correspondences. The pose is computed with the Perpective-from-3-Points (P3P) formula using the method of Kneip et al. (2011). This operation is done within a RANSAC loop to remove correspondences not accurate enough. The pose is computed from the combination of three points giving the most likely estimation for the set of features. The pose is then further refined by minimization of the global reprojection error using the set of inliers:


with the 2D observation of the world point and the reprojection of in the frame with its related projection matrix .

This minimization is done through a nonlinear least-squares optimization using the Levenberg-Marquardt algorithm. The computed poses are then used to estimate the 3D positions of the tracked features.

4.4 Keyframe Selection and Mapping

The mapping process is triggered by the need of a new keyframe. Several criteria have been set as requirements for the creation of a keyframe. The first criterion is the parallax. If an important parallax from the last keyframe has been measured (30 pixels in practice), a new keyframe is inserted as it will allow the computation of accurate 3D points. The parallax is estimated by computing the median disparity of every tracked pure 2D features from the previous keyframe. To ensure that we do not try to estimate 3D points from false parallax due to rotational motions, we unrotate the currently tracked features before computing the median disparity. The second criterion is based on the number of 2D-3D correspondences. We verify that we are tracking enough 2D observations of map points and trigger the creation of a keyframe if this number drops below a threshold (less than 50% of the number of observations in the last keyframe).

For further optimization, a window of the most recent keyframes along with their 2D-3D correspondences is stored. This operation known as bundle adjustment is performed after the creation of every keyframe. Once the optimization done, new Harris corners are detected and the tracking loop is run again.

4.5 Windowed Local bundle adjustment

As stated above, a window of the most recent keyframes is stored and optimized with bundle adjustment at the creation of any new keyframe. To ensure a reasonable computational cost, only the three most recent keyframes are optimized along with their tracked map points. The remaining keyframes are fixed in order to constrain this nonlinear least-squares optimization problem. The size of the window is set adaptively by including every keyframe sharing a map point observation with one of the three optimized keyframes. This adaptive configuration sets high constraints on the problem and helps in reducing the unavoidable scale drift inherent to monocular odometry systems. The Levenberg-Marquardt algorithm is used to perform this optimization. The problem is solved by minimizing the map points reprojection errors. As least-squares estimators do not make any difference between high and low error terms, the result would be highly influenced by the presence of outliers with high residuals. To prevent this, we use the robust M-Estimator Huber cost function (Hartley and Zisserman (2004)) in order to reduce the impact of the highest error terms on the found solution.

We define the reprojection errors for every map point observed in a keyframe as:


We then define the set of parameters to optimize as:


with the number of landmarks observed by the three most recent keyframes. And we minimize (6) over the optimization window of keyframes:


with the set of landmarks observed by the keyframe , the Huber robust cost function and the covariance matrix associated with the measures .

After convergence of the Levenberg-Marquardt algorithm, we remove the map points with resulting reprojection errors higher than a threshold. This optimization step ensures that after the insertion of every keyframe both the trajectory and the 3D structure of the map are statistically optimal.

4.6 Initialization

Monocular systems are subject to a ”Chicken and Egg” problem at the beginning. Indeed, the motion of the camera is estimated through the observations of known 3D world points but the depth of the imaged world points is not observable from a single image. The depth of these world points can be estimated using two images with a sufficient baseline. However, this baseline needs to be known to compute the depth and vice-versa. This is why monocular VO requires an initialization step to bootstrap the algorithm on opposition with stereo systems. Initialization is done by computing the relative pose between two frames and setting arbitrarily the norm of the translation.

In ORB-SLAM, the initialization is done very carefully. Features detected in a first keyframe are tracked in the successive frames. From this tracking, a homography and a fundamental matrix are computed in parallel. If the scene seems to be planar the homography is selected otherwise the fundamental matrix is the one selected. If there is an ambiguity on the geometrical model to use, the system is reset by creating a new keyframe and the initialization step is run again. This complex initialization is done because of a degeneracy in the estimation of a fundamental matrix in the case of a planar scene.

However, in VO or SLAM applications, the cameras used are always calibrated. The knowledge of the calibration parameters of a camera allows computing the essential matrix instead of the fundamental matrix. In his paper, Nister (2004) states that his 5-points method for the estimation of an essential matrix is robust to planar scene so we perform initialization using this 5-points algorithm. In the underwater context, this initialization step is often conducted on planar scenes as navigation is made near the seabed. We assessed that this simpler method is able to initialize accurately the VO framework in any configuration (planar or not). A further advantage of this initialization over ORB-SLAM’s one is that we do not delay the initialization while ORB-SLAM waits for the right conditions to be fulfilled before starting, implying potential delays. These delays are very frequent when the tracking of features is difficult, as in our context.

5 Experimental Results

Figure 6: The four different turbidity levels of the simulated dataset.
Figure 7: Drift in % of ORB-SLAM (green), V.O. ORB-SLAM (blue) and UW-VO (red) on the simulated underwater dataset for each level of noise.

Implementation : The proposed system has been developed in C++ and uses the ROS middleware (Quigley et al. (2009)). The tracking of features is done with the OpenCV implementation of the Kanade-Lucas algorithm (Bouguet (2000)). Epipolar geometry and P3P pose estimations are computed using the OpenGV library (Kneip and Furgale (2014)). Nonlinear optimization is performed using the graph optimization framework g2o (Kümmerle et al. (2011)). Our system runs in real-time with an average run time of 25ms per frame with the tracking limited to 250 features per frame. The experiments have been carried with an Intel Core i5-5200 CPU - 2.20GHz - 8 Gb RAM.

The algorithm has been tested on different datasets along with the state-of-the-art algorithms ORB-SLAM222, LSD-SLAM333 and SVO444 for comparison. To the best of our knowledge there is no underwater method able to estimate localization only from monocular images available open-source. All algorithms are evaluated on real underwater datasets. Our method and ORB-SLAM are also evaluated on a simulated dataset whose frame rate (10 hz) is too low for SVO and LSD-SLAM to work. Note that ORB-SLAM and SVO have been fine-tuned in order to work properly. For ORB-SLAM, the features detection threshold was set at the lowest possible value and the number of points was set to 2000. For SVO, the features detection threshold was also set at the lowest possible value and the number of tracked features required for initialization was lowered to 50. For each method, every results presented are the averaged results over five runs.

5.1 Results on a simulated underwater dataset

Figure 8: Trajectories estimated with (a) our method on the sequence with the highest level of noise and with (b) V.O. ORB-SLAM on the sequence with the noise level of 3.
Drift (in %)
Seq. Noise
1 None 0.18 0.97 0.78
2 Low 0.18 0.93 0.81
3 Medium 0.17 1.21 0.85
4 High X X 0.89
Table 1: Translation drift (in %) on the simulated underwater video sequence with different level of noise simulating turbidity effects. Results are given averaging over five runs for each algorithm. V.O. ORB-SLAM: ORB-SLAM without the loop closing feature enabled. ORB-SLAM results are given for information.

A simulated dataset created from real underwater photographies has been made available to the community by Duarte et al. (2016). Four monocular videos of a triangle-shaped trajectory are provided with four different levels of noise in order to synthetically degrade the images with turbidity-like noise (Fig 6). The images resolution of these videos is 320x240. In each sequence the triangle trajectory is performed twice and it starts and ends at the same place. We have used these four sequences to evaluate the robustness towards turbidity of the proposed method with respect to ORB-SLAM. For fair comparison, ORB-SLAM has been run with and without its loop-closing feature. We will refer at the pure visual odometry version of ORB-SLAM as V.O. ORB-SLAM in the following. Table 1 presents the final drift at the end of the trajectory for each method. On the first three sequences, ORB-SLAM is able to close the loops and therefore has the lowest drift values as the detection of the loop closures allows to reduce the drift accumulated in-between. On the same sequences, V.O. ORB-SLAM has the highest level of drifts. Note that ORB-SLAM and its V.O. alternative fail half the time on the third level of noise sequence and have been run many times before getting five good trajectories. On the last sequence, none of ORB-SLAM versions is able to compute the performed trajectory because of failure in the tracking. These results highlight the deficiency of ORB-SLAM tracking method on turbid images. In comparison, our method is able to run on all the sequences, including the ones with the highest levels of noise (Fig. 7). The computed trajectories are more accurate than V.O. ORB-SLAM and we can note that it is barely affected by the noise level. These results confirm the efficiency of this VO method as a robust odometry system in turbid environments.

5.2 Results on a real underwater video sequence

Figure 9: Trajectories of ORB-SLAM, SVO and UW-VO over the five underwater sequences. (a) Sequence 1, (b) Sequence 2, (c) Sequence 3, (d) Sequence 4, (e) Sequence 5. Ground-truths (GT) are extracted from Colmap trajectories.

We now present experiments conducted on five real underwater video sequences. These sequences were gathered 500 meters deep in the Mediterranean Sea (Corsica), in 2016, during an archaeological mission conducted by the French Department of Underwater Archaeological Research (DRASSM). The videos were recorded from a camera embedded on an ROV and gray-scale 640x480 images were captured at 16 hz. The calibration of the camera has been done with the Kalibr (Furgale et al. (2013)) library. Calibration was done in situ in order to estimate the intrinsic parameters and the distortion coefficients of the whole optical system. If the calibration is performed in the air, the water and camera’s housing effects on the produced images would not be estimated and this would lead to a bad estimate of the camera’s parameters. The camera recording the videos was placed inside an underwater housing equipped with a spherical dome and we obtained good results using the pinhole-radtan model.

These five sequences can be described as follow:

  • Sequence 1: low level of turbidity and almost no fishes.

  • Sequence 2: medium level of turbidity and some fishes.

  • Sequence 3: high level of turbidity and many fishes.

  • Sequence 4: low level of turbidity and many fishes.

  • Sequence 5: medium level of turbidity and many fishes.

For each of these sequences, a ground truth was computed using the state-of-the-art Structure-from-Motion software Colmap (Schönberger and Frahm (2016)). We compare ORB-SLAM, LSD-SLAM and SVO to the developed method here. We evaluate the results of each algorithm against the trajectories computed offline by Colmap by first aligning the estimated trajectories with a similarity transformation using the method of Umeyama (1991) and then computing the absolute trajectory error like Sturm et al. (2012) (Fig. 9). The results are displayed in table 2. To observe the effect of the retracking mechanism (described in section 4.2), we have run the VO algorithm with and without enabling this feature, respectively referring to it as UW-VO and UW-VO. Videos of the results for each method on the five sequences are available online555

As we can see, LSD-SLAM did not give any valid result on any sequence. This is most likely due to its semi-dense approach based on the tracking of edges with strong gradients which are not frequent on sea-floor images. SVO is able to compute quite accurate trajectories as long as the video sequences are not too much affected by dynamism from moving fishes. The tracking of SVO, which is similar to optical flow, seems to work well even on turbid images but its direct pose estimation method is not robust to bad tracked photometric patches like the one created by moving fishes (seq. 3,4,5). ORB-SLAM on the other hand performs well on highly dynamic sequences but lose in accuracy when turbidity is present (seq. 2,3,5). Its pose estimation method based on the observations of independent features is hence robust to short occlusions and dynamic objects but its tracking method fail on images degraded by turbidity. Furthermore, we can note that despite loop closures in the trajectories (see Fig. 9), ORB-SLAM is not able to detect them. The failure to detect the loop closures indicates that the classical offline dictionary approaches (Galvez-Lopez and Tardos (2012)) might not be suited to the underwater environment which does not provide many discriminative features.

Only our method is able to run on all the sequences. While the estimated trajectory is slightly less accurate on the easiest sequence (seq. 1), we perform better than ORB-SLAM and SVO on the hardest sequences (seq. 2,3,5). We can see the benefit of the developed retracking mechanism on most of the sequences. Nonetheless, this optical flow retracking step is not as efficient as the use of descriptors for robustness against occlusions (seq. 4). Studying the effect of combining optical flow tracking with the use of descriptors could result in an interesting hybrid method for future work.

Sequences ATE (RMSE in %)
Seq. # Duration
1 4’ Low Few X 1.67 1.63 1.78 1.76
2 2’30” Medium Some X 1.91 2.45 1.78 1.73
3 22” High Many X X 1.57 1.10 1.04
4 4’30” Low Many X 1.13 X 1.61 1.58
5 3’15” Medium Many X 1.94 X 2.08 1.88
Table 2: Absolute translation errors (RMSE in %) for five underwater sequences with different visual degradations. Results are given averaging over five runs for each algorithm. UW-VO*: our method without the retracking step. UW-VO: our method with the retracking step.

6 Conclusion

This work presents a new vision-based underwater localization method. While most of the existing approaches rely on expensive proprioceptive sensors to estimate the motions of the underwater vehicle, we have chosen to investigate the use of a simple low-cost camera as a mean of localization. We propose a new monocular visual odometry method robust to the underwater environment. Different features tracking methods have been evaluated in this context and we have shown that optical flow tracking performs better than the classical matching of descriptors method. We further enhanced this optical flow tracking by adding a retracking mechanism, making it more robust to underwater visual degradations such as short occlusions due to moving animals. We proposed a visual odometry framework making use of keyframes to limit the localization drift and optimizing the estimated trajectory through bundle adjustment. A simple, and yet very effective, initialization method was also presented. We have shown that the proposed method performs better than the state-of-the-art visual SLAM algorithms ORB-SLAM, LSD-SLAM and SVO on underwater datasets. We publicly released the underwater video sequences used in this paper along with the camera calibration parameters and the trajectories computed with Colmap.

This work on the development of a monocular visual odometer was a first step towards a robust underwater localization method from low-cost sensors. The perspective of this work is to enhance it by adding a loop-closure mechanism, turning into a visual SLAM method. Classical loop-closing approaches using Bag of Words (Galvez-Lopez and Tardos (2012)) did not give the expected results in our tests and alternative methods in the lead of Carrasco et al. (2016) need to be investigated. Finally, in the same idea as visual-inertial SLAM algorithms, we will next study the tight fusion of a low-cost IMU and of a pressure sensor with this visual method to improve the localization accuracy and retrieve the scale factor.


The authors acknowledge support of the CNRS (Mission pour l’interdisciplinarité - Instrumentation aux limites 2018 - Aqualoc project) and support of Région Occitanie (ARPE Pilotplus project). The authors are grateful to the DRASSM for its logistical support and for providing the underwater video sequences.