In this paper, we propose a novel approach that combines all or a subset of several different sensors: acoustic (sonar range), visual (stereo camera), inertial (linear accelerations and angular velocities), and depth data to estimate the trajectory of the combined sensors and to create a reconstruction of an underwater environment. Visualinertial state estimation is in general a challenging problem that many researchers addressed providing several different solutions [1, 2, 3, 4, 5]. The underwater environment – e.g., see Fig. 1 – presents unique challenges to vision based state estimation, as shown in a previous study . The introduction of additional sensors, such as IMU  and sonar , has demonstrated promising results. In our earlier work SVIn  acoustic, visual, and inertial data was fused together to map different underwater structures by augmenting the visual/inertial state estimation package OKVIS . In this paper, we extend this work by improving a novel image enhancement targeted to the underwater domain, introducing depth measurements, loopclosure capabilities, and a more robust initialization.
To validate our proposed approach, first, we assess the performance of the proposed loop-closing method, by comparing it to other state-of-the-art systems on the EuRoC microaerial vehicle public dataset  disabling the fusion of sonar and depth measurements in our system. Second, we test the full system on several different datasets in a diverse set of conditions. More specifically, underwater data – consisting of visual, inertial, depth, and acoustic measurements – has been collected using a custom made sensor suite from different locales; furthermore, data collected by an AQUA2 underwater vehicle , include visual, inertial, and depth measurements. The results on the underwater datasets illustrate the loss of tracking and failure to maintain consistent scale for other state-of-the-art systems while our proposed methodology maintains correct scale without diverging.
The next section discusses related work. Section III presents the mathematical formulation of the proposed system and describes the approach developed for image preprocessing, pose initialization, loopclosure, and relocalization. Section IV presents results from an publicly available aerial dataset and a diverse set of challenging underwater environments. We conclude this paper with a discussion on lessons learned and directions of future work.
Ii Related Work
The literature presents many visionbased state estimation techniques, which use either monocular or stereo cameras and that are direct or indirect methods, including, for example, MonoSLAM , PTAM , LSD-SLAM [12, 13], SVO  and ORB-SLAM . In the following, we highlight some of the state estimation systems which use visual inertial measurements.
To improve the pose estimate, visionbased state estimation techniques have been augmented with Inertial Measurement Unit (IMU) sensors, whose data is fused together with visual information. A class of approaches is based on the Kalman Filter
, e.g., Multi-State Constraint Kalman Filter (MSCKF) and its stereo extension ; ROVIO ; REBiVO  . The other spectrum of methods optimizes the sensor states, possibly within a window, formulating the problem as a graph optimization problem. The optimization function typically includes the IMU error term and the reprojection error, as in OKVIS . Delmerico and Scaramuzza  did a comprehensive comparison specifically monitoring resource usage by the different methods.
Loop closure – the capability of recognizing a place that was seen before – is an important component to mitigate the drift of the state estimate. ORB-SLAM  and its extension with IMU  use bagofwords for loop closure and relocalization. A similar approach is used by VINS-Mono , which is based on a tightly-coupled optimization framework that uses monocular and inertial sensor data.
Note that all visualinertial state estimation systems require a proper initialization. VINS-Mono  uses a loosely-coupled sensor fusion method to align monocular vision with inertial measurement for estimator initialization. ORB-SLAM with IMU  performs initialization by first running a monocular SLAM to observe the pose, and at the same time, IMU biases are also estimated.
Given the modularity of OKVIS for adding new sensors – we fused sonar data in previous work in  – we extend OKVIS to include depth, loop closure capabilities, and a more robust initialization to specifically target underwater environments.
Iii Proposed Method
This section describes the proposed system, depicted in Fig. 2. The full proposed state estimation system can operate with a robot that has stereo camera, IMU, sonar, and depth sensor – the last two can be also disabled to operate as a visual-inertial system.
Due to the lack of natural light illumination and dynamic obstacles, it is hard to find good features to track. In addition to the underwater vision constraints, e.g., light and color attenuation, vision-based systems also suffer from low visibility and poor contrast. Hence, we augment the pipeline by adding an image preprocessing step, where contrast adjustment along with histogram equalization is applied to improve feature detection underwater. In particular, we use a contrast limited adaptive histogram equalization filter in the image pre-processing step.
In the following, after defining the state, we describe the proposed initialization, sensor fusion optimization, loop closure and relocalization steps.
Iii-a Notations and States
The full sensor suite is composed of the following coordinate frames: camera, IMU, acoustic (sonar), depth, and world which are denoted as , , , , and respectively. The transformation between two arbitrary coordinate frames and is represented by a homogeneous transformation matrix where the rotation matrix is represented as and the corresponding quaternion is .
Given the introductory notations, we define the state of the robot that the system is estimating as:
which contains the position , the attitude represented by the quaternion , the linear velocity , all expressed as the IMU reference frame with respect to the world coordinate
; moreover, the state vector contains the gyroscopes and accelerometers biasand .
The associated error-state vector is defined in minimal coordinates, while the perturbation takes place in the tangent space:
which represents the error for each component of the state vector with a transformation between tangent space and minimal coordinates .
Iii-B Initialization: Scale Refinement, Velocity and Gravity Approximation
A robust and accurate initialization is required for the success of tightly-coupled non-linear systems, as described in  and . For underwater deployments, this becomes even more important as vision is often occluded, e.g., by fishes, as well as is negatively affected by the lack of features for tracking. Indeed, from our comparative study of visual-inertial based state estimation systems , in underwater datasets, most of the state-of-the-art systems either fail to initialize or make wrong initialization resulting into divergence. Hence, we propose a robust initialization method for underwater state estimation, making sure: 1) the system only initializes when a minimum number of visual features are present to track (in our experiments worked well); and 2) the initial scale from the stereo vision has been refined by aligning IMU pre-integration values with the pose estimate from stereo matching, using relative camera motion method with RANSAC and the depth measurements from the pressure sensor.
The transformation of combined position estimation stereo camera and depth sensor from their respective coordinate frames to (IMU) frame with the refined scale factor s can be found as follows:
The coordinate system of the depth sensor is aligned with the (world); as such, depth values correspond to values along the axis of . For initialization, no acoustic measurements have been used because of their extrinsics – as described in  – which would not allow to have any match between acoustic and visual features, if the robot is not moving. In addition to refining the scale, we also approximate initial velocity and gravity vector similarly to the method described in .
The pose prediction from IMU integration with IMU measurements in OKVIS  with conditional covariance can be written as:
where , , and are IMU pre-integration terms defining the motion between two consecutive keyframes i and i+1 in time interval and can be obtained only from the IMU measurements. In particular:
refers to the combined position estimate from stereo camera and depth sensor. The reason behind using the depth measurement for scale refinement, is to introduce an additional constraint for position correction along with stereo.
Iii-C Tightly-coupled Non-Linear Optimization with Sonar, Visual, Inertial, and Depth sensors
For the tightly-coupled non-linear optimization, we use the following cost function , which includes the reprojection error and the IMU error with the addition of the sonar error (see ) and the depth error :
where i denotes the camera index – i.e., left or right camera in a stereo camera system with landmark index j observed in the kth camera frame. , , , and represent the information matrix of visual landmark, IMU, sonar range, and depth measurement for the kth frame respectively.
For completeness we briefly discuss each error term. The reprojection error describes the difference between a keypoint measurement in camera coordinate frame and the corresponding landmark projection according to the stereo projection model. The IMU error term combines all accelerometer and gyroscope measurements by IMU pre-integration  between successive camera measurements and represents both the pose and speed and bias error between the prediction based on previous and current states. Both reprojection error and IMU error term follow the formulation by Leutenegger et al. .
The concept behind calculating the sonar range error, introduced in our previous work , is that, if the sonar detects any obstacle at some distance, it is more likely that the visual features would be located on the surface of that obstacle, and thus will be at approximately the same distance. The step involves computing a visual patch detected in close proximity of each sonar point to introduce an extra constraint, using the distance of the sonar point to the patch. Here, we assume that the visual-feature based patch is small enough and approximately coplanar with the sonar point. As such, the error term is the difference between those two distances and is used to correct the position of the robot state
with a normal conditional probability density functionf and the conditional covariance , updated iteratively as new sensor measurements are integrated:
The information matrix is:
is the measurement in sonar reference frame . The Jacobian can be derived by differentiating the expected range measurement with respect to the robot pose:
where represents the sonar landmark in homogeneous coordinate and can be calculated by a simple geometric transformation in world coordinates given range and head-position from the sonar measurements:
The depth error term, introduced in this paper, can be calculated as the difference between the robot position along the the direction (ENU convention) and the water depth measurement provided by a pressure sensor. This can correct the position of the robot along the axis. The error term with a normal conditional probability density function f and the conditional covariance can be defined as:
and denotes the depth measurement in -th frame and first frame, respectively, in depth sensor coordinate system . The information matrix calculation follows a similar approach as the sonar and the Jacobian is straight-forward to derive:
All the error terms are added in the Ceres nonlinear optimization framework  to estimate the robot state.
Iii-D Loop-closing and Relocalization
In a sliding window and marginalization based optimization method, drift accumulates over time on the pose estimate. A global optimization and relocalization scheme in necessary to eliminate this drift and to achieve global consistency. We adapt DBoW2 , a bag of binary words place recognition module, and augment OKVIS for loop detection and relocalization.
A pose-graph is maintained to represent the connection between keyframes. In particular, a node represents a keyframe and an edge between two keyframes exists if the matched keypoints ratio between them is more than
, resulting into a very sparse graph. With each new keyframe in the pose-graph, the loop-closing module searches for candidates in the bag of words database. While querying the place recognition database for loop detection, the database only returns those candidates having higher score than the keyframes connected to that node in the pose-graph. If loop is detected, the candidate with the highest score is retained and feature correspondences between the current keyframe in the local window and the loop candidate keyframe are obtained to establish connection between them. The pose-graph is consequently updated with loop information. A 2D-2D descriptor matching and a 3D-2D matching with known landmark position with outlier rejection by PnP RANSAC is performed to obtain the geometric validation. Though BRISK features are used for local tracking in OKVIS, we use BRIEF descriptors in DBoW2 and to compute feature correspondences by descriptor matching. The reason is, in our combined sonar-visual-inertial system, gravity renders two rotational DoF observable, and we do not need to rely on rotation-invariant features.
When a loop is detected, the global relocalization module aligns the current keyframe pose in the local window with the pose of the loop keyframe in the pose-graph by sending back the drift in pose to the windowed sonar-visual-inertial-depth optimization thread. Also an additional optimization step similar to Eq. (7) is taken only with the matched landmarks for calculating the sonar error term and reprojection error. After loop detection, a 4-DoF (position, and yaw, ) pose-graph optimization takes place to correct the loop by minimizing the cost function:
where is the information matrix and set to identity, as in . The error term of an edge between keyframes and is defined as:
are obtained from local sonar-visual-inertial optimization.
Iv Experimental Results
The proposed state estimation system is quantitatively validated first on a standard dataset, to ensure that loop closure and the initialization work also above water. Moreover, it is compared to other state-of-the-art methods, i.e., VINS-Mono , the basic OKVIS , and the MSCKF  implementation from the GRASP lab . Second, we qualitatively test it on several different datasets collected utilizing a custom made sensor suite  and an Aqua2 AUV .
Iv-a Validation on Standard dataset
Here, we present results on the EuRoC dataset , one of the benchmark datasets used by many visualinertial state estimation systems, including OKVIS (Stereo), VINS-Mono, and MSCKF. To compare the performances, we disabled depth and sonar integration in our method and only look for the validation of loop-closure scheme.
An alignment is performed between ground truth and estimated trajectory, by minimizing the least mean squared errors between estimate/groundtruth locations, which are temporally close, varying rotation, translation, and scaling, according to the method from . The resulting metric is the Root Mean Square Error (RMSE) for the translation, shown in Table I for several Machine Hall sequences in the EuRoC dataset. Our method shows reduced RMSE in every sequence from OKVIS, validating the improvement of pose-estimation after loop-closing. SVIn2 has also less RMSE than MSCKF and slightly higher, but comparable when compared against VINS-Mono. Fig. 3 shows the trajectories for each method together with the ground truth for the Machine Hall 04 Difficult sequence.
Iv-B Underwater datasets
Our proposed state estimation system – SVIn2 – is targeted for underwater environments, where sonar and depth can be fused together with the visualinertial data. Here, we show results from four different datasets in three different underwater environments. First, a sunken bus in Fantasy Lake (NC), where data was collected by a diver with a custommade underwater sensor suite. The diver started from outside the bus, entered in it from the front door, exited from the back door, performed a loop around the bus. The images are affected by haze and low visibility. Second and third, data from an underwater cavern in Ginnie Springs (FL) is collected again by a diver with the same sensor suite as for the sunken bus. The diver performed several loops, around one spot in the second dataset – Cavern1 – and two spots in the third dataset – Cavern2 – inside the cavern. The environment is affected by complete absence of natural light. Fourth, an AUV – AQUA2 robot – collected data over an underwater cemetery in Lake Jocassee (SC) and performed several loops around the tombstones in a square pattern. The visibility, as well as brightness and contrast, was very low. In the underwater datasets, it is a challenge to get any ground truth, because it is a GPS-denied unstructured environment. As such, the evaluation is qualitative, with a rough estimate on the size of the environment measured beforehand by the divers collecting the data.
Fig. 5 shows the results for the submerged bus dataset. In particular, VINS-Mono lost track when the exposure increased for quite some time. It tried to reinitialize, but it was not able to track successfully. Even using histogram equalization or a contrast adjusted histogram equalization filter, VINS-Mono was not able to track. Even if the scale drifted, OKVIS was able to track using a contrast adjusted histogram equalization filter in the image pre-processing step. Without the filter, it lost track at the high exposure location. Our method was able to track, detect, and correct the loop, successfully.
In Cavern1 – see Fig. 6 – VINS-Mono tracked successfully the whole time. However, as can be noticed in Fig. 6, the scale was incorrect based on empirical observations during data collection. OKVIS instead produced a good trajectory, and SVIn2 was also able to detect and close the loops.
In Cavern2 (Fig. 7), VINS-Mono lost track at the beginning, reinitialized, and was able to track for some time and detected a loop, before losing track again. VINS-Mono had similar behavior even if the images were pre-processed with different filters. OKVIS tracked well, but as drifts were accumulated over time, it was not able to join the current pose with a previous pose where a loop was expected. SVIn2 was able to track and reduce the drift in the trajectory with successful loop closure.
In the cemetery dataset – Fig. 8 – both VINS-Mono and OKVIS were able to track, but VINS-Mono was not able to reduce the drift in trajectory, while SVIn2 was able to fuse and correct the loops.
In this paper, we presented SVIn2, a state estimation system with robust initialization, sensor fusion of depth, sonar, visual, and inertial data, and loop closure capabilities. While the proposed system can work also out of the water, by disabling the sensors that are not applicable, our system is specifically targeted for underwater environments. Experimental results in a standard benchmark dataset and different underwater datasets demonstrate excellent performance.
Utilizing the insights gained from implementing the proposed approach, an online adaptation of the discussed framework for the limited computational resources of the AQUA2 AUV  is currently under consideration; see Fig. 4. It is worth noting that maintaining the proper attitude of the traversed trajectory and providing an estimate of the distance traveled will greatly enhance the autonomous capabilities of the vehicle . Furthermore, accurately modeling the surrounding structures would enable AQUA2, as well as other vision based underwater vehicles to operate near, and through, a variety of underwater structures, such as caves, shipwrecks, and canyons.
The authors would like to thank the National Science Foundation for its support (NSF 1513203, 1637876).
-  R. Mur-Artal and J. D. Tardós, “Visual-inertial monocular slam with map reuse,” IEEE Robot. Autom. Lett., vol. 2, no. 2, pp. 796–803, 2017.
-  S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframe-based visual-inertial odometry using nonlinear optimization,” Int. J. Robot. Res., vol. 34, no. 3, pp. 314–334, 2015. [Online]. Available: http://ijr.sagepub.com/content/34/3/314.abstract
-  T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monocular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34, no. 4, pp. 1004–1020, 2018.
-  A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint Kalman filter for vision-aided inertial navigation,” in Proc. ICRA. IEEE, 2007, pp. 3565–3572.
-  K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar, C. J. Taylor, and V. Kumar, “Robust stereo visual inertial odometry for fast autonomous flight,” IEEE Robot. Autom. Lett., vol. 3, no. 2, pp. 965–972, 2018.
-  A. Quattrini Li, A. Coskun, S. M. Doherty, S. Ghasemlou, A. S. Jagtap, M. Modasshir, S. Rahman, A. Singh, M. Xanthidis, J. M. O’Kane, and I. Rekleitis, “Experimental comparison of open source vision based state estimation algorithms,” in Proc. ISER, 2016.
-  S. Rahman, A. Quattrini Li, and I. Rekleitis, “Sonar Visual Inertial SLAM of Underwater Structures,” in Proc. ICRA, 2018.
-  M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The EuRoC micro aerial vehicle datasets,” Int. J. Robot. Res., vol. 35, no. 10, pp. 1157–1163, 2016.
-  G. Dudek, M. Jenkin, C. Prahacs, A. Hogue, J. Sattar, P. Giguere, A. German, H. Liu, S. Saunderson, A. Ripsman, S. Simhon, L. A. Torres-Mendez, E. Milios, P. Zhang, and I. Rekleitis, “A visually guided swimming robot,” in Proc. IROS, 2005, pp. 1749–1754.
-  J. Civera, O. G. Grasa, A. J. Davison, and J. M. M. Montiel, “1Point RANSAC for Extended Kalman Filtering: Application to Real-time Structure from Motion and Visual Odometry,” J. Field Robot., vol. 27, no. 5, pp. 609–631, 2010.
-  G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in IEEE and ACM Int. Symp. on Mixed and Augmented Reality, 2007, pp. 225–234.
-  J. Engel, T. Schöps, and D. Cremers, “LSD-SLAM: Large-Scale Direct Monocular SLAM,” in Proc. ECCV, ser. Lecture Notes in Computer Science, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Springer Int. Publishing, 2014, vol. 8690, pp. 834–849.
-  J. Engel, J. Stückler, and D. Cremers, “Large-scale direct SLAM with stereo cameras,” in Proc. IROS, 2015, pp. 1935–1942.
-  C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “SVO: Semidirect Visual Odometry for Monocular and Multicamera Systems,” IEEE Trans. Robot., vol. 33, no. 2, pp. 249–265, 2017.
-  R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A Versatile and Accurate Monocular SLAM System,” IEEE Trans. Robot., vol. 31, no. 5, pp. 1147–1163, 2015.
-  M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended Kalman filter based visual-inertial odometry using direct photometric feedback,” Int. J. Robot. Res., vol. 36, pp. 1053–1072, 2017.
-  J. J. Tarrio and S. Pedre, “Realtime edge based visual inertial odometry for mav teleoperation in indoor environments,” J. Intell. Robot. Syst., pp. 235–252, 2017.
-  J. Delmerico and D. Scaramuzza, “A benchmark comparison of monocular visual-inertial odometry algorithms for flying robots,” in Proc. ICRA, 2018.
-  C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “On-manifold preintegration for real-time visual–inertial odometry,” IEEE Trans. Robot., vol. 33, no. 1, pp. 1–21, 2017.
-  B. Joshi, B. Cain, J. Johnson, M. Kalitazkis, S. Rahman, M. Xanthidis, A. Hernandez, A. Quattrini Li, N. Vitzilaios, and I. Rekleitis, “Experimental comparison of open source vision-inertial-based state estimation algorithms,” IEEE Robot. Autom. Lett., 2018, (under review).
C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “Imu preintegration on manifold for efficient visual-inertial maximum-a-posteriori estimation.” Georgia Institute of Technology, 2015.
-  S. Agarwal, K. Mierle, and Others, “Ceres Solver,” http://ceres-solver.org, 2015.
-  D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5, pp. 1188–1197, 2012.
-  H. Strasdat, “Local accuracy and global consistency for efficient visual slam,” Ph.D. dissertation, Citeseer, 2012.
-  Research group of Prof. Kostas Daniilidis, “Monocular MSCKF ROS node,” https://github.com/daniilidis-group/msckf˙mono, 2018.
-  S. Rahman, A. Quattrini Li, and I. Rekleitis, “A modular sensor suite for underwater reconstruction,” in MTS/IEEE Oceans Charleston, 2018, p. (accepted).
-  S. Umeyama, “Least-squares estimation of transformation parameters between two point patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 4, pp. 376–380, 1991.
-  J. Sattar, G. Dudek, O. Chiu, I. Rekleitis, P. Giguere, A. Mills, N. Plamondon, C. Prahacs, Y. Girdhar, M. Nahon, and J.-P. Lobos, “Enabling autonomous capabilities in underwater robotics,” in Proc. IROS, 2008, pp. 3628–3634.