I Introduction
In this paper, we propose a novel approach that combines all or a subset of several different sensors: acoustic (sonar range), visual (stereo camera), inertial (linear accelerations and angular velocities), and depth data to estimate the trajectory of the combined sensors and to create a reconstruction of an underwater environment. Visualinertial state estimation is in general a challenging problem that many researchers addressed providing several different solutions [1, 2, 3, 4, 5]. The underwater environment – e.g., see Fig. 1 – presents unique challenges to vision based state estimation, as shown in a previous study [6]. The introduction of additional sensors, such as IMU [2] and sonar [7], has demonstrated promising results. In our earlier work SVIn [7] acoustic, visual, and inertial data was fused together to map different underwater structures by augmenting the visual/inertial state estimation package OKVIS [2]. In this paper, we extend this work by improving a novel image enhancement targeted to the underwater domain, introducing depth measurements, loopclosure capabilities, and a more robust initialization.
To validate our proposed approach, first, we assess the performance of the proposed loopclosing method, by comparing it to other stateoftheart systems on the EuRoC microaerial vehicle public dataset [8] disabling the fusion of sonar and depth measurements in our system. Second, we test the full system on several different datasets in a diverse set of conditions. More specifically, underwater data – consisting of visual, inertial, depth, and acoustic measurements – has been collected using a custom made sensor suite from different locales; furthermore, data collected by an AQUA2 underwater vehicle [9], include visual, inertial, and depth measurements. The results on the underwater datasets illustrate the loss of tracking and failure to maintain consistent scale for other stateoftheart systems while our proposed methodology maintains correct scale without diverging.
The next section discusses related work. Section III presents the mathematical formulation of the proposed system and describes the approach developed for image preprocessing, pose initialization, loopclosure, and relocalization. Section IV presents results from an publicly available aerial dataset and a diverse set of challenging underwater environments. We conclude this paper with a discussion on lessons learned and directions of future work.
Ii Related Work
The literature presents many visionbased state estimation techniques, which use either monocular or stereo cameras and that are direct or indirect methods, including, for example, MonoSLAM [10], PTAM [11], LSDSLAM [12, 13], SVO [14] and ORBSLAM [15]. In the following, we highlight some of the state estimation systems which use visual inertial measurements.
To improve the pose estimate, visionbased state estimation techniques have been augmented with Inertial Measurement Unit (IMU) sensors, whose data is fused together with visual information. A class of approaches is based on the Kalman Filter
, e.g., MultiState Constraint Kalman Filter (MSCKF)
[4] and its stereo extension [5]; ROVIO [16]; REBiVO [17] . The other spectrum of methods optimizes the sensor states, possibly within a window, formulating the problem as a graph optimization problem. The optimization function typically includes the IMU error term and the reprojection error, as in OKVIS [2]. Delmerico and Scaramuzza [18] did a comprehensive comparison specifically monitoring resource usage by the different methods.Loop closure – the capability of recognizing a place that was seen before – is an important component to mitigate the drift of the state estimate. ORBSLAM [15] and its extension with IMU [1] use bagofwords for loop closure and relocalization. A similar approach is used by VINSMono [3], which is based on a tightlycoupled optimization framework that uses monocular and inertial sensor data.
Note that all visualinertial state estimation systems require a proper initialization. VINSMono [3] uses a looselycoupled sensor fusion method to align monocular vision with inertial measurement for estimator initialization. ORBSLAM with IMU [1] performs initialization by first running a monocular SLAM to observe the pose, and at the same time, IMU biases are also estimated.
Given the modularity of OKVIS for adding new sensors – we fused sonar data in previous work in [7] – we extend OKVIS to include depth, loop closure capabilities, and a more robust initialization to specifically target underwater environments.
Iii Proposed Method
This section describes the proposed system, depicted in Fig. 2. The full proposed state estimation system can operate with a robot that has stereo camera, IMU, sonar, and depth sensor – the last two can be also disabled to operate as a visualinertial system.
Due to the lack of natural light illumination and dynamic obstacles, it is hard to find good features to track. In addition to the underwater vision constraints, e.g., light and color attenuation, visionbased systems also suffer from low visibility and poor contrast. Hence, we augment the pipeline by adding an image preprocessing step, where contrast adjustment along with histogram equalization is applied to improve feature detection underwater. In particular, we use a contrast limited adaptive histogram equalization filter in the image preprocessing step.
In the following, after defining the state, we describe the proposed initialization, sensor fusion optimization, loop closure and relocalization steps.
Iiia Notations and States
The full sensor suite is composed of the following coordinate frames: camera, IMU, acoustic (sonar), depth, and world which are denoted as , , , , and respectively. The transformation between two arbitrary coordinate frames and is represented by a homogeneous transformation matrix where the rotation matrix is represented as and the corresponding quaternion is .
Given the introductory notations, we define the state of the robot that the system is estimating as:
(1) 
which contains the position , the attitude represented by the quaternion , the linear velocity , all expressed as the IMU reference frame with respect to the world coordinate
; moreover, the state vector contains the gyroscopes and accelerometers bias
and .The associated errorstate vector is defined in minimal coordinates, while the perturbation takes place in the tangent space:
(2) 
which represents the error for each component of the state vector with a transformation between tangent space and minimal coordinates [19].
IiiB Initialization: Scale Refinement, Velocity and Gravity Approximation
A robust and accurate initialization is required for the success of tightlycoupled nonlinear systems, as described in [1] and [3]. For underwater deployments, this becomes even more important as vision is often occluded, e.g., by fishes, as well as is negatively affected by the lack of features for tracking. Indeed, from our comparative study of visualinertial based state estimation systems [20], in underwater datasets, most of the stateoftheart systems either fail to initialize or make wrong initialization resulting into divergence. Hence, we propose a robust initialization method for underwater state estimation, making sure: 1) the system only initializes when a minimum number of visual features are present to track (in our experiments worked well); and 2) the initial scale from the stereo vision has been refined by aligning IMU preintegration values with the pose estimate from stereo matching, using relative camera motion method with RANSAC and the depth measurements from the pressure sensor.
The transformation of combined position estimation stereo camera and depth sensor from their respective coordinate frames to (IMU) frame with the refined scale factor s can be found as follows:
(3) 
The coordinate system of the depth sensor is aligned with the (world); as such, depth values correspond to values along the axis of . For initialization, no acoustic measurements have been used because of their extrinsics – as described in [7] – which would not allow to have any match between acoustic and visual features, if the robot is not moving. In addition to refining the scale, we also approximate initial velocity and gravity vector similarly to the method described in [3].
The pose prediction from IMU integration with IMU measurements in OKVIS [2] with conditional covariance can be written as:
(4) 
where , , and are IMU preintegration terms defining the motion between two consecutive keyframes i and i+1 in time interval and can be obtained only from the IMU measurements. In particular:
(5) 
Substituting Eq. (3) into Eq. (5), we can estimate by solving the linear least square problem in the following form:
(6) 
where
and

refers to the combined position estimate from stereo camera and depth sensor. The reason behind using the depth measurement for scale refinement, is to introduce an additional constraint for position correction along with stereo.
IiiC Tightlycoupled NonLinear Optimization with Sonar, Visual, Inertial, and Depth sensors
For the tightlycoupled nonlinear optimization, we use the following cost function , which includes the reprojection error and the IMU error with the addition of the sonar error (see [7]) and the depth error :
(7)  
where i denotes the camera index – i.e., left or right camera in a stereo camera system with landmark index j observed in the k^{th} camera frame. , , , and represent the information matrix of visual landmark, IMU, sonar range, and depth measurement for the k^{th} frame respectively.
For completeness we briefly discuss each error term. The reprojection error describes the difference between a keypoint measurement in camera coordinate frame and the corresponding landmark projection according to the stereo projection model. The IMU error term combines all accelerometer and gyroscope measurements by IMU preintegration [21] between successive camera measurements and represents both the pose and speed and bias error between the prediction based on previous and current states. Both reprojection error and IMU error term follow the formulation by Leutenegger et al. [2].
The concept behind calculating the sonar range error, introduced in our previous work [7], is that, if the sonar detects any obstacle at some distance, it is more likely that the visual features would be located on the surface of that obstacle, and thus will be at approximately the same distance. The step involves computing a visual patch detected in close proximity of each sonar point to introduce an extra constraint, using the distance of the sonar point to the patch. Here, we assume that the visualfeature based patch is small enough and approximately coplanar with the sonar point. As such, the error term is the difference between those two distances and is used to correct the position of the robot state
with a normal conditional probability density function
f and the conditional covariance , updated iteratively as new sensor measurements are integrated:(8) 
The information matrix is:
(9) 
is the measurement in sonar reference frame . The Jacobian can be derived by differentiating the expected range measurement with respect to the robot pose:
(10) 
where represents the sonar landmark in homogeneous coordinate and can be calculated by a simple geometric transformation in world coordinates given range and headposition from the sonar measurements:
(11) 
The depth error term, introduced in this paper, can be calculated as the difference between the robot position along the the direction (ENU convention) and the water depth measurement provided by a pressure sensor. This can correct the position of the robot along the axis. The error term with a normal conditional probability density function f and the conditional covariance can be defined as:
(12) 
and denotes the depth measurement in th frame and first frame, respectively, in depth sensor coordinate system . The information matrix calculation follows a similar approach as the sonar and the Jacobian is straightforward to derive:
(13) 
All the error terms are added in the Ceres nonlinear optimization framework [22] to estimate the robot state.
IiiD Loopclosing and Relocalization
In a sliding window and marginalization based optimization method, drift accumulates over time on the pose estimate. A global optimization and relocalization scheme in necessary to eliminate this drift and to achieve global consistency. We adapt DBoW2 [23], a bag of binary words place recognition module, and augment OKVIS for loop detection and relocalization.
A posegraph is maintained to represent the connection between keyframes. In particular, a node represents a keyframe and an edge between two keyframes exists if the matched keypoints ratio between them is more than
, resulting into a very sparse graph. With each new keyframe in the posegraph, the loopclosing module searches for candidates in the bag of words database. While querying the place recognition database for loop detection, the database only returns those candidates having higher score than the keyframes connected to that node in the posegraph. If loop is detected, the candidate with the highest score is retained and feature correspondences between the current keyframe in the local window and the loop candidate keyframe are obtained to establish connection between them. The posegraph is consequently updated with loop information. A 2D2D descriptor matching and a 3D2D matching with known landmark position with outlier rejection by PnP RANSAC is performed to obtain the geometric validation. Though BRISK features are used for local tracking in OKVIS, we use BRIEF descriptors in DBoW2 and to compute feature correspondences by descriptor matching. The reason is, in our combined sonarvisualinertial system, gravity renders two rotational DoF observable, and we do not need to rely on rotationinvariant features.
When a loop is detected, the global relocalization module aligns the current keyframe pose in the local window with the pose of the loop keyframe in the posegraph by sending back the drift in pose to the windowed sonarvisualinertialdepth optimization thread. Also an additional optimization step similar to Eq. (7) is taken only with the matched landmarks for calculating the sonar error term and reprojection error. After loop detection, a 4DoF (position, and yaw, ) posegraph optimization takes place to correct the loop by minimizing the cost function:
(14) 
where is the information matrix and set to identity, as in [24]. The error term of an edge between keyframes and is defined as:

are obtained from local sonarvisualinertial optimization.
Iv Experimental Results
The proposed state estimation system is quantitatively validated first on a standard dataset, to ensure that loop closure and the initialization work also above water. Moreover, it is compared to other stateoftheart methods, i.e., VINSMono [3], the basic OKVIS [2], and the MSCKF [4] implementation from the GRASP lab [25]. Second, we qualitatively test it on several different datasets collected utilizing a custom made sensor suite [26] and an Aqua2 AUV [9].
Iva Validation on Standard dataset
Here, we present results on the EuRoC dataset [8], one of the benchmark datasets used by many visualinertial state estimation systems, including OKVIS (Stereo), VINSMono, and MSCKF. To compare the performances, we disabled depth and sonar integration in our method and only look for the validation of loopclosure scheme.
An alignment is performed between ground truth and estimated trajectory, by minimizing the least mean squared errors between estimate/groundtruth locations, which are temporally close, varying rotation, translation, and scaling, according to the method from [27]. The resulting metric is the Root Mean Square Error (RMSE) for the translation, shown in Table I for several Machine Hall sequences in the EuRoC dataset. Our method shows reduced RMSE in every sequence from OKVIS, validating the improvement of poseestimation after loopclosing. SVIn2 has also less RMSE than MSCKF and slightly higher, but comparable when compared against VINSMono. Fig. 3 shows the trajectories for each method together with the ground truth for the Machine Hall 04 Difficult sequence.
SVin2 
OKVIS 
VINSMono 
MSCKF 


MH 01  0.08  0.15  0.03  0.22 
MH 02  0.09  0.12  0.05  0.24 
MH 03  0.09  0.13  0.06  0.24 
MH 04  0.13  0.18  0.11  0.40 
MH 05  0.20  0.24  0.07  0.41 
IvB Underwater datasets
Our proposed state estimation system – SVIn2 – is targeted for underwater environments, where sonar and depth can be fused together with the visualinertial data. Here, we show results from four different datasets in three different underwater environments. First, a sunken bus in Fantasy Lake (NC), where data was collected by a diver with a custommade underwater sensor suite. The diver started from outside the bus, entered in it from the front door, exited from the back door, performed a loop around the bus. The images are affected by haze and low visibility. Second and third, data from an underwater cavern in Ginnie Springs (FL) is collected again by a diver with the same sensor suite as for the sunken bus. The diver performed several loops, around one spot in the second dataset – Cavern1 – and two spots in the third dataset – Cavern2 – inside the cavern. The environment is affected by complete absence of natural light. Fourth, an AUV – AQUA2 robot – collected data over an underwater cemetery in Lake Jocassee (SC) and performed several loops around the tombstones in a square pattern. The visibility, as well as brightness and contrast, was very low. In the underwater datasets, it is a challenge to get any ground truth, because it is a GPSdenied unstructured environment. As such, the evaluation is qualitative, with a rough estimate on the size of the environment measured beforehand by the divers collecting the data.
Figs. 58 show the trajectories from SVIn2, OKVIS, and VINSMono in the datasets just described. All trajectories are plotted keeping the original scale produced by each package.
Fig. 5 shows the results for the submerged bus dataset. In particular, VINSMono lost track when the exposure increased for quite some time. It tried to reinitialize, but it was not able to track successfully. Even using histogram equalization or a contrast adjusted histogram equalization filter, VINSMono was not able to track. Even if the scale drifted, OKVIS was able to track using a contrast adjusted histogram equalization filter in the image preprocessing step. Without the filter, it lost track at the high exposure location. Our method was able to track, detect, and correct the loop, successfully.
In Cavern1 – see Fig. 6 – VINSMono tracked successfully the whole time. However, as can be noticed in Fig. 6, the scale was incorrect based on empirical observations during data collection. OKVIS instead produced a good trajectory, and SVIn2 was also able to detect and close the loops.
In Cavern2 (Fig. 7), VINSMono lost track at the beginning, reinitialized, and was able to track for some time and detected a loop, before losing track again. VINSMono had similar behavior even if the images were preprocessed with different filters. OKVIS tracked well, but as drifts were accumulated over time, it was not able to join the current pose with a previous pose where a loop was expected. SVIn2 was able to track and reduce the drift in the trajectory with successful loop closure.
In the cemetery dataset – Fig. 8 – both VINSMono and OKVIS were able to track, but VINSMono was not able to reduce the drift in trajectory, while SVIn2 was able to fuse and correct the loops.
V Conclusions
In this paper, we presented SVIn2, a state estimation system with robust initialization, sensor fusion of depth, sonar, visual, and inertial data, and loop closure capabilities. While the proposed system can work also out of the water, by disabling the sensors that are not applicable, our system is specifically targeted for underwater environments. Experimental results in a standard benchmark dataset and different underwater datasets demonstrate excellent performance.
Utilizing the insights gained from implementing the proposed approach, an online adaptation of the discussed framework for the limited computational resources of the AQUA2 AUV [9] is currently under consideration; see Fig. 4. It is worth noting that maintaining the proper attitude of the traversed trajectory and providing an estimate of the distance traveled will greatly enhance the autonomous capabilities of the vehicle [28]. Furthermore, accurately modeling the surrounding structures would enable AQUA2, as well as other vision based underwater vehicles to operate near, and through, a variety of underwater structures, such as caves, shipwrecks, and canyons.
Acknowledgment
The authors would like to thank the National Science Foundation for its support (NSF 1513203, 1637876).
References
 [1] R. MurArtal and J. D. Tardós, “Visualinertial monocular slam with map reuse,” IEEE Robot. Autom. Lett., vol. 2, no. 2, pp. 796–803, 2017.
 [2] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, “Keyframebased visualinertial odometry using nonlinear optimization,” Int. J. Robot. Res., vol. 34, no. 3, pp. 314–334, 2015. [Online]. Available: http://ijr.sagepub.com/content/34/3/314.abstract
 [3] T. Qin, P. Li, and S. Shen, “VINSMono: A robust and versatile monocular visualinertial state estimator,” IEEE Trans. Robot., vol. 34, no. 4, pp. 1004–1020, 2018.
 [4] A. I. Mourikis and S. I. Roumeliotis, “A multistate constraint Kalman filter for visionaided inertial navigation,” in Proc. ICRA. IEEE, 2007, pp. 3565–3572.
 [5] K. Sun, K. Mohta, B. Pfrommer, M. Watterson, S. Liu, Y. Mulgaonkar, C. J. Taylor, and V. Kumar, “Robust stereo visual inertial odometry for fast autonomous flight,” IEEE Robot. Autom. Lett., vol. 3, no. 2, pp. 965–972, 2018.
 [6] A. Quattrini Li, A. Coskun, S. M. Doherty, S. Ghasemlou, A. S. Jagtap, M. Modasshir, S. Rahman, A. Singh, M. Xanthidis, J. M. O’Kane, and I. Rekleitis, “Experimental comparison of open source vision based state estimation algorithms,” in Proc. ISER, 2016.
 [7] S. Rahman, A. Quattrini Li, and I. Rekleitis, “Sonar Visual Inertial SLAM of Underwater Structures,” in Proc. ICRA, 2018.
 [8] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart, “The EuRoC micro aerial vehicle datasets,” Int. J. Robot. Res., vol. 35, no. 10, pp. 1157–1163, 2016.
 [9] G. Dudek, M. Jenkin, C. Prahacs, A. Hogue, J. Sattar, P. Giguere, A. German, H. Liu, S. Saunderson, A. Ripsman, S. Simhon, L. A. TorresMendez, E. Milios, P. Zhang, and I. Rekleitis, “A visually guided swimming robot,” in Proc. IROS, 2005, pp. 1749–1754.
 [10] J. Civera, O. G. Grasa, A. J. Davison, and J. M. M. Montiel, “1Point RANSAC for Extended Kalman Filtering: Application to Realtime Structure from Motion and Visual Odometry,” J. Field Robot., vol. 27, no. 5, pp. 609–631, 2010.
 [11] G. Klein and D. Murray, “Parallel tracking and mapping for small ar workspaces,” in IEEE and ACM Int. Symp. on Mixed and Augmented Reality, 2007, pp. 225–234.
 [12] J. Engel, T. Schöps, and D. Cremers, “LSDSLAM: LargeScale Direct Monocular SLAM,” in Proc. ECCV, ser. Lecture Notes in Computer Science, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Springer Int. Publishing, 2014, vol. 8690, pp. 834–849.
 [13] J. Engel, J. Stückler, and D. Cremers, “Largescale direct SLAM with stereo cameras,” in Proc. IROS, 2015, pp. 1935–1942.
 [14] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, “SVO: Semidirect Visual Odometry for Monocular and Multicamera Systems,” IEEE Trans. Robot., vol. 33, no. 2, pp. 249–265, 2017.
 [15] R. MurArtal, J. M. M. Montiel, and J. D. Tardós, “ORBSLAM: A Versatile and Accurate Monocular SLAM System,” IEEE Trans. Robot., vol. 31, no. 5, pp. 1147–1163, 2015.
 [16] M. Bloesch, M. Burri, S. Omari, M. Hutter, and R. Siegwart, “Iterated extended Kalman filter based visualinertial odometry using direct photometric feedback,” Int. J. Robot. Res., vol. 36, pp. 1053–1072, 2017.
 [17] J. J. Tarrio and S. Pedre, “Realtime edge based visual inertial odometry for mav teleoperation in indoor environments,” J. Intell. Robot. Syst., pp. 235–252, 2017.
 [18] J. Delmerico and D. Scaramuzza, “A benchmark comparison of monocular visualinertial odometry algorithms for flying robots,” in Proc. ICRA, 2018.
 [19] C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “Onmanifold preintegration for realtime visual–inertial odometry,” IEEE Trans. Robot., vol. 33, no. 1, pp. 1–21, 2017.
 [20] B. Joshi, B. Cain, J. Johnson, M. Kalitazkis, S. Rahman, M. Xanthidis, A. Hernandez, A. Quattrini Li, N. Vitzilaios, and I. Rekleitis, “Experimental comparison of open source visioninertialbased state estimation algorithms,” IEEE Robot. Autom. Lett., 2018, (under review).

[21]
C. Forster, L. Carlone, F. Dellaert, and D. Scaramuzza, “Imu preintegration on manifold for efficient visualinertial maximumaposteriori estimation.” Georgia Institute of Technology, 2015.
 [22] S. Agarwal, K. Mierle, and Others, “Ceres Solver,” http://ceressolver.org, 2015.
 [23] D. GálvezLópez and J. D. Tardos, “Bags of binary words for fast place recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5, pp. 1188–1197, 2012.
 [24] H. Strasdat, “Local accuracy and global consistency for efficient visual slam,” Ph.D. dissertation, Citeseer, 2012.
 [25] Research group of Prof. Kostas Daniilidis, “Monocular MSCKF ROS node,” https://github.com/daniilidisgroup/msckf˙mono, 2018.
 [26] S. Rahman, A. Quattrini Li, and I. Rekleitis, “A modular sensor suite for underwater reconstruction,” in MTS/IEEE Oceans Charleston, 2018, p. (accepted).
 [27] S. Umeyama, “Leastsquares estimation of transformation parameters between two point patterns,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 13, no. 4, pp. 376–380, 1991.
 [28] J. Sattar, G. Dudek, O. Chiu, I. Rekleitis, P. Giguere, A. Mills, N. Plamondon, C. Prahacs, Y. Girdhar, M. Nahon, and J.P. Lobos, “Enabling autonomous capabilities in underwater robotics,” in Proc. IROS, 2008, pp. 3628–3634.
Comments
There are no comments yet.