DeepURL: Deep Pose Estimation Framework for Underwater Relative Localization

In this paper, we propose a real-time deep-learning approach for determining the 6D relative pose of Autonomous Underwater Vehicles (AUV) from a single image. A team of autonomous robots localizing themselves, in a communication-constrained underwater environment, is essential for many applications such as underwater exploration, mapping, multi-robot convoying, and other multi-robot tasks. Due to the profound difficulty of collecting ground truth images with accurate 6D poses underwater, this work utilizes rendered images from the Unreal Game Engine simulation for training. An image translation network is employed to bridge the gap between the rendered and the real images producing synthetic images for training. The proposed method predicts the 6D pose of an AUV from a single image as 2D image keypoints representing 8 corners of the 3D model of the AUV, and then the 6D pose in the camera coordinates is determined using RANSAC-based PnP. Experimental results in underwater environments (swimming pool and ocean) with different cameras demonstrate the robustness of the proposed technique, where the trained system decreased translation error by 75.5


page 1

page 3

page 4

page 5

page 7


A vision based system for underwater docking

Autonomous underwater vehicles (AUVs) have been deployed for underwater ...

Robot-to-Robot Relative Pose Estimation using Humans as Markers

In this paper, we propose a method to determine the 3D relative pose of ...

3d sequential image mosaicing for underwater navigation and mapping

Although fully autonomous mapping methods are becoming more and more com...

Close-Proximity Underwater Terrain Mapping Using Learning-based Coarse Range Estimation

This paper presents a novel approach to underwater terrain mapping for A...

A Deep Learning Approach To Dead-Reckoning Navigation For Autonomous Underwater Vehicles With Limited Sensor Payloads

This paper presents a deep learning approach to aid dead-reckoning (DR) ...

Overhead Image Factors for Underwater Sonar-based SLAM

Simultaneous localization and mapping (SLAM) is a critical capability fo...

CADDY Underwater Stereo-Vision Dataset for Human-Robot Interaction (HRI) in the Context of Diver Activities

In this article we present a novel underwater dataset collected from sev...

Code Repositories


Source Code for "DeepURL: Deep Pose Estimation Framework for Underwater Relative Localization", submitted to IROS 2020

view repo

I Introduction

The ability to localize is crucial to many robotic applications. There are several environments where keeping track of the vehicle’s position is a challenging task; particularly in GPSdenied environments with limited features. A common approach to address the localization problem is to use intrarobot measurements for improved positional accuracy – an approach termed Cooperative Localization (CL) [37]. Central to CL is the ability to estimate the relative pose between the two robots; this estimate can then be utilized to improve the absolute localization based on the global pose estimates for one of the two robots.

In this paper, we propose and evaluate a deep pose estimation framework for underwater relative localization, called

DeepURL. The primary application motivating this work is underwater exploration and mapping by a team of autonomous underwater vehicles (AUVs) — shown in Fig. 1 — with a focus on shipwreck and underwater cave mapping; environments that are challenging to most existing localization methodologies (e.g., visual and visual/inertialbased systems [30, 13]). Other applications include convoying [42], environmental assessments [24], and inspections.

Fig. 1: Two Aqua2 vehicles collecting images over a reef require relative localization to efficiently cover the area.

The proposed methodology draws from the rich object detection research and are adapted to the unique conditions of the underwater domain. Traditionally, 6D pose estimation (3D position and 3D orientation) is performed by matching feature points between 3D models and images [21, 39, 4, 47]. While these methods are robust when objects are well textured, they perform poorly when objects are featureless or textureless. In the underwater domain, particulates in the water generate undesired texture smoothing. Recent approaches [14, 44, 49, 31, 45, 11]

to estimate 6D poses using deep neural network perform well on standard benchmark pose estimation datasets such as LINEMOD 

[9], OccludedLINEMOD [16], and YCBVideo [49], but they require either intensive manual annotation or a motion capture system. To the authors’ knowledge, a readily applicable method for collecting underwater training data with the corresponding accurate 6D poses is not available. In this work, we focus on estimating the 6D pose of an Aqua2 vehicle [6]. The observer is either another Aqua2 robot or an underwater handheld camera. The proposed method utilizes the Unreal Engine 4 (UE4) [23] with a 3D model of the Aqua2 robot swimming, projected over underwater images to generate training images with known poses for the pose estimation network. Besides, dissimilarity in the images arising from intrinsic factors such distortion differences from different cameras, or external factors, such as color-loss, poor visibility quality or even the surroundings, hampers the performance of classical deep learningbased 6D pose estimation methods in underwater. CycleGAN [51] was employed to transform UE4 rendered images generating image sets varying in appearance, similar to real-world underwater images to be used in training.

Using a modified version of YOLOv3 [36] to detect an object bounding box, the proposed network produces robust 6D pose estimates by combining multiple local predictions of 2D keypoints that are projections of 3D corner points of the object. Only grid cells inside the detected bounding box contribute to the selection of 2D keypoints along with a confidence score. Using the predictions with confidence, the most dependable 2D keypoint candidates for each 3D keypoint are selected to yield a set of 2D-to-3D correspondences. These selected 2D keypoints are used in the RANSAC-based PnP [17] algorithm to obtain a robust 6D pose estimate.

The proposed framework has been tested in different environments – pool, ocean – and different platforms, including Aqua2 robot and GoPro cameras, demonstrating its robustness. The main contributions are as follows:

  • A 6D pose prediction network that predicts object bounding boxes and eight keypoints in image coordinates. These 2D keypoints are then used in 2Dto3D correspondence to estimate 6D pose.

  • We demonstrate the effective use of rendered image augmentation111Traditionally Generative Adversarial Networks (GAN) [51] use the term image translation for this operation, however, the term translation can be confusing for a robotics application. in 6D pose prediction, eliminating the need for ground truth labeling in real images. Utilizing image augmentation from the rendered to the underwater environment, the pose prediction network becomes invariant to color-loss, texture-smoothing, and other domain-specific challenges.

  • We publish a dataset of Aqua2 robot captured in the ocean and swimming pool to further research in the underwater domain222

The next section reviews related works. Section III introduces the proposed method, including the rendered data generated for training, and the pose estimation method. Section IV presents first the ground truth data acquisition used exclusively for testing, then quantitative results from different datasets together with a comparison with other methods are discussed. Finally, we conclude the paper with future work in section V.

Fig. 2: In training (red dash-dotted), the rendered images are translated to the synthetic images resembling Aqua2 swimming in a pool or a marine environment. The synthetic images are then fed to a common encoder, which is connected to two decoder streams: Detection Decoder (object detection) and Pose Regression Decoder (6D pose regression). Only in inference (violet dash-double-dotted), the predicted 2D keypoint projections of 8 corners of the 3D Aqua2 model are processed and utilized to obtain a 6D pose using the RANSAC-based PnP algorithm.

Ii Related Work

In this paper, we focus on 6D pose estimation using RGB images without access to a depth map – RGB-D based methods [1, 3, 43] are not applicable underwater given that infrared light attenuates quickly at a very short distance. The classical approach for 6D object pose estimation involves extracting local features from the input image, matching them with features from a 3D model to establish 2D-to-3D correspondences from which a 6D pose can be obtained through the PnP algorithm [39, 28, 21, 47]. Previous work has shown a strong focus on making these local feature descriptors invariant to changes in scale, rotation, illumination, and viewpoints [22, 46]

. Even though these featurebased techniques can handle occlusions and scene clutter, they require sufficient texture to compute local features. To deal with poorly-textured objects, some efforts were focused on learned feature descriptors using machine learning techniques 

[48, 5].

In recent years, pose estimation research has been dominated by frameworks utilizing deep neural network. These methods can be broadly classified into two categories: either regressing directly to 6D pose estimates 

[14, 49], or predicting 2D projections of 3D keypoints on an image and then obtaining pose via PnP algorithm [31, 45]. Xiang et al[49] estimate the object center in the image with the distance of the center used for estimating the translation and the predicted quaternions for object rotations. Peng et al[29]

used a pixel-wise voting network to regress pixel-wise unit vectors pointing to the keypoints and used these vectors to vote for keypoint locations using RANSAC. Recent works  

[50, 8, 18] are focused on post-processing to refine the initial pose estimates from the first step. Li et al[19] disentangled the pose to predict rotation and translation separately from two different branches to increase accuracy.

Recent approaches [10, 26] for pose estimation focus on local patches belonging to the object rather than producing a single global prediction. The work of  [45] is closest to our approach in terms of using local image parches which also learns semantic segmentation mask to select multiple keypoint locations from local patches belonging to an object and providing those inputs to a PnP algorithm. Regarding pose estimation using synthetic datasets, Rozantsev et al[40]

used a two-stream network trained on a synthetic and real dataset to introduce loss functions that prevent corresponding weights of two streams from being too different from each other. Rad

et al[32] learn feature mapping from real to synthetic dataset and during inference transfer the features of real images to synthetic and infer pose using synthetic features. Some work has been done using deep learning framework for an Aqua2 vehicle detection that was utilized to enable visual servoing [41]. Koreitem et al[15] used rendered images for pose estimation based visual tracking of Aqua2 and our approach outperforms their approach in terms of 6D pose estimation accuracy.

In our work, we employ CycleGAN [51], a type of Generative Adversarial Network (GAN), to generate a synthetic dataset for training. GANs, introduced by Goodfellow et al[7], are used to generate images through adversarial training where a generator attempts to produce realistic images to fool a discriminator which tries to distinguish if the image is real or generated. Some variants of GAN [12]

has been used for paired image-to-image translation. In the absence of paired images, CycleGAN 

[51] can do an unpaired image-to-image translation. The main idea of CycleGAN is that if an image is translated from one domain to another and translated back, the resulting image should resemble the original image.

Iii The Proposed System

Figure 2 shows an overview of the proposed system. In the training process, UE4 renders a 3D model of Aqua2 with known 6D poses projected on top of underwater marine images. The feature space between real underwater and rendered images is aligned by transferring the rendered images to target domains (swimming pool and ocean) using CycleGAN [51], an image-to-image translation network.

The next stage consists of a Convolutional Neural Network (CNN) that predicts the 2D projections of the 8 corners of the object’s (Aqua2) 3D model, similar to  

[31, 45] and an object detection bounding box. Even though  [31, 45] divide an image into grid cells, they use single global estimates of 2D keypoints for the object with the highest confidence value. In our approach, each grid cell inside the bounding box predicts the 2D projections of keypoints along with their confidences focusing on local regions belonging to the object. Then these predictions of all cells are combined based on the confidence scores using RANSAC-based PnP during 6D pose estimation.

Iii-a Domain Adaptation

In the proposed approach, CycleGAN [51] is employed for unpaired image-to-image translation by learning functions to map UE4 domain to target domain . We use generators G and F to transfer domains; and . Discriminator, , is designed to distinguish between rendered images in and augmented fake images . Discriminator, , aims to separate target images in and augemented fake images . To improve image-to-image translation in CycleGAN, cycle consistency is maintained by ensuring the reconstructed images in addition to the adversarial loss. To calculate adversarial loss, tries to generate , which is so similar to that can fool the discriminator . Thus, the loss for and is:


where and denotes the data distribution in and respectively, and is the loss function, which is L1norm in our approach. Similarly we derive following eq. 1. The cycle consistency loss is defined as


In our proposed method, there are two target domains: swimming pool, and open-water marine environment, . Therefore, we train two instances of CycleGAN, thereby training two generators, and . Fig. 3 shows CycleGAN training overview along with synthetic data generation.

Fig. 3: (a) CycleGAN learning process is shown. CycleGAN learns two mapping functions; and with two discriminators, and . (b) Only using generator , we perform image-to-image translation of rendered images to target images .

Iii-B 6D Pose Estimation

The proposed network is based on encoder-decoder architecture [36]. The network consists of an encoder, Darknet-53 [36]

, and two decoders: Detection Decoder and Pose Regression Decoder. The detection decoder detects objects with bounding boxes, and the pose regression decoder regresses to 2D corner keypoints of the 3D object model. The decoders predict the output as a 3D tensor with a spatial resolution of

and a dimension of and , respectively. The spatial resolution controls the size of an image patch that can effectively vote for object detection and for the 2D keypoint locations. The feature vectors are predicted at 3 different spatial resolutions. The decoder stream detects features with multiple scales via upsampling and concatenation with a depth of final layer, . Pose regression stream also has a similar architecture, but the final depth layer is maintained to be . Predicting in multiple spatial resolutions with upsampling helps to obtain semantic information at multiple scales using fine-grained features from early on in the network.

Object Detection Stream: The object detection stream is similar to the detection stream of YOLOv3 [36] which predicts object bounding box. For each grid cell at offset from the top left corner of the image, the network predicts 4 coordinates for each bounding box . Following [35]

, we use 9 anchor boxes obtained by k-means clustering on COCO dataset 

[20] of size divided among three scales. The width and height are predicted as the fraction of the anchor box priors and the actual bounding box values are obtained as



represents the sigmoid function. The sum of square of error:

between ground truth and coordinate prediction is used as the loss function. The ground truth values can be obtained by inverting equations Eq. (III-B). The object detection stream also predicts the objectness score of each bounding box by calculating its intersection over union with anchor boxes and class prediction scores using independent logistic classifiers as in [36]. The total object detection loss is the sum of coordinate prediction loss, objectness score loss, and class prediction loss. The total object detection loss was introduced by Redmon et al[34] to which we refer for a complete description.

Fig. 4: (a) The object detection stream predicts the bounding box and assigns each cell inside the box to the Aqua2 object. (b) The regression stream predicts the location of 8 bounding box corners as 2D keypoints from each grid cell.

Pose Regression Stream: Pose regression stream predicts the location of 2D projections of predefined 3D keypoints associated with the 3D object model of Aqua2. We use 8 corner points of model bounding boxes as keypoints. The pose regression stream predicts a 3D tensor of size . As we predict the spatial locations for the 8 keypoint projections along with their confidence values, .

We do not predict the 2D coordinates of the 2D keypoints directly. Rather, we predict the offset of each keypoint from the corresponding grid cell as in Fig. 4(b) in the following way: Let be the position of grid cell from top left image corner. For keypoint, we predict the offset from grid cell, so that the actual location in image coordinates becomes which should be close to the ground truth 2D locations . The residual is calculated as


and we define offset loss function, , for spatial residual as


where B consists of grid cells that fall inside the object bounding box and

represents L1norm loss function, which is less susceptible to outliers than L2 loss. Only using grid cells falling inside the object bounding box for 2D keypoint predictions focuses on image regions that truly belong to object.

Apart from the 2D keypoint locations, the pose regression stream also calculates the confidence value for each predicted point, which is obtained through the sigmoid function on the network output. The confidence value should be representative of the distance between the predicted keypoint and ground truth values. A sharp exponential function of the 2D euclidean distance between prediction and ground truth is used as confidence. Thus, the confidence loss is calculated as


where denotes euclidean distance or L2 loss and parameter defines the sharpness of the exponential function. The pose regression loss of Eq. (8) takes up the form


For numerical stability, we downweight the confidence loss for cells that do not contain objects by setting to 0.1, as suggested in [34]. For the cells that include the object, is set to 5.0 and to 1. Therefore, the total loss of the network,

Fig. 5: Inference strategy for combining pose candidates. (a) Grid cells inside the detection box belonging to Aqua2 object overlaid on the image. (b) Each grid predicts 2D locations for corresponding 3D keypoints shown as red dots. (c) For each keypoints, 12 best candidates are selected based on the confidence scores. (d) Using 2D-to-3D correspondence pairs and running RANSACbased PnP algorithm yield accurate pose estimate as shown by the overlaid bounding box.

Iii-C Pose Refinement

After inference, the object detection stream of our network predicts the coordinate locations of the bounding boxes with their confidences and the class probabilities for each grid cell. Then, the class-specific confidence score is estimated for the object by multiplying class probability and confidence score. To select the best bounding box, we use non-max suppression 

[38] with IOU threshold of 0.4 and a class-specific confidence score threshold of 0.3.

At the same time, the pose regression steam produces the projected 2D locations of the object’s 3D bounding box, along with their confidence scores for each grid cell. The 2D keypoint predictions for grid cells that fall outside the bounding box from the object detection stream are filtered out. In an ideal case, these 2D keypoints should cluster around the object center. Therefore, 2D keypoints that do not belong to a cluster are removed using a pixel distance threshold of 0.3 times image width. The keypoints with confidence scores less than 0.5 are also filtered out. We employ RANSACbased PnP [17] on the remaining predictions to obtain 2Dto3D correspondence between the image keypoints and the object’s 3D model. However, this will make the predictions slow due to the optimization process. Empirically, we found using the 12 most confident 2D predictions for each 3D keypoint produces an acceptable pose estimate with a good tradeoff between computation time and accuracy, as shown in Fig. 5.

Iii-D Implementation Details

To create the synthetic dataset, we train the CycleGAN following the training procedure of [51]

. We let the training continue until it generated acceptable reconstruction. Once CycleGAN can reasonably reconstruct for the target domain, we use the model weights of that epoch to translate all rendered images to synthetic images. Then, the synthetic images are scaled to

resolution keeping the aspect ratio the same by padding zeros for training. During inference, no image translation is required, and the real images are directly fed to the network.

The CNN is trained for 125 epochs on the synthetic dataset, and the first 3 epochs are part of the warmup phase, where the learning rate gradually increases from 0 to 1e-4. We utilized the SGD optimizer with a momentum of 0.9 and a piecewise decay to decrease the to 3e-5 and 1e-5 at 60 and 100 epochs, respectively. To avoid overfitting, minibatches of size 8 were produced by applying data augmentation techniques, i.e., randomly changing hue, saturation, and exposure of the image up to a factor of 1.5. In addition, images were also randomly scaled, and affine transformed by up to 25% of the original image size.

Iv Experiments

This section describes first the datasets used, and then results of the inference with the real Aqua2 robot swimming in both a pool and openwater marine deployments.

Iv-a Datasets Description

Training - Rendered/Synthetic Dataset: contains images obtained by rendering an Aqua2 robot model swimming with flipper motion using UE4 and overlaying the resulting 3D model over random underwater images. Rather than just overlaying the 3D model of Aqua2, we simulate the flipper motion to generate images with the flippers in various realistic positions. This flipper motion makes the neural network independent of the flipper position. The synthetic dataset is obtained by the image translation network based on CycleGAN described in III-A to create photo-realistic images. Rendered dataset contains 37K images with random depth between and and orientations ranging from to degrees for roll, to degrees for pitch, and to degrees for yaw.

Testing - Pool Dataset: We deployed two robots: one robot observing the other with a vision-based 2D fiducial marker (AR tag333 mounted on the top used to estimate ground truth during two pool trials in indoor and outdoor pool, as shown in Fig. 6 (a-d). Approximately, 11K images were collected with estimated localization, provided from the pose detection of the AR tag and the relative transformation of the mounted tag to the real robot. The dataset contains images with a distance between two Aquas ranging between 0.5m to 3.5m.

Testing - Barbados 2017 Dataset: Barbados 2017 Dataset consists of 188 real images collected during underwater field trials off the west coast of Barbados used in [15], see Fig. 6(i-l) . The images are captured from an Aqua2 robot’s onboard camera. 6D pose of the robot in each of these images is obtained using a custom-built annotator, which allows the user to mark keypoints on the robot assigned from the CAD model. The annotator then iteratively fits a wireframe to the robot using its known dimensions.

Testing - Barbados GoPro Dataset: We collected images underwater in Barbados of an Aqua2 robot swimming over coral reefs using GoPro camera, which differ significantly from the images collected using another Aqua2 in terms of hue, image size, and aspect ratio, see Fig. 6(e-h). Given that no ground truth is available for these images, this dataset was used to evaluate the proposed method qualitatively.

Translation Error Orientation Error REP-10px Accuracy ADD-0.1d Accuracy FPS
Tekin et al[45] 0.278m 18.87° 9.33% 23.39% 54
DeepURL 0.068m 6.77° 25.22% 57.16% 40
TABLE I: Translation and Orientation errors (the lower the better) along with REP-10px, ADD-0.1d accuracy (the higher the better) and runtime comparison for the pool dataset

Iv-B Evaluation Metrics

To evaluate the pose estimation capability of the proposed system, we calculated the mean translation error as the euclidean distance between predicted and ground truth translation. Let be the ground truth rotation matrices and translation and be the predicted ones. For individual angle errors in terms of yaw, roll, and pitch, we decomposed the ground truth and into Euler angles and calculated their absolute difference. The total orientation error is represented as Eq. (9) where represents the trace of the matrix and the orientation error is in the range as suggested in [27].


To evaluate the pose accuracy, we use standard metrics - namely- 2D reprojection error [2] and the average 3D distance of the model vertices, referred to as ADD metric [49, 31]. In the case of reprojection error, we consider the pose estimate as correct if the average distance between 2D projections of 3D model points obtained using predicted and ground-truth poses is below a 10 pixels threshold, referred to as REP-10px. Generally, a 5-pixel threshold is employed, but we consider a threshold of 10 pixels to account for uncertainties in ground truth due to the AR tag-based pose estimation. The ADD metric takes pose estimate as correct if the mean distance between the coordinates of the 3D model vertices transformed by estimated and ground truth pose fall below 10% of the model diameter, referred to as ADD-0.1d. We also report the inference time of the algorithm in terms of frames per second (FPS) on an RTX 2080 GPU.

Iv-C Experimental Results

Fig. 6: Sample detections from the different datasets. Green square is the 2D detection box, while the red wireframe is the projection of the 3D bounding box of the robot. Top row: Pool dataset, observed Aqua2 vehicle carries a AR tag to generate ground truth estimates; observing robot is another Aqua2. Middle row: GoPro footage during deployments in Barbados in January 2020, observed robot has no additional components, the observing camera is a GoPro 7 camera. Bottom row: Barbados 2017 dataset [15], observed robot is equipped with a USBL modem, the observing robot is a Aqua2 vehicle.

Evaluation on the Pool Dataset: We compare our method with the state-of-the-art Tekin et al[45] trained on a synthetic dataset and tested on real pool dataset. Translation and rotation errors along with REP-10px and ADD-0.1d accuracy for the pool dataset are presented in Table I, as well as the runtime comparisons on Nvidia RTX 2080. DeepURL achieves 0.068m against 0.278m for Tekin et al[14]. DeepURL also estimates the orientation within error as opposed to in Tekin et al[14]. DeepURL also outperforms Tekin et al[14] significantly on REP-10px and ADD-0.1d accuracy. The improved performance comes from two factors, specifically: the use of better detection pipeline and bounding box based keypoint sampling introduced in this paper. The method is highly dependent on the accurate and precise detection of the Aqua2. As known in the literature, YOLOv3 performance significantly better YOLOv1 that is used in Tekin et al[14]. Again in our experiments, we saw that Tekin et al[14] cannot detect Aqua2 when it was further away from the camera. Secondly, the bounding box based keypoint sampling also allowed the selection of more appropriate keypoints using RANSAC-based PnP, whereas Tekin et al[14] only used the keypoints with the highest confidence.

Figure 7 shows the error in translation and orientation of an Aqua2 robot in the Pool dataset. It is evident that the proposed method performs well across all distances from camera (0.5m-3.5m). Interestingly, at very close distance, the method experiences higher orientation error due to the 2D keypoints of Aqua2 not being precisely selected by the pose regression decoder.

Fig. 7: Boxplot summarizing the error statistic of (a) translation and (b) orientation in Pool Dataset with respect to variable distance from camera.

Evaluation on the Barbados 2017 Dataset: We report the performance of our system on Real Barbados 2017 dataset in terms of translation and rotation errors as shown in Table II. DeepURL performs significantly better on translation error and orientation error compared to the method of Koreitem et al[15].

Translation Error Orientation Error Roll Error Pitch Error Yaw Error
Koreitem et al[15] 0.72m 17.59° 11.87° 4.59° 12.11°
DeepURL 0.31m 11.98° 9.64° 3.30° 5.43°
TABLE II: Translation and Rotation errors for the Barbados 2017 dataset [15]

Failing Scenarios: The approach might fail to predict a reasonable estimate of the pose transformation matrix using the PnP algorithm. These scenarios generally occur either when the detection stream fails to predict the object detection box, therefore, there is not enough points for RANSACbased PnP algorithm (at least six 2D-to-3D correspondences are required), or PnP did not converge. These detection failure scenarios are inherent in YOLOV3 architecture. The system may also fail for position or orientation not introduced in the training scenarios, such as translation beyond 3.5m or orientation beyond the rendered range described in IV-A.

V Conclusion

In this work, we presented a system for 6D pose estimation of an Autonomous Underwater Vehicle for relative localization underwater. The system learns to predict the 6D pose without the need for any real ground truth, which enables pose estimation in an environment where ground truth is difficult to acquire. We also present a detection bounding box based keypoint sampling strategy that is more robust to estimate the pose of the observed robot.

Currently, the proposed network is being ported to an Intel Neural Compute Stick 2 (Intel NCS2)444 and an NVidia Jetson TX2 Module555 in order to deploy on an Aqua2 or a BlueROV2 vehicle. The above two platforms were selected based on their performance [25] and compatibility with the proposed vehicles. Furthermore, the DeepURL framework will be integrated with the proprioceptive sensors of each robot (IMU and depth) and either the USBL positioning of the observer or the VisualInertial estimator [33] to recover the pose of both robots in a global frame of reference.

The proposed method will enable robust operations underwater for the Aqua2 robots and is expected to be applicable to other underwater vehicles.


  • [1] E. Brachmann, A. Krull, F. Michel, S. Gumhold, J. Shotton, and C. Rother (2014) Learning 6d object pose estimation using 3d object coordinates. Vol. 8690, Springer. Cited by: §II.
  • [2] E. Brachmann, F. Michel, A. Krull, M. Ying Yang, S. Gumhold, and c. Rother (2016) Uncertainty-driven 6D pose estimation of objects and scenes from a single RGB image. In cvpr, Cited by: §IV-B.
  • [3] C. Choi and H. I. Christensen (2016) RGB-D object pose estimation in unstructured environments. Robotics and Aut. Systems 75, pp. 595 – 613. External Links: Document Cited by: §II.
  • [4] A. Collet, M. Martinez, and S. S. Srinivasa (2011) The MOPED framework: object recognition and pose estimation for manipulation. ijrr 30 (10), pp. 1284–1306. Cited by: §I.
  • [5] A. Doumanoglou, V. Balntas, R. Kouskouridas, and T. Kim (2016)

    Siamese regression networks with efficient mid-level feature extraction for 3D object pose estimation

    External Links: 1607.02257 Cited by: §II.
  • [6] G. Dudek, M. Jenkin, C. Prahacs, A. Hogue, J. Sattar, P. Giguere, A. German, H. Liu, S. Saunderson, A. Ripsman, S. Simhon, L. A. Torres-Mendez, E. Milios, P. Zhang, and I. Rekleitis (2005) A visually guided swimming robot. In iros, Cited by: §I.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In nips, pp. 2672–2680. Cited by: §II.
  • [8] K. Gupta, L. Petersson, and R. Hartley (2019) CullNet: Calibrated and Pose Aware Confidence Scores for Object Pose Estimation. In

    IEEE International Conference on Computer Vision (ICCV) Workshops

    Cited by: §II.
  • [9] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2013) Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Computer Vision – ACCV 2012, K. M. Lee, Y. Matsushita, J. M. Rehg, and Z. Hu (Eds.), pp. 548–562. Cited by: §I.
  • [10] O. Hosseini Jafari, S. K. Mustikovela, K. Pertsch, E. Brachmann, and C. Rother (2018) IPose: instance-aware 6D pose estimation of partly occluded objects. In Computer Vision – ACCV, pp. 477–492. Cited by: §II.
  • [11] Y. Hu, J. Hugonot, P. Fua, and M. Salzmann (2019) Segmentation-driven 6d object pose estimation. In CVPR, Cited by: §I.
  • [12] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017)

    Image-to-image translation with conditional adversarial networks

    CVPR. Cited by: §II.
  • [13] B. Joshi, S. Rahman, M. Kalaitzakis, B. Cain, J. Johnson, M. Xanthidis, N. Karapetyan, A. Hernandez, A. Quattrini Li, N. Vitzilaios, and I. Rekleitis (2019)

    Experimental Comparison of Open Source Visual-Inertial-Based State Estimation Algorithms in the Underwater Domain

    In iros, pp. 7221–7227. Cited by: §I.
  • [14] W. Kehl, F. Manhardt, F. Tombari, S. Ilic, and N. Navab (2017) SSD-6D: making RGB-based 3D detection and 6D pose estimation great again. In IEEE International Conference on Computer Vision (ICCV), Cited by: §I, §II, §IV-C.
  • [15] K. Koreitem, J. Li, I. Karp, T. Manderson, F. Shkurti, and G. Dudek (2018-Oct.) Synthetically trained 3d visual tracker of underwater vehicles. In MTS/IEEE OCEANS, Charleston, SC, USA. Cited by: §II, Fig. 6, §IV-A, §IV-C, TABLE II.
  • [16] A. Krull, E. Brachmann, F. Michel, M. Ying Yang, S. Gumhold, and C. Rother (2015-12) Learning analysis-by-synthesis for 6d pose estimation in rgb-d images. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §I.
  • [17] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) EPnP: An accurate O(n) solution to the PnP problem. Int. Journal of Comp. Vision. Cited by: §I, §III-C.
  • [18] Y. Li, G. Wang, X. Ji, Y. Xiang, and D. Fox (2018) DeepIM: Deep Iterative Matching for 6D pose estimation. In eccv, Cited by: §II.
  • [19] Z. Li, G. Wang, and X. Ji (2019) CDPN: Coordinates-Based Disentangled Pose Network for Real-Time RGB-Based 6-DoF Object Pose Estimation. In IEEE International Conference on Computer Vision (ICCV), Cited by: §II.
  • [20] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: Common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §III-B.
  • [21] D. G. Lowe (1999) Object recognition from local scale-invariant features. In iccv, Vol. 2, pp. 1150–1157. Cited by: §I, §II.
  • [22] D. Lowe (2004-11) Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision. Cited by: §II.
  • [23] T. Manderson, I. Karp, and G. Dudek (2018) Aqua underwater simulator. In iros, Cited by: §I.
  • [24] T. Manderson, J. Li, N. Dudek, D. Meger, and G. Dudek (2017) Robotic Coral Reef Health Assessment Using Automated Image Analysis. Journal of Field Robotics 34 (1), pp. 170–187. External Links: Document, ISSN 1556-4967 Cited by: §I.
  • [25] M. Modasshir, A. Quattrini Li, and I. Rekleitis (2018) Deep neural networks: a comparison on different computing platforms. In crv, pp. 383–389. Cited by: §V.
  • [26] M. Oberweger, M. Rad, and V. Lepetit (2018) Making deep heatmaps robust to partial occlusions for 3D object pose estimation. CoRR abs/1804.03959. Cited by: §II.
  • [27] F. C. Park and B. Ravani (1997)

    Smooth invariant interpolation of rotations

    ACM Trans. Graph. 16 (3), pp. 277–295. Cited by: §IV-B.
  • [28] G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis (2017) 6-DoF object pose from semantic keypoints. In IEEE International Conference on Robotics and Automation, pp. 2011–2018. Cited by: §II.
  • [29] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao (2019) PVNet: Pixel-wise Voting Network for 6DoF Pose Estimation. In CVPR, Cited by: §II.
  • [30] A. Quattrini Li, A. Coskun, S. M. Doherty, S. Ghasemlou, A. S. Jagtap, M. Modasshir, S. Rahman, A. Singh, M. Xanthidis, J. M. O’Kane, and I. Rekleitis (2016) Experimental comparison of open source vision based state estimation algorithms. In iser, pp. 775–786. Cited by: §I.
  • [31] M. Rad and V. Lepetit (2017) BB8: a scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In iccv, Cited by: §I, §II, §III, §IV-B.
  • [32] M. Rad, M. Oberweger, and V. Lepetit (2018) Feature mapping for learning fast and accurate 3D pose inference from synthetic images. In

    IEEE Conference on Computer Vision and Pattern Recognition

    Cited by: §II.
  • [33] S. Rahman, A. Quattrini Li, and I. Rekleitis (2019) SVIn2: An Underwater SLAM System using Sonar, Visual, Inertial, and Depth Sensor. In iros, pp. 1861–1868. Cited by: §V.
  • [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016-06) You only look once: unified, real-time object detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §III-B, §III-B.
  • [35] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In cvpr, Cited by: §III-B.
  • [36] J. Redmon and A. Farhadi (2018) YOLOv3: an incremental improvement. External Links: 1804.02767 Cited by: §I, §III-B, §III-B.
  • [37] I. M. Rekleitis, G. Dudek, and E. E. Milios (1998) On multiagent exploration. In Proc. Vision Interface, pp. 455–461. Cited by: §I.
  • [38] R. Rothe, M. Guillaumin, and L. V. Gool (2014) Non-maximum suppression for object detection by passing messages between windows. In ACCV, Cited by: §III-C.
  • [39] F. Rothganger, S. Lazebnik, C. Schmid, and J. Ponce (2006) 3D object modeling and recognition using local affine-invariant image descriptors and multi-view spatial constraints. ijcv. Cited by: §I, §II.
  • [40] A. Rozantsev, M. Salzmann, and P. Fua (2019) Beyond sharing weights for deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, pp. 801–814. Cited by: §II.
  • [41] F. Shkurti, W. D. Chang, P. Henderson, M. J. Islam, J. C. G. Higuera, J. Li, T. Manderson, A. Xu, G. Dudek, and J. Sattar (2017) Underwater multi-robot convoying using visual tracking by detection. In iros, pp. 4189–4196. Cited by: §II.
  • [42] F. Shkurti, W. D. Chang, P. Henderson, Md. J. Islam, J. C. Gamboa Higuera, J. Li, T. Manderson, A. Xu, G. Dudek, and J. Sattar (2017) Underwater multi-robot convoying using visual tracking by detection. In iros, Cited by: §I.
  • [43] J. Sock, S. H. Kasaei, L. S. Lopes, and T. Kim (2017) Multi-view 6d object pose estimation and camera motion planning using rgbd images. IEEE Int. Conf. on Computer Vision Workshops (ICCVW). Cited by: §II.
  • [44] M. Sundermeyer, Z. Marton, M. Durner, M. Brucker, and R. Triebel (2018) Implicit 3d orientation learning for 6d object detection from rgb images. In European Conference on Computer Vision (ECCV), Cited by: §I.
  • [45] B. Tekin, S. N. Sinha, and P. Fua (2018) Real-time seamless single shot 6d object pose prediction. In cvpr, Cited by: §I, §II, §II, §III, §IV-C, TABLE I.
  • [46] T. Trzcinski, M. Christoudias, V. Lepetit, and P. Fua (2012) Learning image descriptors with the boosting-trick. In nips, Cited by: §II.
  • [47] D. Wagner, G. Reitmayr, A. Mulloni, T. Drummond, and D. Schmalstieg (2008) Pose tracking from natural features on mobile phones. In IEEE/ACM Int. Symp. on Mix. and Aug. Reality, pp. 125–134. External Links: Document Cited by: §I, §II.
  • [48] P. Wohlhart and V. Lepetit (2015) Learning Descriptors for Object Recognition and 3D Pose Estimation. In cvpr, Cited by: §II.
  • [49] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox (2018) PoseCNN: a convolutional neural network for 6D object pose estimation in cluttered scenes. rss. Cited by: §I, §II, §IV-B.
  • [50] S. Zakharov, I. Shugurov, and S. Ilic (2019) DPOD: 6D Pose Object Detector and Refiner. In IEEE Int. Conf. on Computer Vision (ICCV), Cited by: §II.
  • [51] J. Zhu, T. Park, P. Isola, and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In iccv, Cited by: §I, §II, §III-A, §III-D, §III, footnote 1.