Source Code for "DeepURL: Deep Pose Estimation Framework for Underwater Relative Localization", submitted to IROS 2020
In this paper, we propose a real-time deep-learning approach for determining the 6D relative pose of Autonomous Underwater Vehicles (AUV) from a single image. A team of autonomous robots localizing themselves, in a communication-constrained underwater environment, is essential for many applications such as underwater exploration, mapping, multi-robot convoying, and other multi-robot tasks. Due to the profound difficulty of collecting ground truth images with accurate 6D poses underwater, this work utilizes rendered images from the Unreal Game Engine simulation for training. An image translation network is employed to bridge the gap between the rendered and the real images producing synthetic images for training. The proposed method predicts the 6D pose of an AUV from a single image as 2D image keypoints representing 8 corners of the 3D model of the AUV, and then the 6D pose in the camera coordinates is determined using RANSAC-based PnP. Experimental results in underwater environments (swimming pool and ocean) with different cameras demonstrate the robustness of the proposed technique, where the trained system decreased translation error by 75.5READ FULL TEXT VIEW PDF
Source Code for "DeepURL: Deep Pose Estimation Framework for Underwater Relative Localization", submitted to IROS 2020
The ability to localize is crucial to many robotic applications. There are several environments where keeping track of the vehicle’s position is a challenging task; particularly in GPSdenied environments with limited features. A common approach to address the localization problem is to use intrarobot measurements for improved positional accuracy – an approach termed Cooperative Localization (CL) . Central to CL is the ability to estimate the relative pose between the two robots; this estimate can then be utilized to improve the absolute localization based on the global pose estimates for one of the two robots.
In this paper, we propose and evaluate a deep pose estimation framework for underwater relative localization, calledDeepURL. The primary application motivating this work is underwater exploration and mapping by a team of autonomous underwater vehicles (AUVs) — shown in Fig. 1 — with a focus on shipwreck and underwater cave mapping; environments that are challenging to most existing localization methodologies (e.g., visual and visual/inertialbased systems [30, 13]). Other applications include convoying , environmental assessments , and inspections.
The proposed methodology draws from the rich object detection research and are adapted to the unique conditions of the underwater domain. Traditionally, 6D pose estimation (3D position and 3D orientation) is performed by matching feature points between 3D models and images [21, 39, 4, 47]. While these methods are robust when objects are well textured, they perform poorly when objects are featureless or textureless. In the underwater domain, particulates in the water generate undesired texture smoothing. Recent approaches [14, 44, 49, 31, 45, 11]
to estimate 6D poses using deep neural network perform well on standard benchmark pose estimation datasets such as LINEMOD, OccludedLINEMOD , and YCBVideo , but they require either intensive manual annotation or a motion capture system. To the authors’ knowledge, a readily applicable method for collecting underwater training data with the corresponding accurate 6D poses is not available. In this work, we focus on estimating the 6D pose of an Aqua2 vehicle . The observer is either another Aqua2 robot or an underwater handheld camera. The proposed method utilizes the Unreal Engine 4 (UE4)  with a 3D model of the Aqua2 robot swimming, projected over underwater images to generate training images with known poses for the pose estimation network. Besides, dissimilarity in the images arising from intrinsic factors such distortion differences from different cameras, or external factors, such as color-loss, poor visibility quality or even the surroundings, hampers the performance of classical deep learningbased 6D pose estimation methods in underwater. CycleGAN  was employed to transform UE4 rendered images generating image sets varying in appearance, similar to real-world underwater images to be used in training.
Using a modified version of YOLOv3  to detect an object bounding box, the proposed network produces robust 6D pose estimates by combining multiple local predictions of 2D keypoints that are projections of 3D corner points of the object. Only grid cells inside the detected bounding box contribute to the selection of 2D keypoints along with a confidence score. Using the predictions with confidence, the most dependable 2D keypoint candidates for each 3D keypoint are selected to yield a set of 2D-to-3D correspondences. These selected 2D keypoints are used in the RANSAC-based PnP  algorithm to obtain a robust 6D pose estimate.
The proposed framework has been tested in different environments – pool, ocean – and different platforms, including Aqua2 robot and GoPro cameras, demonstrating its robustness. The main contributions are as follows:
A 6D pose prediction network that predicts object bounding boxes and eight keypoints in image coordinates. These 2D keypoints are then used in 2Dto3D correspondence to estimate 6D pose.
We demonstrate the effective use of rendered image augmentation111Traditionally Generative Adversarial Networks (GAN)  use the term image translation for this operation, however, the term translation can be confusing for a robotics application. in 6D pose prediction, eliminating the need for ground truth labeling in real images. Utilizing image augmentation from the rendered to the underwater environment, the pose prediction network becomes invariant to color-loss, texture-smoothing, and other domain-specific challenges.
We publish a dataset of Aqua2 robot captured in the ocean and swimming pool to further research in the underwater domain222https://afrl.cse.sc.edu/afrl/resources/datasets/.
The next section reviews related works. Section III introduces the proposed method, including the rendered data generated for training, and the pose estimation method. Section IV presents first the ground truth data acquisition used exclusively for testing, then quantitative results from different datasets together with a comparison with other methods are discussed. Finally, we conclude the paper with future work in section V.
In this paper, we focus on 6D pose estimation using RGB images without access to a depth map – RGB-D based methods [1, 3, 43] are not applicable underwater given that infrared light attenuates quickly at a very short distance. The classical approach for 6D object pose estimation involves extracting local features from the input image, matching them with features from a 3D model to establish 2D-to-3D correspondences from which a 6D pose can be obtained through the PnP algorithm [39, 28, 21, 47]. Previous work has shown a strong focus on making these local feature descriptors invariant to changes in scale, rotation, illumination, and viewpoints [22, 46]
. Even though these featurebased techniques can handle occlusions and scene clutter, they require sufficient texture to compute local features. To deal with poorly-textured objects, some efforts were focused on learned feature descriptors using machine learning techniques[48, 5].
In recent years, pose estimation research has been dominated by frameworks utilizing deep neural network. These methods can be broadly classified into two categories: either regressing directly to 6D pose estimates[14, 49], or predicting 2D projections of 3D keypoints on an image and then obtaining pose via PnP algorithm [31, 45]. Xiang et al.  estimate the object center in the image with the distance of the center used for estimating the translation and the predicted quaternions for object rotations. Peng et al. 
used a pixel-wise voting network to regress pixel-wise unit vectors pointing to the keypoints and used these vectors to vote for keypoint locations using RANSAC. Recent works[50, 8, 18] are focused on post-processing to refine the initial pose estimates from the first step. Li et al.  disentangled the pose to predict rotation and translation separately from two different branches to increase accuracy.
Recent approaches [10, 26] for pose estimation focus on local patches belonging to the object rather than producing a single global prediction. The work of  is closest to our approach in terms of using local image parches which also learns semantic segmentation mask to select multiple keypoint locations from local patches belonging to an object and providing those inputs to a PnP algorithm. Regarding pose estimation using synthetic datasets, Rozantsev et al. 
used a two-stream network trained on a synthetic and real dataset to introduce loss functions that prevent corresponding weights of two streams from being too different from each other. Radet al.  learn feature mapping from real to synthetic dataset and during inference transfer the features of real images to synthetic and infer pose using synthetic features. Some work has been done using deep learning framework for an Aqua2 vehicle detection that was utilized to enable visual servoing . Koreitem et al.  used rendered images for pose estimation based visual tracking of Aqua2 and our approach outperforms their approach in terms of 6D pose estimation accuracy.
In our work, we employ CycleGAN , a type of Generative Adversarial Network (GAN), to generate a synthetic dataset for training. GANs, introduced by Goodfellow et al. , are used to generate images through adversarial training where a generator attempts to produce realistic images to fool a discriminator which tries to distinguish if the image is real or generated. Some variants of GAN 
has been used for paired image-to-image translation. In the absence of paired images, CycleGAN can do an unpaired image-to-image translation. The main idea of CycleGAN is that if an image is translated from one domain to another and translated back, the resulting image should resemble the original image.
Figure 2 shows an overview of the proposed system. In the training process, UE4 renders a 3D model of Aqua2 with known 6D poses projected on top of underwater marine images. The feature space between real underwater and rendered images is aligned by transferring the rendered images to target domains (swimming pool and ocean) using CycleGAN , an image-to-image translation network.
The next stage consists of a Convolutional Neural Network (CNN) that predicts the 2D projections of the 8 corners of the object’s (Aqua2) 3D model, similar to[31, 45] and an object detection bounding box. Even though [31, 45] divide an image into grid cells, they use single global estimates of 2D keypoints for the object with the highest confidence value. In our approach, each grid cell inside the bounding box predicts the 2D projections of keypoints along with their confidences focusing on local regions belonging to the object. Then these predictions of all cells are combined based on the confidence scores using RANSAC-based PnP during 6D pose estimation.
In the proposed approach, CycleGAN  is employed for unpaired image-to-image translation by learning functions to map UE4 domain to target domain . We use generators G and F to transfer domains; and . Discriminator, , is designed to distinguish between rendered images in and augmented fake images . Discriminator, , aims to separate target images in and augemented fake images . To improve image-to-image translation in CycleGAN, cycle consistency is maintained by ensuring the reconstructed images in addition to the adversarial loss. To calculate adversarial loss, tries to generate , which is so similar to that can fool the discriminator . Thus, the loss for and is:
where and denotes the data distribution in and respectively, and is the loss function, which is L1norm in our approach. Similarly we derive following eq. 1. The cycle consistency loss is defined as
In our proposed method, there are two target domains: swimming pool, and open-water marine environment, . Therefore, we train two instances of CycleGAN, thereby training two generators, and . Fig. 3 shows CycleGAN training overview along with synthetic data generation.
, and two decoders: Detection Decoder and Pose Regression Decoder. The detection decoder detects objects with bounding boxes, and the pose regression decoder regresses to 2D corner keypoints of the 3D object model. The decoders predict the output as a 3D tensor with a spatial resolution ofand a dimension of and , respectively. The spatial resolution controls the size of an image patch that can effectively vote for object detection and for the 2D keypoint locations. The feature vectors are predicted at 3 different spatial resolutions. The decoder stream detects features with multiple scales via upsampling and concatenation with a depth of final layer, . Pose regression stream also has a similar architecture, but the final depth layer is maintained to be . Predicting in multiple spatial resolutions with upsampling helps to obtain semantic information at multiple scales using fine-grained features from early on in the network.
Object Detection Stream: The object detection stream is similar to the detection stream of YOLOv3  which predicts object bounding box. For each grid cell at offset from the top left corner of the image, the network predicts 4 coordinates for each bounding box . Following 
, we use 9 anchor boxes obtained by k-means clustering on COCO dataset of size divided among three scales. The width and height are predicted as the fraction of the anchor box priors and the actual bounding box values are obtained as
represents the sigmoid function. The sum of square of error:between ground truth and coordinate prediction is used as the loss function. The ground truth values can be obtained by inverting equations Eq. (III-B). The object detection stream also predicts the objectness score of each bounding box by calculating its intersection over union with anchor boxes and class prediction scores using independent logistic classifiers as in . The total object detection loss is the sum of coordinate prediction loss, objectness score loss, and class prediction loss. The total object detection loss was introduced by Redmon et al.  to which we refer for a complete description.
Pose Regression Stream: Pose regression stream predicts the location of 2D projections of predefined 3D keypoints associated with the 3D object model of Aqua2. We use 8 corner points of model bounding boxes as keypoints. The pose regression stream predicts a 3D tensor of size . As we predict the spatial locations for the 8 keypoint projections along with their confidence values, .
We do not predict the 2D coordinates of the 2D keypoints directly. Rather, we predict the offset of each keypoint from the corresponding grid cell as in Fig. 4(b) in the following way: Let be the position of grid cell from top left image corner. For keypoint, we predict the offset from grid cell, so that the actual location in image coordinates becomes which should be close to the ground truth 2D locations . The residual is calculated as
and we define offset loss function, , for spatial residual as
where B consists of grid cells that fall inside the object bounding box and
represents L1norm loss function, which is less susceptible to outliers than L2 loss. Only using grid cells falling inside the object bounding box for 2D keypoint predictions focuses on image regions that truly belong to object.
Apart from the 2D keypoint locations, the pose regression stream also calculates the confidence value for each predicted point, which is obtained through the sigmoid function on the network output. The confidence value should be representative of the distance between the predicted keypoint and ground truth values. A sharp exponential function of the 2D euclidean distance between prediction and ground truth is used as confidence. Thus, the confidence loss is calculated as
where denotes euclidean distance or L2 loss and parameter defines the sharpness of the exponential function. The pose regression loss of Eq. (8) takes up the form
For numerical stability, we downweight the confidence loss for cells that do not contain objects by setting to 0.1, as suggested in . For the cells that include the object, is set to 5.0 and to 1. Therefore, the total loss of the network,
After inference, the object detection stream of our network predicts the coordinate locations of the bounding boxes with their confidences and the class probabilities for each grid cell. Then, the class-specific confidence score is estimated for the object by multiplying class probability and confidence score. To select the best bounding box, we use non-max suppression with IOU threshold of 0.4 and a class-specific confidence score threshold of 0.3.
At the same time, the pose regression steam produces the projected 2D locations of the object’s 3D bounding box, along with their confidence scores for each grid cell. The 2D keypoint predictions for grid cells that fall outside the bounding box from the object detection stream are filtered out. In an ideal case, these 2D keypoints should cluster around the object center. Therefore, 2D keypoints that do not belong to a cluster are removed using a pixel distance threshold of 0.3 times image width. The keypoints with confidence scores less than 0.5 are also filtered out. We employ RANSACbased PnP  on the remaining predictions to obtain 2Dto3D correspondence between the image keypoints and the object’s 3D model. However, this will make the predictions slow due to the optimization process. Empirically, we found using the 12 most confident 2D predictions for each 3D keypoint produces an acceptable pose estimate with a good tradeoff between computation time and accuracy, as shown in Fig. 5.
To create the synthetic dataset, we train the CycleGAN following the training procedure of 
. We let the training continue until it generated acceptable reconstruction. Once CycleGAN can reasonably reconstruct for the target domain, we use the model weights of that epoch to translate all rendered images to synthetic images. Then, the synthetic images are scaled to
resolution keeping the aspect ratio the same by padding zeros for training. During inference, no image translation is required, and the real images are directly fed to the network.
The CNN is trained for 125 epochs on the synthetic dataset, and the first 3 epochs are part of the warmup phase, where the learning rate gradually increases from 0 to 1e-4. We utilized the SGD optimizer with a momentum of 0.9 and a piecewise decay to decrease the to 3e-5 and 1e-5 at 60 and 100 epochs, respectively. To avoid overfitting, minibatches of size 8 were produced by applying data augmentation techniques, i.e., randomly changing hue, saturation, and exposure of the image up to a factor of 1.5. In addition, images were also randomly scaled, and affine transformed by up to 25% of the original image size.
This section describes first the datasets used, and then results of the inference with the real Aqua2 robot swimming in both a pool and openwater marine deployments.
Training - Rendered/Synthetic Dataset: contains images obtained by rendering an Aqua2 robot model swimming with flipper motion using UE4 and overlaying the resulting 3D model over random underwater images. Rather than just overlaying the 3D model of Aqua2, we simulate the flipper motion to generate images with the flippers in various realistic positions. This flipper motion makes the neural network independent of the flipper position. The synthetic dataset is obtained by the image translation network based on CycleGAN described in III-A to create photo-realistic images. Rendered dataset contains 37K images with random depth between and and orientations ranging from to degrees for roll, to degrees for pitch, and to degrees for yaw.
Testing - Pool Dataset: We deployed two robots: one robot observing the other with a vision-based 2D fiducial marker (AR tag333http://wiki.ros.org/ar_track_alvar) mounted on the top used to estimate ground truth during two pool trials in indoor and outdoor pool, as shown in Fig. 6 (a-d). Approximately, 11K images were collected with estimated localization, provided from the pose detection of the AR tag and the relative transformation of the mounted tag to the real robot. The dataset contains images with a distance between two Aquas ranging between 0.5m to 3.5m.
Testing - Barbados 2017 Dataset: Barbados 2017 Dataset consists of 188 real images collected during underwater field trials off the west coast of Barbados used in , see Fig. 6(i-l) . The images are captured from an Aqua2 robot’s onboard camera. 6D pose of the robot in each of these images is obtained using a custom-built annotator, which allows the user to mark keypoints on the robot assigned from the CAD model. The annotator then iteratively fits a wireframe to the robot using its known dimensions.
Testing - Barbados GoPro Dataset: We collected images underwater in Barbados of an Aqua2 robot swimming over coral reefs using GoPro camera, which differ significantly from the images collected using another Aqua2 in terms of hue, image size, and aspect ratio, see Fig. 6(e-h). Given that no ground truth is available for these images, this dataset was used to evaluate the proposed method qualitatively.
|Translation Error||Orientation Error||REP-10px Accuracy||ADD-0.1d Accuracy||FPS|
|Tekin et al. ||0.278m||18.87°||9.33%||23.39%||54|
To evaluate the pose estimation capability of the proposed system, we calculated the mean translation error as the euclidean distance between predicted and ground truth translation. Let be the ground truth rotation matrices and translation and be the predicted ones. For individual angle errors in terms of yaw, roll, and pitch, we decomposed the ground truth and into Euler angles and calculated their absolute difference. The total orientation error is represented as Eq. (9) where represents the trace of the matrix and the orientation error is in the range as suggested in .
To evaluate the pose accuracy, we use standard metrics - namely- 2D reprojection error  and the average 3D distance of the model vertices, referred to as ADD metric [49, 31]. In the case of reprojection error, we consider the pose estimate as correct if the average distance between 2D projections of 3D model points obtained using predicted and ground-truth poses is below a 10 pixels threshold, referred to as REP-10px. Generally, a 5-pixel threshold is employed, but we consider a threshold of 10 pixels to account for uncertainties in ground truth due to the AR tag-based pose estimation. The ADD metric takes pose estimate as correct if the mean distance between the coordinates of the 3D model vertices transformed by estimated and ground truth pose fall below 10% of the model diameter, referred to as ADD-0.1d. We also report the inference time of the algorithm in terms of frames per second (FPS) on an RTX 2080 GPU.
Evaluation on the Pool Dataset: We compare our method with the state-of-the-art Tekin et al.  trained on a synthetic dataset and tested on real pool dataset. Translation and rotation errors along with REP-10px and ADD-0.1d accuracy for the pool dataset are presented in Table I, as well as the runtime comparisons on Nvidia RTX 2080. DeepURL achieves 0.068m against 0.278m for Tekin et al. . DeepURL also estimates the orientation within error as opposed to in Tekin et al. . DeepURL also outperforms Tekin et al.  significantly on REP-10px and ADD-0.1d accuracy. The improved performance comes from two factors, specifically: the use of better detection pipeline and bounding box based keypoint sampling introduced in this paper. The method is highly dependent on the accurate and precise detection of the Aqua2. As known in the literature, YOLOv3 performance significantly better YOLOv1 that is used in Tekin et al. . Again in our experiments, we saw that Tekin et al.  cannot detect Aqua2 when it was further away from the camera. Secondly, the bounding box based keypoint sampling also allowed the selection of more appropriate keypoints using RANSAC-based PnP, whereas Tekin et al.  only used the keypoints with the highest confidence.
Figure 7 shows the error in translation and orientation of an Aqua2 robot in the Pool dataset. It is evident that the proposed method performs well across all distances from camera (0.5m-3.5m). Interestingly, at very close distance, the method experiences higher orientation error due to the 2D keypoints of Aqua2 not being precisely selected by the pose regression decoder.
Evaluation on the Barbados 2017 Dataset: We report the performance of our system on Real Barbados 2017 dataset in terms of translation and rotation errors as shown in Table II. DeepURL performs significantly better on translation error and orientation error compared to the method of Koreitem et al. .
|Translation Error||Orientation Error||Roll Error||Pitch Error||Yaw Error|
|Koreitem et al. ||0.72m||17.59°||11.87°||4.59°||12.11°|
Failing Scenarios: The approach might fail to predict a reasonable estimate of the pose transformation matrix using the PnP algorithm. These scenarios generally occur either when the detection stream fails to predict the object detection box, therefore, there is not enough points for RANSACbased PnP algorithm (at least six 2D-to-3D correspondences are required), or PnP did not converge. These detection failure scenarios are inherent in YOLOV3 architecture. The system may also fail for position or orientation not introduced in the training scenarios, such as translation beyond 3.5m or orientation beyond the rendered range described in IV-A.
In this work, we presented a system for 6D pose estimation of an Autonomous Underwater Vehicle for relative localization underwater. The system learns to predict the 6D pose without the need for any real ground truth, which enables pose estimation in an environment where ground truth is difficult to acquire. We also present a detection bounding box based keypoint sampling strategy that is more robust to estimate the pose of the observed robot.
Currently, the proposed network is being ported to an Intel Neural Compute Stick 2 (Intel NCS2)444https://software.intel.com/en-us/neural-compute-stick and an NVidia Jetson TX2 Module555https://developer.nvidia.com/embedded/jetson-tx2 in order to deploy on an Aqua2 or a BlueROV2 vehicle. The above two platforms were selected based on their performance  and compatibility with the proposed vehicles. Furthermore, the DeepURL framework will be integrated with the proprioceptive sensors of each robot (IMU and depth) and either the USBL positioning of the observer or the VisualInertial estimator  to recover the pose of both robots in a global frame of reference.
The proposed method will enable robust operations underwater for the Aqua2 robots and is expected to be applicable to other underwater vehicles.
Siamese regression networks with efficient mid-level feature extraction for 3D object pose estimation. External Links: Cited by: §II.
IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: §II.
Image-to-image translation with conditional adversarial networks. CVPR. Cited by: §II.
Experimental Comparison of Open Source Visual-Inertial-Based State Estimation Algorithms in the Underwater Domain. In iros, pp. 7221–7227. Cited by: §I.
Smooth invariant interpolation of rotations. ACM Trans. Graph. 16 (3), pp. 277–295. Cited by: §IV-B.
IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §II.