Log In Sign Up

Learning Pseudo Front Depth for 2D Forward-Looking Sonar-based Multi-view Stereo

by   Yusheng Wang, et al.
The University of Tokyo

Retrieving the missing dimension information in acoustic images from 2D forward-looking sonar is a well-known problem in the field of underwater robotics. There are works attempting to retrieve 3D information from a single image which allows the robot to generate 3D maps with fly-through motion. However, owing to the unique image formulation principle, estimating 3D information from a single image faces severe ambiguity problems. Classical methods of multi-view stereo can avoid the ambiguity problems, but may require a large number of viewpoints to generate an accurate model. In this work, we propose a novel learning-based multi-view stereo method to estimate 3D information. To better utilize the information from multiple frames, an elevation plane sweeping method is proposed to generate the depth-azimuth-elevation cost volume. The volume after regularization can be considered as a probabilistic volumetric representation of the target. Instead of performing regression on the elevation angles, we use pseudo front depth from the cost volume to represent the 3D information which can avoid the 2D-3D problem in acoustic imaging. High-accuracy results can be generated with only two or three images. Synthetic datasets were generated to simulate various underwater targets. We also built the first real dataset with accurate ground truth in a large scale water tank. Experimental results demonstrate the superiority of our method, compared to other state-of-the-art methods.


page 1

page 2

page 5

page 6

page 7

page 8


Multi-View Guided Multi-View Stereo

This paper introduces a novel deep framework for dense 3D reconstruction...

Rethinking Disparity: A Depth Range Free Multi-View Stereo Based on Disparity

Existing learning-based multi-view stereo (MVS) methods rely on the dept...

Learning Inverse Depth Regression for Multi-View Stereo with Correlation Cost Volume

Deep learning has shown to be effective for depth inference in multi-vie...

Point-Based Multi-View Stereo Network

We introduce Point-MVSNet, a novel point-based deep framework for multi-...

DELS-MVS: Deep Epipolar Line Search for Multi-View Stereo

We propose a novel approach for deep learning-based Multi-View Stereo (M...

DS-MVSNet: Unsupervised Multi-view Stereo via Depth Synthesis

In recent years, supervised or unsupervised learning-based MVS methods a...

I Introduction

When water visibility is poor, sonar is usually the only viable modality for underwater sensing. High-frequency 2D forward-looking sonar, or acoustic cameras, can generate high-resolution 2D acoustic images in any body of water. They are compact in size and can be easily mounted on remotely operated vehicles (ROVs) or autonomous underwater vehicles (AUVs). In the last few years, the sensors have achieved remarkable results in tasks such as mosaicking, 3D mapping, and robot navigation [Hurtos2015, Wangjoe, Wangicra2021, Negahdaripour2013]. Because sonar is a 2D imaging sensor, dimension missing problem also exists similar to optical cameras. Unlike the optical camera where the depth information is missing, the information in the elevation angle direction is missing during the image formation for acoustic cameras. Retrieving 3D information from 2D acoustic images is a nontrivial problem. The unique imaging model, severe noise, and sonar artifacts make this problem difficult.

Early works on 2D forward-looking sonar utilized sparse features, such as corners or edges, for 3D reconstruction [Mai20171, Mai20172, Huang2015]. Points or lines were used to represent the sparse 3D model, which is not intuitive for human comprehension. More recently, dense 3D reconstruction methods have been proposed, based on multi-view stereo or photometric stereo. Multi-view stereo methods basically utilize an efficient space carving scheme, considering the known sensing range and limited field of view (FoV) of 2D forward-looking sonars [Aykin2017, guerneve2018, Wangjoe]. However, the deterministic multi-view stereo methods require a large number of viewpoints; the 3D reconstruction result is extremely poor when only a few viewpoints (e.g., 23) are available. Photometric stereo methods can generate a 3D model from a single acoustic image by modeling ultrasound propagation or using shadow information [Aykin2016, Westman2019iros]

. However, such methods require ideal conditions, and usually have strong assumptions about the scene. Recently, learning-based methods have demonstrated successful results in computer vision problems. A convolutional neural network (CNN)-based method was proposed to estimate the elevation angle for each pixel from a single acoustic image

[DeBortoli2019]. However, for acoustic images, one pixel may correspond to multiple elevation angles, which is a non-bijective 2D-3D correspondence problem, denoted as 2D-3D problem in the following paper. To solve this problem, our research group proposed a CNN-based method (A2FNet) to learn the front view depth from a single acoustic image [Wangicra2021].

In fact, estimating 3D information from a single acoustic image is a highly ill-posed problem and cannot be solved for many cases, considering the ambiguity existing in acoustic imaging (see Section III.C). In addition, the accuracy of single view 3D reconstruction is limited. Recently, studies have focused on sonar stereo [Negahdaripour2020, McConnell2020]. Previous studies on sonar stereo still require handcrafted features that may not perform stably. In this work, we also emphasize the benefits of sonar stereo. Deterministic multi-view stereo methods can solve the ambiguity problem caused by single-view 3D reconstruction, but may require a large number of viewpoints. This paper aims to solve the 3D reconstruction problem by multi-view stereo but use only 23 viewpoints.

Fig. 1: Workflow of the proposed method. The reference frame and the source frames of sonar stereo are input to the elevation plane sweeping stereo network (EPSSN) with the corresponding relative poses. The EPSSN estimates the pseudo front depth of the reference frame. The acoustic image is an azimuth-range image in which the elevation information is missing. The 3D information can be retrieved by generating azimuth-elevation depth map.

Retrieving 3D information using only a few viewpoints is still difficult, in this work, we propose a deep learning based method for the multi-view sonar stereo problem as shown in Fig. 

1. We propose a novel Elevation Plane Sweeping Stereo Network (EPSSN). The major task is to retrieve the 3D information of the reference frame. The source frames are the ones from viewpoints with small motion change to help the 3D reconstruction of the reference frame. The relative pose between the reference frame and the source frames are assumed given in this work, which can be acquired from control input of a motor, inertial measurement unit (IMU) or a known baseline of sonar stereo system. By providing reference and source frames with relative pose information, it is possible to generate a highly accurate pseudo front view depth for the reference frame. By warping the features of the source frame to virtual elevation planes in the reference frame, it is possible to generate a depth-azimuth-elevation cost volume. Finally, the front view depth can be directly generated from the cost volume according to the geometry relationship. It is noticed that not only the ambiguity can be solved by using 23 viewpoints instead of a single viewpoint, the accuracy of 3D reconstruction results may increase drastically.

Our contributions are listed as follows.

  • We propose a novel CNN-based method to generate a dense 3D reconstruction from few (e.g., two) acoustic images.

  • We propose an elevation plane sweeping scheme to generate the depth-azimuth-elevation cost volume. The front view depth can be calculated from the cost volume. By using the front view depth, the 2D-3D problem can be avoided.

  • We generate synthetic and real datasets for evaluation. The synthetic datasets include natural and artificial scenes, and have been released under an open-source license

    111 Our proposed method outperforms other state-of-the-art methods.

Ii Related Works

Ii-a 3D Reconstruction of Acoustic Images

Early methods used sparse points or lines for the 3D reconstruction of acoustic images. Mai et al. and Huang et al. used manually selected corner points combined with simultaneous localization and mapping (SLAM) systems for 3D information acquisition [Mai20171, Huang2015]. For automatic feature detection, AKAZE features have been proven to be effective in acoustic images that have been used for sonar-based localization [Li2018]. Wang et al. tracked AKAZE features based on optical flow and assumed the terrain as a Gaussian process random field on a tree structure [wangj2019]. The terrain is assumed to be smooth and the computation cost increases drastically when the feature points increase.

For better 3D representation, dense 3D reconstruction has been studied. For multi-view stereo methods, Aykin et al. applied a space carving method for small objects [Aykin2017]. The acoustic images were binary-segmented into object and non-object regions. The space was then voxelized and projected onto the binary image. The voxels projected to the object region were kept and the others were carved. Guerneve et al. utilized a linearized sonar projection model, with min-filtering for space carving [guerneve2018]. Our group proposed a probabilistic method using 3D occupancy mapping with an inverse sensor model [Wangjoe]. It can be applied to more general scenes such as 3D mapping. However, the aforementioned methods require a large number of viewpoints and do not work well with only one or two viewpoints. Photometric stereo methods utilize shadows or other image cues to retrieve 3D information. Aykin et al. physically modeled ultrasound propagation and scenes, and proved that the physical model can be used for 3D reconstruction with object contours [Aykin2016]. Westman et al. utilized a similar scheme and generated 3D model of continuous surface [Westman2019iros]. Non-line-of-sight (NLOS) methods have also been applied to acoustic images and proved to be feasible [Westman2020iros]. However, these methods usually have strong assumptions, e.g., that shadows or Fermat paths [Westman2020iros] can be accurately detected, which is difficult, owing to noise and artifacts.

Recent deep learning-based methods have also been utilized in this field. Debortoli et al. used a UNet-like structure for pixel-wise elevation angle estimation of a single acoustic image [DeBortoli2019]. For the synthetic dataset, the network was trained in a supervised manner. For the real dataset, the network was first trained using synthetic images and fine-tuned using a self-supervised scheme with small motions. The study assumes that each pixel only corresponds to one elevation angle which is not always true. To overcome this limitation, our group proposed a network to learn the front view depth, which is a better representation with no 2D-3D ambiguity problem [Wangicra2021]. The network implicitly learns to transfer an azimuth-range image to an azimuth-elevation depth map. Retrieving 3D information from a single acoustic image is highly ill-posed, and cannot be solved, owing to the ambiguity in many cases that are difficult to apply in real underwater scenarios.

Negahdaripour first proposed sonar stereo, established sonar epipolar geometry theory and proved the optimal configuration of two forward-looking sonars [Negahdaripour2020]. McConnell et al. mounted two orthogonally configured sonars to an ROV to generate dense point clouds [McConnell2020]. Such methods still require handcrafted features in the acoustic images. In this work, we are the first to propose a learning-based method for multi-view sonar stereo. It can be used for structured sonar stereo and multiple images from a single sonar. It can solve the problems caused by sonar imaging and generate accurate dense 3D reconstruction results even for complex underwater scenarios.

Ii-B Learning Multi-view Stereo

For optical cameras, retrieving 3D information from stereo cameras is a long-standing problem. Recently, learning-based end-to-end depth from stereo matching methods has shown outstanding abilities [zbontar2016, Mayer2016, Kendall2017]

. By building 3D cost volumes to fuse the information from stereo images, and using soft argmax to estimate the depth of each pixel, the neural networks show high performance for depth estimation problems

[Kendall2017]. However, they usually assume that the two optical cameras are well calibrated and the image pairs are rectified.

Recently, studies have been conducted on unstructured multi-view stereo [mvsnet, deepmvs]. To build the cost volume, plane sweeping scheme [Collins1996] is applied, by projecting images or features to virtual depth planes in the reference camera coordinates. The projection process was calculated using homography. After acquiring the cost volume, multi-view stereo usually follows the same structure as stereo matching methods. Acoustic cameras have a different projection theory compared to optical cameras, so theories such as homography cannot be applied. In addition, the depth estimation of optical images directly estimates the depth of each pixel in the optical image. However, instead of estimating the elevation angles for each pixel, we estimate the pseudo front depth from the acoustic images.

Iii Preliminaries

Iii-a Sonar Projection Theory

Fig. 2: Geometry model of 2D forward looking sonar. It can be equivalently seen as a projection of a 3D point to the plane, where the elevation information is missing.

A 3D point in the sonar coordinate system is usually represented as in the polar coordinate system, which can be transformed to the Euclidean coordinate as follows.


During the imaging process, the angle information is missing, and the corresponding 2D point in the image coordinates can be written as as shown in Fig. 2. Different from optical imaging, each pixel in the acoustic images may correspond to multiple 3D points. The backscattered intensity of the 3D points with the same accumulates. For each in the acoustic image , discretizing as and the -th corresponding backscattered intensities as gives

Iii-B Pseudo Front Depth

Considering the ultrasound as rays, each ray strikes the object surface and reflected to sensor. Before integration in the elevation angle direction, it can be seen as a common time-of-flight (ToF) sensor, from which a depth map can be acquired. Theoretically, a front depth map contains the entire 3D information of the sensible region as shown in Fig. 1. Instead of estimating the elevation angles directly for each pixel, the front depth map representation can avoid the 2D-3D problem [Wangicra2021]. However, it is less constrained than the direct elevation angle estimation problem.

Iii-C Ambiguity Problem

Fig. 3: An example of the ambiguity problem. The spheres from to may generate the same acoustic image. This may cause ambiguity for inverse problem.

Sonar imaging faces severe ambiguity problems. For example, when sensing a single spherical object, theoretically, all spheres from to will generate the same acoustic image as shown in Fig. 3. Estimating the sphere position from a single image is highly ill-posed. In addition, objects or scenes symmetric to the imaging plane will generate the same image. For CNN-based regression methods, if one input corresponds to multiple ground truth labels, the network cannot learn the problem well. Therefore, it is necessary to study sonar stereo to overcome the aforementioned problems.

Iv Learning Sonar Stereo

Iv-a Overview

Figure 4

shows the network structure overview of EPSSN. The inputs are the reference frame and the source frames with corresponding poses. The output is the pseudo front depth map of the reference frame. We propose two major novel blocks: the elevation plane warping and front depth generation blocks. A 2D CNN is first used to extract deep features

from the acoustic images. The features in the source frames are warped to the reference frame based on elevation plane sweeping to generate cost volumes as explained in Section IV.B. In this study, we use a depth-azimuth-elevation volume. After fusing the cost volumes from the reference frame and the source frames, a following 3D CNN is used to refine and upsample the initial cost volume. Finally, the front depth generation block is used to calculate the front depth map from the cost volume, as explained in Section IV.C.

Fig. 4:

Overview of EPSSN. The reference and source frames are passed through a 2D CNN for feature extraction. We propose a novel feature-level elevation plane warping method to generate the depth-azimuth-elevation cost volume. A 3D CNN is used to refine and upsample the cost volume. Then, a front depth generation block is proposed to generate the front depth map.

Iv-B Elevation Plane Warping

Fig. 5: Elevation plane warping. We generate virtual elevation planes along elevation angle direction. Each plane warps the deep feature and all the warped features are concatenated in a new dimension which refers to the elevation direction.

Plane sweeping stereo has been commonly used in multi-view stereo for optical cameras [Collins1996]. Multi-view images are projected onto virtual depth planes in the reference image coordinates to generate a cost volume. The depth maps are then estimated using this cost volume. This type of theory has never been used for acoustic images because of the difference in projection theory. In this work, we propose an elevation plane sweeping scheme for acoustic images to generate a depth-azimuth-elevation volume and use the volume to further generate front depth maps. We first generate virtual elevation planes along the elevation angle direction as shown in Fig. 5. The elevation angle of the -th plane can be described as follows.


where is an integer from to . For each virtual elevation plane, denoting a 3D point in plane as , the point is first transformed into the Euclidean coordinate as . The point in the reference frame is then transformed into the source coordinate as . Here, denotes the relative motion between the reference and source frames. Then, we transform the point to the polar coordinate as and project the point onto the imaging plane in the source frame. We use bilinear sampling to acquire the corresponding feature in source frame. Then, we can warp feature to elevation plane and the result is denoted as . The warped source frame cost volume can be generated as follows.



(.) refers to the concatenation operation along a new dimension corresponding to the elevation direction. The initial cost volume is a 4D tensor, for which the shape is

, where refers to the length of feature channels. The cost volume for the reference frame itself is denoted as . The volumes are aggregated and the initial cost volume is calculated as follows.


where is the element-wise arithmetic mean of and . This can be easily extended if there are multiple source frames.

Iv-C Front Depth Generation

Fig. 6:

Refined cost volume. The refined cost volume can be considered as a probabilistic volumetric representation of the target. In our problem, the probability distribution along the elevation direction may be multi-modal. On the other hand, we assume that the distribution along the range direction is always uni-modal.

The initial cost volume is noisy and not regularized. A 3D UNet-like structure [unet] is used to refine the initial cost volume, followed by an extra 3D convolution transpose layer to further upsample the cost volume. The last convolution layer makes the size of feature channels to one.

The refined cost volume can be considered as a probabilistic volumetric representation of the target as shown in Fig. 6

. Each voxel has a probability value of occupancy and corresponds to the 3D space in the reference coordinate. In particular, in our cases, the boundaries of the cost volume refer to the scope of the sonar. In this work, we assume that the front depth representation is ideal so that the probability distribution for the range of a

pair is always unimodal. On the other hand, the probability distribution for the elevation angle of a pair is more likely to be multimodal due to the non-bijective projection as explained in Section III.A. Theoretically, the front depth map can be retrieved by winner-take-all (i.e., argmax) [Kendall2017]. However, it is difficult to use the same scheme for the multi-modal distribution of the elevation map case. Because argmax cannot offer subpixel results and is not differentiable for network training, it is usually replaced by soft argmax in the neural network. The depth for is calculated as follows using softmax and the depth value for each cell.


Iv-D Loss Functions

In this study, we estimate the depth information instead of reverse depth. This is mainly because the sensing range - is not a large value (e.g. 2 m) in this work. This may lose precision, owing to float-point numbers when using reverse depth. The shortcoming of using direct depth is that some of the values in the ground truth depth value may exceed , especially for submerging objects, and some depth values may be nearly infinite. During the training time, we provide a binary mask from depth labels as follows.


The loss function is written as follows.


where refers to and

is a hyperparameter that balances the two terms in the equation. Noting that for ground truth depth maps, during training, we set the range larger than

to and the range smaller than to . We assume that the object or scene targets are away from the boundaries. It is then easy to filter the non-informative estimations during the test times.

V Experiment

To evaluate the proposed methods, we generated both simulation and real datasets. Usually, roll rotation is considered to be efficient for retrieving 3D information. Existing sonar stereo systems use concurrent orthogonal sonars for sensing which can be approximately considered as a roll rotation of 90 [Negahdaripour2020, McConnell2020]. However, to ensure a sufficient overlap of the images, we used known small roll rotations (510) to generate image pairs in this work. The influence of motion will be investigated in future studies.

V-a Simulation Datasets

The simulation datasets were generated in Blender environment which can be found on our GitHub222 We prepared three datasets with different scenes, including artificial environment, terrain, and sphere. The artificial environment was built using a 3D CAD model which is also the ground truth of our real dataset. Image pairs were generated at random positions throughout the scene. We generated 4,000 image pairs for training and 2,000 pairs for test. For the terrain dataset, we generated terrains with the A.N.T. Landscape add-on in Blender by using hetero noise, which simulated undulating seabed. We trained the network on one terrain and tested it on another terrain. We generated 1,000 image pairs for training and 1,000 pairs for test. The sphere dataset was used to verify the ambiguity problem explained in Section III.C. To generate the dataset, the sphere was located at a random position within the sonar scope.

V-B Real Datasets

Fig. 7: Water tank experiment. (a) Moving device for collecting data from different positions. (b) Target in the large scale water tank.

We collected a real dataset in a large scale water tank with Sound Metrics ARIS EXPLORER 3000 imaging sonar. The configuration of the objects was designed and then set up by a diver in the water tank. The ground truth of the objects was acquired using Agisoft Metashape in land. As shown in Fig. 7, a moving device was built to move the sonar to different positions. The sonar was mounted on a Sound Metrics AR2 rotator for roll rotations. To acquire the ground truth pose of the sonar in water tank coordinates, we calibrated the initial pose of the sonar by matching the edges of the real and synthetic images [wang2022]. The rest of the poses were calculated from the control input of the moving device and the roll rotations. The ground truth depth labels were generated by inputting the sonar pose into the simulator.

The sonar with the rotator was moved by the device to 51 positions. For each position, the sonar performed a roll rotation from -35 to 35 and an acoustic video was recorded. Denoting the roll angle for the reference frame as , the source frame is taken at the roll angle of approximately . In total, we generated 1,493 pairs of reference and source images. The dataset was then separated randomly at an 8:2 ratio for training and test. Thus, there were 1,194 pairs for training and 299 pairs for test. The resolution in the range direction for the real images was downsampled by twice. All the heights and widths for the synthetic images and real images were 512128.

V-C Training

(b) Ground truth
(c) A2FNet [Wangicra2021]
(d) ElevateNet [DeBortoli2019]
(e) Ours
Fig. 8: Visualization examples of simulation datasets. (a) Reference frames. Each row refers to a simulation dataset, which includes artificial (simulation), terrain, and sphere. The results are shown in depth maps. The color refers to depth values. Red and blue refer to the minimum and maximum, respectively.
(b) Ground truth
(c) A2FNet [Wangicra2021]
(d) ElevateNet [DeBortoli2019]
(e) Ours
Fig. 9: Visualization examples of real datasets. (a) Reference frames. We first transfer the estimation results to point clouds and then generate mesh models for comparison.

The proposed method was implemented in PyTorch. The networks were trained and tested on NVIDIA GeForce RTX 3090 graphics card. For each dataset, we trained 50 epochs. The learning rate began at 0.001 and was reduced by half after 25 epochs. The batch size was set to 16 for training and test. Adam optimizer was used for optimization.

was set to 3.

V-D Metrics

To evaluate the results, two metrics were used: the mean absolute error (MAE) for the front view depth map, and chamfer distance (CD) for the point cloud.


where and are scale parameters, which refer to 1,000 and 500, respectively. and refer to the two point sets.

V-E Results

To evaluate our method, we also compared our results with state-of-the-art methods, ElevateNet [DeBortoli2019] and A2FNet [Wangicra2021]. To implement ElevateNet, we chose UNet [unet] as the backbone for training, with a sigmoid layer before the final output. Elevation maps were also generated using our simulator. The network was trained in a supervised manner using L1 loss. We also prepare a binary mask to segment the signal/non-signal region and the loss for the non-signal region was set to zero. During test time, the mask was multiplied to the network output. For A2FNet, the inverse pixel shuffle block was replaced by a four-layer CNN for feature extraction. We used the same hyper-parameters to train the baseline methods as our EPSSN.

The results of the simulation datasets are listed in Table I

. We show the average error and the standard deviation. All the metrics are the smaller the better. For artificial (simulation) and terrain datasets, our proposed method can generate more accurate and precise results compared to the baselines. For sphere datasets, the baseline methods using a single image can hardly deal with the problem; however, our proposed method can successfully estimate the 3D information. Figure 

8 shows an example of the visualization of the results. For ElevateNet, although elevation maps were estimated, they were transferred to front view depth maps. The color refers to distance, where blue and red refer to the minimum and maximum distances, respectively. It can be seen from the results that our proposed method is closest to the ground truth. ElevateNet suffers from the incompleteness caused by non-bijective correspondence. For sphere datasets, A2FNet tends to estimate all the possible positions of the sphere; on the other hand, ElevateNet tends to estimate one of the possible results. By using two or more images, such ambiguity can be resolved and our proposed method can estimate the true results. Later, we will discuss how viewpoint numbers will influence our proposed method and the baseline methods.

The results for the real dataset are listed in Table II. Our proposed method is still far more better than the baseline methods. Examples of visualizing the results are shown in Fig. 9. We show the results of the mesh model for comparison. The mesh models were generated in MeshLab by first estimating normals and using ball pivoting. Clearly, the proposed method generated the 3D model with the best quality.

Artificial (simulation)
A2FNet ElevateNet Ours
CD [m] 0.27070.1053 0.32630.1301 0.13680.0473
MAE [m] 10.41 3.09 4.901.89
A2FNet ElevateNet Ours
CD [m] 0.9962 0.6155 0.5102 0.2394 0.22000.0837
MAE [m] 42.1916.59 14.074.67
A2FNet ElevateNet Ours
CD [m] 24.7811 27.1205 9.6571 9.4248 1.46032.4521
MAE [m] 71.5620.46 9.605.87
TABLE I: Simulation results
Artificial (real water tank)
A2FNet ElevateNet Ours
CD [m] 0.86770.2583 0.76050.2234 0.28220.0879
MAE [m] 30.54 7.07 13.622.94
TABLE II: Real Experiment results

V-F Discussions

V-F1 Viewpoint Numbers

The viewpoint numbers may influence the results. For a fair comparison, we also added viewpoints for ElevateNet and A2FNet by concatenating the reference and source images as inputs and modifying the channels of the networks. We chose the artificial and terrain datasets for training and test. The performance of ElevateNet seems unstable which significantly improved on terrain dataset but became worse on artificial dataset. This may because the 2D-3D problem is more obvious on artificial dataset. Our proposed method still outperforms the baseline. We also tied to increase the viewpoint number for EPSSN to three on terrain dataset, the CD was 0.13330.0573 m, which shows significant improvements.

Artificial (simulation)
A2FNet (two) ElevateNet (two) Ours
CD [m] 0.27070.1053 0.37450.1597 0.13680.0473
A2FNet (two) ElevateNet (two) Ours
CD [m] 0.90970.3439 0.35410.1040 0.22000.0837
TABLE III: Viewpoint Numbers

V-F2 Regression on Elevation Map

Fig. 10: Ablation test results using the CD. Doing regression on the depth outperforms the regression on elevation. Increasing viewpoints may improve the performance of the proposed method.

Front depth generation from the cost volume and regression on the front depth map is one of the main contributions of this work. We tried to replace this part by generating elevation map and performing regression on it. Figure 10 shows the performance of elevation map regression and depth map regression on CD. For the same number of viewpoints, depth map regression significantly outperforms elevation map regression.

V-F3 Ablation on Elevation Plane Warping

An ablation test was carried out on the elevation plane warping block when there were more viewpoints (three). By removing the warping process and simply adding the cost volumes from the reference frame and the source frames, the final result may be even worse than the case using two viewpoints with warping as shown in Table IV. In other words, the elevation plane warping successfully fuse the information from multiple frames with known motion information.

w / o warping (three) w warping (three)
CD [m] 0.27340.0819 0.13330.0573
MAE [m] 16.26 4.59 9.132.70
TABLE IV: Ablation test on elevation plane warping

V-F4 Comparison with Occupancy Mapping

We also compared the proposed method with a previous method using occupancy mapping [Wangjoe]. We used two frames, one reference frame and one source frame, with known relative poses to generate the local map. With only two frames with a small roll angle difference, the results of occupancy mapping did not converge. On terrain dataset, the CD error for the test set was .

Vi Conclusions

In this study, we proposed a novel method for retrieving the missing 3D information in the acoustic images. By utilizing elevation plane warping to fuse the information from multiple frames and learning the pseudo front depth, our proposed method achieved state-of-the-art performance on both simulation and real datasets. It can also deal with severe ambiguous cases such as submerging sphere in water bodies. Our simulation datasets were made open source for future comparisons. The methods can be applied to underwater vehicles with two forward-looking sonars or a single forward-looking sonar with multiple views.

Future work may include self-supervised network training and attempting various motions. Collecting datasets in real underwater environments is extremely difficult. In this study, the datasets were collected in a structured water tank where depth labels could be generated from our simulator. However, this is difficult to transfer to oceanic environments. Training the network without depth labels is worth investigating. Other methods, such as generative adversarial networks (GANs) may be valuable for mitigating the sim-to-real gap so that we can train the network based on synthetic data

[liu2021]. In this work, the motions between the reference frame and the source frames were pure roll rotation; other motions may also be tested in the future.