I Introduction
When water visibility is poor, sonar is usually the only viable modality for underwater sensing. Highfrequency 2D forwardlooking sonar, or acoustic cameras, can generate highresolution 2D acoustic images in any body of water. They are compact in size and can be easily mounted on remotely operated vehicles (ROVs) or autonomous underwater vehicles (AUVs). In the last few years, the sensors have achieved remarkable results in tasks such as mosaicking, 3D mapping, and robot navigation [Hurtos2015, Wangjoe, Wangicra2021, Negahdaripour2013]. Because sonar is a 2D imaging sensor, dimension missing problem also exists similar to optical cameras. Unlike the optical camera where the depth information is missing, the information in the elevation angle direction is missing during the image formation for acoustic cameras. Retrieving 3D information from 2D acoustic images is a nontrivial problem. The unique imaging model, severe noise, and sonar artifacts make this problem difficult.
Early works on 2D forwardlooking sonar utilized sparse features, such as corners or edges, for 3D reconstruction [Mai20171, Mai20172, Huang2015]. Points or lines were used to represent the sparse 3D model, which is not intuitive for human comprehension. More recently, dense 3D reconstruction methods have been proposed, based on multiview stereo or photometric stereo. Multiview stereo methods basically utilize an efficient space carving scheme, considering the known sensing range and limited field of view (FoV) of 2D forwardlooking sonars [Aykin2017, guerneve2018, Wangjoe]. However, the deterministic multiview stereo methods require a large number of viewpoints; the 3D reconstruction result is extremely poor when only a few viewpoints (e.g., 23) are available. Photometric stereo methods can generate a 3D model from a single acoustic image by modeling ultrasound propagation or using shadow information [Aykin2016, Westman2019iros]
. However, such methods require ideal conditions, and usually have strong assumptions about the scene. Recently, learningbased methods have demonstrated successful results in computer vision problems. A convolutional neural network (CNN)based method was proposed to estimate the elevation angle for each pixel from a single acoustic image
[DeBortoli2019]. However, for acoustic images, one pixel may correspond to multiple elevation angles, which is a nonbijective 2D3D correspondence problem, denoted as 2D3D problem in the following paper. To solve this problem, our research group proposed a CNNbased method (A2FNet) to learn the front view depth from a single acoustic image [Wangicra2021].In fact, estimating 3D information from a single acoustic image is a highly illposed problem and cannot be solved for many cases, considering the ambiguity existing in acoustic imaging (see Section III.C). In addition, the accuracy of single view 3D reconstruction is limited. Recently, studies have focused on sonar stereo [Negahdaripour2020, McConnell2020]. Previous studies on sonar stereo still require handcrafted features that may not perform stably. In this work, we also emphasize the benefits of sonar stereo. Deterministic multiview stereo methods can solve the ambiguity problem caused by singleview 3D reconstruction, but may require a large number of viewpoints. This paper aims to solve the 3D reconstruction problem by multiview stereo but use only 23 viewpoints.
Retrieving 3D information using only a few viewpoints is still difficult, in this work, we propose a deep learning based method for the multiview sonar stereo problem as shown in Fig.
1. We propose a novel Elevation Plane Sweeping Stereo Network (EPSSN). The major task is to retrieve the 3D information of the reference frame. The source frames are the ones from viewpoints with small motion change to help the 3D reconstruction of the reference frame. The relative pose between the reference frame and the source frames are assumed given in this work, which can be acquired from control input of a motor, inertial measurement unit (IMU) or a known baseline of sonar stereo system. By providing reference and source frames with relative pose information, it is possible to generate a highly accurate pseudo front view depth for the reference frame. By warping the features of the source frame to virtual elevation planes in the reference frame, it is possible to generate a depthazimuthelevation cost volume. Finally, the front view depth can be directly generated from the cost volume according to the geometry relationship. It is noticed that not only the ambiguity can be solved by using 23 viewpoints instead of a single viewpoint, the accuracy of 3D reconstruction results may increase drastically.Our contributions are listed as follows.

We propose a novel CNNbased method to generate a dense 3D reconstruction from few (e.g., two) acoustic images.

We propose an elevation plane sweeping scheme to generate the depthazimuthelevation cost volume. The front view depth can be calculated from the cost volume. By using the front view depth, the 2D3D problem can be avoided.

We generate synthetic and real datasets for evaluation. The synthetic datasets include natural and artificial scenes, and have been released under an opensource license
^{1}^{1}1https://github.com/sollynoay/EPSSN. Our proposed method outperforms other stateoftheart methods.
Ii Related Works
Iia 3D Reconstruction of Acoustic Images
Early methods used sparse points or lines for the 3D reconstruction of acoustic images. Mai et al. and Huang et al. used manually selected corner points combined with simultaneous localization and mapping (SLAM) systems for 3D information acquisition [Mai20171, Huang2015]. For automatic feature detection, AKAZE features have been proven to be effective in acoustic images that have been used for sonarbased localization [Li2018]. Wang et al. tracked AKAZE features based on optical flow and assumed the terrain as a Gaussian process random field on a tree structure [wangj2019]. The terrain is assumed to be smooth and the computation cost increases drastically when the feature points increase.
For better 3D representation, dense 3D reconstruction has been studied. For multiview stereo methods, Aykin et al. applied a space carving method for small objects [Aykin2017]. The acoustic images were binarysegmented into object and nonobject regions. The space was then voxelized and projected onto the binary image. The voxels projected to the object region were kept and the others were carved. Guerneve et al. utilized a linearized sonar projection model, with minfiltering for space carving [guerneve2018]. Our group proposed a probabilistic method using 3D occupancy mapping with an inverse sensor model [Wangjoe]. It can be applied to more general scenes such as 3D mapping. However, the aforementioned methods require a large number of viewpoints and do not work well with only one or two viewpoints. Photometric stereo methods utilize shadows or other image cues to retrieve 3D information. Aykin et al. physically modeled ultrasound propagation and scenes, and proved that the physical model can be used for 3D reconstruction with object contours [Aykin2016]. Westman et al. utilized a similar scheme and generated 3D model of continuous surface [Westman2019iros]. Nonlineofsight (NLOS) methods have also been applied to acoustic images and proved to be feasible [Westman2020iros]. However, these methods usually have strong assumptions, e.g., that shadows or Fermat paths [Westman2020iros] can be accurately detected, which is difficult, owing to noise and artifacts.
Recent deep learningbased methods have also been utilized in this field. Debortoli et al. used a UNetlike structure for pixelwise elevation angle estimation of a single acoustic image [DeBortoli2019]. For the synthetic dataset, the network was trained in a supervised manner. For the real dataset, the network was first trained using synthetic images and finetuned using a selfsupervised scheme with small motions. The study assumes that each pixel only corresponds to one elevation angle which is not always true. To overcome this limitation, our group proposed a network to learn the front view depth, which is a better representation with no 2D3D ambiguity problem [Wangicra2021]. The network implicitly learns to transfer an azimuthrange image to an azimuthelevation depth map. Retrieving 3D information from a single acoustic image is highly illposed, and cannot be solved, owing to the ambiguity in many cases that are difficult to apply in real underwater scenarios.
Negahdaripour first proposed sonar stereo, established sonar epipolar geometry theory and proved the optimal configuration of two forwardlooking sonars [Negahdaripour2020]. McConnell et al. mounted two orthogonally configured sonars to an ROV to generate dense point clouds [McConnell2020]. Such methods still require handcrafted features in the acoustic images. In this work, we are the first to propose a learningbased method for multiview sonar stereo. It can be used for structured sonar stereo and multiple images from a single sonar. It can solve the problems caused by sonar imaging and generate accurate dense 3D reconstruction results even for complex underwater scenarios.
IiB Learning Multiview Stereo
For optical cameras, retrieving 3D information from stereo cameras is a longstanding problem. Recently, learningbased endtoend depth from stereo matching methods has shown outstanding abilities [zbontar2016, Mayer2016, Kendall2017]
. By building 3D cost volumes to fuse the information from stereo images, and using soft argmax to estimate the depth of each pixel, the neural networks show high performance for depth estimation problems
[Kendall2017]. However, they usually assume that the two optical cameras are well calibrated and the image pairs are rectified.Recently, studies have been conducted on unstructured multiview stereo [mvsnet, deepmvs]. To build the cost volume, plane sweeping scheme [Collins1996] is applied, by projecting images or features to virtual depth planes in the reference camera coordinates. The projection process was calculated using homography. After acquiring the cost volume, multiview stereo usually follows the same structure as stereo matching methods. Acoustic cameras have a different projection theory compared to optical cameras, so theories such as homography cannot be applied. In addition, the depth estimation of optical images directly estimates the depth of each pixel in the optical image. However, instead of estimating the elevation angles for each pixel, we estimate the pseudo front depth from the acoustic images.
Iii Preliminaries
Iiia Sonar Projection Theory
A 3D point in the sonar coordinate system is usually represented as in the polar coordinate system, which can be transformed to the Euclidean coordinate as follows.
(1) 
During the imaging process, the angle information is missing, and the corresponding 2D point in the image coordinates can be written as as shown in Fig. 2. Different from optical imaging, each pixel in the acoustic images may correspond to multiple 3D points. The backscattered intensity of the 3D points with the same accumulates. For each in the acoustic image , discretizing as and the th corresponding backscattered intensities as gives
IiiB Pseudo Front Depth
Considering the ultrasound as rays, each ray strikes the object surface and reflected to sensor. Before integration in the elevation angle direction, it can be seen as a common timeofflight (ToF) sensor, from which a depth map can be acquired. Theoretically, a front depth map contains the entire 3D information of the sensible region as shown in Fig. 1. Instead of estimating the elevation angles directly for each pixel, the front depth map representation can avoid the 2D3D problem [Wangicra2021]. However, it is less constrained than the direct elevation angle estimation problem.
IiiC Ambiguity Problem
Sonar imaging faces severe ambiguity problems. For example, when sensing a single spherical object, theoretically, all spheres from to will generate the same acoustic image as shown in Fig. 3. Estimating the sphere position from a single image is highly illposed. In addition, objects or scenes symmetric to the imaging plane will generate the same image. For CNNbased regression methods, if one input corresponds to multiple ground truth labels, the network cannot learn the problem well. Therefore, it is necessary to study sonar stereo to overcome the aforementioned problems.
Iv Learning Sonar Stereo
Iva Overview
Figure 4
shows the network structure overview of EPSSN. The inputs are the reference frame and the source frames with corresponding poses. The output is the pseudo front depth map of the reference frame. We propose two major novel blocks: the elevation plane warping and front depth generation blocks. A 2D CNN is first used to extract deep features
from the acoustic images. The features in the source frames are warped to the reference frame based on elevation plane sweeping to generate cost volumes as explained in Section IV.B. In this study, we use a depthazimuthelevation volume. After fusing the cost volumes from the reference frame and the source frames, a following 3D CNN is used to refine and upsample the initial cost volume. Finally, the front depth generation block is used to calculate the front depth map from the cost volume, as explained in Section IV.C.IvB Elevation Plane Warping
Plane sweeping stereo has been commonly used in multiview stereo for optical cameras [Collins1996]. Multiview images are projected onto virtual depth planes in the reference image coordinates to generate a cost volume. The depth maps are then estimated using this cost volume. This type of theory has never been used for acoustic images because of the difference in projection theory. In this work, we propose an elevation plane sweeping scheme for acoustic images to generate a depthazimuthelevation volume and use the volume to further generate front depth maps. We first generate virtual elevation planes along the elevation angle direction as shown in Fig. 5. The elevation angle of the th plane can be described as follows.
(2) 
where is an integer from to . For each virtual elevation plane, denoting a 3D point in plane as , the point is first transformed into the Euclidean coordinate as . The point in the reference frame is then transformed into the source coordinate as . Here, denotes the relative motion between the reference and source frames. Then, we transform the point to the polar coordinate as and project the point onto the imaging plane in the source frame. We use bilinear sampling to acquire the corresponding feature in source frame. Then, we can warp feature to elevation plane and the result is denoted as . The warped source frame cost volume can be generated as follows.
(3) 
where
(.) refers to the concatenation operation along a new dimension corresponding to the elevation direction. The initial cost volume is a 4D tensor, for which the shape is
, where refers to the length of feature channels. The cost volume for the reference frame itself is denoted as . The volumes are aggregated and the initial cost volume is calculated as follows.(4) 
where is the elementwise arithmetic mean of and . This can be easily extended if there are multiple source frames.
IvC Front Depth Generation
The initial cost volume is noisy and not regularized. A 3D UNetlike structure [unet] is used to refine the initial cost volume, followed by an extra 3D convolution transpose layer to further upsample the cost volume. The last convolution layer makes the size of feature channels to one.
The refined cost volume can be considered as a probabilistic volumetric representation of the target as shown in Fig. 6
. Each voxel has a probability value of occupancy and corresponds to the 3D space in the reference coordinate. In particular, in our cases, the boundaries of the cost volume refer to the scope of the sonar. In this work, we assume that the front depth representation is ideal so that the probability distribution for the range of a
pair is always unimodal. On the other hand, the probability distribution for the elevation angle of a pair is more likely to be multimodal due to the nonbijective projection as explained in Section III.A. Theoretically, the front depth map can be retrieved by winnertakeall (i.e., argmax) [Kendall2017]. However, it is difficult to use the same scheme for the multimodal distribution of the elevation map case. Because argmax cannot offer subpixel results and is not differentiable for network training, it is usually replaced by soft argmax in the neural network. The depth for is calculated as follows using softmax and the depth value for each cell.(5) 
IvD Loss Functions
In this study, we estimate the depth information instead of reverse depth. This is mainly because the sensing range  is not a large value (e.g. 2 m) in this work. This may lose precision, owing to floatpoint numbers when using reverse depth. The shortcoming of using direct depth is that some of the values in the ground truth depth value may exceed , especially for submerging objects, and some depth values may be nearly infinite. During the training time, we provide a binary mask from depth labels as follows.
(6) 
The loss function is written as follows.
(7) 
where refers to and
is a hyperparameter that balances the two terms in the equation. Noting that for ground truth depth maps, during training, we set the range larger than
to and the range smaller than to . We assume that the object or scene targets are away from the boundaries. It is then easy to filter the noninformative estimations during the test times.V Experiment
To evaluate the proposed methods, we generated both simulation and real datasets. Usually, roll rotation is considered to be efficient for retrieving 3D information. Existing sonar stereo systems use concurrent orthogonal sonars for sensing which can be approximately considered as a roll rotation of 90 [Negahdaripour2020, McConnell2020]. However, to ensure a sufficient overlap of the images, we used known small roll rotations (510) to generate image pairs in this work. The influence of motion will be investigated in future studies.
Va Simulation Datasets
The simulation datasets were generated in Blender environment which can be found on our GitHub^{2}^{2}2https://github.com/sollynoay/Sonarsimulatorblender. We prepared three datasets with different scenes, including artificial environment, terrain, and sphere. The artificial environment was built using a 3D CAD model which is also the ground truth of our real dataset. Image pairs were generated at random positions throughout the scene. We generated 4,000 image pairs for training and 2,000 pairs for test. For the terrain dataset, we generated terrains with the A.N.T. Landscape addon in Blender by using hetero noise, which simulated undulating seabed. We trained the network on one terrain and tested it on another terrain. We generated 1,000 image pairs for training and 1,000 pairs for test. The sphere dataset was used to verify the ambiguity problem explained in Section III.C. To generate the dataset, the sphere was located at a random position within the sonar scope.
VB Real Datasets
We collected a real dataset in a large scale water tank with Sound Metrics ARIS EXPLORER 3000 imaging sonar. The configuration of the objects was designed and then set up by a diver in the water tank. The ground truth of the objects was acquired using Agisoft Metashape in land. As shown in Fig. 7, a moving device was built to move the sonar to different positions. The sonar was mounted on a Sound Metrics AR2 rotator for roll rotations. To acquire the ground truth pose of the sonar in water tank coordinates, we calibrated the initial pose of the sonar by matching the edges of the real and synthetic images [wang2022]. The rest of the poses were calculated from the control input of the moving device and the roll rotations. The ground truth depth labels were generated by inputting the sonar pose into the simulator.
The sonar with the rotator was moved by the device to 51 positions. For each position, the sonar performed a roll rotation from 35 to 35 and an acoustic video was recorded. Denoting the roll angle for the reference frame as , the source frame is taken at the roll angle of approximately . In total, we generated 1,493 pairs of reference and source images. The dataset was then separated randomly at an 8:2 ratio for training and test. Thus, there were 1,194 pairs for training and 299 pairs for test. The resolution in the range direction for the real images was downsampled by twice. All the heights and widths for the synthetic images and real images were 512128.
VC Training
The proposed method was implemented in PyTorch. The networks were trained and tested on NVIDIA GeForce RTX 3090 graphics card. For each dataset, we trained 50 epochs. The learning rate began at 0.001 and was reduced by half after 25 epochs. The batch size was set to 16 for training and test. Adam optimizer was used for optimization.
was set to 3.VD Metrics
To evaluate the results, two metrics were used: the mean absolute error (MAE) for the front view depth map, and chamfer distance (CD) for the point cloud.
(8) 
(9) 
where and are scale parameters, which refer to 1,000 and 500, respectively. and refer to the two point sets.
VE Results
To evaluate our method, we also compared our results with stateoftheart methods, ElevateNet [DeBortoli2019] and A2FNet [Wangicra2021]. To implement ElevateNet, we chose UNet [unet] as the backbone for training, with a sigmoid layer before the final output. Elevation maps were also generated using our simulator. The network was trained in a supervised manner using L1 loss. We also prepare a binary mask to segment the signal/nonsignal region and the loss for the nonsignal region was set to zero. During test time, the mask was multiplied to the network output. For A2FNet, the inverse pixel shuffle block was replaced by a fourlayer CNN for feature extraction. We used the same hyperparameters to train the baseline methods as our EPSSN.
The results of the simulation datasets are listed in Table I
. We show the average error and the standard deviation. All the metrics are the smaller the better. For artificial (simulation) and terrain datasets, our proposed method can generate more accurate and precise results compared to the baselines. For sphere datasets, the baseline methods using a single image can hardly deal with the problem; however, our proposed method can successfully estimate the 3D information. Figure
8 shows an example of the visualization of the results. For ElevateNet, although elevation maps were estimated, they were transferred to front view depth maps. The color refers to distance, where blue and red refer to the minimum and maximum distances, respectively. It can be seen from the results that our proposed method is closest to the ground truth. ElevateNet suffers from the incompleteness caused by nonbijective correspondence. For sphere datasets, A2FNet tends to estimate all the possible positions of the sphere; on the other hand, ElevateNet tends to estimate one of the possible results. By using two or more images, such ambiguity can be resolved and our proposed method can estimate the true results. Later, we will discuss how viewpoint numbers will influence our proposed method and the baseline methods.The results for the real dataset are listed in Table II. Our proposed method is still far more better than the baseline methods. Examples of visualizing the results are shown in Fig. 9. We show the results of the mesh model for comparison. The mesh models were generated in MeshLab by first estimating normals and using ball pivoting. Clearly, the proposed method generated the 3D model with the best quality.
Artificial (simulation)  

A2FNet  ElevateNet  Ours  
CD [m]  0.27070.1053  0.32630.1301  0.13680.0473 
MAE [m]  10.41 3.09  –  4.901.89 
Terrain  
A2FNet  ElevateNet  Ours  
CD [m]  0.9962 0.6155  0.5102 0.2394  0.22000.0837 
MAE [m]  42.1916.59  –  14.074.67 
Sphere  
A2FNet  ElevateNet  Ours  
CD [m]  24.7811 27.1205  9.6571 9.4248  1.46032.4521 
MAE [m]  71.5620.46  –  9.605.87 
Artificial (real water tank)  

A2FNet  ElevateNet  Ours  
CD [m]  0.86770.2583  0.76050.2234  0.28220.0879 
MAE [m]  30.54 7.07  –  13.622.94 
VF Discussions
VF1 Viewpoint Numbers
The viewpoint numbers may influence the results. For a fair comparison, we also added viewpoints for ElevateNet and A2FNet by concatenating the reference and source images as inputs and modifying the channels of the networks. We chose the artificial and terrain datasets for training and test. The performance of ElevateNet seems unstable which significantly improved on terrain dataset but became worse on artificial dataset. This may because the 2D3D problem is more obvious on artificial dataset. Our proposed method still outperforms the baseline. We also tied to increase the viewpoint number for EPSSN to three on terrain dataset, the CD was 0.13330.0573 m, which shows significant improvements.
Artificial (simulation)  

A2FNet (two)  ElevateNet (two)  Ours  
CD [m]  0.27070.1053  0.37450.1597  0.13680.0473 
Terrain  
A2FNet (two)  ElevateNet (two)  Ours  
CD [m]  0.90970.3439  0.35410.1040  0.22000.0837 
VF2 Regression on Elevation Map
Front depth generation from the cost volume and regression on the front depth map is one of the main contributions of this work. We tried to replace this part by generating elevation map and performing regression on it. Figure 10 shows the performance of elevation map regression and depth map regression on CD. For the same number of viewpoints, depth map regression significantly outperforms elevation map regression.
VF3 Ablation on Elevation Plane Warping
An ablation test was carried out on the elevation plane warping block when there were more viewpoints (three). By removing the warping process and simply adding the cost volumes from the reference frame and the source frames, the final result may be even worse than the case using two viewpoints with warping as shown in Table IV. In other words, the elevation plane warping successfully fuse the information from multiple frames with known motion information.
Terrain  

w / o warping (three)  w warping (three)  
CD [m]  0.27340.0819  0.13330.0573 
MAE [m]  16.26 4.59  9.132.70 
VF4 Comparison with Occupancy Mapping
We also compared the proposed method with a previous method using occupancy mapping [Wangjoe]. We used two frames, one reference frame and one source frame, with known relative poses to generate the local map. With only two frames with a small roll angle difference, the results of occupancy mapping did not converge. On terrain dataset, the CD error for the test set was .
Vi Conclusions
In this study, we proposed a novel method for retrieving the missing 3D information in the acoustic images. By utilizing elevation plane warping to fuse the information from multiple frames and learning the pseudo front depth, our proposed method achieved stateoftheart performance on both simulation and real datasets. It can also deal with severe ambiguous cases such as submerging sphere in water bodies. Our simulation datasets were made open source for future comparisons. The methods can be applied to underwater vehicles with two forwardlooking sonars or a single forwardlooking sonar with multiple views.
Future work may include selfsupervised network training and attempting various motions. Collecting datasets in real underwater environments is extremely difficult. In this study, the datasets were collected in a structured water tank where depth labels could be generated from our simulator. However, this is difficult to transfer to oceanic environments. Training the network without depth labels is worth investigating. Other methods, such as generative adversarial networks (GANs) may be valuable for mitigating the simtoreal gap so that we can train the network based on synthetic data
[liu2021]. In this work, the motions between the reference frame and the source frames were pure roll rotation; other motions may also be tested in the future.