Forward-Looking Sonar Patch Matching: Modern CNNs, Ensembling, and Uncertainty

by   Arka Mallick, et al.

Application of underwater robots are on the rise, most of them are dependent on sonar for underwater vision, but the lack of strong perception capabilities limits them in this task. An important issue in sonar perception is matching image patches, which can enable other techniques like localization, change detection, and mapping. There is a rich literature for this problem in color images, but for acoustic images, it is lacking, due to the physics that produce these images. In this paper we improve on our previous results for this problem (Valdenegro-Toro et al, 2017), instead of modeling features manually, a Convolutional Neural Network (CNN) learns a similarity function and predicts if two input sonar images are similar or not. With the objective of improving the sonar image matching problem further, three state of the art CNN architectures are evaluated on the Marine Debris dataset, namely DenseNet, and VGG, with a siamese or two-channel architecture, and contrastive loss. To ensure a fair evaluation of each network, thorough hyper-parameter optimization is executed. We find that the best performing models are DenseNet Two-Channel network with 0.955 AUC, VGG-Siamese with contrastive loss at 0.949 AUC and DenseNet Siamese with 0.921 AUC. By ensembling the top performing DenseNet two-channel and DenseNet-Siamese models overall highest prediction accuracy obtained is 0.978 AUC, showing a large improvement over the 0.91 AUC in the state of the art.


page 1

page 5

page 6

page 7


Improving Sonar Image Patch Matching via Deep Learning

Matching sonar images with high accuracy has been a problem for a long t...

Generalized Contrastive Optimization of Siamese Networks for Place Recognition

Visual place recognition is a challenging task in computer vision and a ...

Fracking Deep Convolutional Image Descriptors

In this paper we propose a novel framework for learning local image desc...

Multimodal matching using a Hybrid Convolutional Neural Network

In this work we propose a novel Convolutional Neural Network (CNN) archi...

Contrastive Siamese Network for Semi-supervised Speech Recognition

This paper introduces contrastive siamese (c-siam) network, an architect...

A Deep Learning based Joint Segmentation and Classification Framework for Glaucoma Assesment in Retinal Color Fundus Images

Automated Computer Aided diagnostic tools can be used for the early dete...

MTCSNN: Multi-task Clinical Siamese Neural Network for Diabetic Retinopathy Severity Prediction

Diabetic Retinopathy (DR) has become one of the leading causes of vision...

I Introduction

More than two-thirds of our planet’s surface is covered by oceans and other water bodies. For a human, it is often impossible to explore it extensively. The need for venturing into potentially dangerous underwater scenarios appear regularly, for example, finding new energy sources, monitoring tsunamis, global warming, wreckage search, or maybe just to learn about deep sea ecosystems. This motivates design and deployment of robots in underwater scenarios, and much research goes in this direction. Some exploration or monitoring tasks require the robot to ”see” underwater, to make intelligent decisions. But the underwater environment is very difficult for optical cameras, as light is attenuated and absorbed by the water particles. And a lot of real-life monitoring and mapping tasks take place in a cluttered and turbid underwater scenario. The limited visibility range of an optical sensor is a big challenge. Hence, sonar is a more practical choice for underwater sensing, as acoustic waves can travel long distances with comparatively little attenuation.

An underwater robot, equipped with sonar image sensors, regularly needs to perform basic tasks such as object detection and recognition, navigation, manipulation etc. In underwater scenarios, sonar patch matching functionality is very useful in several applications such as data association in simultaneous localization and mapping (SLAM), object tracking, sonar image mosaicing [7]

etc. Patch matching, in general, is heavily used in computer vision and image processing applications for low-level tasks like image stitching

[1], deriving structure from motion [12], also in high-level tasks such as object instance recognition [9], object classification [20], multi-view reconstruction [15]

, image-retrieval etc.

Typical challenges in patch matching tasks are different viewing points, variations in scene insonification, occlusion, and different sensor settings. For sonar patch matching the common challenges with acoustic vision adds to the overall complexity. For example, low signal-to-noise ratio, lower resolution, unwanted reflections, less visibility etc. Because of these challenges, the underlying object features might not be so prominent as in a normal optical image. It has also been found that it is very challenging to manually design features for sonar images, and popular hand designed features such as SIFT

[10] are not always very effective in sonar images [18]. For these reasons, patch matching for sonar images remains a topic of research interest.

Fig. 1: Use of convolutional network for learning general similarity function for image patches. The patches in the image are samples taken from the data used in this work. Inspired from Zagoruyko et al. [21]

Ii State of the Art

Sonar image patch matching is more difficult than normal optical matching problem. This is because sonar images have additional challenges such as non-uniform insonification, low signal-to-noise ratio, poor contrast [2], low resolution, low feature repeatability [6] etc. But sonar image matching has important applications like in sonar registration, mosaicing [8], [7] and mapping of seabed surface [13] etc. While Kim et al. [8] used Harris corner detection and matched key-points to register sonar images, Hurtos et al. [7] incorporated Fourier-based features for registration of FLS images. Negahdaripour et al. [13]estimated mathematical models from the dynamics of object movements and it’s shadows. Vandrish et al. [19] used SIFT [10] for sidescan sonar image registration. Even though these approaches achieve considerable success in respective goals, were found to be most effective when the rotation/translation between the frames of sonar images are comparatively smaller. Block-matching was performed on segmented sonar images by Pham et al. [14]

, using Self-Organizing Map for the registration and mosaicing task.

Recently CNNs have been applied for this problem, Zbontar et al[22] for stereo matching in color images, and Valdenegro-Toro et al [18] for sonar images, which is based on Zagoryuko et al [21], and is the state of the art for sonar image patch matching at 0.91 AUC on the Marine Debris dataset. CNNs are increasingly being used for sonar image processing [17]. The main reason behind such a rise of CNN usage is that it can learn sonar-specific information from the data directly. No complex manual feature design or rigorous data pre-processing steps are needed, which makes the task less complex and good prediction accuracy can be achieved.

Iii Matching as Binary Classification

We formulate the matching problem as learning a classifier. A classification model is given two images, and it decides if the images match or not. This decision can be modeled as a score in

, or a binary output decision .

For this formulation, we use AUC, the area under the ROC curve (Receiver Operating Characteristic) as the primary metric to assess performance, as we are interested in how separable are the score distributions between matches and non-matches.

Name Value Name Value
Layers 2-2-2 Pooling avg
Growth rate (gr) 12 Number of filter 32
DenseNet dropout 0.2 Compression 0.5
Bottleneck False Batch size 128
Optimizer Adadelta Learning rate 0.03
TABLE I: Best hyper-parameter values for DenseNet Two-Channel (DTC).

Iv Matching Architectures

In this section we describe the neural network architectures we selected as trunk for the meta-architectures like two-channel and siamese networks, which are used for matching.

Iv-a Hyper-Parameter Tuning

For each architecture, we tuned their hyper-parameters using a validation set, in order to maximize accuracy. Each range of hyper-parameters was set individually for each architecture, considering width, filter values at each layer, drop probabilities, dense layer widths, etc. Overall, we performed 10 runs of different hyper-parameter combinations for each architecture. Details of the hyper-parameter tuning are available at


Iv-B DenseNet Two-Channel Network

In DenseNet [5] each layer connects to every layer in a feed-forward fashion. With the basic idea to enhance the feature propagation, each layer of DenseNet blocks takes the feature-maps of the previous stages as input.

Name Value Name Value
Number of filter 16 Layers 2-2
Growth rate 30 DenseNet dropout 0.4
Compression 0.3 Bottleneck False
FC output 512 FC dropout 0.7
Pooling flatten Batch size 64
Optimizer Adadelta Learning rate 0.07

Best hyperparameter values for DenseNet Siamese (DS).

In DenseNet two channel the the sonar patches are supplied as inputs in two channels format, the network by itself divides each patch into one channel and learn the features from the patches and then finally compare them using the Sigmoid activation function at the end with FC layer of single output.

Hyper-parameters for this architecture are shown in Table I.

Iv-C DenseNet Siamese Network

In this architecture the branches of the Siamese network are DenseNet. Following the classic Siamese model each branch of the Siamese network shares weights between them and gets trained simultaneously on two input patches and then learns the features from the inputs. Through the shared neurons the Siamese network is able to learn the similarity function and be able to discriminate between the two input patches. The role of the DenseNet branches are feature extraction, the decision making or prediction part is taken care of by the Siamese network.

Fig. 2: DenseNet Siamese architecture.

In Figure 2

the basic architecture is displayed for the DenseNet-Siamese network. The two DenseNet branches are designed to share weights between them. The extracted features are concatenated and connected through a FC layer, followed by ReLU activation and where applicable Batch Normalization and Dropout layers. The output is then connected to another FC layer with single output, for binary prediction score of matching (1) or non-matching (0). Sigmoid activation function and binary cross entropy loss function is used for this final FC layer. As mentioned in Figure

2 the size of the output of the FC layer and value of dropout probability etc. hyper-parameters are shown in Table II.

Iv-D Contrastive Loss

Using Contrastive loss [4] higher dimensional input data (e.g. a pair of images) can be mapped in a much lower dimensional output manifold, where similar pairs are placed closer to each other and the dissimilar pairs have larger distances between them depending on their dissimilarity. Using this loss function the distance between two input patches projected in the output manifold can be predicted and if the distance is closer to 0 then the input pairs are matching, otherwise its dissimilar (above threshold). The formulas for this loss are shown in Equations 1 and 2.


Here L is the loss term, the formula presented here is the most generalized form of the loss function, suitable for batch training. ,

represents a pair of input image vectors. Y are the labels, 0 for similar pair and 1 for dissimilar pair.

is the parameterized distance function to be learned by the neural network. is the margin that defines a radius around . The dissimilar pairs only contribute to the loss function if their distance is within the radius. We use for our experiments. One of the ideas for evaluating this loss function is to use it with a Siamese network, as the loss function takes a pair of images as input, indicating their similarity, matching pairs having closer distances in the learned embedding than non-matching ones, and the distance between pairs can be used as a score with a threshold.

Iv-E VGG Siamese Network

The VGG network [16]

is a CNN which was conceptualized by K. Simonyan and A. Zisserman from the University of Oxford (Visual Geometry Group). This network performed very well in ImageNet challenge 2014. The architecture/s has very small 3x3 Conv filters and depth varying from 16 to 19 weight layers. This network generalizes very well with different kinds of data. VGG network has been chosen as the branches of the Siamese (Figure

3) network It’s role is to extract features, similar to the DenseNet-Siamese, the final decision making and prediction is done by the Siamese network. The network is trained with Contrastive loss. The output of this network is euclidean distance between the two input sonar patches, projected into lower dimension using Contrastive loss. The hyper-parameters of this network are shown in Table III.

Fig. 3: VGG Siamese network with contrastive loss.

Since contrastive loss returns projected distance, close to zero means similarity and higher values means dissimilarity. Although, in our original data and matching formulation, labels close to one represents similarity between patches. Hence the labels for train, validation and test data here are all flipped:


Equation 3 is applied to all ground truth labels, meaning that for this evaluation input label zero means similarity (match) between patches.

Name Value Name Value
Conv filters 16 Kernel size 3
FC Layers 1 FC output 2048
Batch normalization False Dropout 0.6
Batch size 256 Optimizer Nadam
Conv Initializer random normal FC Initializer glorot normal
Learning rates 0.0002
TABLE III: Best hyper-parameter values for VGG Siamese network with Contrastive loss (CL).

V Experimental Evaluation

V-a Dataset

We use the Marine Debris dataset, matching task, 111Available at to evaluate our models. This dataset contains 47K labeled sonar image patch pairs, captured using a ARIS Explorer 3000 Forward-Looking sonar, generated from the original 2627 labeled object instances. We exclusively use the D dataset, on which the training and testing sets were generated using different sets of objects, with the purpose of testing a truly generic image matching algorithm that is not object specific. The training set contains 39840 patch pairs, while the test set contains 7440 patch pairs.

V-B Comparative Analysis of AUC

Our main results are presented in Table IV and Figure 4, where we present the AUC and the ROC curves on the test set, correspondingly.

DenseNet two-channel has highest mean AUC (10 trials) of with max AUC of 0.966. With total parameters of only 51,430. DenseNet-Siamese has highest mean AUC (10 trials) of , Max AUC 0.95 with total parameters of 16,725,485. VGG-Siamese network with Contrastive loss have mean AUC (10 trials) of and highest AUC value in a single run as 0.956. With total number of parameters of 3,281,840. These AUC values are considerably better than Valdenegro-Toro [18], with improvements from to (almost 5 AUC points).

It is notable that our best performing model is a two-channel network, indicating that this meta-architecture is better suited for the matching problem than a siamese one, and that there is a considerable reduction in the number of parameters, from M to K, which hints at increased generalization.

A comparison of predictions between all our three architectures is provided in Figure 5.

Network AUC Best AUC # of Params
Two-Channel DenseNet 0.966 51K
Siamese DenseNet 0.95 16.7M
Siamese VGG 0.956 3.3M
Two-Channel CNN [18] 0.910 0.910 1.8M
Siamese CNN [18] 0.855 0.855 1.8M
TABLE IV: Comparative analysis on the AUC and total number of parameters in the best performing networks.
Fig. 4: Comparison of ROC curves for best hyper-parameter architecture configurations and top AUC.
Fig. 5: Comparison of predictions across multiple models, DenseNet Siamese (DS), DenseNet Two-Channel (DTC), and VGG Siamese Contrastive Loss (CL). Note that Siamese VGG produce distances which are not in the range , while the other architectures give scores in the range.

V-C Monte Carlo Dropout Analysis

Normally Dropout is only applied in the training phase, where it provides regularization to avoid overfitting. In test time all the connections/nodes remain present and dropout is not applied, though the weights are adjusted according to the dropout ratio during training. So every time a prediction on test data is obtained, they are deterministic. For Monte Carlo dropout the dropout is also applied in the inference/test time, which introduces randomness, as the connections are dropped randomly according to the dropout probability. This prediction process is stochastic i.e the model could predict different predictions for same test data. The main goal of Monte Carlo Dropout [3] is to generate samples of the predictive posterior distribution of an equivalent Bayesian Neural Wetwork, which quantifies epistemic uncertainty.

We would like to evaluate uncertainty for our best performing model, the DenseNet two-channels (AUC 0.966). This model is trained with Dropout with

. For this evaluation the MC-Dropout during inference time is enabled explicitly. 20 forward passes for each of the test images are made and the mean score and standard deviation is computed. The standard deviation is a measure of uncertainty, with increasing value indicating more uncertainty.

Figures 6 and 7 present these results in terms of the most uncertain patch pairs in Figure 6, and the most certain (least uncertain) images in Figure 7. These results give insights on what the model thinks are its most difficult samples (high uncertainty), and in particular, the most uncertain examples (highest standard deviation) are the ones close to being out of distribution, where the patches are positioned near the border of the FLS polar field of view, which probably confuses the model.

The lowest uncertainty results in Figure 7 indicate the easiest patch pairs to discriminate, either the same object in relatively similar poses, or radically different objects or background in each pair. In both cases the model is quite confident of these predictions.

Figure 8 shows a large selection of patch pairs and their uncertainty estimates, showing that the model is not always confident, particularly for predictions with scores in between zero and one, even for pairs that a human would consider to be easy to match or reject.

Fig. 6: MC-Dropout predictions of DTC with highest standard deviation over 20 forward passes. Ground truth label 1 indicated matching. It is clear that the low signal-to-noise for sonar is affecting the predictions, and unwanted reflections and occlusions are also challenging.
Fig. 7: MC-Dropout predictions of DTC with lowest standard deviation over 20 forward passes. These results show that the network learned some of the similarity functions with great confidence. For object-object non-matching pairs usual std values are much higher than other categories.
Fig. 8: MC-Dropout predictions of DTC with highest standard deviation over 20 forward passes, presenting 20 sonar image patch pairs from the test dataset with index 1000 to 1019, and corresponding mean prediction and standard deviation.

V-D Ensemble

The performance of the DenseNet-Siamese(DS) is good for non-matching pair predictions. DenseNet two-channel(DTC) is overall very good, but most uncertain in object-object non matching pairs.

This observation led to the hypothesis that making an ensemble of these two classifiers might improve overall predictive capability. For this experiment a few of the previously trained models of DTC and DS are loaded, and their predictions on the test data are averaged,i.e. same weights for DS and DTC both. These evaluation results are displayed in Table V. The ROC AUC calculated on the average prediction is found to be higher than the individual scores each time.

DS model AUC DTC model AUC Ensemble AUC
0.95 0.959 0.97
0.952 0.959 0.97
0.952 0.963 0.973
0.952 0.966 0.971
0.952 0.972 0.978
TABLE V: After combining DS and DTC models, with AUC presented in first two columns, the Ensemble is encoded and its prediction accuracy (AUC) gets much improved, presented in the third column.

Ensemble accuracies (AUC) are consistently better than each model individually. If the underlying models, which encode the ensemble, has low AUC, the ensemble AUC is found to be much-improved. For example the first result presented in Table V where the ensemble accuracy is much higher (0.97 AUC) than the underlying model predictions (0.95 and 0.959 AUCs). By encoding an ensemble of the DenseNet-Siamese model with AUC 0.952 and the DenseNet two-channel model with 0.972 AUC, the resulting ensemble AUC is found to be 0.978, which is the highest AUC on test data obtained in any other experiment during the scope of this work. This indicates that both DS and DTC models are complementary and could be used together if higher AUC is required in an application.

Vi Conclusions and Future Work

In this work we present new neural network architectures for matching of sonar image patches, including extensive hyper-parameter tuning, and explore their performance in terms of area under the ROC curve, uncertainty as modeled by MC-Dropout, and performance as multiple models are ensembles. The results in this work are proven to be improvements over the state of the art on the same dataset. Using DenseNet two-Channel network, average prediction accuracy obtained is 0.955 area under ROC curve (AUC). VGG-Siamese (with Contrastive loss function) and DenseNet-Siamese perform the prediction with an average AUC of 0.949 and 0.921 respectively. All these results are an improvement over the result of 0.910 AUC from Valdenegro-Toro [18]. Furthermore, by encoding an ensemble of DenseNet two-channel and DenseNet-Siamese models with respective highest AUC scores, prediction accuracy for the Ensemble obtained is 0.978 AUC, which is overall highest accuracy obtained in the Marine Debris Dataset for the matching task.

We expect that our results motivate other researchers to build applications on top of our matching networks.


  • [1] M. Brown and D. G. Lowe (2007) Automatic panoramic image stitching using invariant features. International journal of computer vision 74 (1), pp. 59–73. Cited by: §I.
  • [2] S. Emberton, L. Chittka, and A. Cavallaro (2018) Underwater image and video dehazing with pure haze region segmentation. Computer Vision and Image Understanding 168, pp. 145–156. Cited by: §II.
  • [3] Y. Gal and Z. Ghahramani (2015)

    Dropout as a Bayesian approximation: representing model uncertainty in deep learning

    arXiv:1506.02142. Cited by: §V-C.
  • [4] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In

    2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06)

    Vol. 2, pp. 1735–1742. Cited by: §IV-D.
  • [5] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks.. In CVPR, Vol. 1, pp. 3. Cited by: §IV-B.
  • [6] N. Hurtós, N. Palomeras, S. Nagappa, and J. Salvi (2013) Automatic detection of underwater chain links using a forward-looking sonar. In OCEANS-Bergen, 2013 MTS/IEEE, pp. 1–7. Cited by: §II.
  • [7] N. Hurtós, Y. Petillot, J. Salvi, et al. (2012) Fourier-based registrations for two-dimensional forward-looking sonar image mosaicing. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pp. 5298–5305. Cited by: §I, §II.
  • [8] K. Kim, N. Neretti, and N. Intrator (2005) Mosaicing of acoustic camera images. IEE Proceedings-Radar, Sonar and Navigation 152 (4), pp. 263–270. Cited by: §II.
  • [9] D. G. Lowe (1999) Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, Vol. 2, pp. 1150–1157. Cited by: §I.
  • [10] D. G. Lowe (2004) Distinctive image features from scale-invariant keypoints. International journal of computer vision 60 (2), pp. 91–110. Cited by: §I, §II.
  • [11] A. Mallick (2019) Sonar patch matching via deep learning. Master’s Thesis, Bonn-Rhein-Sieg University of Applied Sciences. Cited by: §IV-A.
  • [12] N. Molton, A. J. Davison, and I. D. Reid (2004) Locally planar patch features for real-time structure from motion.. In Bmvc, pp. 1–10. Cited by: §I.
  • [13] S. Negahdaripour, M. Aykin, and S. Sinnarajah (2011) Dynamic scene analysis and mosaicing of benthic habitats by fs sonar imaging-issues and complexities. In Proc. OCEANS, Vol. 2011, pp. 1–7. Cited by: §II.
  • [14] M. T. Pham and D. Guériot (2013) Guided block-matching for sonar image registration using unsupervised kohonen neural networks. In Oceans-San Diego, 2013, pp. 1–5. Cited by: §II.
  • [15] S. M. Seitz, B. Curless, J. Diebel, D. Scharstein, and R. Szeliski (2006) A comparison and evaluation of multi-view stereo reconstruction algorithms. In null, pp. 519–528. Cited by: §I.
  • [16] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §IV-E.
  • [17] M. Valdenegro-Toro (2016) Objectness scoring and detection proposals in forward-looking sonar images with convolutional neural networks. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition, pp. 209–219. Cited by: §II.
  • [18] M. Valdenegro-Toro (2017) Improving sonar image patch matching via deep learning. In Mobile Robots (ECMR), 2017 European Conference on, pp. 1–6. Cited by: §I, §II, §V-B, TABLE IV, §VI.
  • [19] P. Vandrish, A. Vardy, D. Walker, and O. Dobre (2011) Side-scan sonar image registration for auv navigation. In Underwater Technology (UT), 2011 IEEE Symposium on and 2011 Workshop on Scientific Use of Submarine Cables and Related Technologies (SSC), pp. 1–7. Cited by: §II.
  • [20] B. Yao, G. Bradski, and L. Fei-Fei (2012) A codebook-free and annotation-free approach for fine-grained image categorization. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3466–3473. Cited by: §I.
  • [21] S. Zagoruyko and N. Komodakis (2015) Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4353–4361. Cited by: Fig. 1, §II.
  • [22] J. Zbontar and Y. LeCun (2016) Stereo matching by training a convolutional neural network to compare image patches.

    Journal of Machine Learning Research

    17 (1-32), pp. 2.
    Cited by: §II.