Underwater facilities of oil and gas fields must be periodically inspected with the goal of investigating the condition of submerged structures. The very first goal with this task is to verify the need of repair and maintenance, being these tasks performed by remotely operated vehicles (ROV) or divers. These inspections are complex, expensive and manually carried out, demanding a complete support, usually comprised of a crane, umbilical cable and the ROV crew.
Because of this expensive cost of oil and gas facility maintenance, many researches have been developing autonomous underwater vehicles (AUVs) to be applied in the aforementioned tasks. AUVs aim at conducting survey missions, using internal and external sensing devices with lower operational costs. The vehicle returns to a pre-programmed location when a mission is completed, and the gathered data can be downloaded and analyzed. One of the underlying AUV’s tasks is to detect submerged objects, providing reference locations to support the vehicle navigation.
The use of visible spectrum cameras in underwater inspection-driven AUVs is limited by turbid or deep waters, while imaging sonars have been exploited to coverage large areas even under low-to-zero visibility conditions. Sonars emit sound waves in a given direction until these waves hit with an object, having part of the wave energy reflected back. By calculating the time-of-flight of the sound waves, a distance between the sonar and the target is established. On measuring the backscattered energy, it is possible to define the target shape, and each record of sliced time is called bin. All bins scanning in the same direction angle composes a beam. While sonars are more independent with respect to water turbidity conditions, these sensors usually provide a more difficult data interpretation because of the acquisition and environment characteristics.
Some works on sonar image to support AUV’s underwater navigation have been developed. Usually these works attempt to eliminate the seabed, tracking highlighted areas without the ability of recognizing the target object. Cuschier and Negahdaripous 
adapted the optical flow method to track any feature from a forward-scan sonar images; they used the sonar image intensities to estimate the motion parameters; the goal is to help the AUV navigate underwater. Ruizet al. 
perform image processing techniques to detect submerged objects, tracking them with a Kalman filter; they segment objects in multi-beam sonar images using a region growing algorithm; theposition, the orientation and the area from the segmented areas are used as features to track the target objects. Petillot et al. 
segment underwater objects, and extract features for AUVs’ obstacle avoidance and path planning; image are segmented by an adaptive threshold technique, and posteriorly features, such as area, perimeter and moments, are extracted from the segmented regions, and a Kalman filter is used to track the obstacles. Folkessonet al. 
segment the sonar image, and the centroid positions of the segmented blobs is used to track the objects by using a probability hypothesis density (PHD) filter. Johanssonet al.  extract dense features from a forward-looking sonar, applying pair-wise registration between consecutive sonar frames; sonar image registrations are combined with sensor information to improve vehicle navigation; features are extracted with the image gradient followed by an adaptive thresholding. Weng et al.  modified the Otsu threshold in order to separate the background and the foreground from the multi-beam sonar images; they built a color and area models in the first sonar frame, which is used to track the objects by particle filter based on multi-feature adaptive fusion. Yan et al.  uses a different approach to detect object avoiding energy emission; by means of gradiometer, the gravity gradient differential ratio is measured, and used to estimate the objects body-center location and mass. Once these aforementioned methods do not distinguish one object from another, they do not allow performing accurate inspection tasks for a specific submerged object; also, used AUVs can not exploit the location of the inspected object as a reference for the navigation system, in these methods.
Different from the other works, the goal of this letter is to introduce a novel trainable method to detect and recognize (identify) specific under-the-sea target objects with a forward-looking sonar equipped in an AUV. Our contributions are: (i) a background reducing method, emphasizing the object shape in the sonar images, (ii) a multi-scale orientation detector, and (iii) a rotated-invariant recognition. The first contribution is achieved by applying image processing techniques in sonar images in order to reduce the sensor noise and seabed background; the second and the third contributions are achieved by applying image pyramid and sliding windows combined with support vector machine (SVM) over several sonar image orientations in order to search for the target object in a rotated-invariant way.
The proposed method was developed to be used in the FlatFish robot , which is an AUV designed to perform inspection tasks in underwater facilities of oil and gas industry. Our method will support the navigation system to guide the vehicle by using reference recognized objects.
Ii Rotation-Invariant Shipwreck Detection
Ii-a Preparing the sonar image
The farther is the water column between the flatfish and the seabed, the bigger is the black region in the bottom of the sonar image (see Fig. 2). That black region adversely affects the acoustic image processing, disturbing detection results. That region does not carry any information, so that it can be removed. This way, the first preprocessing step is to estimate the sonar range containing the beginning of the seabed, eliminating the black region created by the water column. To eliminate that, we first calculate the average of the image intensity over the sonar image lines (see Fig. (a)a); next, the cumulative sum of the average of the image intensity is calculated to help determine the end of the water column (beginning of the seabed), which is represented by the point in the plot that the curve starts raising abruptly (close to 200, in Fig. (b)b). After finding the end of the water column, an actual sonar region-of-interest is created without the black region, as depicted in Fig. (a)a.
Due to sonar configuration and the underwater terrain curvature, sonar image can present non-uniform illumination patterns. One such common pattern is the decrease of intensity values for bins far from the origin of the sonar pulse. This phenomenon occurs because the acoustic waves of the farthest bins travel greater distances than those of the closest bins. Hence the energy loss, caused by the transmission medium, is greater for the bins farthest from the origin of the pulse. To improve the illumination condition caused by that problem, we applied a modified version of a contrast limited adaptive histogram equalization (CLAHE) filter. We first calculate the image entropy in order to determine if the histogram is equalized, according to the following steps:
where and represent the source image and the result of the image enhancement, respectively.
Toward finding the best clip limit, CLAHE is applied with a range of clip limits from to , incremented by . The current clip limit is represented by and the grid size is . The entropy is calculated for each CLAHE result, and the best clip limit is the first one which satisfies any of the following conditions:
, where is the current entropy and is the maximum entropy value;
, where is the last entropy;
, where is the minimum entropy difference.
The result of the image enhancement is shown in Fig. (b)b
. After the image enhancement, the intensities far from the origin of the sonar are highlighted, enhancing the step of features extraction.
As it happens with other acoustic devices, the image acquired from the sonar suffers from low signal-noise rate (SNR). This primarily occurs due to the presence of speckle noise, which appears as a granular pattern in the acoustic images. Speckle noise is caused by the acoustic nature of the imaging sonar , adding high frequency components to the acoustic image, and decreasing the intensity of important information (such as shapes and object edges) . Therefore the acoustic image is smoothed using a mean filter that aims at reducing the intensity of high frequency components. The neighborhood size used in our work is pixels. An example of a resulting image in this step is shown in Fig. (c)c.
Objects found in the acoustic image are typically characterized by high intensities pixels, followed by shadows. This usually occurs due to the occlusion caused by the object itself (no echo is returned from occluded areas). Thus, object shapes in the sonar are seen as sharp transitions of intensities. In order to emphasize these sharp transitions, we apply an edge detection method, which uses vertical and horizontal Sobel derivatives to calculate the image gradient. Object shape is then emphasized, and the image background is reduced, as depicted in Fig. (d)d.
Although the edge detection significantly decreases the image background, some edge components from seabed terrain and acoustic reverberation remain in the image. To tackle that problem, we applied the mean filter with two differently sized windows. Then the mean filter result from the larger window is subtracted from the result of the shorter window. To speed the computation of the means, the integral image is calculated. After the mean filters’ subtraction, most of the pixels from background have negative values. These negative values are then set to zero, removing most of the pixels from the image background. We use a smaller window size of pixels, and a larger window size of pixels in our particular application. The result of this step is shown in Fig. (e)e.
To extract the regions containing objects, we developed a saliency map based on . Our method divides the image into equally sized blocks. Next the average of each block is calculated using the integral image. The saliency map result for block is given by:
where is the average of the block and is the total number of blocks. is the difference between block and the rest of the image. As the majority of the image contains background, blocks with highest difference values are considered as an object or part of it. The result is shown in Fig. (f)f. The block size used was pixels.
The saliency map result is segmented by the traditional Otsu method . The convex hull contours is found by using the methods proposed by  and , creating a final mask circumventing the objects (see Fig. (g)g). This mask represents the regions of the acoustic image where objects are expected to be. After the preprocessing, the shapes of objects are highlighted, the noise is decreased and the image background is reduced, as seen in Fig. (h)h.
Ii-B Representing the object to be recognized
To recognize the target object pattern, we applied histograms of oriented gradients (HOG) , as feature descriptors. The image result of the HOG descriptor extracted from the preprocessed image is seen in Fig. (a)a, where lines are drawn to represent the strengths of the edge orientation in the histogram, for each cell.
Ii-C Learning the target object
Since the assets inspected by the AUV are all previously known, the target detection algorithm is trained via supervised learning. HOG descriptor is not rotation invariant, so that the target orientation is manually informed by selecting the heading of the target during the annotation process.
The training data was divided into positive and negative sets, wherein the former contains the target object. Before the training stage, the target orientation is normalized according to the target head annotated during the annotation step. To achieve that, all image from the training set is rotated to make the target head point be in 180 degrees with respect to the image vertical axis.
Two sets of vectors are created, one with the features extracted from the positive images, and another for negative images. These sets feed a linear SVM, in the training stage. A positive and negative example are shown in Fig. (b)b, where the positive is in the green rectangle, and the negative is in the red rectangle. To extract the negative examples, we scan the image using a window with the same size of the annotated bounding box. The windows that do not overlap the annotated area are selected as negative example. If the image does not have a target object annotated, a window with size of are used.
Ii-D Searching and recognizing the target object
The used forward-looking sonar has an adjustable range, consequently making the target object to have different sizes in the sonar image. As the HOG descriptor is not scale-invariant, the image pyramid method is used to find objects at different scales. A multi-scale orientation detector searches for the target by means of a sliding window combined with an image pyramid-based search method, over different image orientations. This method moves a fixed size rectangle over the image from top-left to bottom-right of the sonar image. For each window, a linear SVM is applied to determine if the window contains the target object or not.
To guarantee the rotation invariance, the sonar image must be rotated into different orientations. For each orientation, the multi-scale detector is executed and results are saved into a list. Each item in the list contains the SVM weight, the window that may contain the target object and the image orientation. The item of the list with the largest SVM weight is selected. The corresponding image orientation is used to rotate the window into the default orientation, resulting in a standardized view of the target object.
Iii Experimental Results
Iii-a Data acquisition
Data acquisition was carried out using the FlatFish robot . FlatFish is a sub-sea resident AUV, which was built to perform on-demand close visual structural inspection at oil and gas sites. The robot is equipped with a Tritech Gemini 720i sonar (under the robot), as an acoustic global navigation sensor, which operates at 720, with horizontal and beamwidths, and a downward of . Across the horizontal axis, the system is comprised of 256 beams with effective azimuth-angular beam resolution of . The coverage range varies between , with a frame rate up to . During the acquisition process, the sonar range was set to 30 and 35 meters, and the AUV was moved surrounding the shipwreck to simulate the inspection task. The FlatFish was manually controlled for the sway and surge degrees, being autonomous for yaw and heave degree.
The subsea experiments were conducted at Todos os Santos Bay, Salvador, Bahia. The main target was Vapor da Jequitaia – a vessel shipwrecked in 1905, which lies approximately 7 meters deep in the water. The Vapor da Jequitaia has 27-meter long and has a distinctive shape, and even with a great number of holes in its hull, it can be inspected from the top, turning to be a very suitable testing target (see Fig. 1).
Iii-B Data preparation: training and test sets
The training data was prepared using the steps described in Section II-C. 431 acoustic images were gathered and annotated to form the training data set. Due to the difficulty to find a good target in the sea environment, we used just Vapor da Jequitaia in our experiments, that represents our target object. These images were resized to fit the annotated bounding box into the detection window, whose size is pixels, keeping the Vapor da Jequitaia aspect ratio. We extracted 431 positive examples, and 469 negative examples from the training dataset. For our experiments, we used a linear SVM, with .
A dataset with 1222 acoustic images containing Vapor da Jequitaia
was annotated to assess the performance of our method. For each result given by the linear SVM classifier, we compare the detected area to the ground truth. The ROC curves in the Figs.(a)a and (b)b show the detector performance, which is quantified by a true positive rate (defined as , where and denotes true positive and false positive areas), and the false positive rate (defined as , where denotes the true negative area).
One of the parameters of our multi-scale orientation detector is the sliding window step size, which defines how many pixels will be skipped during the sliding window. As shown in Figs. (a)a and (b)b, we tested the sliding window step size with values of and pixels. Scale indicates how much of the image will be resized in the image pyramid representation. Five different scales were evaluated (see Figs. (a)a and (b)b). Testing images were rotated from 0 to 180 in steps of 10 to search the target in different orientations. Windows with the highest SVM weight was chosen.
|Scale||Step 8x8||Step 16x16|
Table I summarizes accuracy evaluation in our experiments. TPR average and FPR average denote the averages of the true positives and false positives rates calculated in the test stage, respectively. As shown in Table I, scale presents the best result in both window step size with the best TPR average of %, using the window step of pixels. For all scales, low values of false positive rate were computed with the highest obtained value equal to . As the scale value increases, the TPR average decreases, and thus the detector performance is reduced.
Iv Conclusion and outlook
A novel trainable method to detect and recognize specific submerged objects with a forward-looking sonar was presented here. By taking advantages of specific image processing techniques, we reduced the sensor noise and the seabed background of the sonar image, emphasizing the shape and the borders of the target inspection object. Using multi-scaled orientation detector and a linear SVM, the target object is detected and recognized in a rotation-invariant way. The capacity of our method to detect the known objects has been measured by means of the real data collected by Tritech Gemini 720i sonar. The very main goal was to support the Flatfish with under-the-sea inspection. Ongoing work have been carrying on to fully integrate the proposed method with the FlatFish’s navigation system. The goal is to allow controlling the vehicle with respect to the target inspection object, trying to make the inspection task more accurate. Future work includes increasing the number of target objects, integrating the recognition system with a tracking method.
-  J. Cuschieri and S. Negahdaripour, “Use of forward scan sonar images for positioning and navigation by an auv,” MTS/IEEE OCEANS Conference, vol. 2, pp. 752–756, 1998.
-  I. T. Ruiz, Y. Petillot, D. Lane, and J. Bell, “Tracking objects in underwater multibeam sonar images,” in IEEE Colloquium on Motion Analysis and Tracking, 1999, pp. 111–117.
-  Y. Petillot, I. T. Ruiz, and D. M. Lane, “Underwater vehicle obstacle avoidance and path planning using a multi-beam forward looking sonar,” IEEE Journal of Oceanic Engineering, vol. 26, no. 2, pp. 240–251, 2001.
-  J. Folkesson, J. Leonard, J. Leederkerken, and R. Williams, “Feature tracking for underwater navigation using sonar,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2007, pp. 3678–3684.
-  H. Johannsson, M. Kaess, B. Englot, F. Hover, and J. Leonard, “Imaging sonar-aided navigation for autonomous underwater harbor surveillance,” IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 4396–4403, 2010.
-  L.-Y. Weng, M. Li, Z. Gong, and S. Ma, “Underwater object detection and localization based on multi-beam sonar image processing,” IEEE International Conference on Robotics and Biomimetics (ROBIO), pp. 514–519, 2012.
-  Z. Yan, J. Ma, J. Tian, H. Liu, J. Yu, and Y. Zhang, “A gravity gradient differential ratio method for underwater object detection,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 4, pp. 833–837, 2014.
-  J. Albiez, S. Joyeux, C. Gaudig, J. Hilljegerdes, S. Kroffke, C. Schoo, S. Arnold, G. Mimoso, R. Saback, J. Neto, D. Cesar, G. Neves, T. Watanabe, P. Merz Paranhos, M. Reis, and F. Kirchner, “Flatfish: A compact auv for subsea resident inspection tasks,” in MTS/IEEE OCEANS Conference, 2015, pp. 1–8.
-  H. Cho and S. C. Yu, “Real-time sonar image enhancement for auv-based acoustic vision,” Ocean Engineering, vol. 104, pp. 568–579, 2015.
-  J. Jaybhay and Rajveer Shastri, “A study of speckle noise reduction filters,” Signal & Image Processing: An International Journal, vol. 6, no. 3, pp. 2777–2780, 2015.
R. Achanta, F. Estrada, P. Wils, and S. Süsstrunk, “Salient region
detection and segmentation,” in
International Conference on Computer Vision Systems (ICVS), 2008, pp. 66–75.
-  N. Otsu, “A threshold selection method from gray level,” IEEE Transaction on Systems, Man and Cybernetics, vol. 9, no. 1, pp. 62–66, 1979.
-  J. Sklansky, “Finding the convex hull of a simple polygon,” Pattern Recognition Letters, vol. 1, no. 2, pp. 79–83, 1982.
-  S. Suzuki and K. Be, “Topological structural analysis of digitized binary images by border following,” Computer Vision, Graphics and Image Processing, vol. 30, pp. 32–46, 1985.
-  N. Dalal and W. Triggs, “Histograms of oriented gradients for human detection,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 886–893, 2004.