A vision based system for underwater docking

12/12/2017 ∙ by Shuang Liu, et al. ∙ Tohoku University 0

Autonomous underwater vehicles (AUVs) have been deployed for underwater exploration. However, its potential is confined by its limited on-board battery energy and data storage capacity. This problem has been addressed using docking systems by underwater recharging and data transfer for AUVs. In this work, we propose a vision based framework for underwater docking following these systems. The proposed framework comprises two modules; (i) a detection module which provides location information on underwater docking stations in 2D images captured by an on-board camera, and (ii) a pose estimation module which recovers the relative 3D position and orientation between docking stations and AUVs from the 2D images. For robust and credible detection of docking stations, we propose a convolutional neural network called Docking Neural Network (DoNN). For accurate pose estimation, a perspective-n-point algorithm is integrated into our framework. In order to examine our framework in underwater docking tasks, we collected a dataset of 2D images, named Underwater Docking Images Dataset (UDID), in an experimental water pool. To the best of our knowledge, UDID is the first publicly available underwater docking dataset. In the experiments, we first evaluate performance of the proposed detection module on UDID and its deformed variations. Next, we assess the accuracy of the pose estimation module by ground experiments, since it is not feasible to obtain true relative position and orientation between docking stations and AUVs under water. Then, we examine the pose estimation module by underwater experiments in our experimental water pool. Experimental results show that the proposed framework can be used to detect docking stations and estimate their relative pose efficiently and successfully, compared to the state-of-the-art baseline systems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 13

page 14

page 16

page 17

page 18

page 19

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In various underwater docking tasks, autonomous underwater vehicles (AUVs) aim to detect, localize and physically attach to docking stations (Bellingham, 2016), such as sea survey (Smith et al., 2010; Yoerger et al., 2007) and in-water ship hull inspection (Englot and Hover, 2013). AUVs belong to the category of unmanned underwater vehicles (UUVs). UUVs are categorized into two main groups: i) remotely operated vehicles (ROVs), which require a user (human operator) input through a cable, and ii) autonomous underwater vehicles (AUVs), which provide stand-alone platforms without human supervision and support themselves by their on-board resources. Cables of ROVs supply adequate power and enable communication, but confine the scope of their activities. AUVs offer numerous advantages over ROVs such as a wider scope of activity, more compact size and higher efficiency, free from the limitation of a physical connection to an operator. However, its potential struggles with its finite on-board battery energy, processing and storage capacity.

Underwater docking has been popularly used due to its ability of autonomous battery recharging and data transfer, by making long-term underwater residence possible. Underwater docking systems guide AUVs into predesignated docking stations by using compatible sensors. Three types of sensors are used for an underwater docking task: i) electromagnetic (Feezor et al., 2001), ii) acoustic (Hong et al., 2003), and iii) optical sensors (Park et al., 2009). Optical sensors outperform others in terms of good directional accuracy, low vulnerability to external detection and capacity for multiple tasks, but suffer from good propagation in an underwater environment owing to the speedy attenuation of light in the water (Deltheil et al., 2000). Therefore optical sensors are usually utilized to take responsibility of the final short-distance stage precise docking, and they are combined with other sensors which are superior in propagation but inferior in accuracy (Maki et al., 2013).

Widely used vision based underwater docking (VBUD) systems consist of docking stations, cameras mounted on AUVs and docking algorithms on AUVs. Dedicated landmarks on docking stations are necessarily used by AUVs to identify docking stations. Landmarks may either be passive or active. Passive landmarks, such as patterns drawn on a board, do not need energy supply, but they are only visible within a close range. Active landmarks, such as light beacons, have high visibility by emitting energy. Active landmarks are usually preferred for its high visibility in far distance compared to passive ones (Bianco et al., 2013). In underwater docking systems, either monocular or binocular cameras are used. Binocular cameras require large camera baselines and more computational resources for detection of distant objects (Negre et al., 2008).

In this work, we consider VBUD systems equipped with active landmarks and a monocular camera. VBUD algorithms compute positions of docking stations using images captured by the on-board camera (Park et al., 2009; Li et al., 2015). VBUD algorithms are employed in two phases: i) detection of docking stations, and ii) estimation of pose between AUVs and docking stations (Negre et al., 2008). In the detection phase, docking stations are located in 2D images captured by cameras. Pose estimation recovers 3D relative position and orientation between AUVs and docking stations from the detected image patch.

Successful underwater docking algorithms demand on several conditions. First, detection of docking stations should be credible. Observing the docking station for large number of times (e.g. 100 times), but only successfully detecting once, will be inefficient and a crucial fault for AUVs which will run out of battery. Second, detection methods implemented by VBUD algorithms should be robust to blurring, color shift, contrast shift, mirror images, non-uniform illumination and noisy luminaries observed in non-stationary underwater environments. Finally, algorithms used for pose estimation are required to be fast, accurate and robust to noise.

More precisely, underwater images are prone to be blurring, color shift, reduced contrast and non-uniform illumination in different underwater conditions due primarily to the optical properties of the water medium, in contrast to images in air (Kaeli and Singh, 2017). Water is a strong attenuator of electromagnetic radiation. Some part of the incident light energy is absorbed by water molecules, and some part is scattered out during its propagation through the water. The spectral absorption is wavelength and distance dependent. The light energy exponentially decays with respect to the propagation distance. Red light is more strongly absorbed due to its longer wavelength. In addition, absorption coefficient of natural water varies depending on water quality which is a sum of contribution by various constituents, such as dissolved salts, organic compounds and phytoplankton (Mobley, 1994). Factors above dominantly cause the varying degrees of color shift in underwater images. Scattering is divided into forward scattering and backward scattering depending on the scattering angle. The former results in blurring and low contrast while the latter results in a visible bright haze in images (Bryson et al., 2016). Moreover, underwater noisy luminary may come from ambient noise light, water-surface bubble mixed with oil or other underwater light sources as shown in Park et al. (2009)

. Noisy luminary annoys binary object detection methods crucially. Removing all possible noisy luminary requires employment of image pre-processing methods developed by domain experts. In addition, deformed objects are observed due to scale and rotation variance. Images of docking stations captured at different locations and viewpoints (orientation degrees) give rise to geometric deformations, addressing a challenging problem for underwater docking. Under certain conditions, mirror images of docking stations are also observed due to

total internal reflection. When light propagates from water to the air and crosses the water-air boundary, some part of the light is refracted while the rest is reflected back into the water. According to Snell’s law, there exists a critical angle where light is totally reflected back with zero refraction. Total internal reflection occurs at all angles smaller than critical angle, and thereby water surface serves as a mirror. Mirror images are almost identical copies of original images of docking stations. Their similar appearances confuse detection and recognition algorithms implemented in AUVs.

In order to address above problems, we propose a underwater docking framework that integrates a convolutional neural network (CNN) architecture proposed for robust detection of underwater docking stations, and a perspective-n-point algorithm employed for fast, accurate and robust pose estimation. Our contributions can be summarized as follows:

  • We provide our underwater docking dataset which was collected in our experimental water pool. To the best of our knowledge, it is the first publicly available dataset used for analysis of computer vision algorithms employed for underwater docking. We labeled bounding boxes of docking stations for each image in the dataset. The dataset can help researchers to develop underwater docking algorithms and validate their algorithms in absence of underwater docking infrastructures. In addition, a series of deformation methods are proposed in this study to generate deformed

    realistic underwater images as close as possible to real-world undersea images.

  • We propose a convolutional neural network named DoNN for the detection of underwater docking stations. It has two main advantages compared to state-of-the-art networks. First, it is credible. It achieves AUC (experimental analyses are given in Section 5.1) for detection of underwater docking stations (see Section 5.1.4). Second, the proposed detection approach is robust to various deformations, such as blurring, color shift, contrast shift, mirror images, non-uniform illumination and noisy luminary, in various complex and dynamic underwater environments. It outperforms baseline models in terms of AUC using underwater images with blurring, color shift, contrast shift and mirror images. It achieves slightly inferior but acceptable AUC performance on underwater images with non-uniform illuminations and noisy luminaries compared to baseline models.

  • We integrate a perspective-n-point algorithm termed RPnP, which is fast, accurate and robust to noise, into our framework for estimation of relative position and orientation between docking stations and AUVs. Ground experiments show that the average error of position and orientation of pose estimation module are mm and degree, respectively, and the running time is second per frame. It achieves mm and degree in terms of average error of position and orientation in presence of strong noise.

2 Related work

In recent years, several VBUD solutions have been proposed for underwater docking. It was first verified that optical terminal guidance acquisition ranges of 10 meters - 15 meters are possible even in very turbid water by Cowen et al. (1997). In their study, docking algorithms guide AUVs by using distribution of landmarks in all four quadrants of images. Their method is simple but effective. False detection occurs in shallow water due to sunlight. In Cowen et al. (1997), one light was used as an active landmark. Hong et al. (2003) designed a cone shape docking station with six color lights. Five of them form a pentagon while one lies outside to prevent mutual inference. Park et al. (2009) take advantage of docking stations in similar shape; they used four white lights on the rim of the upside of the circle and one light on the other semicircle. Li et al. (2015) placed four 540 nm green lights around the rim of a cone shape docking station to achieve good underwater propagation of green lights. All landmarks mentioned above are arranged on a plane. Maki et al. (2013) leveraged 3D landmarks with three green lights and one red light for underwater docking.

Configuration of landmarks is tightly coupled with the mechanism of VBUD algorithms. VBUD algorithms perform detection station detection and pose estimation tasks. Detection of docking station provides location of docking stations in 2D images captured by an on-board camera. Underwater docking detection algorithms fall into two general categories: binarization based methods and feature based methods. Binarization based methods binarize images using a threshold followed by a series of image preprocessing methods due to limited prior knowledge

(Park et al., 2009; Ghosh et al., 2016). Park et al. (2009) first binarize images by a pre-defined fix threshold, and then remove salt-and-pepper noise using a mask. Ghosh et al. (2016) improve the threshold selection method using an invariant histogram-based adaptive thresholding method. They did not use image pre-processing methods due to their relative ideal test environment. Performance of binarization based methods are sensitive to user defined thresholds, and require a well-round expert knowledge in different underwater environments in order to perform inference from non-uniform illumination, reduced contrast, blurring, color shift, mirror images and noisy luminaries. As an example of feature-based detection methods, Li et al. (2015) first adopted Mean-Shift algorithm to extract light source area, and then took advantage of the Snake algorithm to recognize contour features. Finally, SVMs are used to detect features of landmarks. Although various detection approaches are proposed in previous underwater docking works, none of them reported their detection performance and analyzed credibility and robustness of detection methods in detail.

Pose estimation in underwater docking refers to recovering 3D relative position and orientation between docking stations and AUVs from 2D images. Both monocular and binocular cameras are available for this problem. Binocular methods estimate position information through disparity maps. They require longer processing time and large baselines for accurate estimation (Negre et al., 2008). Myint et al. (2016) estimate pose by using a binocular camera. Li et al. (2015) proposed a pose estimation algorithm which combines binocular and monocular methods. Park et al. (2009) utilized a monocular camera for pose estimation. They established an algorithm that maps number of pixels to distance between AUVs and docking stations. In their case, exact locations of docking stations cannot be gained owing to factors like scattering. In order to obtain an exact pose by using monocular cameras, perspective-n-points (PnP) algorithm is used. PnP algorithm is able to recover pose information given points correspondence between 2D images and 3D points. A unique solution can be obtained using more than three correspondence points (). Li et al. (2015) took advantages of P3P () to estimate pose in its monocular module. Ghosh et al. (2016) estimated poses by fitting ellipses. They mounted twelve lights around a cone shape docking station. An ellipse was fit, and parameters of the ellipse were used to estimate the pose. Their experiments were only carried out in the range of less than 140cm which is not convincing in real underwater docking tasks (10 meters - 15 meters).

(a)
(b)
(c)
(d)
(e)
Figure 6: Overview of our system used in the experiments. (a) A picture of the SIA-9 in the lake. (b) Illustration of different modules in the SIA-9 after refitting for underwater docking. (c) The embedded computer used for underwater docking. The appearance of our docking station (d) on water surface, and (e) in water.

3 System overview

This section introduces an overview of our proposed underwater docking system.

In order to perform our experiments, we developed a AUV research platform (SIA-9) which is a small torpedo-shaped vehicle as shown in Figure 6. Its specifications are given in Table 1. SIA-9 is mainly equipped with Doppler Velocity Log (DVL), Compass, Inertial Measurement Unit, Radio, GPS, a control computer, battery units and motors. The software architecture is based on MOOS-IvP (Benjamin et al., 2010) which is used to separate overall capability into distinct modules. Each module acts as a MOOS process which can publish its own information and subscribe others’ to and from MOOSDB. MOOSDB is a module used for exchanging information between different MOOS processes, and which is responsible for maintaining the consistency of information. Implementation details of the SIA-9 were given by Jia et al. (2016).

Item Value
Diameter 250 mm
Length 1576 mm
Weights in air 75 kg
Maximum operating depth 300 m
Maximum cruising speed 3 knots
Table 1: Specifications of the SIA-9 used in the experiments.

In order to use the SIA-9 for our underwater docking task, three modules were installed on the SIA-9. First, a forward-looking RGB monocular camera with frame rate 20 fps was rigidly mounted inside the head of the SIA-9. Second, we installed an embedded computer which is shown in Figure (c)c inside the head of SIA-9 to process underwater images. The embedded computer was equipped with Intel Core i7 3.4 Ghz processor, 8 Gb RAM and a 64-bit operating system. It was used to establish communication between the control computer and the camera through LAN and PCIe, respectively. Third, the head of the SIA-9 was redesigned to fit the camera and the embedded computer as shown in Figure (b)b.

Our docking station is cone-shaped with outer diameter 1200 mm and inner diameter 300 mm. Eight active landmarks were mounted uniformly around the rim of the docking station as shown in Figure (d)d and (e)e. Blue LED lights with 460 nm wavelength were used as landmarks due to good propagation of blue light in water.

4 Underwater docking algorithm

In this section, we provide our underwater docking algorithm consisting of two modules which are used for i) detection of underwater docking stations, and ii) estimation of pose between docking stations and AUVs. The detection module takes underwater images as input, and outputs location of underwater docking stations in 2D images. In other words, it is used to determine whether the docking station is within the field of view of AUVs, and where it is located in the captured image. Pose estimation module computes the relative position and orientation between AUVs and docking stations once the predesignated docking station is detected. AUVs conduct a line tracking task, taking current position as the start point and the position given by the pose estimation module following the detection and pose estimation phases. Line tracking is a common procedure used to operate AUVs, and its analysis is out of scope of this work.

4.1 Detection of underwater docking station using deep neural networks

A robust and credible detection algorithm is highly desirable in practical underwater docking as mentioned in Section 1. Underwater docking detection suffers from blurring, color shift, reduced contrast, mirror images, non-uniform illumination and noisy luminaries more compared to overwater detection tasks. A robust detection can guarantee detection performance in various non-stationary underwater environment while a credible one can improve the docking efficiency. The docking efficiency is crucial for AUVs which are in low battery state, and which will recharge their battery by underwater docking. However, none of them draw enough attention in the previous works, and no detection performance was reported with a detailed analysis. In this work, we propose a convolutional neural network (CNN), called Docking Neural Network (DoNN), inspired by the YOLO (Redmon et al., 2016) aiming at robust and credible detection of docking stations. In this section, we first provide a brief background of CNNs, and then introduce our proposed DoNN for detection of underwater docking stations.

4.1.1 Background of convolutional neural networks

A CNN is a deep neural network (DNN) which employs convolution layers as building blocks to model spatial patterns. CNNs draw several key ideas from Hubel and Wiesel’s discovery on cat’s visual cortex (Hubel and Wiesel, 1959), and they have been used as one of the most successful methods used to perform robot vision tasks in the recent years (Oliveira et al., 2017; Schwarz et al., 2016; Levine et al., 2016). A CNN is used to estimate a function defined by , where and is input and output of the CNN. The estimated function is parameterized by the network parameters (weights)

of the CNN. In the training phase, the network parameters are estimated by minimizing a loss function

as

(1)

A forward propagation follows the training step for prediction of the output

. A typical CNN layer consists of three basic operations; convolution, nonlinear activation function and pooling. In CNNs, a convolution operation is defined for an input 2D image

by

(2)

where denotes the convolution operation, denotes filters or kernels and is the value of feature maps computed at location . After computation of a convolution step at a location , the filter shifts to the next location to perform the next step of the convolution, and the amount shift is controlled by stride.

In CNNs, convolution is usually followed by a nonlinear activation function. A commonly used nonlinear activation function is rectified linear unit (ReLU)

(Nair and Hinton, 2010). Pooling is a form of non-linear down-sampling. It substitutes the input at a location with , where denotes a neighbourhood of , and is a pooling function with summary statistics such as max pooling (Zhou and Chellappa, 1988) which utilized an operation . The neighbourhood of can be also viewed as a sliding window centering at . slides across every possible location of

step by step. If max pooling is used, then it takes the largest element within

at each step. Stride of pooling controls the shifting units used at each pooling step. The amount of parameters and computational burden are significantly reduced through pooling. Meanwhile pooling provides invariance to small translations of inputs within the receptive field of the corresponding units (neurons) of the CNN.

CNNs are efficient in terms of memory requirements and statistical efficiency. The efficiency is obtained by two main features: local connectivity and parameter sharing. Local connectivity means that each neuron is connected only to a local region of its input. Parameter sharing is based on the assumption that different patches of local regions share some collective features, e.g. edges. These two features significantly reduce the amount of parameters of CNNs.

4.1.2 Docking Neural Network

In this subsection, we introduce our docking neural network (DoNN) proposed for detection of underwater docking stations. Object detection is one of the most important tasks studied in robot vision. Generally speaking, CNN based detection approaches fall into two categories; region proposal based detection and proposal free detection (Kong et al., 2016). In region proposal based detection, such as Fast-RCNN (Girshick, 2015) and Faster-RCNN (Ren et al., 2017), first some candidate regions are proposed by using another neural network, such as region proposal network (RPN) in Faster-RCNN or selective search (Uijlings et al., 2013) in Fast-RCNN. Then, objects are detected in the proposals. Proposal free detection used by YOLO (Redmon et al., 2016) poses detection as a regression problem. Bounding-boxes and their confidence are predicted simultaneously through one pass. DoNN is inspired by YOLO (Redmon et al., 2016). In our proposed method, we redesigned the loss function used in YOLO, in compatible with our datasets which contain one object.

DoNN consists of nine convolution layers and seven pooling layers. Its architecture is illustrated in Figure 7. Detection problem is considered as a regression problem in DoNN. In underwater docking tasks, DoNN maps underwater images to location of docking stations in images. DoNN learns feature representations of underwater docking stations from the whole image through minimization of the loss function given in (3) in the training phase, and predicts the location of docking stations on unseen underwater images by employing learned feature representations in the inference phase.

DoNN takes the whole image as input. Input images fed are first divided into grids, as illustrated in Figure 7. DoNN predicts the positions and sizes of multiple candidates for the bounding box along with their confidence score. We fix the number of the candidates and denote it by B. We further explain bounding boxes and associated confidence score as follows.

  1. bounding boxes: The bounding-box of the grid is denoted by , , . The center of is located in the grid, and is the coordinate of the center of the bounding box . It is represented by the offsets of the grid bound. The terms and denote the width and height of the bounding-box, respectively. They are divided by the image width and height to be normalized to the range between and .

  2. bounding-box confidence score: We denote a confidence score of the bounding-box residing in the grid by .

Figure 7: An illustration of the architecture of the proposed DoNN. DoNN consists of nine convolution layers (Conv), seven max pooling layers, and three fully connected layers (FC). F(width,height,depth) indicates a convolution filter with the size of width, height and depth. Sliding windows of all max pooling layers have size . The stride is set to in all convolution layers, and it is set to in all max pooling layers. The number of neurons in the first fully connected layer (FC1), the second fully connected layer (FC2), and the third fully connected layer (FC3) is , and , respectively. DoNN takes three channel images as input, and predicts a tensor from which the final detection is obtained.

DoNN is trained to minimize the discrepancy between predictions and manually labeled ground truth for each grid. This discrepancy is expressed by the loss function

(3)

We compute each sub-loss function as follows;

(4)
(5)

and

(6)

where

  • penalizes the difference between the predicted bounding-box and their ground truth for each grid cell. Each grid cell has two mutually exclusive states: i) containing docking stations, and ii) not containing docking stations. Containing docking stations and not containing docking stations represents if the center of docking stations falls into the grid or not, respectively. When the grid contains docking stations, also has two states. One is responsible for prediction, and the other is not responsible for prediction. is responsible for prediction when it has the largest IoU (Everingham et al., 2010) with the ground truth bounding-box among bounding-boxes predicted in the grid. is equal to if a) the grid contains docking stations, and b) is responsible for prediction. Otherwise it is .

  • penalizes confidence score loss for the grids containing docking stations. and is the ground truth confidence score and predicted confidence score for the , respectively. is a indicator function which is equal to if the grid contains docking stations, otherwise it is .

  • penalizes confidence score loss for the grids that do not contain docking stations. is an indicator function indicating appearance of docking stations in the grid. is equal to if the grid does not contain docking stations.

The parameters , and are used to control the contribution of different parts of the loss function (3). We experimentally analyze how the configuration of these two parameters affect detection performance in detail in Section 5.1.3.

DoNN receives an unseen underwater image as input, and outputs a tensor for inference by predicting bounding-boxes (parameterized by using elements), and confidence scores (parameterized by using element) for each grid. The final prediction is computed by

(7)

is the predicted confidence score used in (5), and is the final predicted bounding-box of the docking station.

The major difference between DoNN and YOLO is the loss function. In addition to the loss function (3), (5) and (6), another loss function called class loss is used in YOLO. The process of learning in YOLO can be viewed as learning an objectnessprobability and a conditional probability of for each grid. They are formulated by and for the grid, respectively. Then is used to predict the final class score for each grid. However, in our case, introduces instability as illustrated in Figure 8, due to observation of one target in our task and our relative small datasets. To remedy this problem, we redesign the loss function used in YOLO by estimating only instead of an objectness probability and a conditional probability.

Figure 8: A visualization of and computed using YOLO. We denote and computed for all grids by and , respectively. The left two figures show and , respectively. We use to indicate pixel-wise multiplication of and . The red bounding-box depicted in the last figure indicates the final prediction provided by YOLO. predicts the location of the docking station correctly, however the final prediction is corrupted after multiplication with .

4.2 Pose estimation

In Section 4.1, we explained how to compute 2D locations of docking stations in 2D images using the proposed DoNN. In this section, we provide a method which is used to recover the relative 3D position and orientation between docking stations and AUVs from the 2D image patch determined by the estimated 2D location. The 3D position is represented by , and the orientation is represented by Euler angles , as illustrated in Figure 9. The relative 3D position and orientation between docking stations and AUVs is called pose, collectively. The process of pose recovery is called pose estimation. Pose estimation requires recovering the pose from the 2D image patch defined by described in (7).

Pose estimation is performed if docking stations can be fully observed. We categorize observed docking stations into full observation, and partial observation of docking stations. A full observation of docking stations is a case where all eight landmarks can be observed. Otherwise, it is called a partial observation. In order to recognize full observations, we segment the image patch determined by using an adaptive segmentation method proposed by Bradley and Roth (2007)

. Since other interference, such as noisy luminaries and mirror images, has been addressed during detection, it is needless to worry about the sensitivity of segmentation. After the segmentation, several connected components are available. If the number of connected components is less than eight, then the observation is a partial observation of docking stations. The number of connected components may be greater than eight. which is observed if one light is segmented into more than one connected component. In this case, we use the k-means algorithm

(Arthur and Vassilvitskii, 2007) to cluster adjacent components and obtain centroids of eight landmarks. At the final step, the centroids are used for pose estimation.

If two coordinate frames are attached to the docking station and the AUV, separately, then we model the geometric relationship (the translation and rotation) between two coordinate frames for pose estimation. In pose estimation, three coordinate frames are used as illustrated in Figure 9: i) image coordinate frame, ii) camera coordinate frame and iii) reference coordinate frame. The image coordinate frame is a 2D coordinate system in pixels which is established on the image plane. The camera coordinate and reference coordinate frames are 3D coordinate systems in millimeter. The origin of the camera coordinate resides on the optical center of the camera. We attach a reference coordinate on the center of the circle formed by eight lights. Calculating the pose between AUVs and docking stations is equivalent to determination of the transformation between the camera coordinate frame and reference coordinate frame, since the camera is fixed on the head of AUVs rigidly, and only rigid-body motion is considered.

Figure 9: Coordinate frames used in underwater docking. Left: A reference frame. The origin of the reference frame is located at the centroid of the circle formed by landmarks. Middle: An image plane, and an image coordinate frame. Right: A camera frame, and illustrated Euler angles.

Next, we explain determination of a transformation between camera and reference coordinate frames. Considering a pinhole camera, the transformation between image coordinate and camera coordinate is computed by

(8)

where is the principal point measured in pixels,

is the skew coefficient,

is the coordinate in the image frame, is the coordinate in the camera frame, and denote the scaling factor converting space metrics to pixel units, and is the intrinsic matrix of inherent parameters of a camera. It describes the transformation between image frames and camera frames, and can be obtained by camera calibration. The relationship between the camera frame and reference frame can be described by the extrinsic matrix as follows:

(9)

where is the coordinate of the reference coordinate frame, is a rotation matrix with constraints and . Translation matrix is computed with respect to the coordinate of the origin point of the reference frame with respect to the camera frame. is the target point used for line tracking mentioned in Section 4.

Computation of and using corresponding points between 2D images points and 3D coordinates is addressed as perspective-n-point (PnP) problems. PnP was first coined by Fischler and Bolles (1981), for computation of and given a calibrated camera, a set of correspondences between 3D reference points and their 2D images points. At least four correspondences are required to assure computation of a unique solution for and

. Usually two types of methods are used to solve PnP problems: analytical and iterative methods. A popularly used analytical method is Direct Linear Transformation (DLT)

(Abdel-Aziz et al., 2015). DLT calculates and by solving 11 entries in linear equations derived by (8) and (9) from at least six corresponding points. DLT is computationally efficient but suffer from instability in the presence of noise. Iterative methods address PnP problems by minimization of an error criterion, such as re-projection error proposed by Zhang (2000), and collinearity error proposed by Lu et al. (2000). Iterative methods are less sensitive to noise, and they are more accurate but computationally expensive.

In our framework, RPnP (Li et al., 2012) is employed to estimate the pose between AUVs and docking stations. In Figure 10, and denote the 3D point and its corresponding image point, respectively. Given correspondences , RPnP first split a set of reference points into subsets, each of which contains 3 points. Each subset is illustrated in Figure 10. According to the law of cosines, the following constraints are satisfied for each subset;

(10)
Figure 10: Illustration of the RPnP algorithm. , and are three 3D points. , and are the corresponding 2D image points located on the image plane. Point is the optical center of the camera.

Then, (10) is converted into a fourth order polynomial for the subset by

,
.
(11)

Since subsets are available, we use polynomial equations (11), which form a system of nonlinear equations. Rather than solving this nonlinear equation system, RPnP analyzes the local minima of the seventh order polynomial cost function , which is defined by that has at most 4 minima. The final solution with the least re-projection residual is selected from these minima as the pose estimation result.

The RPnP used in our work has the following properties:

  • RPnP is accurate and highly efficient. It is a non-iterative solution and achieves as accurate solutions as iterative methods provide with less computational cost. The accuracy of RPnP is 1.5 degree median rotation error, and median translation error for co-planar points with Gaussian noise as reported by Li et al. (2012) in simulations. RPnP consumes less than 1 ms if .

  • RPnP is stable in the co-planar case. Many PnP solutions (Oberkampf et al., 1996; Schweighofer and Pinz, 2006) suffer from pose ambiguity which results in highly unstable results. Pose ambiguity refers to the fact that orientation cannot be determined uniquely (Tanaka et al., 2014). Allowing a co-planar arrangement of landmarks can reduce complexity for designing docking stations. Li et al. (2012) showed that the mean error of rotation and translation converges when number of correspondences is larger than eight. Therefore, eight landmarks are designed in our docking station.

  • The computational complexity of RPnP is . Its computational time grows linearly with the number of correspondences. It offers flexibility for increase and decrease of the number of landmarks according to practical requirements. Therefore, we can re-configure the amount of landmarks without substantial increase of the computational cost.

5 Experimental analyses

In the experimental analyses, we first analyze robustness and credibility of the proposed DoNN using our datasets. Then, we analyze the accuracy and efficiency of the pose estimation method by ground and underwater docking experiments. Finally, we give experimental analyses of our integrated underwater docking algorithm incorporating the aforementioned detection and pose estimation methods. Implementation details of the algorithms are given in the supplemental material.

5.1 Analysis of detection performance

In this section, we first introduce our proposed dataset UDID, and performance measures used to evaluate detection performance. Then, we analyze convergence properties of DoNN for the proposed loss functions. Finally, we compare and analyze detection performance of DoNN, and state-of-the-art YOLO and FasterRCNN methods using the UDID and its deformed variations.

5.1.1 Our proposed UDID dataset

In order to evaluate the performance of DoNN, we first set up an underwater docking images dataset (UDID), which is collected in our experimental water pool. The experimental water pool is 15m long, 10m wide and 9m deep with the docking station fixed at 2m deep underwater. It comprises a training set and a test set as illustrated in Table 2. We call images containing docking stations foreground, and images not containing docking stations background.

As mentioned in Section 1, real-world underwater images suffer from (1) blurring, (2) color shift, (3) reduced contrast, (4) mirror images, (5) non-uniform illumination and (6) noisy luminaries (Bryson et al., 2016; Park et al., 2009). Therefore, we also deform the test set using various deformation methods to assess the performance of DoNN in simulated dynamic underwater environments. All images are resized to for training and testing of DoNN, and the images are resized to the original size along with detected bounding-box for pose estimation.

Foreground Images Background Images Total Images
8252 0 8252
1128 1114 2242
Table 2: Our proposed dataset UDID. The training subset contains foreground images and no background images. The test subset consists of foreground images and background images.

As mentioned in Section 4.1.2, there are two types of CNN based detection methods. One is region proposal based detection and the other is proposal free detection. In the experimental analyses, we compare our proposed DoNN with Faster-RCNN and YOLO, which are the sate-of-art region proposal based and proposal free detection methods, respectively. For a comparison, we used the same architecture for design of YOLO and DoNN before employment of the fully connective layers. A Faster-RCNN which employs a ZF network (Zeiler and Fergus, 2014) is employed for comparison. Fully connective layers of the Faster-RCNN are updated for our two-class classification for end-to-end training using .

5.1.2 Performance measures used for evaluation of detection algorithms

Detection performance is evaluated using receiver operating characteristic (ROC) curve and its area under curve (AUC) (Bradley, 1997). ROC is a plot of true positive rate (TPR) against false positive rate (FPR) computed at various threshold settings. TPR and FPR are defined by

(12)

where , , , is the number of true positive, false positive, true negative and false negative samples, respectively. , , and

are illustrated by a confusion matrix in Table

3. If a prediction and a true value of a sample is docking station, then the prediction result is evaluated as a true positive (TP). If both the prediction and the true value are non-docking stations, then the prediction is evaluated as a true negative (TN). If the predicted value is a docking station while the true value is non-docking station, then the prediction is evaluated as false positive (FP). False positive means that a docking station is detected when a docking station is not actually there. If the predicted value is non-docking station while the true value is a docking station, then the prediction is false negative (FN). We label the true value of bounding-boxes whose IoU with the ground truth bounding-box exceed as docking stations as suggested by Everingham et al. (2010). Otherwise, they are labeled as non-docking stations.

True Values
Docking Stations Non-docking Stations
Predicted Values Docking Stations
Non-docking stations
Table 3: Confusion matrix of predicted and true values.

The area under the ROC curve is called AUC for short, which can give an insight into the general performance of detection algorithms. AUC is equal to 1 if the maximum performance is achieved. In the following sections, we analyze the performance of DoNN, YOLO and Faster-RCNN in aforementioned test set and its various deformed versions.

AUC 0.1 0.5 1 3 5
0.1 0.99494 0.99856 0.99862 0.99697 0.99564
0.5 0.96462 0.99689 0.99435 0.99410 0.99297
1 0.92583 0.99509 0.99348 0.99134 0.99067
3 0.87398 0.99708 0.99166 0.98235 0.99042
5 0.87772 0.99492 0.99311 0.99137 0.99231
(a)
AUC 0.1 0.5 1 3 5
0.1 0.99863 0.99918 0.99878 0.99964 0.99902
0.5 0.99376 0.99847 0.99923 0.99800 0.99847
1 0.98665 0.99705 0.99822 0.99827 0.99579
3 0.95976 0.99383 0.99730 0.99806 0.99674
5 0.95500 0.99367 0.98950 0.99605 0.99648
(b)
AUC 0.1 0.5 1 3 5
0.1 0.99900 0.99928 0.99930 0.99958 0.99958
0.5 0.99557 0.99907 0.99917 0.99931 0.99916
1 0.98988 0.99893 0.99892 0.99861 0.99787
3 0.96917 0.99721 0.99777 0.99789 0.99835
5 0.95736 0.99383 0.99751 0.99150 0.99697
(c)
AUC 0.1 0.5 1 3 5
0.1 0.99370 0.99719 0.99907 0.99951 0.99953
0.5 0.98607 0.99888 0.99904 0.99945 0.99895
1 0.98597 0.99752 0.99905 0.99935 0.99952
3 0.96551 0.99153 0.99388 0.99526 0.99807
5 0.95920 0.99428 0.99507 0.99577 0.99434
(d)
AUC 0.1 0.5 1 3 5
0.1 0.98886 0.99951 0.99920 0.99895 0.99931
0.5 0.98908 0.99577 0.99906 0.99924 0.99961
1 0.98509 0.99340 0.99916 0.99897 0.99928
3 0.97131 0.99635 0.99313 0.99463 0.99908
5 0.96330 0.99158 0.99239 0.98866 0.99797
(e)
Table 4: AUC performance of DoNN obtained using different , and on . and are set to and in these experiments, respectively. DoNN achieves the best AUC performance for , , .

5.1.3 Analysis of hyperparameters of loss functions of DoNN

Our proposed DoNN was trained using the loss function given in (3). The loss function contains three parts: i) , ii) and iii)

. Five hyperparameters

, , , and are used during the training phase.

, and are employed to balance their contribution, since three parts of the loss function (3) take values from different scales, as shown in Figure 11. They can also be viewed as three weights acting on three parts of the loss function. We show AUC performance of different settings of these three hyperparameters in Table 4. In average, DoNN achieves acceptable AUC performance in most of the hyperparameter settings. DoNN provides better performance when the parameters weight more than and . DoNN also performs better when less weight is given to with and fixed. This is observed when the number of grids not containing docking stations is larger than the number of grids containing docking stations. The parameter dominates the value of the loss function if equal weights are given to three parts of the loss function. DoNN achieves the best performance for , . Therefore, we use this setting in the following experiments.

The hyperparameter controls the number of scored bounding-boxes that are predicted by the DoNN for each grid. The larger is, the more parameters and computation cost are required. We show that the AUC performance of DoNN at the setting of , and in Table 5. The AUC performance of DoNN does not benefit a lot from a large . Since the DoNN achieves the best performance for , is set to in the following experiments.

The hyperparameter controls the number of grids into which an image is divided. Fine grained grids increase the number of parameters and computational cost. We perform experiments at , since must be a factor of the width and height of the input image, which are both in our experiments. The experimental results are shown in Table 6. The best AUC performance is achieved at , and is set to in the experiments.

Figure 11: Change of value of loss functions , and during training phase.
AUC 0.99944 0.99964 0.99962 0.99953 0.99915
Table 5: AUC performance of DoNN obtained using different settings of on . We use , , , for experiments given in this table. Best performance is obtained using .
AUC 0.99900 0.99819 0.99964 0.99890
Table 6: AUC performance of DoNN obtained using different settings of on . We use , , , in experiments of this table. Best performance is obtained using .

5.1.4 Comparison of performance of detection algorithms using the original test dataset

We first analyze detection performance of YOLO, FasterRCNN and DoNN using the original test dataset . Figure 12 shows the ROC curve and the associated AUC of three models. The results show that DoNN performs slightly better than Faster-RCNN, and they both outperform YOLO. The AUC of DoNN and Faster-RCNN are and , respectively, achieving good performance on . We will show that Faster-RCNN is not as robust to various deformations of as DoNN in the following analyses.

As a concrete example, we depict feature maps computed at the convolution layer, and the detection result of typical samples for three models in Figure 25, 33 and 46. A feature map is computed by

(13)

where denotes the feature map of feature maps computed using (2) at the convolutional layer. After each pooling, is down-sampled by a factor which is determined by the architecture of the network. Each is coupled with a color-bar which indicates its corresponding color scale. For DoNN and YOLO, an additional confidence map , where and , is depicted in the last but one figure. Each pixel belonging to the set indicates the predicted confidence score for its corresponding grid.

Comparing a map computed using these three models, we conjecture that all three models can be used to learn feature representations of spatial structural patterns of docking stations, which are invariant to change of light to a different extent. In Figure (b)b, we depict a map computed using FasterRCNN. We observe that features are activated only in a region containing docking station, although the upper ambient light is stronger than the lower. As for DoNN, effect of ambient light is gradually eliminated as shown in Figure 46. We depicted a map computed using DoNN in Figure (f)f. The results show that features are activated only in a region that contains a docking station. Therefore, DoNN is robust to light variance. As a result, the grid corresponding to the docking station obtained the highest confidence score while others obtained almost zero score. Figure 25 shows a false prediction provided by YOLO. IoU of the prediction is low since estimated conditional class distributions corrupt the final confidence map , as illustrated in Figure 8.

Since AUC performance of YOLO is worse compared to FasterRCNN and DoNN, we further analyzed the performance of FasterRCNN and DoNN in the following sections.

Figure 12: ROC curves and the associated AUCs of DoNN, YOLO, and Faster-RCNN computed using .
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Figure 25: Feature maps computed for a false detection predicted by YOLO and its detection result on . Figures given from left to right, and top to bottom correspond to its input image, feature maps computed at the layer , the confidence map and the final detection result.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Figure 33: Feature maps computed using Faster-RCNN, and its detection result on . Figures given from left to right, and top to bottom correspond to its input image, feature maps computed at the layer , and the final detection result.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Figure 46: Feature maps computed using DoNN, and its detection result on . Figures given from left to right, and top to bottom correspond to its input image, feature maps computed at the layer , the confidence map, and the final detection result.

5.1.5 Comparison of performance of detection algorithms using blurred images

Blurred underwater images are observed due to scattering of light (Bryson et al., 2016). We analyze change of performance of the detection methods using blurred underwater images in a controlled setting for different blurring patterns. Therefore, we generate blurred images from the UDID by employing Gaussian filters (Shapiro and Stockman, 2001)

(14)

with varying standard deviation

to simulate blurring in underwater images, where is the distance between a pixel and a filter center pixel . The filter size is set to . We sample the deviation uniformly by . The larger values the takes, the more blurred images are obtained. The dataset obtained after employment of blurring is called by .

Table 7 shows sample blurred images and the corresponding detection results predicted using FasterRCNN and DoNN, respectively. The predicted probability values decrease as increases. This indicates that both FasterRCNN and DoNN provide less confident results in their prediction by the increase of blurring. FasterRCNN suffers from blurring more than DoNN as observed in Figure 47 and Table 10. The AUC of FasterRCNN degrades from to as varies from to while that of DoNN degrades from to . DoNN outperforms FasterRCNN in nine out of ten levels of blurring. Fine-grained details of docking stations are lost by the increase of blurring, but by preserving spatial structural patterns of docking stations. We conjecture that DoNN outperforms FasterRCNN in blurred underwater images since DoNN can be used to learn better feature representations of docking stations compared to FasterRCNN. DoNN can still keep a high activation in its feature maps although blurring is very high, as shown in Table 9. However, activation of FasterRCNN in feature maps becomes almost as weak as background in cases of and , resulting in incorrect detection, as shown in Table 8.

FrRCNN DoNN
Table 7: Sample blurred images, and detection results provided by FasterRCNN and DoNN on the set of blurred images.
Item
Det. Result
Table 8: Feature maps computed at the layer , and a detection result of FasterRCNN for and .
Item
Det. Result
Table 9: Feature maps computed at the layer , confidence map, and a detection result of DoNN for and .
Figure 47: ROC curves of DoNN, YOLO, Faster-RCNN computed for various degrees of blurring. The corresponding AUCs are given in Table 10.
AUCModel DoNN YOLO FrRCNN
0.99964 0.90862 0.99958
1 0.99964 0.91411 0.99959
2 0.99966 0.91025 0.99943
3 0.99964 0.91446 0.99971
4 0.99965 0.91465 0.99942
5 0.99967 0.91147 0.99899
6 0.99967 0.91453 0.99854
7 0.99953 0.91484 0.99493
8 0.99957 0.91549 0.99210
9 0.99958 0.91333 0.97791
10 0.99948 0.90734 0.95785
Average 0.99961 0.91305 0.99185
Table 10: AUC of DoNN, YOLO, Faster-RCNN computed for various degrees of blurring, and the corresponding ROC curves are given in Figure 47.

5.1.6 Comparison of performance of detection algorithms under color shift

Color shift is determined by various factors in underwater environments, such as attenuation. In order to compare performance of three models in underwater images with color shift, underwater images with color shift at different rates of hue, saturation and value shift are created by

(15)

where denotes one of HSV components of images in . Datasets after hue, saturation and value shift are indicated by , and respectively.

In order to set a reasonable range for , we first compute the distribution of by

(16)

where is an image pair from , is the image coordinate, is the width of the image and is the height of the image. We compute for all possible image pairs in . The distribution of is shown in Figure (a)a, (b)b and (c)c. Due to their symmetry, we sample from .

Sample images after hue shift deformation and corresponding detection results of DoNN and FasterRCNN are shown in Table 11. In this sample, the increment of hue shift results in less confidence for DoNN while total incorrect detection for FasterRCNN. Figure 53 and Table 13 show the ROC curve and associated AUC of three models in hue shift, respectively. DoNN outperforms FasterRCNN in all cases. DoNN can still achieve an acceptable performance in the extreme case while FasterRCNN performs poorly. Notable performance difference between these two models occurs at and . Table 16 shows feature maps of the first, second and fifth convolution layer of FasterRCNN at and . Activation of and is very weak in the region of the docking station, giving rise to the final incorrect detection. Table 17 shows feature maps of the first, fifth convolution layer, confidence map and detection results of DoNN at and . DoNN remains relatively high activation in the docking station region in in Table 17, resulting in less confident but correct detection of docking stations. The confidence map of DoNN shows that DoNN feels more uncertain than cases without hue shift. Three neighbouring grids are with similar confidence, but the correct grid overwhelms.

Saturation indicates the amount of grey in the color. A color is grey when its saturation value is while is primary color when its saturation value is . Table 12 shows sample images after saturation shift and associated detection results of FasterRCNN and DoNN. The color becomes closer to grey with the increment of . We show ROC curve and associated AUC of three models in Figure 54 and Table 14. As depicted, DoNN outperforms FasterRCNN in all levels of Saturation shift. The AUC of FasterRCNN declines from to while DoNN from to as varies from to .

Value component describes the brightness of colors. Images become less bright with the increment of . with a low value corresponds to a dim underwater environment. As ROC curves and associated AUC shown in Figure 55 and Table 15, DoNN and FasterRCNN are both robust to Value shift. No significant degradation arises in different levels of Value shift. DoNN goes ahead in the case of and lags slightly behind FasterRCNN in other cases. But DoNN outperforms FasterRCNN in terms of average performance. The average AUC of DoNN is overall in contrast to of FasterRCNN.

To sum up, DoNN is more robust and credible than FasterRCNN in underwater images with color shift.

(a)
(b)
(c)
(d)
Figure 52: Distribution of , , and .
FrRCNN DoNN
Table 11: Sample images after hue shift deformation and detection results predicted by FasterRCNN and DoNN.
FrRCNN DoNN
Table 12: Sample images after saturation shift deformation and detection results predicted by FasterRCNN and DoNN.
Figure 53: ROC curves of DoNN, YOLO, Faster-RCNN computed under hue shift. Corresponding AUCs are given in Table 13.
AUCModel DoNN YOLO FrRCNN
1 0.99964 0.90862 0.99958
0.9 0.99956 0.93559 0.99888
0.8 0.99951 0.93065 0.99564
0.7 0.99824 0.89211 0.97628
0.6 0.98667 0.87664 0.87605
0.5 0.97405 0.86540 0.80774
Average 0.99160 0.9008 0.93092
Table 13: AUCs of DoNN, YOLO and Faster-RCNN computed under hue shift, and the corresponding ROC curves are shown in Figure 53.
Figure 54: ROC curves of DoNN, YOLO, and Faster-RCNN computed under saturation shift. The corresponding AUCs are given in Table 14.
AUCModel DoNN YOLO FrRCNN
1 0.99964 0.90862 0.99958
0.9 0.99961 0.91990 0.99941
0.8 0.99961 0.92510 0.99922
0.7 0.99960 0.92634 0.99891
0.6 0.99959 0.92680 0.99851
0.5 0.99953 0.91429 0.99713
Average 0.99959 0.92249 0.99864
Table 14: AUCs of DoNN, YOLO, Faster-RCNN computed under various levels of saturation shift, and the corresponding to ROC curves are shown in Figure 54.
Figure 55: ROC curves of DoNN, YOLO, Faster-RCNN computed under for value shift. Corresponding AUCs are given in Table 15.
AUCModel DoNN YOLO FrRCNN
1 0.99964 0.90862 0.99958
0.9 0.99964 0.90621 0.99981
0.8 0.99965 0.89723 0.99988
0.7 0.99963 0.87105 0.99976
0.6 0.99968 0.88644 0.99979
0.5 0.99965 0.90317 0.98946
Average 0.99965 0.89282 0.99774
Table 15: AUCs of DoNN, YOLO, Faster-RCNN computer for various levels of value shift, and the corresponding ROC curves are shown in Figure 55.
Item
Det. Res.
Table 16: Feature maps of the layer and detection results provided by FasterRCNN for and .
Item
Det. Res.
Table 17: Feature maps of the layer , confidence map and detection results provided by DoNN for and .
Item
Det. Res.
Table 18: Feature maps of the layer and detection results provided by FasterRCNN for and .
Item
Det. Res.
Table 19: Feature maps of the layer , confidence map and detection results provided by DoNN for and .

5.1.7 Comparison of performance of detection algorithms under contrast shift

In order to compare the performance of the detection methods in underwater environment under change of contrast, we generate contrast adjustment datasets using gamma transformation

(17)

where , and are input images, output images and the parameter of gamma transformation, respectively. Contrast of images increases as while the contrast decreases as . The dataset obtained after contrast deformation is denoted by .

We computed the distribution of , which is shown in Figure (d)d, by

(18)

where , and are the width, height and channel of and . {, } is an image pair belonging to . denotes the pixel value of at the location. According to the distribution of shown in Figure (d)d, we sample from .

Table 21 shows sample images obtained by contrast adjustment of different levels. Intuitively, the docking station is more clearly observed as increases, and is less clear as decreases in the image. We give ROC curves and the associated AUCs of three detection algorithms in Figure 56 and Table 20, respectively. On average, DoNN performs better than FasterRCNN. The average AUC of DoNN is over all levels of contrast adjustment while that of FasterRCNN is . As is high, Faster-RCNN outperforms DoNN by a tiny margin, but lags behind significantly if is low. The AUC of Faster-RCNN becomes if , and even more acute with , for . This means that all the detections fail to surpass the IoU criterion (). It is primarily owing to the weak activation of FasterRCNN in feature maps. We compare feature maps and of DoNN and FasterRCNN in Table 22 and 23 to examine the performance difference between DoNN and FasterRCNN. For , activation of and of FasterRCNN becomes quite weak in the docking station region, resulting in the final incorrect detection, as shown in Table 22. It becomes more acute for . Activations computed using and are almost as weak as computed in their background in the docking station region, and thus they provide incorrect prediction. However, it provides high activation values, and salient spatial structural patterns in and of DoNN for as shown in Table 23. As a result, relative high confidence is obtained for only one grid in the confidence map of DoNN. When is equal to , DoNN keeps a distinguishable docking station spatial pattern in its and , although activation gets weaker than . The confidence map contains more grids with relatively high confidence, but only the correct one overwhelms as shown in Table 23.

Figure 56: ROC curves of DoNN, YOLO, and Faster-RCNN computer under various contrast conditions. The corresponding AUCs are shown in Table 20.
AUCModel DoNN YOLO FrRCNN
1 0.99964 0.90862 0.99958
0.2 0.87402 0.84775 0
0.4 0.99738 0.88640 0.90320
1.5 0.99884 0.88991 0.99997
2.0 0.99439 0.87390 0.99990
2.5 0.99392 0.87943 0.99962
3 0.99505 0.90531 0.99873
3.5 0.99637 0.91929 0.99833
Average 0.97857 0.88600 0.84282
Table 20: AUCs of DoNN, YOLO, and Faster-RCNN computed under various contrast conditions, and the corresponding ROC curves are shown in Figure 56.
FrRCNN DoNN
Table 21: Sample images obtained by contrast adjustment, and the corresponding detection results provided by FasterRCNN and DoNN.
Item
Det. Res.
Table 22: Feature maps computed at the layer , and the detection result provided by FasterRCNN for and .
Item
Det. Res.
Table 23: Feature maps computed at the layer , confidence map, and the detection result provided by DoNN for and .

5.1.8 Comparison of performance of detection algorithms for mirror images

As mentioned in Section 1, mirror images result from total internal reflection, and they are observed when cameras are within the critical angle, as shown in Figure (d)d. It poses a nasty problem for detection of docking stations due to its very similar appearance with real docking stations. In order to compare performance of three methods under total internal reflection, we establish a dataset by attaching a mirror image to every foreground image in . To this end, image editing (Pérez et al., 2003) is employed to merge a mirror image patch to the upper side of the docking station of original images as real as possible. The merging process is illustrated in Figure 57.

It is shown in Figure 68 that all three CNN-based models are able to distinguish real docking stations from their mirror images with no performance degradation. This attributes the success to the learning ability of CNNs. It is also shown in Figure 68 that DoNN slightly outperforms FasterRCNN. Figure 62 and 67 show feature maps of a natural mirror image generated by DoNN and FasterRCNN, respectively. It is shown in Figure 62 that activations computed for the real docking station computed in of FasterRCNN is stronger than those for the mirror docking station. This enables FasterRCNN to discriminate real docking stations from mirror ones. Figure 67 shows , and confidence map of DoNN. Activation of the mirror docking station is almost as high as the real docking station in , of DoNN. Even so, confidence map of DoNN contains only one highly confident grid which is the correct prediction. Next, we will analyze this phenomenon. We conjecture that DoNN can be used to estimate distribution of relative spatial locations between mirror and real docking stations. DoNN learns feature representations of mirror images of docking stations that appear more likely on the upper side rather than real docking stations.

In the experimental analyses, we have to make sure that it is the relative spatial location or appearance that enables DoNN to distinguish real docking stations from mirror ones. To this end, we carry out two experiments. First, in Figure (d)d, we replace the patch located in the second quadrant of the input image by the patch located in the third quadrant such that docking stations observed in the second and third quadrant share the same appearance, gaining Figure (d)d. Then, we input the simulated image into DoNN. The image obtained by replacement, its feature maps, confidence map and detection results are shown in Figure 73. Due to their same appearance, feature maps computed in the second and the third quadrant are exactly same. DoNN both assigns high confidence to two docking stations, but more to the lower one. In other words, DoNN tends to believe that the lower one is the real docking station. In addition, we compare the confidence of prediction before and after the replacement in Figure (d)d and (d)d, respectively. The confidence falls from computed before replacement to computed after replacement. It reflects that the appearance contributes to the prediction of DoNN as well. Second, we replace the patch located in the third quadrant of the input image given in Figure (d)d by the patch located in the second quadrant of itself, such that all docking stations observed in the image are mirror ones. Then, the obtained image is fed to DoNN. The result is shown in Figure 78. DoNN still provides higher prediction score for the lower one compared to the upper one, although they are both identical mirror images. But the score in Figure (d)d is less than the score in Figure (d)d. Therefore, we can conclude that DoNN has learned not only feature representations of appearance, but also the distribution of relative spatial locations between real and mirror docking stations.

Figure 57: The merging process of synthetic mirror images. A mirror image is merged to the original image on the upper side of the real docking station.
(a)
(b)
(c)
(d)
Figure 62: Feature maps computed at the layer , and the corresponding detection result of FasterRCNN for mirror images.
(a)
(b)
(c)
(d)
Figure 67: Feature maps computed at the layer , confidence map, and the detection result of DoNN for mirror images.
Figure 68: ROC curves of DoNN, YOLO, and Faster-RCNN computed for which contains mirror images.
(a)
(b)
(c)
(d)
Figure 73: Feature maps computed at the layer , confidence map, and the detection result of DoNN for replacement of the second quadrant of Figure (d)d by its third quadrant.
(a)
(b)
(c)
(d)
Figure 78: Feature maps computed at the layer , confidence map, and the detection result of DoNN for replacement of the third quadrant of Figure (d)d by its second quadrant.

5.1.9 Comparison of performance of detection algorithms under non-uniform illumination

Non-uniform illumination is commonly observed in real underwater images. In order to compare detection performance in an underwater environment as close as to real undersea environment of non-uniform illumination, we apply the non-uniform illumination drawn from a subset of the Fish4Knowledge dataset (luminosity changes) (Kavasidis et al., 2014), to our dataset which was collected in an indoor water pool. The Fish4Knowledge dataset was constructed using images captured in a real outdoor undersea environment, and used for detecting targets in noisy underwater environments. The subset luminosity changes is specific for underwater luminosity changes. The new dataset obtained after applying undersea non-uniform illumination to is denoted by .

Polynomials are utilized for non-uniform illumination correction in transmission electron microscopy (TEM) images (Tasdizen et al., 2008). In our work, it is used in an opposite way, in order to generate non-uniform illumination. Details of non-uniform illumination generation procedure are shown in Figure 79, and explained as follows:

  1. Estimation of undersea non-uniform illumination: In order to estimate undersea non-uniform illumination, we fit a low-order () bivariate polynomial to the Value component (HSV color space) of every frame in the Fish4Knowledge dataset. The polynomial is represented by

    (19)