I Introduction
Underwater robot picking means using robots to grab mariculture organisms like seacucumber, seaurchin, or scallop in an open-sea farm automatically where underwater object detection is the first and key step. In recent years, due to the superior feature representation ability of deep CNNs [1, 2] and the availability of large datasets (MS COCO [3]), general object detection achieved remarkable success. However, less progress has been made in underwater object detection, because there are still some tough challenges which mainly come from 3 aspects:
I-a Insufficient training data
Massive training data is the cornerstone of deep learning. But because the cost of underwater collection and annotation is much more expensive than general datasets, there is still not a large real underwater sea farm dataset for underwater robot picking, which directly limits detectors’ generalization and robustness.
I-B Variable water color and massive small objects
Due to the water refraction and scattering of natural light, underwater environment can be very different throughout a day. Meanwhile, compared with the vast scale of ocean space, all objects tend to be small. These aggravate the difficulty of underwater objects detection.
I-C Efficiency requirements
In underwater robot picking, the detector is the key part of the system and it must be as fast as possible and has a good accuracy to meet the robot-embedding requirements.

Considering the above challenges, we have made 3 contributions in this paper to promote the development of underwater object detection, the key part of underwater robot picking.
First, we collect and propose an underwater open-sea farm object detection dataset (UDD) containing 3 categories (seacucumber, seaurchin, and scallop) with 2227 images and manually annotated with boxes. To simulate the real underwater grabbing environment, underwater robots and divers were employed to capture images from a real open-sea farm in the middle of winter at great risk, because seacucumbers can only be captured at this time.
Second, due to the inherent problems such as class-imbalance and massive small objects in UDD, we propose Poisson GAN to expand UDD into 2 datasets: AUDD including 18K images and Pre-trained dataset including 590K images. AUDD is an extension of UDD and Pre-trained dataset could make the detector better adapted to the underwater picking environment. Compared with existing GAN-based data augmentation methods [4, 5], we embed Poisson blending [6]
into the generator to change the number, position, even size of objects in an image. Furthermore, we design a Dual Restriction loss (DR loss) utilizing a combination of content loss and region loss to generate more realistic images. Our proposed Poisson GAN can solve the class-imbalance and massive small objects problems in underwater object detection more effectively while others can only change image styles. When training models, we train models on AUDD with pre-trained parameters trained on Pre-trained dataset. This strategy improves detectors significantly compared with training with ImageNet pre-trained models or randomly initialized models on UDD.
Third, due to lots of star detectors [7, 8, 9] are evaluated on MS COCO, for a dataset containing massive small objects like our UDD, they may not perform as perfect as they do on MS COCO. Therefore, to satisfy detecting massive small objects from cloudy underwater pictures and meet the efficiency requirements in robots, we design UnderwaterNet, only containing 1.3M parameters and achieving the best speed / accuracy trade-off on the UDD test set. Within it, two lightweight components are designed: a pooling layer called Multi-scale Blursampling (MBP) module and a basic block called Multi-scale Contextual Features Fusion (MFF) block based on the depth-separable inverted bottleneck block [10]. Both of them can extract and fuse multi-scale features at the same time thus making UnderwaterNet achieve state-of-the-art detect efficiency, and still stay high accuracy. Code and datasets will be released on https://github.com/chongweiliu soon.
In summary, the key contributions of this paper can be summarized as follows.
-
An underwater open-sea farm object detection dataset (UDD), a large-scale augmented underwater sea farm object detection dataset (AUDD), and Pre-trained dataset are proposed for underwater robot picking.
-
By combining conventional and deep learning methods, we propose Poisson GAN to change the number, position, even size of objects in an image for data augmentation of UDD, while existing GANs can barely do.
-
We design a lightweight UnderwaterNet with 2 efficient components to meet underwater robot picking requirements without any loss on accuracy.
Ii Related Work
This section briefly reviews 3 relevant fields, including underwater dataset, data augmentation with GAN, as well as underwater object detection.
Ii-a Underwater Dataset
Due to the complexity of underwater scenes, collecting underwater images is difficult, both the type and the number of underwater datasets are far fewer than general datasets. Here we introduce one dataset for underwater robot picking and two datasets for fish research.
Ii-A1 Urpc111Underwater Robot Picking Contest http://en.cnurpc.org
It is an underwater object detection dataset provided by Natural Science Foundation of China (NSFC) for underwater robot picking. It contains 18982 labeled images from 6 videos including 3 categories (seacucumbers, seaurchins, and scallops). But all the videos were filmed in an artificial offshore environment in summer and pictures from the same video look almost identical. Consequently, detectors trained on it fail to be used in real open-sea farms.
Ii-A2 Fish4Knowledge [11]
It is a project funded by the European Union Seventh Framework Program in order to assist the study of marine ecosystems. This project publishes a dataset called F4K containing 27370 fish images whose size ranges from 2020 to 200200 pixels. However, class-imbalance problem in this dataset is very serious, the number of most frequently seen fish is 12112 while the least is only 16. Some works [12] choose to manually filter the data and create a new dataset to ensure the performance of models.
Ii-A3 Life-CLEF2014fish [13]
It consists of images and videos collected from Fish4Knowledge. This dataset can be used to perform 4 tasks: video-based fish identification, image-based fish identification, image-based fish identification, and image-based fish species recognition.
Images in our proposed UDD are collected in a real open-sea farm by robots and divers, which are the closest training data to the real picking environment for underwater robot picking task.

Ii-B Data Augmentation With GAN
Data augmentation means adding more variation to the training data in order to improve the generalization capabilities of the model. Now, data augmentation strategies are widely used in training CNNs, like flipping, re-scaling, etc. Recently, GANs [14] have shown excellent performance on many image2image works [15, 16]. There are also some works [17, 18] using GAN on data augmentation. AugGAN [17] possesses structure-aware like semantic segmentation with soft weight-sharing, so the generated images are convincing enough to be trained. But the groundtruth used in this method should contain masks, which are inconvenient to be labeled. CycleGAN [18] only needs an unpaired dataset, thus reducing the difficulty of preparing training set, and there has been a large sum of works using CycleGAN to achieve data augmentation [4, 5]. However, we still cannot ignore the defects of CycleGAN. It will cause over-fitting easily in generating images thus affecting models’ accuracy.
Existing GAN-based methods have done great job in image classification by transforming the image styles, but they do not do well in object detection because they can not change the number, size, or position of objects in an image, which is vital important in detection. Therefore, we propose Poisson GAN to solve the problem.
Ii-C Underwater Object Detection
The development of underwater object detection is similar to that of general object detection. Before CNNs, detectors are mainly based on the sliding-window paradigm with hand-craft features like SIFT [19] and HOG [20]. Mehdi etal. [21] adopt Haar [22] and shape feature in automated fish detection.
With the development of CNNs, CNN-based detectors achieve great improvement in detecting. Modern CNN-based object detection approaches can be roughly divided into two categories: two-stage method and one-stage method. Two-stage methods (e.g., Faster R-CNN [23], R-FCN [24]) first propose generation (e.g., EdgeBoxes [25], RPN [26]) and then determine position and class of objects. They achieve state-of-the-art performance, but need amount of computation, which cannot meet real-time requirements. One-stage methods unify the proposal and prediction processes, making detectors faster than two-stage methods. Redmon et al. propose YOLO [27] using an end-to-end CNN to directly predict each object’s class and location, but there is still a large accuracy gap between YOLO and other two-stage methods. After that, SSD [7] adopts the concept of anchor boxes in [26] and tiles anchor boxes of different scales respectively on a certain layer to improve detection performance. Recently, lots of anchor-free one-stage methods [28, 29] are proposed such as CornerNet [28] and CenterNet [29]. Inspired by above approaches, Li et al. [30] adopt Fast R-CNN [26] framework for underwater object detection. But a detector specially designed for underwater sea farm object detection is also needed to meet the robot-embedding requirements and the tough detection task, therefore we design UnderwaterNet.

Iii Proposed Underwater Detection Dataset
To address the lack of real open-sea farm dataset problem, we collect and annotate an underwater dataset called UDD from a real open-sea farm. It contains 2227 underwater images which has 3 types of marine organisms (seacucumber, seaurchin and scallop) and corresponding manually annotated coordinates. To the best of our knowledge, this is the first one collected from a real open-sea farm and is also the closest one to the ideal picking environment for underwater robot picking until now.

Iii-a Dataset Construction
Because the detector is a key part of the underwater picking system deployed in a robot, the best way to improve the performance of it is to use real picking environment images to train it. Therefore, to make the samples more diversified, we considered the different points of view (e.g., looking down, head-up, and strabismus), different scales (e.g., 720405, 19201080, and 38402160), and different terrains (e.g., flat, slope, and stone) to diverse UDD when capturing images. By the way, we use an underwater high-definition camera handled by divers or robots to capture pictures. The diver-captured images are mainly used to provide more diverse samples. Some images in UDD are shown in Fig. 1. This dataset includes 2227 pictures where 1827 for training and 400 for testing.
Iii-B Dataset Analysis
Iii-B1 The number of instances
The total number of objects is 15022. Seacucumber, seaurchin, and scallop are 1148, 13592, 282, respectively. Obviously, the class-imbalance problem is very serious, while this is the real distribution of the open-sea farm. To address this problem, we use Poisson GAN to expand UDD, which will be introduced in the next section.
Iii-B2 Instance size distribution
We analyze the average size of objects in UDD. In general, smaller objects are harder to detect. For PASCAL VOC [3] or MS COCO [3], roughly 50% of all objects occupy no more than 10% of the image itself, and the other evenly occupy from 10% to 100%. But UDD contains almost exclusively small instances as shown in Fig. 2. More than 90% instances are less than 1.654% of the image size. The average object size is only 44 28 pixels when resizing images into 512512 pixels, making up only 0.47% of the resized image size.
Almost all star detectors are evaluated on PASCAL VOC or MS COCO especially the later one. In MS COCO, the performance of detectors in detecting small objects is the weakest in mAP generally. Therefore, for a dataset containing massive small objects like our UDD, most detectors may not perform as perfect as they do in MS COCO or PASCAL VOC. It’s necessary to design a detector to deal with massive small objects and stay high efficient at the same time for underwater robot picking. On the other hand, we also employ Poisson GAN to increase the number of big or medium objects, making detectors more robustness.
Iv Poisson GAN
To solve the tough problems including insufficient data and class-imbalance in UDD, Poisson GAN and a specially designed loss (DR loss) are proposed. Notably, Poisson GAN is proposed for underwater object detection while existing GAN-based augmentation methods [4, 5] are for classification.
In this section, we first introduce Poisson GAN. After that, AUDD and Pre-trained dataset will be described.
Iv-a Network Architecture
Iv-A1 Generator
Generator consists of Poisson blending phase and Learning phase. Poisson blending phase changes the number, size, or position of objects in the input image and eliminates the boundary of the fusion part to a certain extent. Learning phase further improves the image quality to get a more realistic image through a learned mapping.

Poisson blending phase. Compared to the diversity of the land scenes, underwater sea farm scenes are relatively simple thus Poisson blending can be employed to create relatively real pictures. We embed Poisson blending into generator to change the number, position, or size of objects when generating AUDD. The process of Poisson blending is illustrated in Fig. 4 left. We first crop X, Y, and Z seacucumbers, seaurchins and scallops from UDD respectively to build a object set P:
(1) |
where A, B, C indicate seacucumber, seaurchin, and scallop, respectively. For each generation, we randomly select objects from P to do Poisson blending with a background image B to get a clone image C . Notably, we try to put the embedded objects in the vicinity of the same categories in order to ensure the rationality of the clone images. Because Poisson blending works perfectly only when the contours of embedded objects are well labeled, we also need the Learning phase to get more realistic images.
Learning phase. Learning phase is inspired by [32]
, an anime line art colorization GAN. We use U-net
[33] as backbone structure as shown in Fig. 4 right and a stack of 33 convolution layers is used to construct the encoder. For the decoder, 4 stacks of ResNeXt blocks [34] are employed to construct, denoted as block, n {1,…,4}. In the experiments, we set block to [20,10,10,5].Iv-A2 Discriminator
The discriminator is also built with stacks of ResNeXt blocks and convolution layers as shown in Fig. 4 right. The architecture is similar to the setup of SRGAN [35] with some modifications. We employ same basic block used in the generator. We additionally stack more layers so that it can process 512512 inputs.
Iv-A3 Loss Function
In general, the loss function in GAN contains a generator loss and a discriminator loss. By leveraging the annotation information, we design DR loss as the generator loss. With a combination of content loss and region loss, DR loss can keep the original information on pixel level by penalizing the difference in fusion parts rather than the whole image area. Therefore, Poisson GAN can generate more realistic images. We adopt the original discriminator loss in
[32] and the DR loss is defined as:(2) |
where the equals to 1e-4 in all experiments and the content loss and the adversarial loss are exactly as same as the loss functions in [32]. The region loss can be defined as:
(3) |
where c, h, w denotes the number of channels, height and width of the feature maps and M denotes the mask created by bounding boxes of embedded objects. We take the value of 100 for the pixels in boxes while the value of other areas is 0.1.
Iv-B Image Pairs Used to Train Poisson GAN
Two images in one image pair should only differ in the edge information of the fusion part so that the generator can learn the mapping from fusion images to normal images. As we discussed in Poisson blending phase, underwater sea farm scenes are relatively simple thus we can create image pairs by covering objects in images of UDD with the same category objects cropped from clone images C directly. In view of appearance similarity of the same species, these two images (the original image and the processed image) are nearly the same except the edge information, so we can directly treat the original image as the real image and the processed image as the fake image.
Iv-C Expanded Dataset
To build the object set P, we crop 1000 seaurchins, 150 seacucumbers, and 35 scallops from UDD, and then fuse them into background images through Poisson GAN.
Iv-C1 AUDD dataset
We use images from UDD as background images to generate synthesize images by Poisson GAN. As a result, these images are nearly realistic and can be seen as a supplement of UDD as shown in Fig. 3. This dataset contains 18661 images and has 18350 seacucumbers, 101422 seaurchins, and 9624 scallops, solving the insufficient data and class-imbalance problems to a certain extent.
Iv-C2 Pre-trained dataset
We put cropped instances and background images from URPC or other underwater pictures we captured into the Poisson blending to build the Pre-trained dataset. The main purpose of this dataset is to provide a more robustness pre-trained model for a detector in underwater robot picking. There are up to 589080 images including different background colors, points of view, and terrains to make sure detectors get a good generalization and robustness.
V UnderwaterNet
As we discussed at the end of Section III, we need to design a detector to meet the massive small objects detection from cloudy underwater images and high efficient requirements. Therefore, UnderwaterNet and two simple but efficient modules are designed.
In this section, we first introduce the network architecture. MBP module and MFF block are then described in detail.
V-a Network Architecture
The entire architecture is shown in Fig. 5 (a). For the characters of anchor-free and nms-free, we adapt CenterNet [29] for our detector to improve the detection efficiency. Different from the backbones hand-crafted or searched from ImageNet, only 2 33 convolution layers and 8 MFF blocks with MBP are used to build our backbone because the features of small objects may be disappeared or distorted if there are too many layers in a backbone. The kernel sequences (see MFF) from Stage to Stage are all set into [3,5,7], while in Stage they are [3,5,7,9]. To enhance the inner information flows, we use the concatenation of the output of the last 2 layers of Stage to fuse when upsampling. Finally, we build the same head as CenterNet. The head has 3 maps: HeatMap predicting the center of an object; WHMap predicting the width and height of an object; OffsetMap predicting the offset between the predicted center and the real center of an object. For each map, the features of the backbone are passed through a 3
3 convolution, ReLU and, another 1
1 convolution layer and the resolution is a quarter of the input resolution. The loss function is also as same as that in [29].V-B Mbp
Detecting small objects from cloudy and dirty underwater images requires the high robustness of a detector. But as BlurPool [31]
says, normal downsampling methods (MaxPooling, AveragePooling, and convolution with stride 2) don’t have the capability to anti-alias, which could reduce the robustness of CNNs. To address this problem, we design the MBP module inspired by MixNet
[36] and BlurPool. It can generate diverse downsampling feature maps and add only a little computation compared with MaxPooling.
Fig. 5 (b) shows the structure of MBP module, which is composed of one MaxPooling with stride 1 and several BlurPool with stride 2. For an input feature map, it is first processed by the MaxPooling and split into N (N is the number of BlurPool) groups along channel axis. Then each group is processed by an independent BlurPool. Finally, we concatenate all the groups to get the final output. In all the experiments, we set N to 3.
BlurPool is an operation using a normalized Gaussian filter with stride 2 to downsample an input feature map. The method allows for a choice of blur kernel. In our module, we choose the following 3 different size filters ranging with increasing smoothing: Triangle-3 [1, 2, 1], Binomial-5 [1, 4, 6, 4, 1], Binomial-7
[1, 6, 15, 20, 15, 6, 1]. The weights are normalized. The filters are the outer product of the following vectors with themselves.
V-C Mff
To get a better performance in detecting small objects, a backbone needs to extract multi-scale features as much as possible. Many approaches [37, 38] choose to fuse features after backbones extract features, while this means more layers are needed. Others [39, 36] choose the way using different sizes of kernels or complex connection in one block to do this. Inspired by the latter and considering the efficiency requirements about underwater picking, we combine depth-wise convolution and the connection way in Res2Net [39] to create the MFF block, which can extract and fuse multi-scale features at the same time.
Method | Backbone | Params | FPS | mAP | mAP | mAP | mAP |
SSD [7] | MobileNetV2 | 3.05M | 11 | 22.7% | 14.6% | 48.1% | 5.3% |
YOLOv3 [8] | DarkNet-53 | 61.9M | 32 | 46.8% | 30.7% | 77.5% | 32.3% |
RetinaNet [9] | ResNet-18 | 19.81M | 14 | 24.6% | 3.1% | 61.3% | 9.4% |
RetinaNet [9] | ResNet-50 | 36.15M | 10 | 34.2% | 15.2% | 65.1% | 22.2% |
FCOS [40] | ResNet-50 | 31.84M | 27 | 44.9% | 35.5% | 73.9% | 25.3% |
Foveabox [41] | ResNet-50 | 36.02M | 28 | 30.0% | 16.1% | 61.4% | 12.6% |
FreeAnchor [42] | ResNet-50 | 36.15M | 25 | 32.7% | 17.3% | 71.0% | 9.8% |
RPDet [43] | ResNet-50 | 36.6M | 22 | 45.1% | 26.9% | 76.1% | 32.4% |
GA-RetinaNet [44] | ResNet-50 | 37.15M | 12 | 36.1% | 21.7% | 70.5% | 16.1% |
CenterNet [29] | DLA-34 [45] | 18.12M | 33 | 36.6% | 12.5% | 78.0% | 19.4% |
UnderwaterNet | MobileNetV3-small (1x) [46] | 1.92M | 45 | 39.1% | 15.4% | 77.6% | 24.5% |
UnderwaterNet | ShuffleNetV2 (1x) [47] | 1.88M | 32 | 37.7% | 11.4% | 74.9% | 27.0% |
UnderwaterNet | EfficientNet-B0 [48] | 4.59M | 44 | 34.7% | 9.9% | 77.5% | 16.8% |
UnderwaterNet | MixNet-S (1x) [36] | 3.27M | 23 | 36.2% | 9.3% | 76.6% | 22.6% |
UnderwaterNet | MnasNet-A1 (1x) [49] | 3.18M | 29 | 37.0% | 11.5% | 77.8% | 21.8% |
UnderwaterNet (ours) | - | 1.30M | 48 | 47.4% | 23.6% | 79.3% | 39.4% |
Fig. 5 (c) shows the structure of MFF block. For an input, the number of channels is first expanded N times (N is the number of kernel sequence we set, e.g., in Fig. 5 (c) the sequence is [3,5,7]) by the first 1 1 convolution. Similar to the Res2Net module, the output feature map is split into N groups by channel, denoted by G, i {1,…,N}. And then each group is processed by a depth-wise filters K with the corresponding kernel size in kernel sequence. The output of K is added to the following subset G, and then processed by K. The outputs of these parallel branches are concatenated and then fused by the final 1
1 convolution to reduce to output channels. Besides, we use 2 residual connections in our block: one is between the input and output feature map, the other is between the expanded feature maps. The final output of MFF block is a feature map aggregating multi-scale features, which is vital important in detecting small objects from cloudy underwater images.
Vi Experiments
Vi-a UnderwaterNet
Vi-A1 Experimental Settings
All the experiments are conducted with CUDA 10.0 and cuDNN 7.3.1 backends on NVIDIA TITAN XP GPUs, Intel Xeon CPU E5-2680 v4. Our UnderwaterNet is implemented on PyTorch. The input size is 512
512 both in training and inference. Lookahead optimizer [50] with Adam [51] is employed and initial learning rate is set to 2.3e-5. Batch size is 32. We employ the zero-mean normalization, random flip, random scaling (between 0.6 to 1.3), and cropping for data augmentation.Pool | mAP |
---|---|
MaxPool | 41.9% |
MaxBlurPool (k=3) [31] | 44.9% |
MaxBlurPool (k=5) [31] | 46.0% |
MaxBlurPool (k=7) [31] | 46.4% |
MBP (ours) | 47.4% |
Different kernel | Skip connection | mAP |
42.8% | ||
✓ | 43.1% | |
✓ | ✓ | 47.4% |
Vi-A2 Ablation Studies
We evaluate UnderwaterNet on UDD test set to investigate the effect of each part of it. All models were trained from scratch for 1600 epochs.
Ablation on MBP. MaxPool and MaxBlurPool [31] with different kernel size are used to compare with MBP as shown in Table II. Our method is 5.5% higher than MaxPool due to anti-aliasing and multi-scale. Our method also achieves the highest accuracy among MaxBlurPools with a single kernel size, indicating the multi-level blur strategy is better for small object detection from cloudy underwater images.
Ablation on MFF. Compared with the original block in MobileNetv2 [10], we use different kernel size and add skip connections between branches in MFF. Table III shows results with different settings. For the MFF, the first row equals to the block in MobileNetV2 [10], the second one equals to the block in MixNet [36], and the third one is a standard MFF block. As we can see, different kernel and skip connection operations improve the accuracy by 4.6% compared with the first one. The results show that our MFF block is more suitable than blocks in MobileNetv2 or MixNet to extract and represent features from the underwater open-sea scenes under the same layer configurations.
Vi-A3 Comparison With Other Methods on UDD
We use some public real-time methods and lightweight backbones to compare with UnderwaterNet. For a fair comparison, all the models are trained to convergence from scratch with same hyper-parameters and do not employ any test augmentation (e.g., flip test or multi-scale test) but employ the same data augmentation mentioned in Experimental Settings. Both training and inference were conducted on a same server. Results are shown in Table I. Fig. 6 shows UnderwaterNet detection results.
Among all the methods, UnderwaterNet outperforms other models with the least parameters (only 1.3M) and also gives the best speed / accuracy trade-off. It runs at 48FPS with 47.4% mAP and has the highest accuracy on both seaurchin and scallop. The ablation studies and comparisons indicate that UnderwaterNet with MBP and MFF is designed well to deal with the characteristics of UDD and finally get the excellent performance considering the number of parameters.

But all the methods fail to get a great performance on detecting seacucumber and scallop (basically under 30%) because of the insufficient training objects and class-imbalance problems of UDD. Our work address the problems by generating and utilizing AUDD.
Vi-B Poisson GAN
Vi-B1 Experimental Settings
Our Poisson GAN is implemented on PyTorch. The input size is 512512 both in training and inference. Adam is employed and the learning rate is initialized into 1e-4 in both generator and discriminator, then decreased to 1e-5 after 125K iterations. Our experiments are conducted on a single NVIDIA TITAN XP GPU and the batch size is 4.
Initialization | mAP | mAP | mAP | mAP |
---|---|---|---|---|
Scratch | 68.8% | 55.8% | 81.4% | 69.2% |
ImageNet | 71.5% | 72.9% | 71.2% | 70.5% |
Ours | 80.5% | 76.7% | 84.2% | 80.7% |
Vi-B2 Pre-trained Dataset
We conduct experiments to find out the performance of different initializations (Scratch, ImageNet pre-trained model, and the pre-trained model from Pre-trained dataset) in underwater open-sea farm object detection. YOLOv3 is trained for 70 epochs on Pre-trained dataset to get the pre-trained model. YOLOv3 is trained on AUDD with 3 different models and tested on the UDD test set as shown in Table IV.
Obviously, compared with the original result on UDD (see TABLE I), AUDD is powerful on solving the insufficient training data and class-imbalance problems in UDD. With our Pre-trained model, YOLOv3 can be improved significantly from 46.8% to 80.5% on mAP, which is 11.7% higher than Scratch and 9.0% higher than ImageNet. It shows the Pre-trained Dataset can provide an excellent initialization for the detector in underwater open-sea farm object detection.
Vi-B3 Comparison With Image-to-Image Translation GANs
We use some image2image GANs [52, 53, 54, 55] to compare with our Poisson GAN. GAN-based augmentation methods usually use image2image GANs to generate unreal images as shown in Fig. 7
to improve classifiers in the image classification task. But for small objects in UDD, images generated by them may change the texture of objects or even disappear the objects. The class-imbalance problem can not be solved by the transformation. Therefore, they can’t be used to generate an expanded dataset like AUDD to improve the detector empirically.
In contrast, Poisson GAN can change the position of objects and remain both the objects and underwater environment textures. The images generated by it are real enough to be the supplement of UDD.
Method | mAP | mAP | mAP | mAP |
---|---|---|---|---|
Baseline | 55.3 / 58.1% | 38.9 / 44.1% | 80.1 / 81.5% | 47.0 / 48.6% |
GridMask [56] | 53.6 / 55.8% | 37.5 / 39.3% | 79.2 / 81.1% | 44.0 / 46.9% |
RErase [57] | 54.9 / 56.4% | 40.1 / 44.7% | 82.1 / 82.9% | 42.5 / 41.5% |
Cutout [58] | 54.3 / 56.3% | 39.7 / 44.2% | 81.8 / 82.9% | 41.5 / 41.6% |
HaS [59] | 30.7 / 30.9% | 9.1 / 9.1% | 65.7 / 67.9% | 17.2 / 15.7% |
Mixup [60] | 55.1 / 60.6% | 44.2 / 51.5% | 80.7 / 81.7% | 40.5 / 48.6% |
Poisson GAN | 72.1 / 80.8% | 56.1 / 64.7% | 91.2 / 94.7% | 68.9 / 82.8% |
Vi-B4 Comparison With Other Methods
We use some available augmentation methods [56, 57, 58, 59, 60] which can be used in general object detection to compare with our Poisson GAN. Examples are shown in Fig. 8. GridMask, RErase, Cutout, and HaS drop image information in different ways and Mixup mixes 2 images up to do augmentation. In Table V, UnderwaterNet with our pre-trained model is trained on AUDD in Poisson GAN and on UDD in others and all the models are tested on the UDD test set. Baseline means using random flip, random scaling (between 0.6 to 1.3), and cropping to train UnderwaterNet. In GridMask, RErase, Cutout, and HaS, the corresponding method is first used before using the baseline augmentation mentioned above. In Mixup, the baseline augmentation is first used. The baseline augmentation is also employed in Poisson GAN. Finally, the image is normalized to around zero. All the models are trained to convergence.
Among all the methods, information-dropping methods (i.e., GridMask, RErase, Cutout, HaS) perform lower than Baseline because dropping information could make the class-imbalance problem more serious. Mixup improves by 2.5% compared with Baseline on flip testing but still can’t compare with Poisson GAN, which achieves the significant improvement (up to 22.7% compared with Baseline on mAP) and solves the class-imbalance problem to a great extent. Moreover, Poisson GAN could also use Mixup to further improve performance.

Vii Conclusions
In this paper, we constructed an underwater open-sea farm object detection dataset (UDD) to promote the development of underwater robot picking. Then Poisson GAN was proposed to solve the inherent problems including class-imbalance and insufficient training data in UDD by generating AUDD and Pre-trained datasets. Besides, UnderwaterNet with MFF block and MBP module was designed to deal with massive small objects in UDD and stay high efficient at the same time for underwater robot picking. Finally, we conducted experiments to verify the effectiveness of UnderwaterNet, Poisson GAN, UDD, AUDD, and Pre-trained datasets.
Viii Acknowledgments
This work was supported in part by the National Natural Science Foundation of China (NSFC) under Grant 61976038, 61932020, and 61772108.
References
- [1] S. Haykin and B. Kosko, “Gradientbased learning applied to document recognition,” in IEEE, 2009.
-
[2]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in
International Conference on Neural Information Processing Systems, 2012. - [3] M. Everingham, A. Zisserman, C. K. I. Williams, L. V. Gool, M. Allan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, and G. Dorkó, “The 2005 pascal visual object classes challenge,” Lecture Notes in Computer Science, vol. 111, no. 1, pp. 98–136, 2007.
- [4] W. Deng, L. Zheng, G. Kang, Y. Yang, Q. Ye, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification,” CoRR, vol. abs/1711.07027, 2017. [Online]. Available: http://arxiv.org/abs/1711.07027
- [5] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person Transfer GAN to Bridge Domain Gap for Person Re-Identification,” arXiv e-prints, p. arXiv:1711.08565, Nov 2017.
- [6] P. Rez, M. Gangnet, and A. Blake, “Poisson image editing.” Acm Transactions on Graphics, vol. 22, no. 3, pp. 313–318, 2003.
-
[7]
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg,
“Ssd: Single shot multibox detector,” in
European Conference on Computer Vision
, 2016. - [8] J. Redmon and A. Farhadi, “YOLOv3: An Incremental Improvement,” arXiv e-prints, p. arXiv:1804.02767, Apr 2018.
- [9] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. PP, no. 99, pp. 2999–3007, 2017.
- [10] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “MobileNetV2: Inverted Residuals and Linear Bottlenecks,” arXiv e-prints, p. arXiv:1801.04381, Jan 2018.
- [11] J. He, J. V. Ossenbruggen, and A. P. D. Vries, “Fish4label: accomplishing an expert task without expert knowledge,” in Conference on Open Research Areas in Information Retrieval, 2013.
- [12] H. Qin, L. Xiu, L. Jian, Y. Peng, and C. Zhang, “Deepfish: Accurate underwater live fish recognition with a deep architecture,” Neurocomputing, vol. 187, pp. 49–58, 2016.
- [13] C. Spampinato, S. Palazzo, B. Boom, and R. B. Fisher, “Overview of the lifeclef 2014 fish task,” in Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15-18, 2014, 2014, pp. 616–624. [Online]. Available: http://ceur-ws.org/Vol-1180/CLEF2014wn-Life-SpampinatoEt2014.pdf
- [14] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, X. Bing, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in International Conference on Neural Information Processing Systems, 2014.
- [15] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, “DeblurGAN: Blind Motion Deblurring Using Conditional Adversarial Networks,” arXiv e-prints, p. arXiv:1711.07064, Nov 2017.
-
[16]
S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolutional neural
network for dynamic scene deblurring,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2017. - [17] S.-W. Huang, C.-T. Lin, S.-P. Chen, Y.-Y. Wu, P.-H. Hsu, and S.-H. Lai, “Auggan: Cross domain adaptation with gan-based data augmentation,” in Computer Vision – ECCV 2018, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Eds. Cham: Springer International Publishing, 2018, pp. 731–744.
-
[18]
J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in
IEEE International Conference on Computer Vision, 2017. - [19] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
- [20] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, June 2005, pp. 886–893 vol. 1.
- [21] M. Ravanbakhsh, M. R. Shortis, F. Shafait, A. Mian, E. S. Harvey, and J. W. Seager, “Automated fish detection in underwater images using shape-based level sets,” Photogrammetric Record, vol. 30, no. 149, pp. 46–62, 2015.
-
[22]
P. Viola and M. J. Jones, “Robust real-time face detection,”
International Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004. - [23] S. Ren, K. He, R. Girshick, and S. Jian, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in International Conference on Neural Information Processing Systems, 2015.
- [24] J. Dai, Y. Li, K. He, and J. Sun, “R-FCN: Object Detection via Region-based Fully Convolutional Networks,” arXiv e-prints, p. arXiv:1605.06409, May 2016.
- [25] C. L. Zitnick and P. Dollar, “Edge boxes : Locating object proposals from edges,” in ECCV, 2014.
- [26] R. Girshick, “Fast r-cnn,” Computer Science, 2015.
- [27] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You Only Look Once: Unified, Real-Time Object Detection,” arXiv e-prints, p. arXiv:1506.02640, Jun 2015.
- [28] H. Law and D. Jia, “Cornernet: Detecting objects as paired keypoints,” International Journal of Computer Vision, pp. 1–15, 2018.
- [29] X. Zhou, D. Wang, and P. Krähenbühl, “Objects as Points,” arXiv e-prints, p. arXiv:1904.07850, Apr 2019.
- [30] L. Xiu, S. Min, H. Qin, and L. Chen, “Fast accurate fish detection and recognition of underwater images with fast r-cnn,” in Oceans, 2016.
- [31] R. Zhang, “Making Convolutional Networks Shift-Invariant Again,” arXiv e-prints, p. arXiv:1904.11486, Apr 2019.
-
[32]
Y. Ci, X. Ma, Z. Wang, H. Li, and Z. Luo, “User-guided deep anime line art colorization with conditional adversarial networks,”
2018 ACM Multimedia Conference on Multimedia Conference - MM ’18, 2018. [Online]. Available: http://dx.doi.org/10.1145/3240508.3240661 - [33] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015.
- [34] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-
[35]
C. Ledig, L. Theis, F. Huszar, J. Caballero, and W. Shi, “Photo-realistic single image super-resolution using a generative adversarial network,” in
2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. - [36] M. Tan and Q. V. Le, “MixConv: Mixed Depthwise Convolutional Kernels,” arXiv e-prints, p. arXiv:1907.09595, Jul 2019.
- [37] L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 40, no. 4, pp. 834–848, 2018.
- [38] H. Li, P. Xiong, J. An, and L. Wang, “Pyramid Attention Network for Semantic Segmentation,” arXiv e-prints, p. arXiv:1805.10180, May 2018.
- [39] S. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. H. Torr, “Res2net: A new multi-scale backbone architecture,” IEEE Transactions on Pattern Analysis and Machine Intelligence, p. 1–1, 2019. [Online]. Available: http://dx.doi.org/10.1109/TPAMI.2019.2938758
- [40] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: fully convolutional one-stage object detection,” CoRR, vol. abs/1904.01355, 2019. [Online]. Available: http://arxiv.org/abs/1904.01355
- [41] T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi, “Foveabox: Beyond anchor-based object detector,” CoRR, vol. abs/1904.03797, 2019. [Online]. Available: http://arxiv.org/abs/1904.03797
- [42] X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye, “FreeAnchor: Learning to match anchors for visual object detection,” in Neural Information Processing Systems, 2019.
- [43] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin, “Reppoints: Point set representation for object detection,” in The IEEE International Conference on Computer Vision (ICCV), Oct 2019.
- [44] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin, “Region proposal by guided anchoring,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- [45] F. Yu, D. Wang, and T. Darrell, “Deep layer aggregation,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2403–2412, 2017.
- [46] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan, Q. V. Le, and H. Adam, “Searching for mobilenetv3,” ICCV, 2019.
- [47] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” ECCV, 2018.
- [48] M. Tan and Q. V. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in ICML, 2019.
- [49] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le, “Mnasnet: Platform-aware neural architecture search for mobile,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2815–2823, 2018.
- [50] M. R. Zhang, J. Lucas, G. Hinton, and J. Ba, “Lookahead Optimizer: k steps forward, 1 step back,” arXiv e-prints, p. arXiv:1907.08610, Jul 2019.
- [51] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” CoRR, vol. abs/1412.6980, 2014.
- [52] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks,” arXiv e-prints, p. arXiv:1703.10593, Mar 2017.
- [53] Y. Choi, M. Choi, M. Kim, J.-W. Ha, S. Kim, and J. Choo, “StarGAN: Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation,” arXiv e-prints, p. arXiv:1711.09020, Nov 2017.
- [54] J. Kim, M. Kim, H. Kang, and K. Lee, “U-GAT-IT: unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation,” CoRR, vol. abs/1907.10830, 2019. [Online]. Available: http://arxiv.org/abs/1907.10830
- [55] X. Huang, M.-Y. Liu, S. Belongie, and J. Kautz, “Multimodal unsupervised image-to-image translation,” in ECCV, 2018.
- [56] P. Chen, S. Liu, H. Zhao, and J. Jia, “Gridmask data augmentation,” ArXiv, vol. abs/2001.04086, 2020.
- [57] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” ArXiv, vol. abs/1708.04896, 2017.
- [58] T. Devries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” ArXiv, vol. abs/1708.04552, 2017.
- [59] K. K. Singh and Y. J. Lee, “Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object and action localization,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 3544–3553, 2017.
- [60] H. Zhang, M. Cissé, Y. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” ICLR, 2017.