More and more research institutes and scientists consider attaching camera to Autonomous Underwater Vehicles (AUVs) and Remotely Operated Vehicles (ROVs) in order to perform different underwater tasks, such as marine organism capturing, ecological surveillance and biodiversity monitoring. Underwater object detection is an indispensable technology for AUVs to fulfill these tasks.
In application, once a underwater object detector aiming at certain categories have been trained, we hope this detector can be applied in any underwater circumstances. As a result, it is necessary to build a General Underwater Object Detector (GUOD). A GUOD faces three kinds of challenges:
(1) It is much harder to obtain underwater images, and the annotation task usually need experts to accomplish, which is costly. Therefore, labeled dataset of underwater object detection is extremely limited, inevitably leading to overfitting of deep model. Data augmentation aims at solving the problem of lack of data. There are three types of augmentation. First, geometrical transformations (e.g., horizontal flipping, rotation, patch crop , perspective simulation ) have been proved effective in various fields. Second, cut-Paste-based methods (e.g., randomly cut and paste , Mixup , CutMix , PSIS ) help model learn contextual invariance. Third, domain-transfer based methods (e.g., SIN ) force model to focus more semantic information.
(2) The contradiction between speed and performance becomes even more critical. A GUOD should be able to work in real time, which is a common requirement in robotics field. However, it is impractical to equip small AUVs with high performance hardware. Some works focus on the speed of deep learning model but keep good control of performance decrease, such as MobileNet, SSD , YOLOv3 .
(3) Deep model severely suffers from domain shift, but a GUOD should be invariant of water quality, which can not only work well in oceans, but also in lakes and rivers. This can be seen as a kind of domain generalization problem that a model trains on source domains but evaluates on an unseen domain. Some domain adaptation (DA) (e.g., style consistency , DA-Faster RCNN ) and domain generalization (DG) (e.g., JiGEN , MMD-AAE , EPi-FCR ) technologies are proposed before. Nevertheless, most of DG works focus on object recognition and DA works can not directly transplant to DG task, so their effectiveness are not proved in DG object detection task.
This work aims to use small dataset with limited domains to train a GUOD. To handle challenge (1), a new augmentation method Water Quality Transfer (WQT) is proposed to enlarge the dataset and increase domain diversity. To handle challenge (2) and (3), DG-YOLO is proposed to further boost domain invariance of object detection based on a real-time detector YOLOv3. Our method is implemented on Underwater Robot Picking Contest 2019 (URPC2019) dataset, and achieve performance improvement.
In summary, our contributions are listed as follows: (1) We propose a new augmentation WQT specially for underwater condition and analyze its effectiveness and reveal its limitations; (2) Based on WQT, DG-YOLO is proposed to further mine the domain-invariant (semantic) information of underwater image, which realizes domain generalization; (3) A lot of experiments and visualization are conducted to prove the effectiveness of our method.
2.1 Water Quality Transfer (WQT)
As Figure 1 shows, we select 8 images with different types of water quality, and use  to transfer URPC dataset to different types of water quality. The content image is from URPC’s training set and validation set. In the following section, this seven types of training set are denoted as type1 to type7 and the corresponding validation set are denoted as val_type1 to val_type7. As for type8, only the validation set is transferred to obtain val_type8 without corresponding training set. Since model will never train on type8 domain, val_type8 is to test the domain generalization capacity of model.
2.2 Domain Generalization YOLO (DG-YOLO)
A review of YOLOv3. Because AUVs with a small processing unit have limited calculation capacity, the real-time detector YOLOv3  is a promising choice. YOLOv3 is a one-stage object detector, using Darknet-53 as backbone. Compared with Faster R-CNN , YOLOv3 does not use region proposal network. It directly regresses the coordinates of bounding box and class information with a fully convolutional network. YOLOv3 divides an image into cells, and each cell is responsible for the objects lie in the cell. The training losses of YOLOv3 consists of the loss of classification , the loss of coordination , loss of object and loss of no-object :
where and are trade-off parameters.
Domain Invariant Module (DIM)
. Since DA and DG have some similarities, we modify the domain classifier proposed by to apply in our DG task. Given a batch of input images as from different source domains, its corresponding domain labels are , in which is the number of batch, . Denoting that is a feature extractor and is a domain classifier, the domain loss is defined as follows:
where means categorical cross entropy. In application, domain label comes from WQT, and is 7 corresponding to 7 types of water quality that WQT synthesizes. Domain loss for data from original dataset is not calculated.
IRM Penalty. Inspired by recent study , Invariant Risk Minimization (IRM) help learn an invariant predictor across multiple domain. Given a set of training environments (same meaning as domains) , our final goal is to achieve good performance across a large set of unseen but related environments (). However, directly using Empirical Risk Minimization (ERM)  will lead to overfitting on training environment and learn spurious correlation. In order to generalize well on unseen environments, IRM is a better choice to obtain invariance:
where is the entire invariant predictors, is ERM term on environment , is a fixed scalar, is invariance penalty, and is a trade-off parameter balancing the ERM term and the invariance penalty. To apply IRM in YOLOv3, the IRM penalty specially for YOLOv3 is designed as follows:
where denotes if object appears in cell , denotes that the th bounding box predictor in cell is responsible for that prediction, is sigmoid operation, is the score of class before sigmoid operation, is class label, is the score of objects before sigmoid operation, is object label. is the bounding box outputted by YOLOv3, whose corresponding ground truth is .
Penalty term is designed based on corresponding losses of YOLOv3. To be specific, is added to different places of losses. Square gradient of losses to is the corresponding penalty term.
Network overview. An overview of our network is shown in Figure 2, we denote it DG-YOLO. Compared to YOLOv3, DIM and IRM penalty are added. In details, the backbone of YOLO darknet-53 can be regarded as a feature extractor. The feature maps extracted from darknet will be fed into Gradient Reversal Layer (GRL) 
first, which reverses the gradient when backpropagating for adversarial learning. After that, domain classifier distinguish feature maps between domains. With the help of GRL and domain classifier, the backbone will be forced to abandon information of water quality to fool domain classifier. As a result, DG-YOLO can make a prediction depending more on semantic information. Moreover, IRM penalty is calculated simultaneously with YOLO loss. Combining (1), (2) and (4) ,the total loss of DG-YOLO is:
and set to 1 in experiment. In inference stage, because DIM and IRM penalty can be abandoned, DG-YOLO doesn’t affect the speed of YOLOv3. It should be emphasized that because domain label comes from WQT, DG-YOLO can not be used alone without WQT.
3 Experiments and Discussions
We evaluate WQT and DG-YOLO on a publicly available datasets: URPC2019111www.cnurpc.org, which consists of 3765 training samples and 942 validation samples over five categories: echinus, starfish, holothurian, scallop and waterweeds. Applying WQT on training set and validation set of URPC2019, we can synthesize type1-7 for training and val_type1-8 for validation. The performance on val_type8 represents domain generalization capacity of model.
3.2 Training details
YOLOv3 and DG-YOLO is trained for 300 epochs and evaluated on original and all synthetic validation sets, with image resizing to 416
416. Models are trained on a Nvidia GTX 1080Ti GPU with PyTorch implementation, setting batch size to 8. Adam algorithm is adopted for optimization and learning rate sets to 0.001, withand . IoU, confidence and non-max suppression threshold all set to 0.5. Accumulating gradient is leveraged, which is summing up the gradient and make one step of gradient descent in each two iterations. We do not use any other data augmentation on YOLOv3 and DG-YOLO unless we mention it.
3.3 Experiments of WQT
In this subsection, we analyze why WQT works. In Table 1, Ori means original URPC dataset, baseline means YOLOv3 is trained only on original dataset, and ori+type1 means YOLOv3 is trained with original dataset and type1 dataset. Full_WQT means YOLOv3 is trained across type1 to type7. From Table 1, we can find three interesting points:
(1) Compared to baseline, it can be concluded that every group of augmentation improves the performance in original validation dataset. WQT can be used together with other data augmentation methods to obtain higher performance (last two rows of Table 1), which further proves its effectiveness. Besides, there is a phenomenon that WQT also helps model generalize better on other type of water quality in most of the cases. For example, ori+type7 evaluates on type3 get mAP 39.23%, 12.66% higher than baseline.
(2) We believe that there is a correlation between performance and similarity between water qualities. First, we use style loss proposed by  to represent style distance, and calculate the style distance between different types water quality. We feed style image type1 to type7 into , and extract the feature maps at certain layers from both encoder and decoder, calculating style loss between any two types and obtaining . The result is shown in Table 2. Second, we take the data from column 3 to 9 (val_+type1 to val_type7) and row 2 to 8 (model ori+type1 to ori+type7) in Table 1, subtracting each row of this 7 7 matrix to the performance of corresponding type of baseline, getting . Using Pearson Correlation Coefficient and taking negative, it can be found that the correlation coefficient between style and performance is 0.4634. From this analysis, it can be inferred that the increase of generalization capacity gaining from WQT is from the similarity between different types of water quality.
(3) To further analyze the finding of (2), model is evaluated on val_type8 which is a very different style from type1 to type7. There is no doubt that the WQT-trained model will perform not only better on original dataset, but also across type1 to type7 dataset. However, the model still fails on val_type8 (see Table 3), which is far from the requirement of a GUOD. WQT is not enough for domain generalization.
3.4 Experiments of DG-YOLO
The effectiveness of DG-YOLO. WQT helps YOLOv3 to learn domain-invariant information, but the model still suffers from domain shift severely. In Table 3, it is shown that DG-YOLO further digs domain-invariant information from data, obtaining 3.21% mAP improvement on val_type8 compared to WQT-only. Besides, compared with other object detectors on val_type8 performance, DG-YOLO shows its much better domain generalization capacity.
Ablation study. The result of ablation study is shown in Table 3. Compared to WQT-only on val_type8, WQT+DIM has 4.52% performance decrease and WQT+ has little improvement. However, WQT+DG-YOLO achieves 3.21% improvement, which suggests only by combining DIM and can lead to better performance.
Visualization of DG-YOLO. One thing that can not be ignored is that there is performance decrease in original validation dataset of WQT+DG-YOLO. It is because WQT-only is ”cheating”, learning spurious correlation to make predictions. For example, waterweeds are green in greenish water, but they may turn black in another type of water. Therefore, color of the object is not a domain-invariant information, although it is convenient to use this spurious correlation to achieve good result in just one domain. The performance decrease of DG-YOLO can be interpreted that the model abandons the domain-related information and tries to learn domain invariant information from dataset. We use SmoothGrad 
visualization technique to prove our hypothesis, finding the area that make model to believe there is echinus with probability higher than 95%. As is shown in Figure3, baseline focuses on the shadow on the top left of image where there is no echinus. The pixel that WQT-only focuses is too dispersed, which means WQT-only learns spurious correlation. And the pixel DG-YOLO focuses is concentrated and exactly lie on the place where there is echinus. The visualization shows that DG-YOLO learn more semantic information than baseline and WQT-only.
This paper propose a data augmentation method WQT and a novel model DG-YOLO to overcome three challenges a GUOD faces: limited data, real-time processing and domain shift. Leveraging , WQT is intended to increase domain diversity of original dataset. With DIM and IRM penalty, DG-YOLO can further mine semantic information from dataset. Experiments on original and synthetic URPC2019 dataset prove remarkable domain generalization capacity of our method. However, since the performance of DG-YOLO in an unseen domain can still not reach similar level as that in the seen domains, there is still a lot to explore in this field.
-  (2019) Invariant risk minimization. arXiv preprint arXiv:1907.02893. Cited by: §2.2.
-  (2019) Domain generalization by solving jigsaw puzzles. In , pp. 2229–2238. Cited by: §1.
-  (2018) Domain adaptive faster r-cnn for object detection in the wild. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
-  (2017) Cut, paste and learn: surprisingly easy synthesis for instance detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1301–1310. Cited by: §1.
Unsupervised domain adaptation by backpropagation.
International Conference on Machine Learning, pp. 1180–1189. Cited by: §2.2, §2.2.
Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423. Cited by: §3.3.
-  (2018) ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, Cited by: §1.
-  (2019) Data augmentation for object detection via progressive and selective instance-switching. arXiv preprint arXiv:1906.00358v2. Cited by: §1.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §1.
-  (2019) Faster r-cnn for marine organisms detection and recognition using data augmentation. Neurocomputing 337, pp. 372–384. Cited by: §1.
-  (2019) Episodic training for domain generalization. In International Conference on Computer Vision (ICCV), Cited by: §1.
-  (2018) Domain generalization with adversarial feature learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5400–5409. Cited by: §1.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §1.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1, §2.2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §2.2.
-  (2019) Domain adaptation for object detection via style consistency. In British Machine Vision Conference, Cited by: §1.
-  (2017) Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825. Cited by: §3.4.
-  (1992) Principles of risk minimization for learning theory. In Advances in neural information processing systems, pp. 831–838. Cited by: §2.2.
-  (2019) Photorealistic style transfer via wavelet transforms. In International Conference on Computer Vision (ICCV), Cited by: §2.1.
-  (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6023–6032. Cited by: §1.
-  (2018) Mixup: beyond empirical risk minimization. In International Conference on Learning Representations, Cited by: §1.