RoIMix: Proposal-Fusion among Multiple Images for Underwater Object Detection

11/08/2019 ∙ by Wei-Hong Lin, et al. ∙ 0

Generic object detection algorithms have proven their excellent performance in recent years. However, object detection on underwater datasets is still less explored. In contrast to generic datasets, underwater images usually have color shift and low contrast; sediment would cause blurring in underwater images. In addition, underwater creatures often appear closely to each other on images due to their living habits. To address these issues, our work investigates augmentation policies to simulate overlapping, occluded and blurred objects, and we construct a model capable of achieving better generalization. We propose an augmentation method called RoIMix, which characterizes interactions among images. Proposals extracted from different images are mixed together. Previous data augmentation methods operate on a single image while we apply RoIMix to multiple images to create enhanced samples as training data. Experiments show that our proposed method improves the performance of region-based object detectors on both Pascal VOC and URPC datasets.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many object detectors [21, 2, 20] achieve promising performance on generic datasets such as Pascal VOC [9], MSCOCO [18]. However, due to the complicated underwater environment and illumination conditions, underwater images often have low contrast, texture distortion and uneven illumination. Figure 1(a) displays densely distributed creatures. They cover each other, and some of them are blurred due to sediment. These object detectors perform well on generic datasets while their capability of detecting underwater objects have been less studied. Underwater Robot Picking Contest (URPC) 111 offers a challenging object detection dataset, which contains a wide range of overlapping, occluded and blurred underwater creatures.

The issue of overlapping, occluded, and blurred objects has not been well researched under the existing data augmentation methods [16, 24]. If the model simply fits the training data, it will lack generalization ability and cannot cope with complicated underwater environments. Therefore, we directly simulate objects’ overlap, occlusion and blur by mixing proposals among multiple images.

(a) (b)
Figure 1: (a) Examples of overlap, occlusion and blur. In this paper, “overlap” means the objects of the same class cover part of each other whereas “occlusion” represents the similar case for different classes. “blur” is caused by sediment. (b) Misaligned proposals.

Theoretically, following the Empirical Risk Minimization (ERM) Principle [26], deep models are dedicated to minimizing their average error over the training data. However, they are usually affected by over-fitting. Specifically, ERM guides the deep models to memorize the training data rather than generalize from them. Meanwhile, these models are easily attacked by adversarial samples. Data augmentation is utilized to resolve over-fitting. According to the Vicinal Risk Minimization (VRM) Principle [3], the models are optimized on samples similar to training data via augmentation strategies. In the image classification domain, translating and flipping are commonly used strategy to increase the amount of training data. Some works such as Mixup [30], CutMix [29] are devoted to creating better training data. We investigate the effect of deploying data augmentation in training object detectors.

In this work, we aim to design a new data augmentation method for underwater object detection. Though data augmentation methods [8, 25, 12] for image classification can bring performance gains, they are not specifically designed for underwater object detectors. We propose a data augmentation method called RoIMix to improve the capability of detecting overlapping, occluded and blurred objects. Our proposed method is designed for region-based detectors such as Faster RCNN [21] and its variants [7, 14, 17]. In contrast to previous data augmentation methods that operate on a single image, our proposed method pays more attention to interactions among images. Applying image-level fusion like Mixup [12] directly in object detection would cause proposals from different images misaligned, shown in Figure 1(b). In order to accurately simulate the situations of overlap, occlusion and blur, we perform proposal-level fusion. In this way, we achieve remarkable performance for object detection on Pascal VOC and URPC datasets.

In summary, the main contributions of this paper are as follows: (1) to the best of our knowledge, this is the first work to utilize a proposal-level mix strategy to improve the performance of object detectors, especially for overlapping, occluded and blurred underwater creatures detection; (2) unlike previous data augmentation methods that process on a single image, RoIMix focuses on interactions between images and mixes proposals among multiple images; (3) our proposed method achieves remarkable performance for object detection on both URPC and Pascal VOC. Notably, we won the first prize with our proposed method RoIMix for offline target recognition in URPC 2019.

2 Related Work

2.1 Data Augmentation

Data augmentation is a critical strategy for training deep learning models. In the image classification domain, commonly used data augmentation strategies include rotation, translation

[6, 27, 22] or flip. Besides, there are some works on creating better data augmentation strategies [1, 23, 16]. Zhang et al. [30] proposes to mix two random training images to produce vicinal training data as a regularization approach. Regional dropout methods such as Cutout [8], erases random regions out of the input. This helps the model attend to the most discriminative part of the objects, but it can result in the loss of information. Moreover, an advanced version CutMix [29] cuts and pastes the patches among the training data, which greatly improves the model robustness against input corruption. For object detection, the detector adopts multiple augmentation strategies, such as photo metric distortion [19], image mirror [10] and multi-scale training [5]. Apart from this, a pretrained model based on CutMix can achieve performance gains on Pascal VOC, but it is not specifically designed for object detectors. We fully consider the characteristics of the region-based detectors and propose a new data augmentation method.

2.2 Faster R-CNN and its variants

Faster R-CNN [21] is a milestone in the development of the two-stage object detectors. It is composed of three modules: a Head network responsible for extracting features, such as AlexNet [16], VGG [24] and ResNet [15], RPN [21]

, a fully convolutional network that generates a set of region proposals on the feature map, and a RoI classifier

[10], making predictions for these region proposals. However, the computation is not shared in the region classification step. Dai et al. [7] proposes Region-based Fully Convolutional Networks (R-FCN), extracting spatial-aware region features. It shares the computation in the classification step via removing the fully connected layer without decline in performance. Another issue with Faster R-CNN is that it uses the output features from the last layer to make predictions, which lacks the capability of detecting small objects. Therefore, Lin et al. [17] proposes Feature Pyramid Networks (FPN), which combines hierarchical features to make better predictions. In recent years, there have been other variants of Faster R-CNN [2, 4, 13, 11, 28]. Our method is potentially versatile and can be applied to two-stage detectors.

Method mAP Single Multiple GT RoI Max holothurian echinus scallop starfish
Baseline 73.74 - - - - - 72.16 86.95 52.87 83.00
Proposed (RoIMix) 74.92 73.27 86.80 55.97 83.65
GTMix 74.17 72.30 86.76 54.68 82.95
Single_GTMix 74.23 71.51 86.66 54.67 84.09
Single_RoIMix 74.51 73.13 86.59 54.60 83.71
RoIMix () 72.86 71.79 86.11 50.07 83.46
Single_RoIMix () 73.12 71.22 86.43 51.26 83.56
Table 1: Detection results on the URPC 2018. “Single” or “Multiple” means applying mixing operation on a single image or multiple images; “GT” or “RoI” represents GT-wise or RoI-wise fusion; “Max” means performing max function when choosing mixing ratio .

3 Methodology

Figure 2: Overview of our approach. The architecture contains three modules: Head network, Regional Proposal Network (RPN) and Classifier. RoIMix exists between RPN and Classifier, aiming to combine random proposals generated by RPN to generate Mixed Region of Interest (Mixed RoI), and extracting the feature map of the RoIMixed Samples using for localization and classification.

As shown in Figure 2

, our proposed method is applied between RPN and RoI Classifier. We take RoIs produced by RPN and mix them by a random weight ratio. The ratio is generated based on beta distribution. Then we use the mixed samples to train the model. In the following section, we will describe the algorithm of RoIMix in detail and discuss the principles behind it.

3.1 Algorithm

Let and denote a proposal and its label. RoIMix aims to generate virtual proposals by combining two random RoIs and extracted from multiple images. The size of RoIs is often inconsistent, so we first resize to the same size as . The generated training sample is used to train the model. The combining operation is defined as:


where is a mixing ratio of two proposals. Instead of choosing a mixing ratio directly from a Beta distribution with parameter like Mixup:


we pick the larger one for the first RoI :


where is a function returning the larger value. The reason behind this is that we use as the label of mixed proposal . Our proposed method mixes proposals without labels, which is similar to traditional data augmentation method. It only affects training and keeps the model unchanged during evaluation.

Using this method, we can get new virtual RoIs simulating overlapping, occluded and blurred objects. Figure 3

visualizes the process of our proposed method. We replace the original proposals with these new virtual RoIs and generate new training samples. We train the network by minimizing the original loss function on these generated samples. Code-level details are presented in Algorithm


Figure 3: Visualization of RoIMix Method. , are two RoIs containing a scallop and a sea urchin, respectively. is an occluded sample (a sea urchin lies on a scallop) cropped from an training image. Via RoIMix, and mix into a new virtual proposal similar to , simulating the situation of occlusion and blur.

3.2 Discussion

We simulate objects’ overlap, occlusion by RoIMix to help the model implicitly learn better detection capability of dense objects. From the perspective of statistical learning theory, RoIMix is a type of linear interpolation between two proposals, and the decision boundary may become smoother without sharp transitions.

To be specific, RoIMix follows the VRM Principle instead of the ERM Principle, enabling deep learning models to be robust. A model trained following the ERM Principle minimizes empirical risk to help the model fit the training data well. We define empirical risk as


where represents the nonlinear expression between x and y, is the number of samples, and is a loss function measuring the distance between and . However, this training strategy makes the decision boundary fit the training data too much, and leads to over-fitting. Therefore, we suggest not using empirical risk to approximate the expected risk. RoIMix follows VRM rule and generates the vicinal distribution of training data. Then we can replace the training data with the vicinal data and approximate expected risk :


Therefore, the training process is transformed to minimizing expected risk

. In each epoch, RoIMix generates different vicinal training data. In this manner, the model tends to become more robust. Session 4.3 illustrates the robustness of the model trained by RoIMix in detail.

1:input images: , input RoIs: , , RoIs Position:
2:initialize output image:
3:for each in range(n) do
4:     choose two RoIs separately from : ,
5:     generate mixing ratio using (2)(3)
6:     create mixed RoI using (1)
7:     calculate the image index of :
8:     paste generated RoI into image:
9:end for
10:new training sample
Algorithm 1 RoIMix. The number of Images and RoIs in a mini-batch is N and n, respectively. RPN generates the same number of RoIs for each image. represents RoIs generated by RPN. corresponds to after random permutation of . , represents upper left corner and lower right corner of RoI.
Method mAP areo bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Baseline 80.0 85.4 87.0 79.5 73.0 69.0 84.8 88.4 88.4 65.2 85.5 74.3 87.3 86.4 81.7 83.4 50.1 83.8 81.3 85.1 80.6
Proposed 80.8 85.3 87.0 79.1 73.9 70.2 86.9 88.3 88.8 66.0 86.1 75.1 88.2 88.0 85.6 83.1 54.8 83.8 81.1 86.3 79.0
GTMix 80.6 82.2 85.8 79.4 72.6 71.5 87.5 88.8 88.3 65.4 86.3 76.3 88.3 88.3 86.1 84.3 51.2 83.7 80.8 86.2 79.8
Single_GTMix 80.5 85.4 86.2 78.7 72.4 69.7 88.2 88.4 89.0 65.4 85.2 73.3 87.3 87.8 86.2 83.0 53.0 81.2 81.3 85.6 82.8
Single_RoIMix 80.3 80.8 87.0 79.5 72.3 69.2 87.3 88.5 87.9 64.1 86.0 74.2 88.7 87.3 84.9 83.0 54.7 84.5 79.0 86.4 80.1
Table 2: Detection results on the VOC 2007 test set, trained on 07 trainval + 12 trainval.

4 Experiment

4.1 Experiments on URPC 2018

We comprehensively evaluate our method on the URPC 2018. This dataset consists of 2901 trainval images and 800 test images over 4 object categories, including holothurian, echinus, scallop and starfish. We choose ResNet-101 pretrained on ImageNet as the backbone network and 128 RoI features are extracted from each input image. We use the default hyper-parameters for Faster R-CNN. Mean Average Precision (mAP) is adopted for the evaluation metric. In our experiments on URPC 2018, we set the hyper-parameter


The ablation studies are shown in Table 1. Firstly, we directly generate mixing ratio by Eq.(2) without applying Eq.(3). The last two rows in Table 1 show that the max operation brings 2.06% and 1.8% mAP gains, which illustrates the importance of Eq.(3). Secondly, we compare the effects of mixing Ground Truths (GTs) and mixing RoIs. The second to fifth rows in Table 1 show that mixing RoIs contributes more to performance improvement than mixing GTs. Furthermore, we evaluate the importance of interactions among images. “Single_RoIMix” means choosing and mixing proposals on a single image while our proposed method combines proposals from multiple images in the mini-batch. The second and the fifth rows in Table 1 shows that mixing RoIs among multiple images achieves 0.41% mAP higher than mixing on a single image. Above all, the results show that the above variations lead to performance degradation compared to our proposed method.

Figure 4 visualizes the detection result of the baseline and our proposed method. There are three red boxes marked in Figure 4(b), two of which are vague and overlapping holothurians, and the other is an incomplete scallop. The baseline model fails to detect the objects in the three red boxes, while our method is successful. This illustrates that our method has better detection capability of blurred, overlapping objects.

(a) baseline (b) RoIMix
Figure 4: Comparison of detection results between baseline and our proposed method.

4.2 Experiments on PASCAL VOC.

We also evaluate the effectiveness of our proposed method RoIMix on generic object detection dataset Pascal VOC (07+12): the model is trained on a union set of VOC 2007 trainval and VOC 2012 trainval and tested on the VOC 2007 test set. We use the same setting as in the section 4.1. In the experiments on Pascal VOC, we set the hyper-parameter empirically.

To our knowledge, it is the first experimental report for mixed-samples data augmentation on object detection. We compare our method with our baseline Faster R-CNN. Next we evaluate the performance of RoIMix variants. Table 2 shows that our proposed method achieves 0.8% higher than the baseline and outperforms its variants. We observe that RoIMix’s performance gain on Pascal VOC is smaller than on URPC. One possible reason is that there are more overlapping, occluded and blurred objects in the URPC, which is resolved by our method. Thus, the performance gain is larger on the URPC dataset.

4.3 Stability and Robustness

We analyze the effect of RoIMix on stabilizing the training of object detectors. We compare mean Average Precision (mAP) during the training with RoIMix against the baseline. We visualize the results on both Pascal VOC and URPC datasets in Figure 5.

First, we observe that RoIMix achieves much higher mAP than the baseline at the end of training in both datasets. After the mAP reaches its highest point, the baseline begins to face over-fitting with the increase of training epochs. On the other hand, RoIMix drops steadily in Pascal VOC and keeps its mAP curve better than the baseline over a large margin. In the URPC dataset, RoIMix remains stable as epochs increase after reaching the highest point of mAP. Furthermore, the maximum margin between our proposed method and baseline reaches 2.04%. It shows that diverse vicinal training data generated by RoIMix can alleviate over-fitting and improve the stability of training process.

(a) Detection Result of VOC (b) Detection Result of URPC
Figure 5: Analysis for stability of baseline and RoIMix.

Furthermore, we evaluate the robustness of the trained model by applying 5 types of artificial noise samples: Gaussian noise, Poisson noise, salt noise, pepper noise, and salt-and-pepper noise. Figure 6(a) displays the sample with pepper noise. We use ImageNet pre-trained ResNet-101 models with same setting as in Section 4.1. We evaluate the baseline, GTMix, and RoIMix on each type of noise samples and visualize the results in Figure 6(b). The maximum performance gap between our proposed method and the baseline among these 5 types of noises is 9.05% mAP. The histogram shows that our proposed method is more robust against noise perturbations.

(a) Samples with Pepper Noise (b)Analysis for different noise
Figure 6: Robustness experiments on the URPC 2018.

Apart from artificial noise samples, we additionally explore the situation of vagueness by applying Gaussian Blur to the test images. As shown in Table 3, we can see that the performance is improved by 0.7% mAP after adopting the RoIMix method. These experiments further illustrate that RoIMix results in better robustness.

Method mAP Delta holothurian echinus scallop starfish
Baseline 70.47 0 64.36 86.54 48.83 82.16
GTMix 70.70 +0.23 65.92 86.13 48.09 82.66
Proposed 71.17 +0.70 66.44 86.17 48.72 83.35
Table 3: Detection results on artificial Gaussian Blur samples. We apply baseline, GTMix and RoIMix methods on these blur samples. Delta represents their performance gains with respect to the baseline.

5 Conclusion

In this paper, we propose RoIMix for underwater object detection. To the best of our knowledge, it is the first work to conduct proposal-level fusion among multiple images for generating diverse training samples. RoIMix aims to simulate overlapping, occluded and blurred objects, helping the model implicitly learn the capability of detecting underwater creatures. The experiments show that our proposed method can improve the performance on URPC by 1.18% mAP and on Pascal VOC by 0.8% mAP. Besides, RoIMix exhibits more stability and robustness. RoIMix was used in our first-prize solution for URPC2019 offline target recognition.


  • [1] H. S. Baird (1992) Document image defect models. In Structured Document Image Analysis, pp. 546–556. Cited by: §2.1.
  • [2] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6154–6162. Cited by: §1, §2.2.
  • [3] O. Chapelle, J. Weston, L. Bottou, and V. Vapnik (2001) Vicinal risk minimization. In Advances in neural information processing systems, pp. 416–422. Cited by: §1.
  • [4] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. (2019) Hybrid task cascade for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4974–4983. Cited by: §2.2.
  • [5] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, et al. (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §2.1.
  • [6] D. Cireşan, U. Meier, and J. Schmidhuber (2012)

    Multi-column deep neural networks for image classification

    arXiv preprint arXiv:1202.2745. Cited by: §2.1.
  • [7] J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §1, §2.2.
  • [8] T. DeVries and G. W. Taylor (2017)

    Improved regularization of convolutional neural networks with cutout

    arXiv preprint arXiv:1708.04552. Cited by: §1, §2.1.
  • [9] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §1.
  • [10] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He (2018) Detectron. Cited by: §2.1, §2.2.
  • [11] W. Guan, Y. Zou, and X. Zhou (2018) Multi-scale object detection with feature fusion and region objectness network. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2596–2600. Cited by: §2.2.
  • [12] H. Guo, Y. Mao, and R. Zhang (2019) Mixup as locally linear out-of-manifold regularization. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 3714–3722. Cited by: §1.
  • [13] C. He, S. Lai, and K. Lam (2019) Improving object detection with relation graph inference. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2537–2541. Cited by: §2.2.
  • [14] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1.
  • [15] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.2.
  • [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §2.1, §2.2.
  • [17] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1, §2.2.
  • [18] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1.
  • [19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.1.
  • [20] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin (2019) Libra r-cnn: towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830. Cited by: §1.
  • [21] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §1, §2.2.
  • [22] I. Sato, H. Nishimura, and K. Yokoi (2015) Apac: augmented pattern classification with neural networks. arXiv preprint arXiv:1505.03229. Cited by: §2.1.
  • [23] P. Y. Simard, D. Steinkraus, J. C. Platt, et al. (2003) Best practices for convolutional neural networks applied to visual document analysis.. In Icdar, Vol. 3. Cited by: §2.1.
  • [24] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2.2.
  • [25] C. Summers and M. J. Dinneen (2019) Improved mixed-example data augmentation. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1262–1270. Cited by: §1.
  • [26] V. Vapnik (1992) Principles of risk minimization for learning theory. In Advances in neural information processing systems, pp. 831–838. Cited by: §1.
  • [27] L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In

    International conference on machine learning

    pp. 1058–1066. Cited by: §2.1.
  • [28] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin (2019) Region proposal by guided anchoring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2965–2974. Cited by: §2.2.
  • [29] S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo (2019) Cutmix: regularization strategy to train strong classifiers with localizable features. arXiv preprint arXiv:1905.04899. Cited by: §1, §2.1.
  • [30] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz (2017) Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412. Cited by: §1, §2.1.