GLSD: The Global Large-Scale Ship Database and Baseline Evaluations

06/05/2021 ∙ by Zhenfeng Shao, et al. ∙ Central South University University of Arkansas Wuhan University 0

In this paper, we introduce a challenging global large-scale ship database (called GLSD), designed specifically for ship detection tasks. The designed GLSD database includes a total of 140,616 annotated instances from 100,729 images. Based on the collected images, we propose 13 categories that widely exists in international routes. These categories include sailing boat, fishing boat, passenger ship, war ship, general cargo ship, container ship, bulk cargo carrier, barge, ore carrier, speed boat, canoe, oil carrier, and tug. The motivations of developing GLSD include the following: 1) providing a refined ship detection database; 2) providing the worldwide researchers of ship detection and exhaustive label information (bounding box and ship class label) in one uniform global database; and 3) providing a large-scale ship database with geographic information (port and country information) that benefits multi-modal analysis. In addition, we discuss the evaluation protocols given image characteristics in GLSD and analyze the performance of selected state-of-the-art object detection algorithms on GSLD, providing baselines for future studies. More information regarding the designed GLSD can be found at https://github.com/jiaming-wang/GLSD.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Object detection has been an important computer vision task for over 20 years

[53]. In recent years, with the growing demand for public security, the detection of ships becomes an important task in both military and civilian fields [37]

, including sea controlling, addressing illegal smuggling, and automatic driving. The rapid development of artificial intelligence also pushes autonomous ship detection to the spotlight. Ship detection is of great importance, as sea routes are the lifeblood of the global economy because the international shipping industry is responsible for the carriage of around 90% of world trade, and sea routes are the lifeblood of the global economy

111https://www.ics-shipping.org/shipping-facts/shipping-and-world-trade. Nevertheless, manual inspection for identifying abnormal behaviors is a time-consuming and laborious process. In addition, despite that automatic object detection methods have achieved great performance, it is still far from maturity, as challenges still remain when those algorithms are being applied in real-world ship detection scenarios.

Inspired by the immense success of machine learning approaches in computer vision tasks

[24, 26, 25, 45], deep learning-based methods have been the mainstream in addressing object detection problems [53]. However, the performance of deep learning algorithms, given their big-data-driven nature [48]

, largely depends on the number of high-quality training samples. The first large-scale dataset, i.e., ImageNet

[8]

, has been widely adopted in object detection studies. Following ImageNet, Lin

et al. [22]

presented the Microsoft common objects in context (MS COCO) dataset with instance-level segmentation masks. In real-world scenarios, ships with different categories play considerably different roles during sea transportation. In these publicly available datasets, however, ships are commonly generalized as “ship” or “boat” (for example, in VOC2007

[10]

, CIFAR-10

[19], Caltech-256 [15], and COCO [22]). Although, ImageNet [8] includes six types of ships, i.e., “fireboat”, “lifeboat”, “speedboat”, “submarine”, “pirate”, and “container ship”, most of them are unusual seagoing vessel. Thus, we argue that object models learned from the aforementioned coarse-grained datasets are not suitable for ship identification and the corresponding applications in real-world scenarios, and a new large-scale ship database needs to be developed.

Dataset main categories instances images image size Boxes/img
VOC2007 [10] 1 791 363 random 2.18
CIFAR-10 [19] 1 6,000 6,000 32 32 1.00
Caltech-256 [15] 4 418 418 random 1.00
ImageNet [8] 1 613 525 random 1.17
COCO [22] 1 10,759 3,025 random 3.56
SeaShips [37] 6 40,077 31,455 1,920 1,080 1.37
GLSD 13 140,616 100,729 random 1.40
TABLE I: Comparison of ships among GLSD and object detection datasets.

Recently, efforts have been made to construct ship datasets. For instance, Shao et al. [37] developed a ship dataset, i.e., SeaShips, that consists of 31,455 high-quality images (1,920 1,080 pixels) and covers six common ship types (“passenger ship”, “fishing boat”, “container ship”, “general cargo ship”, “bull cargo carrier”, and “ore carrier”). Despite the fact that the SeaShip dataset considers the following factors: various scales, viewpoints, backgrounds, illumination, and diverse occlusion conditions, these ship images were derived from cameras that only cover the Zhuhai Hengqin New Area, China, with considerably simple scenes and an unbalance category distribution (a large number of fishing boats but limited numbers in other types). Recently, Zheng et al. [52] presented a new multi-category ship dataset, namely McShips, that aimed at the fine-grained categorization of both civilian ships and warships. In McShips, warships are divided into six sub-categories. However, The McShips dataset is relatively small in size, and the ratio of the number of civilian ships to the number of warships is roughly 1:1, which is not in accordance with the real-world scenario that there are far more civilian ships than warships. As McShips is not yet publicly available, we do not intend to further discuss it in this study.

In this study, we present a novel ship dataset, called the Global Large-Scale Ship Database (GLSD), that consists of 100,729 images and covers 13 ship types. Considering that the routes of ships are well established, we collect internet images and monitoring data according to the routes with port and country information. The developed GLSD covers more than 3,000 ports around the world (more details in Section III-A). Improving from SeaShips, we add the following categories: “sailing boat”, “war ship”, “barge”, “speed boat” “canoe”, “oil carrier”, and “tug”. Labels and bounding boxes of GLSD are manually constructed in an accurate manner using an image annotation tool [35]. We name the route-based version of GLSD as “GLSD_port” (GLSD with geographic information, more details about the “GLSD_port” in https://github.com/jiaming-wang/GLSD/blob/master/Ports_list.md). We believe that GLSD_port provides training models with multi-modal information, greatly benefiting ship detection tasks.

Detailed comparisons of GLSD with existing databases are shown in Table I. The early database, like CIFAR-10 [19], has a very low image resolution (32 32 pixels), which is not suitable for object detection tasks. The maximum number of boxes per image in the PASCAL VOC2007 [10], Caltech-256 [15], and ImageNet [8], is rather small. Although, COCO [22] includes a great quantity of “boat” images, the “boat” category has not been sub-categorized. Comparing with SeaShips, the proposed GLSD contains more categories (13 vs 6). In terms of image quantity, the number of images in our GLSD is three times compared to the number of images in SeaShips. In addition, GLSD owns a larger number of boxes per image than the one in SeaShips (1.40 vs 1.37). Table I indicates that GLSD is a more challenging ship database that potentially benefits the training of robust ship detection models.

The main contributions of this work are summarized as follows:

  1. To our best knowledge, the developed GLSD is a very challenging global ship dataset with the largest number of annotates, reaching a hundred of thousand ship images, potentially facilitating the development and evaluation of existing object detection algorithms that target ship detections.

  2. GLSD is built on global routes, providing multi-modal information that better serves ship detection tasks. We plan to maintain GLSD in a regular manner when new images are available.

  3. We evaluate state-of-the-art object detection algorithms on the proposed GLSD, setting up a baseline for future algorithm development.

This paper is organized as follows. Section II reviews the related work of state-of-the-art action recognition methods. Sections III and IV illustrate the collection and design of the GLSD database. Section V details the experiments and analysis. Section VI concludes this study.

Ii Related Work

In this section, we outline the development of object detection datasets and methods, providing references for future studies.

Ii-a Object Detection Datasets

In the early days, some small-scale well-labeled datasets (i.e., Caltech10/256 [11, 15], MSRC [38], PASCAL [9], and CIFAR-10 [19] etc.) were widely used in computer vision tasks as benchmarks. These databases, however, offer a limited number of categories with lower-resolution images [8].

It is widely acknowledged that the development of deep learning is inseparable from the support of big data. In general, the more high-quality data, the better the performance of the deep learning-based algorithms. For the first time, Deng et al. [8] built a dataset with worldwide targets through a tree structure, while simultaneously pushing object classification and detection fields towards more complex problems. The dataset proposed by Deng et al. [8] now contains 22 categories with 14 million images. Later on, pre-training backbone networks [39, 17] based on the ImageNet images gradually become the benchmark in computer vision tasks. From early datasets, like COCO [22], to the recent benchmarks, like Objects365 [36], large-scale datasets have always been preferred choices and play important roles in improving classification and detection performance.

Besides the above general object detection datasets, many exist for specific scenarios, e.g., masked face recognition for novel coronavirus disease 2019 (COVID-19) pneumonia (RMFD

[46]), music information retrieval (GZTAN [41] and MSD [2]), automated detection and classification of fish (labeled fishes [6]), autonomous driving (JAAD [30] and LISA [40]), and ship action recognition database [45]. These datasets have greatly facilitated the development of corresponding tasks. In a recent effort, Shao et al. [37] constructed the first large-scale dataset for ship detection, i.e., SeaShips. Due to the fixed viewpoint in the deployed video monitoring system in the Zhuhai Hengqin New Area, however, the background information in SeaShips lacks diversity. Another effort by Zheng et al. [52] presented a new multi-category ship dataset (McShips), however, with an unreasonable ratio among different ship categories.

Ii-B Object Detection Algorithms

An object detection model generally consists of two main components, a backbone pre-trained on a large image dataset (e.g., ImageNet [8]

) for feature extraction and a head used to predict the label. Common backbone networks include VGG

[39], ResNet [17], DenseNet [18], and ResNetXt [49]. The existing head component can be divided into two categories, i.e., the traditional methods and deep learning methods [53]. In the early stages, most object detection methods adopted hand-crafted image features to achieve real-time object detection [43, 42]. Histogram of oriented gradients (HOG) detector [12] plays a very important role in this task. Felzenszwalb et al. [12] proposed a deformable part-based model (DPM) that can be viewed as an extension of the HOG detector. DPM [12] gradually became the main theme of pedestrian detection as a pretreatment [51].

According to the network structure, deep learning-based object detection methods can be grouped into two genres: two-stage- and one-stage-detection [53], where the two-stage detectors are the dominant paradigm of the object detection tasks. Girshick et al. [13]

proposed the regions with convolutional neural networks (CNN) feature maps for object detection, establishing a brand new venue for the development of two-stage detection algorithms. To reduce computational complexity, SPP-Net

[16] largely reduced the computing cost through the spatial pyramid pooling layer. Inspired by SPP-Net, Girshick et al. [14] further utilized a more efficient region of interest (ROI) pooling to reduce unnecessary computing overhand. In 2015, Ren et al. [34] first proposed a framework that introduces a region proposal network to obtain bounding boxes with low complexity.

Different from the above deep learning-based algorithm, YOLO [31] transforms the detection and classification into an end-to-end regression model, sacrificing the localization accuracy for a boost of detection speed. In the following development of the YOLO series [32, 33], its subsequent versions inherit its advantages and try to gradually improve the detection accuracy. Liu et al. [23] proposed a multi-reference, multi-resolution framework that can significantly enhance the accuracy. Further, Lin et al. [21] introduced the focal loss save the accuracy drop resulting from the imbalance foreground-background classes in one-stage detection methods.

Categories Definition
Ore carrier Ships with small stowage factors
Bulk cargo carrier Ships with large stowage factors
General cargo ship Ships that transporting machinery, equipment, building materials, daily necessities, etc.
Container ship Ships that specialize in the transport of containerized goods
Fishing boat Ships for catching and harvesting aquatic animals and plants
Passenger ship Vessels used to transport passengers or pedestrians
Sailing boat Boats propelled partly or entirely by sails
Barge Ships for cargo transportation between inland ports
War ship Naval ships that are built and primarily intended for naval warfare
Oil carrier Ships designed for the bulk transport of oil or its products
Tug Tug maneuvers other vessels by pushing or pulling them either by direct contact or by means of a tow line
Canoe Lightweight narrow vessel, typically pointed at both ends and open on top
Speed boat Small boats with a powerful engine that travels very fast
TABLE II: The definition of main categories in GLSD.

Iii Image Collection

In this section, we present the details on the collection, main categories, and characteristics of the GLSD.

Iii-a Port-based image collection

Referring to the United Nations Code for Trade and Transport Locations (UN/LOCODE)222http://www.unece.org/cefact/locode/welcome.html, global ports are divided into 33 routes, i.e., “east of South America”, “Pacific island”, “West Mediterranean”, “Middle East”, “Caribbean”, “West Africa”, “Australia”, “India-Pakistan”, “European basic port”, “European inland port”, “East Mediterranean”, “Black Sea”, “Southeast Asia”, “Canada”, “west of South America”, “China”, “Taiwan-China”, “East Africa”, “North Africa”, “Red Sea”, “partial port of Japan”, “Adriatic Sea”, “Kansai”, “Kanto”, “Korea”, “Mexico”, “South Africa, “New Zealand”, “west of American”, “Russia Far East”, “American inland port”, “east of American”, and “Central Asia”.

To ensure diverse image sources, we collect images from these ports (the ports involved in the dataset can be found on our website333https://github.com/jiaming-wang/GLSD/blob/master/Ports_list.md). Part of the images in GLSD are captured from a deployed video monitoring system in the Zhuhai Hengqin New Area, China, and the rest are collected via search engines with multiple resolutions. As images in certain routes are unavailable, GLSD mainly covers ship images captured in China and Europe.

Fig. 1: The distribution of the ports covered by GLSD.

Iii-B Iconic and non-iconic Image Collection

Images in the GLSD can be roughly divided into two categories: iconic images [1] and non-iconic images [22]. The iconic images, often with a clearly depiction of categories, provide high-quality object instances, which can clearly depict an object category (see examples in Fig. 2(a)). Iconic images are widely used in object classification and retrieval tasks, and they can be directly retrieved via image search engines. Most images in SeaShips are iconic images. The non-iconic images that provide contextual information and non-canonical viewpoints also play an important role in object detection tasks (see examples in Fig. 2(b)). In the proposed GLSD, we keep both iconic and non-iconic images, providing diverse image categories that benefit object detection model training. Comparing with SeaShips that contain mostly iconic images, GLSD is considerably more challenging.

Fig. 2: Selected example of (a) iconic object images and (b) non-iconic object images.

Iii-C Main Categories

From collected images, we construct 13 categories that widely exists in international routes. These categories include “sailing boat”, “fishing boat”, “passenger ship”, “war ship”, “general cargo ship”, “container ship”, “bulk cargo carrier”, “barge”, “ore carrier, “speed boat”, “canoe”, “oil carrier”, and “tug”. We refer to Wikipedia444https://www.wikipedia.org and the Cambridge International Dictionary of English to define the main categories involved in GLSD, as shown in Table II. In addition, Table III lists the number of images corresponding to each ship category in GLSD. The GLSD consists of a large number of ships with oceangoing voyage ability, such as, “general cargo ship”, “container ship”, “fishing boat”, and “passenger ship”. However, “tug”, “canoe”, “sailing boat ”, “speed boat ”, and “barge” are usually not capable of long-tailed travel, leading to their limited sample sizes in our dataset. Therefore, despite the coverage of more ship categories with large sample sizes, the class imbalance problem still exists in the proposed GLSD.

Categories Instances
Ore carrier 2,066
Bulk cargo carrier 10,751
General cargo ship 19,868
Container ship 21,884
Fishing boat 19,754
Passenger ship 41,451
Sailing boat 6,714
Barge 3,940
War ship 10,613
Oil carrier 1,491
Tug 711
Canoe 222
Speed boat 1,151
Total 140,616
TABLE III: The number of ships corresponding to each category in GLSD.
Fig. 3: Samples of annotated images in the GLSD dataset Based on the collected images, we propose 13 categories which widely exists in international routes. They are: “sailing boat”, “fishing boat”, “passenger ship”, “war ship”, “general cargo ship”, “container ship”, “bulk cargo carrier”, “barge”, “ore carrier, “speed boat”, “canoe”, “oil carrier”, and “tug”.

Iii-D Statistics

Fig. 4 shows the distribution of the image resolution in the GLSD. Different from SeaShips that mainly based on images retrieved from a monitoring system, GLSD also includes high-resolution unmanned aerial vehicle and remote sensing images. Considering that the performances of existing deep learning-based algorithms are usually limited in detecting small targets, we include a large number of images with small targets (less than 32 32 pixels) and medium targets (between 32 32 to 96 96 pixels). The definition of small and medium targets follows [22]. The smallest image resolution in this dataset has pixels, and the largest has pixels. It is notable that the GLSD contains more diverse images with various resolutions and target sizes than SeaShips.

Fig. 4: The distribution of the image sizes. The smallest image in this dataset has pixels, and the largest has pixels.

Iii-E Annotation

In this session, we describe how we annotate images in GLSD. Different from the regular object, besides the body of ships, ships can contain other features, such as mast, elevating equipment, and oar. During the annotation process, all object instances are labeled with object names with bounding boxes that cover the entire ship with other ship features.

All images are normalized to “.jpg” format by MATLAB. We annotate images in GLSD by labelme [44] following the PASCAL VOC2007 format. GLSD images in the dataset are named from “000001.jpg” to “100729.jpg”, with the corresponding label files are named from “000001.xml” to “100729.xml” (we also provide the MS COCO [22] version of GLSD). All annotation tasks are done in a manual manner. Objects that are too dense and too small for the human eye to recognize are ignored. Fig.3 present selected samples in the GLSD. The GLSD contains 50,365 training, 25,182 validation, and 25,182 testing images (approximately train, val, and test as [22]). More details are described in the following sections.

Iv Design of the GLSD dataset

Different from SeaShips derived from a site monitoring system, images in GLSD collected from the Internet and searching engines are generally more complex. Eight variations, i.e., viewpoint, state, noise, background, scale, mosaic, style, and weather variations, are considered and implemented to construct the GLSD. Selected examples corresponding to these variations are outlined in Fig. 5.

Fig. 5: Selected images corresponding to different variations considered in GLSD, i.e., viewpoint, state, noise, background, scale, mosaic, style, and weather variations.

Iv-a Viewpoint Variation

Images from different viewpoints have varying characteristics. Multi viewpoint images have been proved to benefit deep learning models in coping with the complex changes in real-world scenarios. Compared to SeaShips based on surveillance cameras with limited viewpoints, the designed GLSD contains considerably more viewpoints, as shown in Fig. 5(a), potentially leading to increased model robustness.

Iv-B State Variation

SeaShips only focuses on underway ships while ignoring the state under abnormal events, such as the shipping disaster (e.g., on fire), towed by a tug, and interaction between barges and large vessels. A dataset with images under different states is beneficial in monitoring abnormal events during shipping. Fig. 5(b) shows ship images in the designed GLSD under a unique on-fire state: two fishing boats with only skeletons left after burning.

Iv-C Noise Variation

Numerous studies have shown that the accuracy of target detection is often higher in cleaner images. However, noises are unavoidable in many cases. Thus, the introduction of noises in the images helps further improve the robustness of algorithms. Certain images collected from the Internet in GLSD contain watermarks, as shown in Fig. 5(a), (b), and (c).

Iv-D Background Variation

Theoretically, background information in a non-iconic image can provide rich contextual information. Therefore, a dataset with a diversity of backgrounds is preferred in many object recognition tasks. In real-world applications, the backgrounds of the captured image tend to vary. Thus, it is necessary to include a richer scenario by collecting images from these ports. Fig. 5(d) presents images with two distinctively different backgrounds: one with a tropical characteristic featured by canoes, the other one with a modern port featured by large-scale ships.

Iv-E Scale Variation

Deadweight tonnage (DWT) is an indication of the size of the ship as well as its transporting capacity. The DWT of ships greatly varies. For example, the maximum gross payload of an oil carrier can reach 500,000 DWT, while some old cargo ships are only with 5,000 DWT. Even for ships of the same class, the scale can considerably vary. An oil carrier can occupy ten times as many pixels as a tug, as illustrated in Fig. 5(e). Our designed GLSD contains ships with rich variations in scales, increasing algorithms’ capability in detecting both large and small objects.

Iv-F Mosaic Variation

YOLOv4 [3] introduces a new method of data augmentation, named mosaicking, which mixes four different images into one image. Fig. 5(f) shows some examples of images after mosaicking, which consists of three iconic images with different placement strategies. Our GLSD contains a variety of mosaicking images, greatly enriching the background information of ships to be detected.

Method Backbone Schedule
Faster R-CNN [34] ResNet-50 19.8 33.9 20.5 3.1 9.9 22.1 35.1 42.8 42.9 14.3 30.2 46.2
MPT [27] ResNet-50 19.6 33.6 20.2 3.3 9.7 22.0 35.1 42.7 42.8 15.0 29.8 46.2
Retinanet [21] ResNet-50 20.7 34.3 21.7 3.8 9.8 23.5 44.1 56.7 57.2 28.1 46.2 61.6
DCN [7] ResNet-50 20.7 34.2 21.9 4.1 10.8 23.2 36.3 44.2 44.3 15.9 31.8 47.6
Cascade R-CNN [4] ResNet-50 22.1 35.8 23.6 3.4 11.0 24.8 36.8 44.6 44.7 15.6 30.2 48.5
Libra R-CNN [28] ResNet-50 21.4 35.8 22.6 3.4 10.5 24.1 36.6 44.3 44.3 14.9 30.5 48.3
Libra Retinanet [28] ResNet-50 21.2 35.1 22.3 4.3 10.4 23.8 43.7 57.1 57.6 28.3 47.5 62.0
Double Heads [47] ResNet-50 21.5 36.5 22.6 3.9 10.9 24.2 37.4 45.4 45.6 15.5 31.0 49.7
GFL [20] ResNet-50 21.4 34.4 22.7 4.0 10.6 24.1 44.6 59.6 60.5 31.2 47.1 65.3
Retinanet [21] ResNet-50 21.1 35.0 22.2 8.6 9.6 24.2 42.3 53.9 54.3 25.6 40.3 59.8
Cascade R-CNN [4] ResNet-50 22.1 35.6 23.6 3.3 10.6 24.8 36.7 44.5 44.6 14.4 29.6 48.6
GFL [20] ResNet-50 21.9 35.3 23.3 3.9 10.4 24.7 43.6 57.6 58.3 26.4 44.7 63.6
YOLOV3 [33] Darknet-53

273 epochs

12.4 26.9 9.2 1.9 5.6 14.2 28.8 36.8 37.0 17.5 28.0 39.5
TABLE IV: Object detection results (bounding box AP) on the GLSD.

Iv-G Style Variation

IIn addition to multi-viewpoint images, our designed GLSD includes images from various categories: aerial images, remote sensing images, and portraits. Numerous efforts have been made towards style transfer as a data augmentation approach (e.g., domain adaptation between GTA5 and image style transfer on the COCO database). Our GLSD contains abundant image styles, including images captured via cameras and realistic paintings, as shown in Fig. 5(g).

Iv-H Weather Variation

It is widely acknowledged that port operations are susceptible to extreme weather conditions, such as high winds, fog, heavy haze, snowstorms, thunderstorms, and typhoons, etc., greatly affect the arrival and departure of ships and the unloading of cargo in the port. On the sea, the weather tends to change significantly in a relatively short time. Our DLSD includes a variety of weather conditions, benefiting models in ship recognition under different weather scenarios, as illustrated in Fig. 5(h).

In summary, the aforementioned variations make DLSD a rather challenging dataset for ship detection and recognition. The rich variations, effectively widening within-class gap, in our DLSD are expected to facilitate models to reach higher robustness.

V Evaluation Results

V-a Baseline Algorithms

In this section, we conduct a comprehensive comparison of the following state-of-the-art object detection algorithms on GLSD:

  • one-stage: YOLOV3 [33], RetinaNet [21], Libra RetinaNet [28], and GFL [20].

  • two-stage: Faster R-CNN [13], MPT [27], Cascade R-CNN [4], DCN [7], Double Heads [47], and Libra R-CNN [28].

V-B Implementation Details

These experiments run at a desktop based on mmdetection [5]

(a popular open-source object detection toolbox developed by OpenMMLab)

555https://github.com/open-mmlab/mmdetection

with three NVIDIA GTX TITAN GPUs and 3.60 GHz Intel Core i7-7820X CPU, 32GB memory. We implement these methods using the PyTorch 1.5.0

[29] library with Python 3.7.9 under Ubuntu 18.04, CUDA 10.2, and CUDNN 7.6 systems.

For evaluation, we employed average precision (, , , , , and ) and average recall (, , , , , and ), as with [50]. Among them, the , , , , and all average recall are calculated with intersections over union (IOU) values ([0.50 : 0.05 : 0.95]) as IOU thresholds. As for and , the corresponding thresholds are 0.5 and 0.75, respectively. Moreover, represents the average with different scales (small scale: targets with less than 32 32 pixels; medium scale: targets with between 32 32 to 96 96; large scale: targets with larger than 96 96 pixels [22]), and denotes the average recall with different number of detections.

V-C Results and Analysis

In Table IV

, we report the performances of all selected models on GLSD. In scenes that contain small targets, the APs from selected algorithms without the focal loss function are smaller than 5%. Even in scenes with medium targets, the APs of all selected algorithms are only about 11%, proving that the recognition of small targets is still one of the major challenges in our designed GLSD. Therefore, improving the detection performance of objects with varying scales still remains a challenge, especially for small targets.

With the introduction of the focal loss, an effective approach to mitigating the issues of long-tailed distribution, the performances of Retinanet [21] and GFL [20] show significant improvement compared to other two-stage algorithms. We notice that ARs are larger than APs for these algorithms, indicating the existence of error detections, given the small inter-class gap. However, with an increasing number of iterations, Retinanet [21] is able to achieve a great performance (up to 4.8% gains) in small object detection. Therefore, the solution to deal with the long-tailed distribution is the key for models to reach satisfactory performance in GLSD.

Vi Conclusion

In this paper, we introduce a global large-scale ship database, i.e., GLSD, which is designed for global ship detection tasks. The designed GLSD is considerably larger and more challenging than any existing database in ship detection tasks, to our best knowledge. The main characteristics of the GLSD lie in three aspects: 1) the GLSD contains a total of 140,616 images with a widening inter-class gap of 13 categories, i.e., “sailing boat”, “fishing boat”, “passenger ship”, “war ship”, “general cargo ship”, “container ship”, “bulk cargo carrier”, “barge”, “ore carrier, “speed boat”, “canoe”, “oil carrier”, and “tug”; 2) the GLSD includes a diversity of the variations that include viewpoint, state, noise, background, scale, mosaic, style, and weather variations, which is expected to facilitate models to reach higher robustness; 3) the route-based version of GLSD, i.e., GLSD_port, contains geographic information, providing rich multi-modal information that benefits various ship detection and recognition tasks. We also propose evaluation protocols and provide evaluation results on GLSD using numerous state-of-the-art object detection algorithms.

As ship images of certain categories are difficult to collect, the current version of GLSD has a notable long-tail issue. We will continue to extend GLSD with more ship images (the next-round update is expected to involve 50,000 more ship images), especially on ship categories of “tug”, “canoe”, and “speed boat”.

Acknowledgement

We thank Lan Ye, Sihang Zhang, Linze Bai, Gui Cheng, and all the others who were involved in the annotations of GLSD. In addition, we thank the support of the Post-Doctoral Research Center of Zhuhai Da Hengqin Science and Technology Development Co., Ltd, Guangdong Hengqin New Area.

References

  • [1] T. L. Berg and A. C. Berg (2009) Finding iconic images. In

    IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

    ,
    pp. 1–8. Cited by: §III-B.
  • [2] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere (2011) The million song dataset. Cited by: §II-A.
  • [3] A. Bochkovskiy, C. Wang, and H. M. Liao (2020) YOLOv4: optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934. Cited by: §IV-F.
  • [4] Z. Cai and N. Vasconcelos (2019) Cascade r-cnn: high quality object detection and instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1. External Links: ISSN 1939-3539 Cited by: TABLE IV, 2nd item.
  • [5] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §V-B.
  • [6] G. Cutter, K. Stierhoff, and J. Zeng (2015) Automated detection of rockfish in unconstrained underwater videos using haar cascades and a new image dataset: labeled fishes in the wild. In IEEE Winter Applications and Computer Vision Workshops, pp. 57–62. Cited by: §II-A.
  • [7] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, Cited by: TABLE IV, 2nd item.
  • [8] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: TABLE I, §I, §I, §II-A, §II-A, §II-B.
  • [9] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. International journal of computer vision 111 (1), pp. 98–136. Cited by: §II-A.
  • [10] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: TABLE I, §I, §I.
  • [11] L. Fei-Fei, R. Fergus, and P. Perona (2006) One-shot learning of object categories. IEEE transactions on pattern analysis and machine intelligence 28 (4), pp. 594–611. Cited by: §II-A.
  • [12] P. Felzenszwalb, D. McAllester, and D. Ramanan (2008) A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §II-B.
  • [13] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2015) Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence 38 (1), pp. 142–158. Cited by: §II-B, 2nd item.
  • [14] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §II-B.
  • [15] G. Griffin, A. Holub, and P. Perona (2007) Caltech-256 object category dataset. Cited by: TABLE I, §I, §I, §II-A.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In Proceedings of the European conference on computer vision, pp. 346–361. Cited by: §II-B.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §II-A, §II-B.
  • [18] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §II-B.
  • [19] A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Cited by: TABLE I, §I, §I, §II-A.
  • [20] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang (2020) Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. arXiv preprint arXiv:2006.04388. Cited by: TABLE IV, 1st item, §V-C.
  • [21] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §II-B, TABLE IV, 1st item, §V-C.
  • [22] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: TABLE I, §I, §I, §II-A, §III-B, §III-D, §III-E, §V-B.
  • [23] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In Proceedings of the European conference on computer vision, pp. 21–37. Cited by: §II-B.
  • [24] T. Lu, J. Wang, J. Jiang, and Y. Zhang (2020)

    Global-local fusion network for face super-resolution

    .
    Neurocomputing. Cited by: §I.
  • [25] J. Ma, W. Yu, P. Liang, C. Li, and J. Jiang (2019)

    FusionGAN: a generative adversarial network for infrared and visible image fusion

    .
    Information Fusion 48, pp. 11–26. Cited by: §I.
  • [26] J. Ma, H. Zhang, P. Yi, and Z. Wang (2020) SCSCN: a separated channel-spatial convolution net with attention for single-view reconstruction. IEEE Transactions on Industrial Electronics. Cited by: §I.
  • [27] P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, et al. (2017) Mixed precision training. arXiv preprint arXiv:1710.03740. Cited by: TABLE IV, 2nd item.
  • [28] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin (2019) Libra r-cnn: towards balanced learning for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 821–830. Cited by: TABLE IV, 1st item, 2nd item.
  • [29] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §V-B.
  • [30] A. Rasouli, I. Kotseruba, and J. K. Tsotsos (2017) Are they going to cross? a benchmark dataset and baseline for pedestrian crosswalk behavior. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 206–213. Cited by: §II-A.
  • [31] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §II-B.
  • [32] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §II-B.
  • [33] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §II-B, TABLE IV, 1st item.
  • [34] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proceedings of the neural information processing systems, pp. 91–99. Cited by: §II-B, TABLE IV.
  • [35] B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman (2008) LabelMe: a database and web-based tool for image annotation. International journal of computer vision 77 (1-3), pp. 157–173. Cited by: §I.
  • [36] S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun (2019) Objects365: a large-scale, high-quality dataset for object detection. In Proceedings of the IEEE international conference on computer vision, pp. 8430–8439. Cited by: §II-A.
  • [37] Z. Shao, W. Wu, Z. Wang, W. Du, and C. Li (2018) Seaships: a large-scale precisely annotated dataset for ship detection. IEEE Transactions on Multimedia 20 (10), pp. 2593–2604. Cited by: TABLE I, §I, §I, §II-A.
  • [38] J. Shotton, J. Winn, C. Rother, and A. Criminisi (2006) Textonboost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In European conference on computer vision, pp. 1–15. Cited by: §II-A.
  • [39] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §II-A, §II-B.
  • [40] S. Sivaraman and M. M. Trivedi (2010)

    A general active-learning framework for on-road vehicle recognition and tracking

    .
    IEEE Transactions on Intelligent Transportation Systems 11 (2), pp. 267–276. Cited by: §II-A.
  • [41] G. Tzanetakis and P. Cook (2002) Musical genre classification of audio signals. IEEE Transactions on speech and audio processing 10 (5), pp. 293–302. Cited by: §II-A.
  • [42] P. Viola and M. J. Jones (2004) Robust real-time face detection. International Journal of Computer Vision 57 (2), pp. 137–154. Cited by: §II-B.
  • [43] P. Viola and M. Jones (2001) Rapid object detection using a boosted cascade of simple features. In Proceedings of the Computer Vision and Pattern Recognition, pp. 511–518. Cited by: §II-B.
  • [44] K. Wada (2016) labelme: Image Polygonal Annotation with Python. Note: https://github.com/wkentaro/labelme Cited by: §III-E.
  • [45] J. Wang, Z. Shao, X. Huang, T. Lu, R. Zhang, and X. Lv (2021) Spatial-temporal pooling for action recognition in videos. Neurocomputing. Cited by: §I, §II-A.
  • [46] Z. Wang, G. Wang, B. Huang, Z. Xiong, Q. Hong, H. Wu, P. Yi, K. Jiang, N. Wang, Y. Pei, et al. (2020) Masked face recognition dataset and application. arXiv preprint arXiv:2003.09093. Cited by: §II-A.
  • [47] Y. Wu, Y. Chen, L. Yuan, Z. Liu, L. Wang, H. Li, and Y. Fu (2019) Rethinking classification and localization for object detection. External Links: 1904.06493 Cited by: TABLE IV, 2nd item.
  • [48] G. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang (2018) DOTA: a large-scale dataset for object detection in aerial images. In Proceedings of the Computer Vision and Pattern Recognition, pp. 3974–3983. Cited by: §I.
  • [49] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017)

    Aggregated residual transformations for deep neural networks

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §II-B.
  • [50] R. Zhang, Z. Shao, X. Huang, J. Wang, and D. Li (2020) Object detection in uav images via global density fused convolutional network. Remote Sensing 12 (19), pp. 3140. Cited by: §V-B.
  • [51] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian (2015) Scalable person re-identification: a benchmark. In Proceedings of the IEEE international conference on computer vision, pp. 1116–1124. Cited by: §II-B.
  • [52] Y. Zheng and S. Zhang (2020) Mcships: a large-scale ship dataset for detection and fine-grained categorization in the wild. In International Conference on Multimedia and Expo, pp. 1–6. Cited by: §I, §II-A.
  • [53] Z. Zou, Z. Shi, Y. Guo, and J. Ye (2019) Object detection in 20 years: a survey. arXiv preprint arXiv:1905.05055. Cited by: §I, §I, §II-B, §II-B.