Object detection is challenging while it has widespread applications. With the advances of deep learning, object detection achieves the remarkable progress. According to whether the proposals are generated by an independent learning stage or directly and densely sample possible locations, object detection can be classified into two-stage or one-stage models. Compared to two-stage detectors[cai2018cascade, fasterrcnn] one stage methods [retina, SSD] are less complex, therefore, it can run faster with some precision loss. While most existed successful methods are based on anchor mechanism, the recent state-of-the-art methods focus on anchor-free detection mostly, e.g. CornerNet [CornerNet], FCOS [tian2019fcos], FSAF [fsaf]. These CNN based detection methods are very powerful because it can create some low-level abstractions of the images like lines, circles and then ‘iteratively combine’ them into some objects, but this is also the reason that they struggle with detecting small objects.
Generally, the object detection algorithms mentioned above can achieve good performance, as long as the features extracted from the backbone network are strong enough. Usually, a huge and deep CNN backbone extracts multi-level features and then refine them with feature pyramid network (FPN). Most time, these detection models benefit from deeper backbone, while the deeper backbone also introduces more computation cost and memory usage.
Commonly, the detection performance is extremely sensitive to the resolution of input. High-resolution images are more suitable for small object detection, which reserves more details and position information. But high-resolution also introduces new problems, such as, (i) it’s easy to damage the detection of large objects, as shown in Table 1; (ii) Detection always needs a deeper network for more powerful semantics, resulting in an unaffordable computing cost. Actually, it’s essential to use the high-resolution image for small object detection, and also the deeper backbone for small scale images. But we should deal with the trade-offs between large and small object detection, as well as high performance and low computational complexity.
To solve these problems, we propose a new architecture, High-resolution Detection Network (HRDNet). As shown in Figure 1, it includes two parts: Multi-Depth Image Pyramid Network (MD-IPN) and Multi-Scale Feature Pyramid Network (MS-FPN). The main idea of the HRDNet is to use a deep backbone to process low-resolution images while using a shallow backbone to process high-resolution images. The advantage of extracting features from high-resolution images with the shallow and tiny network has been demonstrated in [doubleRtinyobj]. With HRDNet, we can not only get more details for a small object in high-resolution, but also guarantee the efficiency and effectiveness by integrating multi-depth and multi-scale deep networks.
MD-IPN can be regarded as a variant of the image pyramid network with multiple streams, as shown in Figure 1. MD-IPN is dealing with the trade-offs between large and small object detection, as well as high performance and low computational complexity. We extract features from the high-resolution image using a shallow backbone network. Because of the weak semantic representation power of the shallow backbone network, we also need deep backbones to obtain semantically strong features by feeding low-resolution images in. Thus, the inputs of the MD-IPN form an image pyramid with a fixed decreasing ratio of . The output of MD-IPN is a series of multi-scale feature groups, and each group contains multi-level feature maps.
The multi-scale feature groups extend the standard feature pyramid by adding multi-scale streams. Therefore, traditional FPN can’t be directly applied here. To fuse these multi-scale feature groups properly, we proposed the Multi-Scale Feature Pyramid Network (MS-FPN). As shown in Figure 2, the information of images not only propagates from high-level features to low-level features inside the multi-level feature pyramid but also between streams of different scales in MD-IPN.
Before going through the details, we summarize our contributions as follows:
We comprehensively analyzed the factors that small object detection depends on and the trade-off between performance and efficiency, as well as proposed a novel high-resolution detection network, HRDNet, considering both image pyramid and feature pyramid.
In HRDNet, we designed the multi-depth and multi-stream module, MD-IPN to balance the performance between small, middle and large objects. We proposed another new module, MS-FPN to combine different semantic representations from these multi-scale feature groups.
Extensive ablation studies validate the effectiveness and efficiency of the proposed approach. The performance of bench-marking on several datasets show that our approach achieves the state-of-the-art performance on object detection, particularly on small object detection. Meanwhile, we hope such practice of small object detection could shed the light for other researches.
2 Related Work
Object detection is a basic task for many downstream tasks in computer vision. The state-of-the-art detection networks include one stage model, e.g., RetinaNet[retina], Yolo-v3 [redmon2018yolov3], Center net [duan2019centernet], FSAF [fsaf], Corner net [CornerNet] and two-stage model, e.g., Faster R-CNN [fasterrcnn], Cascade R-CNN [cai2018cascade] etc.). Nevertheless, the proposed HRDNet is a more fundamental framework that could be the backbone network for most of the detection models, as mentioned above, such as RetinaNet and Cascade R-CNN.
Small object detection
The detection performance is largely restricted by small object detection in most datasets. Therefore, there are many researches specializing in small object detection. For example, [Augmentation] proposed oversampling and copy-pasting small objects to solve such a problem. Perceptual GAN [Perceptual_GAN] generated super-resolved features and stacked them into feature maps of small objects to enhance the representations. DetNet [detnet] maintained the spatial resolution and has a large receptive field to improve small object detection. SNIP [SNIP] resized images to different resolutions and only train samples which is close to ground truth. SNIPER [SNIPER] is proposed to use regions around the bounding box to remove the influence of background. Unlike these methods, we combine both image pyramid and feature pyramid together, with which it not only effectively improves the detection performance of small targets, but also ensure the detection performance of other objects.
Some studies already explored to do object detection on high-resolution images. [doubleRtinyobj] proposed a fast tiny detection network for high-resolution remote sensing images. [4k8kdetection] proposed an attention pipeline to achieve fast detection on 4K or 8K videos using YOLO v2 [yolov2]. However, these works did not fully explore the effect of high-resolution images for small object detection, which is what we concentrate on.
To capture the semantic information of objects from different scales, multi-level features are commonly used for object detection. However, they have serious feature-level imbalance because they convey different semantic information. Feature Pyramid Network (FPN) [FPN] introduced a top-down pathway to transmit semantic information, alleviating the feature imbalance problem in some degree. Based on FPN, PANet [PANet] involved a bottom-up path to enrich the location information of deep layers. The authors of Libra R-CNN [Libra_RCNN] revealed and tried to deal with the sample level, feature level, and objective level imbalance issues. Pang et al. [Pang_2019_CVPR] proposed a light weighted module to produce featured image pyramid features to augment the output feature pyramid. While these methods only focus on multi-level features. Here, We solve the feature-level imbalance from a new angle, we proposed a new module called Multi-scale FPN to solve the imbalance not only from multi-level features but also from multi-scale feature groups.
3 High-Resolution Detection Network
Obviously, high-resolution images are important for small object detection. Unfortunately, high-resolution images will introduce unaffordable computation costs to deep networks. At the same time, high-resolution images aggravate the variance of object scales, worsening the performance of large objects, as shown in Table1. To balance computation costs and variance of objects scales while keeping the performance of all the classes, we proposed the High-Resolution Detection Network (HRDNet). The HRDNet is a general concept that is compatible with any alternative detection method.
More specifically, HRDNet is designed with two novel modules, Multi-Depth Image Pyramid Network (MD-IPN) and Multi-Scale Feature Pyramid Network (MS-FPN). In MD-IPN, an image pyramid is processed by backbones with different depth, i.e., using deep CNNs for the low-resolution images while using shallow CNNs for the high-resolution images, as shown in Figure 1. After that, to fuse the multi-scale groups of multi-level features from MD-IPN, MS-FPN is proposed as a more reasonable feature pyramid architecture (Figure 2).
The MD-IPN is composed of independent backbones with various depth to process the image pyramid. We term each backbone as a stream. HRDNet can be generalized to more streams, but to better illustrate the main idea, we mainly discuss the two-stream HRDNet and three-stream HRDNet. Figure 1 presents an example of three-stream HRDNet. Given an image with resolution , the high-resolution image ( with ) is processed by a stream of shallow CNN (), the lower-resolution images ( and with and , and .) is processed by streams of deeper CNN ( and ). Generally, we can build an image pyramid network with independent parallel streams, .
We use to represent the input images with different resolutions given the original image with the highest resolution. The outputs of the multi scale image pyramid are feature groups . Each group contains a set of multi-level features , where is the multi-scale index and is the multi-level index. For example, in Figure 1, the value of and are , respectively, and the relation can be formulated as
Feature pyramid network (FPN) is one of the key components for most object detection algorithms. It combines low-resolution, semantically strong features with high-resolution, semantically weak features via a top-down pathway and lateral connections.
In our proposed HRDNet, the MD-IPN generates multi-scale (different resolution) and multi-level (different hierarchy of features) features. To deal with the multi-scale hierarchy features, we also proposed the Multi-Scale FPN (MS-FPN). Different from FPN, semantic information propagates not only from high-level features to low-level features but also from deep stream (low-resolution) to shallow stream (high-resolution). Therefore, there are two directions for the computation of the multi-scale FPN. The basic operation in multi-scale FPN is same as traditional FPN, i.e., Convolution, up-sampling and sum-up.
In this way, the highest resolution feature, i.e., , not only maintain the high-resolution for small object detection but also combine semantically strong features from multi-scale streams. Our novel MS-FPN can be formulated as
The is the feature in level and stream in Figure 2. The operation is up-sampling. The is convolution. Finally, MS-FPN outputs the final feature group . is calculated by
where is the features in Group , i.e., the outputs of the highest resolution stream.
4.1 Experiment details
We conduct experiments on both the typical small object detection data set, VisDrone2019 [visdronevid] and traditional datasets of MS COCO2017 [coco] and Pascal VOC2007/2012 [pascalvoc] as well.
The VisDrone2019 dataset is collected by the AISKYEYE team, which consists of 288 video clips formed by 261,908 frames and 10,209 static images, covering a wide range location, environment, objects, and density. The resolution of VisDrone2019 is higher than COCO as we mentioned in Section 1, ranging from 960 to 1360. MS COCO and Pascal VOC are the most common benchmark for object detection. Following common practice, we trained our model on the COCO training set and tested it on the COCO validation set. For Pascal VOC, we trained our model with all the training and validation datasets from both Pascal VOC 2007 and 2012, tested the model on the Pascal VOC 2007 test set.
In COCO or Pascal VOC dataset, most images’ resolution is - px, which is resized to or in the training stage, but - px in VisDrone2019 [visdronedet] dataset. As shown in Figure 4, compared to MS-COCO, there are more objects and nearly all of them are very small in VisDrone2019, which is more challenging.
We followed the common practice in mmdetection [chen2019mmdetection]. We trained the models on VisDrone2019 with four Nvidia GPUs and COCO with eight Nvidia GPUs. We use SGD optimizer with a mini-batch for each GPU. The learning rate starts from 0.02 and decreases by
at epochand . The weight decay is . The linear warm-up strategy is used with warm-up iterations of , and the warm-up ratio of
. The image pyramid is obtained by the linear interpolation. The resolution decreasing ratiois .
In order to fit the high-resolution images from VisDrone2019 into GPU memory, we equally cropped each original image in VisDrone2019 training set into four patch images which are not overlapped. In this way, we obtained a new training set with such cropped images.
Same resolution as training is used for inference. The IOU threshold of NMS is , and the threshold of confidence score is . Without especially emphasizing, for the multi-scale test in our experiments, we use three scales.
4.2 Ablation Studies
|aligned by resolution||28.7||49.6||28.7|
|aligned by depth||28.9||49.9||28.7|
|aligned by resolution||31.8||54.0||32.3|
|aligned by depth||32.0||54.3||32.5|
4.2.1 The effect of image resolution for small object detection
Extensive ablation studies on the VisDrone2019 dataset are conducted to illustrate the effect of input image resolution for detection performance. Table 1 shows that detection performance has a significant improvement with the increase of image resolution. Higher resolution leads to better performance under the same experimental settings. The performance of small objects presents more significant improvement from HRDNet. What is more, HRDNet performs much better than the state-of-the-art Cascade R-CNN with the same resolution as the input.
Interestingly, when the resolution of input increases, single backbone model, i.e. Cascade R-CNN, suffers dramatically decrease ( 1.1-7.6%) for categories with relatively large size, i.e. truck, awning-tricycle and bus. On the contrary, significant performance increase (1-5.2%) can be observed from HRDNet. Simply increasing the image resolution without considering the severe variant of object scale is not the ideal solution for object detection, let alone small object detection.
4.2.2 Explore the optimal image resolution
We have stated and showed some experiments that the image resolution is important for small object detection; however, is it true higher resolution leads to better performance. Does it have the optimal resolution for detection? In this part, we will present the effect of image resolution for object detection. Figure 3 shows the change of the Average Precise () with different resolutions. The resolution starts from
(long edge) with 400 as the stride. Finally, HRDNet achieves the best performance when the resolution ispx.
4.2.3 How to design the multi-scale FPN
As mentioned above, MS-FPN is designed to fuse multi-scale feature groups. Here, we compared three different styles, including simple FPN, multi-scale FPN aligned by depth, multi-scale FPN aligned by resolution, as shown in Figure 5, to demonstrate MS-FPN’s advantage. A simple FPN is to apply standard FPN to each multi-scale group of HRDNet and finally fuse the results of each FPN. For multi-scale FPN, there are new connections between multi-streams, as shown in Figure 2. We conducted two groups experiments with ResNet10+18 backbone and ResNet18+101 backbone. The first experiment in Table 2 shows that the multi-scale FPN is better than the simple FPN. Both experiments demonstrate that MS-FPN aligned with depth performs better than those aligned with resolution. Therefore, we adopt MS-FPN aligned with depth in our architecture.
|Faster R-CNN w FPN [FPN]||ResNet-101||36.2||59.1||39.0||18.2||39.0||48.2|
|Cascade RCNN [cai2018cascade]||ResNet-101||42.8||62.1||46.3||23.7||45.5||55.2|
|†Cascade R-CNN [cai2018cascade]||ResNet50||2666||24.10||42.90||23.60||0.40||2.30||21.00||35.20|
|†Faster R-CNN [fasterrcnn]||ResNet50||2666||23.50||43.70||22.20||0.34||2.20||18.30||35.70|
|Faster R-CNN [resnet]||ResNet-101||76.4|
|R-FCN w DCN [deformable]||ResNet-101||82.6|
|CenterNet [objaspoint]||DLA [objaspoint]||80.7|
4.2.4 Efficient and Effective HRDNet
HRDNet is a multi-streams network, and there may be some concerns about the model size and running speed. Here, we illustrate the number of parameters and running speed of our HRDNet, comparing with the state-of-the-art single backbone baseline. The results are shown in Table 3 demonstrate that our HRDNet can achieve much better performance with a similar number of parameters and even faster running speed.
4.2.5 The comparison with single backbone model ensemble
To further demonstrate that the performance improvement of HRDNet is not because of more parameters, we compared two-stream HRDNet with the ensemble of two single backbone models under the same experimental setting (Table 4). The ensemble models fuse the predicted bounding boxes and scores before NMS (Non-Maximum Suppression) and then perform NMS together. We found that the single backbone models with high-resolution input always perform better than those with low-resolution even it is processed by a stronger backbone. HRDNet performs better than the ensemble model, thanks to the novel multi-scale and multi-level fusion method. These results further prove that our designed MS-FPN is essential for HRDNet.
4.3 Comparison with the state-of-the-art methods
To demonstrate the advantage of our model and technical criterion, we also compare HRDNet with the most popular and state-of-the-art methods. Table 6 shows that our proposed HRDNet achieves the best performance on VisDrone2019 DET validation set. Notably, our model obtains more than 3.0% AP improvement with ResNeXt50+101 compared to HFEA using ResNet152 as their backbone.
Besides the experiments on VisDrone2019, we also conduct experiments on the COCO2017 test set to prove our method works well on a larger scale, complicated and standard detection dataset. Table 5 shows that HRDNet achieves state-of-the-art results, and improvement compared with most recent models.
There are not too many small objects in Pascal VOC. We conducted experiments on this data set to demonstrate HRDNet not only improves small object detection but also keeps the performance for large objects.
Merely increasing the image resolution without modifications will relatively damage the performance of large objects. Moreover, the server variance of object scales further limits the performance from high-resolution images. Motivated by this, we propose a new detection network for small objects, HRDNet. In order to handle the issues well, we further design MD-IPN and MS-FPN. HRDNet achieves the state-of-the-art on small object detection data set, VisDrone2019, at the same time, we outperform on other benchmarks, i.e., MS COCO, Pascal VOC.