Weakly Supervised Dataset Collection for Robust Person Detection

03/27/2020 ∙ by Munetaka Minoguchi, et al. ∙ 27

To construct an algorithm that can provide robust person detection, we present a dataset with over 8 million images that was produced in a weakly supervised manner. Through labor-intensive human annotation, the person detection research community has produced relatively small datasets containing on the order of 100,000 images, such as the EuroCity Persons dataset, which includes 240,000 bounding boxes. Therefore, we have collected 8.7 million images of persons based on a two-step collection process, namely person detection with an existing detector and data refinement for false positive suppression. According to the experimental results, the Weakly Supervised Person Dataset (WSPD) is simple yet effective for person detection pre-training. In the context of pre-trained person detection algorithms, our WSPD pre-trained model has 13.38 and 6.38 trained on the fully supervised ImageNet and EuroCity Persons datasets, respectively, when verified with the Caltech Pedestrian.



There are no comments yet.


page 2

page 3

page 4

page 5

page 8

Code Repositories


Fashion Culture DataBase

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the context of navigation and service robots, an appropriate human-centered operating environment usually starts with person detection. Therefore, we require robust and highly accurate person detection for collision avoidance for the realization of self-driving cars, mobile robots, and unmanned aerial vehicles.

To construct a learning-based object detector (In this paper, the meaning of “object” detection includes person detection; in other words, “object” is taken as having a broader meaning) , a large-scale and well-labeled dataset is required, as is a suitable model architecture. Towards this end, large-scale multiple-object datasets, such as MS COCO [18] and OpenImages [16], have been produced to conduct highly accurate object detection. However, the datasets for person detection are currently relatively small. For example, the Caltech Pedestrian [7], CityPersons [32], and EuroCity Persons datasets [3] contain 350,000, 35,000, and 240,000 bounding boxes (bboxes), respectively. Compared to million-image multiple-object datasets like OpenImages, pedestrian datasets could still be greatly expanded to improve the performance of models that are trained with these datasets. Thus, there is great motivation to create a large-scale person dataset with a bboxes quantity on the order of millions.

Inspired by weakly supervised image labeling with Social Network Service (SNS) hashtags, as in the so-called “Instagram-3.5B” study [20], we conducted semi-automatic large-scale dataset collection in the context of person detection. Unlike the related work in Instagram-3.5B, our proposed dataset contains a large number of bboxes in addition to the captured images. We here consider a different method for constructing a pre-trained person dataset based on an existing detector and simple refinement.

Figure 1: Overview of the proposed Weakly Supervised Person Dataset (WSPD) and its contributions.

This paper proposes a million-image person dataset based on a weakly supervised method for robust person detection, namely Weakly Supervised Person Detection (WSPD). Our large-scale person dataset is constructed by semi-automatic image collection and data refinement with SNS images. This dataset collection method allows us to significantly improve the performance of person detection with human annotation in a few hours. When used as a pre-trained person dataset, the model trained with the WSPD method outperforms the detection rate of models trained with fully supervised pre-trained datasets, such as EuroCity Persons (+6.38%) and ImageNet (+13.36%) on the Caltech Pedestrian with a Single-Shot multibox Detector (SSD) [19].

This paper makes the following contributions to person detection (see also Figure 1). (i) We propose a weakly supervised dataset collection method for building training datasets for person detection. We then use this method to construct a dataset containing millions of images with bboxes through an existing detector (e.g., Faster R-CNN [24]) and false positive suppression. (ii) The WSPD pre-trained model is demonstrated to perform well at person detection. The fine-tuned detector achieved an accuracy 6.38 and 13.36% higher than that of the baseline models (EuroCity Persons and ImageNet, respectively) on the Caltech Pedestrian. We provide examples of the detection results and performance comparisons in Figure 2.

Figure 2:

We have constructed a million-image person dataset for use with pre-trained person detectors. (Left) Our WSPD method, which creates a large-scale pre-trained person dataset, provides better person detection performance. We list examples of ground truth and the detection results for the ImageNet and EuroCity Persons pre-trained models. (Right) Detection error trade-off (DET) curves for the baseline models (ImageNet, Pascal VOC, and EuroCity Persons) versus the proposed model. The miss rate (%) for the Caltech Pedestrian is shown, with higher values representing a higher accuracy, that is, our proposed method is up to 13.36% better than the baseline. The baselines employ the ImageNet/Pascal VOC/EuroCity Persons pre-trained VGG-16 neural network and Caltech Pedestrian fine-tuned SSD. In our proposed method, we use the WSPD pre-trained model and Caltech Pedestrian fine-tuned SSD. Note that our proposed WSPD, Pascal VOC, and EuroCity Persons pre-trained models solved person detection tasks with both the pre-trained and fine-tuned datasets.

2 Related work

In this section we briefly review some of the key concepts related to this paper, such as object detection, annotation, and dataset collection, to help highlight how the proposed method differs from existing methods.

Object detection. Object detection algorithms have progressed from hand-crafted detection using local features (e.g., Haar-like features [28], HOG [4], and ICF [6]

) and well-organized classifiers (e.g., Deformable Parts Model 

[11] and aggregated detectors [5]), and currently we are in the era of deep neural networks (DNNs). In the literature, a two-step region identifier and DNN-based classifier has been proposed [12]. The basic technique, called R-CNN, has been adapted for use with any-size feature maps [13], and it includes an end-to-end two-step method [24]. Current research is widely divided between one-shot detectors, such as you only look once (YOLO) [23] and SSD [19]. Recent studies have also focused on highly accurate detectors, such as RetinaNet [17] and M2Det [33], and instance segmentation with Mask R-CNN [14]. Here, we use SSD, which is a balanced detector with a relatively short training time. Further, it can easily be optimized for use with baseline models to compare the dataset prepared with our collection method and the ImageNet and EuroCity Persons datasets in the context of a pre-trained detector. Moreover, we implemented M2Det to identify which person detector is more accurate.

Person detection. According to a comprehensive survey [2], the performance rate of person detection algorithms has increased over the last decade as person detectors have evolved to use more sophisticated architectures. A recent study has proposed several configurations to improve recognition and localization with DNNs [15], semantic meaning [8], combined methods [30], and analysis of small images or crowds [29]. However, a large-scale person dataset must be prepared for training these models (e.g., SSD or M2Det) and fine-tuning their architecture.

Annotation for object bboxes.

In recent machine learning research, annotation treatment has been shown to be important for successful network training. For example, Su 

et al. introduced a method for repeatedly checking bbox annotations in an image in three steps: drawing, quality checking, and coverage verification [26]. Papadopoulos et al. proposed a method that combines an existing detector and human annotation [22]. The combined method iterates between three annotation and quality control steps: model retraining, bbox relocalization, and human verification. Their annotation method results in an object detection dataset without human-drawn bboxes. Compared with these annotation methods, our proposed dataset collection method requires only a minimum of human-based annotation checks to improve the performance of person detection.

Person dataset collection. In addition to changes in the models used for person detection, person datasets have also evolved over the last decade. The first generation of person detection datasets consisted of small training and testing datasets (up to 10,000 images, including INRIA [4], Daimler [21], and ETHZ [9]), followed by a second generation of medium-size datasets (10,000–100,000 images, including Caltech Pedestrian [7], CityPersons [32], and EuroCity Persons [3]) that include occlusion and cluttered backgrounds. However, to the best of our knowledge, a large-scale person dataset (over 1 million images) is not currently freely available. In [31], it was claimed that more high-quality person annotations are required for improving the accuracy of person detection algorithms. Data collection with weak supervision is one area of ongoing work in image classification [20]. As discussed in related work [27], the increasing scale of datasets is also important for improving the accuracy of existing detection algorithms. To help produce large-scale datasets, we here present a weakly supervised pre-training dataset annotation method for person detection.

Figure 3: Weakly supervised dataset collection. (1) Cloud-based image download and collection. Although this paper used the YFCC100M dataset intended for city perception [34], any image dataset can be used. (2) Detection of people with Faster R-CNN to add bboxes to the selected images. (3) Data refinement to exclude unwanted bboxes with a binary classifier, which determines whether a person’s whole body is contained within the bbox.

3 Weakly supervised dataset collection

3.1 Overview

To obtain a better representation of persons for detection during pre-training, a large-scale dataset with bboxes should be used, such as a combination of our WSPD pre-trained model and the Caltech Pedestrian [7] fine-tuned detector. The image localization labeling depends on the efforts of human annotators; therefore, an automatic dataset creation method would be useful for the person detection research community.

Figure 3 illustrates the concept of weakly supervised dataset collection. After collecting a large number of SNS images, we apply a two-step algorithm for weakly supervised dataset construction: person detection with an existing object detector and erasing false positives using a binary classifier.

Here, we describe the problem setting to conduct our weakly supervised dataset collection. The setting of weakly supervised learning is simple yet effective for pre-training a person detector. At the beginning, we assign an object detector

to recognize bboxes and their labels from an input image .


where and denote predictions of object category and bboxes, respectively, and represents trained parameters in the detector. In case of person detection, the category is limited to the “person” label. The equation is simplified as follows:


We used a support vector machine (SVM) to refine the detected bboxes with

. We used weakly supervised dataset collection to classify the detected bboxes as person ground truth () or background (). The following equation shows the binary classifier with SVM:


where represents a cropped image with detected bbox. The cropped image is divided depending on the .


where is assigned to add a person label in the WSPD method. We use the Faster R-CNN as a person detector [24] and a binary classifier for determining whether the target’s whole body is contained within the bbox. Our framework is simple yet effective for generating a large-scale dataset. We use the Yahoo! Creative Commons 100M Database (YFCC100M) [1], which contains 100 million Flickr images. Our person dataset (WSPD) contains images of people from around the world but is limited to specific major cities [34]. The dataset consists of 2,822,421 original images and 8,716,461 person images (in bboxes). To the best of our knowledge, this is the largest person dataset for bbox-based detection currently available (see Table 1).

3.2 Data collection, refinement, and configuration

Collection. We downloaded images from 21 global cities based on [34]; however, we excluded cities having fewer than 100,000 collected images. Consequently, 16 of the 21 cities were selected for the WSPD: London, New York, Boston, Paris, Toronto, Barcelona, Tokyo, San Francisco, Hong Kong, Zurich, Seoul, Beijing, Bangkok, Singapore, Kuala Lumpur, and New Delhi (listed from most images to fewest images). These metropolitan areas do not overlap, as they are at least 200 km apart. To create the images with bboxes, we applied the Pascal VOC pre-trained VGG16 model for the Faster R-CNN. We initially set the threshold value as 0.8 and used only the person label. A dataset consisting of a geo-tag and a time-stamp was replicated from the YFCC100M dataset. In the first step, we collected 76,532,519 images using automatic image collection and bbox annotation.

Refinement. We now consider how to exclude noisy images from our dataset. The refinement strategy is to scan all images with a simple classifier based on a combination of StyleNet [25] and a SVM. To create a sophisticated fashion-oriented database, we treat the WSPD refinement as a binary classification problem to distinguish between street-fashion-snapshot whole-body images and other cropped images, such as partial bodies or backgrounds without a person. We trained and refined the database with 1,443 carefully annotated objective images and a large number of randomly cropped negative images.

Configuration. Our WSPD has the three following features:

  • Images captured from the YFCC100M dataset. After data refinement, the number of images decreased from 8,504037 to 2,822,421 original images.

  • Cropped person images with bboxes (treated as clothing images). As a result of data refinement, the number of person bboxes was reduced from 76,532,519 to 8,716,461.

  • Geo-location and time-stamp information. This relates to the 16 cities listed above.

Database #image #bbox #class
Pascal VOC [10] 11,530 27,450 20

MS COCO [18]
123,287 896,782 80

OpenImages V5 [16]
1,743,042 14,610,229 600
Caltech Pedestrian [7] 250,000 350,000 2

CityPersons [32]
5,000 35,016 2

EuroCity Persons [3]
47,300 238,200 17

WSPD (proposed)
2,822,421 8,716,461 2
Table 1: Proposed WSPD and related datasets.

Details of datasets. We give details on the self-collected dataset in Table 1. We also compare the proposed dataset (WSPD) with existing datasets for object and person detection. It is clear that our dataset contains the largest number of images and bboxes among the currently available person datasets. Also, our dataset contains a diversity of person images from different locations worldwide. From the semi-automatic dataset collection, we obtained millions of person bboxes that can be useful for training and testing a pre-trained detector.

Dataset quality analysis. We manually analyzed the results of our method using 1,000 randomly selected person bboxes from the WSPD dataset. Figure 4 shows the four classifications for the randomly selected images, and Table 2 indicates the corresponding frequency of occurrence of each class. The four bbox classifications were (i) high-quality annotation, (ii) low-quality annotation (partial image of a person), (iii) multiple persons in a bbox, and (iv) misclassification (not a person). Based on our random sample of 1,000 images, we expect the collected dataset to consist of 93% person images (i, ii, and iii combined), regardless of whether the images are perfectly annotated. Non-person images account for only 70 out of 1,000 bboxes. According to the Instagram-3.5B paper [20], 10 and 25% noise reduced the performance rate by only 1.0 and 2.0%, respectively.

The effectiveness of the proposed weakly supervised dataset collection is shown in Section 5. We have considered multiple pre-trained datasets.

Figure 4: Data quality analysis.
Type of bbox annotation %
(i) High-quality annotation 62.2
(ii) Low-quality annotation 21.1
(iii) Multiple people in a bbox 9.7
(iv) Misclassification (not a person) 7.0
Table 2: Statistics of dataset quality analysis. We randomly selected 1,000 images from the WSPD dataset.
Method Annotation Pre-training (#classes, #images) Fine-tuning
ImageNet Human Classification (1,000, 1.2 million) Person detection
ECP Human Person detection (2, 240,000) Person detection
Pascal VOC Human Object detection (20, 10,000) Person detection
WSPD (ours) Weak Person detection (2, 8.7 million) Person detection
Table 3: Annotation type, pre-training, and fine-tuning for each method.

4 Configuration for Detectors

In this section, we describe a suitable base model and training configuration.

4.1 Representative architecture in person detection

To assess the performance achieved when using the proposed WSPD collection method, we consider different types of representative detectors, namely the SSD [19] and M2Det [33]

. In our explorative analysis, we utilized the WSPD dataset containing over 8.7 million bboxes to optimize the network parameters for bbox regression and person classification. The hyperparameter settings used here were the same as those in 

[19, 33].

Method Pre-training Pre-training

#batches, #epochs

Miss rate (%)
supervision (lower is better)
SSD ImageNet Human 64, 100 33.90
SSD Pascal VOC Human 64, 100 29.28
SSD ECP Human 64, 100 26.92

ImageNet Human 16, 100 57.31
M2Det320 Pascal VOC Human 16, 100 73.72
M2Det320 ECP Human 16, 100 97.68

ImageNet Human 8, 100 32.46
M2Det512 Pascal VOC Human 8, 100 23.05
M2Det512 ECP Human 8, 100 82.53

SSD (ours)
WSPD Weak 128, 25 24.06
SSD (ours) WSPD Weak 128, 50 20.95
SSD (ours) WSPD Weak 128, 100 20.55
SSD (ours) WSPD Weak 256, 25 24.35
SSD (ours) WSPD Weak 256, 50 22.92
SSD (ours) WSPD Weak 256, 100 21.45

M2Det320 (ours)
WSPD Weak 16, 50 16.44
M2Det512 (ours) WSPD Weak 8, 35 18.85

Table 4: Detection performance comparisons for the Caltech Pedestrian. We list the method, backbone network, pre-trained dataset, supervision during pre-training, size of batch, number of pre-training epochs, and miss rate (%). Though our WSPD applies only weak supervision during pre-training, we achieve higher rates on fine-tuning tasks.

4.2 Training method for each dataset

We conducted pre-training with our WSPD and fine-tuning for each person dataset. Throughout the experiment, we evaluated the pre-trained models; therefore, fine-tuning was conducted for all pre-trained models on the pedestrian dataset. Our WSPD pre-trained model is compared with three different models: ImageNet, Pascal VOC, and EuroCity Persons pre-trained detector. The ImageNet pre-trained model is trained with a large number of images, but no bboxes are used during the pre-training. In contrast, the EuroCity Person pre-trained detector uses 240,000 person bboxes in the pre-training step. Moreover, the Pascal VOC pre-trained detector is not limited to person bboxes, and it has 20 object annotations. We show the procedures used for pre-training and fine-tuning in Table 3. We used the Caltech Pedestrian in the fine-tuning step.

5 Experimental Results and Discussion

This section clarifies how the use of the weakly supervised dataset collection influences the accuracy of a pre-trained person detector. The weak but numerous annotations enable us to improve the performance in the fine-tuning task. The resulting accuracy is higher than that of other fully supervised pre-trained models (Table 3). We also present the results of the exploratory analysis and compare the performance when using different detection architectures.

5.1 Exploratory study

The purpose of the exploratory study was to optimize the hyperparameters for each architecture using the self-assembled dataset. Although there are numerous hyperparameters that must be selected in the detection architecture and learning strategy, we examined the effects of batch size {128, 256} and #epoch {25, 50, 100} with the WSPD method, as they seem to be the most important for model training. Therefore, we here calculate six different pre-trained SSD models in the pre-training phase. In addition to using the pre-trained detectors, we conducted further fine-tuning on a target dataset. To simplify the parameter tuning step, we employed the SSD (detection architecture), WSPD (pre-trained dataset), and Caltech Pedestrian (fine-tuned dataset).

The exploration results for the Caltech Pedestrian are shown in Table 4. The table shows the change in the miss rates (lower is better) for different numbers of batches and pre-training epochs. According to the results, we can confirm that 128 batches and 100 pre-training epochs provide the best performance. Additionally, we found that the number of pre-training epochs tends to perform better when the pre-training time is longer. However, a smaller number of pre-training epochs must be considered because the training time with 8.7 million bboxes is high. Especially during pre-training with the WSPD, the average training time is roughly 39 hours per epoch on four NVIDIA Tesla V100 GPUs. Therefore, pre-training with 100 epochs requires approximately 3,969 hours (165 days). Undoubtedly, the longer training will result in better pre-training results, but we must consider a more reasonable training time on larger detection architectures like M2Det.

Figure 5: Detection examples with WSPD pre-trained SSD.
Figure 6: DET curves for the M2Det512 model.

5.2 Comparison with baseline models

We consider the detection results in detail for each architecture, backbone network, pre-trained dataset, and miss rate in Table 4. We focus on the validation of the architectures (SSD/M2Det) and pre-trained datasets (ImageNet, Pascal VOC, EuroCity Persons, and WSPD). We discuss the results for both the SSD and M2Det architectures.

SSD. Figure 2 shows the results for the proposed method and three baselines with the SSD architecture. The difference between our proposed method and the baselines for the pre-training tasks is shown in Table 3. Basically, the pre-training results with the WSPD are significantly different because the dataset was collected in a weakly supervised manner. We confirmed that our WSPD pre-trained model achieved the highest score of 20.54%, which is a 6.38% and 13.36% better miss rate than the models pre-trained on EuroCity Persons and ImageNet, respectively. Note that the weakly supervised dataset collection for person detection was processed by a two-step algorithm using an existing detector and binary classification. Despite the presence of noise in the dataset, our method outperformed the fully supervised bboxes developed by human annotators. This result suggests that we can automatically generate a ground truth dataset in a simple way. The performance rate is higher than for the Pascal VOC pre-trained model (our method has an 8.74% better miss rate), which assigns multiple object detection labels in the pre-trained phase.

M2Det. In addition to the SSD model, we considered the M2Det (320/512) model. The M2Det detector represents the current state of the art in terms of detector accuracy. As described above, we compared the self-collected WSPD with ImageNet, Pascal VOC, and EuroCity Persons in the context of the pre-trained dataset with M2Det512 (see Figure 6). According to Table 4, the best miss rate was 16.44% with M2Det320. The miss rate is 4.01% better than that for the WSPD pre-trained SSD. The ImageNet pre-trained M2Det512 had a 32.46% miss rate on the Caltech Pedestrian.

Moreover, we list the detection comparisons and results in Figure 2 and Figure 5, respectively.

Figure 7: (Top) Relationship between additional noise rate (0%, 20%, 40%, 60%, and 80%) and detection miss rate. (Bottom) How to create a ”noisy” bbox from an image.
Noise Miss rate Difference from normal training
(%) (%) (%)
0 23.86
10 24.06 -0.20
20 24.65 -0.79
30 24.81 -0.95
40 26.82 -2.96
50 25.49 -1.63
60 27.25 -3.39
70 29.68 -5.82
80 28.98 -5.12
90 33.03 -9.17
100 38.62 -14.76
Table 5: Detailed noise rate and miss rate correspondences. We also show the difference from normal training, which has a miss rate of 23.86%.

5.3 Noise label analysis

Additionally, we investigated the effect of label noise.

In addition to the manual data quality analysis (see Table 2), we analyzed the relationship between the amount of noise in the dataset and the detection accuracy. We deliberately added a bbox translation with horizontal and vertical movement in the () coordinates. The procedure of making noise is shown at the bottom of Figure 7. We translated a bbox in the image and prevented it from overlapping a ground truth bbox. We simultaneously list the relationship between noise rate and miss rate in the top of Figure 7 and in Table 5. In the figure and table, 0% noise (miss rate of 23.86%) corresponds to normal training and 100% noise (miss rate of 38.62%) corresponds to translating all bboxes. In the experiment, note that the data was obtained as 1 million bboxes randomly selected from the WSPD dataset; therefore, the miss rate is different from the 20.55% with 8.7M bboxes shown in Table 4.

According to the results, 30% noise produced only a small increase in the miss rate (from 23.86 to 24.81%, a difference of 0.95%). This confirmed that a small amount of noise does not greatly affect the performance rate of person detection. At noise rates greater than 30%, the miss rate continued to increase at 40% (miss rate of 26.82%, difference of 2.96%) to 80% noise (miss rate of 28.98%, difference of 5.12%). The 90 and 100% noise rates produced the worst results, with differences of 9.17 and 14.76% from the normal training rate. In particular, the results with the 100% noise rate are worse than those with ImageNet pre-training with the Caltech Pedestrian.

6 Conclusion

This paper proposes a weakly supervised dataset collection method for improving pre-trained person detection models. In a comparison with the baseline detectors (e.g., ImageNet pre-trained model and person dataset fine-tuned model), our proposed method achieved a 6.38 and 13.36% better miss rate than the EuroCity Persons and ImageNet pre-trained models, respectively. The semi-automatic image and bbox collection can be performed by downloading images from SNS (e.g., Flickr), and using an existing detector (e.g., Faster R-CNN) for binary classification to determine whether an image contains a person’s whole body. The weakly supervised dataset collection approach is simple yet highly effective for producing a pre-trained detector. We confirmed that using a large number of bboxes (8.7 million boxes in the WSPD dataset) in the pre-training task results in performance much better performance than that of the baseline detectors.


  • [1] T. Bart, D. A. Shamma, F. Gerald, E. Benjamin, N. Karl, P. Douglas, B. Damian, and L. Li-Jia (2016) YFCC100M: The New Data in Multimedia Research. Commun. ACM 59, pp. 64–73. Cited by: §3.1.
  • [2] R. Benenson, M. Omran, J. Hosang, and B. Schiele (2014) Ten Years of Pedestrian Detection, What Have We Learned?. In

    European Conference on Computer Vision (ECCV) Workshop

    Cited by: §2.
  • [3] M. Braun, S. Krebs, F. B. Flohr, and D. M. Gavrila (2019) EuroCity Persons: A Novel Benchmark for Person Detection in Traffic Scenes. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document, ISSN 0162-8828 Cited by: §1, §2, Table 1.
  • [4] N. Dalal and B. Triggs (2005) Histograms of Oriented Gradients for Human Detection. In

    Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2, §2.
  • [5] P. Dollar, R. Appel, S. Belongie, and P. Perona (2014) Fast Feature Pyramids for Object Detection. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 36 (8), pp. 1532–1545. Cited by: §2.
  • [6] P. Dollar and et al. (2009) Integral Channel Features. In British Machine Vision Conference (BMVC), Cited by: §2.
  • [7] P. Dollár, C. Wojek, B. Schiele, and P. Perona (2009) Pedestrian Detection: A Benchmark. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.1, Table 1.
  • [8] X. Du, M. El-Khamy, J. Lee, and L. Davis (2017) Fused DNN: A Deep Neural Network Fusion Approach to Fast and Robust Pedestrian Detection. In Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.
  • [9] A. Ess, B. Leibe, K. Schindler, and L. van Gool (2009) Robust Multi-Person Tracking from a Mobile Platform. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI). Cited by: §2.
  • [10] M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2015) The Pascal Visual Object Classes Challenge: A Retrospective. International Journal of Computer Vision (IJCV) 111 (1), pp. 98–136. Cited by: Table 1.
  • [11] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan (2010) Object Detection with Discriminatively Trained Part Based Models. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 32 (9). Cited by: §2.
  • [12] R. Girshick (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [13] R. Girshick (2015) Fast R-CNN. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [14] K. He (2017) Mask R-CNN. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [15] J. Hosang, M. Omran, R. Benenson, and B. Schiele (2015) Taking a Deeper Look at Pedestrians. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [16] I. Krasin, T. Duerig, N. Alldrin, V. Ferrari, S. Abu-El-Haija, A. Kuznetsova, H. Rom, J. Uijlings, S. Popov, S. Kamali, M. Malloci, J. Pont-Tuset, A. Veit, S. Belongie, V. Gomes, A. Gupta, C. Sun, D. Cai, Z. Feng, D. Narayanan, and K. Murphy (2017) OpenImages: A public dataset for large-scale multi-label and multi-class image classification.. Dataset available from https://storage.googleapis.com/openimages/web/index.html. Cited by: §1, Table 1.
  • [17] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal Loss for Dense Object Detection. In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [18] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §1, Table 1.
  • [19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §1, §2, §4.1.
  • [20] D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten (2018) Exploring the Limits of Weakly Supervised Pretraining. In European Conference on Computer Vision (ECCV), Cited by: §1, §2, §3.2.
  • [21] S. Munder and D. M. Gavrila (2006) An Experimental Study on Pedestrian Classification. IEEE Trans. on Pattern Analysis and Machine Intelligence (TPAMI) 28 (11), pp. 1863–1868. Cited by: §2.
  • [22] D. P. Papadopoulos, J. R. R. Uijlings, F. Keller, and V. Ferrari (2012) We don’t need no bounding-boxes: Training object class detectors using only human verification. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), Cited by: §2.
  • [23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You Only Look Once: Unified, Real-Time Object Detection. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [24] R. Shaoqinga, H. Kaiming, G. Ross, and S. Jian (2015) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), Cited by: §1, §2, §3.1.
  • [25] E. Simo-Serra and H. Ishikawa (2016)

    Fashion Style in 128 Floats: Joint Ranking and Classification using Weak Data for Feature Extraction

    In Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2.
  • [26] H. Su, J. Deng, and L. Fei-Fei (2012) Crowdsourcing Annotations for Visual Object Detection. In AAAI Human Computation Workshop, Cited by: §2.
  • [27] C. Sun, A. Shrivastava, S. Singh, and A. Gupta (2017)

    Revisiting Unreasonable Effectiveness of Data in Deep Learning Era

    In International Conference on Computer Vision (ICCV), Cited by: §2.
  • [28] P. Viola and M. Jones (2001) Rapid Object Detection using a Boosted Cascade of Simple Features. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [29] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen (2018) Repulsion Loss: Detecting Pedestrians in a Crowd. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [30] L. Zhang, L. Lin, X. Liang, and K. He (2015) Is Faster R-CNN Doing Well for Pedestrian Detection?. In European Conference on Computer Vision (ECCV), Cited by: §2.
  • [31] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele (2016) How Far are We from Solving Pedestrian Detection?. In Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [32] S. Zhang, R. Benenson, and B. Schiele (2017) CityPersons: A Diverse Dataset for Pedestrian Detection. In Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, Table 1.
  • [33] Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling (2019) M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network. In

    Association for the Advancement of Artificial Intelligence (AAAI)

    Cited by: §2, §4.1.
  • [34] B. Zhou, L. Liu, A. Oliva, and A. Torralba (2014) Recognizing City Identity via Attribute Analysis of Geo-tagged Images. European Conference on Computer Vision (ECCV), pp. 519–534. Cited by: Figure 3, §3.1, §3.2.