Recent Advances in Object Detection in the Age of Deep Convolutional Neural Networks

by   Shivang Agarwal, et al.

Object detection-the computer vision task dealing with detecting instances of objects of a certain class (e.g., 'car', 'plane', etc.) in images-attracted a lot of attention from the community during the last 5 years. This strong interest can be explained not only by the importance this task has for many applications but also by the phenomenal advances in this area since the arrival of deep convolutional neural networks (DCNN). This article reviews the recent literature on object detection with deep CNN, in a comprehensive way, and provides an in-depth view of these recent advances. The survey covers not only the typical architectures (SSD, YOLO, Faster-RCNN) but also discusses the challenges currently met by the community and goes on to show how the problem of object detection can be extended. This survey also reviews the public datasets and associated state-of-the-art algorithms.



There are no comments yet.


page 1

page 2

page 3

page 4


Recent Advances in Deep Learning for Object Detection

Object detection is a fundamental visual recognition problem in computer...

Camouflaged Object Detection and Tracking: A Survey

Moving object detection and tracking have various applications, includin...

Object Detection in 20 Years: A Survey

Object detection, as of one the most fundamental and challenging problem...

Recent Advances in Features Extraction and Description Algorithms: A Comprehensive Survey

Computer vision is one of the most active research fields in information...

Recent Advances in Imaging Around Corners

Seeing around corners, also known as non-line-of-sight (NLOS) imaging is...

IOD-CNN: Integrating Object Detection Networks for Event Recognition

Many previous methods have showed the importance of considering semantic...

DeepID-Net: multi-stage and deformable deep convolutional neural networks for object detection

In this paper, we propose multi-stage and deformable deep convolutional ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The task of automatically recognizing and locating objects in images and videos is important in order to make computers able to understand or interact with their surroundings. For humans, it is one of the primary tasks, in the paradigm of visual intelligence, in order to survive, work and communicate. If one wants machines to work for us or with us, they will need to make sense of their environment as good as humans or in some cases even better than humans. Solving the problem of object detection with all the challenges it presents has been identified as a major precursor to solving the problem of semantic understanding of the surrounding environment.

A large number of academics as well as industry researchers have already shown their interest in it by focusing on applications, such as autonomous driving, surveillance, relief and rescue operations, deploying robots in factories, pedestrian and face detection, brand recognition, visual effects in images, digitalizing texts, understanding aerial images,

etc. which have object detection as a major challenge at their core.

The Semantic Gap, defined by Smeulders et al. [2000] as the lack of coincidence between the information one can extract from some visual data and its interpretation by a user in a given situation, is one of the main challenges object detection must deal with. There is indeed a difference of nature between raw pixel intensities contained in images and semantic information depicting objects.

Object detection is a natural extension of the classification problem. The added challenge is to correctly detect the presence and accurately locate the object instance(s) in the image. It is (usually) a supervised learning problem in which, given a set of training images, one has to design an algorithm which can accurately locate and correctly classify as many object instances as possible in a rectangle box while avoiding false detections of background or multiple detections of the same instance. The images can have object instances from same classes, different classes or no instances at all. The object categories in training and testing set are then supposed to be statistically similar. The instance can occupy very few pixels, 0.01% to 0.25%, as well as the majority of the pixels, 80% to 90%, in an image. Apart from the variation in size the variation can be in lighting, rotation, appearance, background,

etc. There may not be enough data to accurately cover all the variations well enough. Small objects, particularly, give low performance at being detected because less information is available to detect them. Some object instances can also be occluded. Most of the applications demand this problem to be solved in real time. With the current state of the architectures that is not the case. To improve upon the speed of the model, they have to let go of some of the performance.

We present this review to connect the dots between various deep learning and data driven techniques proposed in recent years, as they have brought about huge improvements in the performance, even though the recently introduced object detection datasets are much more challenging. We intend to study what makes them work and what are their shortcomings. We discuss the seminal works in the field and the incremental works which are more application oriented. We also see their approach on trying to overcome each of the challenges. The earlier methods which were based on hand-crafted features are outside the scope of this review. The problems that are related to object detection such as semantic segmentation are also outside the scope of this review, except when used to bring contextual information to detectors. Salient object detection being related to semantic segmentation will also not be treated in this survey.

Several surveys related to object detection have been written in the past, addressing specific tasks such as pedestrian detection [Enzweiler and Gavrila, 2008], moving objects in surveillance systems [Joshi and Thakore, 2012], object detection in remote sensing images [Cheng and Han, 2016], face detection [Hjelmås and Low, 2001, Zhang and Zhang, 2010], facial landmark detection [Wu and Ji, 2018], to cite only some illustrative examples. In contrast with this article, the aforementioned surveys do not cover the latest advances obtained with deep neural networks. We share the same motivations as [Zhao et al., 2018c], which was released on Arxiv while we were finishing this survey, but we believe we have covered the topic more comprehensively.

The following subsections give an overview of the problem, some of the seminal works in the field (hand-crafted as well as data driven) and describe the task and evaluation methodology. Section 2 goes into the detail of the design of the current state-of-the-art models. Section 3 presents recent methodological advances as well as the main challenge modern detectors have to face. Section 4 shows how to extend the presented detectors to different detection tasks (video, 3D) or perform under different constraints (energy efficiency, training data, etc.). Section 5 informs about the popular datasets in the field and also reports state-of-the-art performance on these datasets. Finally, Section 6 concludes the review.

1.1 From Hand-crafted to Data Driven Detectors

While the first object detectors initially relied on mechanisms to align a 2D/3D model of the object on the image using simple features, such as edges [Lin et al., 2007], key-points [Lowe, 1999] or templates [Pentland et al., 1994]

, the arrival of Machine Learning (ML) was the first revolution which had shaken up the area. One of the most popular ML algorithms used for object detection was boosting,

e.g., [Schneiderman and Kanade, 2004]

) or Support Vector Machines,

e.g. [Dalal and Triggs, 2005]. This first wave of ML-based detectors were all based on hand-crafted (engineered) visual features processed by classifiers or regressors. These hand-crafted features were as diverse as Haar Wavelets [Viola et al., 2005], edgelets [Wu and Nevatia, 2005], shapelets [Sabzmeydani and Mori, 2007], histograms of oriented gradient [Dalal and Triggs, 2005], bags-of-visual-words [Lampert et al., 2008], integral histograms [Porikli, 2005], color histograms [Walk et al., 2010], covariance descriptors [Tuzel et al., 2008], linear binary patterns Wang et al. [2009], or their combinations [Enzweiler and Gavrila, 2011]. One of the most popular detectors before the DCNN revolution was the Deformable Part Model of Felzenszwalb et al. [2010] and its variants, e.g. [Sadeghi and Forsyth, 2014].

This very rich literature on visual descriptors has been wiped out in less than five years by Deep Convolutional Neural Networks, which is a class of deep, feed-forward artificial neural networks. DCNNs are inspired by the connectivity patterns between neurons of the human visual cortex and use no pre-processing as the network learns itself the filters previously hand-engineered by traditional algorithms, making them independent from prior knowledge and human effort. They are said to be

end-to-end trainable and solely rely on the training data. This leads to their major disadvantage of requiring copious amounts of data. The first use of ConvNets for detection and localization goes back to the early 1990s for faces [Vaillant et al., 1994], hands [Nowlan and Platt, 1995] and multi-character strings [Matan et al., 1992]. Then in 2000s they were used in text [Delakis and Garcia, 2008], face [Garcia and Delakis, 2002, Osadchy et al., 2007] and pedestrians [Sermanet et al., 2013b] detection.

However, the merits of DCNN for object detection was generated in the community only after the seminal work of Krizhevsky et al. [2012] and Sermanet et al. [2013a]

on the challenging ImageNet dataset.

Krizhevsky et al. [2012] were the first to demonstrate localization through DCNN in the ILSVRC 2012 localization and detection tasks. Just one year later Sermanet et al. [2013a] were able to describe how the DCNN can be used to locate and detect objects instances. They won the ILSVRC 2013 localization and detection competition and also showed that combining the classification, localization and detection tasks can simultaneously boost the performance of all tasks.

The first DCNN-based object detectors applied a fine-tuned classifier on each possible location of the image in a sliding window manner [Oquab et al., 2015], or on some specific regions of interest [Girshick et al., 2014], through a region proposal mechanism. Girshick et al. [2014]

treated each region proposal as a separate classification and localization task. Therefore, given an arbitrary region proposal, they deformed it to a warped region of fixed dimensions. DCNN are used to extract a fixed-length feature vector from each proposal respectively and then category-specific linear SVMs were used to classify them. Since it was a region based CNN they called it R-CNN. Another important contribution was to show the usability of transfer learning in DCNN. Since data is scarce, supervised pre-training on an auxiliary task can lead to a significant boost to the performance of domain specific fine-tuning.

Sermanet et al. [2013a], Girshick et al. [2014] and Oquab et al. [2015] were among the first authors to show that DCNN can lead to dramatically higher object detection performance on ImageNet detection challenge [Deng et al., 2009] and PASCAL VOC [Everingham et al., 2010] respectively as compared to previous state-of-the-art systems based on HOG [Dalal and Triggs, 2005] or SIFT [Lowe, 2004].

Since most prevalent DCNN had to use a fixed size input, because of the fully connected layers at the end of the network, they had to either warp or crop the image to make it fit into that size. He et al. [2015] came up with the idea of aggregating feature maps of the final convolutional layer. Thus, the fully connected layer at the end of the network gets a fixed size input even if the input images in the dataset are of varying sizes and aspect ratios. This helped reduce overfitting, increased robustness and improved the generalizability of the existing models. Compared to R-CNN which used one forward pass per proposal to generate the feature map, the methodology proposed by [He et al., 2015] allowed to share computation among all the proposals and do just one forward pass for the whole image and then select the region from the final feature map according to the regions proposed. This naturally increased the speed of the network by over one hundred times.

All the previous approaches train the network in multistage pipelines are complex, slow and inelegant. They include extracting features through CNNs, classifying through SVMs and finally fitting bounding box regressors. Since, each task is handled separately, convolutional layers cannot take advantage of end-to-end learning and bounding box regression. Girshick [2015] helped alleviate this problem by streamlining all the tasks in a single model using a multitask loss. As we will explain later, this not only improved upon the accuracy but also made the network run faster at test time.

1.2 Overview of Recent Detectors

The foundations of the DCNN based object detection, having been laid out, it allowed the field to mature and move further away from classical methods. The fully-convolutional paradigm glimpsed in [Girshick, 2015] and [He et al., 2015] gained more traction every day in the community.

When Ren et al. [2015]

successfully replaced the only component of Fast R-CNN that still relied on non-learned heuristics by inventing RPN (Region Proposal Networks), it put the last nail in the coffin of traditional object detection and started the age of completely end-to-end architectures. Specifically, the anchor mechanism, developed for the RPN, was here to stay. This grid of fixed a-priori (or anchors), not necessarily corresponding to the receptive field of the feature map pixel they lied on, created a framework for fully-convolutional classification and regression and is used nowadays by most pipelines like

[Liu et al., 2016] or [Lin et al., 2017b], to cite a few.

These conceptual changes make the detection pipelines far more elegant and efficient than their counterparts when dealing with big training sets. However, it comes at a cost. The resulting detectors become complete black boxes, and, because they are more prone to overfitting, they require more data than ever.

[Ren et al., 2015] and its other double stage variants are now the go-to methods for objects detection and will be thoroughly explored in Sec. 2.1.3. Although this line of work is now prominent, other choices were explored all based on fully-convolutional architectures.

Faster single-stage algorithms that were completely abandoned since Viola et al. [2005] have now become reasonable alternatives thanks to the discriminative power of the CNN features. Redmon et al. [2016] first showed that the simplest architectural design could bring unfathomable speed with acceptable performances. Liu et al. [2016] sophisticated the pipeline by using anchors at different layers while making it faster and more accurate than Redmon et al. [2016]. These two seminal works gave birth to a considerable amount of literature on single stage methods that we will cover in Sec. 2.1.2. Boosting and Deformable part-based models, that were once the norm, have yet to make their comebacks into the mainstream. However, some recent popular works used close ideas like Dai et al. [2017] and thus these approaches will also be discussed in the survey sections 2.1.4 and 2.1.5.

The fully-convolutional nature of these new dominant architectures allows all kinds of implementation tricks during training and at inference time that will be discussed at the end of the next section. However, it makes the subtle design choices of the different architectures something of a dark art to the newcomers.

The goal of the rest of the survey is to provide a complete view of this new landscape while giving the keys to understand the underlying principles that guide interesting new architectural ideas. Before diving into the subject, the survey starts by reminding the readers about the object detection task and the metrics associated with it.

1.3 Tasks & Performance Evaluation

Object detection is one of the various tasks related to the inference of high-level information from images. Even if there is no universally accepted definition of it in the literature, it is usually defined as the task of localizing all the instances of a given category (e.g.’car’ instances in the case of car detection) while avoiding raising alarms when/where no instances are present. The localization can be provided as the center of the object on the image, as a bounding box containing the object, or even as the list of the pixels belonging to the object. In some rare cases, only the presence/absence of at least one instance of the category is sought, without any localization.

Object detection is related but different from object segmentation, which aims to group pixels from the same object into a single region, or semantic segmentation which is similar to object segmentation except that the classes may also refer to varied backgrounds or ’stuff’ (e.g.’sky’, ’grass’, ’water’ etc., categories). It is also different from Object Recognition which is usually defined as recognizing (i.e. giving the name of the category) of an object contained in an image or a bounding box, assuming there is only one object in the image. For some authors Object Recognition involves detecting all the objects in an image. Instance object detection is more restricted than object detection as the detector is focused on a single object (e.g. a particular car model) and not any object of a given category. In case of videos, object detection task is to detect the objects on each frame of the video.

How to evaluate the performance of a detector depends on how the task is defined. However, only a few criteria reflect most of the current research. One of the most common metrics is the Average Precision (AP) such as defined for the Pascal VOC challenge [Everingham et al., 2010]

. It assumes ground truths are defined by non-rotated rectangular bounding boxes containing object instances, associated with class labels. The diversity of the methods to be evaluated prevents the use of ROC (Receiver Operating Characteristic) or DET (Detection Error Trade-off), commonly used for face detection, as it would assume all the methods use the same window extraction scheme (such as the sliding window mechanism), which is not always the case. In the Pascal VOC challenge, object detection is evaluated by one separate AP score per category. For a given category, the Precision/Recall curve is computed from the ranked outputs (bounding boxes) of the method to be evaluated. Recall is the proportion of positive examples ranked above a given rank, while precision is the number of positive boxes above that rank. The AP summarizes the Precision/Recall curve and is defined as the mean (interpolated) precision of the set of eleven equally spaced recall levels. Output bounding boxes are judged as true positives (correct detections) if the overlap ratio (intersection over union or IOU) exceeds 0.50. Detection outputs are assigned to ground truth in the order given by decreasing confidence scores. Duplicated detections of the same object are considered as false detections. The performance over the whole dataset is computed by averaging the APs across all the categories.

The recent and popular MSCOCO challenge [Lin et al., 2014] relies on the same principles. The main difference is that the overall performance (mAP) is obtained by averaging the AP obtained with 10 different IOU thresholds between 0.50 and 0.95. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) also has a detection task in which algorithms have to produce triplets of class labels, bounding boxes and confidence scores. Each image has mostly one dominant object in it. Missing object’s detection is penalized in the same way as a duplicate detection and the winner of the detection challenge is the one who achieves first place AP on most of the object categories. The challenge also has the Object Localization task, with a slightly different definition. The motivation is to not penalize algorithms if one of the detected objects is actually present while not included in the ground-truth annotations, which is not rare due to the size of the dataset and the number of categories (1000). Algorithms are expected to produce 5 class labels (in decreasing order of confidence) and 5 bounding boxes (one for each class label). The error of an algorithm on an image is 0 if one of the 5 bounding boxes is a true positive (correct class label and correct localization according to IOU), 1 otherwise. The error is averaged on all the images of the test set.

Some recent datasets, like DOTA [Xia et al., 2017], proposed two tasks named as detection on horizontal bounding boxes and detection on oriented bounding boxes, corresponding to two different kinds of ground truths (with or without target orientations), no matter how those methods were trained. In some other datasets, the scale of the detection is not important and a detection is counted as a True Positive if its coordinates are close enough to the center of the object. This is the case for the VeDAI dataset [Razakarivony and Jurie, 2016]. In the particular case of object detection in 3D point clouds, such as in the KITTI object detection benchmark [Geiger et al., 2012], the criteria is similar to Pascal VOC, except that the boxes are in 3D and the overlap is measured in terms of volume intersection.

Regarding the detection of objects in videos, the most common practice is to evaluate the performance by considering each frame of the video as being an independent image and averaging the performance over all the frames, as done in the ImageNet VID challenge [Russakovsky et al., 2015].

This survey only covers the methodologies for performance evaluation found in the recent literature. But, beside these common evaluation measures, there are a lot of more specific ones, as object detection can be combined with other complex tasks, e.g., 3D orientation and layout inference in [Xiang and Savarese, 2012]. The reader can refer to the review by Mariano et al. [2002] to explore this topic. It is also worth mentioning the very recent work of Oksuz et al. [2018] which proposes a novel metric providing richer and more discriminative information than AP, especially with respect to the localization error.

After introducing the topic and touching upon some general information, next section will get right into the heart of object detection by presenting the designs of recent deep learning based object detectors.

2 On the Design of Modern Deep Detectors

Here we analyze, investigate and dissect the current state-of-the-art models and the intuition behind their approaches. We can divide the whole detection pipeline into three major parts. The first part focuses on the arrangement of convolutional layers to get proposals (if required) and box predictions. The second part is about setting various training hyper-parameters, deciding upon the losses, weight initializations, etc. to make the model converge faster. The third part’s center of attention will be to know various approaches to refine the predictions from the converged model(s) at test time and therefore get better detection performances. The first part has been able to get the attention of most of the researchers and second and third part not so much.

Most of the ideas from the following sub-sections have achieved top accuracies on the challenging MS COCO [Lin et al., 2014] object detection challenge and PASCAL VOC [Everingham et al., 2010] detection challenge or on some other very challenging datasets.

2.1 Architecture of the Networks

The architecture of the DCNN object detectors follows a Lego-like construction pattern based on chaining different building blocks. The first part of this Section will focus on what researchers call the backbone of the DCNN, meaning the feature extractor from which the detector draws its discriminative power. We will then tackle diverse arrangements of increasing complexity found in DCNN detectors: from single stage to multiple stages methods. Finally, we will talk about the Deformable Part Models and their place in the deep learning landscape.

2.1.1 Backbone Networks

A lot of deep neural networks originally designed for classification tasks have been adopted for the detection task as well. And a lot of modifications have been done on them to adapt for the additional difficulties encountered. The following discussion is about these networks and the modifications in question.


Backbone networks play a major role in object detection models. Huang et al. [2017c] partially confirmed the common observation that, as the classification performance of the backbone increases on ImageNet classification task [Russakovsky et al., 2015], so does the performance of object detectors based on those backbones. It is the case at least for popular double-stage detectors like Faster-RCNN [Ren et al., 2015] and R-FCN [Dai et al., 2016b] although for SSD [Liu et al., 2016] the object detection performance remains around the same (see the following Sections for details about these 3 architectures).

However, as the size of the network increases, the inference and the training become slower and require more data. The most popular architectures in increasing order of inference time are MobileNet [Howard et al., 2017], VGG [Simonyan and Zisserman, 2014], Inception [Szegedy et al., 2015, Ioffe, 2017, Szegedy et al., 2016], ResNet [He et al., 2016], Inception-ResNet [Szegedy et al., 2017], etc. All of the above architectures were first borrowed from the classification problem with little or no modification.

Some other backbones used in object detectors which were not included in the analysis of [Huang et al., 2017c] but have given state-of-the-art performances on ImageNet [Deng et al., 2009], or COCO [Lin et al., 2014] detection tasks are Xception [Chollet, 2017], DarkNet [Redmon and Farhadi, 2017], Hourglass [Newell et al., 2016], Wide-Residual Net [Zagoruyko and Komodakis, , Lee et al., 2017b], ResNeXt [Xie et al., 2017], DenseNet [Huang et al., 2017b], Dual Path Networks [Chen et al., 2017c] and Squeeze-and-Excitation Net [Hu et al., 2017]. The recent DetNet [Li et al., 2018c], proposed a backbone network, is designed specifically for high performance detection. It avoided large down-sampling factors present in classification networks. Dilated Residual Networks [Yu et al., 2017]

also worked with similar motivations to extract features with fewer strides. SqueezeNet

[Iandola et al., 2016] and ShuffleNet [Zhang et al., 2017c] choose instead to focus on speed. More information for networks focusing on speed can be found in Section 4.2.4.

Adapting the mentioned backbones to the inherent multi-scale nature of object detection is a challenge, we will give in the following paragraph examples of commonly used strategies.

Multi-scale detections:

Papers [Cai et al., 2016, Li et al., 2017b, Yang et al., 2016a] made independent predictions on multiple feature maps to take into account objects of different scales. The lower layers with finer resolution have generally been found better for detecting small objects than the coarser top layers. Similarly, coarser layers are better for the bigger objects. Liu et al. [2016] were the first to use multiple feature maps for detecting objects. Their method has been widely adopted by the community. Since final feature maps of the networks may not be coarse enough to detect sizable objects in large images, additional layers are also usually added. These layers have a wider receptive field.

Fusion of layers:

In object detection, it is also helpful to make use of the context pixels of the object [Zeng et al., 2017, Zagoruyko et al., 2016, Gidaris and Komodakis, 2015]. One interesting argument in favor of fusing different layers is it integrates information from different feature maps with different receptive fields, thus it can take help of surrounding local context to disambiguate some of the object instances.

Some papers [Chen et al., 2017d, Fu et al., 2017, Jeong et al., 2017, Lee et al., 2017a, Zheng et al., 2018] have experimented with fusing different feature layers of these backbones so that the finer layers can make use of the context learned in the coarser layers. Lin et al. [2017b, a], Shrivastava et al. [2016b], Woo et al. [2018] took one step ahead and proposed a whole additional top-down network in addition to standard bottom-up network connected through lateral connections. The bottom-top network used can be any one of the above mentioned. While Shrivastava et al. [2016b] used only the finest layer of top-down architecture for detection, Feature Pyramid Network (FPN) [Lin et al., 2017b] and RetinaNet [Lin et al., 2017a] used all the layers of top-down architecture for detection. FPN used the feature maps thus generated in a two-stage detector fashion while RetinaNet used them in a single-stage detector fashion (See Section 2.1.3 and Section 2.1.2 for more details). FPN [Lin et al., 2017a] has been a part of the top entries in MS COCO 2017 challenge.

Now that we have seen how to best use the feature maps of the object detectors backbones we can explore the architectural details of the different major players in DCNN object detection, starting with the most immediate methods: single-stage detectors.

2.1.2 Single Stage Detectors

The two most popular approaches in single stage detection category are YOLO [Redmon et al., 2016] and SSD [Liu et al., 2016]. In this Section we will go through their basic functioning, some upsides and downsides of using these two approaches and further improvements proposed on them.


Redmon et al. [2016]

presented for the first time a single stage method for object detection where raw image pixels were converted to bounding box coordinates and class probabilities and can be optimized end-to-end directly. This allowed to directly predict boxes in a single feed-forward pass without reusing any component of the neural network or generating proposals of any kind, thus speeding up the detector.

They started by dividing the image into a grid and assuming B bounding boxes per grid. Each cell containing the center of an object instance is responsible for the detection of that object. Each bounding box predicts 4 coordinates, objectness and class probabilities. This reframed the object detection as a regression problem. To have a receptive field cover that covers the whole image they included a fully connected layer in their design towards the end of the network.


Liu et al. [2016], inspired by the Faster-RCNN architecture, used reference boxes of various sizes and aspect ratios to predict object instances but they completely got rid of the region proposal stage (discussed in the following Section). They were able to do this by making the whole network work as a regressor as well as a classifier. During training, thousands of default boxes corresponding to different anchors on different feature maps learned to discriminate between objects and background. They also learned to directly localize and predict class probabilities for the object instances. This was achieved with the help of a multitask loss. Since, during inference time a lot of boxes try to localize the objects, generally a post-processing step like Greedy NMS is required to suppress duplicate detections.

In order to accommodate objects of all the sizes they added additional convolutional layers to the backbone and used them, instead of a single feature map, to improve the performance. This method was later applied to approaches related to two-stage detectors too [Lin et al., 2017a].

Pros and Cons:

Single stage detectors generally do not give as good performance as the double-stage ones, but they are a lot faster [Huang et al., 2017c].

The various advantages of YOLO strategy are that it is extremely fast, with 45 to 150 frames per second. It sees the entire image as opposed to region proposal based strategies which is helpful for encoding contextual information and it learns generalizable representations of objects. But it also has some obvious disadvantages. Since each grid cell has only two bounding boxes, it can only predict at most two objects in a grid cell. This is particularly inefficient strategy for small objects. It struggles to precisely localize some objects as compared to two stages. Another drawback of YOLO is that it uses coarse feature map at a single scale only.

To address these issues, SSD used a dense set of boxes and considered predictions from various feature maps instead of one. It improved upon the performance of YOLO. But since it has to sample from these dense set of detections at test time it gives lower performance on MS COCO dataset as compared to two-stage detectors. The two-stage object detectors get a sparse set of proposals on which they have to perform predictions.

Further improvements:

Redmon and Farhadi [2017] and Redmon and Farhadi [2018]

suggested a lot of small changes in versions 2 and 3 of the YOLO method. The changes like applying batch normalization, using higher resolution input images, removing the fully connected layer and making it fully convolutional, clustering box dimensions, location prediction and multi-scale training helped to improve performance while a custom network (DarkNet) helped to improve speed.

Many further developments by many researchers have been proposed on Single Shot MultiBox Detector. Deconvolutional Single Shot Detector (DSSD) [Fu et al., 2017], instead of the element-wise sum, used a deconvolutional module to increase the resolution of top layers and added each layer, through element-wise products to previous layer. Rainbow SSD [Jeong et al., 2017]

proposed to concatenate features of shallow layers to top layers by max-pooling as well as features of top layers to shallow layers through deconvolution operation. The final fused information increased from few hundreds to 2,816 channels per feature map. RUN

[Lee et al., 2017a] proposed a 3-way residual block to combine adjacent layers before final prediction. Cao et al. [2017] used concatenation modules and element-sum modules to add contextual information in a slightly different manner. Zheng et al. [2018] slightly tweak DSSD by fusing lesser number of layers and adding extra ConvNets to improve speed as well as performance.

They all improved upon the performance of conventional SSD and they lie within a small range among themselves on Pascal VOC 2012 test set [Everingham et al., 2010], but they added considerable amount of computational costs, thus making it little slower. WeaveNet [Chen et al., 2017d] aimed at reducing computational costs by gradually sharing the information from adjacent scales in an iterative manner. They hypothesized that by weaving the information iteratively, sufficient multi-scale context information can be transferred and integrated to current scale.

Recently three strong candidates have emerged for replacing the undying YOLO and SSD variants:

  • RetinaNet [Lin et al., 2017b] borrowed the FPN structure but in a single stage setting. It is similar in spirit to SSD but it deserves its own paragraph given its growing popularity based on its speed and performance. The main new advance of this pipeline is the focal loss, which we will discuss in Section 2.2.1.

  • RefineDet [Zhang et al., 2018d] tried to combine the advantages of double-staged methods and single-stage methods by incorporating two new modules in the single stage classic architecture. The first one, the ARM (Anchor Refinement modules), is used in multiple staged detectors’ fashion to reduce the search space and also to iteratively refine the localization of the detections. The ODM (Object Detection Module) took the output of the ARM to output fine-grained classification and further improve the localization.

  • CornerNet [Law and Deng, 2018] offered a new approach for object detection by predicting bounding boxes as paired top-left and bottom right keypoints. They also demonstrated that one can get rid of the prominent anchors step while gaining accuracy and precision. They used fully convolutional networks to produce independent score heat maps for both corners for each class in addition to learning an embedding for each corner. The embedding similarities were then used to group them into multiple bounding boxes. It beat its two (less original) competing rivals on COCO.

However, most methods used in competitions until now are predominantly double-staged methods because their structure is better suited for fine-grained classification. It is what we are going to see in the next Section.

2.1.3 Double Stage Detectors

The process of detecting objects can be split into two parts: proposing regions & classifying and regressing bounding boxes. The purpose of the proposal generator is to present the classifier with class-agnostic rectangular boxes which try to locate the ground-truth instances. The classifier, then, tries to assign a class to each of the proposals and further fine-tune the coordinates of the boxes.

Region proposal:

Hosang et al. [2014] presented an in-depth review of ten ”non-data driven” object proposal methods including Objectness [Alexe et al., 2010, 2012], CPMC [Carreira and Sminchisescu, 2010, 2011], Endres and Hoiem [2010, 2014], Selective Search [Van de Sande et al., 2011, Uijlings et al., 2013], Rahtu et al. [2011], Randomized Prim [Manen et al., 2013], Bing [Cheng et al., 2014], MCG [Pont-Tuset et al., 2017], Rantalankila et al. [2014], Humayun et al. [2014] and EdgeBoxes [Zitnick and Dollar, 2014] and evaluated their effect on the detector’s performance. Also, Xiao et al. [2015] developed a novel distance metric for grouping two super-pixels in high-complexity scenarios. Out of all these approaches Selective Search and EdgeBoxes gave the best recall and speed. The former is an order of magnitude slower than Fast R-CNN while the latter, which is not as efficient, took as much time as a detector. The bottleneck lied in the region proposal part of the pipeline.

Deep learning based approaches [Erhan et al., 2014, Szegedy et al., 2014] had also been used to propose regions but they were not end-to-end trainable for detection and required input images to be of fixed size. In order to address strong localization bias [Chen et al., 2015b]

proposed a box-refinement method based on the super-pixel tightness distribution. DeepMask

[Pinheiro et al., 2015] and SharpMask [Pinheiro et al., 2016] proposed segmentation based object proposals with very deep networks. [Kang et al., 2015]estimated the objectness of image patches by comparing them with exemplar regions from prior data and finding the ones that are most similar to it.

The next obvious question became apparent. How can deep learning methods be streamlined into existing approaches to give an elegant, simple, end-to-end trainable and fully convolutional model? In the discussion that follows we will discuss two widely adopted approaches in two-stage detectors, pros and cons of using such approaches and further improvements made on them.


The seminal Faster-RCNN paper [Ren et al., 2015] showed that the same backbone architecture used in Fast R-CNN for classification can be used to generate proposals as well. They proposed an efficient fully convolutional data driven based approach for proposing regions called Region Proposal Network (RPN). RPN learned the ”objectness” of all instances and accumulated the proposals to be used by the detector part of the backbone. The detector further classified and refined bounding boxes around those proposals. RPN and detector can be trained separately as well as in a combined manner. When sharing convolutional layers with the detector they result in very little extra cost for region proposals. Since it has two parts for generating proposals and detection, it comes under the category of two-stage detectors.

Faster-RCNN used thousands of reference boxes, commonly known as anchors. Anchors formed a grid of boxes that act as starting points for regressing bounding boxes. These anchors were then trained end-to-end to regress to the ground truth and an objectness score was calculated per anchor. The density, size and aspect ratio of anchors are decided according to the general range of size of object instances expected in the dataset and the receptive field of the associated neuron in the feature map.

RoI Pooling, introduced in [Girshick, 2015], warped the proposals generated by the RPN to fixed size vectors for feeding to the detection sub-network as its inputs. The quantization and rounding operation defining the pooling cells introduced misalignments and actually hurt localization.


To avoid running the costly RoI-wise subnetwork in Faster-RCNN hundreds of times, i.e. once per proposal, Dai et al. [2016b] got rid of it and shared the convolutional network end to end. To achieve this they proposed the idea of position sensitive feature maps. In this approach each feature map was responsible for outputting score for a specific part, like top-left, center, bottom right, etc

., of the target class. The parts were identified with RoI-Pooling cells which were distributed alongside each part-specific feature map. Final scores were obtained by average voting every part of the RoI from the respective filter. This implementation trick introduced some more translational variance to structures that were essentially translation-invariant by construction. Translational variance in object detection can be beneficial for learning localization representations. Although this pipeline seems to be more precise, it is not always better performance-wise than its Faster R-CNN counterpart.

From an engineering point of view, this method of Position sensitive RoI-Pooling (PS Pooling) also prevented the loss of information at RoI Pooling stage in Faster-RCNN. It improved the overall inference time speed of two-stage detectors but performed slightly worse.

Pros and Cons:

RPNs are generally configured to generate nearly 300 proposals to get state-of-the-art performances. Since each of the proposal passed through a head of convolutional layers and fully connected layers to classify the objects and fine tune the bounding boxes, it decreased the overall speed. Although they are slow and not suited to real-time applications, the ideas based on these approaches give one the best performances in the challenging COCO detection task. Another drawback is that Ren et al. [2015] and Dai et al. [2016b] used coarse feature maps at a single scale only. This is not sufficient when objects of diverse sizes are present in the dataset.

Further improvements:

Many improvements have been suggested on the above methodologies concerning speed, performance and computational efficiency.

DeepBox [Kuo et al., 2015] proposed a light weight generic objectness system by capturing semantic properties. It helped in reducing the burden of localization on the detector as the number of classes increased. Light-head R-CNN [Li et al., 2017f] proposed a smaller detection head and thin feature maps to speed up two-stage detectors. Singh et al. [2017] brought R-FCN to 30 fps by sharing position sensitive feature maps across classes. Using slight architectural changes, they were also able to bring the number of classes predicted by R-FCN to 3000 without losing too much speed.

Several improvements have been made to RoI-Pooling. The spatial transformer of [Jaderberg et al., 2015] used a differentiable re-sampling grid using bilinear interpolation and can be used in any detection pipeline. Chen et al. [2016b] used this for Face detection, where faces were warped to fit canonical poses. Dai et al. [2016a] proposed another type of pooling called RoI Warping based on bilinear interpolation. Ma et al. [2018] were the first to introduce a rotated RoI-Pooling working with oriented regions (More on oriented RoI-Pooling can be found in Section 3.2.2). Mask R-CNN [He et al., 2017] proposed RoI Align to address the problem of misalignment in RoI Pooling which used bilinear interpolation to calculate the value of four regularly sampled locations on each cell. It was also the first step towards a differentiable RoI-Pooling with respect to the coordinates of the regions. It brought consistent improvements to all Faster R-CNN baselines on COCO. Recently, Jiang et al. [2018] introduced a Precise RoI Pooling based on interpolating not just 4 spatial locations but a dense region, which allowed full differentiability with no misalignments.

Li et al. [2016a], Yu et al. [2016b] also used contextual information and aspect ratios while StuffNet [Brahmbhatt et al., 2017] trained for segmenting amorphous categories such as ground and water for the same purpose. Chen and Gupta [2017] made use of memory to take advantage of context in detecting objects. Li et al. [2018b] incorporated Global Context Module (GCM) to utilize contextual information and Row-Column Max Pooling (RCM Pooling) to better extract scores from the final feature map as compared to the R-FCN method.

Deformable R-FCN [Dai et al., 2017] brought flexibility to the fixed geometric transformations at the Position sensitive RoI-Pooling stage of R-FCN by learning additional offsets for each spatial sampling location using a different network branch in addition to other tricks discussed in Section 2.1.5. Lin et al. [2017a] proposed to use a network with multiple final feature maps with different coarseness to adapt to objects of various sizes. Zagoruyko et al. [2016] used skip connections with the same motivation. Mask-RCNN [He et al., 2017] in addition to RoI-align added a branch in parallel to the classification and bounding box regression for optimizing the segmentation loss. Additional training for segmentation lead to an improvement in the performance of object detection task as well.

The double-staged methods have now by far attained supremacy over best performing object detection DCNNs. However, for certain applications two-stage methods are not enough to get rid of all the false positives.

2.1.4 Cascades

Traditional one-class object detection pipelines resorted to boosting like approaches for improving the performance where uncorrelated weak classifiers (better than random chance but not too correlated with the true predictions) are combined to form a strong classifier. With modern CNNs, as the classifiers are quite strong, the attractiveness of those methods has plummeted. However, for some specific problems where there are still too many false positives, researchers still find it useful. Furthermore, if the weak CNNs used are very shallow it can also sometimes increase the overall speed of the method.

One of the first ideas that were developed was to cascade multiple CNNs. Li et al. [2015] and Yang and Nevatia [2016] both used a three-staged approach by chaining three CNNs for face detection. The former approach scanned the image using a patch CNN to reject 90% of the non-face regions in a coarse manner. The remaining detections were offset by a second CNN and given as input to a CNN that continued rejecting false positives and refining regressions. The final candidates were then passed on to a classification network which output the final score. The latter approach created separate score maps for different resolutions using the same FCN on different scales of the test image (image pyramid). These score maps were then up-sampled to the same resolution and added to create a final score map, which was then used to select proposals. Proposals were then passed to the second stage where two different verification CNNs, trained on hard examples, eradicated the remaining false positives. The first one being a four-layer FCN trained from scratch and the second one an AlexNet [Krizhevsky et al., 2012] pre-trained on ImageNet.

All the approaches mentioned in the last paragraph are ad hoc: the CNNs are independent of each other, there is no overall design, therefore, they could benefit from integrating the elegant zooming module that is the RoI-Pooling. The RoI-Pooling can act like a glue to pass the detections from one network to the other, while doing the down-sampling operation locally. Dai et al. [2016a] used a Mask R-CNN like structure that first proposed bounding boxes, then predicted a mask and used a third stage to perform fine grained discrimination on masked regions that are RoI-Pooled a second time.

Ouyang et al. [2017], Wang et al. [2017a] optimized in an end-to-end manner a Faster R-CNN with multiple stages of RoI-Pooling. Each stage accepted only the highest scored proposals from the previous stage and added more context and/or localized the detection better. Then additional information about context was used to do fine grained discrimination between hard negatives and true positives in [Ouyang et al., 2017], for example. On the contrary, Zhang et al. [2016a] showed that for pedestrian detection RoI-Pooling, too coarse a feature map actually hurts the result. This problem has been alleviated by the use of feature pyramid networks with higher resolution feature maps. Therefore, they used the RPN proposals of a Faster R-CNNN in a boosting pipeline involving a forest (Tang et al. [2017c] acted similarly for small vehicle detection).

Yang et al. [2016a], aware of the problem raised by Zhang et al. [2016a], used RoI-Pooling on multiple scaled feature maps of all the layers of the network. The classification function on each layer was learned using the weak classifiers of AdaBoost and then approximated using a fully connected neural network. While all the mentioned pipelines are hard cascades where the different classifiers are independent, it is sometimes possible to use a soft cascade where the final score is a linear weighted combination of the scores given by the different weak classifiers like in Angelova et al. [2015]. They used 200 stages (instead of 2000 stages in their baseline with AdaBoost [Benenson et al., 2012]) to keep recall high enough while improving precision. To save computations that would be otherwise unmanageable, they terminated the computations of the weighted sum whenever the score for a certain number of classifiers fell under a specified threshold (there are, therefore, as many thresholds to learn as there are classifiers). These thresholds are then really important because they control the trade-off between speed, recall and precision.

All the previous works in this Section involved a small fixed number of localization refinement steps, which might cause proposals to be not perfectly aligned with the ground truth, which in turn might impact the accuracy. That is why lots of work proposed iterative bounding box regression (while loop on localization refinement until condition is reached). Najibi et al. [2016], Rajaram et al. [2016] started with a regularly spaced grid of sparse pyramid boxes (only 200 non-overlapping in Najibi et al. [2016] whereas, Rajaram et al. [2016] used all Faster R-CNN anchors on the grid) that were iteratively pushed towards the ground truth according to the feature representation obtained from RoI-Pooling the current region. An interesting finding was that even if the goal was to use as many refinement steps as necessary if the seed boxes or anchors span the space appropriately, regressing the boxes only twice can in fact be sufficient [Najibi et al., 2016]. Approaches proposed by Gidaris and Komodakis [2016a] and Li et al. [2017a] can also be viewed, internally, as iterative regression based methods proposing regions for detectors, such as Fast R-CNN.

Boosting and multistage () methods we have seen previously exhibit very different possible combinations of DCNNs. But we thought it would be interesting to still have a Section for a special kind of method that was hinted at in the previous Sections, namely the part-based models, if not for their performances at least for their historical importance.

2.1.5 Parts-Based Models

Before the reign of CNN methods, the algorithms based on Deformable Parts-based Model (DPM) and HoG features used to win all the object detection competitions. In this algorithm latent (not supervised) object parts were discovered for each class and optimized by minimizing the deformations of the full objects (connections were modeled by springs forces). The whole thing was built on a HoG image pyramid.

When Region based DCNNs started to beat the former champion, researchers began to wonder if it was only a matter of using better features. If this was the case then the region based approach would not necessarily be a more powerful algorithm. The DPM was flexible enough to integrate the newer more discriminative CNN features. Therefore, some research works focused in this research direction.

In 2014, Savalle and Tsogkas [2014] tried to get the best of both worlds: they replaced the HoG feature pyramids used in the DPM with the CNN layers. Surprisingly, the performance they obtained, even if far superior to the DPM+HoG baseline, was considerably worse than the R-CNN method. The authors suspected the reason for it was the fixed size aspect ratios used in the DPM together with the training strategy. Girshick et al. [2015] put more thought on how to mix CNN and DPM by coming up with the distance transform pooling thus bringing the new DPM (DeepPyramidDPM) to the level of R-CNN (even slightly better). Ranjan et al. [2015] built on it and introduced a normalization layer that forced each scale-specific feature map to have the same activation intensities. They also implemented a new procedure of sampling optimal targets by using the closest root filter in the pyramid in terms of dimensions. This allowed them to further mimic the HOG-DPM strengths. Simultaneously, Wan et al. [2015] also improved the DeepPyramidDPM but failed short compared to the newest version of R-CNN, fine-tuned (R-CNN FT). Therefore, in 2015 it seemed that the DPM based approaches have hit a dead end and that the community should focus on R-CNN type methods.

However, the flexibility of the RoI-Pooling of Fast R-CNN was going to help making the two approaches come together. Ouyang et al. [2015] combined Fast R-CNN to get rid of most backgrounds and a DeepID-Net, which introduced a max-pooling penalized by the deformation of the parts called def-pooling. The combination improved over the state-of-the-art. As we mentioned in Section 2.1.3, Dai et al. [2017]

built on R-FCN and added deformations in the Position Sensitive RoI-Pooling: an offset is learned from the classical Position Sensitive pooled tensor with a fully connected network for each cell of the RoI-Pooling thus creating ”parts” like features. This trick of moving RoI cells around is also present in

[Mordan et al., 2017], although slightly different because it is closer to the original DPM. Dai et al. [2017] even added offsets to convolutional filters cells on Conv-5, which became doable thanks to bilinear interpolation. It, thus, became a truly deformable fully convolutional network. However, Mordan et al. [2017] got better performances on VOC without it. Several works used deformable R-FCN like [Xu et al., 2017b] for aerial imagery that used a different training strategy. However, even if it is still present in famous competitions like COCO, it is less used than its counterparts with fixed RoI-Pooling. It might come back though thanks to recent best performing models like [Singh and Davis, 2018] that used [Dai et al., 2017] as their baseline and selectively back-propagated gradients according to the object size.

2.2 Model Training

The next important aspect of the detection model’s design is the losses being used to converge the huge number of weights and the hyper-parameters that must be conducive to this convergence. Optimizing for a wrongfully crafted loss may actually lead the model to diverge instead. Choosing incorrect hyper-parameters, on the one hand, can stagnate the model, trap it in a local optima or, on the other hand, over-fit the training data (causing poor generalizations). Since DCNNs are mostly trained with mini-batch SGD (see for instance [LeCun et al., 2012]), we focus the following discussion on losses and on the optimization tricks necessary to attain convergence. We also review the contribution of pre-training on some other dataset and data augmentation techniques which bring about an excellent initialization point and good generalizations respectively.

2.2.1 Losses

Multi-variate cross entropy loss, or log loss, is generally used throughout the literature to classify images or regions in the context of detectors. However, detecting objects in large images comes with its own set of specific challenges: regress bounding boxes to get precise localization, which is a hard problem that is not present at all in classification and an imbalance between target object regions and background regions.

A binary cross entropy loss is formulated as shown in Eq. 1. It is used for learning the combined objectness. All instances, , are marked as positive labels with a value one. This equation constraints the network to output the predicted confidence score, , to be if it thinks there is an object and otherwise.


A multi-variate version of the log loss is used for classification (Eq. 2). predicts the probability of observation being class where . is if observation belongs to class and otherwise. is accounted for the special case of background class.


Fast-RCNN [Girshick, 2015] used a multitask loss (Eq. 3) which is the de-facto equation used for classifying as well as regressing. The losses are summed over all the regions proposals or default reference boxes, . The ground-truth label, , is if the proposal box is positive, otherwise . Regularization is learned only for positive proposal boxes.


where is a vector representing the 4 coordinates of the predicted bounding box and similarly represents the 4 coordinates of the ground truth. Eq. 4 presents the equation for exact parameterized coordinates. are the center x and y coordinates, width and height of the default anchor box respectively. Similarly are ground truths and are the coordinates to be predicted. The two terms are normalized by mini-batch size, , and number of proposals/default reference boxes, , and weighted by a balancing parameter .


is a smooth loss defined by Eq. 5. In its place some papers also use losses.

Losses for regressing bounding boxes:

Since accurate localization is a major issue, papers have suggested a more sophisticated localization loss. [Gidaris and Komodakis, 2016b] came up with a binary logistic type regression loss. After dividing the image patch into columns and rows, they computed the probability of each row and column being inside or outside the predicted observation box (in-out loss) (Eq. 6).


where are the left, right, top and bottom edges of the bounding box respectively. and are the binary positive or negative values for rows and columns respectively. is the probability associated with it respectively.

In addition, they also compute the confidence for each column and row being the exact boundary of the predicted observation or not (Eq. 7).


where . The notations can be inferred from Eq. 6. In the second paper [Gidaris and Komodakis, 2016a], related to the same topic, applied the regression losses iteratively at the region proposal stage in a class agnostic manner. They used final convolutional features and predictions from last iteration to further refine the proposals.

It was also found out to be beneficial to optimize the loss directly over Intersection over Union (IoU) which is the standard practice to evaluate an algorithm. Yu et al. [2016a] presented Eq. 8 for regression loss.


. The terms are self-explanatory. Jiang et al. [2018] also learned to predict IoU between predicted box and ground truth. They made a case to use localization confidence instead of classification confidence to suppress boxes at NMS stage. It gave higher recall on MS COCO dataset.

Losses for class imbalance:

Since in recent detectors there are a lot of anchors which most of the time cover background, there is a class imbalance between positive and negative anchors. An alternative is Online Hard Example Mining (OHEM). Shrivastava et al. [2016a] performed to select only worst performing examples (so-called hard examples) for calculating gradients. Even if by fixing the ratio between positive and negative instances, generally 1:3, one can partly solve this imbalance. Lin et al. [2017b] proposed a tweak to the cross entropy loss, called focal loss, which took into account all the anchors but penalized easy examples less and hard examples more. Focal loss (Eq. 9) was found to increase the performance by 3.2 mAP points on MS COCO, in comparison to OHEM on a ResNet-50-FPN backbone and 600 pixel image scale.


One can also adopt simpler strategies like rebalancing the cross-entropy by putting more weights on the minority class [Ogier Du Terrail and Jurie, 2017].

Supplementary losses:

In addition to classification and regression losses, some papers also optimized extra losses in parallel. Dai et al. [2016a] proposed a three-stage cascade for differentiating instances, estimating masks and categorizing objects. Because of this they achieved competitive performance on object detection task too. They further experimented with a five-stage cascade also. UberNet [Kokkinos, 2017] trained on as many as six other tasks in parallel with object detection. He et al. [2017] have shown that using an additional segmentation loss by adding an extra branch to the Faster RCNN detection sub-network can also improve detection performance. Li et al. [2017d] introduced position-sensitive inside/outside score maps to train for detection and segmentation simultaneously. Wang et al. [2018b] proposed an additional repulsion loss between predicted bounding boxes in order to have one final prediction per ground truth. Generally, it can be observed, instance segmentation in particular, aids the object detection task.

2.2.2 Hyper-Parameters

The detection problem is a highly non-convex problem in hundreds of thousands of dimensions. Even for classification where the structure is much simpler, no general strategy has emerged yet on how to use mini-batch gradient descent correctly. Different popular versions of mini-batch Stochastic Gradient Descent(SGD)

[Rumelhart et al., 1985] have been proposed based on a combination of momentum, to accelerate convergence, and using the history of the past gradients, to dampen the oscillations when reaching a minimum: AdaDelta [Zeiler, 2012]

, RMSProp

[Tieleman and Hinton, 2012] and the unavoidable ADAM [Kingma and Ba, 2014, Reddi et al., 2018] are only the most well-known. However, in object detection literature authors, use either plain SGD or ADAM, without putting too much thought into it. The most important hyper-parameters remain the learning rate and the batch size.

Learning rate:

There is no concrete way to decide the learning rate policy over the period of the training. It depends on a myriad of factors like optimizer, number of training examples, model, batch size, etc. We cannot quantify the effect of each factor; Therefore, the current way to determine the policy is by hit-and-trial. What works for one setting may or may not work for other settings. If the policy is incorrect then the model might fail to converge at all. Nevertheless, some papers have studied it and have established general guidelines that have been found to work better than others. A large learning rate might never converge while a small learning rate gives sub-optimal results. Since, in the initial stage of training the change in weights is dramatic, Goyal et al. [2017] have proposed a Linear Gradual Warmup strategy in which learning rate is increased every iteration during this period. Then starting from a point (for

) the policy was to decrease learning rate over many epochs.

Krizhevsky [2014] and Goyal et al. [2017] also used a Linear Scaling Rule which linearly scaled the learning rate according to the mini-batch size.

Batch size:

The object detection literature doesn’t generally focus on the effects of using a bigger or smaller batch size during training. Training modern detectors requires working on full images and therefore on large tensors which can be troublesome to store on the GPU RAM. It has forced the community to use small batches, of 1 to 16 images, for training (16 in RetinaNet [Lin et al., 2017b] and Mask R-CNN [He et al., 2017] with the latest GPUs).

One obvious advantage of increasing the batch size is that it reduces the training time but since the memory constraint restricts the number of images, more GPUs have to be employed. However, using extra large batches have been shown to potentially lead to big improvements in performances or speed. For instance, batch normalization [Ioffe and Szegedy, 2015] needs many images to provide meaningful statistics. Originally batch size effects were studied by [Goyal et al., 2017] on ImageNet dataset. They were able to show that by increasing the batch size from 256 to 8192, train time can be reduced from 29 hours to just 1 hour while maintaining the same accuracy. Further, You et al. [2018] and Akiba et al. [2017] brought down the training time to below 15 minutes by increasing the batch size to 32k.

Very recently, MegDet [Peng et al., 2017a], inspired from [Goyal et al., 2017], have shown that by averaging gradients on many GPUs to get an equivalent batch size of 256 and adjusting the learning rates could lead to some performance gains. It is hard to say now which strategy will eventually win in the long term but they have shown that it is worth exploring.

2.2.3 Pre-Training

Transfer learning was first shown to be useful in a supervised learning approach by Girshick et al. [2014]. The idea is to fine-tune from a model already trained on a dataset that is similar to the target dataset. This is usually a better starting point for training instead of randomly initializing weights. For model pre-trained on ImageNet being used for training on MS COCO. And since, COCO dataset’s classes is a superset of PASCAL VOC’s classes most of the state-of-the-art approaches pre-train on COCO before training it on PASCAL VOC. If the dataset at hand is completely unrelated to dataset used for pre-training, it might not be useful. For e.g. model pre-trained on ImageNet being used for detecting cars in aerial images.

Singh and Davis [2018] made a compelling case for the minimum difference in scales of object instances between classification dataset used for pre-training and detection dataset to minimize domain shift while fine-tuning. They asked should we pre-train CNNs on low resolution classification dataset or restrict the scale of object instances in a range by training on an image pyramid? By minimizing scale variance they were able to get better results than the methods that employed scale invariant detector. The problem with the second approach is some instances are so small that in order to bring them in the scale range, the images have to be upscaled so much that they might not fit in the memory or they will not be used for training at all. Using a pyramid of images and using each for inference is also slower than methods that use input image exactly once.

Section 3.2.3 covers pre-training and other aspects of it like fine-tuning and beyond in great detail. There are only, to the best of our knowledge, two articles that tried to match the performances of ImageNet pre-training by training detectors from scratch. The first one being [Shen et al., 2017a] that used deep supervision (dense access to gradients for all layers) and very recently [Shen et al., 2017b] that adaptively recalibrated supervision intensities based on input object sizes.

2.2.4 Data Augmentation

The aim of augmenting the train set images is to create diversity, avoid overfitting, increase amount of data, improve generalizability and overcome different kinds of variances. This can easily be achieved without any extra annotations efforts by manually designing many augmentation strategies. The general practices that are followed include and are not limited to scale, resize, translation, rotation, vertical and horizontal flipping, elastic distortions, random cropping and contrast, color, hue, brightness, saturation and sharpness adjustments etc. The two recent and promising but not widely adapted techniques are Cutout [Devries and Taylor, 2017b] and Sample Pairing [Inoue, 2018]. Taylor and Nitschke [2017] benchmarked various popular data augmentation schemes to know the ones that are most appropriate, and found out that cropping was the most influential in their case.

Although there are many techniques available and each one of them is easy to implement, it is difficult to know in advance, without expert knowledge, which techniques assist the performance for a target dataset. For example, vertical flipping in case of traffic signs dataset is not helpful because one is not likely to encounter inverted signs in the test set. It is not trivial to select the approaches for each target dataset and test all of them before deploying a model. Therefore, Cubuk et al. [2018]

proposed a search algorithm based on reinforcement learning to find the best augmentation policy. Their approach tried to find the best suitable augmentation operations along with their magnitude and probability of happening. Smart Augmentation

[Lemley et al., 2017] worked by creating a network that learned how to automatically generate augmented data during the training process of a target network in a way that reduced the loss. Tran et al. [2017] proposed a Bayesian approach, where new annotated training points are treated as missing variables and generated based on the distribution learned from the training set. Devries and Taylor [2017a] applied simple transformations such as adding noise, interpolating, or extrapolating between data points. They performed the transformation, not in input space, but in a learned feature space. All the above approaches are implemented in the domain of classification only but they might be beneficial for the detection task as well and it would be interesting to test them.

Generative adversarial networks (GANs) have also been used to generate the augmented data directly for classification without searching for the best policies explicitly [Perez and Wang, 2017, Mun et al., 2017, Antoniou et al., 2017, Sixt et al., 2018]. Ratner et al. [2017] used GANs to describe data augmentation strategies. GAN approaches may not be as effective for detection scenarios yet because generating an image with many object instances placed in a relevant background is much more difficult than generating an image with just one dominant object. This is also an interesting problem which might be addressed in the near future and is explored in Section 3.1.2.

2.3 Inference

The behavior of the modern detectors is to pick up pixels of target objects, propose as many windows as possible surrounding those pixels and estimate confidence scores for each of the window. It does not aim to suggest one box exactly per object. Since all the reference boxes act independently during test time and similar input pixels are picked up by many neighboring anchors, each positive prediction in the prediction set highly overlaps with other boxes. If the best ones out of these are not selected, it will lead to many double detections and thus false positives. The ideal result would be to predict exactly one prediction box per ground-truth object that has high overlap with it. To reach near this ideal state, some sort of post-processing needs to be done.

Greedy Non-maximum suppression (NMS) [Dalal and Triggs, 2005] is the most frequent technique used for inference modules to suppress double detections through hard thresholding. In this approach, the prediction box with the highest confidence was chosen and all the boxes having an Intersection over Union (IoU) higher than a threshold, , were suppressed or rescored to zero. This step was made iteratively till all the boxes were covered. Because of its nature there is no single threshold that works best for all datasets. Datasets with just one object per image will trivially apply NMS by choosing only the highest-ranking box. Generally, datasets with sparse and fewer number of objects per image (2 to 3 objects) require a lower threshold. While datasets with cramped and higher numbers of objects per image (7 and above) give better results with a higher threshold. The problems that arose with this naive and hand-crafted approach was that it may completely suppress nearby or occluded true positive detections, choose top scoring box which might not be the best localized one and its inability to suppress false positives with insufficient overlap.

To improve upon it, many approaches have been proposed but most of them work for a very special case such as pedestrians in highly occluded scenarios. We discuss the various directions they take and the approaches that work better than the greedy NMS in the general scenario. Most of the following discussion is based on [Hosang et al., 2017] and [Bodla et al., 2017], who, in their papers, provided us with an in-depth view of the alternate strategies being used.

Many clustering approaches for predicted boxes have been proposed. A few of them are mean shift clustering [Dalal and Triggs, 2005, Wojek et al., 2008], agglomerative clustering [Bourdev et al., 2010], affinity propagation clustering [Mrowca et al., 2015], heuristic variants [Sermanet et al., 2013a], etc. Rothe et al. [2014] presented a learning based method which ”passes messages between windows” or clustered the final detections to finally select exemplars from each cluster. Mrowca et al. [2015] deployed a multi-class version of this paper. Clustering formulations with globally optimal solutions have been proposed in [Tang et al., 2015]. All of them worked for special cases but are less consistent than Greedy NMS, generally.

Some papers learn NMS in a convolutional network. Henderson and Ferrari [2016] and Wan et al. [2015] tried to incorporate NMS procedure at training time. Stewart et al. [2016] generated a sparse set of detections by training an LSTM. Hosang et al. [2016] trained the network to find different optimal cutoff thresholds () locally. Hosang et al. [2017] took one step further and got rid of the NMS step completely by taking into account double detections in the loss and jointly processed neighboring detections. The former inclined the network to predict one detection per object and the latter provided with the information if an object was detected multiple times. Their approach worked better than greedy NMS and they obtained a performance gain of 0.8 mAP on COCO dataset.

Most recently, greedy NMS was improved upon by Bodla et al. [2017]. Instead of setting the score of neighboring detections as zero they decreased the detection confidence as an increasing function of overlap. It improved the performance by 1.1 mAP for COCO dataset. There was no extra training required and since it is hand-crafted it can be easily integrated in object detection pipeline. It is used in current top entries for MS COCO object detection challenge.

Jiang et al. [2018] performed NMS based on separately predicted localization confidence instead of usually accepted classification confidence. Other papers rescored detections locally [Chen et al., 2013, Tu and Bai, 2010] or globally [Vezhnevets and Ferrari, 2015]. Some others detected objects in pairs in order to handle occlusions [Ouyang and Wang, 2013b, Sadeghi and Farhadi, 2011, Tang et al., 2014]. Rodriguez et al. [2011] made use of the crowd density. Quadratic unconstrained binary optimization (QUBO) [Rujikietgumjorn and Collins, 2013] used detection scores as a unary potential and overlap between detections as a pairwise potential to obtain the optimal subset of detection boxes. Niepert et al. [2016] saw overlapping windows as edges in a graph.

As a bonus, in the end, we also throw some light on the inference ”tricks” that are generally known to the experts participating in the competitions. The tricks that are used to further improve the evaluation metrics are: Doing multi-scale inference on an image pyramid (see Section 

3.2.1 for training); Doing inference on the original image and on its horizontal flip (or on different rotated versions of the image if the application domain does not have a fixed direction) and aggregating results with NMS; Doing bounding box voting as in [Gidaris and Komodakis, 2015] using the score of each box as its weight; Using heavy backbones, as observed in the backbone section; Finally, averaging the predictions of different models in ensembles. For the last trick often it is better to not necessarily use the top-N best performing models but to prefer instead uncorrelated models so that they can correct each other’s weaknesses. Ensembles of models are outperforming single models by often a large margin and one can average as many as a dozen models to outrank its competitors. Furthermore, with DCNNs generally one does not need to put too much thought on normalizing the models as each one gives bounded probabilities (because of the softmax operator in the last layer).

2.4 Concluding Remarks

This concludes a general overview of the landscape of the mainstream object detection halfway through 2018. Although the methods presented are all different, it has been shown that in fact most papers have converged towards the same crucial design choices. All pipelines are now fully convolutional, which brings structure (regularization), simplicity, speed and elegance to the detectors. The anchors mechanism of Ren et al. [2015] has now also been widely adopted and has not really been challenged yet, although iteratively regressing a set of seed boxes show some promise [Najibi et al., 2016, Gidaris and Komodakis, 2016a]. The need to use multi-scale information from different layers of the CNN is now apparent [Kong et al., 2016, Lin et al., 2017a, b]. The RoI-Pooling module and its cousins can also be cited as one of the main architectural advances of recent years but might not ultimately be used by future works.

With that said, most of the research being done now in the mainstream object recognition consists of inventing new ways of passing the information through the different layers or coming up with different kinds of losses or parametrization [Yu et al., 2016a, Gidaris and Komodakis, 2016b]. There is a small paradox now in the fact that even if man-made features are now absent of most modern detectors, more and more research is being done on how to better hand-craft the CNN architectures and modules.

3 Going Forward in Object Detection

While we demonstrated that object detection has already been turned upside-down by CNN architectures and that nowadays most methods revolve around the same architectural ideas, the field has not yet reached a status quo, far from it. Completely new ideas and paradigms are being developed and explored as we write this survey, shaping the future of object detection. This section exposes such ideas and lists the major challenges that remain mostly unsolved and the attempts to get around them.

3.1 Complementary New Ideas in Object Detection

In this subsection we review ideas which haven’t quite matured yet but we feel could bring major breakthroughs in the near future. If we want the field to advance, we should embrace new grand ideas like these, even if that means completely rethinking all the architectural ideas evoked in Section 2.

3.1.1 Graph Networks

The dramatic failings of state-of-the-art detectors on perturbed versions of the COCO validation sets, spotted by Rosenfeld et al. [2018], are raising questions for better understanding of compositionality, context and relationships in detectors.

Battaglia et al. [2018] recently wrote a position article arguing about the need to introduce more representational power into Deep Learning using graph networks. It means finding new ways to enforce the learning of graph structures of connected entities instead of outputting independent predictions. Convolutions are too local and translation equivariant to reflect the intricate structure of objects in their context.

One embodiment of this idea in the realm of detection can be found in the work of Wang et al. [2017b], where long-distance dependencies were introduced in deep-learning architectures. These combined local and non-local interactions are reminiscent of the CRF [Lafferty et al., 2001], which sparked a renewed interest for graphical models in 2001. Dot products between features determine their influences on each other, the closest they are in the feature space, the stronger their interactions will be (using a Gaussian kernel for instance). This seems to go against the very principles of DCNNs, which are, by nature, local. However this kind of layer can be integrated seamlessly in any DCNN to its benefit, it is very similar to self-attention [Cheng et al., 2016c]. It is not clear yet if these new networks will replace their local counterparts in the long-term but they are definitely suitable candidates.

Graph structures also emerge when one needs to incorporate a priori (or inductive biases) on the spatial relationships of the objects to detect (relational reasoning) [Hu et al., 2018]. The relation module uses attention to learn object dependencies, also using dot products of features. Similarly, Wang et al. [2017b] incorporated geometrical features to further disambiguate relationships between objects. One of the advantages of this pipeline is the last relation module, which is used to remove duplicates similarly to the usual NMS step but adaptively. We mention this article in particular because although relationships between detected objects have been used in the literature before, it was the first attempt to have it as a differentiable module inside a CNN architecture.

3.1.2 Adversarial Trainings

No one in the computer vision community was spared by the amazing successes of the Generative Adversarial Networks [Goodfellow et al., 2014]. By pitting a con-artist (a CNN) against a judge (another CNN) one can learn to generate images from a target distribution up to an impressive degree of realism. This new tool keeps the flexibility of the regular CNN architectures as it is implemented using the same bricks and therefore, it can be added in any detection pipeline.

Even if [Wang et al., 2017c] does not belong to the GAN family per say, the adversarial training it uses: dropping pixels in examples to make them harder to classify and hence, render the network robust to occlusions, obviously drew its inspiration from GANs. Ouyang et al. [2018]

went a step further and used the GAN formalism to learn to generate pedestrians from white noise in large images and showed how those created examples were beneficial for the training of object detectors. There are numerous recent papers,

e.g., [Peng and Saenko, 2018, Bousmalis et al., 2017], proposing approaches for converting synthetic data towards more realistic images for classification. Inoue et al. [2018] used the latest CycleGAN [Zhu et al., 2017a] to convert real images to cartoons and by doing so gained free annotations to train detectors on weakly labeled images and became the first work to use GANs to create full images for detectors. As stated in the introduction, GANs can also be used, not in a standalone manner but, directly embedded inside a detector too: Li et al. [2017c] operated at the feature level by adapting the features of small objects to match features obtained with well resolved objects. Bai et al. [2018]

trained a generator directly for super-resolution of small objects patches using traditional GAN loss in addition to classification losses and MSE loss per pixel. Integrating the module in modern pipelines brought improvement to the original mAP on COCO. These two articles addressed the detection of small objects, which will be tackled in more details in Section 


Shen et al. [2018] used GANs to completely replace the Multiple Instance Learning paradigm (see Section 4.2.1) using the GAN framework to generate candidate boxes following the real distribution of the training images boxes and built a state-of-the-art detector that is faster than all the others by two orders of magnitude.

Thus, this extraordinary breakthrough is starting to produce interesting results in object detection and its importance is growing. Considering the latest result in the generation of synthetic data using GANs for instance the high resolution examples of [Karras et al., 2018] or the infinite image generators, BiCycleGAN from Zhu et al. [2017a] and MUNIT from Huang et al. [2018b], it seems the tsunami that started in 2014 will only get bigger in the years to come.

3.1.3 Use of Contextual Information

We will see in this section that the word context can mean a lot of different things but taking it into account gives rise to many new methods in object detection. Most of them (like spatial relationships or using stuff to find things) are often overlooked in competitions, arguably for bad reasons (too complex to implement in the time frame of the challenge).

Methods have evolved a lot since Heitz and Koller [2008] used clustering of stuff/backgrounds to help detect objects in aerial imagery. Now, thanks to the CNN architectures, it is possible to do detection of things and stuff segmentation in parallel, both tasks helping the other [Brahmbhatt et al., 2017].

Of course, this finding is not surprising. Certain objects are more likely to appear in certain stuff or environments (or context): thanks to our knowledge of the world, we find it weird to have a flying train: Katti et al. [2016] showed that adding this human knowledge helps existing pipelines. The environments of the visual objects also comprise of other objects that they are present with, which advocates for learning spatial relationships between objects. Mrowca et al. [2015] and Gupta et al. [2015] independently used spatial relationships between proposals and classes (using WordNet hierarchy) to post-process detections. This is also the case in [Zuo et al., 2016] where RNNs were used to model those relationships at different scales and in [Chen and Gupta, 2017] where an external memory module was keeping track of the likelihood of objects being together. Hu et al. [2018], that we mentioned in Section 3.1.1, went even further with a trainable relation module inside the structure of the network. In a different but not unrelated manner Gonzalez-Garcia et al. [2017] improved the detection of parts of objects by associating parts with their root objects.

All multi-scale architectures use different sized context, as we saw in Section 2.1.1. Zeng et al. [2016] used features from different sized regions (different contexts) in different layers of the CNN with message-passing in between features related to different context. Kong et al. [2017] used skip connections and concatenation directly in the CNN architecture to extract multi-level and multi-scale information.

Sometimes, even the simplest local context surrounding a region of interest can help (see, for instance, themethods presented in Section 2.1.4, where the amount of context varies in between the classifiers). Extracted proposals can include variable amounts of pixels (context means size of the proposal) to help the classifiers such as in [Ouyang et al., 2017] or in [Chen et al., 2016a, Gidaris and Komodakis, 2016b, a]. Li et al. [2016a] included global image context in addition to regional context. Some approaches went as far as integrating all the image context: it was done for the first time in YOLO [Redmon et al., 2016] with the addition of a fully connected layer on the last feature map. Wang et al. [2017b] modified the convolutional operator to put weights on every part of the image, helping the network use context outside the object to infer their existence. This use of global context is also found with the Global Context Module of the recent detection pipeline from Megvii [Peng et al., 2017b]. Li et al. [2018b] proposed a fully connected layer on all the feature maps (similar to Redmon et al. [2016]) with dilated kernels.

Other kinds of context can also be put to work. Yu et al. [2016b] used latent variables to decide on which context cues to use to predict the bounding boxes. It is not clear yet which method is the best to take context into account, another question is: do we want to? Even if the presence of an object in a context is unlikely, do we actually want to blind our detectors to unlikely situations?

3.2 Major Challenges

There are some walls that the current models cannot overcome without heavy structural changes (occlusions, domain adaptation, rotation invariance to cite a few). Often, when we hear that object recognition is solved, we argue that the existence of these walls are solid proof. Although we have advanced the field, we cannot rely indefinitely on the current DCNNs. This section shows how the recent literature addressed these topics.

3.2.1 Scale Variance

In the past three years a lot of approaches have been proposed to deal with the challenge of scale variance. On the one hand, object instances in the image may fill only 0.01% to 0.25% of the pixels, and, on the other hand, the instance may fill 80% to 90% of the whole image. It is tough to make a single feature map predict all the objects, with this huge variance, because of the limited receptive field that it’s neurons have. Particularly small objects (discussed in Section 3.2.6) are difficult to classify and localize. In this section we will discuss three main approaches that are used to tackle the challenge of scale variance.

First, is to make image pyramids [He et al., 2015, Girshick, 2015, Felzenszwalb et al., 2010, Sermanet et al., 2013a]. This helps enlarge small objects and shrink the large objects. Although the variance is reduced to an extent but each image has to be pass forwarded multiple times thus, making it computationally expensive and slower than the approaches discussed in the following discussion. This approach is different from data augmentation techniques [Cubuk et al., 2018] where an image is randomly cropped, zoomed in or out, rotated etc. and used exactly once for inference. Ren et al. [2017] extracted feature maps from a frozen network at different image scales and merged them using maxout [Goodfellow et al., 2013]. Singh and Davis [2018] selectively back-propagated the gradients of object instances if they fall in a predetermined size range. This way, small objects must be scaled up to be considered for training. They named their technique Scale Normalization for Image Pyramids (SNIP). Singh et al. [2018] optimized this approach by processing only context regions around ground-truth instances, referred to as chips.

Second, a set of default reference boxes, with varied size and aspect ratios that cover the whole image uniformly, were used. Ren et al. [2015] proposed a set of reference boxes at each sliding window location which are trained to regress and classify. If an anchor box has a significant overlap with the ground truth it is treated as positive otherwise, it is ignored or treated as negative. Due to the huge density of anchors most of them are negative. This leads to an imbalance in the positive and negative examples. To overcome it OHEM [Shrivastava et al., 2016a] or Focal Loss [Lin et al., 2017b] are generally applied at training time. One more downside of anchors is that their design has to be adapted according to the object sizes in the dataset. If large anchors are used with too many small objects then, and vice versa, then they won’t be able to train as efficiently. Default reference boxes are an important design feature in double stage [Dai et al., 2016b] as well as single-stage methods [Redmon and Farhadi, 2017, Liu et al., 2016]. Most of the top winning entries [He et al., 2017, Lin et al., 2017a, b, Dai et al., 2017] use them in their models. Bodla et al. [2017] helped by improving the suppression technique of double detections, generated from the dense set of reference boxes, at inference time.

Third, multiple convolutional layers were used for bounding box predictions. Since a single feature map was not enough to predict objects of varied sizes, SSD [Liu et al., 2016] added more feature maps to the original classification backbones. Cai et al. [2016] proposed regions as well as performed detections on multiple scales in a two-stage detector. Najibi et al. [2017] used this method to achieve state-of-the-art on a face dataset [Yang et al., 2016b] and Li et al. [2017b] on pedestrian dataset [Ess et al., 2007]. Yang et al. [2016a] used all the layers to reject easy negatives and then performed scale-dependent pooling on the remaining proposals. Shallower or finer layers are deemed to be better for detecting small objects while top or coarser layers are better at detecting bigger objects. In the original design, all the layers predict the boxes independently and no information from other layers is combined or merged. Many papers, then, tried to fuse different layers [Chen et al., 2017d, Lee et al., 2017a] or added additional top-down network [Shrivastava et al., 2016b, Woo et al., 2018]. They have already been discussed in Section 2.1.1.

Fourth, Dilated Convolutions (a.k.a. atrous convolutions) [Yu and Koltun, 2015] were deployed to increase the filter’s stride. This helped increase the receptive field size and, thus, incorporate larger context without additional computations. It has been successfully applied in the context of object detection [Dai et al., 2016b] and semantic segmentation [Chen et al., 2018c]. Dai et al. [2017] presented a generalized version of it by learning the deformation offsets additionally.

3.2.2 Rotational Variance

In the real world object instances are not necessarily present in an upright manner but can be found at an angle or even inverted. While it is hard to define rotation for flexible objects like a cat, a pose definition would be more appropriate, it is much easier to define it for texts or objects in aerial images which have an expected rigid shape. It is well known that CNNs as they are now do not have the ability to deal with the rotational variance of the data. More often than not, this problem is circumvented by using data augmentation: showing the network slightly rotated versions of each patch. When training on full images with multiple annotations it becomes less practical. Furthermore, like for occlusions, this might work but it is disappointing as one could imagine incorporating rotational invariance into the structure of the network.

Building rotational invariance can be simply done by using oriented bounding boxes in the region proposal step of modern detectors. Jiang et al. [2017] used Faster R-CNN features to predict oriented bounding boxes, their straightened versions were then passed on to the classifier. More elegantly, few works like [Ma et al., 2018, He et al., 2018, Busta et al., 2017] proposed to construct different kinds of RoI-pooling module for oriented bounding boxes. Ma et al. [2018] transformed the RoI-Pooling layer of Faster R-CNN by rotating the region inside the detector to make it fit the usual horizontal grid, which brought an astonishing increase of performances from the 38.7% of regular Faster R-CNN to 71.8% with additional tricks on MSRA. Similarly, He et al. [2018] used a rotated version of the recently introduced RoI-Align to pool oriented proposals to get more discriminative features (better aligned with the text direction) that will be used in the text recognition parts. Busta et al. [2017] also used rotated pooling by bilinear interpolation to extract oriented features to recognize text after having rendered YOLO to be able to predict rotated bounding boxes. Shi et al. [2017a] detected in the same way, oriented bounding boxes (called segments) with a similar architecture but differ from [Ma et al., 2018, He et al., 2018, Busta et al., 2017] because it also learned to merge the oriented segments appropriately, if they cover the same word or sentence, which allowed greater flexibility.

Liu and Jin [2017] needed slightly more complicated anchors: quadrangles anchors, and regressed compact text zones in a single-stage architecture similar to Faster R-CNN’s RPN. This system being more flexible than the previous ones, necessitated more parameters. They used Monte-Carlo simulations to compute overlaps between quadrangles. Liao et al. [2018b] directly rotated convolution filters inside the SSD framework, which effectively rendered the network rotation-invariant for a finite set of rotations (which is generalized in the recent [Weiler et al., 2018] for segmentation). However, in the case of text detection even oriented bounding boxes can be insufficient to cover text with a layout with too much curvature and one often sees the same failure cases in different articles (circle-shaped texts for instance).

A different kind of approach for translation invariance was taken by the two following works of Cheng et al. [2016a] and Laptev et al. [2016] that made use of metric-learning. Former proposed an original approach of using metric learning to force features of an image and its rotated versions to be close to each other hence, somehow invariant to rotations. In a somewhat related approach the latter found a canonical pose for different rotated versions of an image and used a differentiable transformation to make every example canonical and to pool the same features.

The difficulty of predicting oriented bounding boxes is alleviated if one resorts to semantic segmentation like in [Zhang et al., 2016b]. They learned to output semantic segmentation then oriented bounding boxes were found based on the output score map. However, it shares the same downsizes as other approaches [Ma et al., 2018, He et al., 2018, Busta et al., 2017, Liao et al., 2018b, Liu and Jin, 2017] for text detection because in the end one still has to fit oriented rectangles to evaluate the performances.

Other applications than text detection also require rotation invariance. In the domain of aerial imagery, the recently released DOTA [Xia et al., 2017] is one of the first datasets of its kind expecting oriented bounding boxes for predictions. One can anticipate an avalanche of papers trying to use text detection techniques like [Tang et al., 2017b], where the SSD framework is used to regress bounding box angles or the former metric learning technique from Cheng et al. [2016a] and Cheng et al. [2016b]. For face detection, paper like [Shi et al., 2018] relied on oriented proposals too. The diversity of the methods show that no real standard has emerged yet. Even the most sophisticated detection pipelines are only rotation invariant to a certain extent.

The detectors presented in this section do not yet have the same popularity as the vertical ones because all the main datasets like COCO do not present rotated images. One could define a rotated-COCO or rotated-VOC to evaluate the benefit these pipelines could bring over their vertical versions but it is obviously difficult and would not be accepted as is by the community without a strong, well-thought-evaluation protocol.

3.2.3 Domain Adaptation

It is often needed to repurpose a detector trained on domain A to function on domain B. In most cases this is because the dataset in domain A has lots of training examples and the categories in it are generic whereas the dataset in domain B has less training examples and objects that are very specific or distinct from A. There are surprisingly very few recent articles that tackled explicit domain adaptation in the context of object detection[Tang et al., 2012, Sun and Saenko, 2014, Xu et al., 2014] did it for HOG based features – even though the literature for domain adaptation for classification is dense, as shown by the recent survey of Csurka [2017]. For instance when one trains a Faster R-CNN on COCO and want to test it off-the-shelf on the car images of KITTI Geiger et al. [2012] (’car’ is one of the 80 classes of COCO) one gets only 56.1% AP w.r.t. 83.7% using more similar images because of the differences between the domains (see [Tremblay et al., 2018a]) .

Most works adapt the features learned in another domain (mostly classification) by simply fine-tuning the weights on the task at hand. Since [Girshick et al., 2014], literally every state-of-the-art detectors are pre-trained on ImageNet or on an even bigger dataset. This is the case even for relatively large object detection datasets like COCO. There is no fundamental reason for it to be a requirement. The objects of the target domains have to be similar and of the same scales as the objects on which the network was pre-trained as pointed out by Singh and Davis [2018], that detected small cars in aerial imagery by first pre-training on ImageNet. The seminal work of Hoffman et al. [2014], already evoked in the weakly supervised Section 4.2.1, showed how to transfer a good classifier trained on large scale image datasets to a good detector trained on few images by fine-tuning the first layers of a convnet trained on classification and adapting the final layer using nearest neighbor classes. Hinterstoisser et al. [2017] demonstrated another example of transfer learning where they froze the first layers of detectors trained on synthetic data and fine-tuned only the last layers on the target task.

We discuss below all the articles we found that go farther than simple transfer learning for domain adaptation for object detection. Raj et al. [2015]

aligned features subspace from different domains for each class using Principal Component Analysis (PCA).

Chen et al. [2018e] used H-divergence theory and adversarial training to bridge the distribution mismatches. All the mentioned articles worked on adapting the features. Thanks to GANs some of them are trying to adapt directly to the image [Inoue et al., 2018], which used CycleGAN from [Zhu et al., 2017a] to convert images directly from one domain to the other. The object detection community needs to evolve if we want to move beyond transfer-learning.

One of the end goals of domain adaptation would be to be able to learn a model on synthetic data, which is available (almost) for free and to have it performing well on real images. Pepik et al. [2015] was, to the best of our knowledge, the first to point out that, even though CNNs are texture sensitive, wire-framed and CAD models used in addition to real data can improve the performances of detectors. Peng et al. [2015] augmented PASCAL-VOC data with 3D CAD models of the objects found in PASCAL-VOC (planes, horses, potted plants, etc.) and then rendered them in backgrounds where they are likely to be found and improved overall detection performances. Following this line, several authors introduced synthetic data for various tasks such as i) persons: Varol et al. [2017] ii) furniture: Massa et al. [2016] created rendered CAD furnitures on real backgrounds by using grayscale images to avoid color artifacts and improved the detection performances on the IKEA dataset. iii) text: Gupta et al. [2016] created an oriented text detection benchmark by superimposing synthetic text to existing scenes while respecting geometric and uniformity constraints and showed better results on ICDAR iv) logos: Su et al. [2017b] did the same without any constraints by superimposing transparent logos to existing images.

Georgakis et al. [2017] synthesized new instances of 3D CAD models by copy pasting rendered objects on surface normals, very close to [Rajpura et al., 2017], which used Blender to put instances of objects inside a refrigerator. Later Dwibedi et al. [2017] with the same approach but without respecting any global consistency shown promise. For them only local consistency is important for modern object detectors. Similar to [Georgakis et al., 2017], they used different kinds of blending to make the detector robust to the pasting artifacts (more details can be found in [Dwibedi, 2017]). More recently, Dvornik et al. [2018] extended [Dwibedi et al., 2017] by first finding locations in images with high likelihood of object presence before pasting objects. Another recent approach [Tremblay et al., 2018a] found that domain randomization when creating synthetic data is vital to train detectors: training on Virtual KITTI Gaidon et al. [2016], a dataset that was built to be close to KITTI (in terms of aspects, textures, vehicles and bounding boxes statistics), is not sufficient to be state-of-the-art on KITTI. One can gain almost one point of AP when building his own version of Virtual KITTI by introducing more randomness than was present in the original in the form of random textures and backgrounds, random camera angles and random flying distractor objects. Randomness was apparently absent from KITTI but is beneficial for the detector to gain generalization capabilities.

Several authors have shown interest in proposing tools for generating artificial images at a large scale. Qiu and Yuille [2016] created the open-source plug-in UnrealCV for a popular game engine Unreal Engine 4 and showed applications to deep network algorithms. Tian et al. [2017] used the graphical model CityEngine to generate a synthetic city according to the layout of existing cities and added cars, trucks and buses to it using a game engine (Unity3D). The detectors trained on KITTI and this dataset are again better than just with KITTI. Alhaija et al. [2018] pushed Blender to its limits to generate almost real-looking 3D CAD cars with environment maps and pasted them inside different 2D/3D environments including KITTI, VirtualKITTI (and even Flickr). It is worth noting that some datasets included real images to better simulate the scene viewed by a robot in active vision settings, as in [Ammirato et al., 2017].

Another strategy is to render simple artificial images and increase the realism of the images in a second iteration, using Generative Adversarial Networks [Shrivastava et al., 2017]. RenderGAN was used to directly generate realistic training images [Sixt et al., 2018]. We refer the reader to the section on GANs (Section 3.1.2) for more information on the use of GANs for style transfer.

We have seen that for the time being synthetic datasets can augment existing ones but not totally replace them for object detection, however, the domain shift between synthetic data and the target distribution is still too large to rely on synthetic data only.

3.2.4 Object Localization

Accurate localization remains one of the two biggest sources of error [Hoiem et al., 2012] in fully supervised object detection. It mainly originates from small objects and more stringent evaluation protocol applied in the latest datasets. The predicted boxes are required to have an IoU of up to 0.95 with the ground-truth boxes. Generally, localization is dealt by using smooth L1 or L2 losses along with classification loss. Some papers proposed a more detailed methodology to overcome this issue. Also, annotating bounding boxes for each and every object is expensive. We will also look into some methods that localize objects using only weakly annotated images.

Kong et al. [2016] overcame the poor localization because of coarseness of the feature maps by aggregating hierarchical feature maps and then compressing them into a uniform space. It provided an efficient combination framework for deep but semantic, intermediate but complementary, and shallow but high-resolution CNN features. Chen et al. [2015b] proposed multi-thresholding straddling expansion (MTSE) to reduce localization bias and refine boxes during proposal time which is based on super-pixel tightness as opposed to objectness based models. Zhang et al. [2015] addressed the localization problem by using a search algorithm based on Bayesian optimization that sequentially proposed candidate regions for an object bounding box. Hosang et al. [2017] tried to integrate NMS in the convolutional network which in the end improved localization.

Many papers [Gidaris and Komodakis, 2016a, Jiang et al., 2018]

also try to adapt the loss function to address the localization problem.

Gidaris and Komodakis [2016b] proposed to assign conditional probabilities to each row and column of a sample region, using a neural convolutional network adapted for this task. These probabilities allow more accurate inference of the object bounding box under a simple probabilistic framework. Since Intersection over Union (IoU) is used in the evaluation strategies of many detection challenges, Yu et al. [2016a] and Jiang et al. [2018] optimized over IoU directly. The loss-based papers have been discussed in Section 2.2.1 in detail.

There is also an interesting case made by some papers that do we really need to optimize for localization? Oquab et al. [2015] used weakly annotated images to predict approximate locations of the object. Their approach performed comparably to the fully supervised counterparts. Zhou et al. [2016a] were able to get localizable deep representations that exposed the implicit attention of CNNs on an image with the help of global average pooling layers. In comparison to the earlier approach, their localization is not limited to localizing a point lying inside an object but determining the full extent of the object. [Zhou et al., 2015, Zeiler and Fergus, 2014a, Bazzani et al., 2016] have also tried to predict localizations by masking different patches of the image during test time. More weakly supervised methods have been discussed in Section 4.2.1.

3.2.5 Occlusions

The occlusions lead to partial missing information from object instances. They may be occluded due to the background or other object instances. Less information naturally leads to harder examples and inaccurate localizations. The occlusions happen all the time in real-life images. However, since deep learning is based on convoluting filters and that occlusions by definition introduce parasite patterns most modern methods are not robust to it by construction.

Training with occluded objects help for sure [Mitash et al., 2017] but it is often not doable because of a lack of data and furthermore, it cannot be bulletproof. Wu et al. [2016] managed to learn an And-Or model for cars by dynamic programming, where the And stood for the decomposition of the objects into parts and the Or for all different configurations of parts (including occluded configurations). The learning was only possible thanks to the heavy use of synthetic data to model every possible type of occlusion. Another way to generate examples of occlusions is to directly learn to mask the proposals of Fast R-CNN [Wang et al., 2017c].

For dense pedestrians crowds deformable models and parts can help improve detection accuracy (see 2.1.5) e.g. if some parts are masked some others will not be, therefore, the average score is diminished but not made zero like in [Ouyang and Wang, 2013a, Savalle and Tsogkas, 2014, Girshick et al., 2015]. Parts are also useful for occlusion handling in face detection where different CNNs can be trained on different facial parts [Yang et al., 2015]. The survey already tackled Deformable RoI-Pooling (RoI-Pooling with parts) [Mordan et al., 2017]. Another way of re-introducing parts in modern pipelines is the deformable kernels of [Dai et al., 2017]. They presented a way to alleviate the occlusion problems by giving more flexibility to the usually fixed geometric structures.

Building special kinds of regression losses for bounding boxes acknowledging the proximity of each detection (which is reminiscent of the springs in the old part-based models) was done in [Wang et al., 2018b]. They, in addition, to the attraction term in the traditional regression loss that pushes predictions towards their assigned ground truth added a repulsion term that pushed predictions away from each other.

Traditional non-maximum suppression causes a lot of problems with occlusions because overlapping boxes are suppressed. Hence, if one object is in front of another only one is detected. To address this, Hosang et al. [2017] offered to learn non-maximum suppression making it continuous (and differentiable) and Bodla et al. [2017] used a soft version that only degraded the score of the overlapping objects (more details can be found about various other types of NMS in Section 2.3.

Other approaches used clues and context to help infer the presence of occluded objects. Zhang et al. [2018b] used super-pixel labeling to help occluded objects detection. They hypothesized that if some pixels are visible then the object is there. This is also the approach of the recent [He et al., 2017] but it needs pixel-level annotations. In videos, temporal coherence can be used [Yoshihashi et al., 2017], where heavily occluded objects are not occluded in every frame and can be tracked to help detection.

But for now all the solutions seem to be far-off from the mentally inpainting ability of humans to infer missing parts. Using GANs for this purpose might be an interesting research direction.

3.2.6 Detecting Small Objects

Detecting small objects is harder than detecting medium sized and large sized objects because of less information associated with them, easier possibility of confusion with the background, higher precision requirement for localization, large image size, etc. In COCO metrics evaluation, objects occupying areas lesser than and equal to pixels come under this category and this size threshold is generally accepted within the community for datasets related to common objects. Datasets related to aerial images [Xia et al., 2017], traffic signs [Zhu et al., 2016], faces [Nada et al., 2018], pedestrians [Enzweiler and Gavrila, 2008] or logos [Su et al., 2017a] are generally abundant with small object instances.

In case of objects like logos or traffic signs, objects have an expected shape, size and aspect ratio of the objects to be detected, and this information can be embedded to bias the deep learning model. This strategy is much harder and not feasible for common objects as they are a lot more diverse. As an illustration, the winner of the COCO challenge 2017 [Peng et al., 2017a], which used many of the latest techniques and ensemble of four detectors reported a performance of 34.5% mAP on small objects and 64.9% mAP on large objects. The following entries reported even a greater dip for smaller objects than the larger ones. Pham et al. [2017] have presented an evaluation, focusing on real-time small object detection, of three state-of-the-art models, YOLO, SSD and Faster R-CNN with related trade-off between accuracy, execution time and resource constraints.

There are different ways to tackle this problem, such as: i) up-scaling the images ii) shallow networks, iii) contextual information, iv) super-resolution. These four directions are discussed in the following.

The first – and most trivial direction – consists in up-scaling the image before detection. But a naive upscaling is not efficient as the large images become too large to fit into a GPU for training. Gao et al. [2017], first, down-sampled the image and then used reinforcement learning to train attention-based models to dynamically search for the interesting regions in the image. The selected regions are then studied at higher resolution and can be used to predict smaller objects. This avoided the need of analyzing each pixel of the image with equal attention and saved some computational costs. Some papers [Dai et al., 2017, 2016b, Singh and Davis, 2018] used image pyramids during training time in the context of object detection while [Ren et al., 2017] used it during inference time.

The second direction is to use shallow networks. Small objects are easier to predict by detectors which have smaller receptive field. The deeper networks with their large receptive field tend to lose some information about the small objects in their coarser layers. Sommer et al. [2017b] proposed a very shallow network with just four convolutional layers and three fully connected layers for the purpose of detecting objects in aerial imagery. Such type of detectors are useful when the expected instances are only of type small. But if expected instances are of diverse size it is more beneficial to use finer feature maps of very deep networks for small objects and coarser feature maps for larger objects. We have already discussed this approach in Section 3.2.1. Please refer to Section 4.2.4 for more low power and shallow detectors.

The third direction is to make use of context surrounding the small object instances. Gidaris and Komodakis [2015], Zhu et al. [2015b] used context to improve the performance but Chen et al. [2016a] used context specifically for improving the performance for small objects. They augmented the R-CNN with the context patch in parallel to the proposal patch generated from region proposal network. Zagoruyko et al. [2016] combined their approach of making the information flow through multiple paths with DeepMask object proposals [Pinheiro et al., 2015, 2016] to gain a massive improvement in the performance for small objects. Context can also be used by fusing coarser layers of the network with finer layers [Lin et al., 2017a, Shrivastava et al., 2016b, Lin et al., 2017b]. Context related literature has been covered in Section 3.1.3 in detail.

Finally, the last direction is to use Generative Adversarial Networks to selectively increase the resolution of small objects, as proposed by Li et al. [2017c]. Its generator learned to enhance the poor representations of the small objects to super-resolved ones that are similar enough to real large objects to fool a competing discriminator.

3.3 Concluding Remarks

This section finished the tour of all the principal CNN based approaches past, present and future that treat general object detection in the traditional settings. It has allowed to peer through the armor of the CNN detectors and see them for what they are: impressive machines having amazing generalization capabilities but still powerless in a variety of cases, in which a trained human would have no problem (domain adaptation, occlusions, rotations, small objects). Potential ideas to go past these obstacles have also been mentioned among them the use of adversarial training and context are the most prominent. The following section will go into more specific set-ups, less traditional problems or environments that will frame the detector abilities even further.

4 Extending Object Detection

Object detection may still feel like a narrow problem: one has a big training set of 2D images, huge resources (GPUs, TPUs, etc.) and wants to output 2D bounding boxes on a similar set of 2D images. However, these basic assumptions are often not present in practical scenarios. Firstly, because there exists many other modalities where one can perform object detection. These require conceptual changes in architectures to perform equally well. Secondly, sometimes one might be constrained to learn from exceedingly few fully annotated images, therefore, training a regular detector is either irrelevant or not an optimal choice because of overfitting. Also detectors are not built to be run in research labs alone but to be integrated into industrial products, which often come with an upper bound on energy consumption and speed requirements to satisfy the customer. The aim of the following discussion will be to know more about the research work done to extend the deep learning based object detection into new modalities and with tough constraints. It ends with reflections on what other interesting functionalities a strong detector in the future might possess.

4.1 Detecting Objects in Other Modalities

There are several modalities other than 2D images that can be interesting: videos, 3D point clouds, medical imaging, hyper-spectral imagery, etc. We will be discussing in this survey the former two. We did not treat for instance the volumetric images from the medical domain (MRI, etc.) or hyper-spectral imagery, which are outside of the scope of this article and would deserve their own survey.

4.1.1 Object Detection in Videos

The upside of detecting objects in videos is that it provides additional temporal information but it also has unique challenges associated with it: motion blur, appearance changes, video defocus, pose variations, computational efficiency etc. It is a recent research domain due to the lack of large scale public datasets. One of the first video datasets is the ImageNet VID [Russakovsky et al., 2015], proposed in 2015. This dataset as well as the recent datasets for object detection in video are mentioned in Section 5.4.

One of the simplest ways to use temporal information for detecting object is the detection by tracking paradigm. As an example, Ray et al. [2017] proposed a spatio-temporal detector of motion blobs, associated into tracks by a tracking algorithm. Each track is then interpreted as a moving object. Despite its simplicity, this type of algorithm is marginal in the literature as it is only interesting when the appearances of the objects are not available.

The most widely used approaches in the literature are those relying on tubelets. Tubelets have been introduced in the T-CNN approach of Kang et al. [2017, 2016]. T-CNN relied on 4 steps. First, still-image object detection (with Faster R-CNN like detectors) was performed. Second, multi-context suppression removed detection hypotheses having the lowest scores: highly ranked detection scores were treated as high-confidence classes and the rest were suppressed. Third, motion-guided propagation transferred detection results to adjacent frames to reduce false negatives. Fourth, temporal tubelet rescoring used a tracking algorithm to obtain sequences of bounding boxes, classified into positive and negative samples. Positive samples were mapped to a higher range, thus, increasing the score margins. T-CNN has several follow ups. The first was Seq-NMS [Han et al., 2016] which constructed sequences along nearby high-confidence bounding boxes from consecutive frames, rescoring to the average confidence. Other boxes close to this sequence were suppressed. Another one was MCMOT [Lee et al., 2016] in which a post-processing stage, under the form of a multi-object tracker, was introduced, relying on hand-crafted rules (e.g., detector confidences, color/motion clues, changing point detection and forward-backward validation) to determine whether bounding boxes belonged to the tracked objects, and to further refine the tracking results. Tripathi et al. [2016]

exploited temporal information by training a recurrent neural network that took as input, sequences with predicted bounding boxes, and optimized an objective enforcing consistency across frames.

The most advanced pipeline for object detection in videos is certainly the approach of Feichtenhofer et al. [2017], borrowing ideas from tubelets as well as from feature aggregation. The approach relies on a multitask objective loss, for frame-based object detection and across-frame track regression, correlating features that represented object co-occurrences across time and linking the frame level detections based on across-frame tracklets to produce the detections.

The literature on object detection in video also addressed the question of computing time, since applying a detector on each frame can be time consuming. In general, it is non-trivial to transfer the state-of-the-art object detection networks to videos, as per-frame evaluation is slow. Deep feature flow

[Zhu et al., 2017c] ran the convolutional sub-network only on sparse key frames, propagated deep feature maps to other frames via a flow field. It led to significant speedup as flow computation is relatively fast. In the impression network [Hetang et al., 2017] proposed to iteratively absorb sparsely extracted frame features, impression features being propagated all the way down the video which helped enhance features of low-quality frames. In the same way, the light flow of [Zhu et al., 2018c] is a very small network designed to aggregate features on key frames. For non-key frames, sparse feature propagation was performed, reaching a speed of 25.6 fps. Fast YOLO [Shafiee et al., 2017] came up with an optimized architecture that has 2.8X fewer parameters with just a 2% IOU drop, by applying a motion-adaptive inference method. Finally, [Chen et al., 2018b] proposed to reallocate computational resources over a scale-time space: while expensive detection is done sparsely and propagated across both scales and time. Cheaper networks did the temporal propagation over a scale-time lattice.

An interesting question is ”What can we expect from using temporal information?” The improvement of the mAP due to the direct use of temporal information can vary from +2.9% [Zhu et al., 2017b] to +5.6% [Feichtenhofer et al., 2017].

4.1.2 Object Detection in 3D Point Clouds

This section addresses the literature about object detection in 3D data, whether it is true 3D point clouds or 2D images augmented with depth data (RGBD images). These problems raise novel challenges, especially in the case of 3D point clouds for which the nature of the data is totally different (both in terms of structure and contained information). We can distinguish 4 main types of approaches depending on i) the use of 2D images and geometry, ii) the detections made in raw 3D point clouds, iii) the detections made in a 3D voxel grid iv) the detections made in 2D after projecting the point cloud on a 2D plane. Most of the presented methods are evaluated on the KITTI benchmark [Geiger et al., 2012]. Section 5.3 introduces the datasets used for 3D object detection and quantitatively compares best methods on these datasets.

The methods belonging to the first category, monocular, start by the processing of RGB images and then add shape and geometric prior or occlusion patterns to infer 3D bounding boxes, as proposed by Chen et al. [2016c], Mousavian et al. [2017] and Xiang et al. [2015]. Deng and Latecki [2017] revisited the amodal 3D detection by directly relating 2.5D visual appearance to 3D objects and proposed a 3D object detection system that simultaneously predicted 3D locations and orientations of objects in indoor scenes. Li et al. [2016b] represented the data in a 2D point map and used a single 2D end-to-end fully convolutional network to detect objects and predicted full 3D bounding boxes even while using a 2D convolutional network. Deep MANTA [Chabot et al., 2017] is a robust convolutional network introduced for simultaneous vehicle detection, part localization, visibility characterization and 3D dimension estimation, from 2D images.

Among the methods using 3D point clouds directly, we can mention the series of papers relying on PointNet [Qi et al., 2017a] and PointNet++ [Qi et al., 2017c] networks, which are capable of dealing with the irregular format of point clouds without having to transform them into 3D voxel grids. F-PointNet [Qi et al., 2017b] is a 3D detector operating on raw point clouds (RGB-D scans). It leveraged mature 2D object detector to propose 2D object regions in RGB images and then collected all points within the frustum to form a frustum point cloud.

Voxel based methods such as VoxelNet [Zhou and Tuzel, 2017] represented the irregular format of point clouds by fixed size 3D Voxel grids on which standard 3D convolution can be applied. Li [2017] discretized the point cloud on square grids, and represented discretized data by a 4D array of fixed dimensions. Vote3Deep [Engelcke et al., 2017] examined the trade-off between accuracy and speed for different architectures applied on a voxelized representation of input data.

Regarding approaches based on bird’s eye view, MV3D [Chen et al., 2017b] projected LiDAR point cloud to a bird’s eye view on which a 2D region proposal network is applied, allowing the generation of 3D bounding box proposals. In a similar way, LMNet [Minemura et al., 2018] addressed the question of real-time object detection using 3D LiDAR by projecting the point cloud onto 5 different frontal planes. More recently, BirdNet [Beltrán et al., 2018] proposed an original cell encoding mechanisms for bird’s eye view, which is invariant to distance and differences on LiDAR devices resolution, as well as a detector taking this representation as input. One of the fastest methods (50 fps) is ComplexYOLO [Simon et al., 2018], which expanded YOLOv2 by a specific complex regression strategy to estimate multi-class 3D boxes in Cartesian space, after building a bird’s eye view of the data.

Some recent methods, such as [Ku et al., 2017], combined different sources of information (eg., bird’s eye view, RGB images, 3D voxels, etc.) and proposed an architecture performing multimodal feature fusion on high resolution feature maps. Ku et al. [2017] is one of the top performing methods on KITTI benchmark [Geiger et al., 2012]. Finally, it is worth mentioning the super-pixel based method by Srivastava et al. [2018] allowed to discover novel objects in 3D point clouds.

4.2 Detecting Objects Under Constraints

In object detection, challenges arise not only because of the naturally expected problems (scale, rotation, localization, occlusions, etc.) but also due to the ones that are created artificially. The first motivation for the following discussion is to know and understand the research works that deal with the inadequacy of annotations in certain datasets. This inadequacy could be due to weak (image-level) labels, scarce bounding box annotations or no annotations at all for certain classes. The second motivation is to discuss the approaches dealing with hardware and application constraints, real-world detectors might encounter.

4.2.1 Weakly Supervised Detection

Research teams want to include as many images as possible in their proposed datasets. Due to budget constraints or to save costs or for some other reasons, sometimes, they chose not to annotate precise bounding boxes around objects and include only image level annotations or captions. The object detection community has proven that it is still possible with enough weakly annotated data to train good object detectors.

The most obvious way to address Weakly Supervised Object Detection (WSOD) is to use the Multiple Instance Learning (MIL) framework [Maron and Lozano-Pérez, 1997]. The image is considered as being a bag of regions extracted by conventional object proposals: at least one of these candidate regions is positive if the image has the appropriate weak label, if not, no region is positive. The classical formulation of the problem at hand (before CNNs) then becomes a latent-SVM on the region’s features where the latent part is the assignment of each proposal (that is weakly constrained by the image label). This problem being highly non-convex is heavily dependent on the quality of the initialization.

Song et al. [2014a, b] thus focused on the initialization of the boxes by starting from selective-search proposals. They used for each proposal, its K-nearest neighbors in other images to construct a bipartite graph. The boxes were then pruned by taking only the patches that occur in most positive images (covering) while not belonging to the set of neighbors of regions found in negative images. They also applied Nesterov smoothing on the SVM objective to make the optimization easier. Of course, if proposals do not spin enough of the image some objects will not be detected and thus the performance will be bad as there is no re-localization. The work of Sun et al. [2016] also belongs to this category. Bilen et al. [2015] added regularization to the smoothened optimization problem of Song et al. [2014a] using prior knowledge, but followed the same general directions. In another related research direction Wang et al. [2014]

learned to cluster the regions extracted with selective search into K-categories using unsupervised learning (pLSA) and then learned category selection using bag of words to determine the most discriminative clusters per class.

However, it is not always a requirement to explicitly solve the latent-SVM problem. Thanks to the fully convolutional structure of most CNNs it is sometimes possible to get a rough idea where an object might be while training for classification. For example, the arg-max of the produced spatial heat maps before global max-pooling is often located inside a bounding box as shown in [Oquab et al., 2015, 2014]. It is also possible to learn to detect objects without using any ground truth bounding boxes for training by masking regions of the image and see how the global classification score is impacted, as proposed by Bazzani et al. [2016].

This free localization information can be improved through the use of different pooling strategies. For instance: producing a spatial heat map and using a global average pooling instead of global max pooling to train in classification. This strategy was used in [Zhou et al., 2016a] where the heat maps per class were thresholded to obtain bounding boxes. In this line of work, Pinheiro and Collobert [2015] went a step further by producing pixel-level label segmentation maps using Log-Sum-Exp pooling in conjunction with some image and smoothing prior. Other pooling strategies involved aggregating minimum and maximum evidences to get a more precise idea where the object is and isn’t, e.g., as in the line developed in Durand et al. [2015, 2016, 2017]. Bilen and Vedaldi [2016] used the spatial pyramid pooling module to take MIL to the modern-age by incorporating it into a Fast R-CNN like architecture with a two-stream Fast R-CNN proposal classification part: one with classification score and the other with relative rankings of proposals that are merged together using hadamard products. Thus, producing region level labels predictions like in classic detection settings. They then aggregated all labels per image by taking the sum. They trained it end-to-end using image level labels thanks to their aggregation module while adding a spatial-regularization constraint on the features obtained by the SPP module.

Another idea, which can be combined with MIL is to draw the supervision from elsewhere. Tracked object proposals were used by Kumar Singh et al. [2016] to extract pseudo-groundtruth to train detectors. This idea was further explored by Chen et al. [2017a] where the keywords extracted from the subtitles of documentaries allowed to further ground and cluster the generated annotations. In a similar way, Yuan et al. [2017] used action description supervision via LSTMs. Cheap supervision can also be gained by involving user feedback [Papadopoulos et al., 2016], where the users iteratively improved the pseudo-ground truth by saying if the objects were missed or partly included in the detections. Click supervision by users, far less demanding than full annotations, also improved the performance of detectors [Papadopoulos et al., 2017]. [Roy et al., 2016]

used active learning to select the right images to annotate and thus get the same performance by using far fewer images. One can also leverage strong annotations for other classes to improve the performance of weakly supervised classes. This was done in

[Tang et al., 2016] by using the powerful LSDA framework [Hoffman et al., 2014]. This was also the case in [Huval et al., 2013, Hoffman et al., 2015, Rochan and Wang, 2015].

This year, a lot of interesting new works continued to develop the MIL+CNN framework using diverse approaches [Ge et al., 2018, Zhang et al., 2018g, Wan et al., 2018, Zhang et al., 2018e, f, Tang et al., 2017a]. These articles will not be treated in detail because the focus of this survey is object detection in general and not WSOD.

As of this writing, the state-of-the-art mAP on VOC2007 in WSOD is 47.6% [Zhang et al., 2018f]. The gap is being reduced at an exhilarating pace but we are still far from the 83.1% state-of-the-art with full annotations [Mordan et al., 2017] (without COCO pre-training).

4.2.2 Few-shot Detection

The cost of annotating thousands of boxes over hundreds of classes is too high. Although some large scale datasets are created, as described in Section 5

, but it is not practical to do it for every single target domain. Collecting and annotating training examples in the case of video is even costlier than still images, making few shot detection more interesting. For this purpose, researchers have come up with ways to train the detectors with as low as three to five bounding boxes per target class and get lower but competitive performance as compared to the fully supervised approach on a large scale dataset. Few shot learning usually relies on semi-supervised learning mechanisms.

Dong et al. [2017] took up an iterative approach to simultaneously train the model and generate new samples which are used in the following iterations for training. They observed that as the model becomes more discriminative it is able to sample harder as well as more number of instances. Iterating between multiple kinds of detectors was found to outperform the single detector approach. One interesting aspect of the paper is that their approach with only three to four annotations per class gives results comparable to weakly annotated approaches with image level annotations on the whole PASCAL VOC dataset. A similar approach was used by Keren et al. [2018], who proposed a model which can be trained with as few as one single exemplar of an unseen class and a larger target example that may or may not contain an instance of the same class as the exemplar (weakly supervised learning). This model was able to simultaneously identify and localize instances of classes unseen at training time.

Another way to deal with few-shot detection is to fine-tune a detector trained on a sourced domain to a target domain for which only few samples are available. This is what Chen et al. [2018a] did, by introducing a novel regularization method, involving, depressing the background and transferring the knowledge from the source domain to the target domain to enhance the fine-tuned detector.

For videos, Misra et al. [2015] proposed a semi-supervised framework in which some initial labeled boxes allowed to iteratively learn and label hundreds of thousands of object instances automatically. Criteria for reliable object detection and tracking constrained the semi-supervised learning process and minimized semantic drift.

4.2.3 Zero-shot Detection

Zero-shot detection is useful for a system where large number of classes are to be detected. Its hard to annotate a large number of classes as the cost of annotation gets higher with more classes. This is a unique type of problem in the object detection domain as the aim is to classify and localize new categories, without any training examples, during test time with the constraint that the new categories are semantically related to the objects in the training classes. Therefore, in practice the semantic attributes are available for the unseen classes. The challenges that come with this problem are: First, zero-shot learning techniques are restricted to recognize a single dominant objects and not all the object instances present in the image. Second, the background class during fully supervised training may contain objects from unseen classes. The detector will be trained to discriminatively treat these classes as background.

While there is a comparably large literature present for zero shot classification, well covered in the survey [Fu et al., 2018], zero shot detection has only a few papers to the best of our knowledge. Zhu et al. [2018b] proposed a method where semantic features are utilized during training but it is agnostic to semantic information during test time. This means they incorporated semantic attribute information in addition to seen classes during training and generated proposals only, but no identification label. for seen and unseen objects at test time. Rahman et al. [2018] proposed a multitask loss that combines max-margin, useful for separating individual classes, and semantic clustering, useful for reducing noise in semantic vectors by positioning similar classes together and dissimilar classes far apart. They used ILSVRC [Deng et al., 2009] which contains an average of only three objects per image. They also proposed another method for a more general case when unseen classes are not predefined during training. Bansal et al. [2018] proposed two background-aware approaches, statically assigning the background image regions into a single background class embedding and latent assignment based alternating algorithms which associated background to different classes belonging to a large open vocabulary, for this task. They used MSCOCO [Lin et al., 2014] and VisualGenome [Krishna et al., 2017] which contain an average of 7.7 and 35 objects per image respectively. They also set number of unseen classes to be higher, making their task more complex than previous two papers.

Since, it is quite a new problem there is no well-defined experimental protocol for this approach. They vary in number and nature of unseen classes, use of semantic attribute information of unseen classes during training, complexity of the visual scene, etc.

4.2.4 Fast and Low Power Detection

There is generally a trade-off between performance and speed (we refer to the comprehensive study of [Huang et al., 2017c] for instance). When one needs real time detectors, like for video object detection, one loses some precision. However, researchers have been constantly working on improving the precision of fast methods and making precise methods faster. Furthermore, not every setup can have powerful GPUs, so for most industrial applications the detectors have to run on CPUs or on different low power embedded devices like Raspberry-Pie.

Most real-time methods are single stage because they need to perform inference in a quasi fully constitutional manner. The most iconic methods have already been discussed in detail in the rest of the paper [Redmon et al., 2016, Liu et al., 2016, Redmon and Farhadi, 2017, 2018, Lin et al., 2017b]. Zhou et al. [2018] designed a scale transfer module to replace the feature pyramid and thus got a detection network more accurate and faster than YOLOv2. Iandola et al. [2014] provided a framework to efficiently compute multi-scale features. Redmon and Angelova [2015] used a YOLO-like architecture to provide oriented bounding boxes symbolizing grasps in real time. Shafiee et al. [2017] built a faster version of YOLOv2 that runs on embedded devices other than GPUs. Li and Zhou [2017] managed to speed-up the SSD detector, bringing it to almost 70 fps, using a more lightweight architecture.

In single stage methods most of the computations are found in the backbone networks so researchers started to design new backbones for detection in order to have fewer operations like PVANet [Kim et al., 2016] that built a deep and thin networks with fewer channels than its classification counterparts, or SqueezeDet [Wu et al., 2017] that is similar to YOLO but with more anchors and fewer parameters.

Iandola et al. [2016] built an AlexNet backbone with 50 times fewer parameters. Howard et al. [2017] used depth-wise-separable convolutions and point-wise convolutions to build an efficient backbone called MobileNets for image classification and detection. Sandler et al. [2018]

improved upon it by adding residual connections and removing non-linearities. Very recently,

Tan et al. [2018] used architecture search to come up with an even more efficient network (1.5 times faster than Sandler et al. [2018] and with lower latency). ShuffleNet [Zhang et al., 2017c] attained impressive performance on ARM devices. They can sustain only that many computations (40MFlops). Their backbone is 13 times faster than AlexNet.

Finally, Wang et al. [2018a] proposed PeleeNet, a light network that is 66% of the model size of MobileNet, achieving 76.4% mAP on PASCAL VOC2007 and 22.4% mAP on MS COCO at a speed of 17.1 fps on iPhone 6s and 23.6 fps on iPhone 8. [Li et al., 2018a] is also very efficient, achieving 72.1% mAP on PASCAL VOC2007 with 0.95M parameters and 1.06B FLOPs.

Fast double-staged methods exist, although the NMS part becomes generally the bottleneck. Among them one can also mention for the second time Singh et al. [2017], which is one of the double-staged methods that researchers have brought to 30 fps by using superclass (sets of similar classes) specific detection. Using a mask obtained by a fast and coarse face detection method the authors of [Chen et al., 2016b] reduced the computational complexity of their double stage detector by a great amount at test time by only computing convolutions on non-masked regions. Singh et al. [2017] sped up R-FCN by using detection heads super classes (sets of similar classes) specific and thus decouple detection from classification. SNIPER [Singh et al., 2018] can train on 512x512 images using an adaptive sampling of the region of interests. Therefore, it’s training can use larger batch size and therefore, be way faster but it needed 30% more pixels than original images at inference time making it slower.

There have also been lots of work done on pruning and/or quantifying the weights of CNNs for image classification [Han et al., 2015, Lin et al., 2015, Rastegari et al., 2016, Hubara et al., 2016, Zhou et al., 2016b, Hubara et al., 2017, Huang et al., 2017a, 2018a, Zhang et al., 2018a, Peng et al., 2018], but much fewer in detection yet. Although, one can find some detection articles that used pruning. Girshick [2015] used SVD on the weights of the fully connected layers in Fast R-CNN. Masana et al. [2016], who pruned near-zero weights in detection networks and extended the compression to be domain-adaptive in Masana et al. [2017].

It is not only necessary to respect available material constraints (data and machines) but detectors have to be reliable too. They must be robust to perturbations and they can make mistakes but the mistakes also need to be interpretable, which is a challenge in itself with the millions of weights and the architectural complexity of modern pipelines. It is a good sign to outperform all other methods on a benchmark, it is something else to perform accurately in the wild. That is why we dedicate the following sections to the exploration of such challenges.

4.3 Towards Versatile Object Detectors

So far in all this survey, detectors were tested on limited, well-defined benchmarks. It is mandatory to assess their performances. However, at the end we are really interested in their behaviors in the wild where no annotations are present. Detectors have to be robust to unusual situations and one would wish for detectors to be able to evolve themselves. This section will review the state of deep learning methods w.r.t. these expectations.

4.3.1 Interpretability and Robustness

With the recent craze about self-driving cars, it has become a top priority to build detectors, that can be trusted with our lives. Hence, detectors should be robust to physical adversarial attacks [Lu et al., 2017, Chen et al., 2018d] and weather conditions, which was the reason for building KITTI [Geiger et al., 2012] and DETRAC [Wen et al., 2015] back then and has now led to the creation of two amazingly huge datasets: ApolloScape from Baidu [Yu et al., 2018] and BDD100K from Berkeley [Yu et al., 2018] car detection datasets. The driving conditions of the real world are so complex: changing environments, reflections, different traffic signs and rules for different countries. So far, this open problem is largely unsolved even if some industry players seem to be confident enough to leave self-driving cars without safety nets in specific cities. It will surely involve at some point the heavy use of synthetic data otherwise it would take a lifetime to gather the data necessary to be confident enough. To finish on a positive note detectors in self-driving cars can benefit from multi-sensory inputs such as LiDAR point clouds [Himmelsbach et al., 2008], other lasers and multiple cameras so it can help disambiguate certain difficult situations (reflections on the cars in front of it for instance).

But most of all the detectors should incorporate a certain level of interpretability so that if a dramatic failure happens it can be understood and fixed. It is also a need for legal matters. Very few works have done so because it requires delving into the feature maps of the backbone network. A few works proposed different approaches for classification only but no consensus has been reached yet. Among the popular methods one can cite the gradient map in the image space of Simonyan et al. [2013], the occlusion analysis of Zeiler and Fergus [2014b], the guided back propagation of Springenberg et al. [2014] and, recently, the perturbation approach of Fong and Vedaldi [2017]. No method exists yet for object detectors to the best of our knowledge. It would be an interesting research direction for future works.

4.3.2 Universal Detector, Lifelong Learning

Having object detectors able to iteratively, and without any supervision, learn to detect novel object classes and improve their performance would be one of the Holy Grails of computer vision. This can have the form of lifelong learning, where goal is to sequentially retrain learned knowledge and to selectively transfer the knowledge when learning a new task, as defined in [Silver et al., 2013]. Or never ending learning [Mitchell, 2018], where the system has sufficient self-reflection to avoid plateaus in performances and can decide how to progress by itself. However, one of the biggest issues with current detectors is they suffer from catastrophic forgetting, as say Castro et al. [2018]. It means their performance decreases when new classes are added incrementally. Some authors tried to face this challenge. For example, the knowledge distillation loss introduced by Li and Hoiem [2018] allows to forget old data while using previous models to constraint updated ones during learning. In the domain of object detection, the only recent contribution we are aware of is the incremental learning approach of Shmelkov et al. [2017], relying on a distillation mechanism. Lifelong learning and never ending learning are domains where a lot still have to be discovered or developed.

4.4 Concluding Remarks

It seems that deep learning in its current form is not yet fully ready to be applied to other modalities than 2D images: in videos, temporal consistency is hard to take into account with DCNNs because 3D convolutions are expensive, tubelets and tracklets are interesting ideas but lack the elegance of DCNNs on still images. For point clouds the picture is even worse. The voxelisation of point clouds does not deal with their inherent sparsity and create memory issues and even the simplicity and originality of the PointNet articles Qi et al. [2017a, c] that leaves the point clouds untouched has not matured enough yet to be widely adopted by the community. Hopefully, dealing with other constraints like weak supervision or few training images is starting to produce worthy results without too much change to the original DCNN architectures [Durand et al., 2017, Ge et al., 2018, Zhang et al., 2018g, Wan et al., 2018, Zhang et al., 2018e, f, Tang et al., 2017a]. It seems to be only a matter of refining cost functions and coming-up with more building blocks than reinventing DCNNs entirely. However, the Achilles heel of deep-learning methods is their interpretability and trustworthiness. The object detection community seems focused on improving the performances on static benchmarks instead of finding ways to better understand the behavior of DCNNs. It is understandable but it shows that Deep Learning has not yet reached full maturity. Eventually, one can hope that the performances of new detectors will plateau and when it does, researchers will be forced to come back to the basics and focus instead on interpretability and robustness before the next paradigm washes off deep-learning entirely.

5 Datasets and Results

Most of the object detection’s influential ideas, concepts and literature having been now reviewed, the rest of the article dives into the datasets used to train and evaluate these detectors.

Public datasets play an essential role as they not only allow to measure and compare the performance of object detectors but also provides resources allowing to learn object models from examples. In the area of deep learning, these resources play an essential role, as it has been clearly shown that deep convolutional neural networks are designed to benefit and learn from massive amount of data [Zhou et al., 2014]. This section discusses the main datasets used in the recent literature on object detection and present state-of-the-art methods for each dataset.

5.1 Classical Datasets with Common Objects

We first start by presenting the datasets containing everyday life object taken from consumer cameras. This category contains the most important datasets for the domain, attracting the largest part of the community. We will discuss in a second section the datasets devoted to specific detection tasks (e.g., face detection, pedestrian detection, etc.).

5.1.1 Pascal-VOC

Pascal-VOC [Everingham et al., 2010] is the most iconic object detection dataset. It has changed over the years but the format everyone is familiar with is the one that emerged in 2007 with 20 classes (Person: person; Animal: bird, cat, cow, dog, horse, sheep; Vehicle: aeroplane, bicycle, boat, bus, car, motorbike, train; Indoor: bottle, chair, dining table, potted plant, sofa, tv/monitor). It is now used as a test bed for most new algorithms. As it is quite small there have been claims that we are starting to overfit on the test set and therefore, MS-COCO (see next section) is preferred nowadays to demonstrate the quality of a new algorithm. The 0.5 IoU based metrics this dataset introduced has now become the de facto standard for every single detection problem. Overall, this dataset’s impact on the development of innovative methods in object detection cannot be overstated. It is quite hard to find all relevant literature but we have tried to be as thorough as possible in terms of best performing methods. The URL of the dataset is

Two versions of Pascal-VOC are commonly used in the literature, namely VOC2007 and VOC2012:

  • VOC07, with 9,963 images containing 24,640 annotated objects, is small. For this reason, papers using VOC07 often train on the union of VOC07 and VOC12 trainvals (VOC07+12). The Average Precision (AP) averaged across the 20 classes is saturating at around 80 points @0.5 IoU. Some methods got extra points but it seems one cannot go over around 85 points (without pre-training on MS COCO). Using MS COCO data in addition, one can get up to 86.3 AP (see [Li et al., 2018b]). We chose to display methods with mAP over 80 points only on Table 1. We do not distinguish between the methods that do multiple inference tricks or the methods that reports results as is. However for each method we reported for the highest published results we could get.

  • VOC12 is a little bit harder than its 2007 counterpart, and we have just gone over the 80 point mark. As it is harder, this time, most literature uses the union of the whole VOC2007 data (trainval+test) and VOC2012 trainval; It is referred to as 07++12. Again better results are obtained with pre-training on COCO data (83.8 points in [He et al., 2016]). Results above 75 points are presented in Table 2.

On both splits all backbones used by the leaders of the board are heavy backbones with more than a 100 layers except for [Zhang et al., 2018h] that gets close to state of the art using only VGG-16.

Method Backbone mAP
[Mordan et al., 2017] ResNeXt-101 83.1
[Xu et al., 2017a] ResNet-101 83.1
[Zhai et al., 2018] ResNet-101 82.9
[Dai et al., 2017] ResNet-101 82.6
[Kong et al., 2018] ResNet-101 82.4
[Li et al., 2018b] ResNet-101 82.1
[Zhang et al., 2018h] VGG-16 81.7
[Fu et al., 2017] ResNet-101 81.5
[Zhou et al., 2018] DenseNet-169 80.9
[Zhao et al., 2018b] ResNet-101 80.7
[Dai et al., 2016b] ResNet-101 80.5
Table 1: State-of-the-art methods on VOC07 test set (Using VOC07+12 )
Method Backbone mAP
[Xu et al., 2017a] ResNet-101 81.2
[Kong et al., 2018] ResNet-101 81.1
[Mordan et al., 2017] ResNeXt-101 80.9
[Li et al., 2018b] ResNet-101 80.6
[Zhai et al., 2018] ResNet-101 80.5
[Zhang et al., 2018h] VGG-16 80.3
[Fu et al., 2017] ResNet-101 80.0
[Liu et al., 2016] ResNet-101 78.5
[Dai et al., 2016b] ResNet-101 77.6
Table 2: State-of-the-art methods on VOC12 test set (Using VOC07++12 )

5.1.2 Ms Coco

MS COCO [Lin et al., 2014] is the most challenging object detection dataset available today. It consists of 118,000 training images, 5,000 validation images and 41,000 testing images. They have also released 120K unlabeled images that follow the same class distribution as the labeled images. They may be useful for semi-supervised learning on COCO. The MS COCO challenge has been ongoing since 2015. There are 80 object categories, over 4 times more than Pascal-VOC. MS COCO is a fine replacement for Pascal-VOC, that has arguably started to age a little. Like ImageNet in its time, MS-COCO has become the de facto standard for the object detection community and any method winning the state-of-the-art on it is assured to gain much traction and visibility. The AP is calculated similar to Pascal-VOC but averaged on multiple IoUs from 0.5 to 0.95.

Most available alternatives stemmed from Faster R-CNN [Ren et al., 2015], which in its first iteration won the first challenge with 37.3 mAP with a ResNet101 backbone. In the second iteration of the challenge the mAP went up to 41.5 with an ensemble of Faster R-CNN [Ren et al., 2015] that used a different implementation of RoI-Pooling. This maybe inspired the RoI-Align of Mask R-CNN [He et al., 2017]. Tao Kong claimed that a single Faster R-CNN with HyperNet features [Kong et al., 2016] can reach 42.0 mAP. The best published single model method [Peng et al., 2017a] nowadays is around 50.5 (52.5 with an ensemble) and relied on different techniques already mentioned in this survey. Amongst them one can mention FPN [Lin et al., 2017a], large batch training [Peng et al., 2017a] and GCN [Peng et al., 2017b]. Ensembling Mask R-CNNs [He et al., 2017] gave around the same performance as [Peng et al., 2017a] at around 50.3 mAP. Deformable R-FCN [Dai et al., 2017] is not lagging too far behind with 48.5 mAP single model performance (50.4 mAP with an ensemble) using Soft NMS [Bodla et al., 2017] and the ”mandatory” FPN [Lin et al., 2017a]. Other entries were based mostly on Mask R-CNN [He et al., 2017]. If one extrapolates this fabulous trend, one can expect results at ECCV 2018 to be at around 60 mAP (although the methods will be ranked according to the instance segmentation metric). A leaderboard summarizes best results obtained so far The URL of the dataset is

5.1.3 ImageNet Detection Task

ImageNet is a dataset organized according to the nouns of the WordNet hierarchy. Each node of the hierarchy is depicted by hundreds and thousands of images, with an average of over 5,000 images per node. Since 2010, the Large Scale Visual Recognition Challenge is organized each year and contains a detection challenge using ImageNet images. The detection task, in which each object instance has to be detected, has 200 categories. There is also a classification and localization task, with 1,000 categories in which algorithms have to produce 5 labels (and 5 bounding boxes) only, allowing not to penalize the detection of objects that are present, but not included in the ground truth. In the 2017 contest, the top detector was proposed by a team from Nanjing University of Information Science and Imperial College London. It ranked first on 85 categories with an overall AP of 73.13. As far as we know, there is no paper describing the approach precisely (but some slides are available at the workshop page). The 2nd ranked method was from Bae et al. [2017], who observed that modern convolutional detectors behave differently for each object class. The authors consequently built an ensemble detector by finding the best detector for each object class. They obtained a AP of 59.30 points and won 10 categories. ImageNet is available at

5.1.4 VisualGenome

VisualGenome [Krishna et al., 2017] is a very peculiar dataset focusing on object relationships. It contains over 100,000 images. Each image has bounding boxes but also complete scene graphs. Over 17,000 categories of objects are present. The first ones in terms of representativeness by far are man and woman followed by trees and sky. On average there are 21 objects per image. It is unclear if it qualifies as an object detection dataset as the paper does not include clear object detection metrics or evaluation as its focus is on scene graphs and visual relationships. However, it is undoubtedly an enormous source of strongly supervised images to train object detectors. The Visual Genome Dataset has huge number of classes, most of them being small and hard to detect. The mAP reported in the literature is therefore, much smaller compared to previous datasets. One of the best performing approaches is of Li et al. [2017e] which reached 7.43 mAP by linking object detection, scene graph generation and region captioning. Faster R-CNN [Girshick, 2015] has a mAP of 6.72 points on this dataset. The URL of the dataset is

5.1.5 OpenImages

The challenge OpenImagesV4 [Krasin et al., 2017] that will be organized for the first time at ECCV2018 offers the largest to date common objects detection dataset with up to 500 classes (including the familiar ones from Pascal-VOC) on 1,743,000 images and more than 12,000,000 bounding boxes with an average of 7 objects per image for training, and 125,436 images for tests (41,620 for validation). The object detection metric is the AP@0.5IoU averaged across classes taking into account the hierarchical structure of the classes with some technical subtleties on how to deal with groups of objects closely packed together. This is the first detection dataset to have so many classes and images and it will surely require some new breakthrough to get it right. At the time of writing there is no published or non-published results on it, although the results of an Inception ResNet Faster R-CNN baseline can be found on their site to have 37 mAP. The URL of the project is

For industrial applications, more often than not, the objects to detect does not come from the categories present in VOC or MS-COCO. Furthermore, they do not share the same variances; Rotation variance for instance, is a property of several applications domains but is not present in any classical common object dataset. That is why, pushed by the industry needs, several other object detection domains have appeared all with their respective literature. The most famous of them are listed in the following sections.

5.2 Specialized datasets

To find interesting domains one has to find interesting products or applications that drive them. The industry has given birth to many sub-fields in object detection: they wanted to have self-driving cars so we built pedestrian detection and traffic signs detection datasets; they wanted to monitor traffic so we had to have aerial imagery datasets; they wanted to be able to read text for blind persons or automatic translations of foreign languages so we constructed text detection datasets; some people wanted to do personalized advertising (arguably not a good idea) so we engineered logo datasets. They all have their place in this specialized dataset section.

5.2.1 Aerial Imagery

The detection of small vehicles in aerial imagery is an old problem that has gained much attraction in recent times. However, it was only in the last years that large dataset have been made publicly available, making the topic even more popular. The following paragraphs take inventory of these datasets and of the best performing methods.

Google Earth [Heitz and Koller, 2008] comprises 30 images of the city of Bruxelles with 1,319 small cars and vertical bounding boxes, its variability is not enormous but it is still widely used in the literature. There are 5 folds. The CNN best result is [Cheng et al., 2016a] with 94.6 AP. It was later augmented with angle annotations by Henriques and Vedaldi [2017]. The data can be found on Geremy Heitz webpage (

OIRDS [Tanner et al., 2009], with only 180 vehicles this dataset, is not very much used by the community.

DLR 3k Munich Dataset [Liu and Mattyus, 2015] is one of the most used datasets in the small vehicle detection literature with 20 extra large images. 10 training images with up to 3,500 cars and 70 trucks and 10 test images with 5,800 cars 90 trucks. Other classes are also available like car or truck’s trails and dashed lines. The state-of-the-art seems to belong to [Tang et al., 2017c] at 83% of F1 on both cars and trucks and [Tang et al., 2017b] at 82%, which provide oriented boxes. Some relevant articles that compare on this dataset are [Sommer et al., 2017a, b, Deng et al., 2017]. The data can be downloaded by asking the provided contact on

VeDAI [Razakarivony and Jurie, 2016] is for vehicle detection is aerial images. The vehicles contained in the database, in addition to being small, exhibit different variability such as multiple orientations, lighting/shadowing changes, occlusions. etc. Furthermore, each image is available in several spectral bands and resolutions. They provide the same images in 2 resolutions 512x512 and 1024x1024. There are a total of 9 classes and 1,200 images with an average of 5.5 instances per image. It is one of the few datasets to have 10 folds and the metric is based on an ellipse based distance between the center of the ground truth and the centers of the detections. The state-of-the-art is currently held by [Ogier Du Terrail and Jurie, 2017]. Although many recent articles used their own metrics, which makes them difficult to compare [Tang et al., 2017b, Sakla et al., 2017, Sommer et al., 2017b, 2018, Tang et al., 2017c]. VeDAI is available at

COWC [Mundhenk et al., 2016], introduced in ECCV2016, is a very large dataset with regions from all over the world and more than 32,000 cars. It also contains almost 60,000 hard negative patches hand-picked, which is a blessing when training detectors that do not include hard-example mining strategies. Unfortunately, no test data annotations are available so detection methods cannot yet be properly tested on it. COWC is available at

DOTA [Xia et al., 2017], released this year at CVPR, is the first mainstream dataset to change its metric to incorporate rotated bounding boxes similar to the text detections datasets. The images are of very different resolutions and zoom factors. There are 2,800 images with almost 200,000 instances and 15 categories. This dataset will surely become one of the important ones in the near future. The leader board

shows that Mask R-CNN structures are the best at this task for the moment with the winner culminating at 76.2 oriented mAP but no other published method apart from

[Xia et al., 2017] yet. UCAS-AOD [Zhu et al., 2015a], NWPU VHR10 [Cheng et al., 2016b] and HRSC2016 [Liu et al., 2017] all provided oriented annotations also but they are hard to find and very few articles actually use them. DOTA is available at

xView [Lam et al., 2018] is a very large scale dataset gathered by the pentagon, containing 60 classes and 1 million instances. It is split in three parts train, val and test. xView is available at First challenge will end in August 2018, no results are available yet.

VisDrone [Zhu et al., 2018a] is the most recent dataset including aerial images. Images, captured by different drones flying over 14 different cities separated by thousands of kilometers in China, in different scenarios under various weather and lighting conditions. The dataset consists of 263 video sequences formed by 179,264 frames and 10,209 static images and contains different objects such pedestrian, vehicles, bicycles, etc. and density (sparse and crowded scenes). Frames are manually annotated with more than 2.5 million bounding boxes and some attributes, e.g. scene visibility, object class and occlusion, are provided. VisDrone is very recent and no results are available yet. VisDrone is available at

5.2.2 Text Detection in Images

Text detection in images or videos is a common way to extract content from images and opens the door to image retrieval or automatic text translation applications. We inventory, in the following, the main datasets as well as the best practices to address this problem.

ICDAR 2003 [Lucas et al., 2003] was one of the first public datasets for text detection. The dataset contains 509 scene images and the scene text is mostly centered and iconic. Delakis and Garcia [2008] was one of the first to use CNN on this dataset.

Street View Text (SVT) [Wang and Belongie, 2010]. Taken from Google StreetView, it is a dataset filled with business names mostly, from outdoor streets. There are 350 images and 725 instances. One of the best performing methods on SVT is [Zhao et al., 2018a] with a F-measure of 83%. SVT can be downloaded from

MSRA-TD500 [Tu et al., 2012] contains 500 natural images, which are taken from indoor (office and mall) and outdoor (street) scenes. The resolutions of the images vary from to . There are Chinese and English texts and mixed too. The training set contains 300 images randomly selected from the original dataset and the remaining 200 images constitute the test set. Best performing method on MSRA-TD500 is [Liao et al., 2018b] with a F-measure of 79%. Shi et al. [2017a], Yao et al. [2016], Ma et al. [2018] and Zhang et al. [2016b] also performed very well (F-measures of 77%, 76%, 75% and 75% respectively). The dataset is available at

IIIT 5k-word [Mishra et al., 2012] has 1,120 images and 5,000 words from both street scene texts and born-digital images. 380 images are used to train and the remaining to test. Each text has also a category label easy or hard. [Liao et al., 2018b] is state-of-the-art, as for MSRA-TD500. IIIT 5k-word is available at

Synth90K [Jaderberg et al., 2014] is a completely generated grayscale text dataset with multiple fonts and vocabulary well blended into scenes with 9 million images from a 90,000 vocabulary. It can be found on the VGG page at

ICDAR 2015 [Karatzas et al., 2015] is another popular iteration of the ICDAR challenge, following ICDAR 2013. Busta et al. [2017] got state-of-the-art 87% of F measure in comparison to the 83.8% of Liao et al. [2018b] and the 82.54% of Jiang et al. [2017]. TextBoxes++ [Liao et al., 2018a] reached 81.7% and Shi et al. [2017a] is at 75%.

COCO Text [Veit et al., 2016], based on MS COCO, is the biggest dataset for text detection. It has 63,000 images with 173,000 annotations. [Liao et al., 2018b] is the only published result with [Zhou et al., 2017] yet that differs from the baselines implemented in the dataset paper [Veit et al., 2016]. So there must still be room for improvement. The very recent [Liao et al., 2018a] outperformed [Zhou et al., 2017]. COCO Text is available at

RCTW-17 (ICDAR 2017) [Shi et al., 2017b] is the latest ICDAR database. It is a large line-based dataset with mostly Chinese text. Liao et al. [2018b] achieved SOTA on this one too with 67.0% of F measure. The dataset is available at

5.2.3 Face Detection

Face detection is one of the most widely addressed detection tasks. Even if the detection of frontal in high resolution images is an almost solved problem, there is room for improvement when the conditions are harder (non-frontal images, small faces, etc.). These harder conditions are reflected by the following recent datasets. The main characteristics of the different face datasets are proposed in Table 3.

Face Detection Data Set and Benchmark (FDDB) [Jain and Learned-Miller, 2010] is built using Yahoo!, with 2845 images and a total of 5171 faces; it has a wide range of difficulties such as occlusions, strong pose changes, low resolution and out-of-focus faces, with both grayscale and color images. Zhang et al. [2017b] obtained an AUR of 98.3% on this dataset and is currently state-of-the-art for this dataset. Najibi et al. [2017] obtained 98.1%. The dataset can be downloaded at

Annotated Facial Landmarks in the Wild (AFLW) [Kostinger et al., 2011] is made from a collection of images collected on Flickr, with a large variety in face appearance (pose, expression, ethnicity, age, gender) and environmental conditions. It has the particularity to not to be aimed at face detection only, but more oriented towards landmark detection and face alignment. In total 25,993 faces in 21,997 real-world images are annotated. Annotations come with rich facial landmark information (21 landmarks per faces). The dataset can be downloaded from

Annotated Face in-the-Wild (AFW) [Zhu and Ramanan, 2012] is a dataset containing faces in real conditions, with their associated annotations (bounding box, facial landmarks and pose angle labels). Each image contains multiple, non-frontal faces. The dataset contains 205 images with 468 faces. Zhang et al. [2017b] obtained an AP of 99.85% on this dataset and is currently state-of-the-art for this dataset.

PASCAL Faces [Yan et al., 2014] contains images selected from PASCAL VOC [Everingham et al., 2010] in which the faces have been annotated. [Zhang et al., 2017b] obtained an AP of 98.49% on this dataset, and is currently state-of-the-art for this dataset.

Multi-Attribute Labeled Faces (MALF ) [Bin Yang et al., 2015] incorporates richer semantic annotations such as pose, gender and occlusion information as well as expression information. It contains 5,250 images collected from the Internet and approximately 12,000 labeled faces. The dataset and up-to-date results of the evaluation can be found at

Wider Face [Yang et al., 2016b] is one of the largest datasets for face detection. Each annotation includes information such as scale, occlusion, pose, overall difficulty and events, which makes possible in-depth analyses. This dataset is very challenging especially for the ’hard set’. Najibi et al. [2017] obtained an AP of 93.1% (easy), 92.1% (medium) and 84.5% (hard) on this dataset and is currently state-of-the-art for this dataset. Zhang et al. [2017b] are also very good with AP of 92.8% (easy), 91.3% (medium) and 84.0% (hard). Datasets and results can be downloaded at

IARPA Janus Benchmark A (IJ-A) [Klare et al., 2015] contains images and videos from 500 subjects captured from ’in the wild’ environment, and contains annotations for both recognition and detection tasks. All labeled faces are localized with bounding boxes as well as with landmarks (center of the two eyes, base of the nose). IJB-B [Whitelam et al., 2017] extended this dataset with 1,845 subjects, for 21,798 still images and 55,026 frames from 7,011 videos. IJB-C [Maze et al., 2018], which is the new extended version of the IARPA Janus Benchmark A and B, adds 1,661 new subjects to the 1,870 subjects released in IJB-B. The NIST Face Challenges are at

Un-constrained Face Detection Dataset (UFDD) [Nada et al., 2018] was built after noting that in many challenges large variations in scale, pose, appearance are successfully addressed but there is a gap in the performance of state-of-the-art detectors and real-world requirements, not captured by existing methods or datasets. UFDD aimed at identifying the next set of challenges and collect a new dataset of face images that involve variations such as weather-based degradations, motion blur and focus blur. The authors also provide an in-depth analysis of the results and failure cases of these methods. This dataset is very recent and has not been used specifically yet. However, Nada et al. [2018] reported the performances (in terms of AP) of Faster-RCNN [Ren et al., 2015] (52.1%), SSH [Najibi et al., 2017] (69.5%), S3FD [Zhang et al., 2017b] (72.5%) and HR-ER [Hu and Ramanan, 2017] (74.2%). Dataset and results can be downloaded at

IIIT-Cartoon Faces in the Wild) [Mishra et al., 2016] contains 8,927 annotated images of cartoon faces belonging to 100 famous personalities, harvested from Google image search, with annotations including attributes such as age group, view, expression, pose, etc. The benchmark includes 7 challenges: Cartoon face recognition, Cartoon face verification, Cartoon gender identification, photo2cartoon and cartoon2photo, face detection, pose estimation and landmark detection, relative attributes in Cartoon and attribute-based cartoon search. Jha et al. [2018] have published SOTA detection results using a Haar features-based detector, with a F measure of 84%. The dataset can be downloaded from

Wildest Faces [Yucel et al., 2018] is a dataset where the emphasis is put on violent scenes in unconstrained scenarios. It contains images of diverse quality, resolution and motion blur. It includes 68K images (aka video frames) and 2186 shots of 64 fighting celebrities. All of the video frames are manually annotated to foster research for detection and recognition, both. The dataset is not released at the time this survey is written.

Dataset #Images #Faces Source Type
FDDB [Jain and Learned-Miller, 2010] 2,845 5,171 Yahoo! News Images
AFLW [Kostinger et al., 2011] 21,997 25,993 Flickr Images
AFW [Zhu and Ramanan, 2012] 205 473 Flickr Images
PASCAL Faces [Yan et al., 2014] 851 1,335 Pascal-VOC Images
MALF [Bin Yang et al., 2015] 5,250 11,931 Flickr, Baidu Inc. Images
IJB-A [Klare et al., 2015] 24,327 67,183 Google, Bing, etc. Images/Videos
IIIT-CFW [Mishra et al., 2016] 8,927 8,928 Google Images
Wider Face [Yang et al., 2016b] 32,203 393,703 Google, Bing Images
IJB-B [Whitelam et al., 2017] 76,824 125,474 Freebase Images/Videos
IJB-C [Maze et al., 2018] 148,876 540,630 Freebase Images/Videos
Wildest Faces [Yucel et al., 2018] 67,889 109,771 YouTube Videos
UFDD [Nada et al., 2018] 6,424 10,895 Google, Bing, etc. Images
Table 3: Datasets for face detection

5.2.4 Pedestrian Detection

Pedestrian detection is one of the specific tasks abundantly studied in the literature, especially since research on autonomous vehicles has intensified.

MIT [Papageorgiou and Poggio, 2000] is one of the first pedestrian datasets. It’s puny in size (509 training and 200 testing images). The images were extracted from the LabelMe database. You can find it at

INRIA [Dalal and Triggs, 2005] is currently one of the most popular static pedestrian detection datasets introduced in the seminal HOG paper [Dalal and Triggs, 2005]. It uses obviously the Caltech metric. Zhang et al. [2018c] gained state-of-the-art with 6.4% log average miss rate. Method at the second position is [Zhang et al., 2016a] with 6.9% using the RPN from Faster R-CNN and boosted forests on extracted features. The others are not CNN methods (the third one using pooling with HOG, LBP and covariance matrices). It can be found at Similarly, PASCAL Persons dataset is a subset of the aforementioned Pascal-VOC dataset.

CVC-ADAS [Gerónimo et al., 2007] is a collection of datasets including videos acquired on board, virtual-world pedestrians and real pedestrians. It can be found at following

USC [Wu and Nevatia, 2007] is an old small pedestrian dataset taken largely from surveillance videos. It is still downloadable at

ETH [Ess et al., 2007] was captured from a stroller. There are 490 training frames with 1578 annotations. There are three test sets. The first test set has 999 frames with 5193 annotations, the second one 450 and 2359 and the third one 354 and 1828 respectively. The stereo cues are available. It is a difficult dataset where the state-of-the-art from Zhang et al. [2018c] trained on CityPersons still remains at 24.5% log average miss rate. The boosted forest of Zhang et al. [2016a] gets 30.2% only. It is available at

Daimler DB [Enzweiler and Gavrila, 2008] is an old dataset captured in an urban setting, builds on DaimlerChrysler datasets with only grayscale images. It has been recently extended with Cyclist annotations into the Tsinghua Daimler Cyclist (TDC) dataset [Li et al., 2016c] with color images. The dataset is available at

TUD-Brussels [Wojek et al., 2009] is from the TU Darmstadt University and contains image pairs recorded in a crowded urban setting with an on-board camera from a car. There are 1092 image pairs with 1776 annotations in the training set. The test set contains 508 image pairs with 1326 pedestrians. The evaluation is measured from the recall at 90% precision, somehow reminiscent of KITTI dataset. TUD-Brussels is available at

Caltech USA [Dollar et al., 2012] contains images are captured in the Greater Los Angeles area by an independent driver to simulate real-life conditions without any bias. 192,000 pedestrian instances are available for training. 155,000 for testing. The evaluation use Pascal-VOC criteria at 0.5 IoU. The performance measure is the log average miss rate as application wise one cannot have too many False Positive per Image (FPPI). It is computed by averaging miss rates at 9 FPPIs from to 1 uniformly in log scale. State-of-the-art algorithms are at around 4% log average miss rate. Wang et al. [2018b] got 4.0% by using a novel bounding box regression loss. Following it, we have Zhang et al. [2018c] at 4.1% using a novel RoI-Pooling of parts helping with occlusions and pre-training on CityPersons. Mao et al. [2017] is lagging behind with 5.5%, using a Faster R-CNN with additional aggregated features. There also exists a CalTech Japan dataset. The benchmark is hosted at

KITTI [Geiger et al., 2012] is one of the most famous datasets in Computer Vision taken over the city of Karlsruhe in Germany. There are 100,000 instances of pedestrians. With around 6000 identities and one person in average per image. The preferred metric is the AP (Average Precision) on the moderate (persons who are less than 25 pixels tall are left behind for ranking) set. Li et al. [2017b] got 65.01 AP on moderate by using an adapted version of Fast R-CNN with different heads to deal with different scales. The state-of-the-art of Chen et al. [2015a] had to rely on stereo information to get good object proposals and 67.47 AP. All KITTI related datasets are found at

GM-ATCI [Silberstein et al., 2014] is a dataset captured from a fisheye-lens camera that uses CalTech evaluation system. We could not find any CNN detection results on it possibly because the state-of-the-art using multiple cues is already pretty good with 3.5% log average miss rate. The sequences can be downloaded here

CityPersons [Zhang et al., 2017a] is a relatively new dataset that builds upon CityScapes [Cordts et al., 2016]. It is a semantic segmentation dataset recorded in 27 different cities in Germany. There are 19,744 persons in the training set and around 11,000 in the test set. There are way more identities present than in CalTech even though there are fewer instances (1300 in CalTech w.r.t. 19000 in CityPersons). Therefore, it is more diverse and thus, more challenging. The metric is the same as CalTech with some subsets like the Reasonable: the pedestrians that are more than 50 pixels tall and less than 35% occluded. Again Zhang et al. [2018c] and Wang et al. [2018b] take the lead with 11.32% and 11.48% respectively on the reasonable set w.r.t. the baseline on adapted Faster R-CNN that stands at 12.97% log average miss rate. The dataset is available at

EuroCity [Braun et al., 2018] is the largest pedestrian detection dataset ever released with 238,300 instances in 47,300 images. Images are taken over 31 cities in 12 different European countries. The metric is the same as CalTech. Three baselines were tested (Faster R-CNN, R-FCN and YOLOv3). Faster R-CNN dominated on the reasonable set with 8.1%, followed by YOLOv3 with 8.5% and R-FCN lagging behind with 12.1%. On other subsets with heavily occluded or small pedestrians the ranking is not the same. We refer the reader to the dataset paper of [Braun et al., 2018].

5.2.5 Logo Detection

Logo detection was attracting a lot of attention in the past, due to the specificity of the task. At the moment we write this survey, there are fewer papers on this topic and most of the logo detection pipelines are direct applications of Faster RCNN [Ren et al., 2015].

BelgaLogos [Joly and Buisson, 2009] images come from the BELGA press agency. The dataset is composed of 10,000 images covering all aspects of life and current affairs: politics and economics, finance and social affairs, sports, culture and personalities. All images are in JPEG format and have been re-sized with a maximum value of height and width equal to 800 pixels, preserving aspect ratio. There are 26 different logos. Only a few images are annotated with bounding boxes. The dataset can be downloaded at

FlickrLogos [Romberg et al., 2011, Eggert et al., 2017] consists of real-world images collected from Flickr, depicting company logos in various situations. The dataset comes in two versions: The original FlickrLogos-32 dataset and the FlickrLogos-47 [Eggert et al., 2017] dataset. In FlickrLogos-32 the annotations for object detection were often incomplete, since only the most prominent logo instances were labeled. FlickrLogos-47 uses the same image corpus as FlickrLogos-32 but new classes were introduced (logo and text as separate classes) and missing object instances have been annotated. FlickrLogos-47 contains 833 training and 1402 testing images. The dataset can be downloaded at

Logo32plus [Bianco et al., 2017] is an extension of the train set of FlickrLogos-32 [Eggert et al., 2017]. It has the same classes of objects but much more training instances (12,312 instances). The dataset can be downloaded at

WebLogo-2M [Su et al., 2017a] is very large, but annotated at image level only and does not contain bounding boxes. It contains 194 logo classes and over 2 million logo images. Labels are noisy as the annotations are automatically generated. Therefore, this dataset is designed for large-scale logo detection model learning from noisy training data. For performance evaluation, the dataset includes 6,569 test images with manually labeled logo bounding boxes for all the 194 logo classes. The dataset can be downloaded at

SportsLogo [Liao et al., 2017], in the absence of public video logo dataset, was collected on a set of tennis videos containing 20 different tennis video clips with camera motions (blurring) and occlusion. The logos can appear on the background as well as on players’ and staff’s clothes. 20 logos are annotated, with about 100 images for each logo.

Logos in the Wild [Tüzkö et al., 2018] contains images collected from the web with logo annotations provided in Pascal-VOC style. It contains large varieties of brands in-the-wild. The latest version (v2.0) of the dataset consists of 11,054 images with 32,850 annotated logo bounding boxes of 871 brands. It contains from 4 to 608 images per searched brand, and 238 brands occur at least 10 times. It has up to 118 logos in one image. Only the links to the images are released, which is problematic as numerous images have already disappeared, making exact comparisons impossible. The dataset can be downloaded from

Open Logo Detection Challenge [Su et al., 2018]. This dataset assumes that only on a small proportion of logo classes are annotated whilst the remaining classes have no labeled training data. It contrasts with previous logo datasets which assumed all the logo classes are annotated. The OpenLogo challenge contains 27,189 images from 309 logo classes, built by aggregating/refining 7 existing datasets and establishing an open logo detection evaluation protocol. The dataset can be downloaded at

Dataset #Classes #Images
BelgaLogos [Joly and Buisson, 2009] 26 10,000
FlickrLogos-32 [Romberg et al., 2011] 32 8,240
FlickrLogos-47 [Eggert et al., 2017] 47 8,240
Logo32plus [Bianco et al., 2017] 32 7,830
WebLogo-2M [Su et al., 2017a] 194 2,190,757
SportsLogo [Liao et al., 2017] 20 1,978
Logos in the Wild [Tüzkö et al., 2018] 871 11,054
OpenLogos [Su et al., 2018] 309 27,189
Table 4: Datasets for logo detection

5.2.6 Traffic Signs Detection

This section reviews the 4 main datasets and benchmarks for evaluating traffic sign detectors [Mogelmose et al., 2012, Houben et al., 2013, Timofte et al., 2014, Zhu et al., 2016], as well as the Bosch Small Traffic Lights [Behrendt and Novak, 2017]. The most challenging one is the Tsinghua Tencent 100k (TTK100) [Zhu et al., 2016], on which Faster RCNN like detectors detectors such as [Pon et al., 2018] have an overall precision/recall of 44%/68%, which shows the difficulty of the dataset.

LISA Traffic Sign Dataset [Mogelmose et al., 2012] was among the first datasets for traffic sign detection. It contains 47 US signs and 7,855 annotations on 6,610 video frames. Sign sizes vary from 6x6 to 167x168 pixels. Each sign is annotated with sign type, position, size, occluded (yes/no), on side road (yes/no). The URL for this dataset is

The German Traffic Sign Detection Benchmark (GTSDB) [Houben et al., 2013] is one of the most popular traffic signs detection benchmarks. It introduced a dataset with evaluation metrics, baseline results, and a web interface for comparing approaches. The dataset provides a total of 900 images with 1,206 traffic signs. The traffic sign sizes vary between 16 and 128 pixels w.r.t. the longest edge. The image resolution is ; images capture different scenarios (urban, rural, highway) during the daytime and dusk featuring various weather conditions. It can be found at

Belgian TSD [Timofte et al., 2014] consists of 7,356 still images for training, with a total of 11,219 annotations, corresponding to 2,459 traffic signs visible at less than 50 meters in at least one view. The test set contains 4 sequences, captured by 8 roof-mounted cameras on the van, with a total of 121,632 frames and 269 different traffic signs for evaluating the detectors. For each sign, the type and 3D location is given. The dataset can be downloaded at

Tsinghua Tencent 100k (TTK100) [Zhu et al., 2016] provides images for traffic signs detection and classification, with various illumination and weather conditions. It’s the largest dataset for traffic signs detection, with 100,000 images out of which 16,787 contain traffic signs instances, for a total of 30,000 traffic instances. There are a total of 128 classes. Each instance is annotated with class label, bounding box and pixel mask. It has small objects in abundance and huge scale variations. Some signs which are naturally rare, signs to warn the driver to be cautious on mountain roads appear, have quite low number of instances. There are 45 classes with at least 100 instances present. The dataset can be obtained at

Bosch Small Traffic Lights [Behrendt and Novak, 2017] is made for benchmarking traffic light detectors. It contains 13,427 images of size pixels with around 24,000 annotated traffic lights, annotated with bounding boxes and states (active light). Best performing algorithm is [Pon et al., 2018] which obtained a mAP of 53 on this dataset. Bosch Small Traffic Lights can be downloaded at

5.2.7 Other Datasets

Some datasets do not fit in any of the previously mentioned category but deserve to be mentioned because of the interest the community has for them.

iNaturalist Species Classification and Detection Dataset [Van Horn et al., 2018] contains 859,000 images from over 5,000 different species of plants and animals. The goal of this dataset is to encourage the development of algorithms for ’in the wild’ data featuring large numbers of imbalanced, one-grained, categories. The dataset can be downloaded at

Below we give all known datasets that can be used to tackle object detection with the different modalities that we presented in the Sec. 4.1.

5.3 3D Datasets

KITTI object detection benchmark [Geiger et al., 2012] is the most widely used dataset for evaluating detection in 3D point clouds. It contains 3 main categories (namely 2D, 3D and birds-eye-view objects), 3 object categories (cars, pedestrians and cyclists), and 3 difficulty levels (easy, moderate and hard considering the object size, distance, occlusion and truncation). The dataset is public and contains 7,481 images for training and 7,518 for testing, comprising a total of 80,256 labeled objects. The 3D point clouds are acquired with a Velodyne laser scanner. 3D object detection performance is evaluated using the PASCAL criteria also used for 2D object detection. For cars a 3D bounding box overlap of 70% is required, while for pedestrians and cyclists a 3D bounding box overlap of 50% is required. For evaluation, precision-recall curves are computed and the methods are ranked according to average precision. The algorithms can use the following sources of information: i) Stereo: Method uses left and right (stereo) images ii) Flow: Method uses optical flow (2 temporally adjacent images) iii) Multiview: Method uses more than 2 temporally adjacent images iv) Laser Points: Method uses point clouds from Velodyne laser scanner v) Additional training data: Use of additional data sources for training. The datasets and performance of SOTA detectors can be download at, and the leader board is at One of the leading methods is [Simon et al., 2018] which is at an mAP of 67.72/64.00/63.01 (Easy/Mod./Hard) for the car category, at 50 fps. Slower (10 fps) but more accurate, [Ku et al., 2017] has a performance of 81.94/71.88/66.38 on cars. Chen et al. [2017b], Zhou and Tuzel [2017] and Qi et al. [2017b] also gave very good results.

Active Vision Dataset (AVD) [Ammirato et al., 2017] contains 30,000+ RGBD images, 30+ frequently occurring instances, 15 scenes, and 70,000+ 2D bounding boxes. This dataset focused on simulating robotic vision tasks in everyday indoor environments using real imagery. The dataset can be downloaded at

SceneNet RGB-D [McCormac et al., 2017]

is a synthetic dataset designed for scene understanding problems such as semantic segmentation, instance segmentation, and object detection. It provides camera poses and depth data and permits to create any scene configuration. 5M rendered RGB-D images from 16K randomly generated 3D trajectories in synthetic layouts are also provided. The dataset can be downloaded at

Falling Things [Tremblay et al., 2018b] introduced a novel synthetic dataset for 3D object detection and pose estimation, the Falling Things dataset. The dataset contains 60k annotated photos of 21 household objects taken from the YCB dataset. For each image, the 3D poses, per-pixel class segmentation, and 2D/3D bounding box coordinates for all objects are given. To facilitate testing different input modalities, mono and stereo RGB images are provided, along with registered dense depth images. The dataset can be downloaded at

5.4 Video Datasets

The two most popular datasets for video object detection are the YouTube-BoundingBoxes [Real et al., 2017] and the ImageNet VID challenge [Russakovsky et al., 2015]. Both are reviewed in this section.

YouTube-BoundingBoxes [Real et al., 2017] is a data set of video URLs with the single object bounding box annotations. All video sequences are annotated with classifications and bounding boxes, at 1 frame per second. There is a total of about 380,000 video segments of 15-20 seconds, from 240,000 publicly available YouTube videos, featuring objects in natural settings, without editing or post-processing. Real et al. [2017] reported a mAP of 59 on this dataset. This dataset can be downloaded at

ImageNet VID challenge [Russakovsky et al., 2015] was a part of the ILSVRC 2015 challenge. It has a training set of 3,862 fully annotated video sequences having a length from 6 frames to 5,492 frames per video. The validation set contains 555 fully annotated videos, ranging from 11 frames to 2898 frames per video. Finally, the test set contains 937 video sequences and the ground-truth annotation are not publicly available. One of the best performing methods on ImageNet VID is [Feichtenhofer et al., 2017] with a mAP of 79.8, by combining detection and tracking. Zhu et al. [2017b] reached 76.3 points with a flow best approach. This dataset can be downloaded at

VisDrone [Zhu et al., 2018a] contains video clips acquired by drones. This dataset is presented in Section 5.2.1

5.5 Concluding Remarks

This section gave a large overview of the datasets introduced by the community for developing and evaluating object detectors in images, videos or 3D point clouds. Each object detection dataset presents a very biased view of the world, as shown in [Torralba and Efros, 2011, Khosla et al., 2012, Tommasi et al., 2017], representative of the user’s needs when they built it. The bias is not only in the images they chose (specific views of objects, objects imbalance [Ouyang et al., 2016], objects categories) but also in the metric they created and the evaluation protocol they devised. The community is trying its best to build more and more datasets with less and less bias and as a result it has become quite hard to find its way in this jungle of datasets, especially when one needs: older datasets that have fallen out of fashion or even exhaustive lists of state-of-the-art algorithms performances on modern ones. Through this survey we have partially addressed this need of a common source for information on datasets.

6 Conclusions

Object detection in images, a key topic attracting a substantial part of the computer vision community, has been revolutionized by the recent arrival of convolutional neural networks, which swept all the methods previously dominating the field. This article provides a comprehensive survey of what happened in the domain since 2012. It shows that, even if top-performing methods concentrate around two main alternatives – single stage methods such as SSD or YOLO, or two stages methods in the footsteps of Faster RCNN – the domain is still very active. Graph networks, GANs, context, small objects, domain adaptation, occlusions, etc. are the directions that are actively studied in the context of object detection. Extension of object detection to other modalities, such as videos or 3D point clouds, as well as constraints, such as weak supervision is also very active and has been addressed. The survey also lists the public datasets available to the community and highlights top performing methods on these datasets. We believe this article will be useful to better understand the recent progress and the bigger picture of this constantly moving field.


  • Akiba et al. [2017] Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch SGD: training resnet-50 on imagenet in 15 minutes. CoRR, abs/1711.04325, 2017. URL
  • Alexe et al. [2010] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. What is an object? In

    The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010

    , pages 73–80, 2010.
  • Alexe et al. [2012] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari. Measuring the objectness of image windows. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2189–2202, 2012.
  • Alhaija et al. [2018] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars M. Mescheder, Andreas Geiger, and Carsten Rother. Augmented reality meets computer vision: Efficient data generation for urban driving scenes. International Journal of Computer Vision (IJCV), 126(9):961–972, 2018.
  • Ammirato et al. [2017] Phil Ammirato, Patrick Poirson, Eunbyung Park, Jana Kosecka, and Alexander C. Berg. A dataset for developing and benchmarking active vision. IEEE International Conference on Robotics and Automation (ICRA), cs.CV, 2017.
  • Angelova et al. [2015] Anelia Angelova, Alex Krizhevsky, Vincent Vanhoucke, Abhijit S Ogale, and Dave Ferguson. Real-time pedestrian detection with deep network cascades. In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, volume 2, page 4, 2015.
  • Antoniou et al. [2017] Antreas Antoniou, Amos J. Storkey, and Harrison Edwards. Data augmentation generative adversarial networks. CoRR, abs/1711.04340, 2017. URL
  • Bae et al. [2017] Seung-Hwan Bae, Youngwan Lee, Youngjoo Jo, Yuseok Bae, and Joong-won Hwang. Rank of experts: Detection network ensemble. CoRR, abs/1712.00185, 2017. URL
  • Bai et al. [2018] Yancheng Bai, Yongqiang Zhang, Mingli Ding, and Bernard Ghanem. SOD-MTGAN: Small Object Detection via Multi-Task Generative Adversarial Network. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8 - 14, 2018, page 16, 2018.
  • Bansal et al. [2018] Ankan Bansal, Karan Sikka, Gaurav Sharma, Rama Chellappa, and Ajay Divakaran. Zero-shot object detection. CoRR, abs/1804.04340, 2018. URL
  • Battaglia et al. [2018] Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. Relational inductive biases, deep learning, and graph networks. CoRR, abs/1806.01261, 2018. URL
  • Bazzani et al. [2016] Loris Bazzani, Alessandro Bergamo, Dragomir Anguelov, and Lorenzo Torresani. Self-taught object localization with deep networks. In 2016 IEEE Winter Conference on Applications of Computer Vision, WACV 2016, Lake Placid, NY, USA, March 7-10, 2016, pages 1–9, 2016. URL
  • Behrendt and Novak [2017] Karsten Behrendt and Libor Novak. A Deep Learning Approach to Traffic Lights: Detection, Tracking, and Classification. In Robotics and Automation (ICRA), 2017 IEEE International Conference On, 2017.
  • Beltrán et al. [2018] Jorge Beltrán, Carlos Guindel, Francisco Miguel Moreno, Daniel Cruzado, Fernando García, and Arturo de la Escalera. Birdnet: a 3d object detection framework from lidar information. CoRR, abs/1805.01195, 2018. URL
  • Benenson et al. [2012] Rodrigo Benenson, Markus Mathias, Radu Timofte, and Luc Van Gool. Pedestrian detection at 100 frames per second. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pages 2903–2910, 2012.
  • Bianco et al. [2017] Simone Bianco, Marco Buzzelli, Davide Mazzini, and Raimondo Schettini. Deep Learning for Logo Recognition. Neurocomputing, 245:23–30, July 2017.
  • Bilen and Vedaldi [2016] Hakan Bilen and Andrea Vedaldi. Weakly Supervised Deep Detection Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, 2016.
  • Bilen et al. [2015] Hakan Bilen, Marco Pedersoli, and Tinne Tuytelaars. Weakly supervised object detection with convex clustering. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, June 2015.
  • Bin Yang et al. [2015] Bin Yang, Junjie Yan, Zhen Lei, and Stan Z. Li. Fine-grained evaluation on face detection in the wild. In Automatic Face and Gesture Recognition (FG), pages 1–7, 2015.
  • Bodla et al. [2017] Navaneeth Bodla, Bharat Singh, Rama Chellappa, and Larry S Davis. Soft-nms—improving object detection with one line of code. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 5562–5570, 2017.
  • Bourdev et al. [2010] Lubomir Bourdev, Subhransu Maji, Thomas Brox, and Jitendra Malik. Detecting people using mutually consistent poselet activations. In Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, pages 168–181, 2010.
  • Bousmalis et al. [2017] Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. Unsupervised Pixel-Level Domain Adaptation with Generative Adversarial Networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 95–104, 2017.
  • Brahmbhatt et al. [2017] Samarth Brahmbhatt, Henrik I. Christensen, and James Hays. StuffNet - Using 'Stuff' to Improve Object Detection. In IEEE Winter Conf. on Applications of Computer Vision (WACV), 2017.
  • Braun et al. [2018] Markus Braun, Sebastian Krebs, Fabian Flohr, and Dariu M. Gavrila. The eurocity persons dataset: A novel benchmark for object detection. CoRR, abs/1805.07193, 2018. URL
  • Busta et al. [2017] Michal Busta, Lukas Neumann, and Jiri Matas. Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2223–2231. IEEE Computer Society, 2017.
  • Cai et al. [2016] Zhaowei Cai, Quanfu Fan, Rogerio S Feris, and Nuno Vasconcelos. A unified multi-scale deep convolutional neural network for fast object detection. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, pages 354–370, 2016.
  • Cao et al. [2017] Guimei Cao, Xuemei Xie, Wenzhe Yang, Quan Liao, Guangming Shi, and Jinjian Wu. Feature-fused SSD: fast detection for small objects. CoRR, abs/1709.05054, 2017. URL
  • Carreira and Sminchisescu [2010] Joao Carreira and Cristian Sminchisescu. Constrained parametric min-cuts for automatic object segmentation. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pages 3241–3248, 2010.
  • Carreira and Sminchisescu [2011] Joao Carreira and Cristian Sminchisescu. Cpmc: Automatic object segmentation using constrained parametric min-cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1312–1328, 2011.
  • Castro et al. [2018] Francisco M. Castro, Manuel J. Marín-Jiménez, Nicolás Guil, Cordelia Schmid, and Karteek Alahari. End-to-End Incremental Learning. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8 - 14, 2018, 2018.
  • Chabot et al. [2017] Florian Chabot, Mohamed Chaouch, Jaonary Rabarisoa, Céline Teulière, and Thierry Chateau. Deep MANTA: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1827–1836, 2017.
  • Chen et al. [2016a] Chenyi Chen, Ming-Yu Liu 0001, Oncel Tuzel, and Jianxiong Xiao. R-CNN for Small Object Detection. Computer Vision - ACCV 2016 - 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, 10115:214–230, 2016a.
  • Chen et al. [2016b] D. Chen, G. Hua, F. Wen, and J. Sun.

    Supervised transformer network for efficient face detection.

    In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, 2016b.
  • Chen et al. [2013] Guang Chen, Yuanyuan Ding, Jing Xiao, and Tony X Han. Detection evolution with multi-order contextual co-occurrence. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pages 1798–1805, 2013.
  • Chen et al. [2018a] Hao Chen, Yali Wang, Guoyou Wang, and Yu Qiao. LSTD: A low-shot transfer detector for object detection. In Sheila A. McIlraith and Kilian Q. Weinberger, editors,

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, Louisiana, USA, February 2-7, 2018

    . AAAI Press, 2018a.
  • Chen et al. [2017a] Kai Chen, Hang Song, Chen Change Loy, and Dahua Lin. Discover and Learn New Objects from Documentaries. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1111–1120, July 2017a.
  • Chen et al. [2018b] Kai Chen, Jiaqi Wang, Shuo Yang, Xingcheng Zhang, Yuanjun Xiong, Chen Change Loy, and Dahua Lin. Optimizing video object detection via a scale-time lattice. CoRR, abs/1804.05472, 2018b. URL
  • Chen et al. [2018c] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848, 2018c. URL
  • Chen et al. [2018d] Shang-Tse Chen, Cory Cornelius, Jason Martin, and Duen Horng Chau. Robust physical adversarial attack on faster R-CNN object detector. CoRR, abs/1804.05810, 2018d. URL
  • Chen et al. [2016c] X. Chen, K. Kundu, Z. Zhang, H. Ma, and S. Fidler. Monocular 3d object detection for autonomous driving. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, 2016c.
  • Chen et al. [2015a] Xiaozhi Chen, Kaustav Kundu, Yukun Zhu, Andrew G. Berneshawi, Huimin Ma, Sanja Fidler, and Raquel Urtasun. 3d object proposals for accurate object class detection. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 424–432, 2015a. URL
  • Chen et al. [2015b] Xiaozhi Chen, Huimin Ma, Xiang Wang, and Zhichen Zhao. Improving object proposals with multi-thresholding straddling expansion. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015b.
  • Chen et al. [2017b] Xiaozhi Chen, Huimin Ma, Ji Wan, Bo Li, and Tian Xia. Multi-view 3d object detection network for autonomous driving. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6526–6534. IEEE Computer Society, 2017b.
  • Chen and Gupta [2017] Xinlei Chen and Abhinav Gupta. Spatial Memory for Context Reasoning in Object Detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017.
  • Chen et al. [2018e] Yuhua Chen, Wen Li, Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Domain adaptive faster R-CNN for object detection in the wild. CoRR, abs/1803.03243, 2018e. URL
  • Chen et al. [2017c] Yunpeng Chen, Jianan Li, Huaxin Xiao, Xiaojie Jin, Shuicheng Yan, and Jiashi Feng. Dual path networks. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 4467–4475, 2017c.
  • Chen et al. [2017d] Yunpeng Chen, Jianshu Li, Bin Zhou, Jiashi Feng, and Shuicheng Yan. Weaving multi-scale context for single shot detector. CoRR, abs/1712.03149, 2017d. URL
  • Cheng et al. [2016a] G. Cheng, P. Zhou, and J. Han. RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, 2016a.
  • Cheng and Han [2016] Gong Cheng and Junwei Han. A Survey on Object Detection in Optical Remote Sensing Images. ISPRS Journal of Photogrammetry and Remote Sensing, 117:11–28, 2016.
  • Cheng et al. [2016b] Gong Cheng, Peicheng Zhou, and Junwei Han. Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing, 54(12):7405–7415, 2016b.
  • Cheng et al. [2016c] Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading. In

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016

    , pages 551–561, 2016c.
  • Cheng et al. [2014] Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr.

    Bing: Binarized normed gradients for objectness estimation at 300fps.

    In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 3286–3293, 2014.
  • Chollet [2017] François Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1800–1807, 2017.
  • Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • Csurka [2017] Gabriela Csurka. A comprehensive survey on domain adaptation for visual applications. In Gabriela Csurka, editor, Domain Adaptation in Computer Vision Applications., Advances in Computer Vision and Pattern Recognition, pages 1–35. Springer, 2017. URL
  • Cubuk et al. [2018] Ekin Dogus Cubuk, Barret Zoph, Dandelion Mané, Vijay Vasudevan, and Quoc V. Le. Autoaugment: Learning augmentation policies from data. CoRR, abs/1805.09501, 2018. URL
  • Dai et al. [2016a] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 3150–3158, 2016a.
  • Dai et al. [2016b] Jifeng Dai, Yi Li, Kaiming He, and Jian Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 379–387, 2016b.
  • Dai et al. [2017] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 764–773. IEEE Computer Society, 2017.
  • Dalal and Triggs [2005] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, volume 1, pages 886–893, 2005.
  • Delakis and Garcia [2008] Manolis Delakis and Christophe Garcia. text detection with convolutional neural networks. In International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAP), pages 290–294, 2008.
  • Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 248–255, 2009.
  • Deng et al. [2017] Zhipeng Deng, Hao Sun, Shilin Zhou, Juanping Zhao, and Huanxin Zou. Toward Fast and Accurate Vehicle Detection in Aerial Images Using Coupled Region-Based Convolutional Neural Networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10:3652–3664, 2017.
  • Deng and Latecki [2017] Zhuo Deng and Longin Jan Latecki. Amodal Detection of 3D Objects: Inferring 3D Bounding Boxes from 2D Ones in RGB-Depth Images. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 398–406, 2017.
  • Devries and Taylor [2017a] Terrance Devries and Graham W. Taylor. Dataset augmentation in feature space. CoRR, abs/1702.05538, 2017a. URL
  • Devries and Taylor [2017b] Terrance Devries and Graham W. Taylor. Improved regularization of convolutional neural networks with cutout. CoRR, abs/1708.04552, 2017b. URL
  • Dollar et al. [2012] Piotr Dollar, Christian Wojek, Bernt Schiele, and Pietro Perona. Pedestrian detection: An evaluation of the state of the art. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4):743–761, 2012.
  • Dong et al. [2017] Xuanyi Dong, Liang Zheng, Fan Ma, Yi Yang, and Deyu Meng. Few-shot object detection. CoRR, abs/1706.08249, 2017. URL
  • Durand et al. [2015] Thibaut Durand, Nicolas Thome, and Matthieu Cord. MANTRA: Minimum Maximum Latent Structural SVM for Image Classification and Ranking. In IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015.
  • Durand et al. [2016] Thibaut Durand, Nicolas Thome, and Matthieu Cord. Weldon: Weakly supervised learning of deep convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, 2016.
  • Durand et al. [2017] Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord. WILDCAT: Weakly Supervised Learning of Deep ConvNets for Image Classification, Pointwise Localization and Segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017.
  • Dvornik et al. [2018] Nikita Dvornik, Julien Mairal, and Cordelia Schmid. Modeling Visual Context is Key to Augmenting Object Detection Datasets. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8 - 14, 2018, page 18, 2018.
  • Dwibedi [2017] D. Dwibedi. Synthesizing scenes for instance detection. Master’s thesis, Carnegie Mellon University, 2017.
  • Dwibedi et al. [2017] Debidatta Dwibedi, Ishan Misra, and Martial Hebert. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 1310–1319. IEEE Computer Society, 2017.
  • Eggert et al. [2017] Christian Eggert, Dan Zecha, Stephan Brehm, and Rainer Lienhart. Improving small object proposals for company logo detection. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval, pages 167–174, 2017.
  • Endres and Hoiem [2010] Ian Endres and Derek Hoiem. Category independent object proposals. In Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, pages 575–588, 2010.
  • Endres and Hoiem [2014] Ian Endres and Derek Hoiem. Category-independent object proposals with diverse ranking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(2):222–234, 2014.
  • Engelcke et al. [2017] Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, and Ingmar Posner. Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks. In IEEE International Conference on Robotics and Automation (ICRA), 2017.
  • Enzweiler and Gavrila [2008] Markus Enzweiler and Dariu M Gavrila. Monocular pedestrian detection: Survey and experiments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(12):2179–2195, 2008.
  • Enzweiler and Gavrila [2011] Markus Enzweiler and Dariu M. Gavrila. A multilevel mixture-of-experts framework for pedestrian classification. IEEE Transactions on Image Processing, 20(10):2967–2979, 2011.
  • Erhan et al. [2014] Dumitru Erhan, Christian Szegedy, Alexander Toshev, and Dragomir Anguelov. Scalable Object Detection Using Deep Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, 2014.
  • Ess et al. [2007] Andreas Ess, Bastian Leibe, and Luc Van Gool. Depth and appearance for mobile scene analysis. In IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, October 14-20, 2007, pages 1–8, 2007.
  • Everingham et al. [2010] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision (IJCV), 88(2):303–338, 2010.
  • Feichtenhofer et al. [2017] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Detect to track and track to detect. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3038–3046, 2017.
  • Felzenszwalb et al. [2010] Pedro F. Felzenszwalb, Ross B. Girshick, David A. McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
  • Fong and Vedaldi [2017] Ruth C. Fong and Andrea Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 3449–3457. IEEE Computer Society, 2017.
  • Fu et al. [2017] Cheng-Yang Fu, Wei Liu, Ananth Ranga, Ambrish Tyagi, and Alexander C. Berg. DSSD : Deconvolutional single shot detector. CoRR, abs/1701.06659, 2017. URL
  • Fu et al. [2018] Yanwei Fu, Tao Xiang, Yu-Gang Jiang, Xiangyang Xue, Leonid Sigal, and Shaogang Gong. Recent advances in zero-shot recognition: Toward data-efficient understanding of visual content. IEEE Signal Processing Magazine, 35(1):112–125, 2018.
  • Gaidon et al. [2016] A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, 2016.
  • Gao et al. [2017] Mingfei Gao, Ruichi Yu, Ang Li, Vlad I. Morariu, and Larry S. Davis. Dynamic zoom-in network for fast object detection in large images. CoRR, abs/1711.05187, 2017. URL
  • Garcia and Delakis [2002] Christophe Garcia and Manolis Delakis. A neural architecture for fast and robust face detection. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 2, pages 44–47, 2002.
  • Ge et al. [2018] Weifeng Ge, Sibei Yang, and Yizhou Yu. Multi-evidence filtering and fusion for multi-label classification, object detection and semantic segmentation based on weakly supervised learning. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018.
  • Geiger et al. [2012] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pages 3354–3361, 2012.
  • Georgakis et al. [2017] Georgios Georgakis, Arsalan Mousavian, Alexander C. Berg, and Jana Kosecka. Synthesizing training data for object detection in indoor scenes. In Nancy M. Amato, Siddhartha S. Srinivasa, Nora Ayanian, and Scott Kuindersma, editors, Robotics: Science and Systems XIII, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA, July 12-16, 2017, 2017. URL
  • Gerónimo et al. [2007] David Gerónimo, Angel Domingo Sappa, Antonio López, and Daniel Ponsa. Adaptive image sampling and windows classification for on-board pedestrian detection. In Proceedings of the 5th International Conference on Computer Vision Systems (ICVS 2007), 2007.
  • Gidaris and Komodakis [2016a] Spyridon Gidaris and Nikos Komodakis. Attend Refine Repeat - Active Box Proposal Generation via In-Out Localization. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, 2016a.
  • Gidaris and Komodakis [2015] Spyros Gidaris and Nikos Komodakis. Object detection via a multi-region and semantic segmentation-aware cnn model. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 1134–1142, 2015.
  • Gidaris and Komodakis [2016b] Spyros Gidaris and Nikos Komodakis. LocNet: Improving Localization Accuracy for Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, 2016b.
  • Girshick [2015] Ross Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 1440–1448, 2015.
  • Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 580–587, 2014.
  • Girshick et al. [2015] Ross B. Girshick, Forrest N. Iandola, Trevor Darrell, and Jitendra Malik. Deformable part models are convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015.
  • Gonzalez-Garcia et al. [2017] Abel Gonzalez-Garcia, Davide Modolo, and Vittorio Ferrari. Objects as context for part detection. CoRR, abs/1703.09529, 2017. URL
  • Goodfellow et al. [2014] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 2672–2680, 2014.
  • Goodfellow et al. [2013] Ian J. Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron C. Courville, and Yoshua Bengio. Maxout networks. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, pages 1319–1327, 2013. URL
  • Goyal et al. [2017] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017. URL
  • Gupta et al. [2016] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. Synthetic Data for Text Localisation in Natural Images. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 2315–2324, June 2016.
  • Gupta et al. [2015] Saurabh Gupta, Bharath Hariharan, and Jitendra Malik. Exploring person context and local scene context for object detection. CoRR, abs/1511.08177, 2015. URL
  • Han et al. [2015] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015. URL
  • Han et al. [2016] Wei Han, Pooya Khorrami, Tom Le Paine, Prajit Ramachandran, Mohammad Babaeizadeh, Honghui Shi, Jianan Li, Shuicheng Yan, and Thomas S. Huang. Seq-nms for video object detection. CoRR, abs/1602.08465, 2016. URL
  • He et al. [2015] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1904–1916, 2015.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 770–778, 2016.
  • He et al. [2017] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2980–2988, 2017.
  • He et al. [2018] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. An end-to-end textspotter with explicit alignment and attention. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, 2018.
  • Heitz and Koller [2008] Geremy Heitz and Daphne Koller. Learning Spatial Context - Using Stuff to Find Things. In Computer Vision - ECCV 2008, 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Berlin, Heidelberg, 2008.
  • Henderson and Ferrari [2016] Paul Henderson and Vittorio Ferrari. End-to-end training of object class detectors for mean average precision. In Computer Vision - ACCV 2016 - 13th Asian Conference on Computer Vision, Taipei, Taiwan, November 20-24, 2016, pages 198–213, 2016.
  • Henriques and Vedaldi [2017] João F. Henriques and Andrea Vedaldi. Warped Convolutions - Efficient Invariance to Spatial Transformations. International Conference on Machine Learning (ICML), 2017.
  • Hetang et al. [2017] Congrui Hetang, Hongwei Qin, Shaohui Liu, and Junjie Yan. Impression network for video object detection. CoRR, abs/1712.05896, 2017. URL
  • Himmelsbach et al. [2008] Michael Himmelsbach, Andre Mueller, Thorsten Lüttel, and Hans-Joachim Wünsche. Lidar-based 3d object perception. In Proceedings of 1st international workshop on cognition for technical systems, volume 1, 2008.
  • Hinterstoisser et al. [2017] Stefan Hinterstoisser, Vincent Lepetit, Paul Wohlhart, and Kurt Konolige. On pre-trained image features and synthetic images for deep learning. CoRR, abs/1710.10710, 2017. URL
  • Hjelmås and Low [2001] Erik Hjelmås and Boon Kee Low. Face Detection: A Survey. Computer Vision and Image Understanding (CVIU), 83(3):236–274, September 2001.
  • Hoffman et al. [2014] Judy Hoffman, Sergio Guadarrama, Eric S Tzeng, Ronghang Hu, Jeff Donahue, Ross Girshick, Trevor Darrell, and Kate Saenko. Lsda: Large scale detection through adaptation. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 3536–3544, 2014.
  • Hoffman et al. [2015] Judy Hoffman, Deepak Pathak, Trevor Darrell, and Kate Saenko. Detector discovery in the wild: Joint multiple instance and representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 2883–2891, 2015.
  • Hoiem et al. [2012] Derek Hoiem, Yodsawalai Chodpathumwan, and Qieyun Dai. Diagnosing error in object detectors. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, pages 340–353, 2012.
  • Hosang et al. [2016] Jan Hosang, Rodrigo Benenson, and Bernt Schiele. A convnet for non-maximum suppression. In German Conference on Pattern Recognition, pages 192–204, 2016.
  • Hosang et al. [2014] Jan Hendrik Hosang, Rodrigo Benenson, and Bernt Schiele. How good are detection proposals, really?. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014, 2014.
  • Hosang et al. [2017] Jan Hendrik Hosang, Rodrigo Benenson, and Bernt Schiele. Learning non-maximum suppression. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6469–6477, 2017.
  • Houben et al. [2013] Sebastian Houben, Johannes Stallkamp, Jan Salmen, Marc Schlipsing, and Christian Igel. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In International Joint Conference on Neural Networks, number 1288, 2013.
  • Howard et al. [2017] Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017. URL
  • Hu et al. [2018] Han Hu, Jiayuan Gu, Zheng Zhang, Jifeng Dai, and Yichen Wei. Relation networks for object detection. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018.
  • Hu et al. [2017] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks. CoRR, abs/1709.01507, 2017. URL
  • Hu and Ramanan [2017] Peiyun Hu and Deva Ramanan. Finding tiny faces. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1522–1530. IEEE Computer Society, 2017.
  • Huang et al. [2017a] Gao Huang, Shichen Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Condensenet: An efficient densenet using learned group convolutions. CoRR, abs/1711.09224, 2017a. URL
  • Huang et al. [2017b] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, volume 1, page 3, 2017b.
  • Huang et al. [2017c] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, et al. Speed/accuracy trade-offs for modern convolutional object detectors. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017c.
  • Huang et al. [2018a] Qiangui Huang, Shaohua Kevin Zhou, Suya You, and Ulrich Neumann. Learning to prune filters in convolutional neural networks. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pages 709–718. IEEE Computer Society, 2018a.
  • Huang et al. [2018b] Xun Huang, Ming-Yu Liu, Serge J. Belongie, and Jan Kautz. Multimodal unsupervised image-to-image translation. CoRR, abs/1804.04732, 2018b. URL
  • Hubara et al. [2016] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized neural networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4107–4115, 2016.
  • Hubara et al. [2017] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1):6869–6898, 2017.
  • Humayun et al. [2014] Ahmad Humayun, Fuxin Li, and James M Rehg. Rigor: Reusing inference in graph cuts for generating object regions. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 336–343, 2014.
  • Huval et al. [2013] Brody Huval, Adam Coates, and Andrew Y. Ng. Deep learning for class-generic object detection. CoRR, abs/1312.6885, 2013. URL
  • Iandola et al. [2014] Forrest N. Iandola, Matthew W. Moskewicz, Sergey Karayev, Ross B. Girshick, Trevor Darrell, and Kurt Keutzer. Densenet: Implementing efficient convnet descriptor pyramids. CoRR, abs/1404.1869, 2014. URL
  • Iandola et al. [2016] Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size. CoRR, abs/1602.07360v3, 2016. URL
  • Inoue [2018] Hiroshi Inoue. Data augmentation by pairing samples for images classification. CoRR, abs/1801.02929, 2018. URL
  • Inoue et al. [2018] Naoto Inoue, Ryosuke Furuta, Toshihiko Yamasaki, and Kiyoharu Aizawa. Cross-domain weakly-supervised object detection through progressive domain adaptation. CoRR, abs/1803.11365, 2018. URL
  • Ioffe [2017] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 1942–1950, 2017. URL
  • Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 448–456, 2015. URL
  • Jaderberg et al. [2014] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. CoRR, abs/1406.2227, 2014. URL
  • Jaderberg et al. [2015] Max Jaderberg, Karen Simonyan, and Andrew Zisserman. Spatial transformer networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015.
  • Jain and Learned-Miller [2010] Vidit Jain and Erik Learned-Miller. FDDB: A Benchmark for Face Detection in Unconstrained Settings. UM-CS-2010-009, University of Massachusetts Amherst, 2010.
  • Jeong et al. [2017] Jisoo Jeong, Hyojin Park, and Nojun Kwak. Enhancement of SSD by concatenating feature maps for object detection. CoRR, abs/1705.09587, 2017. URL
  • Jha et al. [2018] Saurav Jha, Nikhil Agarwal, and Suneeta Agarwal. Towards improved cartoon face detection and recognition systems. CoRR, abs/1804.01753, 2018. URL
  • Jiang et al. [2018] Borui Jiang, Ruixuan Luo, Jiayuan Mao, Tete Xiao, and Yuning Jiang. Acquisition of localization confidence for accurate object detection. CoRR, abs/1807.11590, 2018. URL
  • Jiang et al. [2017] Yingying Jiang, Xiangyu Zhu, Xiaobing Wang, Shuli Yang, Wei Li, Hua Wang, Pei Fu, and Zhenbo Luo. R2CNN: rotational region CNN for orientation robust scene text detection. CoRR, abs/1706.09579, 2017. URL
  • Joly and Buisson [2009] Alexis Joly and Olivier Buisson. Logo retrieval with a contrario visual query expansion. In Wen Gao, Yong Rui, Alan Hanjalic, Changsheng Xu, Eckehard G. Steinbach, Abdulmotaleb El-Saddik, and Michelle X. Zhou, editors, Proceedings of the 17th International Conference on Multimedia 2009, Vancouver, British Columbia, Canada, October 19-24, 2009, pages 581–584. ACM, 2009.
  • Joshi and Thakore [2012] Kinjal A Joshi and Darshak G Thakore. A Survey on Moving Object Detection and Tracking in Video Surveillance System. International Journal of Soft Computing and Engineering (IJSCE), 2(3):5, 2012.
  • Kang et al. [2015] Hongwen Kang, Martial Hebert, Alexei A Efros, and Takeo Kanade. Data-driven objectness. IEEE Transactions on Pattern Analysis and Machine Intelligence, (1):189–195, 2015.
  • Kang et al. [2016] Kai Kang, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Object detection from video tubelets with convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 817–825, 2016.
  • Kang et al. [2017] Kai Kang, Hongsheng Li, Junjie Yan, Xingyu Zeng, Bin Yang, Tong Xiao, Cong Zhang, Zhe Wang, Ruohui Wang, Xiaogang Wang, and Wanli Ouyang. T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos. IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2017.
  • Karatzas et al. [2015] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny. Icdar 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 1156–1160, Aug 2015.
  • Karras et al. [2018] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL
  • Katti et al. [2016] Harish Katti, Marius V. Peelen, and S. P. Arun. Object detection can be improved using human-derived contextual expectations. CoRR, abs/1611.07218, 2016. URL
  • Keren et al. [2018] Gil Keren, Maximilian Schmitt, Thomas Kehrenberg, and Björn W. Schuller. Weakly supervised one-shot detection with attention siamese networks. CoRR, abs/1801.03329, 2018. URL
  • Khosla et al. [2012] Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz, Alexei A Efros, and Antonio Torralba. Undoing the damage of dataset bias. In Computer Vision - ECCV 2012 - 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, pages 158–171, 2012.
  • Kim et al. [2016] Kye-Hyeon Kim, Yeongjae Cheon, Sanghoon Hong, Byung-Seok Roh, and Minje Park. PVANET: deep but lightweight neural networks for real-time object detection. CoRR, abs/1608.08021, 2016. URL
  • Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL
  • Klare et al. [2015] Brendan F Klare, Ben Klein, Emma Taborsky, Austin Blanton, Jordan Cheney, Kristen Allen, Patrick Grother, Alan Mah, and Anil K Jain. Pushing the frontiers of unconstrained face detection and recognition: Iarpa janus benchmark a. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 1931–1939, 2015.
  • Kokkinos [2017] Iasonas Kokkinos. Ubernet: Training a universal convolutional neural network for low-, mid-, and high-level vision using diverse datasets and limited memory. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5454–5463. IEEE Computer Society, 2017.
  • Kong et al. [2016] Tao Kong, Anbang Yao, Yurong Chen, and Fuchun Sun. HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, April 2016.
  • Kong et al. [2017] Tao Kong, Fuchun Sun, Anbang Yao, Huaping Liu, Ming Lu, and Yurong Chen. RON: reverse connection with objectness prior networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5244–5252. IEEE Computer Society, 2017.
  • Kong et al. [2018] Tao Kong, Fuchun Sun, Wen-bing Huang, and Huaping Liu. Deep feature pyramid reconfiguration for object detection. CoRR, abs/1808.07993, 2018. URL
  • Kostinger et al. [2011] Martin Kostinger, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Annotated Facial Landmarks in the Wild: A large-scale, real-world database for facial landmark localization. In First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, pages 2144–2151, 2011.
  • Krasin et al. [2017] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Shahab Kamali, Matteo Malloci, Jordi Pont-Tuset, Andreas Veit, Serge Belongie, Victor Gomes, Abhinav Gupta, Chen Sun, Gal Chechik, David Cai, Zheyun Feng, Dhyanesh Narayanan, and Kevin Murphy. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from, 2017.
  • Krishna et al. [2017] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV), 123(1):32–73, 2017.
  • Krizhevsky [2014] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. CoRR, abs/1404.5997, 2014. URL
  • Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Léon Bottou, and Kilian Q. Weinberger, editors, Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, pages 1106–1114, 2012. URL
  • Ku et al. [2017] Jason Ku, Melissa Mozifian, Jungwook Lee, Ali Harakeh, and Steven Lake Waslander. Joint 3d proposal generation and object detection from view aggregation. CoRR, abs/1712.02294, 2017. URL
  • Kumar Singh et al. [2016] Krishna Kumar Singh, Fanyi Xiao, and Yong Jae Lee. Track and transfer: Watching videos to simulate strong human supervision for weakly-supervised object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 3548–3556, 2016.
  • Kuo et al. [2015] Weicheng Kuo, Bharath Hariharan, and Jitendra Malik. Deepbox: Learning objectness with convolutional networks. In IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2479–2487, 2015.
  • Lafferty et al. [2001] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA, 2001. URL
  • Lam et al. [2018] Darius Lam, Richard Kuzma, Kevin McGee, Samuel Dooley, Michael Laielli, Matthew Klaric, Yaroslav Bulatov, and Brendan McCord. xview: Objects in context in overhead imagery. CoRR, abs/1802.07856, 2018. URL
  • Lampert et al. [2008] Christoph H. Lampert, Matthew B. Blaschko, and Thomas Hofmann. Beyond sliding windows: Object localization by efficient subwindow search. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2008), 24-26 June 2008, Anchorage, Alaska, USA, 2008.
  • Laptev et al. [2016] Dmitry Laptev, Nikolay Savinov, Joachim M. Buhmann, and Marc Pollefeys. TI-POOLING: transformation-invariant pooling for feature learning in convolutional neural networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 289–297. IEEE Computer Society, 2016.
  • Law and Deng [2018] Hei Law and Jia Deng. Cornernet: Detecting objects as paired keypoints. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8 - 14, 2018, 2018.
  • LeCun et al. [2012] Yann LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Efficient backprop. In Grégoire Montavon, Genevieve B. Orr, and Klaus-Robert Müller, editors, Neural Networks: Tricks of the Trade - Second Edition, volume 7700 of Lecture Notes in Computer Science, pages 9–48. Springer, 2012. URL
  • Lee et al. [2016] Byungjae Lee, Enkhbayar Erdenee, SongGuo Jin, Mi Young Nam, Young Giu Jung, and Phill-Kyu Rhee. Multi-class multi-object tracking using changing point detection. In Gang Hua and Hervé Jégou, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, volume 9914 of Lecture Notes in Computer Science, pages 68–83, 2016. URL
  • Lee et al. [2017a] Kyoungmin Lee, Jaeseok Choi, Jisoo Jeong, and Nojun Kwak. Residual features and unified prediction network for single stage detection. CoRR, abs/1707.05031, 2017a. URL
  • Lee et al. [2017b] Youngwan Lee, Huieun Kim, Eunsoo Park, Xuenan Cui, and Hakil Kim. Wide-residual-inception networks for real-time object detection. In Intelligent Vehicles Symposium (IV), 2017 IEEE, pages 758–764, 2017b.
  • Lemley et al. [2017] Joseph Lemley, Shabab Bazrafkan, and Peter Corcoran. Smart augmentation learning an optimal data augmentation strategy. IEEE Access, 5:5858–5869, 2017.
  • Li [2017] Bo Li. 3D Fully Convolutional Network for Vehicle Detection in Point Cloud. In IROS, 2017.
  • Li et al. [2016a] Bo Li, Tianfu Wu, Shuai Shao, Lun Zhang, and Rufeng Chu. Object detection via end-to-end integration of aspect ratio and context aware part-based models and fully convolutional networks. CoRR, abs/1612.00534, 2016a. URL
  • Li et al. [2016b] Bo Li, Tianlei Zhang, and Tian Xia. Vehicle detection from 3d lidar using fully convolutional network. In David Hsu, Nancy M. Amato, Spring Berman, and Sam Ade Jacobs, editors, Robotics: Science and Systems XII, University of Michigan, Ann Arbor, Michigan, USA, June 18 - June 22, 2016, 2016b. URL
  • Li et al. [2015] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutional neural network cascade for face detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 5325–5334, 2015.
  • Li et al. [2017a] Hongyang Li, Yu Liu, Wanli Ouyang, and Xiaogang Wang. Zoom out-and-in network with recursive training for object proposal. CoRR, abs/1702.05711, 2017a. URL
  • Li et al. [2017b] Jianan Li, Xiaodan Liang, ShengMei Shen, Tingfa Xu, Jiashi Feng, and Shuicheng Yan. Scale-aware fast r-cnn for pedestrian detection. IEEE Transactions on Multimedia, 2017b.
  • Li et al. [2017c] Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, and Shuicheng Yan. Perceptual generative adversarial networks for small object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1951–1959. IEEE Computer Society, 2017c.
  • Li et al. [2016c] Xiaofei Li, Fabian Flohr, Yue Yang, Hui Xiong, Markus Braun, Shuyue Pan, Keqiang Li, and Dariu M Gavrila. A new benchmark for vision-based cyclist detection. In Intelligent Vehicles Symposium (IV), 2016 IEEE, pages 1028–1033, 2016c.
  • Li et al. [2017d] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei. Fully convolutional instance-aware semantic segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4438–4446, 2017d. doi: 10.1109/CVPR.2017.472. URL
  • Li et al. [2017e] Yikang Li, Wanli Ouyang, Bolei Zhou, Kun Wang, and Xiaogang Wang. Scene graph generation from objects, phrases and caption regions. CoRR, abs/1707.09700, 2017e. URL
  • Li et al. [2018a] Yuxi Li, Jiuwei Li, Weiyao Lin, and Jianguo Li. Tiny-DSOD: Lightweight Object Detection for Resource-Restricted Usages. In Proceedings of the British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, July 2018a.
  • Li et al. [2017f] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Light-head R-CNN: in defense of two-stage object detector. CoRR, abs/1711.07264, 2017f. URL
  • Li et al. [2018b] Zeming Li, Yilun Chen, Gang Yu, and Yangdong Deng. R-FCN++: Towards Accurate Region-Based Fully Convolutional Networks for Object Detection. In AAAI, page 8, 2018b.
  • Li et al. [2018c] Zeming Li, Chao Peng, Gang Yu, Xiangyu Zhang, Yangdong Deng, and Jian Sun. Detnet: A backbone network for object detection. CoRR, abs/1804.06215, 2018c. URL
  • Li and Hoiem [2018] Zhizhong Li and Derek Hoiem. Learning without Forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, (to appear), 2018.
  • Li and Zhou [2017] Zuoxin Li and Fuqiang Zhou. FSSD: feature fusion single shot multibox detector. CoRR, abs/1712.00960, 2017. URL
  • Liao et al. [2018a] Minghui Liao, Baoguang Shi, and Xiang Bai. Textboxes++: A single-shot oriented scene text detector. CoRR, abs/1801.02765, 2018a. URL
  • Liao et al. [2018b] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-Song Xia, and Xiang Bai. Rotation-sensitive regression for oriented scene text detection. CoRR, abs/1803.05265, 2018b. URL
  • Liao et al. [2017] Yuan Liao, Xiaoqing Lu, Chengcui Zhang, Yongtao Wang, and Zhi Tang. Mutual Enhancement for Detection of Multiple Logos in Sports Videos. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 4856–4865, October 2017.
  • Lin et al. [2014] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, pages 740–755, 2014.
  • Lin et al. [2017a] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, volume 1, page 4, 2017a.
  • Lin et al. [2017b] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2999–3007. IEEE Computer Society, 2017b.
  • Lin et al. [2007] Zhe Lin, Larry S. Davis, David S. Doermann, and Daniel DeMenthon. Hierarchical part-template matching for human detection and segmentation. In IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, October 14-20, 2007, pages 1–8, 2007.
  • Lin et al. [2015] Zhouhan Lin, Matthieu Courbariaux, Roland Memisevic, and Yoshua Bengio. Neural networks with few multiplications. CoRR, abs/1510.03009, 2015. URL
  • Liu and Mattyus [2015] Kang Liu and Gellert Mattyus. Fast multiclass vehicle detection on aerial images. IEEE Geoscience and Remote Sensing Letters, 12(9):1938–1942, 2015.
  • Liu et al. [2016] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, pages 21–37, 2016.
  • Liu and Jin [2017] Yuliang Liu and Lianwen Jin. Deep matching prior network: Toward tighter multi-oriented text detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3454–3461. IEEE Computer Society, 2017.
  • Liu et al. [2017] Zikun Liu, Liu Yuan, Lubin Weng, and Yiping Yang. A high resolution optical satellite image dataset for ship recognition and some new baselines. In ICPRAM, pages 324–331, 2017.
  • Lowe [1999] David G Lowe. Object recognition from local scale-invariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157, 1999.
  • Lowe [2004] David G Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision (IJCV), 60(2):91–110, 2004.
  • Lu et al. [2017] Jiajun Lu, Hussein Sibai, Evan Fabry, and David A. Forsyth. Standard detectors aren’t (currently) fooled by physical adversarial stop signs. CoRR, abs/1710.03337, 2017. URL
  • Lucas et al. [2003] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young. Icdar 2003 robust reading competitions. In Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., pages 682–687, Aug 2003.
  • Ma et al. [2018] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xiangyang Xue. Arbitrary-Oriented Scene Text Detection via Rotation Proposals. IEEE Transactions on Multimedia, pages 1–1, 2018.
  • Manen et al. [2013] Santiago Manen, Matthieu Guillaumin, and Luc Van Gool. Prime object proposals with randomized prim’s algorithm. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, pages 2536–2543, 2013.
  • Mao et al. [2017] Jiayuan Mao, Tete Xiao, Yuning Jiang, and Zhimin Cao. What can help pedestrian detection? In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6034–6043, 2017.
  • Mariano et al. [2002] V.Y. Mariano, Junghye Min, Jin-Hyeong Park, R. Kasturi, D. Mihalcik, Huiping Li, D. Doermann, and T. Drayer. Performance evaluation of object detection algorithms. In International Conference on Pattern Recognition (ICPR), volume 3, pages 965–969, 2002.
  • Maron and Lozano-Pérez [1997] Oded Maron and Tomás Lozano-Pérez. A framework for multiple-instance learning. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems 10, [NIPS Conference, Denver, Colorado, USA, 1997], pages 570–576. The MIT Press, 1997. URL
  • Masana et al. [2016] Marc Masana, Joost van de Weijer, and Andrew D. Bagdanov. On-the-fly network pruning for object detection. CoRR, abs/1605.03477, 2016. URL
  • Masana et al. [2017] Marc Masana, Joost van de Weijer, Luis Herranz, Andrew D. Bagdanov, and Jose M. Álvarez. Domain-adaptive deep network compression. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 4299–4307. IEEE Computer Society, 2017.
  • Massa et al. [2016] Francisco Massa, Bryan C. Russell, and Mathieu Aubry. Deep Exemplar 2D-3D Detection by Adapting from Real to Rendered Views. 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 6024–6033, 2016.
  • Matan et al. [1992] Ofer Matan, Henry S. Baird, Jane Bromley, Christopher J. C. Burges, John S. Denker, Lawrence D. Jackel, Yann Le Cun, Edwin P. D. Pednault, William D Satterfield, Charles E. Stenard, et al. Reading handwritten digits: A zip code recognition system. IEEE Computer, 25(7):59–63, 1992.
  • Maze et al. [2018] Brianna Maze, Jocelyn Adams, James A Duncan, Nathan Kalka, Tim Miller, Charles Otto, Anil K Jain, W Tyler Niggel, Janet Anderson, Jordan Cheney, and Patrick Grother. IARPA Janus Benchmark – C: Face Dataset and Protocol. In ICB, page 8, 2018.
  • McCormac et al. [2017] John McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J. Davison. Scenenet RGB-D: can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2697–2706. IEEE Computer Society, 2017.
  • Minemura et al. [2018] Kazuki Minemura, Hengfui Liau, Abraham Monrroy, and Shinpei Kato. Lmnet: Real-time multiclass object detection on CPU using 3d lidar. CoRR, abs/1805.04902, 2018. URL
  • Mishra et al. [2016] A. Mishra, S. Nandan Rai, A. Mishra, and C. V. Jawahar. IIIT-CFW: A Benchmark Database of Cartoon Faces in the Wild. In VASE ECCVW, 2016.
  • Mishra et al. [2012] Anand Mishra, Karteek Alahari, and CV Jawahar. Scene text recognition using higher order language priors. In British Machine Vision Conference, BMVC 2012, Surrey, UK, September 3-7, 2012, 2012.
  • Misra et al. [2015] I. Misra, A. Shrivastava, and M. Hebert. Watch and learn: Semi-supervised learning of object detectors from videos. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 3593–3602, June 2015.
  • Mitash et al. [2017] Chaitanya Mitash, Kun Wang, Kostas E Bekris, and Abdeslam Boularias. Physics-aware Self-supervised Training of CNNs for Object Detection. In IEEE International Conference on Robotics and Automation (ICRA), 2017.
  • Mitchell [2018] T M Mitchell. Never-Ending Learning. Commun. ACM, 61(5):103–115, 2018.
  • Mogelmose et al. [2012] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund. Vision-Based Traffic Sign Detection and Analysis for Intelligent Driver Assistance Systems: Perspectives and Survey. IEEE Transactions on Intelligent Transportation Systems, 13:1484–1497, November 2012.
  • Mordan et al. [2017] Taylor Mordan, Nicolas Thome, Matthieu Cord, and Gilles Henaff. Deformable Part-based Fully Convolutional Network for Object Detection. In Proceedings of the British Machine Vision Conference 2017, BMVC 2017, London, UK, September 4-7, 2017, 2017.
  • Mousavian et al. [2017] Arsalan Mousavian, Dragomir Anguelov, John Flynn, and Jana Kosecka. 3d bounding box estimation using deep learning and geometry. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5632–5640. IEEE Computer Society, 2017.
  • Mrowca et al. [2015] Damian Mrowca, Marcus Rohrbach, Judy Hoffman, Ronghang Hu, Kate Saenko, and Trevor Darrell. Spatial semantic regularisation for large scale object detection. In IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 2003–2011, 2015.
  • Mun et al. [2017] Seongkyu Mun, Sangwook Park, David K Han, and Hanseok Ko. Generative adversarial network based acoustic scene training set augmentation and selection using svm hyper-plane. Proc. DCASE, pages 93–97, 2017.
  • Mundhenk et al. [2016] T Nathan Mundhenk, Goran Konjevod, Wesam A Sakla, and Kofi Boakye. A large contextual dataset for classification, detection and counting of cars with deep learning. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, pages 785–800, 2016.
  • Nada et al. [2018] Hajime Nada, Vishwanath A. Sindagi, He Zhang, and Vishal M. Patel. Pushing the limits of unconstrained face detection: a challenge dataset and baseline results. CoRR, abs/1804.10275, 2018. URL
  • Najibi et al. [2016] Mahyar Najibi, Mohammad Rastegari, and Larry S. Davis. G-CNN: An Iterative Grid Based Object Detector. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, 2016.
  • Najibi et al. [2017] Mahyar Najibi, Pouya Samangouei, Rama Chellappa, and Larry Davis. SSH: Single Stage Headless Face Detector. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017.
  • Newell et al. [2016] Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass networks for human pose estimation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, pages 483–499, 2016.
  • Niepert et al. [2016] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov. Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023, 2016.
  • Nowlan and Platt [1995] Steven J Nowlan and John C Platt. A convolutional neural network hand tracker. In Advances in Neural Information Processing Systems 8, NIPS, Denver, CO, USA, November 27-30, 1995, pages 901–908, 1995.
  • Ogier Du Terrail and Jurie [2017] Jean Ogier Du Terrail and Frédéric Jurie. ON THE USE OF DEEP NEURAL NETWORKS FOR THE DETECTION OF SMALL VEHICLES IN ORTHO-IMAGES. In IEEE International Conference on Image Processing, Beijing, China, September 2017. URL
  • Oksuz et al. [2018] Kemal Oksuz, Baris Can Cam, Emre Akbas, and Sinan Kalkan. Localization Recall Precision (LRP): A New Performance Metric for Object Detection. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8 - 14, 2018, July 2018.
  • Oquab et al. [2014] M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Weakly supervised object recognition with convolutional neural networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014.
  • Oquab et al. [2015] Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free? - weakly-supervised learning with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 685–694, 2015.
  • Osadchy et al. [2007] Margarita Osadchy, Yann Le Cun, and Matthew L Miller.

    Synergistic face detection and pose estimation with energy-based models.

    Journal of Machine Learning Research, 8(May):1197–1215, 2007.
  • Ouyang et al. [2016] W. Ouyang, X. Wang, and C. Zhang. Factors in finetuning deep model for object detection with long-tail distribution. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, 2016.
  • Ouyang and Wang [2013a] Wanli Ouyang and Xiaogang Wang. Joint deep learning for pedestrian detection. In IEEE International Conference on Computer Vision, ICCV 2013, Sydney, Australia, December 1-8, 2013, 2013a.
  • Ouyang and Wang [2013b] Wanli Ouyang and Xiaogang Wang. Single-pedestrian detection aided by multi-pedestrian detection. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pages 3198–3205, 2013b.
  • Ouyang et al. [2015] Wanli Ouyang, Xiaogang Wang, Xingyu Zeng, Shi Qiu, Ping Luo, Yonglong Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Chen-Change Loy, and Xiaoou Tang. DeepID-Net: Deformable deep convolutional neural networks for object detection. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, 2015.
  • Ouyang et al. [2017] Wanli Ouyang, Ku Wang, Xin Zhu, and Xiaogang Wang. Learning chained deep features and classifiers for cascade in object detection. CoRR, abs/1702.07054, 2017. URL
  • Ouyang et al. [2018] Xi Ouyang, Yu Cheng, Yifan Jiang, Chun-Liang Li, and Pan Zhou. Pedestrian-synthesis-gan: Generating pedestrian data in real scene and beyond. CoRR, abs/1804.02047, 2018. URL
  • Papadopoulos et al. [2016] Dim P. Papadopoulos, Jasper R. R. Uijlings, Frank Keller, and Vittorio Ferrari. We don’t need no bounding-boxes: Training object class detectors using only human verification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, February 2016.
  • Papadopoulos et al. [2017] Dim P. Papadopoulos, Jasper R. R. Uijlings, Frank Keller, and Vittorio Ferrari. Training object class detectors with click supervision. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 180–189. IEEE Computer Society, 2017.
  • Papageorgiou and Poggio [2000] Constantine Papageorgiou and Tomaso Poggio. A trainable system for object detection. International Journal of Computer Vision (IJCV), 38(1):15–33, 2000.
  • Peng et al. [2018] Bo Peng, Wenming Tan, Zheyang Li, Shun Zhang, Di Xie, and Shiliang Pu. Extreme network compression via filter group approximation. CoRR, abs/1807.11254, 2018. URL
  • Peng et al. [2017a] Chao Peng, Tete Xiao, Zeming Li, Yuning Jiang, Xiangyu Zhang, Kai Jia, Gang Yu, and Jian Sun. Megdet: A large mini-batch object detector. CoRR, abs/1711.07240, 2017a. URL
  • Peng et al. [2017b] Chao Peng, Xiangyu Zhang, Gang Yu, Guiming Luo, and Jian Sun. Large kernel matters???improve semantic segmentation by global convolutional network. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1743–1751, 2017b.
  • Peng and Saenko [2018] Xingchao Peng and Kate Saenko. Synthetic to real adaptation with generative correlation alignment networks. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pages 1982–1991. IEEE Computer Society, 2018.
  • Peng et al. [2015] Xingchao Peng, Baochen Sun, Karim Ali 0002, and Kate Saenko. Learning Deep Object Detectors from 3D Models. In IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, 2015.
  • Pentland et al. [1994] Alex Pentland, Baback Moghaddam, and Thad Starner.

    View-based and modular eigenspaces for face recognition.

    In Conference on Computer Vision and Pattern Recognition, CVPR 1994, 21-23 June, 1994, Seattle, WA, USA, pages 84–91, 1994.
  • Pepik et al. [2015] Bojan Pepik, Rodrigo Benenson, Tobias Ritschel, and Bernt Schiele. What is holding back convnets for detection? In German Conference on Pattern Recognition, pages 517–528, 2015.
  • Perez and Wang [2017] Luis Perez and Jason Wang. The effectiveness of data augmentation in image classification using deep learning. CoRR, abs/1712.04621, 2017. URL
  • Pham et al. [2017] Phuoc Pham, Duy Nguyen, Tien Do, Thanh Duc Ngo, and Duy-Dinh Le. Evaluation of Deep Models for Real-Time Small Object Detection. ICONIP, 10636:516–526, 2017.
  • Pinheiro et al. [2015] Pedro H. O. Pinheiro, Ronan Collobert, and Piotr Dollár. Learning to segment object candidates. In Corinna Cortes, Neil D. Lawrence, Daniel D. Lee, Masashi Sugiyama, and Roman Garnett, editors, Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 1990–1998, 2015. URL
  • Pinheiro and Collobert [2015] Pedro O. Pinheiro and Ronan Collobert. From Image-level to Pixel-level Labeling with Convolutional Networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015.
  • Pinheiro et al. [2016] Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. Learning to refine object segments. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, pages 75–91, 2016.
  • Pon et al. [2018] Alex D. Pon, Oles Andrienko, Ali Harakeh, and Steven L. Waslander. A Hierarchical Deep Architecture and Mini-Batch Selection Method For Joint Traffic Sign and Light Detection. In IEEE Conference on Computer and Robot Vision, June 2018.
  • Pont-Tuset et al. [2017] Jordi Pont-Tuset, Pablo Arbelaez, Jonathan T Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grouping for image segmentation and object proposal generation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(1):128–140, 2017.
  • Porikli [2005] Fatih Murat Porikli. Integral histogram: A fast way to extract histograms in cartesian spaces. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2005), 20-26 June 2005, San Diego, CA, USA, pages 829–836, 2005.
  • Qi et al. [2017a] Charles R. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, July 2017a.
  • Qi et al. [2017b] Charles Ruizhongtai Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J. Guibas. Frustum pointnets for 3d object detection from RGB-D data. CoRR, abs/1711.08488, 2017b. URL
  • Qi et al. [2017c] Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 5105–5114, 2017c. URL
  • Qiu and Yuille [2016] Weichao Qiu and Alan L. Yuille. Unrealcv: Connecting computer vision to unreal engine. In Gang Hua and Hervé Jégou, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, volume 9915 of Lecture Notes in Computer Science, pages 909–916, 2016. URL
  • Rahman et al. [2018] Shafin Rahman, Salman Hameed Khan, and Fatih Porikli. Zero-shot object detection: Learning to simultaneously recognize and localize novel concepts. CoRR, abs/1803.06049, 2018. URL
  • Rahtu et al. [2011] Esa Rahtu, Juho Kannala, and Matthew Blaschko. Learning a category independent object detection cascade. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pages 1052–1059, 2011.
  • Raj et al. [2015] Anant Raj, Vinay P. Namboodiri, and Tinne Tuytelaars. Subspace Alignment Based Domain Adaptation for RCNN Detector. In Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, pages 166.1–166.11, Swansea, 2015.
  • Rajaram et al. [2016] Rakesh N. Rajaram, Eshed Ohn-Bar, and Mohan M. Trivedi. RefineNet: Iterative refinement for accurate object localization. In IEEE 19th International Conference on Intelligent Transportation Systems (ITSC), pages 1528–1533, November 2016.
  • Rajpura et al. [2017] Param S. Rajpura, Ravi S. Hegde, and Hristo Bojinov. Object detection using deep cnns trained on synthetic images. CoRR, abs/1706.06782, 2017. URL
  • Ranjan et al. [2015] Rajeev Ranjan, Vishal M. Patel, and Rama Chellappa. A deep pyramid deformable part model for face detection. In IEEE 7th International Conference on Biometrics Theory, Applications and Systems, BTAS 2015, Arlington, VA, USA, September 8-11, 2015, pages 1–8. IEEE, 2015.
  • Rantalankila et al. [2014] Pekka Rantalankila, Juho Kannala, and Esa Rahtu. Generating object segmentation proposals using global and local search. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pages 2417–2424, 2014.
  • Rastegari et al. [2016] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classification using binary convolutional neural networks. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, pages 525–542, 2016.
  • Ratner et al. [2017] Alexander J Ratner, Henry Ehrenberg, Zeshan Hussain, Jared Dunnmon, and Christopher Ré. Learning to compose domain-specific transformations for data augmentation. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 3236–3246, 2017.
  • Ray et al. [2017] Kumar S. Ray, Vijayan K. Asari, and Soma Chakraborty. Object detection by spatio-temporal analysis and tracking of the detected objects in a video with variable background. CoRR, abs/1705.02949, 2017. URL
  • Razakarivony and Jurie [2016] Sébastien Razakarivony and Frédéric Jurie. Vehicle detection in aerial imagery: A small target detection benchmark. Journal of Visual Communication and Image Representation, 34:187–203, 2016.
  • Real et al. [2017] Esteban Real, Jonathon Shlens, Stefano Mazzocchi, Xin Pan, and Vincent Vanhoucke. Youtube-boundingboxes: A large high-precision human-annotated data set for object detection in video. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 7464–7473. IEEE Computer Society, 2017.
  • Reddi et al. [2018] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations (ICLR), 2018.
  • Redmon and Angelova [2015] Joseph Redmon and Anelia Angelova. Real-time grasp detection using convolutional neural networks. In IEEE International Conference on Robotics and Automation (ICRA), 2015.
  • Redmon and Farhadi [2017] Joseph Redmon and Ali Farhadi. YOLO9000: better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6517–6525. IEEE Computer Society, 2017.
  • Redmon and Farhadi [2018] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018. URL
  • Redmon et al. [2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 779–788, 2016.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pages 91–99, 2015.
  • Ren et al. [2017] Shaoqing Ren, Kaiming He, Ross B. Girshick, Xiangyu Zhang, and Jian Sun. Object detection networks on convolutional feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(7):1476–1481, 2017. URL
  • Rochan and Wang [2015] M. Rochan and Yang Wang. Weakly supervised localization of novel objects using appearance transfer. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015.
  • Rodriguez et al. [2011] Mikel Rodriguez, Ivan Laptev, Josef Sivic, and Jean-Yves Audibert. Density-aware person detection and tracking in crowds. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pages 2423–2430, 2011.
  • Romberg et al. [2011] Stefan Romberg, Lluis Garcia Pueyo, Rainer Lienhart, and Roelof Van Zwol. Scalable logo recognition in real-world images. In Proceedings of the 1st ACM International Conference on Multimedia Retrieval, page 25, 2011.
  • Rosenfeld et al. [2018] Amir Rosenfeld, Richard Zemel, and John K. Tsotsos. The elephant in the room. CoRR, abs/1808.03305, 2018. URL
  • Rothe et al. [2014] Rasmus Rothe, Matthieu Guillaumin, and Luc Van Gool. Non-maximum suppression for object detection by passing messages between windows. In Computer Vision - ACCV 2014 - 12th Asian Conference on Computer Vision, Singapore, Singapore, November 1-5, 2014, pages 290–306, 2014.
  • Roy et al. [2016] Soumya Roy, Vinay P. Namboodiri, and Arijit Biswas. Active learning with version spaces for object detection. CoRR, abs/1611.07285, 2016. URL
  • Rujikietgumjorn and Collins [2013] Sitapa Rujikietgumjorn and Robert T Collins. Optimized pedestrian detection for multiple and occluded people. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pages 3690–3697, 2013.
  • Rumelhart et al. [1985] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning internal representations by error propagation. Technical report, California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
  • Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • Sabzmeydani and Mori [2007] Payam Sabzmeydani and Greg Mori. Detecting pedestrians by learning shapelet features. In 2007 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2007), 18-23 June 2007, Minneapolis, Minnesota, USA, 2007.
  • Sadeghi and Farhadi [2011] Mohammad Amin Sadeghi and Ali Farhadi. Recognition using visual phrases. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 1745–1752, 2011.
  • Sadeghi and Forsyth [2014] Mohammad Amin Sadeghi and David A. Forsyth. 30hz object detection with DPM V5. In David J. Fleet, Tomás Pajdla, Bernt Schiele, and Tinne Tuytelaars, editors, Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, volume 8689 of Lecture Notes in Computer Science, pages 65–79. Springer, 2014. URL
  • Sakla et al. [2017] Wesam A. Sakla, Goran Konjevod, and T. Nathan Mundhenk. Deep multi-modal vehicle detection in aerial ISR imagery. In 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, Santa Rosa, CA, USA, March 24-31, 2017, pages 916–923. IEEE, 2017.
  • Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, pages 4510–4520, 2018.
  • Savalle and Tsogkas [2014] P. A. Savalle and S. Tsogkas. Deformable part models with cnn features. In SAICSIT Conf., 2014.
  • Schneiderman and Kanade [2004] Henry Schneiderman and Takeo Kanade. Object detection using the statistics of parts. International Journal of Computer Vision (IJCV), 56(3):151–177, 2004.
  • Sermanet et al. [2013a] Pierre Sermanet, David Eigen, Xiang Zhang, Michaël Mathieu, Rob Fergus, and Yann LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013a. URL
  • Sermanet et al. [2013b] Pierre Sermanet, Koray Kavukcuoglu, Soumith Chintala, and Yann LeCun. Pedestrian detection with unsupervised multi-stage feature learning. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013, pages 3626–3633, 2013b.
  • Shafiee et al. [2017] Mohammad Javad Shafiee, Brendan Chywl, Francis Li, and Alexander Wong. Fast YOLO: A fast you only look once system for real-time embedded object detection in video. CoRR, abs/1709.05943, 2017. URL
  • Shen et al. [2018] Yunhan Shen, Rongrong Ji, Shengchuan Zhang, Wangmeng Zuo, and Yan Wang. Generative adversarial learning towards fast weakly supervised detection. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018.
  • Shen et al. [2017a] Zhiqiang Shen, Zhuang Liu, Jianguo Li, Yu-Gang Jiang, Yurong Chen, and Xiangyang Xue. Dsod: Learning deeply supervised object detectors from scratch. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, volume 3, page 7, 2017a.
  • Shen et al. [2017b] Zhiqiang Shen, Honghui Shi, Rogério Schmidt Feris, Liangliang Cao, Shuicheng Yan, Ding Liu, Xinchao Wang, Xiangyang Xue, and Thomas S. Huang. Learning object detectors from scratch with gated recurrent feature pyramids. CoRR, abs/1712.00886, 2017b. URL
  • Shi et al. [2017a] Baoguang Shi, Xiang Bai, and Serge J. Belongie. Detecting oriented text in natural images by linking segments. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3482–3490. IEEE Computer Society, 2017a.
  • Shi et al. [2017b] Baoguang Shi, Cong Yao, Minghui Liao, Mingkun Yang, Pei Xu, Linyan Cui, Serge J. Belongie, Shijian Lu, and Xiang Bai. ICDAR2017 competition on reading chinese text in the wild (RCTW-17). CoRR, abs/1708.09585, 2017b. URL
  • Shi et al. [2018] Xuepeng Shi, Shiguang Shan, Meina Kan, Shuzhe Wu, and Xilin Chen. Real-time rotation-invariant face detection with progressive calibration networks. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018.
  • Shmelkov et al. [2017] Konstantin Shmelkov, Cordelia Schmid, and Karteek Alahari. Incremental learning of object detectors without catastrophic forgetting. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 3420–3429, 2017.
  • Shrivastava et al. [2016a] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 761–769, 2016a.
  • Shrivastava et al. [2016b] Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik, and Abhinav Gupta. Beyond skip connections: Top-down modulation for object detection. CoRR, abs/1612.06851, 2016b. URL
  • Shrivastava et al. [2017] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Joshua Susskind, Wenda Wang, and Russell Webb. Learning from Simulated and Unsupervised Images through Adversarial Training. 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2242–2251, 2017.
  • Silberstein et al. [2014] Shai Silberstein, Dan Levi, Victoria Kogan, and Ran Gazit. Vision-based pedestrian detection for rear-view cameras. In Intelligent Vehicles Symposium Proceedings, 2014 IEEE, pages 853–860, 2014.
  • Silver et al. [2013] Daniel L Silver, Qiang Yang, and Lianghao Li. Lifelong Machine Learning Systems: Beyond Learning Algorithms. In 2013 AAAI Spring Symposium, page 7, 2013.
  • Simon et al. [2018] Martin Simon, Stefan Milz, Karl Amende, and Horst-Michael Gross. Complex-yolo: Real-time 3d object detection on point clouds. CoRR, abs/1803.06199, 2018. URL
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. URL
  • Simonyan et al. [2013] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013. URL
  • Singh and Davis [2018] Bharat Singh and Larry S Davis. An analysis of scale invariance in object detection-snip. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2018.
  • Singh et al. [2017] Bharat Singh, Hengduo Li, Abhishek Sharma, and Larry S. Davis. R-FCN-3000 at 30fps: Decoupling detection and classification. CoRR, abs/1712.01802, 2017. URL
  • Singh et al. [2018] Bharat Singh, Mahyar Najibi, and Larry S. Davis. SNIPER: efficient multi-scale training. CoRR, abs/1805.09300, 2018. URL
  • Sixt et al. [2018] Leon Sixt, Benjamin Wild, and Tim Landgraf. Rendergan: Generating realistic labeled data. Front. Robotics and AI, 2018, 2018.
  • Smeulders et al. [2000] Arnold W M Smeulders, Amarnath Gupta, and Ramesh Jain. Content-Based Image Retrieval at the End of the Early Years. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):32, 2000.
  • Sommer et al. [2017a] Lars W. Sommer, Tobias Schuchert, Jurgen Beyerer, Firooz A. Sadjadi, and Abhijit Mahalanobis. Deep learning based multi-category object detection in aerial images. In SPIE Defense+ Security, May 2017a.
  • Sommer et al. [2017b] Lars Wilko Sommer, Tobias Schuchert, and Jürgen Beyerer. Fast deep vehicle detection in aerial images. In 2017 IEEE Winter Conference on Applications of Computer Vision, WACV 2017, Santa Rosa, CA, USA, March 24-31, 2017, pages 311–319. IEEE, 2017b.
  • Sommer et al. [2018] Lars Wilko Sommer, Arne Schumann, Tobias Schuchert, and Jürgen Beyerer. Multi feature deconvolutional faster R-CNN for precise vehicle detection in aerial imagery. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pages 635–642. IEEE Computer Society, 2018.
  • Song et al. [2014a] Hyun Oh Song, Ross B. Girshick, Stefanie Jegelka, Julien Mairal, Zaïd Harchaoui, and Trevor Darrell. On learning to localize objects with minimal supervision. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, volume 32 of JMLR Workshop and Conference Proceedings, pages 1611–1619., 2014a. URL
  • Song et al. [2014b] Hyun Oh Song, Yong Jae Lee, Stefanie Jegelka, and Trevor Darrell. Weakly-supervised discovery of visual pattern configurations. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 1637–1645, 2014b.
  • Springenberg et al. [2014] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. Striving for simplicity: The all convolutional net. CoRR, abs/1412.6806, 2014. URL
  • Srivastava et al. [2018] Siddharth Srivastava, Gaurav Sharma, and Brejesh Lall. Large scale novel object discovery in 3d. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pages 179–188. IEEE Computer Society, 2018.
  • Stewart et al. [2016] Russell Stewart, Mykhaylo Andriluka, and Andrew Y Ng. End-to-end people detection in crowded scenes. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 2325–2333, 2016.
  • Su et al. [2017a] Hang Su, Shaogang Gong, and Xiatian Zhu. WebLogo-2M: Scalable Logo Detection by Deep Learning from the Web. In ICCB Workshops, pages 270–279, October 2017a.
  • Su et al. [2017b] Hang Su, Xiatian Zhu, and Shaogang Gong. Deep Learning Logo Detection with Data Expansion by Synthesising Context. IEEE Winter Conf. on Applications of Computer Vision (WACV), pages 530–539, 2017b.
  • Su et al. [2018] Hang Su, Xiatian Zhu, and Shaogang Gong. Open Logo Detection Challenge. In Proceedings of the British Machine Vision Conference 2018, BMVC 2018, Newcastle, UK, September 3-6, 2018, 2018.
  • Sun and Saenko [2014] Baochen Sun and Kate Saenko. From virtual to reality: Fast adaptation of virtual object detectors to real domains. In British Machine Vision Conference, BMVC 2014, Nottingham, UK, September 1-5, 2014, volume 1, page 3, 2014.
  • Sun et al. [2016] Chen Sun, Manohar Paluri, Ronan Collobert, Ram Nevatia, and Lubomir Bourdev. ProNet: Learning to Propose Object-Specific Boxes for Cascaded Neural Networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, 2016.
  • Szegedy et al. [2014] Christian Szegedy, Scott E. Reed, Dumitru Erhan, and Dragomir Anguelov. Scalable, high-quality object detection. CoRR, abs/1412.1441, 2014. URL
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, et al. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 1–9, 2015.
  • Szegedy et al. [2016] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2818–2826. IEEE Computer Society, 2016.
  • Szegedy et al. [2017] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In AAAI, volume 4, page 12, 2017.
  • Tan et al. [2018] Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, and Quoc V. Le. Mnasnet: Platform-aware neural architecture search for mobile. CoRR, abs/1807.11626, 2018. URL
  • Tang et al. [2012] Kevin D. Tang, Vignesh Ramanathan, Fei-Fei Li, and Daphne Koller. Shifting Weights: Adapting Object Detectors from Image to Video. In Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012. Proceedings of a meeting held December 3-6, 2012, Lake Tahoe, Nevada, United States, 2012.
  • Tang et al. [2017a] Peng Tang, Xinggang Wang, Xiang Bai, and Wenyu Liu. Multiple instance detection network with online instance classifier refinement. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017a.
  • Tang et al. [2014] Siyu Tang, Mykhaylo Andriluka, and Bernt Schiele. Detection and tracking of occluded people. International Journal of Computer Vision (IJCV), 110(1):58–69, 2014.
  • Tang et al. [2015] Siyu Tang, Bjoern Andres, Miykhaylo Andriluka, and Bernt Schiele. Subgraph decomposition for multi-target tracking. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 5033–5041, 2015.
  • Tang et al. [2017b] Tianyu Tang, Shilin Zhou, Zhipeng Deng, Lin Lei, and Huanxin Zou. Arbitrary-Oriented Vehicle Detection in Aerial Imagery with Single Convolutional Neural Networks. Remote Sensing, 9:1170–17, November 2017b.
  • Tang et al. [2017c] Tianyu Tang, Shilin Zhou, Zhipeng Deng, Huanxin Zou, and Lin Lei. Vehicle Detection in Aerial Images Based on Region Convolutional Neural Networks and Hard Negative Example Mining. Sensors, 17:336–17, February 2017c.
  • Tang et al. [2016] Y. Tang, J. K. Wang, B. Gao, and E. Dellandréa. Large Scale Semi-supervised Object Detection using Visual and Semantic Knowledge Transfer. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, 2016.
  • Tanner et al. [2009] Franklin Tanner, Brian Colder, Craig Pullen, David Heagy, Michael Eppolito, Veronica Carlan, Carsten Oertel, and Phil Sallee. Overhead imagery research data set???an annotated data library & tools to aid in the development of computer vision algorithms. In 2009 IEEE Applied Imagery Pattern Recognition Workshop (AIPR 2009), pages 1–8, 2009.
  • Taylor and Nitschke [2017] Luke Taylor and Geoff Nitschke. Improving deep learning using generic data augmentation. CoRR, abs/1708.06020, 2017. URL
  • Tian et al. [2017] Yonglin Tian, Xuan Li, Kunfeng Wang, and Fei-Yue Wang. Training and testing object detectors with virtual images. CoRR, abs/1712.08470, 2017. URL
  • Tieleman and Hinton [2012] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
  • Timofte et al. [2014] Radu Timofte, Karel Zimmermann, and Luc Van Gool. Multi-view traffic sign detection, recognition, and 3d localisation. Machine vision and applications, 25(3):633–647, 2014.
  • Tommasi et al. [2017] Tatiana Tommasi, Novi Patricia, Barbara Caputo, and Tinne Tuytelaars. A deeper look at dataset bias. In Gabriela Csurka, editor, Domain Adaptation in Computer Vision Applications., Advances in Computer Vision and Pattern Recognition, pages 37–55. Springer, 2017. URL
  • Torralba and Efros [2011] Antonio Torralba and Alexei A Efros. Unbiased look at dataset bias. In The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pages 1521–1528, 2011.
  • Tran et al. [2017] Toan Tran, Trung Pham, Gustavo Carneiro, Lyle Palmer, and Ian Reid. A bayesian data augmentation approach for learning deep models. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 2797–2806, 2017.
  • Tremblay et al. [2018a] Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, and Stan Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain randomization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018a.
  • Tremblay et al. [2018b] Jonathan Tremblay, Thang To, and Stan Birchfield. Falling things: A synthetic dataset for 3d object detection and pose estimation. CoRR, abs/1804.06534, 2018b. URL
  • Tripathi et al. [2016] Subarna Tripathi, Zachary C. Lipton, Serge J. Belongie, and Truong Q. Nguyen. Context matters: Refining object detection in video with recurrent neural networks. In Richard C. Wilson, Edwin R. Hancock, and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press, 2016. URL
  • Tu and Bai [2010] Zhuowen Tu and Xiang Bai. Auto-context and its application to high-level vision tasks and 3d brain image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(10):1744–1757, 2010.
  • Tu et al. [2012] Zhuowen Tu, Yi Ma, Wenyu Liu, Xiang Bai, and Cong Yao. Detecting texts of arbitrary orientations in natural images. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1083–1090, 2012.
  • Tuzel et al. [2008] Oncel Tuzel, Fatih Porikli, and Peter Meer. Pedestrian detection via classification on riemannian manifolds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(10):1713–1727, 2008.
  • Tüzkö et al. [2018] Andras Tüzkö, Christian Herrmann, Daniel Manger, and Jürgen Beyerer. Open set logo detection and retrieval. In Francisco H. Imai, Alain Trémeau, and José Braz, editors, Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018) - Volume 5: VISAPP, Funchal, Madeira, Portugal, January 27-29, 2018., pages 284–292. SciTePress, 2018.
  • Uijlings et al. [2013] Jasper RR Uijlings, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. Selective search for object recognition. International Journal of Computer Vision (IJCV), 104(2):154–171, 2013.
  • Vaillant et al. [1994] Régis Vaillant, Christophe Monrocq, and Yann Le Cun. Original approach for the localisation of objects in images. IEE Proceedings-Vision, Image and Signal Processing, 141(4):245–250, 1994.
  • Van de Sande et al. [2011] Koen EA Van de Sande, Jasper RR Uijlings, Theo Gevers, and Arnold WM Smeulders. Segmentation as selective search for object recognition. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pages 1879–1886, 2011.
  • Van Horn et al. [2018] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The iNaturalist Species Classification and Detection Dataset. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, 2018.
  • Varol et al. [2017] Gül Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J. Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4627–4635. IEEE Computer Society, 2017.
  • Veit et al. [2016] Andreas Veit, Tomas Matera, Lukas Neumann, Jiri Matas, and Serge J. Belongie. Coco-text: Dataset and benchmark for text detection and recognition in natural images. CoRR, abs/1601.07140, 2016. URL
  • Vezhnevets and Ferrari [2015] Alexander Vezhnevets and Vittorio Ferrari. Object localization in imagenet by looking out of the window. In Xianghua Xie, Mark W. Jones, and Gary K. L. Tam, editors, Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, pages 27.1–27.12. BMVA Press, 2015.
  • Viola et al. [2005] Paul A. Viola, Michael J. Jones, and Daniel Snow. Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision (IJCV), 63(2):153–161, 2005.
  • Walk et al. [2010] Stefan Walk, Nikodem Majer, Konrad Schindler, and Bernt Schiele. New features and insights for pedestrian detection. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pages 1030–1037, 2010.
  • Wan et al. [2018] Fang Wan, Pengxu Wei, Jianbin Jiao, Zhenjun Han, and Qixiang Ye. Min-entropy latent model for weakly supervised object detection. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018.
  • Wan et al. [2015] Li Wan, David Eigen, and Rob Fergus. End-to-end integration of a convolutional network, deformable parts model and non-maximum suppression. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 851–859. IEEE Computer Society, 2015.
  • Wang et al. [2014] Chong Wang, Weiqiang Ren, Kaiqi Huang, and Tieniu Tan. Weakly Supervised Object Localization with Latent Category Learning. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, 2014.
  • Wang and Belongie [2010] Kai Wang and Serge Belongie. Word spotting in the wild. In Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, pages 591–604, 2010.
  • Wang et al. [2017a] Li Wang, Yao Lu, Hong Wang, Yingbin Zheng, Hao Ye, and Xiangyang Xue. Evolving boxes for fast vehicle detection. ICME, pages 1135–1140, 2017a.
  • Wang et al. [2018a] Robert J. Wang, Xiang Li, Shuang Ao, and Charles X. Ling. Pelee: A Real-Time Object Detection System on Mobile Devices. In International Conference on Learning Representations (ICLR), 2018a.
  • Wang et al. [2017b] Xiaolong Wang, Ross B. Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. CoRR, abs/1711.07971, 2017b. URL
  • Wang et al. [2017c] Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-fast-rcnn: Hard positive generation via adversary for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 3039–3048. IEEE Computer Society, 2017c.
  • Wang et al. [2009] Xiaoyu Wang, Tony X. Han, and Shuicheng Yan. An HOG-LBP human detector with partial occlusion handling. In IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27 - October 4, 2009, pages 32–39, 2009.
  • Wang et al. [2018b] Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, and Chunhua Shen. Repulsion Loss: Detecting Pedestrians in a Crowd. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, 2018b.
  • Weiler et al. [2018] Maurice Weiler, Fred A. Hamprecht, and Martin Storath. Learning steerable filters for rotation equivariant cnns. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018.
  • Wen et al. [2015] Longyin Wen, Dawei Du, Zhaowei Cai, Zhen Lei, Ming-Ching Chang, Honggang Qi, Jongwoo Lim, Ming-Hsuan Yang, and Siwei Lyu. DETRAC: A new benchmark and protocol for multi-object tracking. CoRR, abs/1511.04136, 2015. URL
  • Whitelam et al. [2017] Cameron Whitelam, Emma Taborsky, Austin Blanton, Brianna Maze, Jocelyn Adams, Tim Miller, Nathan Kalka, Anil K Jain, James A Duncan, Kristen Allen, et al. Iarpa janus benchmark-b face dataset. In CVPR Workshop on Biometrics, 2017.
  • Wojek et al. [2008] Christian Wojek, Gyuri Dorkó, André Schulz, and Bernt Schiele. Sliding-windows for rapid object class localization: A parallel technique. In Joint Pattern Recognition Symposium, pages 71–81, 2008.
  • Wojek et al. [2009] Christian Wojek, Stefan Walk, and Bernt Schiele. Multi-cue onboard pedestrian detection. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pages 794–801. IEEE Computer Society, 2009.
  • Woo et al. [2018] Sanghyun Woo, Soonmin Hwang, and In So Kweon. Stairnet: Top-down semantic aggregation for accurate one shot detection. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pages 1093–1102. IEEE Computer Society, 2018.
  • Wu et al. [2017] Bichen Wu, Forrest N. Iandola, Peter H. Jin, and Kurt Keutzer. Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops, Honolulu, HI, USA, July 21-26, 2017, pages 446–454. IEEE Computer Society, 2017.
  • Wu and Nevatia [2007] Bo Wu and Ram Nevatia. Cluster boosted tree classifier for multi-view, multi-pose object detection. In IEEE 11th International Conference on Computer Vision, ICCV 2007, Rio de Janeiro, Brazil, October 14-20, 2007, pages 1–8, 2007.
  • Wu and Nevatia [2005] Bo Wu and Ramakant Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In 10th IEEE International Conference on Computer Vision (ICCV 2005), 17-20 October 2005, Beijing, China, pages 90–97, 2005.
  • Wu et al. [2016] Tianfu Wu, Bo Li, and Song-Chun Zhu. Learning and-or model to represent context and occlusion for car detection and viewpoint estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(9):1829–1843, 2016.
  • Wu and Ji [2018] Yue Wu and Qiang Ji. Facial Landmark Detection: A Literature Survey. International Journal of Computer Vision (IJCV), To appear, May 2018.
  • Xia et al. [2017] Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge J. Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. DOTA: A large-scale dataset for object detection in aerial images. CoRR, abs/1711.10398, 2017. URL
  • Xiang and Savarese [2012] Yu Xiang and S. Savarese. Estimating the aspect layout of object categories. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, 2012.
  • Xiang et al. [2015] Yu Xiang, Wongun Choi, Yuanqing Lin, and Silvio Savarese. Data-driven 3d voxel patterns for object category recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pages 1903–1911. IEEE Computer Society, 2015.
  • Xiao et al. [2015] Yao Xiao, Cewu Lu, E. Tsougenis, Yongyi Lu, and Chi-Keung Tang. Complexity-adaptive distance metric for object proposals generation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015.
  • Xie et al. [2017] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 5987–5995, 2017.
  • Xu et al. [2017a] Hongyu Xu, Xutao Lv, Xiaoyu Wang, Zhou Ren, and Rama Chellappa. Deep regionlets for object detection. CoRR, abs/1712.02408, 2017a. URL
  • Xu et al. [2014] Jiaolong Xu, Sebastian Ramos, David Vázquez, and Antonio M López. Domain adaptation of deformable part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(12):2367–2380, 2014.
  • Xu et al. [2017b] Zhaozhuo Xu, Xin Xu, Lei Wang, Rui Yang, and Fangling Pu. Deformable ConvNet with Aspect Ratio Constrained NMS for Object Detection in Remote Sensing Imagery. Remote Sensing, 9:1312–19, December 2017b.
  • Yan et al. [2014] Junjie Yan, Xuzong Zhang, Zhen Lei, and Stan Z. Li. Face detection by structural models. Image and Vision Computing, 32(10):790–799, October 2014.
  • Yang et al. [2016a] Fan Yang, Wongun Choi, and Yuanqing Lin. Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 2129–2137, 2016a.
  • Yang et al. [2015] Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. From facial parts responses to face detection: A deep learning approach. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pages 3676–3684. IEEE Computer Society, 2015.
  • Yang et al. [2016b] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 5525–5533, 2016b.
  • Yang and Nevatia [2016] Zhenheng Yang and Ramakant Nevatia. A multi-scale cascade fully convolutional network face detector. In 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico, December 4-8, 2016, pages 633–638. IEEE, 2016.
  • Yao et al. [2016] Cong Yao, Xiang Bai, Nong Sang, Xinyu Zhou, Shuchang Zhou, and Zhimin Cao. Scene text detection via holistic, multi-channel prediction. CoRR, abs/1606.09002, 2016. URL
  • Yoshihashi et al. [2017] Ryota Yoshihashi, Tu Tuan Trinh, Rei Kawakami, Shaodi You, Makoto Iida, and Takeshi Naemura. Learning multi-frame visual representation for joint detection and tracking of small objects. CoRR, abs/1709.04666, 2017. URL
  • You et al. [2018] Yang You, Zhao Zhang, Cho-Jui Hsieh, James Demmel, and Kurt Keutzer. Imagenet training in minutes. In Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, Eugene, OR, USA, August 13-16, 2018, pages 1:1–1:10. ACM, 2018.
  • Yu and Koltun [2015] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, abs/1511.07122, 2015. URL
  • Yu et al. [2017] Fisher Yu, Vladlen Koltun, and Thomas A. Funkhouser. Dilated residual networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 636–644. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.75.
  • Yu et al. [2018] Fisher Yu, Wenqi Xian, Yingying Chen, Fangchen Liu, Mike Liao, Vashisht Madhavan, and Trevor Darrell. BDD100K: A diverse driving video database with scalable annotation tooling. CoRR, abs/1805.04687, 2018. URL
  • Yu et al. [2016a] Jiahui Yu, Yuning Jiang, Zhangyang Wang, Zhimin Cao, and Thomas S. Huang. Unitbox: An advanced object detection network. In Proceedings of the 2016 ACM Conference on Multimedia Conference, MM 2016, Amsterdam, The Netherlands, October 15-19, 2016, pages 516–520, 2016a.
  • Yu et al. [2016b] Ruichi Yu, Xi Chen, Vlad I. Morariu, and Larry S. Davis. The Role of Context Selection in Object Detection. In Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, September 2016b.
  • Yuan et al. [2017] Yuan Yuan, Xiaodan Liang, Xiaolong Wang, Dit-Yan Yeung, and Abhinav Gupta. Temporal dynamic graph lstm for action-driven video object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, Oct 2017.
  • Yucel et al. [2018] Mehmet Kerim Yucel, Yunus Can Bilge, Oguzhan Oguz, Nazli Ikizler-Cinbis, Pinar Duygulu, and Ramazan Gokberk Cinbis. Wildest faces: Face detection and recognition in violent settings. CoRR, abs/1805.07566, 2018. URL
  • [434] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In Richard C. Wilson, Edwin R. Hancock, and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016. BMVA Press. URL
  • Zagoruyko et al. [2016] Sergey Zagoruyko, Adam Lerer, Tsung-Yi Lin, Pedro Oliveira Pinheiro, Sam Gross, Soumith Chintala, and Piotr Dollár. A multipath network for object detection. In Richard C. Wilson, Edwin R. Hancock, and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, September 19-22, 2016, 2016. URL
  • Zeiler [2012] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012. URL
  • Zeiler and Fergus [2014a] Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, pages 818–833, 2014a. URL
  • Zeiler and Fergus [2014b] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, pages 818–833, 2014b.
  • Zeng et al. [2016] Xingyu Zeng, Wanli Ouyang, Bin Yang, Junjie Yan, and Xiaogang Wang. Gated Bi-directional CNN for Object Detection. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, October 2016.
  • Zeng et al. [2017] Xingyu Zeng, Wanli Ouyang, Junjie Yan, Hongsheng Li, Tong Xiao, Kun Wang, Yu Liu, Yucong Zhou, Bin Yang, Zhe Wang, et al. Crafting gbd-net for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • Zhai et al. [2018] Yao Zhai, Jingjing Fu, Yan Lu, and Houqiang Li. Feature selective networks for object detection. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018.
  • Zhang and Zhang [2010] Cha Zhang and Zhengyou Zhang. A survey of recent advances in face detection. Technical report, Tech. rep., Microsoft Research, 2010.
  • Zhang et al. [2018a] Dongqing Zhang, Jiaolong Yang, Dongqiangzi Ye, and Gang Hua. Lq-nets: Learned quantization for highly accurate and compact deep neural networks. CoRR, abs/1807.10029, 2018a. URL
  • Zhang et al. [2016a] Liliang Zhang, Liang Lin, Xiaodan Liang, and Kaiming He. Is faster R-CNN doing well for pedestrian detection? In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors, Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, volume 9906 of Lecture Notes in Computer Science, pages 443–457. Springer, 2016a. URL
  • Zhang et al. [2017a] Shanshan Zhang, Rodrigo Benenson, and Bernt Schiele. Citypersons: A diverse dataset for pedestrian detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 4457–4465. IEEE Computer Society, 2017a.
  • Zhang et al. [2018b] Shanshan Zhang, Jian Yang, and Bernt Schiele. Occluded Pedestrian Detection Through Guided Attention in CNNs. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, page 9, 2018b.
  • Zhang et al. [2017b] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z. Li. S$3̂$FD: Single Shot Scale-invariant Face Detector. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2017b.
  • Zhang et al. [2018c] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z. Li. Occlusion-aware R-CNN: detecting pedestrians in a crowd. CoRR, abs/1807.08407, 2018c. URL
  • Zhang et al. [2018d] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z. Li. Single-shot refinement neural network for object detection. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, 2018d.
  • Zhang et al. [2017c] Xiangyu Zhang, Xinyu Zhou, Mengxiao Lin, and Jian Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. CoRR, abs/1707.01083, 2017c. URL
  • Zhang et al. [2018e] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas S. Huang. Adversarial complementary learning for weakly supervised object localization. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018e.
  • Zhang et al. [2018f] Xiaopeng Zhang, Jiashi Feng, Hongkai Xiong, and Qi Tian. Zigzag learning for weakly supervised object detection. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018f.
  • Zhang et al. [2018g] Yongqiang Zhang, Yancheng Bai, Mingli Ding, Yongqiang Li, and Bernard Ghanem. W2f: A weakly-supervised to fully-supervised framework for object detection. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018g.
  • Zhang et al. [2015] Yuting Zhang, Kihyuk Sohn, R. Villegas, Gang Pan, and Honglak Lee. Improving object detection with deep convolutional networks via Bayesian optimization and structured prediction. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015.
  • Zhang et al. [2016b] Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang Bai. Multi-oriented text detection with fully convolutional networks. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 4159–4167. IEEE Computer Society, 2016b.
  • Zhang et al. [2018h] Zhishuai Zhang, Siyuan Qiao, Cihang Xie, Wei Shen, Bo Wang, and Alan L. Yuille. Single-shot object detection with enriched semantics. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018h.
  • Zhao et al. [2018a] Fan Zhao, Yao Yang, Hai-yan Zhang, Lin-lin Yang, and Lin Zhang. Sign text detection in street view images using an integrated feature. Multimedia Tools and Applications, April 2018a.
  • Zhao et al. [2018b] Xiangyun Zhao, Shuang Liang, and Yichen Wei. Pseudo mask augmented object detection. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, June 2018b.
  • Zhao et al. [2018c] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object detection with deep learning: A review. CoRR, abs/1807.05511, 2018c. URL
  • Zheng et al. [2018] Liwen Zheng, Canmiao Fu, and Yong Zhao. Extend the shallow part of single shot multibox detector via convolutional neural network. CoRR, abs/1801.05918, 2018. URL
  • Zhou et al. [2015] 15 Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge in deep scene cnns. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015.
  • Zhou et al. [2014] Bolei Zhou, Àgata Lapedriza, Jianxiong Xiao, Antonio Torralba, and Aude Oliva. Learning deep features for scene recognition using places database. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, 2014.
  • Zhou et al. [2016a] Bolei Zhou, Aditya Khosla, Àgata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pages 2921–2929. IEEE Computer Society, 2016a.
  • Zhou et al. [2018] Peng Zhou, Bingbing Ni, Cong Geng, Jianguo Hu, and Yi Xu. Scale-Transferrable Object Detection. In Computer Vision and Pattern Recognition (CVPR), 2018 IEEE Conference on, page 10, 2018.
  • Zhou et al. [2016b] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR, abs/1606.06160, 2016b. URL
  • Zhou et al. [2017] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang. East: An efficient and accurate scene text detector. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, July 2017.
  • Zhou and Tuzel [2017] Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. CoRR, abs/1711.06396, 2017. URL
  • Zhu et al. [2015a] Haigang Zhu, Xiaogang Chen, Weiqun Dai, Kun Fu, Qixiang Ye, and Jianbin Jiao. Orientation robust object detection in aerial images using deep convolutional neural network. In Image Processing (ICIP), 2015 IEEE International Conference on, pages 3735–3739, 2015a.
  • Zhu et al. [2017a] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A. Efros.

    Unpaired image-to-image translation using cycle-consistent adversarial networks.

    In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 2242–2251. IEEE Computer Society, 2017a.
  • Zhu et al. [2018a] Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, and Qinghua Hu. Vision meets drones: A challenge. CoRR, abs/1804.07437, 2018a. URL
  • Zhu et al. [2018b] Pengkai Zhu, Hanxiao Wang, Tolga Bolukbasi, and Venkatesh Saligrama. Zero-shot detection. CoRR, abs/1803.07113, 2018b. URL
  • Zhu and Ramanan [2012] Xiangxin Zhu and Deva Ramanan. Face detection, pose estimation, and landmark localization in the wild. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, June 16-21, 2012, pages 2879–2886. IEEE Computer Society, 2012.
  • Zhu et al. [2017b] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-guided feature aggregation for video object detection. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 408–417. IEEE Computer Society, 2017b.
  • Zhu et al. [2017c] Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei. Deep feature flow for video recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, volume 2, page 7, 2017c.
  • Zhu et al. [2018c] Xizhou Zhu, Jifeng Dai, Xingchi Zhu, Yichen Wei, and Lu Yuan. Towards high performance video object detection for mobiles. CoRR, abs/1804.05830, 2018c. URL
  • Zhu et al. [2015b] Yukun Zhu, R. Urtasun, R. Salakhutdinov, and S. Fidler. segDeepM: Exploiting segmentation and context in deep neural networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 2015b.
  • Zhu et al. [2016] Zhe Zhu, Dun Liang, Songhai Zhang, Xiaolei Huang, Baoli Li, and Shimin Hu. Traffic-sign detection and classification in the wild. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas,NV, USA, June 27-30, 2016, pages 2110–2118, 2016.
  • Zitnick and Dollar [2014] C. L. Zitnick and P. Dollar. Edge boxes: Locating object proposals from edges. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, 2014.
  • Zuo et al. [2016] Zhen Zuo, Bing Shuai, Gang Wang 0012, Xiao Liu, Xingxing Wang, Bing Wang, and Yushi Chen. Learning Contextual Dependence With Convolutional Hierarchical Recurrent Neural Networks. IEEE Transactions on Image Processing, 2016.