Deep Learning for Generic Object Detection: A Survey

09/06/2018 ∙ by Li Liu, et al. ∙ University of Waterloo The University of Sydney University of Oulu The Chinese University of Hong Kong 0

Generic object detection, aiming at locating object instances from a large number of predefined categories in natural images, is one of the most fundamental and challenging problems in computer vision. Deep learning techniques have emerged in recent years as powerful methods for learning feature representations directly from data, and have led to remarkable breakthroughs in the field of generic object detection. Given this time of rapid evolution, the goal of this paper is to provide a comprehensive survey of the recent achievements in this field brought by deep learning techniques. More than 250 key contributions are included in this survey, covering many aspects of generic object detection research: leading detection frameworks and fundamental subproblems including object feature representation, object proposal generation, context information modeling and training strategies; evaluation issues, specifically benchmark datasets, evaluation metrics, and state of the art performance. We finish by identifying promising directions for future research.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As a longstanding, fundamental and challenging problem in computer vision, object detection has been an active area of research for several decades. The goal of object detection is to determine whether or not there are any instances of objects from the given categories (such as humans, cars, bicycles, dogs and cats) in some given image and, if present, to return the spatial location and extent of each object instance (e.g., via a bounding box Everingham et al. (2010); Russakovsky et al. (2015)

). As the cornerstone of image understanding and computer vision, object detection forms the basis for solving more complex or high level vision tasks such as segmentation, scene understanding, object tracking, image captioning, event detection, and activity recognition. Object detection has a wide range of applications in many areas of artificial intelligence and information technologies, including robot vision, consumer electronics, security, autonomous driving, human computer interaction, content based image retrieval, intelligent video surveillance, and augmented reality.

Recently, deep learning techniques Hinton and Salakhutdinov (2006); LeCun et al. (2015) have emerged as powerful methods for learning feature representations automatically from data. In particular, these techniques have provided significant improvement for object detection, a problem which has attracted enormous attention in the last five years, even though it has been studied for decades by psychophysicists, neuroscientists, and engineers.

Figure 1: Recent evolution of object detection performance. We can observe significant performance (mean average precision) improvement since deep learning entered the scene in 2012. The performance of the best detector has been steadily increasing by a significant amount on a yearly basis. (a) Results on the PASCAL VOC datasets: Detection results of winning entries in the VOC2007-2012 competitions (using only provided training data). (b) Top object detection competition results in ILSVRC2013-2017 (using only provided training data).
Figure 2: Milestones of object detection and recognition, including feature representations Csurka et al. (2004); Dalal and Triggs (2005); He et al. (2016); Krizhevsky et al. (2012a); Lazebnik et al. (2006); Lowe (1999, 2004); Perronnin et al. (2010); Simonyan and Zisserman (2015); Sivic and Zisserman (2003); Szegedy et al. (2015); Viola and Jones (2001); Wang et al. (2009), detection frameworks Felzenszwalb et al. (2010); Girshick et al. (2014); Sermanet et al. (2014); Uijlings et al. (2013); Viola and Jones (2001), and datasets Everingham et al. (2010); Lin et al. (2014); Russakovsky et al. (2015). The time period up to 2012 is dominated by handcrafted features. We see a turning point in 2012 with the development of DCNNs for image classification by Krizhevsky et al. Krizhevsky et al. (2012a). Most listed methods are highly cited and won one of the major ICCV or CVPR prizes. See Section 2.3 for details.

Object detection can be grouped into one of two types Grauman and Leibe (2011); Zhang et al. (2013): detection of specific instance and detection of specific categories. The first type aims at detecting instances of a particular object (such as Donald Trump’s face, the Pentagon building, or my dog Penny), whereas the goal of the second type is to detect different instances of predefined object categories (for example humans, cars, bicycles, and dogs). Historically, much of the effort in the field of object detection has focused on the detection of a single category (such as faces and pedestrians) or a few specific categories. In contrast, in the past several years the research community has started moving towards the challenging goal of building general purpose object detection systems whose breadth of object detection ability rivals that of humans.

However in 2012, Krizhevsky et al. Krizhevsky et al. (2012a) proposed a Deep Convolutional Neural Network (DCNN) called AlexNet which achieved record breaking image classification accuracy in the Large Scale Visual Recognition Challenge (ILSRVC) Russakovsky et al. (2015). Since that time the research focus in many computer vision application areas has been on deep learning methods. A great many approaches based on deep learning have sprung up in generic object detection Girshick et al. (2014); He et al. (2014); Girshick (2015); Sermanet et al. (2014); Ren et al. (2017a) and tremendous progress has been achieved, yet we are unaware of comprehensive surveys of the subject during the past five years. Given this time of rapid evolution, the focus of this paper is specifically that of generic object detection by deep learning, in order to gain a clearer panorama in generic object detection.

The generic object detection problem itself is defined as follows: Given an arbitrary image, determine whether there are any instances of semantic objects from predefined categories and, if present, to return the spatial location and extent. Object refers to a material thing that can be seen and touched. Although largely synonymous with object class detection, generic object detection places a greater emphasis on approaches aimed at detecting a broad range of natural categories, as opposed to object instances or specialized categories (e.g., faces, pedestrians, or cars). Generic object detection has received significant attention, as demonstrated by recent progress on object detection competitions such as the PASCAL VOC detection challenge from 2006 to 2012 Everingham et al. (2010, 2015), the ILSVRC large scale detection challenge since 2013 Russakovsky et al. (2015), and the MS COCO large scale detection challenge since 2015 Lin et al. (2014). The striking improvement in recent years is illustrated in Fig. 1.


  No. Survey Title Ref. Year Published Content 
  1 Monocular Pedestrian Detection: Survey and Experiments Enzweiler and Gavrila (2009) 2009 PAMI Evaluating three detectors with additional experiments integrating the detectors into full systems 
  2 Survey of Pedestrian Detection for Advanced Driver Assistance Systems Geronimo et al. (2010) 2010 PAMI A survey of pedestrian detection for advanced driver assistance systems 
  3 Pedestrian Detection: An Evaluation of the State of The Art Dollar et al. (2012) 2012 PAMI Focus on a more thorough and detailed evaluation of detectors in individual monocular images 
  4 Detecting Faces in Images: A Survey Yang et al. (2002) 2002 PAMI

First survey of face detection from a single image

  5 A Survey on Face Detection in the Wild: Past, Present and Future Zafeiriou et al. (2015) 2015 CVIU A survey of face detection in the wild since 2000 
  6 On Road Vehicle Detection: A Review Sun et al. (2006) 2006 PAMI A review of vision based onroad vehicle detection systems where the camera is mounted on the vehicle 
  7 Text Detection and Recognition in Imagery: A Survey Ye and Doermann (2015) 2015 PAMI A survey of text detection and recognition in color imagery 
  8 Toward Category Level Object Recognition Ponce et al. (2007) 2007 Book Collects a series of representative papers on object categorization, detection, and segmentation 
  9 The Evolution of Object Categorization and the Challenge of Image Abstraction Dickinson et al. (2009) 2009 Book A trace of the evolution of object categorization in the last four decades 
  10 Context based Object Categorization: A Critical Survey Galleguillos and Belongie (2010) 2010 CVIU A review of different ways of using contextual information for object categorization 
  11 50 Years of Object Recognition: Directions Forward Andreopoulos and Tsotsos (2013) 2013 CVIU A review of the evolution of object recognition systems in the last five decades 
  12 Visual Object Recognition Grauman and Leibe (2011) 2011 Tutorial Covers fundamental and time tested approaches for both instance and category object recognition techniques 
  13 Object Class Detection: A Survey Zhang et al. (2013) 2013 ACM CS First survey of generic object detection methods before 2011 
  14 Feature Representation for Statistical Learning based Object Detection: A Review Li et al. (2015b) 2015 PR A survey on feature representation methods in statistical learning based object detection, including handcrafted and a few deep learning based features 
  15 Salient Object Detection: A Survey Borji et al. (2014) 2014 arXiv A survey for Salient object detection 
  16 Representation Learning: A Review and New Perspectives Bengio et al. (2013) 2013 PAMI

A review of unsupervised feature learning and deep learning, covering advances in probabilistic models, autoencoders, manifold learning, and deep networks

  17 Deep Learning LeCun et al. (2015) 2015 Nature An introduction to deep learning and its typical applications 
  18 A Survey on Deep Learning in Medical Image Analysis Litjens et al. (2017) 2017 MIA A survey of deep learning for image classification, object detection, segmentation, registration, and others in medical image analysis 
  19 Recent Advances in Convolutional Neural Networks Gu et al. (2017) 2017 PR

A broad survey of the recent advances in CNN and its applications in computer vision, speech and natural language processing

  20 Tutorial: Tools for Efficient Object Detection 2015 ICCV15 A short course for object detection only covering recent milestones 
  21 Tutorial: Deep Learning for Objects and Scenes 2017 CVPR17 A high level summary of recent work on deep learning for visual recognition of objects and scenes 
  22 Tutorial: Instance Level Recognition 2017 ICCV17 A short course of recent advances on instance level recognition, including object detection, instance segmentation and human pose prediction 
  23 Tutorial: Visual Recognition and Beyond 2018 CVPR18 This tutorial covers methods and principles behind image classification, object detection, instance segmentation, and semantic segmentation. 
  24 Deep Learning for Generic Object Detection 2018 Ours A comprehensive survey of deep learning for generic object detection 


Table 1: Summarization of a number of related surveys since 2000.

1.1 Comparison with Previous Reviews

A number of notable object detection surveys have been published, as summarized in Table 1. These include many excellent surveys on the problem of specific object detection, such as pedestrian detection Enzweiler and Gavrila (2009); Geronimo et al. (2010); Dollar et al. (2012), face detection Yang et al. (2002); Zafeiriou et al. (2015), vehicle detection Sun et al. (2006) and text detection Ye and Doermann (2015). Important contributions were also made by Ponce et al. Ponce et al. (2007), Dickinson Dickinson et al. (2009), Galleguillos and Belongie Galleguillos and Belongie (2010), Grauman and Leibe Grauman and Leibe (2011), and Andreopoulos and Tsotsos Andreopoulos and Tsotsos (2013).

There are few recent surveys focusing directly on the problem of generic object detection, except for the work by Zhang et al. Zhang et al. (2013) who conducted a survey on the topic of object class detection. However, the research reviewed in Grauman and Leibe (2011), Andreopoulos and Tsotsos (2013) and Zhang et al. (2013) is mostly that preceding 2012, and therefore before the more recent striking success of deep learning and related methods.

Deep learning allows computational models consisting of multiple hierarchical layers to learn fantastically complex, subtle, and abstract representations. In the past several years, deep learning has driven significant progress in a broad range of problems, such as visual recognition, object detection, speech recognition, natural language processing, medical image analysis, drug discovery and genomics. Among different types of deep neural networks, Deep Convolutional Neural Networks (DCNN) LeCun et al. (1998); Krizhevsky et al. (2012a); LeCun et al. (2015) have brought about breakthroughs in processing images, video, speech and audio. Given this time of rapid evolution, researchers have recently published surveys on different aspects of deep learning, including that of Bengio et al. Bengio et al. (2013), LeCun et al. LeCun et al. (2015), Litjens et al. Litjens et al. (2017), Gu et al. Gu et al. (2017), and more recently in tutorials at ICCV and CVPR.

Although many deep learning based methods have been proposed for objection detection, we are unaware of comprehensive surveys of the subject during the past five years, the focus of this survey. A thorough review and summarization of existing work is essential for further progress in object detection, particularly for researchers wishing to enter the field. Extensive work on CNNs for specific object detection, such as face detection Li et al. (2015a); Zhang et al. (2016a); Hu and Ramanan (2017), pedestrian detection Zhang et al. (2016b); Hosang et al. (2015), vehicle detection Zhou et al. (2016b) and traffic sign detection Zhu et al. (2016b) will not be included in our discussion.

Figure 3: Recognition problems related to generic object detection. (a) Image level object classification, (b) bounding box level generic object detection, (c) pixel-wise semantic segmentation, (d) instance level semantic segmentation.

1.2 Categorization Methodology

The number of papers on generic object detection published since deep learning entering is just breathtaking. So many, in fact, that compiling a comprehensive review of the state of the art already exceeds the possibility of a paper like this one. It is necessary to establish some selection criteria, e.g. completeness of a paper and importance to the field. We have preferred to include top journal and conference papers. Due to limitations on space and our knowledge, we sincerely apologize to those authors whose works are not included in this paper. For surveys of efforts in related topics, readers are referred to the articles in Table  1. This survey mainly focuses on the major progress made in the last five years; but for completeness and better readability, some early related works are also included. We restrict ourselves to still pictures and leave video object detection as a separate topic.

The remainder of this paper is organized as follows. Related background, including the problem, key challenges and the progress made during the last two decades are summarized in Section 2. We describe the milestone object detectors in Section 3. Fundamental subproblems and relevant issues involved in designing object detectors are presented in Section 4. A summarization of popular databases and state of the art performance is given in 5. We conclude the paper with a discussion of several promising directions in Section 6.

2 Background

2.1 The Problem

Generic object detection (i.e., generic object category detection), also called object class detection Zhang et al. (2013) or object category detection, is defined as follows. Given an image, the goal of generic object detection is to determine whether or not there are instances of objects from many predefined categories and, if present, to return the spatial location and extent of each instance. It places greater emphasis on detecting a broad range of natural categories, as opposed to specific object category detection where only a narrower predefined category of interest (e.g., faces, pedestrians, or cars) may be present. Although thousands of objects occupy the visual world in which we live, currently the research community is primarily interested in the localization of highly structured objects (e.g., cars, faces, bicycles and airplanes) and articulated (e.g., humans, cows and horses) rather than unstructured scenes (such as sky, grass and cloud).

Typically, the spatial location and extent of an object can be defined coarsely using a bounding box, i.e., an axis-aligned rectangle tightly bounding the object Everingham et al. (2010); Russakovsky et al. (2015), a precise pixel-wise segmentation mask, or a closed boundary Russell et al. (2008); Lin et al. (2014), as illustrated in Fig. 3. To our best knowledge, in the current literature, bounding boxes are more widely used for evaluating generic object detection algorithms Everingham et al. (2010); Russakovsky et al. (2015), and will be the approach we adopt in this survey as well. However the community is moving towards deep scene understanding (from image level object classification to single object localization, to generic object detection, and to pixel-wise object segmentation), hence it is anticipated that future challenges will be at the pixel levelLin et al. (2014).

There are many problems closely related to that of generic object detection111To our best knowledge, there is no universal agreement in the literature on the definitions of various vision subtasks. Often encountered terms such as detection, localization, recognition, classification, categorization, verification and identification, annotation, labeling and understanding are often differently defined Andreopoulos and Tsotsos (2013).. The goal of object classification or object categorization (Fig. 3 (a)) is to assess the presence of objects from a given number of object classes in an image; i.e., assigning one or more object class labels to a given image, determining presence without the need of location. It is obvious that the additional requirement to locate the instances in an image makes detection a more challenging task than classification. The object recognition problem denotes the more general problem of finding and identifying objects of interest present in an image, subsuming the problems of object detection and object classification Everingham et al. (2010); Russakovsky et al. (2015); Opelt et al. (2006); Andreopoulos and Tsotsos (2013). Generic object detection is closely related with semantic image segmentation (Fig. 3 (c)), which aims to assign each pixel in an image to a semantic class label. Object instance segmentation (Fig. 3 (d)) aims at distinguishing different instances of the same object class, while semantic segmentation does not distinguish different instances. Generic object detection also distinguishes different instances of the same object. Different from segmentation, object detection includes background region in the bounding box that might be useful for analysis.

Figure 4: Summary of challenges in generic object detection.
Figure 5:

Changes in imaged appearance of the same class with variations in imaging conditions (a-g). There is an astonishing variation in what is meant to be a single object class (h). In contrast, the four images in (i) appear very similar, but in fact are from four different object classes. Images from ImageNet

Russakovsky et al. (2015) and MS COCO Lin et al. (2014).

2.2 Main Challenges

Generic object detection aims at localizing and recognizing a broad range of natural object categories. The ideal goal of generic object detection is to develop general-purpose object detection algorithms achieving two competing goals: high quality/accuracy and high efficiency, as illustrated in Fig. 4. As illustrated in Fig. 5, high quality detection has to accurately localize and recognize objects in images or video frames, such that the large variety of object categories in the real world can be distinguished (i.e., high distinctiveness), and that object instances from the same category, subject to intraclass appearance variations, can be localized and recognized (i.e., high robustness). High efficiency requires the entire detection task to run at a sufficiently high frame rate with acceptable memory and storage usage. Despite several decades of research and significant progress, arguably the combined goals of accuracy and efficiency have not yet been met.

2.2.1 Accuracy related challenges

For accuracy, the challenge stems from 1) the vast range of intraclass variations and 2) the huge number of object categories.

We begin with intraclass variations, which can be divided into two types: intrinsic factors, and imaging conditions. For the former, each object category can have many different object instances, possibly varying in one or more of color, texture, material, shape, and size, such as the “chair” category shown in Fig. 5 (h). Even in a more narrowly defined class, such as human or horse, object instances can appear in different poses, with nonrigid deformations and different clothes.

For the latter, the variations are caused by changes in imaging conditions and unconstrained environments which may have dramatic impacts on object appearance. In particular, different instances, or even the same instance, can be captured subject to a wide number of differences: different times, locations, weather conditions, cameras, backgrounds, illuminations, viewpoints, and viewing distances. All of these conditions produce significant variations in object appearance, such as illumination, pose, scale, occlusion, background clutter, shading, blur and motion, with examples illustrated in Fig. 5 (a-g). Further challenges may be added by digitization artifacts, noise corruption, poor resolution, and filtering distortions.

In addition to intraclass variations, the large number of object categories, on the order of , demands great discrimination power of the detector to distinguish between subtly different inter-class variations, as illustrated in Fig. 5 (i)). In practice, current detectors focus mainly on structured object categories, such as the 20, 200 and 91 object classes in PASCAL VOC Everingham et al. (2010), ILSVRC Russakovsky et al. (2015) and MS COCO Lin et al. (2014) respectively. Clearly, the number of object categories under consideration in existing benchmark datasets is much smaller than that can be recognized by humans.

2.2.2 Efficiency related challenges

The exponentially increasing number of images calls for efficient and scalable detectors. The prevalence of social media networks and mobile/wearable devices has led to increasing demands for analyzing visual data. However mobile/wearable devices have limited computational capabilities and storage space, in which case an efficient object detector is critical.

For efficiency, the challenges stem from the need to localize and recognize all object instances of very large number of object categories, and the very large number of possible locations and scales within a single image, as shown by the example in Fig. 5 (c). A further challenge is that of scalability: A detector should be able to handle unseen objects, unknown situations, and rapidly increasing image data. For example, the scale of ILSVRC Russakovsky et al. (2015) is already imposing limits on the manual annotations that are feasible to obtain. As the number of images and the number of categories grow even larger, it may become impossible to annotate them manually, forcing algorithms to rely more on weakly supervised training data.

2.3 Progress in the Past Two Decades

Early research on object recognition was based on template matching techniques and simple part based models Fischler and Elschlager (1973), focusing on specific objects whose spatial layouts are roughly rigid, such as faces. Before 1990 the leading paradigm of object recognition was based on geometric representations Mundy (2006); Ponce et al. (2007)

, with the focus later moving away from geometry and prior models towards the use of statistical classifiers (such as Neural Networks

Rowley et al. (1998), SVM Osuna et al. (1997) and Adaboost Viola and Jones (2001); Xiao et al. (2003)) based on appearance featuresMurase and Nayar (1995a); Schmid and Mohr (1997). This successful family of object detectors set the stage for most subsequent research in this field.

In the late 1990s and early 2000s object detection research made notable strides. The milestones of object detection in recent years are presented in Fig.

2, in which two main eras (SIFT vs. DCNN) are highlighted. The appearance features moved from global representations Murase and Nayar (1995b); Swain and Ballard (1991); Turk and Pentland (1991) to local representations that are invariant to changes in translation, scale, rotation, illumination, viewpoint and occlusion. Handcrafted local invariant features gained tremendous popularity, starting from the Scale Invariant Feature Transform (SIFT) feature Lowe (1999), and the progress on various visual recognition tasks was based substantially on the use of local descriptors Mikolajczyk and Schmid (2005) such as Haar like features Viola and Jones (2001), SIFT Lowe (2004), Shape Contexts Belongie et al. (2002), Histogram of Gradients (HOG) Dalal and Triggs (2005) and Local Binary Patterns (LBP) Ojala et al. (2002), covariance Tuzel et al. (2006). These local features are usually aggregated by simple concatenation or feature pooling encoders such as the influential and efficient Bag of Visual Words approach introduced by Sivic and Zisserman Sivic and Zisserman (2003) and Csurka et al. Csurka et al. (2004), Spatial Pyramid Matching (SPM) of BoW models Lazebnik et al. (2006)

, and Fisher Vectors

Perronnin et al. (2010).

For years, the multistage handtuned pipelines of handcrafted local descriptors and discriminative classifiers dominated a variety of domains in computer vision, including object detection, until the significant turning point in 2012 when Deep Convolutional Neural Networks (DCNN) Krizhevsky et al. (2012a) achieved their record breaking results in image classification. The successful application of DCNNs to image classification Krizhevsky et al. (2012a) transferred to object detection, resulting in the milestone Region based CNN (RCNN) detector of Girshick et al. Girshick et al. (2014). Since then, the field of object detection has dramatically evolved and many deep learning based approaches have been developed, thanks in part to available GPU computing resources and the availability of large scale datasets and challenges such as ImageNet Deng et al. (2009); Russakovsky et al. (2015) and MS COCO Lin et al. (2014). With these new datasets, researchers can target more realistic and complex problems when detecting objects of hundreds categories from images with large intraclass variations and interclass similarities Lin et al. (2014); Russakovsky et al. (2015).

The research community has started moving towards the challenging goal of building general purpose object detection systems whose ability to detect many object categories matches that of humans. This is a major challenge: according to cognitive scientists, human beings can identify around 3,000 entry level categories and 30,000 visual categories overall, and the number of categories distinguishable with domain expertise may be on the order of Biederman (1987). Despite the remarkable progress of the past years, designing an accurate, robust, efficient detection and recognition system that approaches human-level performance on categories is undoubtedly an open problem.

3 Frameworks

There has been steady progress in object feature representations and classifiers for recognition, as evidenced by the dramatic change from handcrafted features Viola and Jones (2001); Dalal and Triggs (2005); Felzenszwalb et al. (2008); Harzallah et al. (2009); Vedaldi et al. (2009) to learned DCNN features Girshick et al. (2014); Ouyang et al. (2015); Girshick (2015); Ren et al. (2015); Dai et al. (2016c).

In contrast, the basic “sliding window” strategy Dalal and Triggs (2005); Felzenszwalb et al. (2010, 2008) for localization remains to be the main stream, although with some endeavors in Lampert et al. (2008); Uijlings et al. (2013). However the number of windows is large and grows quadratically with the number of pixels, and the need to search over multiple scales and aspect ratios further increases the search space. The the huge search space results in high computational complexity. Therefore, the design of efficient and effective detection framework plays a key role. Commonly adopted strategies include cascading, sharing feature computation, and reducing per-window computation.

In this section, we review the milestone detection frameworks present in generic object detection since deep learning entered the field, as listed in Fig. 6 and summarized in Table 10. Nearly all detectors proposed over the last several years are based on one of these milestone detectors, attempting to improve on one or more aspects. Broadly these detectors can be organized into two main categories:

  1. Two stage detection framework, which includes a pre-processing step for region proposal, making the overall pipeline two stage.

  2. One stage detection framework, or region proposal free framework, which is a single proposed method which does not separate detection proposal, making the overall pipeline single-stage.

Section 4 will build on the following by discussing fundamental subproblems involved in the detection framework in greater detail, including DCNN features, detection proposals, context modeling, bounding box regression and class imbalance handling.

Figure 6: Milestones in generic object detection based on the point in time of the first arXiv version.

3.1 Region Based (Two Stage Framework)

In a region based framework, category-independent region proposals are generated from an image, CNN Krizhevsky et al. (2012a) features are extracted from these regions, and then category-specific classifiers are used to determine the category labels of the proposals. As can be observed from Fig. 6, DetectorNet Szegedy et al. (2013), OverFeat Sermanet et al. (2014), MultiBox Erhan et al. (2014) and RCNN Girshick et al. (2014) independently and almost simultaneously proposed using CNNs for generic object detection.

Figure 7: Illustration of the milestone detecting framework RCNN Girshick et al. (2014, 2016) in great detail.
Figure 8: High level diagrams of the leading frameworks for generic object detection. The properties of these methods are summarized in Table 10.

RCNN: Inspired by the breakthrough image classification results obtained by CNN and the success of selective search in region proposal for hand-crafted features Uijlings et al. (2013), Girshick et al. were among the first to explore CNN for generic object detection and developed RCNN Girshick et al. (2014, 2016), which integrates AlexNet Krizhevsky et al. (2012a) with the region proposal method selective search Uijlings et al. (2013). As illustrated in Fig. 7, training in an RCNN framework consists of multistage pipelines:

  1. Class-agnostic region proposals, which are candidate regions that might contain objects, are obtained selective search Uijlings et al. (2013);

  2. Region proposals, which are cropped from the image and warped into the same size, are used as the input for finetuning a CNN model pre-trained using large-scale dataset such as ImageNet;

  3. A set of class specific linear SVM classifiers are trained using fixed length features extracted with CNN, replacing the softmax classifier learned by finetuning.

  4. Bounding box regression is learned for each object class with CNN features.

In spite of achieving high object detection quality, RCNN has notable drawbacks Girshick (2015):

  1. Training is a multistage complex pipeline, which is inelegant, slow and hard to optimize because each individual stage must be trained separately.

  2. Numerous region proposals which provide only rough localization need to be externally detected.

  3. Training SVM classifiers and bounding box regression is expensive in both disk space and time, since CNN features are extracted independently from each region proposal in each image, posing great challenges for large-scale detection, especially very deep CNN networks such as AlexNet Krizhevsky et al. (2012a) and VGG Simonyan and Zisserman (2015).

  4. Testing is slow, since CNN features are extracted per object proposal in each testing image.

SPPNet: During testing, CNN features extraction is the main bottleneck of the RCNN detection pipeline, which requires to extract CNN features from thousands of warped region proposals for an image. Noticing these obvious disadvantages, He et al. He et al. (2014) introduced the traditional spatial pyramid pooling (SPP) Grauman and Darrell (2005); Lazebnik et al. (2006) into CNN architectures. Since convolutional layers accept inputs of arbitrary sizes, the requirement of fixed-sized images in CNNs is only due to the Fully Connected (FC) layers, He et al. found this fact and added an SPP layer on top of the last convolutional (CONV) layer to obtain features of fixed-length for the FC layers. With this SPPnet, RCNN obtains a significant speedup without sacrificing any detection quality because it only needs to run the convolutional layers once on the entire test image to generate fixed-length features for region proposals of arbitrary size. While SPPnet accelerates RCNN evaluation by orders of magnitude, it does not result in a comparable speedup of the detector training. Moreover, finetuning in SPPnet He et al. (2014) is unable to update the convolutional layers before the SPP layer, which limits the accuracy of very deep networks.

Fast RCNN: Girshick Girshick (2015) proposed Fast RCNN that addresses some of the disadvantages of RCNN and SPPnet, while improving on their detection speed and quality. As illustrated in Fig. 8, Fast RCNN enables end-to-end detector training (when ignoring the process of region proposal generation) by developing a streamlined training process that simultaneously learns a softmax classifier and class-specific bounding box regression using a multitask loss, rather than training a softmax classifier, SVMs, and BBRs in three separate stages as in RCNN/SPPnet. Fast RCNN employs the idea of sharing the computation of convolution across region proposals, and adds a Region of Interest (RoI) pooling layer between the last CONV layer and the first FC layer to extract a fixed-length feature for each region proposal (i.e.

RoI). Essentially, RoI pooling uses warping at feature level for approximating warping at image level. The features after the RoI pooling layer are fed into a sequence of FC layers that finally branch into two sibling output layers: softmax probabilities for object category prediction and class-specific bounding box regression offsets for proposal refinement. Compared to RCNN/SPPnet, Fast RCNN improves the efficiency considerably – typically 3 times faster in training and 10 times faster in testing. In summary, Fast RCNN has attractive advantages of higher detection quality, a single-stage training process that updates all network layers, and no storage required for feature caching.

Faster RCNN Ren et al. (2015, 2017a): Although Fast RCNN significantly sped up the detection process, it still relies on external region proposals. Region proposal computation is exposed as the new bottleneck in Fast RCNN. Recent work has shown that CNNs have a remarkable ability to localize objects in CONV layers Zhou et al. (2015, 2016a); Cinbis et al. (2017); Oquab et al. (2015); Hariharan et al. (2016), an ability which is weakened in the FC layers. Therefore, the selective search can be replaced by the CNN in producing region proposals. The Faster RCNN framework proposed by Ren et al. Ren et al. (2015, 2017a) proposed an efficient and accurate Region Proposal Network (RPN) to generating region proposals. They utilize single network to accomplish the task of RPN for region proposal and Fast RCNN for region classification. In Faster RCNN, the RPN and fast RCNN share large number of convolutional layers. The features from the last shared convolutional layer are used for region proposal and region classification from separate branches. RPN first initializes reference boxes (i.e. the so called anchors) of different scales and aspect ratios at each CONV feature map location. Each anchor is mapped to a lower dimensional vector (such as 256 for ZF and 512 for VGG), which is fed into two sibling FC layers — an object category classification layer and a box regression layer. Different from Fast RCNN, the features used for regression in RPN have the same size. RPN shares CONV features with Fast RCNN, thus enabling highly efficient region proposal computation. RPN is, in fact, a kind of Fully Convolutional Network (FCN) Long et al. (2015); Shelhamer et al. (????); Faster RCNN is thus a purely CNN based framework without using handcrafted features. For the very deep VGG16 model Simonyan and Zisserman (2015), Faster RCNN can test at 5fps (including all steps) on a GPU, while achieving state of the art object detection accuracy on PASCAL VOC 2007 using 300 proposals per image. The initial Faster RCNN in Ren et al. (2015) contains several alternating training steps. This was then simplified by one step joint training in Ren et al. (2017a).

Concurrent with the development of Faster RCNN, Lenc and Vedaldi Lenc and Vedaldi (2015) challenged the role of region proposal generation methods such as selective search, studied the role of region proposal generation in CNN based detectors, and found that CNNs contain sufficient geometric information for accurate object detection in the CONV rather than FC layers. They proved the possibility of building integrated, simpler, and faster object detectors that rely exclusively on CNNs, removing region proposal generation methods such as selective search.

RFCN (Region based Fully Convolutional Network): While Faster RCNN is an order of magnitude faster than Fast RCNN, the fact that the region-wise subnetwork still needs to be applied per RoI (several hundred RoIs per image) led Dai et al. Dai et al. (2016c) to propose the RFCN detector which is fully convolutional (no hidden FC layers) with almost all computation shared over the entire image. As shown in Fig. 8, RFCN differs from Faster RCNN only in the RoI subnetwork. In Faster RCNN, the computation after the RoI pooling layer cannot be shared. A natural idea is to minimize the amount of computation that cannot be shared, hence Dai et al. Dai et al. (2016c) proposed to use all CONV layers to construct a shared RoI subnetwork and RoI crops are taken from the last layer of CONV features prior to prediction. However, Dai et al. Dai et al. (2016c)

found that this naive design turns out to have considerably inferior detection accuracy, conjectured to be that deeper CONV layers are more sensitive to category semantic and less sensitive to translation, whereas object detection needs localization representations that respect translation variance. Based on this observation, Dai

et al. Dai et al. (2016c) constructed a set of position sensitive score maps by using a bank of specialized CONV layers as the FCN output, on top of which a position sensitive RoI pooling layer different from the more standard RoI pooling in Girshick (2015); Ren et al. (2015) is added. They showed that the RFCN with ResNet101 He et al. (2016) could achieve comparable accuracy to Faster RCNN, often at faster running times.

Mask RCNN: Following the spirit of conceptual simplicity, efficiency, and flexibility, He et al. He et al. (2017) proposed Mask RCNN to tackle pixel-wise object instance segmentation by extending Faster RCNN. Mask RCNN adopts the same two stage pipeline, with an identical first stage (RPN). In the second stage, in parallel to predicting the class and box offset, Mask RCNN adds a branch which outputs a binary mask for each RoI. The new branch is a Fully Convolutional Network (FCN) Long et al. (2015); Shelhamer et al. (????) on top of a CNN feature map. In order to avoid the misalignments caused by the original RoI pooling (RoIPool) layer, a RoIAlign layer was proposed to preserve the pixel level spatial correspondence. With a backbone network ResNeXt101-FPN Xie et al. (2017); Lin et al. (2017a), Mask RCNN achieved top results for the COCO object instance segmentation and bounding box object detection. It is simple to train, generalizes well, and adds only a small overhead to Faster RCNN, running at 5 FPS He et al. (2017).

Light Head RCNN: In order to further speed up the detection speed of RFCN Dai et al. (2016c), Li et al. Li et al. (2018c) proposed Light Head RCNN, making the head of the detection network as light as possible to reduce the RoI regionwise computation. In particular, Li et al. Li et al. (2018c) applied a large kernel separable convolution to produce thin feature maps with small channel number and a cheap RCNN subnetwork, leading to an excellent tradeoff of speed and accuracy.

3.2 Unified Pipeline (One Stage Pipeline)

The region-based pipeline strategies of Section 3.1 have prevailed on detection benchmarks since RCNN Girshick et al. (2014). The significant efforts introduced in Section 3.1 have led to faster and more accurate detectors, and the current leading results on popular benchmark datasets are all based on Faster RCNN Ren et al. (2015). In spite of that progress, region-based approaches could be computationally expensive for mobile/wearable devices, which have limited storage and computational capability. Therefore, instead of trying to optimize the individual components of a complex region-based pipeline, researchers have begun to develop unified detection strategies.

Unified pipelines refer broadly to architectures that directly predict class probabilities and bounding box offsets from full images with a single feed forward CNN network in a monolithic setting that does not involve region proposal generation or post classification. The approach is simple and elegant because it completely eliminates region proposal generation and subsequent pixel or feature resampling stages, encapsulating all computation in a single network. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

DetectorNet: Szegedy et al. Szegedy et al. (2013) were among the first to explore CNNs for object detection. DetectorNet formulated object detection a regression problem to object bounding box masks. They use AlexNet Krizhevsky et al. (2012a) and replace the final softmax classifier layer by a regression layer. Given an image window, they use one network to predict foreground pixels over a coarse grid, as well as four additional networks to predict the object’s top, bottom, left and right halves. A grouping process then converts the predicted masks into detected bounding boxes. One needs to train a network per object type and mask type. It does not scale up to multiple classes. DetectorNet must take many crops of the image, and run multiple networks for each part on every crop.

OverFeat, proposed by Sermanet et al. Sermanet et al. (2014), was one of the first modern one-stage object detectors based on fully convolutional deep networks. It is one of the most successful object detection frameworks, winning the ILSVRC2013 localization competition. OverFeat performs object detection in a multiscale sliding window fashion via a single forward pass through the CNN network, which (with the exception of the final classification/regressor layer) consists only of convolutional layers. In this way, they naturally share computation between overlapping regions. OverFeat produces a grid of feature vectors, each of which represents a slightly different context view location within the input image and can predict the presence of an object. Once an object is identified, the same features are then used to predict a single bounding box regressor. In addition, OverFeat leverages multiscale features to improve the overall performance by passing up to six enlarged scales of the original image through the network and iteratively aggregating them together, resulting in a significantly increased number of evaluated context views (final feature vectors). OverFeat has a significant speed advantage over RCNN Girshick et al. (2014), which was proposed during the same period, but is significantly less accurate because it is hard to train fully convolutional network at that stage. The speed advantage derives from sharing the computation of convolution between overlapping windows using fully convolutional network.

YOLO (You Only Look Once): Redmon et al. Redmon et al. (2016) proposed YOLO, a unified detector casting object detection as a regression problem from image pixels to spatially separated bounding boxes and associated class probabilities. The design of YOLO is illustrated in Fig. 8. Since the region proposal generation stage is completely dropped, YOLO directly predicts detections using a small set of candidate regions. Unlike region-based approaches, e.g. Faster RCNN, that predict detections based on features from local region, YOLO uses the features from entire image globally. In particular, YOLO divides an image into a grid. Each grid predicts class probabilities, bounding box locations and confidences scores for those boxes. These predictions are encoded as an tensor. By throwing out the region proposal generation step entirely, YOLO is fast by design, running in real time at 45 FPS and a fast version, i.e. Fast YOLO Redmon et al. (2016), running at 155 FPS. Since YOLO sees the entire image when making predictions, it implicitly encodes contextual information about object classes and is less likely to predict false positives on background. YOLO makes more localization errors resulting from the coarse division of bounding box location, scale and aspect ratio. As discussed in Redmon et al. (2016), YOLO may fail to localize some objects, especially small ones, possibly because the grid division is quite coarse, and because by construction each grid cell can only contain one object. It is unclear to what extent YOLO can translate to good performance on datasets with significantly more objects, such as the ILSVRC detection challenge.

YOLOv2 and YOLO9000: Redmon and Farhadi Redmon and Farhadi (2017) proposed YOLOv2, an improved version of YOLO, in which the custom GoogLeNet Szegedy et al. (2015)

network is replaced with a simpler DarkNet19, plus utilizing a number of strategies drawn from existing work, such as batch normalization

He et al. (2015), removing the fully connected layers, and using good anchor boxes learned with kmeans and multiscale training. YOLOv2 achieved state of the art on standard detection tasks, like PASCAL VOC and MS COCO. In addition, Redmon and Farhadi Redmon and Farhadi (2017) introduced YOLO9000, which can detect over 9000 object categories in real time by proposing a joint optimization method to train simultaneously on ImageNet and COCO with WordTree to combine data from multiple sources.

SSD (Single Shot Detector): In order to preserve real-time speed without sacrificing too much detection accuracy, Liu et al. Liu et al. (2016) proposed SSD, which is faster than YOLO Redmon et al. (2016) and has accuracy competitive with state-of-the-art region-based detectors, including Faster RCNN Ren et al. (2015). SSD effectively combines ideas from RPN in Faster RCNN Ren et al. (2015), YOLO Redmon et al. (2016) and multiscale CONV features Hariharan et al. (2016) to achieve fast detection speed while still retaining high detection quality. Like YOLO, SSD predicts a fixed number of bounding boxes and scores for the presence of object class instances in these boxes, followed by an NMS step to produce the final detection. The CNN network in SSD is fully convolutional, whose early layers are based on a standard architecture, such as VGG Simonyan and Zisserman (2015) (truncated before any classification layers), which is referred as the base network. Then several auxiliary CONV layers, progressively decreasing in size, are added to the end of the base network. The information in the last layer with low resolution may be too coarse spatially to allow precise localization. SSD uses shallower layers with higher resolution for detecting small objects. For objects of different sizes, SSD performs detection over multiple scales by operating on multiple CONV feature maps, each of which predicts category scores and box offsets for bounding boxes of appropriate sizes. For a input, SSD achieves mAP on the VOC2007 test at 59 FPS on a Nvidia Titan X.

4 Fundamental SubProblems

In this section important subproblems are described, including feature representation, region proposal, context information mining, and training strategies. Each approach is reviewed with respect to its primary contribution.

4.1 DCNN based Object Representation

As one of the main components in any detector, good feature representations are of primary importance in object detection Dickinson et al. (2009); Girshick et al. (2014); Gidaris and Komodakis (2015); Zhu et al. (2016a). In the past, a great deal of effort was devoted to designing local descriptors (e.g., SIFT Lowe (1999) and HOG Dalal and Triggs (2005)) and to explore approaches (e.g., Bag of Words Sivic and Zisserman (2003) and Fisher Vector Perronnin et al. (2010)) to group and abstract the descriptors into higher level representations in order to allow the discriminative object parts to begin to emerge, however these feature representation methods required careful engineering and considerable domain expertise.

In contrast, deep learning methods (especially deep CNNs, or DCNNs), which are composed of multiple processing layers, can learn powerful feature representations with multiple levels of abstraction directly from raw images Bengio et al. (2013); LeCun et al. (2015). As the learning procedure reduces the dependency of specific domain knowledge and complex procedures needed in traditional feature engineering Bengio et al. (2013); LeCun et al. (2015), the burden for feature representation has been transferred to the design of better network architectures.

The leading frameworks reviewed in Section 3 (RCNN Girshick et al. (2014), Fast RCNN Girshick (2015), Faster RCNN Ren et al. (2015), YOLO Redmon et al. (2016), SSD Liu et al. (2016)) have persistently promoted detection accuracy and speed. It is generally accepted that the CNN representation plays a crucial role and it is the CNN architecture which is the engine of a detector. As a result, most of the recent improvements in detection accuracy have been achieved via research into the development of novel networks. Therefore we begin by reviewing popular CNN architectures used in Generic Object Detection, followed by a review of the effort devoted to improving object feature representations, such as developing invariant features to accommodate geometric variations in object scale, pose, viewpoint, part deformation and performing multiscale analysis to improve object detection over a wide range of scales.

Figure 9: Performance of winning entries in the ILSVRC competitions from 2011 to 2017 in the image classification task.


  No. DCNN Architecture #Paras () #Layers (CONV+FC) Test Error (Top 5) First Used In Highlights 


  AlexNet Krizhevsky et al. (2012b) Girshick et al. (2014) The first DCNN; The historical turning point of feature representation from traditional to CNN; In the classification task of ILSVRC2012 competition, achieved a winning Top 5 test error rate of , compared to given by the second best entry. 
  OverFeat Sermanet et al. (2014) Sermanet et al. (2014) Similar to AlexNet, differences including a smaller stride for CONV1 and 2, different filter size for some layers, more filters for some layers. 
  ZFNet (fast) Zeiler and Fergus (2014) He et al. (2014) Highly similar to AlexNet, with a smaller filter size in CONV1 and a smaller stride for CONV1 and 2. 
  VGGNet16 Simonyan and Zisserman (2015) Girshick (2015) Increasing network depth significantly with small convolution filters; Significantly better performance. 
  GoogLeNet Szegedy et al. (2015) Szegedy et al. (2015)

With the use of Inception module which concatenates feature maps produced by filters of different sizes, the network goes wider and parameters are much less than those of AlexNet

  Inception v2 Ioffe and Szegedy (2015) Howard et al. (2017) Faster training with the introduce of Batch Normalization. 
  Inception v3 Szegedy et al. (2016) Going deeper with Inception building blocks in efficient ways. 
  YOLONet Redmon et al. (2016) Redmon et al. (2016) A network inspired by GoogLeNet used in YOLO detector. 
  ResNet50 He et al. (2016) He et al. (2016)

With the use of residual connections, substantially deeper but with fewer

  ResNet101 He et al. (2016) (ResNets) He et al. (2016) parameters than previous DCNNs (except for GoogLeNet). 
  InceptionResNet v1 Szegedy et al. (2017) A residual version of Inception with similar computational cost of Inception v3, but with faster training process. 
  InceptionResNet v2 Szegedy et al. (2017) (Ensemble) Huang et al. (2017b) A costlier residual version of Inception, with significantly improved recognition performance. 
  Inception v4 Szegedy et al. (2017) A Inception variant without residual connections with roughly the same recognition performance as InceptionResNet v2, but significantly slower. 
  ResNeXt50 Xie et al. (2017) Xie et al. (2017) Repeating a building block that aggregates a set of transformations with the same topology. 
  DenseNet201 Huang et al. (2017a) Zhou et al. (2018)

Design dense block, which connects each layer to every other layer in a feed forward fashion; Alleviate the vanishing gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.

  DarkNet Redmon and Farhadi (2017) Redmon and Farhadi (2017) Similar to VGGNet, but with significantly less parameters due to the use of fewer filters at each layer. 
  MobileNet Howard et al. (2017) Howard et al. (2017) Light weight deep CNNs using depthwise separable convolutions for mobile applications. 
  SE ResNet50 Hu et al. (2018b) (SENets) Hu et al. (2018b) Proposing a novel block called Squeeze and Excitation to model feature channel relationship; Can be flexibly used in all existing CNNs to improve recognition performance at minimal additional computational cost. 


Table 2: DCNN architectures that are commonly used for generic object detection. Regarding the statistics for “#Paras” and “#Layers”, we didn’t consider the final FC prediction layer. “Test Error” column indicates the Top 5 classification test error on ImageNet1000. Explanations: OverFeat (accurate model), DenseNet201 (Growth Rate 32, DenseNet-BC), and ResNeXt50 (32*4d).

4.1.1 Popular CNN Architectures

CNN architectures serve as network backbones to be used in the detection frameworks described in Section 3. Representative frameworks include AlexNet Krizhevsky et al. (2012b), ZFNet Zeiler and Fergus (2014) VGGNet Simonyan and Zisserman (2015), GoogLeNet Szegedy et al. (2015), Inception series Ioffe and Szegedy (2015); Szegedy et al. (2016, 2017), ResNet He et al. (2016), DenseNet Huang et al. (2017a) and SENet Hu et al. (2018b), which are summarized in Table 2, and where the network improvement in object recognition can be seen from Fig. 9. A further review of recent CNN advances can be found in Gu et al. (2017).

Briefly, a CNN has a hierarchical structure and is composed of a number of layers such as convolution, nonlinearity, pooling etc. From finer to coarser layers, the image repeatedly undergoes filtered convolution, and with each layer the receptive field (region of support) of these filters increases. For example, the pioneering AlexNet Krizhevsky et al. (2012b) has five convolutional layers and two Fully Connected (FC) layers, and where the first layer contains 96 filters of size . In general, the first CNN layer extracts low level features (e.g. edges), intermediate layers extract features of increasing complexity, such as combinations of low level features, and later convolutional layers detect objects as combinations of earlier parts Zeiler and Fergus (2014); Bengio et al. (2013); LeCun et al. (2015); Oquab et al. (2014).

As can be observed from Table 2, the trend in architecture evolution is that networks are getting deeper: AlexNet consisted of 8 layers, VGGNet 16 layers, and more recently ResNet and DenseNet both surpassed the 100 layer mark, and it was VGGNet Simonyan and Zisserman (2015) and GoogLeNet Szegedy et al. (2015), in particular, which showed that increasing depth can improve the representational power of deep networks. Interestingly, as can be observed from Table 2, networks such as AlexNet, OverFeat, ZFNet and VGGNet have an enormous number of parameters, despite being only few layers deep, since a large fraction of the parameters come from the FC layers. Therefore, newer networks like Inception, ResNet, and DenseNet, although having a very great network depth, have far fewer parameters by avoiding the use of FC layers.

With the use of Inception modules in carefully designed topologies, the parameters of GoogLeNet is dramatically reduced. Similarly ResNet demonstrated the effectiveness of skip connections for learning extremely deep networks with hundreds of layers, winning the ILSVRC 2015 classification task. Inspired by ResNet He et al. (2016), InceptionResNets Szegedy et al. (2017) combine the Inception networks with shortcut connections, claiming that shortcut connections can significantly accelerate the training of Inception networks. Extending ResNets, Huang et al. Huang et al. (2017a) proposed DenseNets which are built from dense blocks, where dense blocks connect each layer to every other layer in a feed-forward fashion, leading to compelling advantages such as parameter efficiency, implicit deep supervision, and feature reuse. Recently, Hu et al. He et al. (2016) proposed an architectural unit termed the Squeeze and Excitation (SE) block which can be combined with existing deep architectures to boost their performance at minimal additional computational cost, by adaptively recalibrating channelwise feature responses by explicitly modeling the interdependencies between convolutional feature channels, leading to winning the ILSVRC 2017 classification task. Research on CNN architectures remain active, and a numer of backbone networks are still emerging such as Dilated Residual Networks Yu et al. (2017), Xception Chollet (2017), DetNet Li et al. (2018b), and Dual Path Networks (DPN) Chen et al. (2017b).

The training of a CNN requires a large labelled dataset with sufficient label and intraclass diversity. Unlike image classification, detection requires localizing (possibly many) objects from an image. It has been shown Ouyang et al. (2017) that pretraining the deep model with a large scale dataset having object-level annotations (such as the ImageNet classification and localization dataset), instead of only image-level annotations, improves the detection performance. However collecting bounding box labels is expensive, especially for hundreds of thousands of categories. A common scenario is for a CNN to be pretrained on a large dataset (usually with a large number of visual categories) with image-level labels; the pretrained CNN can then be applied to a small dataset, directly, as a generic feature extractor Razavian et al. (2014); Azizpour et al. (2016); Donahue et al. (2014); Yosinski et al. (2014), which can support a wider range of visual recognition tasks. For detection, the pretrained network is typically finetuned222Finetuning is done by initializing a network with weights optimized for a large labeled dataset like ImageNet and then updating the network’s weights using the target-task training set. on a given detection dataset Donahue et al. (2014); Girshick et al. (2014, 2016). Several large scale image classification datasets are used for CNN pretraining; among them the ImageNet1000 dataset Deng et al. (2009); Russakovsky et al. (2015) with 1.2 million images of 1000 object categories, or the Places dataset Zhou et al. (2017a) which is much larger than ImageNet1000 but has fewer classes, or a recent hybrid dataset Zhou et al. (2017a) combining the Places and ImageNet datasets.

Pretrained CNNs without finetuning were explored for object classification and detection in Donahue et al. (2014); Girshick et al. (2016); Agrawal et al. (2014), where it was shown that features performance is a function of the extracted layer; for example, for AlexNet pretrained on ImageNet, FC6 / FC7 / Pool5 are in descending order of detection accuracy Donahue et al. (2014); Girshick et al. (2016); finetuning a pretrained network can increase detection performance significantly Girshick et al. (2014, 2016), although in the case of AlexNet the finetuning performance boost was shown to be much larger for FC6 and FC7 than for Pool5, suggesting that the Pool5 features are more general. Furthermore the relationship or similarity between the source and target datasets plays a critical role, for example that ImageNet based CNN features show better performance Zhou et al. (2015) on object related image datasets.


  Detector Region Backbone Pipelined mAP@IoU=0.5 mAP Published  
  Group Name Proposal DCNN Used VOC07 VOC12 COCO COCO In Full Name of Detector and Highlights 


  (1) Single detection with multilayer features ION Bell et al. (2016) SS+EB MCG+RPN VGG16 Fast RCNN (07+12) (07+12) CVPR16 (Inside Outside Network, ION); Use skip layer pooling to extract information at multiple scales; Features pooled from multilayers are normalized, concatenated, scaled, and dimension reduced; Won the Best Student Entry and overall in the 2015 MS COCO detection challenge. 
HyperNet Kong et al. (2016) RPN VGG16 Faster RCNN (07+12) (07++12) CVPR16 A good variant of Faster RCNN; Combine deep, intermediate and shallow layer features and compress them into a hyper feature; The hyper feature is used for both RPN and detection network. 
PVANet Kim et al. (2016) RPN PVANet Faster RCNN (07+12+CO) (07++12+CO) NIPSW16

A newly designed deep but lightweight network with the principle “less channels with more layers”; Combine ideas from concatenated ReLU

Shang et al. (2016), Inception Szegedy et al. (2015), and HyperNet Kong et al. (2016). 


  (2) Detection at multiple layers SDP+CRC Yang et al. (2016b) EB VGG16 Fast RCNN (07) CVPR16 (Scale Dependent Pooling + Cascade Rejection Classifier, SDP+CRC); Utilize features in all CONV layers to reject easy negatives via CRC and then classify survived proposals using SDP which represents an object proposal with the convolutional features extracted from a layer corresponding to its scale. 
MSCNN Cai et al. (2016) RPN VGG Faster RCNN Only Tested on KITTI ECCV16 (MultiScale CNN, MSCNN); Both proposal generation and detection are performed at multiple output layers; Propose to use feature upsampling; End to end learning. 
MPN Zagoruyko et al. (2016) SharpMask Pinheiro et al. (2016) VGG16 Fast RCNN BMVC16 (MultiPath Network, MPN)

; Use skip connections, multiscale proposals and an integral loss function to improve Fast RCNN; Ranked

in both the COCO15 detection and segmentation challenges; Need segmentation annotations for training. 
DSOD Shen et al. (2017) Free DenseNet SSD (07+12) (07++12) ICCV17 (Deeply Supervised Object Detection, DSOD); Combine ideas of DenseNet and SSD; Training from scratch on the target dataset without pretraining with other datasets like ImageNet. 
RFBNet Liu et al. (2018a) Free VGG16 SSD (07+12) (07++12) CVPR18 (Receptive Field Block, RBF); Proposed RFB to improve SSD; RBF is a multibranch convolutional block similar to the Inception block Szegedy et al. (2015), but with dilated CONV layers. 


  (3) Combination of (1) and (2) DSSD Fu et al. (2017) Free ResNet101 SSD (07+12) (07++12) 2017 (Deconvolutional Single Shot Detector, DSSD); Design a top-down network connected with lateral connections to supplement the bottom-up network, as shown in Fig. 11 (c1, c2). 
FPN Lin et al. (2017a) RPN ResNet101 Faster RCNN CVPR17 (Feature Pyramid Network, FPN); Exploit inherent pyramidal hierarchy of DCNN to construct feature pyramids with marginal extra cost, as shown in Fig. 11 (a1, a2); Widely used in detectors. 
TDM Shrivastava et al. (2017) RPN ResNet101 VGG16 Faster RCNN CVPR17 (Top Down Modulation, TDM); Integrate top-down features and bottom-up, feedforward features via the proposed block shown in Fig. 11 (b2); Result was produced by InceptionResNetv2. 
RON Kong et al. (2017) RPN VGG16 Faster RCNN (07+12+CO) (07++12+CO) CVPR17 (Reverse connection with Objectness prior Networks, RON); Effectively combine Faster RCNN and SSD; Design a block shown in Fig. 11 (d2) to perform multiscale object detection in DCNN. 
ZIP Li et al. (2018a) RPN Inceptionv2 Faster RCNN (07+12) IJCV18 (Zoom out and In network for object Proposals, ZIP); Generate proposals in a deep conv/deconv network with multilayers, as shown in Fig. 12; Proposed a map attention decision (MAD) unit to weight the feature channels input to RPN. 
STDN Zhou et al. (2018) Free DenseNet169 SSD (07+12) CVPR18 (Scale Transferrable Detection Network, STDN); Proposed a efficient scale transfer module embedded into DenseNet; The scale transfer layer rearranges elements by expanding the width and height of the feature map with channel elements. 
RefineDet Zhang et al. (2018a) RPN VGG16 ResNet101 Faster RCNN (07+12) (07++12) CVPR18 Proposed an anchor refinement module to obtain better and less anchors; Designed a transfer connection block as shown in Fig. 11 (e2) to improve features for classification. 
StairNet Woo et al. (2018) VGG16 SSD (07+12) (07++12) WACV18 Design a transfer connection block similar to those shown in Fig. 11 to improve feature combination. 


  (4) Model Geometric Transforms DeepIDNet Ouyang et al. (2015) SS+ EB AlexNet ZFNet OverFeat GoogLeNet RCNN (07) CVPR15 Introduce a deformation constrained pooling layer to explore object part deformation; Also utilize context modeling, model averaging, and bounding box location refinement in the multistage detection pipeline; Highly engineered; Training not end to end; 
DCN Dai et al. (2017) RPN ResNet101 IRN RFCN (07+12) CVPR17 (Deformable Convolutional Networks, DCN); Design efficient deformable convolution and deformable RoI pooling modules that can replace their plain counterparts in existing DCNNs. 
DPFCN Mordan et al. (2018) AttractioNet Gidaris and Komodakis (2016) ResNet RFCN (07+12) (07++12) IJCV18 (Deformable Part based FCN, DPFCN); Design a deformable part based RoI pooling layer to explicitly select discriminative regions around object proposals by simultaneously optimizing latent displacements of all parts. 


Table 3: Summarization of properties of representative methods in improving DCNN feature representations for generic object detection. See Section 4.1.2 for more detail discussion. Abbreviations: Selective Search (SS), EdgeBoxes (EB), InceptionResNet (IRN). Detection results on VOC07, VOC12 and COCO were reported with mAP@IoU=0.5, and the other column results on COCO were reported with a new metric mAP@IoU= which averages mAP over different IoU thresholds from 0.5 to 0.95 (written as [0.5:0.95]). Training data: “07”VOC2007 trainval; “12”VOC2012 trainval; “07+12”union of 07 and VOC12 trainval; “07++12”union of VOC07 trainval, VOC07 test, and VOC12 trainval; 07++12+COunion of VOC07 trainval, VOC07 test, VOC12 trainval and COCO trainval. The COCO detection results were reported with COCO2015 Test-Dev, except for MPN Zagoruyko et al. (2016) which reported with COCO2015 Test-Standard.

4.1.2 Methods For Improving Object Representation

Deep CNN based detectors such as RCNN Girshick et al. (2014), Fast RCNN Girshick (2015), Faster RCNN Ren et al. (2015) and YOLO Redmon et al. (2016), typically use the deep CNN architectures listed in 2 as the backbone network and use features from the top layer of the CNN as object representation, however detecting objects across a large range of scales is a fundamental challenge. A classical strategy to address this issue is to run the detector over a number of scaled input images (e.g., an image pyramid) Felzenszwalb et al. (2010); Girshick et al. (2014); He et al. (2014), which typically produces more accurate detection, however with obvious limitations of inference time and memory. In contrast, a CNN computes its feature hierarchy layer by layer, and the subsampling layers in the feature hierarchy lead to an inherent multiscale pyramid.

This inherent feature hierarchy produces feature maps of different spatial resolutions, but have inherent problems in structure Hariharan et al. (2016); Long et al. (2015); Shrivastava et al. (2017): the later (or higher) layers have a large receptive field and strong semantics, and are the most robust to variations such as object pose, illumination and part deformation, but the resolution is low and the geometric details are lost. On the contrary, the earlier (or lower) layers have a small receptive field and rich geometric details, but the resolution is high and is much less sensitive to semantics. Intuitively, semantic concepts of objects can emerge in different layers, depending on the size of the objects. So if a target object is small it requires fine detail information in earlier layers and may very well disappear at later layers, in principle making small object detection very challenging, for which tricks such as dilated convolutions Yu and Koltun (2016) or atrous convolution Dai et al. (2016c); Chen et al. (2018) have been proposed. On the other hand if the target object is large then the semantic concept will emerge in much later layers. Clearly it is not optimal to predict objects of different scales with features from only one layer, therefore a number of methods Shrivastava et al. (2017); Zhang et al. (2018b); Lin et al. (2017a); Kong et al. (2017) have been proposed to improve detection accuracy by exploiting multiple CNN layers, broadly falling into three types of multiscale object detection:

  1. Detecting with combined features of multiple CNN layers Hariharan et al. (2016); Kong et al. (2016); Bell et al. (2016);

  2. Detecting at multiple CNN layers;

  3. Combinations of the above two methods Fu et al. (2017); Lin et al. (2017a); Shrivastava et al. (2017); Kong et al. (2017); Zhou et al. (2018); Zhang et al. (2018a).

Figure 10: Comparison of HyperNet and ION. LRN: Local Response Normalization

(1) Detecting with combined features of multiple CNN layers seeks to combine features from multiple layers before making a prediction. Representative approaches include Hypercolumns Hariharan et al. (2016), HyperNet Kong et al. (2016), and ION Bell et al. (2016). Such feature combining is commonly accomplished via skip connections, a classic neural network idea that skips some layers in the network and feeds the output of an earlier layer as the input to a later layer, architectures which have recently become popular for semantic segmentation Long et al. (2015); Shelhamer et al. (????); Hariharan et al. (2016). As shown in Fig. 10 (a), ION Bell et al. (2016) uses skip pooling to extract RoI features from multiple layers, and then the object proposals generated by selective search and edgeboxes are classified by using the combined features. HyperNet Kong et al. (2016), as shown in Fig. 10 (b), follows a similar idea and integrates deep, intermediate and shallow features to generate object proposals and predict objects via an end to end joint training strategy. This method extracts only 100 candidate regions in each image. The combined feature is more descriptive and is more beneficial for localization and classification, but at increased computational complexity.

Figure 11: Hourglass architectures: Conv1 to Conv5 are the main Conv blocks in backbone networks such as VGG or ResNet. Comparison of a number of Reverse Fusion Block (RFB) commonly used in recent approaches.

(2) Detecting at multiple CNN layers Long et al. (2015); Shelhamer et al. (????) combines coarse to fine predictions from multiple layers by averaging segmentation probabilities. SSD Liu et al. (2016) and MSCNN Cai et al. (2016), RBFNet Liu et al. (2018a), and DSOD Shen et al. (2017) combine predictions from multiple feature maps to handle objects of various sizes. SSD spreads out default boxes of different scales to multiple layers within a CNN and enforces each layer to focus on predicting objects of a certain scale. Liu et al. Liu et al. (2018a) proposed RFBNet which simply replaces the later convolution layers of SSD with a Receptive Field Block (RFB) to enhance the discriminability and robustness of features. The RFB is a multibranch convolutional block, similar to the Inception block Szegedy et al. (2015), but combining multiple branches with different kernels and convolution layers Chen et al. (2018). MSCNN Cai et al. (2016) applies deconvolution on multiple layers of a CNN to increase feature map resolution before using the layers to learn region proposals and pool features.

(3) Combination of the above two methods recognizes that, on the one hand, the utility of the hyper feature representation by simply incorporating skip features into detection like UNet Olaf Ronneberger (2015), Hypercolumns Hariharan et al. (2016), HyperNet Kong et al. (2016) and ION Bell et al. (2016) does not yield significant improvements due to the high dimensionality. On the other hand, it is natural to detect large objects from later layers with large receptive fields and to use earlier layers with small receptive fields to detect small objects; however, simply detecting objects from earlier layers may result in low performance because earlier layers possess less semantic information. Therefore, in order to combine the best of both worlds, some recent works propose to detect objects at multiple layers, and the feature of each detection layer is obtained by combining features from different layers. Representative methods include SharpMask Pinheiro et al. (2016), Deconvolutional Single Shot Detector (DSSD) Fu et al. (2017), Feature Pyramid Network (FPN) Lin et al. (2017a), Top Down Modulation (TDM )Shrivastava et al. (2017), Reverse connection with Objectness prior Network (RON) Kong et al. (2017), ZIP Li et al. (2018a) (shown in Fig. 12), Scale Transfer Detection Network (STDN) Zhou et al. (2018), RefineDet Zhang et al. (2018a) and StairNet Woo et al. (2018), as shown in Table 3 and contrasted in Fig. 11.

Figure 12: ZIP is similar to the approaches in Fig. 11.

As can be observed from Fig. 11 (a1) to (e1), these methods have highly similar detection architectures which incorporate a top down network with lateral connections to supplement the standard bottom-up, feedforward network. Specifically, after a bottom-up pass the final high level semantic features are transmitted back by the top-down network to combine with the bottom-up features from intermediate layers after lateral processing. The combined features are further processed, then used for detection and also transmitted down by the top-down network. As can be seen from Fig. 11 (a2) to (e2), one main difference is the design of the Reverse Fusion Block (RFB) which handles the selection of the lower layer filters and the combination of multilayer features. The top-down and lateral features are processed with small convolutions and combined with elementwise sum or elementwise product or concatenation. FPN shows significant improvement as a generic feature extractor in several applications including object detection Lin et al. (2017a, b) and instance segmentation He et al. (2017), e.g. using FPN in a basic Faster RCNN detector. These methods have to add additional layers to obtain multiscale features, introducing cost that can not be neglected. STDN Zhou et al. (2018) used DenseNet Huang et al. (2017a) to combine features of different layers and designed a scale transfer module to obtain feature maps with different resolutions. The scale transfer module module can be directly embedded into DenseNet with little additional cost.

(4) Model Geometric Transformations. DCNNs are inherently limited to model significant geometric transformations. An empirical study of the invariance and equivalence of DCNN representations to image transformations can be found in Lenc and Vedaldi (2018). Some approaches have been presented to enhance the robustness of CNN representations, aiming at learning invariant CNN representations with respect to different types of transformations such as scale Kim et al. (2014); Bruna and Mallat (2013a), rotation Bruna and Mallat (2013a); Cheng et al. (2016); Worrall et al. (2017); Zhou et al. (2017b), or both Jaderberg et al. (2015).

Modeling Object Deformations: Before deep learning, Deformable Part based Models (DPMs) Felzenszwalb et al. (2010) have been very successful for generic object detection, representing objects by component parts arranged in a deformable configuration. This DPM modeling is less sensitive to transformations in object pose, viewpoint and nonrigid deformations because the parts are positioned accordingly and their local appearances are stable, motivating researchers Dai et al. (2017); Girshick et al. (2015); Mordan et al. (2018); Ouyang et al. (2015); Wan et al. (2015) to explicitly model object composition to improve CNN based detection. The first attempts Girshick et al. (2015); Wan et al. (2015)

combined DPMs with CNNs by using deep features learned by AlexNet in DPM based detection, but without region proposals. To enable a CNN to enjoy the built-in capability of modeling the deformations of object parts, a number of approaches were proposed, including DeepIDNet

Ouyang et al. (2015), DCN Dai et al. (2017) and DPFCN Mordan et al. (2018) (shown in Table 3). Although similar in spirit, deformations are computed in a different ways: DeepIDNet Ouyang et al. (2017)

designed a deformation constrained pooling layer to replace a regular max pooling layer to learn the shared visual patterns and their deformation properties across different object classes, Dai

et al. Dai et al. (2017) designed a deformable convolution layer and a deformable RoI pooling layer, both of which are based on the idea of augmenting the regular grid sampling locations in the feature maps with additional position offsets and learning the offsets via convolutions, leading to Deformable Convolutional Networks (DCN), and in DPFCN Mordan et al. (2018), Mordan et al. proposed deformable part based RoI pooling layer which selects discriminative parts of objects around object proposals by simultaneously optimizing latent displacements of all parts.

4.2 Context Modeling


  Detector Region Backbone Pipelined mAP@IoU=0.5 mAP Published  
  Group Name Proposal DCNN Used VOC07 VOC12 COCO In Full Name of Detector and Highlights 


  Global Context SegDeepM Zhu et al. (2015) SS+CMPC VGG16 RCNN VOC10 VOC12 CVPR15 Use an additional feature extracted from an enlarged object proposal as context information; Frame the detection problem as inference in a Markov Random Field. 
ION Bell et al. (2016) SS+EB VGG16 Fast RCNN CVPR16 (Inside Outside Network, ION)

; Contextual information outside the region of interest is integrated using spatial recurrent neural networks.

DeepIDNet Ouyang et al. (2015) SS+EB AlexNet ZFNet RCNN (07) CVPR15 Propose to use image classification scores as global contextual information to refine the detection scores of each object proposal. 
CPF Shrivastava and Gupta (2016) RPN VGG16 Faster RCNN (07+12) (07++12) ECCV16 (Contextual Priming and Feedback, CPF); Augment Faster RCNN with a semantic segmentation network; Use semantic segmentation to provide top down feedback. 


  Local Context MRCNN Gidaris and Komodakis (2015) SS VGG16 SPPNet (07+12) (07+12) ICCV15 (MultiRegion CNN, MRCNN); Extract features from multiple regions surrounding or inside the object proposals; Intergrate the semantic segmentation-aware features. 
GBDNet Zeng et al. (2016) CRAFT Yang et al. (2016a) Inception v2 ResNet269 Fast RCNN (07+12) ECCV16 (Gated BiDirectional CNN, GBDNet); Propose a GBDNet module to model the relations of multiscale contextualized regions surrounding an object proposal; GBDNet pass messages among features from different context regions through convolution between neighboring support regions in two directions; Gated functions are used to control message transmission. 
ACCNNLi et al. (2017b) SS VGG16 Fast RCNN (07+12) (07++12) TMM17 (Attention to Context CNN, ACCNN); Propose to use multiple stacked LSTM layers to capture global context; Propose to encode features from multiscale contextualized regions surrounding an object proposal by feature concatenation. The global and local context feature are concatenated for recognition. 
CoupleNetZhu et al. (2017a) RPN ResNet101 RFCN (07+12) (07++12) ICCV17 Improve on RFCN; Besides the main branch in the network head, propose an additional branch by concatenating features from multiscale contextualized regions surrounding an object proposal; Features from two branches are combined with elementwise sum. 
SMN Chen and Gupta (2017) RPN VGG16 Faster RCNN (07) ICCV17 (Spatial Memory Network, SMN); Propose a SMN to model object-object relationship efficiently and effectively; A sequential reasoning architecture. 
ORN Hu et al. (2018a) RPN ResNet101 +DCN Faster RCNN CVPR18 (Object Relation Network, ORN); Propose an ORN to model the relations of a set of object proposals through interaction between their appearance feature and geometry; ORM does not require additional supervision and is easy to embed in existing networks. 
SIN Liu et al. (2018b) RPN VGG16 Faster RCNN (07+12) (07++12) CVPR18 (Structure Inference Network, SIN); Formulates object detection as a problem of graph structure inference, where given an image the objects are treated as nodes in a graph and relationships between the objects are modeled as edges in such graph. 


Table 4: Summarization of detectors that exploit context information, similar to Table 3.
Figure 13: Representative approaches that explore local surrounding contextual features: MRCNN Gidaris and Komodakis (2015), GBDNet Zeng et al. (2016, 2017), ACCNN Li et al. (2017b) and CoupleNet Zhu et al. (2017a), see also Table 4.

In the physical world visual objects occur in particular environments and usually coexist with other related objects, and there is strong psychological evidence Biederman (1972); Bar (2004) that context plays an essential role in human object recognition. It is recognized that proper modeling of context helps object detection and recognition Torralba (2003); Oliva and Torralba (2007); Chen et al. (2018, 2015a); Divvala et al. (2009); Galleguillos and Belongie (2010), especially when object appearance features are insufficient because of small object size, occlusion, or poor image quality. Many different types of context have been discussed, in particular see surveys Divvala et al. (2009); Galleguillos and Belongie (2010). Context can broadly be grouped into one of three categories Biederman (1972); Galleguillos and Belongie (2010):

  1. Semantic context: The likelihood of an object to be found in some scenes but not in others;

  2. Spatial context: Tthe likelihood of finding an object in some position and not others with respect to other objects in the scene;

  3. Scale context: Objects have a limited set of sizes relative to other objects in the scene.

A great deal of work Chen et al. (2015b); Divvala et al. (2009); Galleguillos and Belongie (2010); Malisiewicz and Efros (2009); Murphy et al. (2003); Rabinovich et al. (2007); Parikh et al. (2012) preceded the prevalence of deep learning, however much of this work has not been explored in DCNN based object detectors Chen and Gupta (2017); Hu et al. (2018a).

The current state of the art in object detection Ren et al. (2015); Liu et al. (2016); He et al. (2017) detects objects without explicitly exploiting any contextual information. It is broadly agreed that DCNNs make use of contextual information implicitly Zeiler and Fergus (2014); Zheng et al. (2015) since they learn hierarchical representations with multiple levels of abstraction. Nevertheless there is still value in exploring contextual information explicitly in DCNN based detectors Hu et al. (2018a); Chen and Gupta (2017); Zeng et al. (2017), and so the following reviews recent work in exploiting contextual cues in DCNN based object detectors, organized into categories of global and local contexts, motivated by earlier work in Zhang et al. (2013); Galleguillos and Belongie (2010). Representative approaches are summarized in Table 4.

Global context Zhang et al. (2013); Galleguillos and Belongie (2010) refers to image or scene level context, which can serve as cues for object detection (e.g., a bedroom will predict the presence of a bed). In DeepIDNet Ouyang et al. (2015), the image classification scores were used as contextual features, and concatenated with the object detection scores to improve detection results. In ION Bell et al. (2016), Bell et al. proposed to use spatial Recurrent Neural Networks (RNNs) to explore contextual information across the entire image. In SegDeepM Zhu et al. (2015), Zhu et al. proposed a MRF model that scores appearance as well as context for each detection, and allows each candidate box to select a segment and score the agreement between them. In Shrivastava and Gupta (2016), semantic segmentation was used as a form of contextual priming.

Local context Zhang et al. (2013); Galleguillos and Belongie (2010); Rabinovich et al. (2007) considers local surroundings in object relations, the interactions between an object and its surrounding area. In general, modeling object relations is challenging, requiring reasoning about bounding boxes of different classes, locations, scales etc. In the deep learning era, research that explicitly models object relations is quite limited, with representative ones being Spatial Memory Network (SMN) Chen and Gupta (2017), Object Relation Network Hu et al. (2018a), and Structure Inference Network (SIN) Liu et al. (2018b). In SMN, spatial memory essentially assembles object instances back into a pseudo image representation that is easy to be fed into another CNN for object relations reasoning, leading to a new sequential reasoning architecture where image and memory are processed in parallel to obtain detections which further update memory. Inspired by the recent success of attention modules in natural language processing field Vaswani et al. (2017), Hu et al. Hu et al. (2018a) proposed a lightweight ORN, which processes a set of objects simultaneously through interaction between their appearance feature and geometry. It does not require additional supervision and is easy to embed in existing networks. It has been shown to be effective in improving object recognition and duplicate removal steps in modern object detection pipelines, giving rise to the first fully end-to-end object detector. SIN Liu et al. (2018b) considered two kinds of context including scene contextual information and object relationships within a single image. It formulates object detection as a problem of graph structure inference, where given an image the objects are treated as nodes in a graph and relationships between objects are modeled as edges in such graph.

A wider range of methods has approached the problem more simply, normally by enlarging the detection window size to extract some form of local context. Representative approaches include MRCNN Gidaris and Komodakis (2015), Gated BiDirectional CNN (GBDNet) Zeng et al. (2016, 2017), Attention to Context CNN (ACCNN) Li et al. (2017b), CoupleNet Zhu et al. (2017a), and Sermanet et al. Sermanet et al. (2013).

In MRCNN Gidaris and Komodakis (2015) (Fig. 13 (a)), in addition to the features extracted from the original object proposal at the last CONV layer of the backbone, Gidaris and Komodakis proposed to extract features from a number of different regions of an object proposal (half regions, border regions, central regions, contextual region and semantically segmented regions), in order to obtain a richer and more robust object representation. All of these features are combined simply by concatenation.

Quite a number of methods, all closely related to MRCNN, have been proposed since. The method in Zagoruyko et al. (2016) used only four contextual regions, organized in a foveal structure, where the classifier is trained jointly end to end. Zeng et al. proposed GBDNet Zeng et al. (2016, 2017) (Fig. 13 (b)) to extract features from multiscale contextualized regions surrounding an object proposal to improve detection performance. Different from the naive way of learning CNN features for each region separately and then concatenating them, as in MRCNN, GBDNet can pass messages among features from different contextual regions, implemented through convolution. Noting that message passing is not always helpful but dependent on individual samples, Zeng et al.

used gated functions to control message transmission, like in Long Short Term Memory (LSTM) networks

Hochreiter and Schmidhuber (1997). Concurrent with GBDNet, Li et al. Li et al. (2017b) presented ACCNN (Fig. 13 (c)) to utilize both global and local contextual information to facilitate object detection. To capture global context, a Multiscale Local Contextualized (MLC) subnetwork was proposed, which recurrently generates an attention map for an input image to highlight useful global contextual locations, through multiple stacked LSTM layers. To encode local surroundings context, Li et al. Li et al. (2017b) adopted a method similar to that in MRCNN Gidaris and Komodakis (2015). As shown in Fig. 13 (d), CoupleNet Zhu et al. (2017a) is conceptually similar to ACCNN Li et al. (2017b), but built upon RFCN Dai et al. (2016c). In addition to the original branch in RFCN Dai et al. (2016c), which captures object information with position sensitive RoI pooling, CoupleNet Zhu et al. (2017a) added one branch to encode the global context information with RoI pooling.

4.3 Detection Proposal Methods

An object can be located at any position and scale in an image. During the heyday of handcrafted feature descriptors (e.g., SIFT Lowe (2004), HOG Dalal and Triggs (2005) and LBP Ojala et al. (2002)), the Bag of Words (BoW) Sivic and Zisserman (2003); Csurka et al. (2004) and the DPM Felzenszwalb et al. (2008) used sliding window techniques Viola and Jones (2001); Dalal and Triggs (2005); Felzenszwalb et al. (2008); Harzallah et al. (2009); Vedaldi et al. (2009). However the number of windows is large and grows with the number of pixels in an image, and the need to search at multiple scales and aspect ratios further significantly increases the search space. Therefore, it is computationally too expensive to apply more sophisticated classifiers.

Around 2011, researchers proposed to relieve the tension between computational tractability and high detection quality by using detection proposals333We use the terminology detection proposals, object proposals and region proposals interchangeably. Van de Sande et al. (2011); Uijlings et al. (2013). Originating in the idea of objectness proposed by Alexe et al. (2010), object proposals are a set of candidate regions in an image that are likely to contain objects. Detection proposals are usually used as a preprocessing step, in order to reduce the computational complexity by limiting the number of regions that need be evaluated by the detector. Therefore, a good detection proposal should have the following characteristics:

  1. High recall, which can be achieved with only a few proposals;

  2. The proposals match the objects as accurately as possible;

  3. High efficiency.

The success of object detection based on detection proposals given by selective search Van de Sande et al. (2011); Uijlings et al. (2013) has attracted broad interest Carreira and Sminchisescu (2012); Arbeláez et al. (2014); Alexe et al. (2012); Cheng et al. (2014); Zitnick and Dollár (2014); Endres and Hoiem (2010); Krähenbühl1 and Koltun (2014); Manen et al. (2013).

A comprehensive review of object proposal algorithms is outside the scope of this paper, because object proposals have applications beyond object detection Arbeláez et al. (2012); Guillaumin et al. (2014); Zhu et al. (2017b). We refer interested readers to the recent surveys Hosang et al. (2016); Chavali et al. (2016) which provides an in-depth analysis of many classical object proposal algorithms and their impact on detection performance. Our interest here is to review object proposal methods that are based on DCNNs, output class agnostic proposals, and related to generic object detection.

In 2014, the integration of object proposals Van de Sande et al. (2011); Uijlings et al. (2013) and DCNN features Krizhevsky et al. (2012a) led to the milestone RCNN Girshick et al. (2014) in generic object detection. Since then, detection proposal algorithms have quickly become a standard preprocessing step, evidenced by the fact that all winning entries in the PASCAL VOC Everingham et al. (2010), ILSVRC Russakovsky et al. (2015) and MS COCO Lin et al. (2014) object detection challenges since 2014 used detection proposals Girshick et al. (2014); Ouyang et al. (2015); Girshick (2015); Ren et al. (2015); Zeng et al. (2017); He et al. (2017).

Among object proposal approaches based on traditional low-level cues (e.g., color, texture, edge and gradients), Selective Search Uijlings et al. (2013), MCG Arbeláez et al. (2014) and EdgeBoxes Zitnick and Dollár (2014) are among the more popular. As the domain rapidly progressed, traditional object proposal approaches Hosang et al. (2016) (e.g. selective searchUijlings et al. (2013) and Zitnick and Dollár (2014)), which were adopted as external modules independent of the detectors, became the bottleneck of the detection pipeline Ren et al. (2015). An emerging class of object proposal algorithms Erhan et al. (2014); Ren et al. (2015); Kuo et al. (2015); Ghodrati et al. (2015); Pinheiro et al. (2015); Yang et al. (2016a) using DCNNs has attracted broad attention.


  Proposer Backbone Detector Recall@IoU (VOC07) Detection Results (mAP) Published  
  Name Network Tested VOC07 VOC12 COCO In Full Name of Detector and Highlights 
  Bounding Box Object Proposal Methods MultiBox1Erhan et al. (2014) AlexNet RCNN (10) (12) CVPR14 Among the first to explore DCNN for object proposals; Learns a class agnostic regressor on a small set of 800 predefined anchor boxes; Does not share features with the detection network. 
DeepBox Kuo et al. (2015) VGG16 Fast RCNN (1000) (1000) (1000) (500) (IoU@0.5) ICCV15 Propose a light weight CNN to learn to rerank proposals generated by EdgeBox; Can run at 0.26s per image; Not sharing features extracted for detection. 
RPNRen et al. (2015, 2017a) VGG16 Faster RCNN (300) 0.98 (1000) (300) 0.84 (1000) (300) 0.04 (1000) (300) (07+12) (300) (07++12) (300) NIPS15 (Region Proposal Network, RPN); First to generate object proposals by sharing full image convolutional features with the detection network; Firstly introduced the anchor boxes idea; Most widely used object proposal method; Greatly improved the detection speed and accuracy. 
DeepProposalGhodrati et al. (2015) VGG16 Fast RCNN (100) 0.92 (1000) (100) 0.80 (1000) (100) 0.16 (1000) (100) (07) ICCV15 Generated proposals inside a DCNN in a multiscale manner; Selected the most promising object locations and refined their boxes in a coarse to fine cascading way; Used the off the shelf DCNN features for generating proposals; Sharing features with the detection network. 
CRAFT Yang et al. (2016a) VGG16 Faster RCNN (300) (300) (300) (07+12) 71.3 (12) CVPR16 (Cascade Region proposal network And FasT rcnn, CRAFT); Introduced a classification Network (i.e. two class Fast RCNN) cascade that comes after the RPN. Not sharing features extracted for detection. 
AZNet Lu et al. (2016) VGG16 Fast RCNN (300) (300) (300) (07) CVPR16 (Adjacency and Zoom Network, AZNet); Generates anchor locations by using a recursive search strategy which can adaptively guide computational resources to focus on subregions likely to contain objects. 
ZIP Li et al. (2018a) Inception v2 Faster RCNN (300) COCO (300) COCO (300) COCO (07+12) IJCV18 (Zoom out and In network for object Proposals, ZIP); Generate proposals in a deep conv/deconv network with multilayers; Proposed a map attention decision (MAD) unit to weight the feature channels input to RPN; 
DeNetTychsenSmith and Petersson (2017) ResNet101 Fast RCNN (300) (300) (300) (07+12) (07++12) ICCV17

A lot faster than Faster RCNN; Introduces a bounding box corner estimation for predict object proposals efficiently to replace RPN; Doesn’t require predefined anchors.



  Proposer Name Backbone Network Detector Tested Box Proposals (AR, COCO) Segment Proposals (AR, COCO) Published In Highlights 
  Segment Proposal Methods DeepMask Pinheiro et al. (2015) VGG16 Fast RCNN (100), (100), NIPS15 First to generate object mask proposals with DCNN; Slow inference time; Need segmentation annotations for training; Not sharing features with detection network; Achieved mAP of (500) with Fast RCNN. 
InstanceFCN Dai et al. (2016a) VGG16 (100), ECCV16 (Instance Fully Convolutional Networks, InstanceFCN); Combine ideas of FCN Long et al. (2015) and DeepMask Pinheiro et al. (2015); Introduce instance sensitive score maps; Need segmentation annotations to train their network. 
SharpMask Pinheiro et al. (2016) MPN Zagoruyko et al. (2016) Fast RCNN (100), (100), ECCV16 Leverages features at multiple convolutional layers by introducing a top down refinement module; Does not share features with detection network; Need segmentation annotations for training; Slow for real time use. 
FastMaskHu et al. (2017) ResNet39 (100), (100), CVPR17 Generate instance segment proposals efficiently in one shot manner similar to SSD Liu et al. (2016), in order to make use of multiscale convolutional features in a deep network; Need segmentation annotations for training. 
ScaleNet Qiao et al. (2017) ResNet (100), (100), ICCV17 Extends SharpMask by explicitly adding a scale prediction phase; Proposed ScaleNet to estimate the distribution of the object scales for an input image. Performs well on supermarket datasets. 


Table 5: Summarization of object proposal methods using DCNN. The numbers in blue color denote the the number of object proposals. The detection results on COCO is mAP@IoU[0.5, 0.95], unless stated otherwise.

Recent DCNN based object proposal methods generally fall into two categories: bounding box based and object segment based, with representative methods summarized in Table 5.

Figure 14: Illustration of the Region Proposal Network (RPN) proposed in Ren et al. (2015).

Bounding Box Proposal Methods is best exemplified by the RPC method Ren et al. (2015) of Ren et al., illustrated in Fig. 14. RPN predicts object proposals by sliding a small network over the feature map of the last shared CONV layer (as shown in Fig. 14). At each sliding window location, it predicts proposals simultaneously by using anchor boxes, where each anchor box444The terminology “an anchor box” or “an anchor” first appeared in Ren et al. (2015). is centered at some location in the image, and is associated with a particular scale and aspect ratio. Ren et al. Ren et al. (2015) proposed to integrate RPN and Fast RCNN into a single network by sharing their convolutional layers. Such a design led to substantial speedup and the first end-to-end detection pipeline, Faster RCNN Ren et al. (2015). RPN has been broadly selected as the proposal method by many state of the art object detectors, as can be observed from Tables 3 and 4.

Instead of fixing a priori a set of anchors as MultiBox Erhan et al. (2014); Szegedy et al. (2014) and RPN Ren et al. (2015), Lu et al. Lu et al. (2016) proposed to generate anchor locations by using a recursive search strategy which can adaptively guide computational resources to focus on subregions likely to contain objects. Starting with the whole image, all regions visited during the search process serve as anchors. For any anchor region encountered during the search procedure, a scalar zoom indicator is used to decide whether to further partition the region, and a set of bounding boxes with objectness scores are computed with a deep network called Adjacency and Zoom Network (AZNet). AZNet extends RPN by adding a branch to compute the scalar zoom indicator in parallel with the existing branch.

There is further work attempting to generate object proposals by exploiting multilayer convolutional features Kong et al. (2016); Ghodrati et al. (2015); Yang et al. (2016a); Li et al. (2018a). Concurrent with RPN Ren et al. (2015), Ghodrati et al. Ghodrati et al. (2015) proposed DeepProposal which generates object proposals by using a cascade of multiple convolutional features, building an inverse cascade to select the most promising object locations and to refine their boxes in a coarse to fine manner. An improved variant of RPN, HyperNet Kong et al. (2016) designs Hyper Features which aggregate multilayer convolutional features and shares them both in generating proposals and detecting objects via an end to end joint training strategy. Yang et al. proposed CRAFT Yang et al. (2016a) which also used a cascade strategy, first training an RPN network to generate object proposals and then using them to train another binary Fast RCNN network to further distinguish objects from background. Li et al. Li et al. (2018a) proposed ZIP to improve RPN by leveraging a commonly used idea of predicting object proposals with multiple convolutional feature maps at different depths of a network to integrate both low level details and high level semantics. The backbone network used in ZIP is a “zoom out and in” network inspired by the conv and deconv structure Long et al. (2015).

Finally, recent work which deserves mention includes Deepbox Kuo et al. (2015), which proposed a light weight CNN to learn to rerank proposals generated by EdgeBox, and DeNet TychsenSmith and Petersson (2017) which introduces a bounding box corner estimation to predict object proposals efficiently to replace RPN in a Faster RCNN style two stage detector.

Object Segment Proposal Methods Pinheiro et al. (2015, 2016) aim to generate segment proposals that are likely to correspond to objects. Segment proposals are more informative than bounding box proposals, and take a step further towards object instance segmentation Hariharan et al. (2014); Dai et al. (2016b); Li et al. (2017d). A pioneering work was DeepMask proposed by Pinheiro et al. Pinheiro et al. (2015), where segment proposals are learned directly from raw image data with a deep network. Sharing similarities with RPN, after a number of shared convolutional layers DeepMask splits the network into two branches to predict a class agnostic mask and an associated objectness score. Similar to the efficient sliding window prediction strategy in OverFeat Sermanet et al. (2014), the trained DeepMask network is applied in a sliding window manner to an image (and its rescaled versions) during inference. More recently, Pinheiro et al. Pinheiro et al. (2016) proposed SharpMask by augmenting the DeepMask architecture with a refinement module, similar to the architectures shown in Fig. 11 (b1) and (b2), augmenting the feedforward network with a top-down refinement process. SharpMask can efficiently integrate the spatially rich information from early features with the strong semantic information encoded in later layers to generate high fidelity object masks.

Motivated by Fully Convolutional Networks (FCN) for semantic segmentation Long et al. (2015) and DeepMask Pinheiro et al. (2015), Dai et al. proposed InstanceFCN Dai et al. (2016a) for generating instance segment proposals. Similar to DeepMask, the InstanceFCN network is split into two branches, howver the two branches are fully convolutional, where one branch generates a small set of instance sensitive score maps, followed by an assembling module that outputs instances, and the other branch for predicting the objectness score. Hu et al. proposed FastMask Hu et al. (2017) to efficiently generate instance segment proposals in a one-shot manner similar to SSD Liu et al. (2016), in order to make use of multiscale convolutional features in a deep network. Sliding windows extracted densely from multiscale convolutional feature maps were input to a scale-tolerant attentional head module to predict segmentation masks and objectness scores. FastMask is claimed to run at 13 FPS on a resolution image with a slight trade off in average recall. Qiao et al. Qiao et al. (2017) proposed ScaleNet to extend previous object proposal methods like SharpMask Pinheiro et al. (2016) by explicitly adding a scale prediction phase. That is, ScaleNet estimates the distribution of object scales for an input image, upon which SharpMask searches the input image at the scales predicted by ScaleNet and outputs instance segment proposals. Qiao et al. Qiao et al. (2017) showed their method outperformed the previous state of the art on supermarket datasets by a large margin.


  Detector Name Region Proposal Backbone DCNN Pipelined Used VOC07 Results VOC12 Results COCO Results Published In Full Name of Method and Highlights 


  MegDet Peng et al. (2018) RPN ResNet50 +FPN Faster RCNN CVPR18 Allow training with much larger minibatch size (like 256) than before by introducing cross GPU batch normalization; Can finish the COCO training in 4 hours on 128 GPUs and achieved improved accuracy; Won COCO2017 detection challenge. 
  SNIP Singh et al. (2018) RPN DPN Chen et al. (2017b) +DCN Dai et al. (2017) RFCN CVPR18 (Scale Normalization for Image Pyramids, SNIP); A novel scale normalized training scheme; Empirically examined the effect of upsampling for small object detection; During training, selectively back propagates the gradients of object instances by ignoring gradients arising from objects of extreme scales. 
  SNIPER Singh et al. (2018) RPN ResNet101 +DCN Faster RCNN 2018 (Scale Normalization for Image Pyramids with Efficient Resampling, SNIPER); An efficient multiscale data argumentation strategy. 
  OHEM Shrivastava et al. (2016) SS VGG16 Fast RCNN (07+12) (07++12) CVPR16 (Online Hard Example Mining, OHEM); A simple and effective OHEM algorithm to improve training of region based detectors. 
  RetinaNet Lin et al. (2017b) free ResNet101 +FPN RetinaNet ICCV17 Proposed a simple dense detector called RetinaNet; Proposed a novel Focal Loss which focuses training on a sparse set of hard examples; High detection speed and high detection accuracy. 


Table 6: Representative methods for training strategies and class imbalance handling. Results on COCO are reported with Test-Dev.

4.4 Other Special Issues

Aiming at obtaining better and more robust DCNN feature representations, data augmentation tricks are commonly used Chatfield et al. (2014); Girshick (2015); Girshick et al. (2014). It can be used at training time, at test time, or both. Augmentation refers to perturbing an image by transformations that leave the underlying category unchanged, such as cropping, flipping, rotating, scaling and translating in order to generate additional samples of the class. Data augmentation can affect the recognition performance of deep feature representations. Nevertheless, it has obvious limitations. Both training and inference computational complexity increases significantly, limiting its usage in real applications. Detecting objects under a wide range of scale variations, and especially, detecting very small objects stands out as one of key challenges. It has been shown Huang et al. (2017b); Liu et al. (2016) that image resolution has a considerable impact on detection accuracy. Therefore, among those data augmentation tricks, scaling (especially a higher resolution input) is mostly used, since high resolution inputs enlarge the possibility of small objects to be detected Huang et al. (2017b). Recently, Singh et al. proposed advanced and efficient data argumentation methods SNIP Singh and Davis (2018) and SNIPER Singh et al. (2018) to illustrate the scale invariance problem, as summarized in Table 6. Motivated by the intuitive understanding that small and large objects are difficult to detect at smaller and larger scales respectively, Singh et al. presented a novel training scheme named SNIP can reduce scale variations during training but without reducing training samples. SNIPER Singh et al. (2018) is an approach proposed for efficient multiscale training. It only processes context regions around ground truth objects at the appropriate scale instead of processing a whole image pyramid. Shrivastava et al. Shrivastava et al. (2016) and Lin et al. explored approaches to handle the extreme foreground-background class imbalance issue Lin et al. (2017b). Wang et al. Wang et al. (2017) proposed to train an adversarial network to generate examples with occlusions and deformations that are difficult for the object detector to recognize. There are some works focusing on developing better methods for nonmaximum suppression Bodla et al. (2017); Hosang et al. (2017); Tychsen-Smith and Petersson (2018).

5 Datasets and Performance Evaluation

5.1 Datasets

Datasets have played a key role throughout the history of object recognition research. They have been one of the most important factors for the considerable progress in the field, not only as a common ground for measuring and comparing performance of competing algorithms, but also pushing the field towards increasingly complex and challenging problems. The present access to large numbers of images on the Internet makes it possible to build comprehensive datasets of increasing numbers of images and categories in order to capture an ever greater richness and diversity of objects. The rise of large scale datasets with millions of images has paved the way for significant breakthroughs and enabled unprecedented performance in object recognition. Recognizing space limitations, we refer interested readers to several papers Everingham et al. (2010, 2015); Lin et al. (2014); Russakovsky et al. (2015); Krishna et al. (2017) for detailed description of related datasets.

Beginning with Caltech101 Li et al. (2004), representative datasets include Caltech256 Griffin et al. (2007), Scenes15 Lazebnik et al. (2006), PASCAL VOC (2007) Everingham et al. (2015), Tiny Images Torralba A (2008), CIFAR10 Krizhevsky (2009), SUN Xiao et al. (2014), ImageNet Deng et al. (2009), Places Zhou et al. (2017a), MS COCO Lin et al. (2014), and Open Images Krasin et al. (2017). The features of these datasets are summarized in Table 7, and selected sample images are shown in Fig. 15.

Earlier datasets, such as Caltech101 or Caltech256, were criticized because of the lack of intraclass variations that they exhibit. As a result, SUN Xiao et al. (2014) was collected by finding images depicting various scene categories, and many of its images have scene and object annotations which can support scene recognition and object detection. Tiny Images Torralba A (2008) created a dataset at an unprecedented scale, giving comprehensive coverage of all object categories and scenes, however its annotations were not manually verified, containing numerous errors, so two benchmarks (CIFAR10 and CIFAR100 Krizhevsky (2009)) with reliable labels were derived from Tiny Images.

PASCAL VOC Everingham et al. (2010, 2015), a multiyear effort devoted to the creation and maintenance of a series of benchmark datasets for classification and object detection, creates the precedent for standardized evaluation of recognition algorithms in the form of annual competitions. Starting from only four categories in 2005, increasing to 20 categories that are common in everyday life, as shown in Fig. 15. ImageNet Deng et al. (2009) contains over 14 million images and over 20,000 categories, the backbone of ILSVRC Deng et al. (2009); Russakovsky et al. (2015) challenge, which has pushed object recognition research to new heights.

ImageNet has been criticized that the objects in the dataset tend to be large and well centered, making the dataset atypical of real world scenarios. With the goal of addressing this problem and pushing research to richer image understanding, researchers created the MS COCO database Lin et al. (2014). Images in MS COCO are complex everyday scenes containing common objects in their natural context, closer to real life, and objects are labeled using fully-segmented instances to provide more accurate detector evaluation. The Places database Zhou et al. (2017a) contains 10 million scene images, labeled with scene semantic categories, offering the opportunity for data hungry deep learning algorithms to reach human level recognition of visual patterns. More recently, Open Images Krasin et al. (2017) is a dataset of about 9 million images that have been annotated with image level labels and object bounding boxes.


  Dataset Name Total Images Categories Images Per Category Objects Per Image Image Size Started Year Highlights  
  MNIST LeCun et al. (1998) Handwritten digits; Single object; Binary images; Blank backgrounds; Small image size. 
  Caltech101 Li et al. (2004) Relatively smaller number of training images; A single object centered on each image; No clutter in most images; Limited intraclass variations; Less applicable to real-world evaluation. 
  Caltech256 Griffin et al. (2007) Similar to the Caltech101, a larger number of classes than the Caltech101. 
  Scenes15 Lazebnik et al. (2006) Collected for scene recognition. 
  Tiny Images Torralba A (2008) 79 millions+ Largest number of images, largest number of categories, low resolution images, not manually verified, less suitable for algorithm evaluation; Two subsets CIFAR10 and CIFAR100 as popular benchmarks for object classifiation. 
  PASCAL VOC (2012) Everingham et al. (2015) Covers only 20 categories that are common in everyday life; Large number of training images; Close to real-world applications; Significantly larger intraclass variations; Objects in scene context; Multiple objects in one image; Contains many difficult samples; Creates the precedent for standardized evaluation of recognition algorithms in the form of annual competitions. 
  SUN Xiao et al. (2014) A large number scene categories; The number of instances per object category exhibits the long tail phenomenon; Two benchmarks SUN397 and SUN2012 for scene recognition and object detection respectively. 
  ImageNet Russakovsky et al. (2015) 14 millions+ Considerably larger number of object categories; More instances and more categories of objects per image; More challenging than PASCAL VOC; Popular subset benchmarks ImageNet1000; The backbone of ILSVRC challenge; Images are object-centric. 
  MS COCO Lin et al. (2014) Even closer to real world scenarios; Each image contains more instances of objects and richer object annotation information; Contains object segmentation notation data that is not available in the ImageNet dataset; The next major dataset for large scale object detection and instance segmentation. 
  Places Zhou et al. (2017a) 10 millions+ The largest labeled dataset for scene recognition; Four subsets Places365 Standard, Places365 Challenge, Places 205 and Places88 as benchmarks. 
  Open Images Krasin et al. (2017) 9 millions+ + varied A dataset of about 9 million images that have been annotated with image level labels and object bounding boxes. 


Table 7: Popular databases for object recognition. Some example images from MNIST, Caltech101, CIFAR10, PASCAL VOC and ImageNet are shown in Fig. 15.
Figure 15: Some example images from MNIST, Caltech101, CIFAR10, PASCAL VOC and ImageNet. See Table 7 for summary of these datasets.

There are three famous challenges for generic object detection: PASCAL VOC Everingham et al. (2010, 2015), ILSVRC Russakovsky et al. (2015) and MS COCO Lin et al. (2014). Each challenge consists of two components: (i) a publicly available dataset of images together with ground truth annotation and standardized evaluation software; and (ii) an annual competition and corresponding workshop. Statistics for the number of images and object instances in the training, validation and testing datasets 555The annotations on the test set are not publicly released, except for PASCAL VOC2007. for the detection challenges is given in Table 8.

For the PASCAL VOC challenge, since 2009 the data consist of the previous years’ images augmented with new images, allowing the number of images to grow each year and, more importantly, meaning that test results can be compared with the previous years’ images.

ILSVRC Russakovsky et al. (2015) scales up PASCAL VOC’s goal of standardized training and evaluation of detection algorithms by more than an order of magnitude in the number of object classes and images. The ILSVRC object detection challenge has been run annually from 2013 to the present.

The COCO object detection challenge is designed to push the state of the art in generic object detection forward, and has been run annually from 2015 to the present. It features two object detection tasks: using either bounding box output or object instance segmentation output. It has fewer object categories than ILSVRC (80 in COCO versus 200 in ILSVRC object detection) but more instances per category (11000 on average compared to about 2600 in ILSVRC object detection). In addition, it contains object segmentation annotations which are not currently available in ILSVRC. COCO introduced several new challenges: (1) it contains objects at a wide range of scales, including a high percentage of small objects (e.g. smaller than of image area Singh and Davis (2018)). (2) objects are less iconic and amid clutter or heavy occlusion, and (3) the evaluation metric (see Table 9) encourages more accurate object localization.

COCO has become the most widely used dataset for generic object detection, with the dataset statistics for training, validation and testing summarized in Table 8. Starting in 2017, the test set has only the Dev and Challenge splits, where the Test-Dev split is the default test data, and results in papers are generally reported on Test-Dev to allow for fair comparison.

2018 saw the introduction of the Open Images Object Detection Challenge, following in the tradition of PASCAL VOC, ImageNet and COCO, but at an unprecedented scale. It offers a broader range of object classes than previous challenges, and has two tasks: bounding box object detection of 500 different classes and visual relationship detection which detects pairs of objects in particular relations.


  Challenge Object Classes Number of Images   Number of Annotated Objects  
  Train Val Test   Train Val  


  PASCAL VOC Object Detection Challenge  


  ILSVRC Object Detection Challenge  


  MS COCO Object Detection Challenge  
  MS COCO15    
  MS COCO16    
  MS COCO17    
  MS COCO18    


  Open Images Object Detection Challenge  


Table 8: Statistics of commonly used object detection datasets. Object statistics for VOC challenges list the nondifficult objects used in the evaluation (all annotated objects). For the COCO challenge, prior to 2017, the test set had four splits (Dev, Standard, Reserve, and Challenge), with each having about 20K images. Starting in 2017, test set has only the Dev and Challenge splits, with the other two splits removed.

5.2 Evaluation Criteria

There are three criteria for evaluating the performance of detection algorithms: detection speed (Frames Per Second, FPS), precision, and recall. The most commonly used metric is

Average Precision (AP), derived from precision and recall. AP is usually evaluated in a category specific manner, i.e., computed for each object category separately. In generic object detection, detectors are usually tested in terms of detecting a number of object categories. To compare performance over all object categories, the mean AP (mAP) averaged over all object categories is adopted as the final measure of performance666In object detection challenges such as PASCAL VOC and ILSVRC, the winning entry of each object category is that with the highest AP score, and the winner of the challenge is the team that wins on the most object categories. The mAP is also used as the measure of a team’s performance, and is justified since the ranking of teams by mAP was always the same as the ranking by the number of object categories won Russakovsky et al. (2015).. More details on these metrics can be found in Everingham et al. (2010, 2015); Russakovsky et al. (2015); Hoiem et al. (2012).

Figure 16: The algorithm for determining TPs and FPs by greedily matching object detection results to ground truth boxes.

The standard outputs of a detector applied to a testing image I are the predicted detections , indexed by . A given detection (omitting for notational simplicity) denotes the predicted location (i.e., the Bounding Box, BB) with its predicted category label and its confidence level . A predicted detection is regarded as a True Positive (TP) if

  • The predicted class label is the same as the ground truth label .

  • The overlap ratio IOU (Intersection Over Union) Everingham et al. (2010); Russakovsky et al. (2015)


    between the predicted BB and the ground truth one is not smaller than a predefined threshold . Here denotes the intersection of the predicted and ground truth BBs, and their union. A typical value of is 0.5.

Otherwise, it is considered as a False Positive (FP). The confidence level is usually compared with some threshold to determine whether the predicted class label is accepted.

AP is computed separately for each of the object classes, based on Precision and Recall. For a given object class and a testing image , let denote the detections returned by a detector, ranked by the confidence in decreasing order. Let be the ground truth boxes on image for the given object class . Each detection is either a TP or a FP, which can be determined via the algorithm777It is worth noting that for a given threshold , multiple detections of the same object in an image are not considered as all correct detections, and only the detection with the highest confidence level is considered as a TP and the rest as FPs. in Fig. 16. Based on the TP and FP detections, the precision and recall Everingham et al. (2010) can be computed as a function of the confidence threshold , so by varying the confidence threshold different pairs can be obtained, in principle allowing precision to be regarded as a function of recall, i.e. , from which the Average Precision (AP) Everingham et al. (2010); Russakovsky et al. (2015) can be found.

Table 9 summarizes the main metrics used in the PASCAL, ILSVRC and MS COCO object detection challenges.


  Metric Meaning Definition and Description  
  TP True Positive A true positive detection, per Fig. 16.  
  FP False Positive A false positive detection, per Fig. 16.  
  Confidence Threshold A confidence threshold for computing and .  
  IOU Threshold VOC Typically around  
ILSVRC ; is the size of a GT box.  
MS COCO Ten IOU thresholds  
  Precision The fraction of correct detections out of the total detections returned by the detector with confidence of at least .  
  Recall The fraction of all objects detected by the detector having a confidence of at least .  
  AP Average Precision Computed over the different levels of recall achieved by varying the confidence .  
  mAP mean Average Precision VOC AP at a single IOU and averaged over all classes.  
ILSVRC AP at a modified IOU and averaged over all classes.  
MS COCO : mAP averaged over ten IOUs: ;  
: mAP at IOU=0.50 (PASCAL VOC metric);  
: mAP at IOU=0.75 (strict metric);  
: mAP for small objects of area smaller than ;  
: mAP for objects of area between and ;  
: mAP for large objects of area bigger than ;  
  AR Average Recall The maximum recall given a fixed number of detections per image, averaged over all categories and IOU thresholds.  
  AR Average Recall MS COCO : AR given 1 detection per image;  
: AR given 10 detection per image;  
: AR given 100 detection per image;  
: AR for small objects of area smaller than ;  
: AR for objects of area between and ;  
: AR for large objects of area bigger than ;  


Table 9: Summarization of commonly used metrics for evaluating object detectors.

5.3 Performance

A large variety of detectors has appeared in the last several years, and the introduction of standard benchmarks such as PASCAL VOC Everingham et al. (2010, 2015), ImageNet Russakovsky et al. (2015) and COCO Lin et al. (2014) has made it easier to compare detectors with respect to accuracy. As can be seen from our earlier discussion in Sections 3 and 4, it is difficult to objectively compare detectors in terms of accuracy, speed and memory alone, as they can differ in fundamental / contextual respects, including the following:

  • Meta detection frameworks, such as RCNN Girshick et al. (2014), Fast RCNN Girshick (2015), Faster RCNN Ren et al. (2015), RFCN Dai et al. (2016c), Mask RCNN He et al. (2017), YOLO Redmon et al. (2016) and SSD Liu et al. (2016);

  • Backbone networks such as VGG Simonyan and Zisserman (2015), Inception Szegedy et al. (2015); Ioffe and Szegedy (2015); Szegedy et al. (2016), ResNet He et al. (2016), ResNeXt Xie et al. (2017), Xception Chollet (2017) and DetNet Li et al. (2018b) etc. listed in Table 2;

  • Innovations such as multilayer feature combination Lin et al. (2017a); Shrivastava et al. (2017); Fu et al. (2017), deformable convolutional networks Dai et al. (2017), deformable RoI pooling Ouyang et al. (2015); Dai et al. (2017), heavier heads Ren et al. (2017b); Peng et al. (2018), and lighter heads Li et al. (2018c);

  • Pretraining with datasets such as ImageNet Russakovsky et al. (2015), COCO Lin et al. (2014), Places Zhou et al. (2017a), JFT Hinton et al. (2015) and Open Images Krasin et al. (2017)

  • Different detection proposal methods and different numbers of object proposals;

  • Train/test data augmentation “tricks” such as multicrop, horizontal flipping, multiscale images and novel multiscale training strategies Singh and Davis (2018); Singh et al. (2018) etc, mask tightening, and model ensembling.

Although it may be impractical to compare every recently proposed detector, it is nevertheless highly valuable to integrate representative and publicly available detectors into a common platform and to compare them in a unified manner. There has been very limited work in this regard, except for Huang’s study Huang et al. (2017b) of the trade off between accuracy and speed of three main families of detectors (Faster RCNN Ren et al. (2015), RFCN Dai et al. (2016c) and SSD Liu et al. (2016)) by varying the backbone network, image resolution, and the number of box proposals etc.

Figure 17: Evolution of object detection performance on COCO (Test-Dev results). Results are quoted from Girshick (2015); He et al. (2017); Ren et al. (2017a) accordingly. The backbone network, the design of detection framework and the availability of good and large scale datasets are the three most important factors in detection.

As can be seen from Tables 3, 4, 5, 6 and Table 10, we have summarized the best reported performance of many methods on three widely used standard benchmarks. The results of these methods were reported on the same test benchmark, despite their differing in one or more of the aspects listed above.

Figs. 1 and 17 present a very brief overview of the state of the art, summarizing the best detection results of the PASCAL VOC, ILSVRC and MSCOCO challenges. More results can be found at detection challenge websites ILSVRC detection challenge results (2018); MS COCO detection leaderboard (2018); PASCAL VOC detection leaderboard (2018). In summary, the backbone network, the detection framework design and the availability of large scale datasets are the three most important factors in detection. Furthermore ensembles of multiple models, the incorporation of context features, and data augmentation all help to achieve better accuracy.

In less than five years, since AlexNet Krizhevsky et al. (2012a) was proposed, the Top5 error on ImageNet classification Russakovsky et al. (2015) with 1000 classes has dropped from 16% to 2%, as shown in Fig. 9. However, the mAP of the best performing detector Peng et al. (2018) (which is only trained to detect 80 classes) on COCO Lin et al. (2014) has reached , even at 0.5 IoU, illustrating clearly how object detection is much harder than image classification. The accuracy level achieved by the state of the art detectors is far from satisfying the requirements of general purpose practical applications, so there remains significant room for future improvement.

6 Conclusions

Generic object detection is an important and challenging problem in computer vision, and has received considerable attention. Thanks to remarkable development of deep learning techniques, the field of object detection has dramatically evolved. As a comprehensive survey on deep learning for generic object detection, this paper has highlighted the recent achievements, provided a structural taxonomy for methods according to their roles in detection, summarized existing popular datasets and evaluation criteria, and discussed performance for the most representative methods.

Despite the tremendous successes achieved in the past several years (e.g. detection accuracy improving significantly from in ILSVRC2013 to in ILSVRC2017), there remains a huge gap between the state-of-the-art and human-level performance, especially in terms of open world learning. Much work remains to be done, which we see focused on the following eight domains:

(1) Open World Learning: The ultimate goal is to develop object detection systems that are capable of accurately and efficiently recognizing and localizing instances of all object categories (thousands or more object classes Dean et al. (2013)) in all open world scenes, competing with the human visual system. Recent object detection algorithms are learned with limited datasets Everingham et al. (2010, 2015); Lin et al. (2014); Russakovsky et al. (2015), recognizing and localizing the object categories included in the dataset, but blind, in principle, to other object categories outside the dataset, although ideally a powerful detection system should be able to recognize novel object categories Lake et al. (2015); Hariharan and Girshick (2017). Current detection datasets Everingham et al. (2010); Russakovsky et al. (2015); Lin et al. (2014) contain only dozens to hundreds of categories, which is significantly smaller than those which can be recognized by humans. To achieve this goal, new large-scale labeled datasets with significantly more categories for generic object detection will need to be developed, since the state of the art in CNNs require extensive data to train well. However collecting such massive amounts of data, particularly bounding box labels for object detection, is very expensive, especially for hundreds of thousands categories.

(2) Better and More Efficient Detection Frameworks: One of the factors for the tremendous successes in generic object detection has been the development of better detection frameworks, both region-based (RCNN Girshick et al. (2014), Fast RCNN Girshick (2015), Faster RCNN Ren et al. (2015), Mask RCNN He et al. (2017)) and one-state detectors (YOLO Redmon et al. (2016), SSD Liu et al. (2016)). Region-based detectors have the highest accuracy, but are too computationally intensive for embedded or real-time systems. One-stage detectors have the potential to be faster and simpler, but have not yet reached the accuracy of region-based detectors. One possible limitation is that the state of the art object detectors depend heavily on the underlying backbone network, which have been initially optimized for image classification, causing a learning bias due to the differences between classification and detection, such that one potential strategy is to learn object detectors from scratch, like the DSOD detector Shen et al. (2017).

(3) Compact and Efficient Deep CNN Features: Another significant factor in the considerable progress in generic object detection has been the development of powerful deep CNNs, which have increased remarkably in depth, from several layers (e.g., AlexNet Krizhevsky et al. (2012b)) to hundreds of layers (e.g., ResNet He et al. (2016), DenseNet Huang et al. (2017a)). These networks have millions to hundreds of millions of parameters, requiring massive data and power-hungry GPUs for training, again limiting their application to real-time / embedded applications. In response, there has been growing research interest in designing compact and lightweight networks Chen et al. (2017a); Alvarez and Salzmann (2016); Huang et al. (2018); Howard et al. (2017); Lin et al. (2017c); Yu et al. (2018), network compression and acceleration Cheng et al. (2018); Hubara et al. (2016); Song Han (2016); Li et al. (2017a, c), and network interpretation and understanding Bruna and Mallat (2013b); Mahendran and Vedaldi (2016); Montavon et al. (2018).

(4) Robust Object Representations: One important factor which makes the object recognition problem so challenging is the great variability in real-world images, including viewpoint and lighting changes, object scale, object pose, object part deformations, background clutter, occlusions, changes in appearance, image blur, image resolution, noise, and camera limitations and distortions. Despite the advances in deep networks, they are still limited by a lack of robustness to these many variations Liu et al. (2017); Chellappa (2016), which significantly constrains the usability for real-world applications.

(5) Context Reasoning: Real-world objects typically coexist with other objects and environments. It has been recognized that contextual information (object relations, global scene statistics) helps object detection and recognition Oliva and Torralba (2007), especially in situations of small or occluded objects or poor image quality. There was extensive work preceding deep learning Malisiewicz and Efros (2009); Murphy et al. (2003); Rabinovich et al. (2007); Divvala et al. (2009); Galleguillos and Belongie (2010), however since the deep learning era there has been only very limited progress in exploiting contextual information Chen and Gupta (2017); Gidaris and Komodakis (2015); Hu et al. (2018a). How to efficiently and effectively incorporate contextual information remains to be explored, ideally guided by how humans are quickly able to guide their attention to objects of interest in natural scenes.

(6) Object Instance Segmentation: Continuing the trend of moving towards a richer and more detailed understanding image content (e.g., from image classification to single object localization to object detection), a next challenge would be to tackle pixel-level object instance segmentation Lin et al. (2014); He et al. (2017); Hu et al. (2018c), as object instance segmentation can play an important role in many potential applications that require the precise boundaries of individual instances.

(7) Weakly Supervised or Unsupervised Learning:

Current state of the art detectors employ fully-supervised models learned from labelled data with object bounding boxes or segmentation masks Everingham et al. (2015); Lin et al. (2014); Russakovsky et al. (2015); Lin et al. (2014)

, however such fully supervised learning has serious limitations, where the assumption of bounding box annotations may become problematic, especially when the number of object categories is large. Fully supervised learning is not scalable in the absence of fully labelled training data, therefore it is valuable to study how the power of CNNs can be leveraged in weakly supervised or unsupervised detection

Bilen and Vedaldi (2016); Diba et al. (2017); Shi et al. (2017).

(8) 3D Object Detection: The progress of depth cameras has enabled the acquisition of depth information in the form of RGB-D images or 3D point clouds. The depth modality can be employed to help object detection and recognition, however there is only limited work in this direction Chen et al. (2015c); Pepik et al. (2015); Xiang et al. (2014), but which might benefit from taking advantage of large collections of high quality CAD models Wu et al. (2015).

The research field of generic object detection is still far from complete; given the massive algorithmic breakthroughs over the past five years, we remain optimistic of the opportunities over the next five years.

    Detector Name RP Backbone DCNN Input ImgSize VOC07 Results VOC12 Results Speed (FPS) Published In Source Code Highlights and Disadvantages      Region based (Section 3.1) RCNN Girshick et al. (2014) SS AlexNet Fixed (07) (12) CVPR14 Caffe Matlab Highlights: First to integrate CNN with RP methods; Dramatic performance improvement over previous state of the art; ILSVRC2013 detection result mAP. Disadvantages: Multistage pipeline of sequentially-trained (External RP computation, CNN finetuning, Each warped RP passing through CNN, SVM and BBR training); Training is expensive in space and time; Testing is slow.  SPPNet He et al. (2014) SS ZFNet Arbitrary (07) ECCV14 Caffe Matlab Highlights: First to introduce SPP into CNN architecture; Enable convolutional feature sharing; Accelerate RCNN evaluation by orders of magnitude without sacrificing performance; Faster than OverFeat; ILSVRC2013 detection result mAP. Disadvantages: Inherit disadvantages of RCNN except the speedup; Does not result in much speedup of training; Finetuning not able to update the CONV layers before SPP layer.  Fast RCNN Girshick (2015) SS AlexNet VGGM VGG16 Arbitrary (VGG) (07+12) (VGG) (07++12) ICCV15 Caffe Python Highlights: First to enable end to end detector training (when ignoring the process of RP generation); Design a RoI pooling layer (a special case of SPP layer); Much faster and more accurate than SPPNet; No disk storage required for feature caching; Disadvantages: External RP computation is exposed as the new bottleneck; Still too slow for real time applications.  Faster RCNN Ren et al. (2015) RPN ZFnet VGG Arbitrary (VGG) (07+12) (VGG) (07++12) NIPS15 Caffe Matlab Python Highlights: Propose RPN for generating nearly cost free and high quality RPs instead of selective search; Introduce translation invariant and multiscale anchor boxes as references in RPN; Unify RPN and Fast RCNN into a single network by sharing CONV layers; An order of magnitude faster than Fast RCNN without performance loss; Can run testing at 5 FPS with VGG16. Disadvantages: Training is complex, not a streamlined process; Still fall short of real time.  RCNNR Lenc and Vedaldi (2015) New ZFNet +SPP Arbitrary (07) BMVC15 Highlights: Replace selective search with static RPs; Prove the possibility of building integrated, simpler and faster detectors that rely exclusively on CNN. Disadvantages: Fall short of real time; Decreased accuracy from not having good RPs.  RFCN Dai et al. (2016c) RPN ResNet101 Arbitrary (07+12) (07+12+CO) (07++12) (07++12+CO) NIPS16 Caffe Matlab Highlights: Fully convolutional detection network; Minimize the amount of regionwise computation; Design a set of position sensitive score maps using a bank of specialized CONV layers; Faster than Faster RCNN without sacrificing much accuracy. Disadvantages: Training is not a streamlined process; Still fall short of real time.  Mask RCNN He et al. (2017) RPN ResNet101 ResNeXt101 Arbitrary (ResNeXt101) (COCO Result) ICCV17 Caffe Matlab Python Highlights: A simple, flexible, and effective framework for object instance segmentation; Extends Faster RCNN by adding another branch for predicting an object mask in parallel with the existing branch for BB prediction; Feature Pyramid Network (FPN) is utilized; Achieved outstanding performance. Disadvantages: Fall short of real time applications.      Unified (Section 3.2) OverFeat Sermanet et al. (2014) AlexNet like Arbitrary ICLR14 c++ Highlights: Enable convolutional feature sharing; Multiscale image pyramid CNN feature extraction; Win the ISLVRC2013 localization competition; Significantly faster than RCNN; ILSVRC2013 detection result mAP. Disadvantages: Multistage pipeline of sequentially-trained (classifier model training, class specific localizer model finetuning); Single bounding box regressor; Cannot handle multiple object instances of the same class in an image; Too slow for real time applications.  YOLO Redmon et al. (2016) GoogLeNet like Fixed (07+12) (07++12) (VGG) CVPR16 DarkNet Highlights: First efficient unified detector; Drop RP process completely; Elegant and efficient detection framework; Significantly faster than previous detectors; YOLO runs at 45 FPS and Fast YOLO at 155 FPS; Disadvantages: Accuracy falls far behind state of the art detectors; Struggle to localize small objects.  YOLOv2Redmon and Farhadi (2017) DarkNet Fixed (07+12) (07++12) CVPR17 DarkNet Highlights: Propose a faster DarkNet19; Use a number of existing strategies to improve both speed and accuracy; Achieve high accuracy and high speed; YOLO9000 can detect over 9000 object categories in real time. Disadvantages: Not good at detecting small objects.  SSD Liu et al. (2016) VGG16 Fixed (07+12) (07+12+CO) (07++12) (07++12+CO) ECCV16 Caffe Python