Log In Sign Up

Ten Years of Pedestrian Detection, What Have We Learned?

by   Rodrigo Benenson, et al.

Paper-by-paper results make it easy to miss the forest for the trees.We analyse the remarkable progress of the last decade by discussing the main ideas explored in the 40+ detectors currently present in the Caltech pedestrian detection benchmark. We observe that there exist three families of approaches, all currently reaching similar detection quality. Based on our analysis, we study the complementarity of the most promising ideas by combining multiple published strategies. This new decision forest detector achieves the current best known performance on the challenging Caltech-USA dataset.


page 1

page 2

page 3

page 4


Filtered Channel Features for Pedestrian Detection

This paper starts from the observation that multiple top performing pede...

How Far are We from Solving Pedestrian Detection?

Encouraged by the recent progress in pedestrian detection, we investigat...

Variational Pedestrian Detection

Pedestrian detection in a crowd is a challenging task due to a high numb...

A Content-Based Late Fusion Approach Applied to Pedestrian Detection

The variety of pedestrians detectors proposed in recent years has encour...

Strengthening the Effectiveness of Pedestrian Detection with Spatially Pooled Features

We propose a simple yet effective approach to the problem of pedestrian ...

CSID: Center, Scale, Identity and Density-aware Pedestrian Detection in a Crowd

Pedestrian detection in a crowd is very challenging due to vastly differ...

Interactive Graphics for Visually Diagnosing Forest Classifiers in R

This paper describes structuring data and constructing plots to explore ...

Code Repositories


Real-Time Pedestrian Detection With Deep Network Cascades

view repo

1 Introduction

Figure 1: The last decade has shown tremendous progress on pedestrian detection. What have we learned out of the proposed methods?

Pedestrian detection is a canonical instance of object detection. Because of its direct applications in car safety, surveillance, and robotics, it has attracted much attention in the last years. Importantly, it is a well defined problem with established benchmarks and evaluation metrics. As such, it has served as a playground to explore different ideas for object detection. The main paradigms for object detection “Viola&Jones variants”, HOG+SVM rigid templates, deformable part detectors (DPM), and convolutional neural networks (ConvNets) have all been explored for this task.

The aim of this paper is to review progress over the last decade of pedestrian detection ( methods), identify the main ideas explored, and try to quantify which ideas had the most impact on final detection quality. In the next sections we review existing datasets (section 2), provide a discussion of the different approaches (section 3), and experiments reproducing/quantifying the recent years’ progress (section LABEL:sec:Experiments, presenting experiments over newly trained detector models). Although we do not aim to introduce a novel technique, by putting together existing methods we report the best known detection results on the challenging Caltech-USA dataset.

(a) INRIA test set
(b) Caltech-USA test set
(c) KITTI test set
Figure 2: Example detections of a top performing method (SquaresChnFtrs).

2 Datasets

Multiple public pedestrian datasets have been collected over the years; INRIA [Dalal2005Cvpr], ETH [Ess2008Cvpr], TUD-Brussels [Wojek2009Cvpr], Daimler [Enzweiler2009PAMI] (Daimler stereo [Keller2009Dagm]), Caltech-USA [Dollar2009Cvpr], and KITTI [Geiger2012CVPR] are the most commonly used ones. They all have different characteristics, weaknesses, and strengths.

INRIA is amongst the oldest and as such has comparatively few images. It benefits however from high quality annotations of pedestrians in diverse settings (city, beach, mountains, etc.), which is why it is commonly selected for training (see also §LABEL:sub:Generalization-across-datasets). ETH and TUD-Brussels are mid-sized video datasets. Daimler is not considered by all methods because it lacks colour channels. Daimler stereo, ETH, and KITTI provide stereo information. All datasets but INRIA are obtained from video, and thus enable the use of optical flow as an additional cue.

Today, Caltech-USA and KITTI are the predominant benchmarks for pedestrian detection. Both are comparatively large and challenging. Caltech-USA stands out for the large number of methods that have been evaluated side-by-side. KITTI stands out because its test set is slightly more diverse, but is not yet used as frequently. For a more detailed discussion of the datasets please consult [Dollar2011Pami, Geiger2012CVPR]. INRIA, ETH (monocular), TUD-Brussels, Daimler (monocular), and Caltech-USA are available under a unified evaluation toolbox; KITTI uses its own separate one with unpublished test data. Both toolboxes maintain an online ranking where published methods can be compared side by side.

In this paper we use primarily Caltech-USA for comparing methods, INRIA and KITTI secondarily. See figure 2 for example images. Caltech-USA and INRIA results are measured in log-average miss-rate (MR, lower is better), while KITTI uses area under the precision-recall curve (AUC, higher is better).

Method MR







More data

Feat. type


VJ   [Viola2004Ijvc] DF Haar  I
Shapelet [Sabzmeydani2007Cvpr] - Gradients  I
PoseInv [Lin2008Eccv] - HOG  I+
LatSvm-V1 [Felzenszwalb2008CVPR] DPM HOG  P
ConvNet [Sermanet2013Cvpr] DN Pixels  I
FtrMine [Dollar2007Cvpr] DF HOG+Color  I
HikSvm [Maji2008Cvpr] - HOG  I
HOG   [Dalal2005Cvpr] - HOG  I
MultiFtr [Wojek2008DagmMultiFtrs] DF HOG+Haar  I
HogLbp [Wang2009Iccv] - HOG+LBP  I
AFS+Geo [Levi2013Cvpr] - Custom  I
AFS [Levi2013Cvpr] - Custom  I
LatSvm-V2 [Felzenszwalb2010Pami] DPM HOG  I
Pls [Schwartz2009Iccv] - Custom  I
MLS [Nam2011IccvWorkshop] DF HOG  I
MultiFtr+CSS [Walk2010Cvpr] DF Many  T
FeatSynth [BarHillel2010Eccv] - Custom  I
pAUCBoost [Paisitkriangkrai2013Iccv] DF HOG+COV  I
FPDW [Dollar2010Bmvc] DF HOG+LUV  I
ChnFtrs [Dollar2009Bmvc] DF HOG+LUV  I
CrossTalk [Dollar2012Eccv] DF HOG+LUV  I
DBN−Isol [Ouyang2012Cvpr] DN HOG  I
ACF [Dollar2014Pami] DF HOG+LUV  I
RandForest [Marin2013Iccv] DF HOG+LBP  I&C
MultiFtr+Motion [Walk2010Cvpr] DF Many+Flow  T
SquaresChnFtrs[Benenson2013Cvpr] DF HOG+LUV  I
Franken [Mathias2013Iccv] DF HOG+LUV  I
MultiResC [Park2010Eccv] DPM HOG  C
Roerei [Benenson2013Cvpr] DF HOG+LUV  I
DBN−Mut [Ouyang2013CvprDbnMut] DN HOG  C
MF+Motion+2Ped [Ouyang2013Cvpr] DF Many+Flow  I+
MOCO [Chen2013Cvpr] - HOG+LBP  C
MultiSDP [Zeng2013Iccv] DN HOG+CSS  C
ACF-Caltech [Dollar2014Pami] DF HOG+LUV  C
MultiResC+2Ped [Ouyang2013Cvpr] DPM HOG  C+
WordChannels [Costea2014CVPR] DF Many  C
MT-DPM [Yan2013Cvpr] DPM HOG  C
JointDeep [Ouyang2013Iccv] DN Color+Gradient  C
SDN [Luo2014Cvpr] DN Pixels  C
MT-DPM+Context [Yan2013Cvpr] DPM HOG  C+
ACF+SDt [Park2013Cvpr] DF ACF+Flow  C+
SquaresChnFtrs[Benenson2013Cvpr] DF HOG+LUV  C
InformedHaar [Zhang2014CvprInformedHaar] DF HOG+LUV  C
Katamari-v1 DF HOG+Flow  C+
Table 1: Listing of methods considered on Caltech-USA, sorted by log-average miss-rate (lower is better). Consult sections LABEL:sub:Training-data to LABEL:sub:Better-features for details of each column. See also matching figure 3. “HOG” indicates HOG-like [Dalal2005Cvpr]. Ticks indicate salient aspects of each method.

2.0.1 Value of benchmarks

Individual papers usually only show a narrow view over the state of the art on a dataset. Having an official benchmark that collects detections from all methods greatly eases the author’s effort to put their curve into context, and provides reviewers easy access to the state of the art results. The collection of results enable retrospective analyses such as the one presented in the next section.

3 Main approaches to improve pedestrian detection

Figure 3 and table 1 together provide a quantitative and qualitative overview over methods whose results are published on the Caltech pedestrian detection benchmark (July 2014). Methods marked in italic are our newly trained models (described in section LABEL:sec:Experiments). We refer to all methods using their Caltech benchmark shorthand. Instead of discussing the methods’ individual particularities, we identify the key aspects that distinguish each method (ticks of table 1) and group them accordingly. We discuss these aspects in the next subsections.

3.0.1 Brief chronology

Figure 3: Caltech-USA detection results.

In 2003, Viola and Jones applied their VJ detector [Viola2003Cvpr] to the task of pedestrian detection. In 2005 Dalal and Triggs introduced the landmark HOG [Dalal2005Cvpr] detector, which later served in 2008 as a building block for the now classic deformable part model DPM (named LatSvm here) by Felzenswalb et al. [Felzenszwalb2008CVPR]. In 2009 the Caltech pedestrian detection benchmark was introduced, comparing seven pedestrian detectors [Dollar2009Cvpr]. At this point in time, the evaluation metrics changed from per-window (FPPW) to per-image (FPPI), once the flaws of the per-window evaluation were identified [Dollar2011Pami]. Under this new evaluation metric some of the early detectors turned out to under-perform.

6 Reviewing the effect of features

The idea behind the experiments in section 4.1 of the main paper is to demonstrate that, within a single framework, varying the features can replicate the jump in detection performance over a ten-year span , i.e. the jump in performance between VJ and the current state-of-the-art.

(a) INRIA test set
(b) Caltech-USA reasonable test set
Figure 9: Effect of features on detection performance. (I)/(C) indicates using INRIA/Caltech-USA training set respectively.

See figure 9 for results on INRIA and Caltech-USA of the following methods (all based on SquaresChnFtrs, described in section 4 of the paper):

  • uses only the luminance colour channel, emulating the original VJ [Viola2003Cvpr]. We use weak classifiers to compensate for the weak input feature, only square pooling regions, and level-2 trees to emulate the Haar wavelet-like features used by VJ.

  • uses pooling regions, oriented gradients,

    gradient magnitude, and level 1/2 decision trees (1/3 threshold comparisons respectively). A level-1 tree emulates the non-linearity in the original

    HOG+linear SVM features [Dalal2005Cvpr].

  • is identical to HOGLike, but with additional LUV colour channels ( feature channels total).

  • is the baseline described in the beginning of the experiments section (§4). It is similar to HOGLike+LUV but the size of the square pooling regions is not restricted.

  • is inspired by [Nam2014arXiv]. We expand the ten HOG+LUV channels into channels by convolving each of the 10 channels with three DCT (discrete cosine transform) filters (), and storing the absolute value of the filter responses as additional feature channels. The three DCT basis functions we use as 2d-filters correspond to the lowest spatial frequencies. We name this variant SquaresChnFtrs+DCT and it serves as reference point for the performance improvement that can be obtained by increasing the number of channels.

7 Complementarity of approaches

Table 3 contains the detailed results of combining different approaches with a strong baseline, related to section 4.2 of the main paper. Katamari-v1 combines all three listed approaches with SquaresChnFtrs. We train and test on the Caltech-USA dataset. It can be noticed that the obtained improvement is very close to the sum of individual gains, showing that these approaches are quite complementary amongst each other.

Method Results Improvement Expected
SquaresChnFtrs 34.81% - -
+DCT 31.28% 3.53 -
+SDt [Park2010Eccv] 30.34% 4.47 -
+2Ped [Ouyang2013Cvpr] 29.42% 5.39 -
+DCT+2Ped 27.40% 7.41 8.92
+SDt+2Ped 26.68% 8.13 9.86
+DCT+SDt 25.24% 9.57 8.00
Katamari-v1 22.49% 12.32 13.39
Table 3: Complementarity between different extensions of the SquaresChnFtrs strong baseline. Results in MR (lower is better). Improvement in MR percent points. Expected improvement is the direct sum of individual improvements.