Real-Time Pedestrian Detection With Deep Network Cascades
Paper-by-paper results make it easy to miss the forest for the trees.We analyse the remarkable progress of the last decade by discussing the main ideas explored in the 40+ detectors currently present in the Caltech pedestrian detection benchmark. We observe that there exist three families of approaches, all currently reaching similar detection quality. Based on our analysis, we study the complementarity of the most promising ideas by combining multiple published strategies. This new decision forest detector achieves the current best known performance on the challenging Caltech-USA dataset.READ FULL TEXT VIEW PDF
Real-Time Pedestrian Detection With Deep Network Cascades
Pedestrian detection is a canonical instance of object detection. Because of its direct applications in car safety, surveillance, and robotics, it has attracted much attention in the last years. Importantly, it is a well defined problem with established benchmarks and evaluation metrics. As such, it has served as a playground to explore different ideas for object detection. The main paradigms for object detection “Viola&Jones variants”, HOG+SVM rigid templates, deformable part detectors (DPM), and convolutional neural networks (ConvNets) have all been explored for this task.
The aim of this paper is to review progress over the last decade of pedestrian detection ( methods), identify the main ideas explored, and try to quantify which ideas had the most impact on final detection quality. In the next sections we review existing datasets (section 2), provide a discussion of the different approaches (section 3), and experiments reproducing/quantifying the recent years’ progress (section LABEL:sec:Experiments, presenting experiments over newly trained detector models). Although we do not aim to introduce a novel technique, by putting together existing methods we report the best known detection results on the challenging Caltech-USA dataset.
Multiple public pedestrian datasets have been collected over the years; INRIA [Dalal2005Cvpr], ETH [Ess2008Cvpr], TUD-Brussels [Wojek2009Cvpr], Daimler [Enzweiler2009PAMI] (Daimler stereo [Keller2009Dagm]), Caltech-USA [Dollar2009Cvpr], and KITTI [Geiger2012CVPR] are the most commonly used ones. They all have different characteristics, weaknesses, and strengths.
INRIA is amongst the oldest and as such has comparatively few images. It benefits however from high quality annotations of pedestrians in diverse settings (city, beach, mountains, etc.), which is why it is commonly selected for training (see also §LABEL:sub:Generalization-across-datasets). ETH and TUD-Brussels are mid-sized video datasets. Daimler is not considered by all methods because it lacks colour channels. Daimler stereo, ETH, and KITTI provide stereo information. All datasets but INRIA are obtained from video, and thus enable the use of optical flow as an additional cue.
Today, Caltech-USA and KITTI are the predominant benchmarks for pedestrian detection. Both are comparatively large and challenging. Caltech-USA stands out for the large number of methods that have been evaluated side-by-side. KITTI stands out because its test set is slightly more diverse, but is not yet used as frequently. For a more detailed discussion of the datasets please consult [Dollar2011Pami, Geiger2012CVPR]. INRIA, ETH (monocular), TUD-Brussels, Daimler (monocular), and Caltech-USA are available under a unified evaluation toolbox; KITTI uses its own separate one with unpublished test data. Both toolboxes maintain an online ranking where published methods can be compared side by side.
In this paper we use primarily Caltech-USA for comparing methods, INRIA and KITTI secondarily. See figure 2 for example images. Caltech-USA and INRIA results are measured in log-average miss-rate (MR, lower is better), while KITTI uses area under the precision-recall curve (AUC, higher is better).
Individual papers usually only show a narrow view over the state of the art on a dataset. Having an official benchmark that collects detections from all methods greatly eases the author’s effort to put their curve into context, and provides reviewers easy access to the state of the art results. The collection of results enable retrospective analyses such as the one presented in the next section.
Figure 3 and table 1 together provide a quantitative and qualitative overview over methods whose results are published on the Caltech pedestrian detection benchmark (July 2014). Methods marked in italic are our newly trained models (described in section LABEL:sec:Experiments). We refer to all methods using their Caltech benchmark shorthand. Instead of discussing the methods’ individual particularities, we identify the key aspects that distinguish each method (ticks of table 1) and group them accordingly. We discuss these aspects in the next subsections.
In 2003, Viola and Jones applied their VJ detector [Viola2003Cvpr] to the task of pedestrian detection. In 2005 Dalal and Triggs introduced the landmark HOG [Dalal2005Cvpr] detector, which later served in 2008 as a building block for the now classic deformable part model DPM (named LatSvm here) by Felzenswalb et al. [Felzenszwalb2008CVPR]. In 2009 the Caltech pedestrian detection benchmark was introduced, comparing seven pedestrian detectors [Dollar2009Cvpr]. At this point in time, the evaluation metrics changed from per-window (FPPW) to per-image (FPPI), once the flaws of the per-window evaluation were identified [Dollar2011Pami]. Under this new evaluation metric some of the early detectors turned out to under-perform.
The idea behind the experiments in section 4.1 of the main paper is to demonstrate that, within a single framework, varying the features can replicate the jump in detection performance over a ten-year span , i.e. the jump in performance between VJ and the current state-of-the-art.
See figure 9 for results on INRIA and Caltech-USA of the following methods (all based on SquaresChnFtrs, described in section 4 of the paper):
uses only the luminance colour channel, emulating the original VJ [Viola2003Cvpr]. We use weak classifiers to compensate for the weak input feature, only square pooling regions, and level-2 trees to emulate the Haar wavelet-like features used by VJ.
uses pooling regions, oriented gradients,
gradient magnitude, and level 1/2 decision trees (1/3 threshold comparisons respectively). A level-1 tree emulates the non-linearity in the originalHOG+linear SVM features [Dalal2005Cvpr].
is identical to HOGLike, but with additional LUV colour channels ( feature channels total).
is the baseline described in the beginning of the experiments section (§4). It is similar to HOGLike+LUV but the size of the square pooling regions is not restricted.
is inspired by [Nam2014arXiv]. We expand the ten HOG+LUV channels into channels by convolving each of the 10 channels with three DCT (discrete cosine transform) filters (), and storing the absolute value of the filter responses as additional feature channels. The three DCT basis functions we use as 2d-filters correspond to the lowest spatial frequencies. We name this variant SquaresChnFtrs+DCT and it serves as reference point for the performance improvement that can be obtained by increasing the number of channels.
Table 3 contains the detailed results of combining different approaches with a strong baseline, related to section 4.2 of the main paper. Katamari-v1 combines all three listed approaches with SquaresChnFtrs. We train and test on the Caltech-USA dataset. It can be noticed that the obtained improvement is very close to the sum of individual gains, showing that these approaches are quite complementary amongst each other.