Pedestrian detection is a key problem for surveillance, automotive safety and robotics applications. The wide variety of appearances of pedestrians due to body pose, occlusions, clothing, lighting and backgrounds makes this task challenging.
, followed by a trainable classifier such as SVM[13, 28], boosted classifiers 7]. While low-level features can be designed by hand with good success, mid-level features that combine low-level features are difficult to engineer without the help of some sort of learning procedure. Multi-stage recognizers that learn hierarchies of features tuned to the task at hand can be trained end-to-end with little prior knowledge. Convolutional Networks (ConvNets) 
are examples of such hierarchical systems with end-to-end feature learning that are trained in a supervised fashion. Recent works have demonstrated the usefulness of unsupervised pre-training for end-to-end training of deep multi-stage architectures using a variety of techniques such as stacked restricted Boltzmann machines, stacked auto-encoders  and stacked sparse auto-encoders 
, and using new types of non-linear transforms at each layer[17, 20].
achieved a breakthrough on a 1000-class ImageNet detection task. The main contribution of this paper is to show that the ConvNet model, with a few important twists, consistently yields state of the art and competitive results on all major pedestrian detection benchmarks. The system uses unsupervised convolutional sparse auto-encoders to pre-train features at all levels from the relatively small INRIA dataset, and end-to-end supervised training to train the classifier and fine-tune the features in an integrated fashion. Additionally, multi-stage features with layer-skipping connections enable output stages to combine global shape detectors with local motif detectors.
Processing speed in pedestrian detection has recently seen great progress, enabling real-time operation without sacrificing quality.  manage to entirely avoid image rescaling for detection while observing quality improvements. While processing speed is not the focus of this paper, features and classifier approximations introduced by  and  may be applicable to deep learning models for faster detection, in addition to GPU optimizations.
2 Learning Feature Hierarchies
Much of the work on pedestrian detection have focused on designing representative and powerful features [5, 9, 8, 38]. In this work, we show that generic feature learning algorithms can produce successful feature extractors that can achieve state-of-the-art results.
. However, for many input domains, it is hard to find adequate number of labeled data. In this case, one can resort to designing useful features by using domain knowledge, or an alternative way is to use unsupervised learning algorithms. Recently unsupervised learning algorithms have been demonstrated to produce good features for generic object recognition problems[24, 25, 18, 20].
In , it was shown that unsupervised learning can be used to train deep hierarchical models and the final representation achieved is actually useful for a variety of different tasks [32, 24, 4]. In this work, we also follow a similar approach and train a generic unsupervised model at each layer using the output representation from the layer before. This process is then followed by supervised updates to the whole hierarchical system using label information.
2.1 Hierarchical Model
A hierarchical feature extraction system consists of multiple levels of feature extractors that perform the same filtering and non-linear transformation functions in successive layers. Using a particular generic parametrized function one can then map the inputs into gradually more higher level (or abstract) representations[23, 16, 4, 32, 24]. In this work we use sparse convolutional feature hierarchies as proposed in . Each layer of the unsupervised model contains a convolutional sparse coding algorithm and a predictor function that can be used for fast inference. After the last layer a classifier is used to map the feature representations into class labels. Both the sparse coding dictionary and the predictor function do not contain any hard-coded parameter and are trained from the input data.
The training procedure for this model is similar to . Each layer is trained in an unsupervised manner using the representation from previous layer (or the input image for the initial layer) separately. After the whole multi-stage system is trained in a layer-wise fashion, the complete architecture followed by a classifier is fine-tuned using labeled data.
2.2 Unsupervised Learning
Recently sparse coding has seen much interest in many fields due to its ability to extract useful feature representations from data, The general formulation of sparse coding is a linear reconstruction model using an overcomplete dictionary where and a regularization penalty on the mixing coefficients .
The aim is to minimize equation 1 with respect to to obtain the optimal sparse representation that correspond to input . The exact form of depends on the particular sparse coding algorithm that is used, here, we use the norm penalty, which is the sum of the absolute values of all elements of . It is immediately clear that the solution of this system requires an optimization process. Many efficient algorithms for solving the above convex system has been proposed in recent years [1, 6, 2, 26]. However, our aim is to also learn generic feature extractors. For that reason we minimize equation 1 wrt too.
This resulting equation is non-convex in and at the same time, however keeping one fixed, the problem is still convex wrt to the other variable. All sparse modeling algorithms that adopt the dictionary matrix exploit this property and perform a coordinate descent like minimization process where each variable is updated in succession. Following  many authors have used sparse dictionary learning to represent images [27, 1, 19]. However, most of the sparse coding models use small image patches as input to learn the dictionary and then apply the resulting model to every overlapping patch location on the full image. This approach assumes that the sparse representation for two neighboring patches with a single pixel shift is completely independent, thus produces very redundant representations. [20, 39] have introduced convolutional sparse modeling formulations for feature learning and object recognition and we use the Convolutional Predictive Sparse Decomposition (CPSD) model proposed in  since it is the only convolutional sparse coding model providing a fast predictor function that is suitable for building multi-stage feature representations. The particular predictor function we use is similar to a single layer ConvNet of the following form:
where operator represents convolution operator that applies on a single input and single filter. In this formulation is a grayscale input image, is a set of 2D filters where each filter is , and
are vectors withelements, the predictor output is a set of feature maps where each of is of size . Considering this general predictor function, the final form of the convolutional unsupervised energy for grayscale inputs is as follows:
where is a dictionary of filters the same size as and is a hyper-parameter. The unsupervised learning procedure is a two step coordinate descent process. At each iteration, (1) Inference: The parameters are kept fixed and equation 6 is minimized to obtain the optimal sparse representation , (2) Update: Keeping fixed, the parameters updated using a stochastic gradient step: where is the learning rate parameter. The inference procedure requires us to carry out the sparse coding problem solution. For this step we use the FISTA method proposed in . This method is an extension of the original iterative shrinkage and thresholding algorithm  using an improved step size calculation with a momentum-like term. We apply the FISTA algorithm in the image domain adopting the convolutional formulation.
For color images or other multi-modal feature representations, the input is a set of feature maps indexed by and the representation is a set of feature maps indexed by for each input map . We define a map of connections from input to features . A output feature map is connected to a set of input feature maps. Thus, the predictor function in Algorithm 1 is defined as:
and the reconstruction is computed using the inverse map :
For a fully connected layer, all the input features are connected to all the output features, however it is also common to use sparse connection maps to reduce the number of parameters. The online training algorithm for unsupervised training of a single layer is:
2.3 Non-Linear Transformations
Once the unsupervised learning for a single stage is completed, the next stage is trained on the feature representation from the previous one. In order to obtain the feature representation for the next stage, we use the predictor function followed by non-linear transformations and pooling. Following the multi-stage framework used in , we apply absolute value rectification, local contrast normalization and average down-sampling operations.
Absolute Value Rectification is applied component-wise to the whole feature output from in order to avoid cancellation problems in contrast normalization and pooling steps.
Local Contrast Normalization is a non-linear process that enhances the most active feature and suppresses the other ones. The exact form of the operation is as follows:
where is the feature map index and is a Gaussian weighting function with normalized weights so that . For each sample, the constant is set to in the experiments.
operation is performed using a fixed size boxcar kernel with a certain step size. The size of the kernel and the stride are given for each experiment in the following sections.
Once a single layer of the network is trained, the features for training a successive layer is extracted using the predictor function followed by non-linear transformations. Detailed procedure of training an layer hierarchical model is explained in Algorithm 2.
The first layer features can be easily displayed in the parameter space since the parameter space and the input space is same, however visualizing the second and higher level features in the input space can only be possible when only invertible operations are used in between layers. However, since we use absolute value rectification and local contrast normalization operations mapping the second layer features onto input space is not possible. In Figure 2 we show a subset of second layer features in the parameter space.
2.4 Supervised Training
2.5 Multi-Stage Features
|Task||Single-Stage features||Multi-Stage features||Improvement %|
|Pedestrians detection (INRIA) (Fig. 8)||23.39%||17.29%||26.1%|
|Traffic Signs classification (GTSRB) ||1.80%||0.83%||54%|
|House Numbers classification (SVHN) ||5.54%||5.36%||3.2%|
ConvNets are usually organized in a strictly feed-forward manner where one layer only takes the output of the previous layer as input. Features extracted this way tend to be high level features after a few stages of convolutions and subsampling. By branching lower levels’ outputs into the top classifier (Fig. 3), one produces features that extract both global shapes and structures and local details, such as a global silhouette and face components in the case of human detection. Contrary to , the output of the first stage is branched after the non-linear transformations and pooling/subsampling operations rather than before.
We also use color information on the training data. For this purpose we convert all images into YUV image space and subsample the UV features by 3 since color information is in much lower resolution. Then at the first stage, we keep feature extraction systems for Y and UV channels separate. On the Y channel, we use features followed by absolute value rectification, contrast normalization and subsampling. On the subsampled UV channels, we extract features followed by absolute value rectification and contrast normalization, skipping the usual subsampling step since it was performed beforehand. These features are then concatanated to produce feature maps that are input to the first layer. The second layer feature extraction takes feature maps and produces output features using features. A randomly selected 20% of the connections in mapping from input features to output features is removed to limit the computational requirements and break the symmetry . The output of the second layer features are then transformed using absolute value rectification and contrast normalization followed by subsampling. This results in dimensional feature vector for each sample which is then fed into a linear classifier.
In Table 1, we show that multi-stage features improve accuracy for different tasks, with different magnitudes. Greatest improvements are obtained for pedestrian detection and traffic-sign classification while only minimal gains are obtained for house numbers classification, a less complex task.
Bootstrapping is typically used in detection settings by multiple phases of extracting the most offending negative answers and adding these samples to the existing dataset while training. For this purpose, we extract 3000 negative samples per bootstrapping pass and limit the number of most offending answers to 5 for each image. We perform 3 bootstrapping passes in addition to the original training phase (i.e. 4 training passes in total).
2.7 Non-Maximum Suppression
Non-maximum suppression (NMS) is used to resolve conflicts when several bounding boxes overlap. For both INRIA and Caltech experiments we use the widely accepted PASCAL overlap criteria to determine a matching score between two bounding boxes () and if two boxes overlap by more than 60%, only the one with the highest score is kept. In ’s addendum, the matching criteria is modified by replacing the union of the two boxes with the minimum of the two. Therefore, if a box is fully contained in another one the small box is selected. The goal for this modification is to avoid false positives that are due to pedestrian body parts. However, a drawback to this approach is that it always disregards one of the overlapping pedestrians from detection. Instead of changing the criteria, we actively modify our training set before each bootstrapping phase. We include body part images that cause false positive detection into our bootstrapping image set. Our model can then learn to suppress such responses within a positive window and still detect pedestrians within bigger windows more reliably.
We evaluate our system on 5 standard pedestrian detection datasets. However, like most other systems, we only train on the INRIA dataset. We also demonstrate improvements brought by unsupervised training and multi-stage features. In the following we name our model ConvNet with variants of unsupervised (Convnet-U) and fully-supervised training (Convnet-F) and multi-stage features (Convnet-U-MS and ConvNet-F-MS).
3.1 Data Preparation
The ConvNet is trained on the INRIA pedestrian dataset . Pedestrians are extracted into windows of 126 pixels in height and 78 pixels in width. The context ratio is 1.4, i.e. pedestrians are 90 pixels high and the remaining 36 pixels correspond to the background. Each pedestrian image is mirrored along the horizontal axis to expand the dataset. Similarly, we add 5 variations of each original sample using 5 random deformations such as translations and scale. Translations range from -2 to 2 pixels and scale ratios from 0.95 to 1.05. These deformations enforce invariance to small deformations in the input. The range of each deformation determines the trade-off between recognition and localization accuracy during detection. An equal amount of background samples are extracted at random from the negative images and taking approximately 10% of the extracted samples for validation yields a validation set with 2000 samples and training set with 21845 samples. Note that the unsupervised training phase is performed on this initial data before the bootstrapping phase.
3.2 Evaluation Protocol
During testing and bootstrapping phases using the INRIA dataset, the images are both up-sampled and sub-sampled. The up-sampling ratio is 1.3 while the sub-sampling ratio is limited by 0.75 times the network’s minimum input (). We use a scale stride of 1.10 between each scale, while other methods typically use either 1.05 or 1.20 . A higher scale stride is desirable as it implies less computations.
For evaluation we use the bounding boxes files published on the Caltech Pedestrian website 111http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians
and the evaluation software provided by Piotr Dollar (version 3.0.1). In an effort to provide a more accurate evaluation, we improved on both the evaluation formula and the INRIA annotations as follows. The evaluation software was slightly modified to compute the continuous area under curve (AUC) in the entire [0, 1] range rather than from 9 discrete points only (0.01, 0.0178, 0.0316, 0.0562, 0.1, 0.1778, 0.3162, 0.5623 and 1.0 in version 3.0.1). Instead, we compute the entire area under the curve by summing the areas under the piece-wise linear interpolation of the curve, between each pair of points. In addition, we also report a ’fixed’ version of the annotations for INRIA dataset, which has missing positive labels. The added labels are only used to avoid counting false errors and wrongly penalizing algorithms. The modified code and extra INRIA labels are available at222http://cs.nyu.edu/~sermanet/data.html#inria. Table 2 reports results for both original and fixed INRIA datasets. Notice that the continuous AUC and fixed INRIA annotations both yield a reordering of the results (see supplementary material for further evidence that the impact of these modifications is significant enough to be used). To avoid ambiguity, all results with the original discrete AUC are reported in the supplementary paper.
To ensure a fair comparison, we separated systems trained on INRIA (the majority) from systems trained on TUD-MotionPairs and the only system trained on Caltech in table 2. For clarity, only systems trained on INRIA were represented in Figure 5, however all results for all systems are still reported in table 2.
In Figure 8, we plot DET curves, i.e. miss rate versus false positives per image (FPPI), on the fixed INRIA dataset and rank algorithms along two measures: the error rate at 1 FPPI and the area under curve (AUC) rate in the [0, 1] FPPI range. This graph shows the individual contributions of unsupervised learning (ConvNet-U) and multi-stage features learning (ConvNet-F-MS) and their combination (ConvNet-U-MS) compared to the fully-supervised system without multi-stage features (ConvNet-F). With 17.1% error rate, unsupervised learning exhibits the most improvements compared to the baseline ConvNet-F (23.39%). Multi-stage features without unsupervised learning reach 17.29% error while their combination yields the competitive error rate of 10.55%.
Extensive results comparison of all major pedestrian datasets and published systems is provided in Table 2. Multiple types of measures proposed by  are reported. For clarity, we also plot in Figure 5 two of these measures, ’reasonable’ and ’large’, for INRIA-trained systems. The ’large’ plot shows that the ConvNet results in state-of-the-art performance with some margin on the ETH, Caltech and TudBrussels datasets and is closely behind LatSvm-V2 and VeryFast for INRIA and Daimler datasets. In the ’reasonable’ plot, the ConvNet yields competitive results for INRIA, Daimler and ETH datasets but performs poorly on the Caltech dataset. We suspect the ConvNet with multi-stage features trained at high-resolution is more sensitive to resolution loss than other methods. In future work, a ConvNet trained at multiple resolution will likely learn to use appropriate cues for each resolution regime.
All - AUC %
Reasonable - AUC % - >50 pixels & no/partial occlusion
Large - AUC % - >100 pixels
Near - AUC % - >80 pixels
|Medium - AUC % - 30-80 pixels|
We have introduced a new feature learning model with an application to pedestrian detection. Contrary to popular models where the low-level features are hand-designed, our model learns all the features at all levels in a hierarchy. We used the method of  as a baseline, and extended it by combining high and low resolution features in the model, and by learning features on the color channels of the input. Using the INRIA dataset, we have shown that these improvements provide clear performance benefits. The resulting model provides state of the art or competitive results on most measures of all publicly available datasets. Small-scale pedestrian measures can be improved in future work by training multiple scale models relying less on high-resolution details. While computational speed was not the focus and hence was not reported here, our model was successfully used with near real-time speed in a haptic belt system  using parallel hardware. In future work, models designed for speed combined to highly optimized parallel computing on graphics cards is expected to yield competitive computational performance.
-  M. Aharon, M. Elad, and A. M. Bruckstein. K-SVD and its non-negative variant for dictionary design. In M. Papadakis, A. F. Laine, and M. A. Unser, editors, Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, volume 5914, pages 327–339, Aug. 2005.
-  A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Img. Sci., 2(1):183–202, 2009.
-  R. Benenson, M. Mathias, R. Timofte, and L. Van Gool. Pedestrian detection at 100 frames per second. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2903–2910. IEEE, 2012.
-  Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In Advances in Neural Information Processing Systems 19, pages 153–160. MIT Press, 2007.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In C. Schmid, S. Soatto, and C. Tomasi, editors, CVPR’05, volume 2, pages 886–893, June 2005.
-  I. Daubechies, M. Defrise, and C. De Mol. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Communications on Pure and Applied Mathematics, 57(11):1413–1457, 2004.
-  P. Dollár, R. Appel, and W. Kienzle. Crosstalk cascades for frame-rate pedestrian detection.
-  P. Dollár, S. Belongie, and P. Perona. The fastest pedestrian detector in the west. In BMVC 2010, Aberystwyth, UK.
-  P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In BMVC 2009, London, England.
-  P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: A benchmark. In CVPR’09. IEEE, June 2009.
-  P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. PAMI, 99, 2011.
J. Fan, W. Xu, Y. Wu, and Y. Gong.
Human tracking using convolutional neural networks.Neural Networks, IEEE Transactions on, 21(10):1610 –1623, 2010.
-  P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. In PAMI 2010.
-  A. Frome, G. Cheung, A. Abdulkader, M. Zennaro, B. Wu, A. Bissacco, H. Adam, H. Neven, and L. Vincent. Large-scale privacy protection in street-level imagery. In ICCV’09.
C. Garcia and M. Delakis.
Convolutional face finder: A neural architecture for fast and robust face detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004.
-  G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
-  K. Jarrett, K. Kavukcuoglu, M. Ranzato, and Y. LeCun. What is the best multi-stage architecture for object recognition? In ICCV’09. IEEE, 2009.
-  K. Kavukcuoglu, M. Ranzato, R. Fergus, and Y. LeCun. Learning invariant features through topographic filter maps. In CVPR’09. IEEE, 2009.
-  K. Kavukcuoglu, M. Ranzato, and Y. LeCun. Fast inference in sparse coding algorithms with applications to object recognition. Technical report, CBLL, Courant Institute, NYU, 2008. CBLL-TR-2008-12-01.
-  K. Kavukcuoglu, P. Sermanet, Y. Boureau, K. Gregor, M. Mathieu, and Y. LeCun. Learning convolutional feature hierachies for visual recognition. In Advances in Neural Information Processing Systems (NIPS 2010), 2010.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS 2012: Neural Information Processing Systems.
-  Q. Le, M. Quigley, J. Feng, J. Chen, Y. Zou, W, M. Rasi, T. Low, and A. Ng. Haptic belt with pedestrian detection. In NIPS, 2011 (Demonstrations).
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
-  H. Lee, A. Battle, R. Raina, and A. Y. Ng. Efficient sparse coding algorithms. In B. Schölkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing Systems 19, pages 801–808. MIT Press, Cambridge, MA, 2007.
H. Lee, R. Grosse, R. Ranganath, and A. Ng.
Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.In ICML’09, pages 609–616. ACM, 2009.
-  Y. Li and S. Osher. Coordinate Descent Optimization for l1 Minimization with Application to Compressed Sensing; a Greedy Algorithm. CAM Report, pages 09–17.
-  J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman. Discriminative learned dictionaries for local image analysis. Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8, June 2008.
S. Maji, A. C. Berg, and J. Malik.
Classification using intersection kernel support vector machines is efficient.volume 0, pages 1–8, Los Alamitos, CA, USA, 2008. IEEE Computer Society.
-  S. Nowlan and J. Platt. A convolutional neural network hand tracker. pages 901–908, San Mateo, CA, 1995. Morgan Kaufmann.
-  B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: a strategy employed by v1? Vision Research, 37(23):3311–3325, 1997.
M. Osadchy, Y. LeCun, and M. Miller.
Journal of Machine Learning Research, 8:1197–1215, May 2007.
-  M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun. Efficient learning of sparse representations with an energy-based model. In NIPS’07. MIT Press, 2007.
-  W. R. Schwartz, A. Kembhavi, D. Harwood, and L. S. Davis. Human detection using partial least squares analysis. In Computer Vision, 2009 IEEE 12th International Conference on, pages 24 –31, 29 2009-oct. 2 2009.
-  P. Sermanet, S. Chintala, and Y. LeCun. Convolutional neural networks applied to house numbers digit classification. In Proceedings of International Conference on Pattern Recognition, 2012.
-  P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In Proceedings of International Joint Conference on Neural Networks, 2011.
-  G. Taylor, R. Fergus, G. Williams, I. Spiro, and C. Bregler. Pose-sensitive embedding by nonlinear nca regression. In Advances in Neural Information Processing Systems NIPS 23, 2010.
-  R. Vaillant, C. Monrocq, and Y. LeCun. Original approach for the localisation of objects in images. IEE Proc on Vision, Image, and Signal Processing, 141(4):245–250, August 1994.
-  S. Walk, N. Majer, K. Schindler, and B. Schiele. New features and insights for pedestrian detection. In CVPR 2010, San Francisco, California.
-  M. Zeiler, D. Krishnan, G. Taylor, and R. Fergus. Deconvolutional Networks. In CVPR’10. IEEE, 2010.
5 Evidence for using the proposed continuous Area Under Curve measure
6 Evidence for using the proposed fixed INRIA dataset
7 All results with the continuous AUC measure
|all - Continuous AUC%|
|reasonable - Continuous AUC%|
|scale=large - Continuous AUC%|
|scale=near - Continuous AUC%|
|scale=medium - Continuous AUC%|
|scale=far - Continuous AUC%|
|occ=partial - Continuous AUC%|
|occ=heavy - Continuous AUC%|
|ar=atypical - Continuous AUC%|
8 All results with the original discrete AUC measure
|all - Discrete AUC%|
|reasonable - Discrete AUC%|
|scale=large - Discrete AUC%|
|scale=near - Discrete AUC%|
|scale=medium - Discrete AUC%|
|scale=far - Discrete AUC%|
|occ=partial - Discrete AUC%|
|occ=heavy - Discrete AUC%|
|ar=atypical - Discrete AUC%|