## 1 Introduction

In recent years object detectors have undergone an impressive transformation [Felzenszwalb2009PAMI, SermanetARXIV2013, Girshick2014CVPR]

. Nevertheless, boosted detectors remain extraordinarily successful for fast detection of quasi-rigid objects. Such detectors were first proposed by Viola and Jones in their landmark work on efficient sliding window detection that made face detection practical and commercially viable

[Viola2004IJCV]. This initial architecture remains largely intact today: boosting [Schapire1998TAS, Friedman2000AS] is used to train and combine decision trees and a cascade is employed to allow for fast rejection of negative samples. Details, however, have evolved considerably; in particular, significant progress has been made on the feature representation [Dalal2005CVPR, Dollar2009BMVC, Benenson2013seeking] and cascade architecture [BourdevCVPR05, Dollar2012ECCV]. Recent boosted detectors [Benenson2012CVPR, DollarPAMI14pyramids] achieve state-of-the-art accuracy on modern benchmarks [Dollar2011PAMI, Mathias2013traffic] while retaining computational efficiency.While boosted detectors have evolved considerably over the past decade, decision trees with orthogonal (single feature) splits – also known as axis-aligned decision trees – remain popular and predominant. A possible explanation for the persistence of orthogonal splits is their efficiency: oblique (multiple feature) splits incur considerable computational cost during both training and detection. Nevertheless, oblique trees can hold considerable advantages. In particular, Menze et al. [Menze2011ML]

recently demonstrated that oblique trees used in conjunction with random forests are quite effective given high dimensional data with heavily correlated features.

To achieve similar advantages while avoiding the computational expense of oblique trees, we instead take inspiration from recent work by Hariharan et al. [Hariharan2012ECCV] and propose to decorrelate features prior to applying orthogonal trees. To do so we introduce an efficient feature transform that removes correlations in local image neighborhoods (as opposed to decorrelating features globally as in [Hariharan2012ECCV]). The result is an overcomplete but locally decorrelated

representation that is ideally suited for use with orthogonal trees. In fact, orthogonal trees with our locally decorrelated features require estimation of fewer parameters and actually outperform oblique trees trained over the original features.

We evaluate boosted decision tree learning with decorrelated features in the context of pedestrian detection. As our baseline we utilize the aggregated channel features (ACF) detector [DollarPAMI14pyramids], a popular, top-performing detector for which source code is available online. Coupled with use of deeper trees and a denser sampling of the data, the improvement obtained using our locally decorrelated channel features

(LDCF) is substantial. While in the past year the use of deep learning

[Ouyang2013ICCV], motion features [Park2013CVPR], and multi-resolution models [Yan2013CVPR] has brought down log-average miss rate (MR) to under on the Caltech Pedestrian Dataset [Dollar2011PAMI], LDCF reduces MR to under . This translates to a nearly tenfold reduction in false positives over the (very recent) state-of-the-art.The paper is organized as follows. In §2 we review orthogonal and oblique trees and demonstrate that orthogonal trees trained on decorrelated data may be equally or more effective as oblique trees trained on the original data. We introduce the baseline in §3 and in §4 show that use of oblique trees improves results but at considerable computational expense. Next, in §5, we demonstrate that orthogonal trees trained with locally decorrelated features are efficient and effective. Experiments and results are presented in §6. We begin by briefly reviewing related work next.

### 1.1 Related Work

Pedestrian Detection: Recent work in pedestrian detection includes use of deformable part models and their extensions [Felzenszwalb2009PAMI, Yan2013CVPR, Park2010ECCV], convolutional nets and deep learning [Sermanet2013CVPR, Zeng2013ICCV, Ouyang2013ICCV], and approaches that focus on optimization and learning [Marin2013ICCV, Levi2013CVPR, Shen2013IJCV]. Boosted detectors are also widely used. In particular, the channel features detectors [Dollar2009BMVC, Benenson2012CVPR, Benenson2013seeking, DollarPAMI14pyramids] are a family of conceptually straightforward and efficient detectors based on boosted decision trees computed over multiple feature channels such as color, gradient magnitude, gradient orientation and others. Current top results on the INRIA [Dalal2005CVPR] and Caltech [Dollar2011PAMI] Pedestrian Datasets include instances of the channel features detector with additional mid-level edge features [LimCVPR13] and motion features [Park2013CVPR], respectively.

Oblique Decision Trees: Typically, decision trees are trained with orthogonal (single feature) splits; however, the extension to oblique (multiple feature) splits is fairly intuitive and well known, see e.g. [Murthy1994JAIR]. In fact, Breiman’s foundational work on random forests [Breiman01ML] experimented with oblique trees. Recently there has been renewed interest in random forests with oblique splits [Menze2011ML, Rodriguez2006PAMI] and Marin et al. [Marin2013ICCV] even applied such a technique to pedestrian detection. Likewise, while typically orthogonal trees are used with boosting [Friedman2000AS], oblique trees can easily be used instead. The contribution of this work is not the straightforward coupling of oblique trees with boosting, rather, we propose a local decorrelation transform that eliminates the necessity of oblique splits altogether.

Decorrelation: Decorrelation is a common pre-processing step for classification [Krizhevsky09Toronto, Hariharan2012ECCV]. In recent work, Hariharan et al. [Hariharan2012ECCV] proposed an efficient scheme for estimating covariances between HOG features [Dalal2005CVPR] with the goal of replacing linear SVMs with LDA and thus allowing for fast training. Hariharan et al. demonstrated that the global covariance matrix for a detection window can be estimated efficiently as the covariance between two features should depend only on their relative offset. Inspired by [Hariharan2012ECCV], we likewise exploit the stationarity of natural image statistics, but instead propose to estimate a local covariance matrix shared across all image patches. Next, rather than applying global decorrelation

, which would be computationally prohibitive for sliding window detection with a nonlinear classifier

^{1}

^{1}1Global decorrelation coupled with a linear classifier is efficient as the two linear operations can be merged., we instead propose to apply an efficient local decorrelation transform. The result is an overcomplete representation well suited for use with orthogonal trees.

## 2 Boosted Decision Trees with Correlated Data

Boosting is a simple yet powerful tool for classification and can model complex non-linear functions [Schapire1998TAS, Friedman2000AS]. The general idea is to train and combine a number of weak learners into a more powerful strong classifier. Decision trees are frequently used as the weak learner in conjunction with boosting, and in particular orthogonal decision trees, that is trees in which every split is a threshold on a single feature, are especially popular due to their speed and simplicity [Viola2004IJCV, DollarPAMI14pyramids, Benenson2012CVPR].

The representational power obtained by boosting orthogonal trees is not limited by use of orthogonal splits; however, the number and depth of the trees necessary to fit the data may be large. This can lead to complex decision boundaries and poor generalization, especially given highly correlated features. Figure 1(a)-(c) shows the result of boosted orthogonal trees on correlated data. Observe that the orthogonal trees generalize poorly even as we vary the number and depth of the trees.

Decision trees with oblique splits can more effectively model data with correlated features as the topology of the resulting classifier can better match the natural topology of the data [Menze2011ML]. In oblique trees, every split is based on a linear projection of the data followed by thresholding. The projection can be sparse (and orthogonal splits are a special case with ). While in principle numerous approaches can be used to obtain , in practice linear discriminant analysis (LDA) is a natural choice for obtaining discriminative splits efficiently [Hastie2009BOOK]. LDA aims to minimize within-class scatter while maximizing between-class scatter.

is computed from class-conditional mean vectors

and and a class-independent covariance matrix as follows:(1) |

The covariance may be degenerate if the amount or underlying dimension of the data is low; in this case LDA can be regularized by using in place of . In Figure 1(d) we apply boosted oblique trees trained with LDA on the same data as before. Observe the resulting decision boundary better matches the underlying data distribution and shows improved generalization.

The connection between whitening and LDA is well known [Hariharan2012ECCV]. Specifically, LDA simplifies to a trivial classification rule on whitened data (data whose covariance is the identity). Let be the eigendecomposition of where

is an orthogonal matrix and

is a diagonal matrix of eigenvalues.

is known as a whitening matrix because the covariance ofis the identity matrix. Given whitened data and means, LDA can be interpreted as learning the trivial projection

since . Can whitening or a related transform likewise simplify learning of boosted decision trees?Using standard terminology [Krizhevsky09Toronto], we define the following related transforms: decorrelation (), PCA-whitening , and ZCA-whitening . Figure 2 shows the result of boosting orthogonal trees on the variously transformed features, using the same data as before. Observe that with decorrelated and PCA-whitened features orthogonal trees show improved generalization. In fact, as each split is invariant to scaling of individual features, orthogonal trees with PCA-whitened and decorrelated features give identical results. Decorrelating the features is critical, while scaling is not. The intuition is clear: each split operates on a single feature, which is most effective if the features are decorrelated. Interestingly, the standard ZCA-whitened transform used by LDA is ineffective: while the resulting features are not technically correlated, due to the additional rotation by each resulting feature is a linear combination of features obtained by PCA-whitening.

## 3 Baseline Detector (ACF)

We next briefly review our baseline detector and evaluation benchmark. This will allow us to apply the ideas from §2 to object detection in subsequent sections. In this work we utilize the channel features detectors [Dollar2009BMVC, DollarPAMI14pyramids, Benenson2012CVPR, Benenson2013seeking], a family of conceptually straightforward and efficient detectors for which variants have been utilized for diverse tasks such as pedestrian detection [Dollar2011PAMI], sign recognition [Mathias2013traffic] and edge detection [LimCVPR13]. Specifically, for our experiments we focus on pedestrian detection and employ the aggregate channel features (ACF) variant [DollarPAMI14pyramids] for which code is available online^{2}^{2}2http://vision.ucsd.edu/~pdollar/toolbox/doc/.

Given an input image, ACF computes several feature channels, where each channel is a per-pixel feature map such that output pixels are computed from corresponding patches of input pixels (thus preserving image layout). We use the same channels as [DollarPAMI14pyramids]: normalized gradient magnitude (1 channel), histogram of oriented gradients (6 channels), and LUV color channels (3 channels), for a total of 10 channels. We downsample the channels by 2x and features are single pixel lookups in the aggregated channels. Thus, given a detection window, there are candidate features (channel pixel lookups). We use RealBoost [Friedman2000AS] with multiple rounds of bootstrapping to train and combine 2048 depth-3 decision trees over these features to distinguish object from background. Soft-cascades [BourdevCVPR05] and an efficient multiscale sliding-window approach are employed. Our baseline uses slightly altered parameters from [DollarPAMI14pyramids] (RealBoost, deeper trees, and less downsampling); this increases model capacity and benefits our final approach as we report in detail in §6.

Current practice is to use the INRIA Pedestrian Dataset [Dalal2005CVPR] for parameter tuning, with the test set serving as a validation set, see e.g. [Marin2013ICCV, Benenson2013seeking, Dollar2009BMVC]. We utilize this dataset in much the same way and report full results on the more challenging Caltech Pedestrian Dataset [Dollar2011PAMI]. Following the methodology of [Dollar2011PAMI], we summarize performance using the log-average miss rate (MR) between and

false positives per image. We repeat all experiments 10 times and report the mean MR and standard error for every result. Due to the use of a log-log scale, even small improvements in (log-average) MR correspond to large reductions in false-positives. On INRIA, our (slightly modified) baseline version of ACF scores at

MR compared to MR for the model reported in [DollarPAMI14pyramids].## 4 Detection with Oblique Splits (ACF-LDA)

In this section we modify the ACF detector to enable oblique splits and report the resulting gains. Recall that given input , at each split of an oblique decision tree we need to compute for some projection and threshold the result. For our baseline pedestrian detector, we use windows where each window is represented by a feature vector of size (see §3). Given the high dimensionality of the input coupled with the use of thousands of trees in a typical boosted classifier, for efficiency must be sparse.

Local : We opt to use ’s that correspond to local blocks of pixels. In other words, we treat as a tensor and allow to operate over any patch in a single channel of . Doing so holds multiple advantages. Most importantly, each pixel has strongest correlations to spatially nearby pixels [Hariharan2012ECCV]. Since oblique splits are expected to help most when features are strongly correlated, operating over local neighborhoods is a natural choice. In addition, using local allows for faster lookups due to the locality of adjacent pixels in memory.

Complexity: First, let us consider the complexity of training the oblique splits. Let be the window size of a single channel. The number of patches per channel in is about , thus naively training a single split means applying LDA times – once per patch – and keeping with lowest error. Instead of computing independent matrices per channel, for efficiency, we compute , a covariance matrix for the entire window, and reconstruct individual ’s by fetching appropriate entries from . A similar trick can be used for the ’s. Computing is given training examples (and could be made faster by omitting unnecessary elements). Inverting each , the bottleneck of computing Eq. (1), is but independent of and thus fairly small as . Finally computing over all training examples and projections is . Given the high complexity of each step, a naive brute-force approach for training is infeasible.

Speedup: While the weights over training examples change at every boosting iteration and after every tree split, in practice we find it is unnecessary to recompute the projections that frequently. Table 1, rows 2-4, shows the results of ACF with oblique splits, updated every boosting iterations (denoted by ACF-LDA-). While more frequent updates improve accuracy, ACF-LDA-16 has negligibly higher MR than ACF-LDA-4 but a nearly fourfold reduction in training time (timed using 12 cores). Training the brute force version of ACF-LDA, updated at every iteration and each tree split (7 interior nodes per depth-3 tree) would have taken about hours. For these results we used regularization of and patch size of (effect of varying is explored in §6).

Shared | Miss Rate | Training | ||

ACF | - | - | m | |

ACF-LDA-4 | No | m | ||

ACF-LDA-16 | No | m | ||

ACF-LDA- | No | m | ||

ACF-LDA-4 | Yes | m | ||

ACF-LDA-16 | Yes | m | ||

ACF-LDA- | Yes | m | ||

LDCF | Yes | - | m |

Shared : The crux and computational bottleneck of ACF-LDA is the computation and application of a separate covariance at each local neighborhood. In recent work on training linear object detectors using LDA, Hariharan et al. [Hariharan2012ECCV] exploited the observation that the statistics of natural images are translationally invariant and therefore the covariance between two features should depend only on their relative offset. Furthermore, as positives are rare, [Hariharan2012ECCV] showed that the covariances can be precomputed using natural images. Inspired by these observations, *we propose to use a single, fixed covariance shared across all local image neighborhoods*. We precompute one per channel and do not allow it to vary spatially or with boosting iteration. Table 1, rows 5-7, shows the results of ACF with oblique splits using fixed , denoted by ACF-LDA. As before, the ’s and resulting are updated every iterations. As expected, training time is reduced relative to ACF-LDA. Surprisingly, however, accuracy improves as well, presumably due to the implicit regularization effect of using a fixed . This is a powerful result we will exploit further.

Summary: ACF with local oblique splits and a single shared (ACF-LDA-4) achieves MR compared to MR for ACF with orthogonal splits. The improvement in log-average MR corresponds to a nearly twofold reduction in false positives but comes at considerable computational cost. In the next section, we propose an alternative, more efficient approach for exploiting the use of a single shared capturing correlations in local neighborhoods.

## 5 Locally Decorrelated Channel Features (LDCF)

We now have all the necessary ingredients to introduce our approach. We have made the following observations: (1) oblique splits learned with LDA over local patches improve results over orthogonal splits, (2) a single covariance matrix can be shared across all patches per channel, and (3) orthogonal trees with decorrelated features can potentially be used in place of oblique trees. This suggests the following approach: for every patch in , we can create a decorrelated representation by computing , where is the eigendecomposition of as before, followed by use of orthogonal trees. However, such an approach is computationally expensive.

First, due to use of overlapping patches, computing for every overlapping patch results in an overcomplete representation with a factor increase in feature dimensionality. To reduce dimensionality, we only utilize the top eigenvectors in , resulting in features per pixel. The intuition is that the top eigenvectors capture the salient neighborhood structure. Our experiments in §6 confirm this: using as few as eigenvectors per channel for patches of size is sufficient. As our second speedup, we observe that the projection can be computed by a series of convolutions between a channel image and each filter reshaped from its corresponding eigenvector (column of ). This is possible because the covariance matrix is shared across all patches per channel and hence the derived is likewise spatially invariant. Decorrelating all channels in an entire feature pyramid for a image takes about seconds.

In summary, we modify ACF by taking the original channels and applying decorrelating (linear) filters per channel. The result is a set of locally decorrelated channel features (LDCF). To further increase efficiency, we downsample the decorrelated channels by a factor of 2x which has negligible impact on accuracy but reduces feature dimension to the original value. Given the new locally decorrelated channels, all other steps of ACF training and testing are identical. The extra implementation effort is likewise minimal: given the decorrelation filters, a few lines of code suffice to convert ACF into LDCF. To further improve clarity, all source code for LDCF will be released.

Results of the LDCF detector on the INRIA dataset are given in the last row of Table 1. The LCDF detector (which uses orthogonal splits) improves accuracy over ACF with oblique splits by an additional MR. Training time is significantly faster, and indeed, is only minute longer than for the original ACF detector. More detailed experiments and results are reported in §6. We conclude by (1) describing the estimation of for each channel, (2) showing various visualizations, and (3) discussing the filters themselves and connections to known filters.

Estimating : We can estimate a spatially constant for each channel using any large collection of natural images. for each channel is represented by a spatial autocorrelation function . Given a collection of natural images, we first estimate a separate autocorrelation function for each image and then average the results. Naive computation of the final function is but the Wiener-Khinchin theorem reduces the complexity to via the FFT [Box1994], where and denote the number of images and pixels per image, respectively.

Visualization: Fig. 3, top-left, illustrates the estimated autocorrelations for each channel. Nearby features are highly correlated and oriented gradients are spatially correlated along their orientation due to curvilinear continuity [Hariharan2012ECCV]. Fig. 3, bottom-left, shows the decorrelation filters for each channel obtained by reshaping the largest eigenvectors of . The largest eigenvectors are smoothing filters while the smaller ones resemble increasingly higher-frequency filters. The corresponding eigenvalues decay rapidly and in practice we use the top filters. Observe that the decorrelation filters for oriented gradients are aligned to their orientation. Finally, Fig. 3, right, shows original and decorrelated channels averaged over positive training examples.

Discussion

: Our decorrelation filters are closely related to sinusoidal, DCT basis, and Gaussian derivative filters. Spatial interactions in natural images are often well-described by Markov models

[Geman1984PAMI] and first-order stationary Markov processes are known to have sinusoidal KLT bases [Ray1970ITIT]. In particular, for the LUV color channels, our filters are similar to the discrete cosine transform (DCT) bases that are often used to approximate the KLT. For oriented gradients, however, the decorrelation filters are no longer well modeled by the DCT bases (note also that our filters are applied densely whereas the DCT typically uses block processing). Alternatively, we can interpret our filters as Gaussian derivative filters. Assume that the autocorrelation is modeled by a squared-exponential function , which is fairly reasonable given the estimation results in Fig. 3. In 1D, the^{th}

largest eigenfunction of such an autocorrelation function is a

order Gaussian derivative filter [Rasmussen2006]. It is straightforward to extend the result to an anisotropic multivariate case in which case the eigenfunctions are Gaussian directional derivative filters similar to our filters.## 6 Experiments

description | # channels | miss rate | |
---|---|---|---|

1. ACF | (modified) baseline | ||

2. LDCF small | decorrelation w smallest filters | ||

3. LDCF random | filtering w random filters | ||

4. LDCF LUV only | decorrelation of LUV channels only | ||

5. LDCF grad only | decorrelation of grad channels only | ||

6. LDCF constant | decorrelation w constant filters | ||

7. LDCF | proposed approach |

In this section, we demonstrate the effectiveness of locally decorrelated channel features (LDCF) in the context of pedestrian detection. We: (1) study the effect of parameter settings, (2) test variations of our approach, and finally (3) compare our results with the state-of-the-art.

Parameters: LDCF has two parameters: the count and size of the decorrelation filters. Fig. 4(a) and (b) show the results of LDCF on the INRIA dataset while varying the filter count () and size (), respectively. Use of decorrelation filters of size improves performance up to MR compared to ACF. Inclusion of additional higher-frequency filters or use of larger filters can cause performance degradation. For all remaining experiments we fix and .

Variations: We test variants of LDCF and report results on INRIA in Table 2. LDCF (row 7) outperforms all variants, including the baseline (1). Filtering the channels with the smallest eigenvectors (2) or random filters (3) performs worse. Local decorrelation of only the color channels (4) or only the gradient channels (5) is inferior to decorrelation of all channels. Finally, we test constant decorrelation filters obtained from the intensity channel L that resemble the first DCT basis filters. Use of unique filters per channel outperforms use of constant filters across all channels (6).

Model Capacity: Use of locally decorrelated features implicitly allows for richer, more effective splitting functions, increasing modeling capacity and generalization ability. Inspired by their success, we explore additional strategies for augmenting model capacity. For the following experiments, we rely solely on the training set of the Caltech Pedestrian Dataset [Dollar2011PAMI]. Of the 71 minute long training videos (k images), we use every fourth video as validation data and the rest for training. On the validation set, LDCF outperforms ACF by a considerable margin, reducing MR from to . We first augment model capacity by increasing the number of trees twofold (to 4096) and the sampled negatives fivefold (to 50k). Surprisingly, doing so reduces MR by an additional 4%. Next, we experiment with increasing maximum tree depth while simultaneously enlarging the amount of data available for training. Typically, every 30^{th} image in the Caltech dataset is used for training and testing. Instead, Figure 4(c) shows validation performance of LDCF with different tree depths while varying the training data sampling interval. The impact of maximum depth on performance is quite large. At a dense sampling interval of every 4^{th} frame, use of depth-5 trees (up from depth-2 for the original approach) improves performance by an additional to MR. Note that consistent with the generalization bounds of boosting [Schapire1998TAS], use of deeper trees requires more data.

INRIA Results: In Figure 5(a) we compare LDCF with state-of-the-art detectors on INRIA [Dalal2005CVPR] using benchmark code maintained by [Dollar2011PAMI]. Since the INRIA dataset is oft-used as a validation set, including in this work, we include these results for completeness only. LDCF is essentially tied for second place with Roerei [Benenson2013seeking] and Franken [Mathias2013ICCV] and outperformed by MR by SketchTokens [LimCVPR13]. These approaches all belong to the family of channel features detectors, and as the improvements proposed in this work are orthogonal, the methods could potentially be combined.

Caltech Results: We present our main result on the Caltech Pedestrian Dataset [Dollar2011PAMI], see Fig. 5(b), generated using the official evaluation code available online^{3}^{3}3http://www.vision.caltech.edu/Image_Datasets/CaltechPedestrians/. The Caltech dataset has become the standard for evaluating pedestrian detectors and the latest methods based on deep learning (JointDeep) [Ouyang2013ICCV], multi-resolution models (MT-DPM) [Yan2013CVPR] and motion features (ACF+SDt) [Park2013CVPR] achieve under log-average MR. For a complete comparison, we first present results for an augmented capacity ACF model which uses more (4096) and deeper (depth-5) trees trained with RealBoost using dense sampling of the training data (every 4^{th} image). See preceding note on model capacity for details and motivation. This augmented model (ACF-Caltech+) achieves MR, a considerable nearly MR gain over previous methods, including the baseline version of ACF (ACF-Caltech). With identical parameters, locally decorrelated channel features (LDCF) further reduce error to MR with substantial gains at higher recall. Overall, this is a massive improvement and represents a nearly 10x reduction in false positives over the previous state-of-the-art.

## 7 Conclusion

In this work we have presented a simple, principled approach for improving boosted object detectors. Our core observation was that effective but expensive oblique splits in decision trees can be replaced by orthogonal splits over locally decorrelated data. Moreover, due to the stationary statistics of image features, the local decorrelation can be performed efficiently via convolution with a fixed filter bank precomputed from natural images. Our approach is general, simple and fast.

Our method showed dramatic improvement over previous state-of-the-art. While some of the gain was from increasing model capacity, use of local decorrelation gave a clear and significant boost. Overall, we reduced false-positives tenfold on Caltech. Such large gains are fairly rare.

In the present work we did not decorrelate features across channels (decorrelation was applied independently per channel). This is a clear future direction. Testing local decorrelation in the context of other classifiers (e.g. convolutional nets or linear classifiers as in [Hariharan2012ECCV]) would also be interesting.

While the proposed locally decorrelated channel features (LDCF) require only modest modification to existing code, we will release all source code used in this work to ease reproducibility.