Fast detection of multiple objects in traffic scenes with a common detection framework

10/12/2015 ∙ by Qichang Hu, et al. ∙ 0

Traffic scene perception (TSP) aims to real-time extract accurate on-road environment information, which in- volves three phases: detection of objects of interest, recognition of detected objects, and tracking of objects in motion. Since recognition and tracking often rely on the results from detection, the ability to detect objects of interest effectively plays a crucial role in TSP. In this paper, we focus on three important classes of objects: traffic signs, cars, and cyclists. We propose to detect all the three important objects in a single learning based detection framework. The proposed framework consists of a dense feature extractor and detectors of three important classes. Once the dense features have been extracted, these features are shared with all detectors. The advantage of using one common framework is that the detection speed is much faster, since all dense features need only to be evaluated once in the testing phase. In contrast, most previous works have designed specific detectors using different features for each of these objects. To enhance the feature robustness to noises and image deformations, we introduce spatially pooled features as a part of aggregated channel features. In order to further improve the generalization performance, we propose an object subcategorization method as a means of capturing intra-class variation of objects. We experimentally demonstrate the effectiveness and efficiency of the proposed framework in three detection applications: traffic sign detection, car detection, and cyclist detection. The proposed framework achieves the competitive performance with state-of- the-art approaches on several benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Vision-based traffic scene perception (TSP) is one of many fast-emerging areas in the intelligent transportation system. This field of research has been actively studied over the past decade [56]. TSP involves three phases: detection, recognition and tracking of various objects of interest. Since recognition and tracking often rely on the results from detection, the ability to detect objects of interest effectively plays a crucial role in TSP. In this paper, we focus on three important classes of objects: traffic signs, cars, and cyclists. Fig. 1 shows a typical on-road traffic scene with the detected objects of interest and illustrates some positive examples from the three mentioned classes.

Fig. 1: Top image: A typical on-road traffic scene with the detected objects of interest. Bottom images: Each block represents one class of objects of interest. From left to right, the first block contains traffic sign examples, the second contains car examples, and the third contains cyclist examples.

The aim of traffic sign detection is to alert the driver of the changed traffic conditions. The task is to accurately localize and recognize road signs in various traffic environments. Prior approaches [9, 8, 31] use color and shape information. However, these approaches are not adaptive under severe weather and lighting conditions. Additionally, appearance of traffic signs can physically change over time, due to the weather and damage caused by accidents. Instead of using color and shape features, most recent approaches [41, 62] employ texture or gradient features, such as local binary patterns (LBP) [2] and histogram of oriented gradients (HOG) [7]. These features are partially invariant to image distortion and illumination change, but they are still unable to handle severe deformations.

Car detection is a more challenging problem compared to traffic sign detection due to its large intra-class variation caused by different viewpoints and occlusions. Fig. 2 shows a set of different cars with a large intra-class variation. Although sliding-window based detection methods have shown promising results in face and human detection [61, 7], they often fail to detect cars due to a large variation of viewpoints. Recently the deformable parts model (DPM) [16], which has gained a lot of attention in generic object detection, has been adapted successfully for car detection [20, 25, 48]. In addition to the DPM, visual subcategorization based approaches [10, 30, 44] have been applied to improve the generalization performance.

Fig. 2: A set of different vehicles with different viewpoints, occlusions, and truncations.

Cyclist detection is a new attractive application in the domain of TSP. At present, only few methods are designed purposely for cyclist detection. Many existing pedestrian detection approaches [7, 11, 20] can be adapted for cyclist detection because appearances of pedestrians are very similar to appearances of cyclists along the road. Compared to pedestrian detection, the new problem is more difficult because the various appearances and viewpoints increase the diversity of the cyclists. Therefore, existing pedestrian detection approaches hardly achieve the acceptable detection performance.

Most previous methods have designed specific detectors using different features for each of these objects. The approach we claim here differs from these existing approaches in that we propose a single learning based detection framework to detect all the three important classes of objects. The proposed framework consists of a dense feature extractor and detectors of these three classes. Once the dense features have been extracted, these features are shared with all detectors. The advantage of using one common framework is that the detection speed is much faster, since all dense features need only to be evaluated once in the testing phase. The proposed framework introduces spatially pooled features [47] as a part of aggregated channel features [13] to enhance the feature robustness to noises and image deformations. In order to further improve the generalization performance, we propose an object subcategorization method as a means of capturing intra-class variation of objects.

The remainder of this paper is organized as follows: we briefly review related works in Section II. The structure of the proposed detection framework will be discussed in Section III. Experimental settings and results of all three applications are given in Section IV. Section V summaries this paper and points the direction of future work.

Ii Related works

Ii-a Generic object detection

Object detection is a challenging but important application in the computer vision community. It has achieved successful outcomes in many practical applications such as face detection and pedestrian detection 

[61, 2, 7, 65]. Complete survey of object detection can be found in [61, 7, 16, 66, 22]. This section briefly reviews several generic object detection methods.

One classical object detector is the detection framework of Viola and Jones which uses a sliding-window search with cascaded classifiers to achieve accurate location and efficient classification 

[61]

. The other commonly used framework is using a linear support vector machine (SVM) classifier with histogram of oriented gradients (HOG), which has been applied successfully in pedestrian detection 

[7]. These frameworks achieve excellent detection results on rigid object classes. However, for object classes with a large intra-class variation, their detection performance falls down dramatically [47].

In order to deal with appearance variations in object detection, a deformable parts model (DPM) based method has been proposed in [16]. This method relies on a variant of HOG features and window template matching, but explicitly models deformations using a latent SVM classifier. It has been applied successfully in many object detection applications [20, 58, 68]. In addition to the DPM, visual subcategorization [10] is another common approach to improve the generalization performance of detection model. It divides the entire object class into multiple subclasses such that objects with similar visual appearance are grouped together. A sub-detector is trained for each subclass and detection results from all sub-detectors are merged to generate the final results. Recently, a new detection framework which uses aggregated channel features (ACF) and a cascaded AdaBoost classifier has been proposed in [11]. This framework uses exhaustive sliding-window search to detect objects at multi-scales. It has been adopted successfully for many practical applications [44, 41, 47].

Ii-B Traffic sign detection

Many traffic sign detectors have been proposed over the last decade with newly created challenging benchmarks. Interested reader should see [42] which provides a detailed analysis on the recent progress in the field of traffic sign detection. Most existing traffic sign detectors are appearance-based detectors. These detectors generally fall into one of four categories, namely, color-based approaches, shape-based approaches, texture-based approaches, and hybrid approaches.

Color-based approaches [9, 8, 31] usually employ a two-stage strategy. First, segmentation is done by a thresholding operation in one specific color space. Subsequently, shape detection is implemented and is applied only to the segmented regions. Since RGB color space is very sensitive to illumination change, some approaches [15, 38, 31] convert RGB space to HSI space which is insensitive to light change. Other approaches [29, 9] implement segmentation in the normalized RGB space which is shown to outperform the HSI space [23]. Both HSI and normalized RGB space can alleviate the negative effect of illumination change, but still fail on some severe situations.

Shape-based approaches [26, 37, 57] detect edges or corners from raw images using canny edge detector or its variants. Then, edges and corners will be connected to regular polygons or circles by using Hough-like voting scheme. These detectors are invariant to illumination change, but the memory and computational requirement is quite high for large images. In [8]

, the genetic algorithm is adopted to detect circles and is invariant to projective deformation, but the expensive computational requirement limits its application.

Texture-based approaches firstly extract hand-crafted features computed from texture of images, and then use these extracted features to train a classifier. Popular hand-crafted features include HOG, LBP, ACF, etc [7, 2, 11]. Some approaches [34, 62, 50] use the HOG features with a SVM, others [41]

use the ACF features with an Adaboost classifier. Besides the above approaches, a convolutional neural network (CNN) has been adopted for traffic sign detection and achieved excellent results in 

[55].

Hybrid approaches [18, 52] are a combination of the aforementioned approaches. Usually, the initial step is the segmentation to narrow the search space, which is same as the color-based approaches. Instead of only using edges features or texture-based features, these methods use them together to improve the detection performance.

One standard benchmark for traffic sign detection is the German traffic sign detection benchmark (GTSDB) [27] which collects three important categories of road signs (prohibitory, danger, and mandatory) from various traffic scenes. All traffic signs have been fully annotated with the rectangular regions of interest (ROIs). Researchers can conveniently compare their work based on this benchmark.

Fig. 3: Overview of the proposed detection framework. Left diagram is the training section and right diagram is the testing section.

Ii-C Car detection

Many existing car detectors are vision-based detectors. Interested reader should see [56] which discusses different approaches for vehicle detection using mono, stereo, and other vision-sensors. We focus on vision-based car detectors using monocular information in this paper. These detectors can be divided into three categories: DPM-based approaches, subcategorization-based approaches and motion-based approaches.

DPM-based approaches are built on the deformable parts model (DPM) [16] which has been successfully adopted in car detection [58]. In [20], a variant of DPM discretizes the number of car orientations and each component of the mixture model corresponds to one orientation. The authors of [25] train a variant of DPM to detect cars under severe occlusions and clutters. In [48], occlusion patterns are used as training data to train a DPM which can reason the relationships between cars and obstacles for detection.

Visual subcategorization which learns subcategories within an object class is a common approach to improve the model generalization in car detection [10]

. It usually consists of two phases: feature extraction and clustering. Samples with similar visual features are grouped together by applying clustering algorithm on extracted feature space. Subcategorization-based methods are commonly used with DPM to detect cars from multiple viewpoints. In 

[30], subcategories of cars corresponding to car orientation are learned by using locally linear embedding method with HOG features. In [44], cars with similar viewpoints, occlusions, and truncation scenarios are grouped in the same subcategory by using a semi-supervised clustering method with ACF features.

Motion-based approaches often use appearance cues in monocular vision since monocular images do not provide any 3D and depth information. In [4], adaptive background model is used to detect cars based on motion that differentiated them from the background. The authors of [64] propose an adaptive background model to model the area where overtaking cars tend to appear in the camera’s field of view. Optical flow [39], which is a popular tool in machine vision, has been used for monocular car detection. In [32], a combination of optical flow and symmetry tracking is used for car detection. Optical flow is also used in conjunction with appearance-based techniques in [6].

The KITTI vision benchmark (KITTI) [19] is a novel challenging benchmark for the tasks of monocular, stereo, optical flow, visual odometry, and 3D object detection. The KITTI dataset provides a wide range of images from various traffic scenes with fully annotated objects. Objects in the KITTI dataset includes pedestrians, cyclists, and vehicles.

Ii-D Cyclist detection

Many existing cyclist detection approaches [53, 63] use pedestrian detection techniques since pedestrians are very similar to cyclists along the road. In [53], corner feature extraction, motion matching, and object classification are combined to detect pedestrians and cyclists simultaneously. In [63], a stereo vision based approach is proposed for pedestrian and cyclist detection. It uses the shape features and matching criterion of partial Hausdorff distance to detect pedestrians and cyclists. However, these approaches cannot distinguish cyclists from pedestrians. Besides the above approaches, the authors of [54] proposed a cyclist detector by using a fixed camera to detect two wheels of bicycle on road, but this approach is limited to detect crossing cyclists. Moreover, all above approaches are designed for traffic monitoring using fixed camera and cannot be used for on-road detection which aims at intelligent driving.

Iii Our approach

Despite several important techniques have been proposed on object detection, the conventional sliding-window based method of Viola and Jones [61] is still the most successful and practical object detector. The VJ framework consists of two main components: a dense feature extractor and an AdaBoost classifier. In this paper, we build a common object detection framework for traffic scene perception based on the VJ framework, but our framework can employ a number of Adaboost classifiers to detect target objects of different classes. Apart from basic components of VJ framework, we propose an object subcategorization method to improve the generalization performance and employ spatially pooled features [47] to enhance the robustness and effectiveness.

Fig. 3 shows an overview of our framework. In the training phase, we firstly check the intra-class variation of the input object class with respect to object properties, e.g. size, orientation, aspect ratio, and occlusion. If the variation is considerable large, we apply the object subcategorization method to categorize the training data into multiple subcategories and train one sub-detector for each subcategory. Otherwise, we train a single detector for the entire training data. In the testing phase, raw detection results from all sub-detectors need to be calibrated before merging them together. Non-maximum suppression is used to eliminate redundant bounding boxes. If the framework employs detectors of different classes, detection results need to be carefully merged together.

Iii-a Object Subcategorization

For object classes with a large intra-class variation like cars, the appearance and shape of cars change significantly as the viewpoint changes. In order to deal with these variations that cannot be tackled by the conventional VJ framework, we present an object subcategorization method which aims to cluster the training data into visually homogeneous subcategories. The proposed subcategorization method applies an unsupervised clustering algorithm on one specific feature space of the training data to generate multiple subcategories. This method simplifies the original learning problem by dividing it into multiple sub-problems and improves model generalization performance.

Iii-A1 Visual Features

A variety of hand-designed features can be used to perform the clustering algorithm, such as HOG and ACF [7, 11]. HOG is successful at capturing the shapes of objects while does not consider color information. ACF combines both color information and gradient information, which is shown to outperform HOG [13]. In our experiments, a total of 10 channels of features are used for clustering: LUV color channels (3 channels), histogram of oriented gradients at 6 bins (6 channels), and normalized gradient magnitude (1 channel). To extract features from the training data, all positive samples are resized to the median object size.

Iii-A2 Geometrical Features

Besides the visual features, geometrical information of objects can be extracted from traffic scenes using a variety of sensors and methods. In the KITTI dataset, objects in images from a velodyne laser scanner were annotated with 3D bounding boxes and 3D orientations. Ohn-Bar et al. [45] proposed an analysis of different types of geometrical features, which showed that the geometrical features outperform the visual features for clustering, even the CNN features. We use the following set of geometrical features to represent the object instances in our experiments.

3D orientation The appearance and shape of objects change significantly as the viewpoint changes. We include the 3D orientation (relative orientation between the object and camera) in clustering, aiming at grouping objects with similar visual appearance together.

Aspect-ratio The aspect-ratio (width/height) of objects is strongly correlated with the geometry of objects being detected. We use this feature because learning models at different aspect-ratios significantly improve the generalization performance.

Truncation level The truncation level refers to the percentage of the object outside of the image boundaries. This feature strongly affects the appearance of objects.

Occlusion index Instead of using subtle occlusion patterns defined in [45], we use an occlusion index to indicate whether an object is not occluded, partially occluded, largely occluded or an unknown situation. We simplify the occlusion patterns because some occlusion features cannot be defined for each object, such as occlusion level, related orientation, and relative 3D point. The above features can only be defined when the obstacle is an annotated object. However, in the KITTI dataset, many obstacles are unlabelled objects.

Iii-A3 Clustering

A clustering algorithm is used to generate a predefined number of clusters on a specific feature space. Traditional clustering schemes, such as k-means or single linkage, suffer from the cluster degeneration which means that a few clusters claim most data samples 

[28]

. The cluster degeneration problem can be alleviated by using spectral clustering. Spectral clustering followed by k-means often outperforms the traditional schemes. We implement the normalized spectral clustering using the algorithm proposed in 

[43]. The quality of clustering results is very sensitive to the predefined number of clusters. Unfortunately, how to determine the appropriate number of centroids is still an open question. We experimentally determine the number of clusters for each application.

Iii-B Feature extraction

The proposed framework introduces spatially pooled features [47] as a part of the aggregated channel features [13] and employs them as dense features in the training phase. All feature channels are aggregated in blocks in order to produce fast pixel lookup features.

Iii-B1 Aggregated channel features (ACF)

Given an input image , a channel of is a feature map, where the output pixels are computed from corresponding pixels of the input image. Aggregated channel features are extracted from multiple image channels using pixel lookups method. Many image channels are available for extracting features. For example, a trivial channel of a grayscale image is the image itself. For a color image, each color channel can be used as a channel. Other channels can be computed using various transformations of . In order to accelerate the speed of feature extraction, all transformations are required to be translational invariant. It means that the transformation need only to be evaluated once on the entire image rather than separately for each overlapping detection window.

ACF uses the same channel features as ChnFtrs [13]: LUV color channels (3 channels), histogram of oriented gradients (6 channels), and normalized gradient magnitude (1 channel). ACF combines the richness and diversity of statistics from these channels, which is shown to outperform HOG [11, 13]. Prior to computing these 10 channels, we smooth the input image to suppress fine scale structures as well as noises.

LUV color channels LUV color space contains 3 channels, L channel describes the lightness of the object, U channel and V channel represent the chromaticity of the object. Compared to RGB space, LUV space is able to partially invariant to illumination change. So the proposed detector can work under different light conditions. Images can be converted to LUV space by using a specific transformation.

Gradient magnitude channel A normalized gradient magnitude is used to measure the edge strength. Gradient magnitude at location is computed by , where and are first intensity derivatives along the -axis and -axis, respectively. Since the gradient magnitude is computed on 3 LUV channels independently, only the maximum response is used as the gradient magnitude channel.

Gradient histogram channels A histogram of oriented gradients is a weighted histogram where bin index is determined by gradient orientation and weighted by gradient magnitude [13]. The histogram of oriented gradients at location is computed by , where is the indicator function, and are the gradient magnitude and discrete gradient orientation, respectively. ACF quantizes the orientation space to 6 orientations and compute one gradient histogram channel for each orientation.

Iii-B2 Spatially pooled features

Spatial pooling is used to combine multiple visual descriptors obtained at nearby locations into a lower dimensional descriptor over the pooling region. We follow the work of [47] which is shown that pooling can enhance the robustness of two hand-crafted low-level features, covariance features [59] and LBP [2].

Covariance matrix

A covariance matrix is a positive semidefinite matrix which provides a measure of the relationship between multiple sets of variates. The diagonal elements of a covariance matrix represent the variance of each feature and non-diagonal elements represent the correlation between different features. In order to compute the covariance matrix, we use the following variates proposed in 

[47]:

where and indicate the pixel location. and are first intensity derivatives along the horizontal-axis and vertical-axis respectively. Similarly, and are second intensity derivatives, respectively. is the gradient magnitude . is the edge orientation and is an additional edge orientation in which,

where the atan2 function is defined in terms of the arctan in the following:

The covariance descriptor of a region is a covariance matrix which can be computed efficiently because the computational cost is independent of the size of the region. We also exclude the variance of pixel locations (x and y coordinates) and the correlation coefficient between pixel locations (x and y coordinates), since these features do not capture discriminative information. Due to the symmetry, each covariance descriptor finally contains different values.

Spatially pooled covariance

The spatial invariance and robustness of the covariance descriptors can be improved by applying pooling method. There are two common pooling methods in this context: average pooling and max pooling. Max pooling is used in our framework as it has been shown to outperform average pooling in image classification 

[5]. Max-pooling uses the maximum value of a pooling region to represent the pooled features in the region. It aims to retain the most salient information and discard irrelevant details and noises over the pooling region. The image window is divided into multiple dense patches (refer to Fig. 4). Covariance features are computed over pixels within each patch. Then, we perform max pooling over a fixed-size pooling region and use the pooled features to represent the covariance features in the pooling region. In fact, multiple covariance matrices within each pooling region are summarized into a single matrix which has better invariance to image deformation and translation. The pooled features extracted from each pooling region is called the spatially pooled covariance (sp-Cov) features in [47].

Fig. 4: Architecture of the spatially pooled covariance features.

Implementation To expand the richness of our feature representation, we extract sp-Cov features using multi-scale patches with the following sizes: , and pixels. Each scale will generate an independent set of visual descriptors. In our experiments, the patch step-size is set to be 1 pixel, the pooling region is set to be

pixels, and the pooling spacing stride is set to be

pixels.

Local Binary Pattern (LBP) LBP is a texture descriptor which uses a histogram to represent the binary code of each image patch [2]. The original LBP is generated by thresholding the -neighbourhood of each pixel with the value of centre pixel. All binary results are concatenated to form an 8-bit length binary sequence with different labels. The histogram of these 256 different labels can represent a texture descriptor. By following the work of [47], we convert the input image from the RGB space to LUV space, and extract the uniform LBP [65] from the luminance (L) channel. The uniform LBP, which is an extension of the original LBP, can better filter out noises.

Spatially pooled LBP Similar to the sp-Cov features, the image window is divided into multiple dense patches and LBP histogram is computed over pixels within each patch. In order to enhance the invariance to image deformation and translation, we perform max pooling over a fixed-size pooling region and use the pooled features to represent the LBP histogram in the pooling region. The pooled features extracted from each pooling region is called the spatially pooled LBP (sp-LBP) features in [47].

Implementation To extract LBP, we apply the LBP operator on the 33-neighbourhood at each pixel. The LBP histogram is extracted from a pixels patch. We extract the 58-dimension LBP histogram using a C-MEX implementation of [60]. In our experiments, the patch step-size, the pooling region, and the pooling spacing stride are set to 1 pixel, pixels, and 4 pixels, respectively. Instead of extracting LBP histograms from multi-scale patches, the sp-LBP and LBP are combined as channel features.

Iii-C Supervised learning

Once dense features have been extracted, we are in a position to train a classifier. Instead of training a standard AdaBoost, we use a shrinkage version of AdaBoost as the strong classifier and use decision trees as weak learners. To train the classifier, the procedure known as bootstrapping is applied, which collects the hard negative samples and re-trains the classifier. If the object subcategorization is applied to the training data, we train one classifier for each sub-detector. The pseudo code of the learning algorithm is presented in Algorithm 

1.

Shrinkage The accuracy of AdaBoost can be further improved by applying a weighting coefficient known as shrinkage [24]. The shrinkage version of AdaBoost can be viewed as a form of regularization for boosting. At each iteration, the coefficient of weak learner is updated by

(1)

Here is a weak learner of AdaBoost at the -th round and is the coefficient of the weak learner. is a learning rate which controls the trade-off between overall accuracy and training time. The smaller the value of , the higher the overall accuracy as long as the number of weak learners is sufficiently large. Compared to the standard AdaBoost, shrinkage often produces better generalization performance [17].

Bootstrapping To improve the performance of the learned classifier, we perform three bootstrapping iterations in addition to the original training phase. The initial training phase randomly sample negative samples from training images with positive regions cropped out, and further bootstrapping iterations add more hard negatives to the training set. The learning process consists of 4 training iterations with increasing numbers of weak learners and the final model consists of 2048 weak learners.

Input: The training set , , , . Initialize:The weighted distribution of training set in 1st round, , . for do
        Train the weak learner using the weighted distribution ,
Compute the error rate of in traning set .
Compute the coefficient of and update it by multiplying shrinkage parameter .
Update the weighted distribution of the training set
where is a normalization factor,
end for
Output: Final classifier
Algorithm 1 Shrinkage version of AdaBoost

Iii-D Post-processing

Raw detection results are generated by applying the trained detectors on test images, but these results often contain some noises and redundant information. To improve detection performance, some techniques are used to post-process raw detection results.

Iii-D1 Calibration of confidence scores

If we have multiple sub-detectors and apply them on test data, detection results of each sub-detector are required to merge together to generate the integrated results. However, the classifier of each sub-detector is learned with different training data, confidence scores of raw detection results output by individual classifiers need to be calibrated appropriately to suppress noises before merging them together. We address this problem by transforming the output of each classifier by a sigmoid regression to generate comparable score distributions [51, 35]. For sample in subcategory , its confidence score is the output of the ensemble classifier which is defined as

(2)

its calibrated score is defined as

(3)

where , are the learned parameters for the -th subcategory of the following regularized maximum likelihood problem:

(4)
(5)

The in equation 4 can be cancelled by reformulation:

(6)

is the total number of training examples for the -th subcategory-specific classifier, is the number of positive examples, and is the number of negative examples.

Iii-D2 Non-maximum suppression (NMS)

NMS aims to suppress redundant overlaps among the raw detection results. When multiple bounding boxes overlap, NMS will eliminate the lower-scored detections and retain the highest-scored detection. Pascal overlap score [14] is used to determine the overlap ratio between two bounding boxes. The overlap ratio is defined as

(7)

where and are two different bounding boxes. If the overlap ratio exceeds a predefined threshold, bounding box with the lower confidence score is discarded.

Iii-D3 Fusion of detection results

The proposed framework can detect multiple objects using detectors or sub-detectors of different classes. Suppose we have detection results generated from different detectors, there are probably some redundant detections on the results since different detectors may generate some overlapped bounding boxes. NMS is usually used to delete redundant bounding boxes. However, NMS is not suitable for all cases. Assume that a car is occluded by a cyclist, both the car and the cyclist are detected in the results. If their overlap ratio exceeds the threshold, NMS will simply delete the lower-scored detection, and retain the higher-scored detection. One true positive detection is removed in this case.

To solve the above problem, we merge all detection results in two steps. In the first step, we merge detection results which belong to the same class using the NMS. It means that we apply NMS to bounding boxes generated by either a single detector of one class or multiple sub-detectors of one class exclusively. Objects of the same class are easily detected redundantly by multiple sub-detectors or a single detector at different scales. NMS is used to remove these redundant detections. In the second step, we merge all remaining detection results of different classes without using NMS to generate the final bounding boxes.

Iv Experiments

Iv-a Traffic sign detection on GTSDB dataset

In this section, we conduct an experiment on traffic sign detection and evaluate our detector on the German Traffic Sign Detection Benchmark (GTSDB) [27].

Iv-A1 Dataset

The GTSDB dataset contains 600 images for training and 300 images for testing. Images are captured from various scenes (highway, urban, rural) and various time slots (morning, afternoon, dusk, etc.). The dataset contains more than 1000 traffic signs from different categories. Three main categories of traffic signs (Prohibitory, Danger, Mandatory) are selected as the target classes in the IJCNN 2013 [27] competition and in our experiments. The resolutions of traffic signs vary from pixels to pixels. Fig. 5 shows traffic signs from three main categories on the GTSDB.

Fig. 5: Each row shows traffic signs in one of three categories (prohibitory, danger, mandatory).

Iv-A2 Evaluation criteria

Pascal overlap score [14] is used to find the best match between each predicted bounding box and each ground truth. The minimum overlap ratio

is set to be 60% on the GTSDB. Only the bounding box with the highest confidence score is counted as true positive if multiple bounding boxes satisfy the overlap criterion, the others are ignored. To compare the performance of different detectors, we follow the evaluation metric of the GTSDB which uses the area under the precision-recall curve (AUC) as a final score.

Iv-A3 Parameter selection

To alleviate the effect of the illumination change, we apply the automatic color equalization algorithm (ACE) [21] to globally normalize all images. The resolution of the traffic sign model is set to

pixels and the dimension of model padding is set to

pixels. This border provides an additional amount of context that helps improve the detection performance [7, 12]. Additionally, we increase the number of positive samples by adding jittered versions of the original samples, which significantly improves the detection performance. For prohibitory and danger signs, flipped versions are added to the training set. For mandatory signs, samples are randomly perturbed in translation ( pixels), in scale ( ratio), in rotation ( degrees), and flipping. We demonstrate the performance gain on the test set in table I. Negative samples are collected from the GTSDB training images with the corresponding traffic sign regions cropped out.

Prohibitory Danger Mandatory Avg.
Original dataset 98.76% 93.65% 86.86% 93.09%

Jettered dataset
100.00% 98.00% 97.57% 98.52%
TABLE I: Performance (AUC) difference between training on original training set and jettered training set.

Iv-A4 Experimental design

We investigate the experimental design of the proposed detector on traffic sign detection. Since traffic signs are divided into three subcategories, we train one sub-detector for each subcategory. We train all detectors on the GTSDB training set and evaluate them on the GTSDB test set. All experiments are carried out using combined features (ACF+sp-Cov+sp-LBP) as dense features, Adaboost with shrinkage value of 0.1 as the strong classifier, and depth3-decision trees as weak learners (if not specified otherwise).

Shrinkage We evaluate the performance of AdaBoost with 4 different shrinkage values from . We decrease the reject threshold of soft cascade by a factor of as coefficients of weak learners have been diminished by a factor of . The area under precision-recall curve of different detectors are shown in Table II. We observe that applying a small shrinkage value often improves the detection performance and the best performance is achieved by setting . However, without increasing the number of weak learners, setting the shrinkage value to be too small () can degrade the performance as the boosting cannot converge with a limited number of boosting iterations.

Shrinkage Prohibitory Danger Mandatory Avg.
98.13% 95.28% 90.32% 94.58%

99.38% 96.80% 92.79% 96.32%

100.00% 98.00% 97.57% 98.52%

99.99% 97.81% 95.16% 97.63%

99.99% 98.00% 96.76% 98.25%
TABLE II: Performance (AUC) of detectors with different shrinkage values.  The model consists of 4096 weak learners while others consist of 2048 weak learners.

Depth of decision trees We trained 4 different traffic sign detectors with decision trees of depth 1 to depth 4. Table III shows the detection performance of different detectors. We observe that increase the depth of decision trees provides a performance gain, especially for the mandatory category. However, the depth-3 decision trees achieve better generalization performance and are faster to train than depth-4 decision trees.

Depth Prohibitory Danger Mandatory Avg.
depth-1 99.98% 97.41% 75.47% 90.95%

depth-2
99.99% 97.98% 95.49% 97.82%

depth-3
100.00% 98.00% 97.57% 98.52%

depth-4
99.99% 96.77% 98.10% 98.29%
TABLE III: Performance (AUC) of detectors with different depths of decision trees.

Combination of features To compare the discriminative power of different feature representations, we evaluate the performance of various feature combinations. The results are shown in Table IV. We observe that the combination of the sp-Cov features and LUV outperforms the ACF features and combining more features can further improve the detection performance. The best result is achieved by using the combination of all features (sp-Cov+sp-LBP+ACF).

Feature combination Prohibitory Danger Mandatory Avg.
ACF (LUV+O+M) 98.72% 94.58% 92.65% 95.32%

sp-LBP+ACF
99.99% 95.07% 96.12% 97.06%

sp-Cov+LUV
99.30% 96.67% 95.56% 97.18%

sp-Cov+ACF
98.73% 95.23% 95.61% 96.52%

sp-Cov+sp-LBP+ACF
100.00% 98.00% 97.57% 98.52%
TABLE IV: Performance (AUC) of detectors with various feature combinations.

Iv-A5 Comparison with state-of-the-art detectors

Detection performance of various detectors on the GTSDB test set are shown in Table V. The proposed detector achieves the comparable results with state-of-the-art detectors despite its simplicity. These detectors [62, 41] that offer better performance employ multi-scale models in detection. The authors of [62] trained multiple subcategory-specific classifiers for each type of mandatory signs to achieve the best performance.

Method Prohibitory Danger Mandatory Avg.
Ours 100.00% 98.00% 97.57% 98.52%

Wang et al. [62]
100.00% 99.91% 100.00% 99.97%

Mathias et al. [41]
100.00% 100.00% 96.98% 98.99%

BolognaCVLab [27]
99.98% 98.72% 95.76% 98.15%

Liang et al. [34]
100.00% 98.85% 92.00% 96.95%

Timofte et al. [57]
61.12% 79.43% 72.60% 71.05%

Viola-Jones [61]
90.81% 46.26% 44.87% 60.65%
TABLE V: Detection performance (AUC) of various detectors on GTSDB test set with 60% overlap ratio.

Iv-B Car detection on UIUC dataset

Next, we conduct an experiment on car detection and compare detection performance of different detectors on the UIUC dataset [1]. The UIUC dataset captures images of side views of cars with a resolution pixels. The training set contains 550 positive samples and 500 negative samples. The test set is divided into two sets: 170 single-scale test images, containing 200 cars at roughly the same scale as in the training images, and 108 multi-scale test images, containing 139 cars at various scales.

We follow the evaluation protocol provided along with the UIUC dataset. A bounding box is counted as true positive if it lies within 25% of the ground truth dimension in each direction. Only the bounding box with the highest confidence score is counted as true positive if multiple bounding boxes satisfy the criterion, the others are counted as false positives. In the dataset, three criteria are adopted to evaluate the performance: -score, detection rate, and the number of false positives.

-score is the weighted harmonic mean of precision and recall.

The dimension of UIUC car model is set to pixels without marginal padding as the car images are clipped to the same size. We expand the positive samples by flipping car images along the vertical axis. Since viewpoints of cars in the UIUC dataset are limited to side-views, we train a single detector without applying subcategorization and bootstrapping. Table VI shows the results of different detectors on the multi-scale test subset. We observe that our detector achieves the best detection rate with slight more false positives on this dataset.

Method F-Measure Det. rate No. false pos.
Ours 98.6% 99.28% 3

Pruning [46]
98.6% 97.8% 1

AdaBoost [61]
98.6% 98.6% 2

AdaBoost+LDA [67]
98.6% 97.8% 1

CS-AdaBoost [40]
95.3% 95.5% 9
TABLE VI: Performance of various detectors on UIUC multi-scale test set.
Training Testing Properties

# cars

# images

# cars

# images

color

Annotations

multi-views

occ. labels

trunc. labels

UIUC Car 550 1050 139 108
MIT Car 516 516 - -
Street Parking - 881 - -
Pascal VOC 1250 713 1201 721
KITTI Car 27k 7481 - 7518
TABLE VII: Comparison of car datasets. The first four columns indicate the amount of training/testing data in each dataset. Note that KITTI dataset is two orders of magnitude larger than other existing datasets. The next five columns provide additional properties of each dataset.

Iv-C Car detection on KITTI dataset

To further demonstrate the effectiveness and robustness of the proposed detector on car detection, we evaluate our detector on a more challenging object detection benchmark, KITTI dataset [19]

Iv-C1 Dataset

The KITTI dataset is a recently proposed challenging dataset which consists of 7481 training images and 7518 test images, comprising more than 80 thousands of annotated objects in traffic scenes. Table VII provides a summary of existing car datasets. We observe that the KITTI dataset provides a large number of cars with different sizes, viewpoints, occlusion patterns, and truncation scenarios. Due to the diversity of these objects, the dataset has three subsets (Easy, Moderate, Hard) with respect to the difficulty of object size, occlusion and truncation. Since the detection performance are ranked based on the moderately difficult results, we use the moderate subset as the training data in our experiments. The moderate subset contains 15710 cars, with the heights of the cars vary from 25 pixels to 270 pixels and the aspect ratios vary between 0.9 and 4.0. Since annotations of test data are not provided by the KITTI benchmark, we split the KITTI training images into training set (first 4000 images) and validation set (remaining 3481 images).

Iv-C2 Evaluation criteria

We follow the provided protocol for evaluation. Pascal overlap score is used to find the best match and the minimum overlap ratio is set to be 70%. Only the bounding box with the highest confidence score is kept if multiple bounding boxes satisfy the overlap criterion, the others are counted as false positives. Instead of using AUC, average precision (AP) [14] is used to evaluate the detection performance. The AP summaries the shape of the precision-recall curve, and is defined as the mean precision at a set of evenly spaced recall levels.

Iv-C3 Parameter selection

We apply the proposed subcategorization method to categorize the training data into multiple subcategories. To find the model dimensions of each subcategory, we set the base height of each model to 52 pixels. From the base height, the width of each model can be obtained by taking the median aspect ratios of cars in the corresponding subcategory. Each model includes additional 4 pixels of margin on all sides. Using a model with suitable aspect ratio can significantly improves the detection performance due to better localization. We expand the positive training samples by randomly perturbing original car patches in translation ( pixels), and in rotation ( degrees). Negative samples are collected from the KITTI training images with vehicles regions cropped out.

Iv-C4 Experimental design

We investigate the experimental design of the proposed detector on car detection. We train car detectors on the training set and evaluate them on the validation set. All experiments are carried out using ACF as dense features, Adaboost with shrinkage value of 0.1 as the strong classifier, depth4-decision trees as weak learners, and in the subcategorization method (if not specified otherwise).

Number of subcategories To investigate the effect of different numbers of clusters in our subcategorization method, we set the number from {1, 4, 8, 12, 16, 20, 25}. Fig. 6(a) and Fig. 6(b) shows the effect of increasing the number of subcategories on geometrical feature space and visual feature space, respectively. We observe that the geometrical features outperform the ACF features in the spectral clustering. We also observe that the detection performance improves as we increase the number of subcategories with provides the best performance.

Fig. 6: (a) Average precision of spectral clustering + geometrical features with different number of subcategories on KITTI car validations set. (b) Average precision of spectral clustering + visual features with different number of subcategories on KITTI car validations set. (c) Average precision of spectral clustering + aspect-ratios with different number of subcategories on KITTI cyclist validations set.

Depth of decision trees We trained 4 different car detectors with decision trees of depth 2 to depth 5. Table VIII shows the average precisions of different detectors. We observe that depth-4 decision trees provide the best generalization performance.

Depth Easy Moderate Hard
depth-2 96.38% 89.18% 70.87%

depth-3
97.17% 91.21% 74.44%

depth-4
97.41% 93.37% 75.60%

depth-5
96.67% 92.08% 72.77%
TABLE VIII: Performance (AP) of detectors with different depths of decision trees.

Combination of features We evaluate the performance of various feature combinations on car detection. The results are shown in table IX. We observe that the detection performance improves as we add more features and the best performance is achieved by using the combination of all features (sp-Cov+sp-LBP+ACF). The combination of sp-LBP features and ACF features also achieves the similar performance and is five times faster than the combination of all features. We use the combination of sp-LBP features and ACF features as dense features in the testing since it gives a better trade-off between detection performance and runtime.

Feature combination Easy Moderate Hard Runtime
ACF (LUV+O+M) 97.41% 93.37% 75.60% 0.5s

sp-LBP+ACF
97.74% 94.38% 76.50% 1.5s

sp-Cov+LUV
97.76% 93.68% 75.68% 6.8s

sp-Cov+ACF
97.98% 93.48% 75.61% 6.8s

sp-Cov+sp-LBP+ACF
98.42% 94.55% 76.66% 7.5s
TABLE IX: Performance (AP) of detectors with various feature combinations.

Iv-C5 Comparison with state-of-the-art detectors

Table X shows the performance comparison of state-of-the-art detectors on the KITTI testing set. Experimental results show that the proposed detector is of not only better performance than all DPM-based methods [48, 58, 16] but also less runtime. More significantly, our detector outperforms the SubCat [44] which employs a similar object subcategorization method and the Regionlets [66, 36] which employs a similar pooling strategy. We conjecture that the additional performance gain is provided by the spatially pooled features.

Method Easy Moderate Hard Runtime
Ours 87.19% 77.40% 60.60% 1.5s

Regionlets [66, 36]
84.75% 76.54% 59.70% 1s

SubCat [44]
81.94% 66.32% 51.10% 0.3s

AOG [33]
80.26% 67.03% 55.60% 3s

OC-DPM [48]
74.94% 65.95% 53.86% 10s

DPM-C8B1 [58]
74.33% 60.99% 47.16% 15s

MDPM-un-BB [16]
71.19% 62.16% 48.43% 60s

mBoW [3]
36.02% 23.76% 18.44% 10s

TABLE X: Detection performance (AP) of various detectors on KITTI car test set with 70% overlap ratio.

Iv-D Cyclist detection on KITTI dataset

In this section, we conduct an experiment on cyclist detection and evaluate our detector on the KITTI dataset.

Iv-D1 Dataset

The KITTI dataset contains annotated cyclist objects which are captured from various traffic scenes. Similar to cars, cyclists are divided into three subsets (Easy, Moderate, Hard) and the moderate subset is used as the training data in our experiments. The moderate subset contains 1098 cyclists, with the heights of the cyclists vary from 25 pixels to 275 pixels and the aspect ratios vary between 0.3 and 1.5.

Iv-D2 Evaluation criteria

The KITTI cyclist detection uses the same evaluation protocol with the car detection expect that the minimum overlap ratio is relaxed to 50%.

Iv-D3 Parameter selection

The proposed subcategorization method is applied to cyclist detection. We define the dimensions of each cyclist model using the similar method in car detection. We set the base height of each model to 56 pixels, and the width of each model is derived from the median aspect ratios of cyclists in the corresponding subcategory. Each model includes additional 4 pixels of margin on all sides. We expand the positive training samples by randomly perturbing the original cyclists in translation ( pixels), in rotation ( degrees). Negative patches are collected from the KITTI training images with cyclist regions cropped out.

Iv-D4 Experimental design

We investigate the experimental design of our detector on cyclist detection. We train cyclist detectors on the training set and evaluate them on the validation set. All experiments are carried out using ACF as dense features, Adaboost with shrinkage value of 0.1 as the strong classifier, depth4-decision trees as weak learners, and in the subcategorization method (if not specified otherwise).

Number of subcategories We set the number of clusters from {1, 2, 3, 4, 5, 6} in our subcategorization method. Since the minority of cyclists are occluded and truncated, clustering on all geometrical features leads to a cluster degeneration problem. We carefully select the aspect-ratios of cyclists as the feature space to avoid the above problem. Fig. 6(c) shows the effect of increasing the number of subcategories. We observe that the detection performance improves as we increase the number of subcategories when . Since the number of cyclists is much less than cars, the number of cyclists in each subcategory becomes very small when we have too many subcategories, which results in an imbalanced learning problem and degrading the detection performance.

Depth of decision trees We trained 4 cyclist detectors with decision trees of depth 2 to depth 5. Average precisions of different detectors are shown in Table XI. We observe that depth-4 decision trees offer the best generalization performance, as similar in the car detection.

Depth Easy Moderate Hard
depth-2 80.92% 75.47% 69.46%

depth-3
89.83% 82.67% 76.65%

depth-4
92.15% 86.18% 79.28%

depth-5
90.98% 85.21% 78.26%
TABLE XI: Performance (AP) of detectors with different depths of decision trees.

Combination of features We evaluate the performance of various feature combinations on cyclist detection. The results are shown in Table XII. We observe that the best performance is achieved by using the combination of sp-LBP features and ACF features. The performance declines when we add the sp-Cov features as a part of aggregated channel features. The reason may be due to the lack of enough cyclist training data. We use the combination of sp-LBP features and ACF features as the dense features in the testing.

Feature combination Easy Moderate Hard Runtime
ACF (LUV+O+M) 92.15% 86.18% 79.28% 0.2s

sp-LBP+ACF
92.56% 87.40% 80.01% 0.6s

sp-Cov+LUV
85.48% 79.17% 72.20% 5.8s

sp-Cov+ACF
85.16% 80.58% 73.64% 5.8s

sp-Cov+sp-LBP+ACF
90.08% 83.80% 76.89% 6.1s
TABLE XII: Performance (AP) of detectors with various feature combinations.

Iv-D5 Comparison with state-of-the-art detectors

Table XIII shows the performance comparison with state-of-the-art approaches. As shown in Table XIII, our detector outperforms all other methods on the test set. Specifically, our detector outperforms the best DPM-based method DPM-VOC+VP [49] on all the three subsets by 16.29%, 14.95%, and 12.35%, respectively. Our detector also performs slightly better than the Regionlets [66, 36].

Method Easy Moderate Hard Runtime
Our method 58.72% 46.03% 40.58% 0.6s

Regionlets [66, 36]
56.96% 44.65% 39.05% 1s

MV-RGBD-RF
52.97% 42.61% 37.42% 4s

DPM-VOC+VP [49]
42.43% 31.08% 28.23% 8s

LSVM-MDPM-us [16]
38.84% 29.88% 27.31% 10s

DPM-C8B1 [58]
43.49% 29.04% 26.20% 15s


mBoW [3]
28.00% 21.62% 20.93% 10s
TABLE XIII: Detection performance (AP) of various detectors on KITTI cyclist test set with 50% overlap ratio.

Iv-E An evaluation of the overall runtime

We conduct an experiment on evaluation of the overall runtime with various feature combinations on the KITTI dataset. All experiments are carried out on a computer with an octa-core Intel Xeon 2.50GHz processor. The average runtime of each component of our detection framework can be seen in Table XIV. For feature extraction, we observe that the ACF features can be extracted very quickly within 0.1s. When we add the sp-LBP features, the runtime increases moderately, but this features give an obvious performance gain in all three applications. When the sp-Cov features are employed, the runtime of feature extraction increases rapidly and dominates the total runtime of the system. For object detection, we observe that the car detector costs the most time in this framework since it has 25 sub-detectors. The traffic sign detector uses the least time since it has only 3 sub-detectors. We also observe that the runtime of detection increases as we add more features in this framework. According to observe the detection results of three applications, we conjecture that using a combination of ACF features and sp-LBP features can provide a better trade-off between detection performance and system runtime.

Feature combination Feature Cars(25) Cyclists(4) Signs(3) Total
extraction detection detection detection Runtime
ACF (LUV+O+M) 0.10s 0.40s 0.10s 0.05s 0.65s

sp-LBP+ACF
0.35s 1.20s 0.30s 0.10s 1.95s

sp-Cov+ACF
5.50s 1.30s 0.30s 0.10s 7.20s

sp-Cov+sp-LBP+ACF
5.75s 1.75s 0.35s 0.15s 8.00s
TABLE XIV: An evaluation of the overall runtime of the proposed framework with various feature combinations.

V Conclusion

In this paper, we propose a common framework for detecting three important classes of objects in traffic scenes. The proposed framework introduces spatially pooled features as a part of the aggregated channel features to enhance the robustness and employs detectors of three important classes to detect target objects. The detection speed of the framework is fast since dense features need only to be evaluated once rather than individually for each detector. To overcome the weakness of the VJ framework for object classes with large intra-class variations, we propose an object subcategorization method to capture the variations and improves the generalization performance. We demonstrated that our detector achieves the competitive results with state-of-the-art detectors in traffic sign detection, car detection, and cyclist detection. Future work could include that contextual information can be used to facilitate object detection in traffic scenes and convolutional neural network can be used to generate more discriminative feature representations.

Acknowledgements This work was in part supported by Australia’s Information and Communications Technology (ICT) Research Centre of Excellence.

References

  • [1] S. Agarwal, A. Awan, and D. Roth. Learning to detect objects in images via a sparse, part-based representation. IEEE Trans. Pattern Anal. Mach. Intell., 26(11):1475–1490, 2004.
  • [2] T. Ahonen, A. Hadid, and M. Pietikäinen. Face recognition with local binary patterns. In Proc. Eur. Conf. Comp. Vis., pages 469–481. Springer, 2004.
  • [3] J. Behley, V. Steinhage, and A. B. Cremers. Laser-based segment classification using a mixture of bag-of-words. In Proc. IEEE Int. Conf. Intell. Robots Syst., pages 4195–4200. IEEE, 2013.
  • [4] A. Broggi, A. Cappalunga, S. Cattani, and P. Zani. Lateral vehicles detection using monocular high resolution cameras on terramax. In Proc. IEEE Intell. Vehicles Symp., pages 1143–1148. IEEE, 2008.
  • [5] A. Coates and A. Y. Ng. The importance of encoding versus training with sparse coding and vector quantization. In Proc. Int. Conf. Mach. Learn., pages 921–928, 2011.
  • [6] J. Cui, F. Liu, Z. Li, and Z. Jia. Vehicle localisation using a single camera. In Proc. IEEE Intell. Vehicles Symp., pages 871–876. IEEE, 2010.
  • [7] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 886–893. IEEE, 2005.
  • [8] A. de la Escalera, J. M. Armingol, and M. Mata. Traffic sign recognition and analysis for intelligent vehicles. Image Vis. Comput., 21(3):247–258, 2003.
  • [9] A. De La Escalera, L. E. Moreno, M. A. Salichs, and J. M. Armingol. Road traffic sign detection and classification. IEEE Trans. Industrial Electronics, 44(6):848–859, 1997.
  • [10] S. K. Divvala, A. A. Efros, and M. Hebert. How important are ”deformable parts” in the deformable parts model? In Proc. Eur. Conf. Comp. Vis. Workshop, pages 31–40. Springer, 2012.
  • [11] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell., 36(8):1532–1545, 2014.
  • [12] P. Dollár, S. Belongie, and P. Perona. The fastest pedestrian detector in the west. In Proc. Bri. Conf. Mach. Vis., pages 1–11, 2010.
  • [13] P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In Proc. Bri. Conf. Mach. Vis., pages 1–11, 2009.
  • [14] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge. Int. J. Comp. Vis., 88(2):303–338, 2010.
  • [15] C. Fang, S. Chen, and C. Fuh. Road-sign detection and tracking. IEEE Trans. Vehicular Technol., 52(5):1329–1341, 2003.
  • [16] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell., 32(9):1627–1645, 2010.
  • [17] J. Friedman, T. Hastie, and R. Tibshirani.

    Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors).

    J. Annals Stat., 28(2):337–407, 2000.
  • [18] X. W. Gao, L. Podladchikova, D. Shaposhnikov, K. Hong, and N. Shevtsova. Recognition of traffic signs based on their colour and shape features extracted using human vision models. J. Visual Comm. Image Repr., 17(4):675–685, 2006.
  • [19] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The KITTI dataset. Int. J. Robotic Res., 32(11):1231–1237, 2013.
  • [20] A. Geiger, C. Wojek, and R. Urtasun.

    Joint 3d estimation of objects and scene layout.

    In Proc. Adv. Neural Inf. Process. Syst., pages 1467–1475, 2011.
  • [21] P. Getreuer. Automatic color enhancement (ace) and its fast implementation. Image Process. Line, 2:266–277, 2012.
  • [22] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 580–587, 2014.
  • [23] H. Gómez-Moreno, S. Maldonado-Bascón, P. Gil-Jiménez, and S. Lafuente-Arroyo. Goal evaluation of segmentation algorithms for traffic sign recognition. IEEE Trans. Intell. Transportation Syst., 11(4):917–930, 2010.
  • [24] T. Hastie, R. Tibshirani, J. Friedman, and J. Franklin. The elements of statistical learning: data mining, inference and prediction. J. Math. Intelligencer, 27(2):83–85, 2005.
  • [25] M. Hejrati and D. Ramanan. Analyzing 3d objects in cluttered images. In Proc. Adv. Neural Inf. Process. Syst., pages 602–610, 2012.
  • [26] S. Houben. A single target voting scheme for traffic sign detection. In Proc. IEEE Intell. Vehicles Symp., pages 124–129. IEEE, 2011.
  • [27] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel. Detection of traffic signs in real-world images: The German Traffic Sign Detection Benchmark. In Proc. Int. Joint Conf. Neural Net., pages 1–8. IEEE, 2013.
  • [28] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM computing surveys (CSUR), 31(3):264–323, 1999.
  • [29] R. Janssen, W. Ritter, F. Stein, and S. Ott. Hybrid approach for traffic sign recognition. In Proc. IEEE Intell. Vehicles Symp., pages 390–395, 1993.
  • [30] C. Kuo and R. Nevatia. Robust multi-view car detection using unsupervised sub-categorization. In Proc. App. Comp. Vis. Workshop, pages 1–8. IEEE, 2009.
  • [31] W. Kuo and C. Lin. Two-stage road sign detection and recognition. In Proc. IEEE Int. Conf. Multimedia Expo., pages 1427–1430. IEEE, 2007.
  • [32] S. Kyo, T. Koga, K. Sakurai, and S. Okazaki. A robust vehicle detecting and tracking system for wet weather conditions using the imap-vision image processing board. In Proc. IEEE Int. Conf. Intell. Transportation Syst., pages 423–428. IEEE, 1999.
  • [33] B. Li, T. Wu, and S. Zhu. Integrating context and occlusion for car detection by hierarchical and-or model. In Proc. Eur. Conf. Comp. Vis., pages 652–667. Springer, 2014.
  • [34] M. Liang, M. Yuan, X. Hu, J. Li, and H. Liu. Traffic sign detection by ROI extraction and histogram features-basedrecognition. In Proc. Int. Joint Conf. Neural Net., pages 1–8. IEEE, 2013.
  • [35] H.-T. Lin, C.-J. Lin, and R. C. Weng. A note on platt’s probabilistic outputs for support vector machines. Mach. Learn., 68(3):267–276, 2007.
  • [36] C. Long, X. Wang, G. Hua, M. Yang, and Y. Lin. Accurate object detection with location relaxation and regionlets re-localization. In Proc. Asian Conf. Comp. Vis., pages 3000–3016. IEEE, 2014.
  • [37] G. B. Loy and N. M. Barnes. Fast shape-based road sign detection for a driver assistance system. In Proc. IEEE Int. Conf. Intell. Robots Syst., pages 70–75. IEEE, 2004.
  • [38] S. Maldonado-Bascón, S. Lafuente-Arroyo, P. Gil-Jiménez, H. Gómez-Moreno, and F. López-Ferreras. Road-sign detection and recognition based on support vector machines. IEEE Trans. Intell. Transportation Syst., 8(2):264–278, 2007.
  • [39] E. Martinez, M. Diaz, J. Melenchon, J. Montero, I. Iriondo, and J. Socoro. Driving assistance system based on the detection of head-on collisions. In Proc. IEEE Intell. Vehicles Symp., pages 913–918. IEEE, 2008.
  • [40] H. Masnadi-Shirazi and N. Vasconcelos. Cost-sensitive boosting. IEEE Trans. Pattern Anal. Mach. Intell., 33(2):294–309, 2011.
  • [41] M. Mathias, R. Timofte, R. Benenson, and L. J. V. Gool. Traffic sign recognition - how far are we from the solution? In Proc. Int. Joint Conf. Neural Net., pages 1–8. IEEE, 2013.
  • [42] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund. Vision-based traffic sign detection and analysis for intelligent driver assistance systems: Perspectives and survey. IEEE Trans. Intell. Transportation Syst., 13(4):1484–1497, 2012.
  • [43] A. Y. Ng, M. I. Jordan, Y. Weiss, et al.

    On spectral clustering: Analysis and an algorithm.

    Proc. Adv. Neural Inf. Process. Syst., 2:849–856, 2002.
  • [44] E. Ohn-Bar and M. M. Trivedi. Fast and robust object detection using visual subcategories. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. Workshop, pages 179–184, 2014.
  • [45] E. Ohn-Bar and M. M. Trivedi. Learning to detect vehicles by clustering appearance patterns. IEEE Trans. Intell. Transportation Syst., 2015.
  • [46] S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Asymmetric pruning for learning cascade detectors. IEEE Trans. Multimedia, 16(5):1254–1267, 2014.
  • [47] S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Strengthening the effectiveness of pedestrian detection with spatially pooled features. In Proc. Eur. Conf. Comp. Vis., pages 546–561. Springer, 2014.
  • [48] B. Pepik, M. Stark, P. V. Gehler, and B. Schiele. Occlusion patterns for object class detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3286–3293. IEEE, 2013.
  • [49] B. Pepikj, M. Stark, P. Gehler, and B. Schiele. Multi-view and 3d deformable part models. IEEE Trans. Pattern Anal. Mach. Intell., 2015.
  • [50] N. Pettersson, L. Petersson, and L. Andersson. The histogram feature-a resource-efficient weak classifier. In Proc. IEEE Intell. Vehicles Symp., pages 678–683. IEEE, 2008.
  • [51] J. C. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Proc. Adv. Large Margin Classifiers, pages 61–74. Citeseer, 1999.
  • [52] V. A. Prisacariu, R. Timofte, K. Zimmermann, I. Reid, and L. J. V. Gool. Integrating object detection with 3d tracking towards a better driver assistance system. In Proc. Int. conf. Patt. Recogn., pages 3344–3347. IEEE, 2010.
  • [53] Z. Qui, D. Yao, Y. Zhang, D. Ma, and X. Liu. The study of the detection of pedestrian and bicycle using image processing. IEEE Trans. Intell. Transportation Syst., 1:340–345, 2003.
  • [54] S. Rogers and N. P. Papanikolopoulos. Counting bicycles using computer vision. In Proc. IEEE Conf. Intell. Transportation Syst., pages 33–38. IEEE, 2000.
  • [55] P. Sermanet and Y. LeCun. Traffic sign recognition with multi-scale convolutional networks. In Proc. Int. Joint Conf. Neural Net., pages 2809–2813. IEEE, 2011.
  • [56] S. Sivaraman and M. M. Trivedi. Looking at vehicles on the road: A survey of vision-based vehicle detection, tracking, and behavior analysis. IEEE Trans. Intell. Transportation Syst., 14(4):1773–1795, 2013.
  • [57] R. Timofte, K. Zimmermann, and L. J. V. Gool. Multi-view traffic sign detection, recognition, and 3d localisation. In Proc. App. Comp. Vis. Workshop, pages 1–8. IEEE, 2009.
  • [58] J. J. Y. Torres, L. M. Bergasa, R. Arroyo, and A. Lazaro. Supervised learning and evaluation of kitti’s cars detector with DPM. In Proc. IEEE Intell. Vehicles Symp., pages 768–773, 2014.
  • [59] O. Tuzel, F. Porikli, and P. Meer. Region covariance: A fast descriptor for detection and classification. In Proc. Eur. Conf. Comp. Vis., pages 589–600. Springer, 2006.
  • [60] A. Vedaldi and B. Fulkerson. Vlfeat: an open and portable library of computer vision algorithms. In Proc. IEEE Int. Conf. Multimedia, pages 1469–1472. ACM, 2010.
  • [61] P. Viola and M. J. Jones. Robust real-time face detection. Int. J. Comp. Vis., 57(2):137–154, 2004.
  • [62] G. Wang, G. Ren, Z. Wu, Y. Zhao, and L. Jiang. A robust, coarse-to-fine traffic sign detection method. In Proc. Int. Joint Conf. Neural Net., pages 1–5. IEEE, 2013.
  • [63] H. Wang, Q. Chen, and W. Cai. Shape-based pedestrian/bicyclist detection via onboard stereo vision. In Proc. Multiconf. Computational Eng. Syst. App., pages 1776–1780. IEEE, 2006.
  • [64] J. Wang, G. Bebis, and R. Miller. Overtaking vehicle detection using dynamic and quasi-static background modeling. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. workshop, pages 64–64. IEEE, 2005.
  • [65] X. Wang, T. X. Han, and S. Yan. An HOG-LBP human detector with partial occlusion handling. In Proc. IEEE Int. Conf. Comp. Vis., pages 32–39. IEEE, 2009.
  • [66] X. Wang, M. Yang, S. Zhu, and Y. Lin. Regionlets for generic object detection. In Proc. IEEE Int. Conf. Comp. Vis., pages 17–24. IEEE, 2013.
  • [67] J. Wu, S. C. Brubaker, M. D. Mullin, and J. M. Rehg. Fast asymmetric learning for cascade face detection. IEEE Trans. Pattern Anal. Mach. Intell., 30(3):369–382, 2008.
  • [68] J. Yan, X. Zhang, Z. Lei, S. Liao, and S. Z. Li. Robust multi-resolution pedestrian detection in traffic scenes. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3033–3040. IEEE, 2013.