1 Introduction
Pedestrian detection has attracted broad attentions [8, 33, 29, 9, 10, 11]. This problem is challenging because of large variation and confusion in human body and background scene, as shown in Fig.1 (a), where the positive and hard negative patches have large ambiguity.
Current methods for pedestrian detection can be generally grouped into two categories, the models based on handcrafted features [33, 8, 34, 10, 9, 37, 13] and deep models [22, 24, 29, 23, 18]. In the first category, conventional methods extracted Haar [33], HOG[8], or HOGLBP [34] from images to train SVM [8]
or boosting classifiers
[10]. The learned weights of the classifier (e.g. SVM) can be considered as a global template of the entire human body. To account for more complex pose, the hierarchical deformable part models (DPM) [13, 39, 17]learned a mixture of local templates for each body part. Although they are sufficient to certain pose changes, the feature representations and the classifiers cannot be jointly optimized to improve performance. In the second category, deep neural networks achieved promising results
[22, 24, 29, 23, 18], owing to their capacity to learn middlelevel representation. For example, Ouyang et al. [23]learned features by designing specific hidden layers for the Convolutional Neural Network (CNN), such that features, deformable parts, and pedestrian classification can be jointly optimized. However, previous deep models treated pedestrian detection as a single binary classification task, they can mainly learn middlelevel features, which are not able to capture rich pedestrian variations, as shown in Fig.
1 (a).To learn highlevel representations, this work jointly optimizes pedestrian detection with auxiliary semantic tasks, including pedestrian attributes (e.g. ‘backpack’, ‘gender’, and ‘views’) and scene attributes (e.g. ‘vehicle’, ‘tree’, and ‘vertical’). To understand how this work, we provide an example in Fig.2. If only a single detector is used to classify all the positive and negative samples in Fig.2 (a), it is difficult to handle complex pedestrian variations. Therefore, the mixture models of multiple views were developed in Fig.2 (b), i.e. pedestrian images in different views are handled by different detectors. If views are treated as one type of semantic tasks, learning pedestrian representation by multiple attributes with deep models actually extends this idea to extreme. As shown in Fig.2 (c), more supervised information enriches the learned features to account for combinatorial more pedestrian variations. The samples with similar configurations of attributes can be grouped and separated in the highlevel feature space.
Specifically, given a pedestrian dataset (denoted by ), the positive image patches are manually labeled with several pedestrian attributes, which are suggested to be valuable for surveillance analysis [21]. However, as the number of negatives is significantly larger than the number of positives, we transfer scene attributes information from existing background scene segmentation databases (each one is denoted by ) to the pedestrian dataset, other than annotating them manually. A novel taskassistant CNN (TACNN) is proposed to jointly learn multiple tasks using multiple data sources. As different ’s may have different data distributions, to reduce these discrepancies, we transfer two types of scene attributes that are carefully chosen, comprising the shared attributes that appear across all the ’s and the unshared attributes that appear in only one of them. The former one facilitates the learning of shared representation among ’s, whilst the latter one increases diversity of attribute. Furthermore, to reduce gaps between and ’s, we first project each sample in ’s to a structural space of
and then the projected values are employed as input to train TACNN. Learning TACNN is formulated as minimizing a weighted multivariate crossentropy loss, where both the importance coefficients of tasks and the network parameters can be iteratively solved via stochastic gradient descent
[16].This work has the following main contributions. (1) To our knowledge, this is the first attempt to learn highlevel representation for pedestrian detection by jointly optimizing it with semantic attributes, including pedestrian attributes and scene attributes. The scene attributes can be transferred from existing scene datasets without annotating manually. (2) These multiple tasks from multiple sources are trained using a single taskassistant CNN (TACNN), which is carefully designed to bridge the gaps between different datasets. A weighted multivariate crossentropy loss is proposed to learn TACNN, by iterating among two steps, updating network parameters with tasks’ weights fixed and updating weights with network parameters fixed. (3) We systematically investigate the effectiveness of attributes in pedestrian detection. Extensive experiments on both challenging Caltech [11] and ETH [12] datasets demonstrate that TACNN outperforms stateoftheart methods. It reduces miss rates of existing deep models on these datasets by and percent, respectively.
1.1 Related Works
We review recent works in two aspects.
Models based on HandCrafted Features The handcrafted features, such as HOG, LBP, and channel features, achieved great success in pedestrian detection. For example, Wang et al. [34] utilized the LBP+HOG features to deal with partial occlusion of pedestrian. Chen et al. [7] modeled the context information in a multiorder manner. The deformable part models [13] learned mixture of local templates to account for view and pose variations. Moreover, Dollár et al. proposed Integral Channel Features (ICF) [10] and Aggregated Channel Features (ACF) [9], both of which consist of gradient histogram, gradients, and LUV, and can be efficiently extracted. Benenson et al. [2] combined channel features and depth information. However, the representation of handcrafted features cannot be optimized for pedestrian detection. They are not able to capture large variations, as shown in Fig.3 (a) and (b).
Deep Models Deep learning methods can learn features from raw pixels to improve the performance of pedestrian detection. For example, ConvNet [29] employed convolutional sparse coding to unsupervised pretrain CNN for pedestrian detection. Ouyang et al. [22] jointly learned features and the visibility of different body parts to handle occlusion. The JointDeep model [23] designed a deformation hidden layer for CNN to model mixture poses information. Unlike the previous deep models that formulated pedestrian detection as a single binary classification task, TACNN jointly optimizes pedestrian detection with related semantic tasks, and the learned features are more robust to large variations, as shown in Fig.3 (c) and (d).
2 Our Approach
Method Overview Fig.4 shows our pipeline of pedestrian detection, where pedestrian classification, pedestrian attributes, and scene attributes are jointly learned by a single TACNN. Given a pedestrian dataset , for example Caltech [11], we manually label the positive patches with nine pedestrian attributes, which are listed in Fig.5. Most of them are suggested by the UK Home Office and UK police to be valuable in surveillance analysis [21]. Since the number of negative patches in is significantly larger than the number of positives, we transfer scene attribute information from three public scene segmentation datasets to , as shown in Fig.4 (a), including CamVid () [5], Stanford Background () [14], and LM+SUN () [31], where hard negatives are chosen by applying a simple yet fast pedestrian detector [9] on these datasets. As the data in different ’s are sampled from different distributions, we carefully select two types of attributes, the shared attributes (outlined in orange) that present in all ’s and the unshared attributes (outlined in red) that appear only in one of them. This is done because the former one enables the learning of shared representation across ’s, while the latter one enhances diversity of attribute. All chosen attributes are summarized in Fig.5, where shows that data from different sources have different subset of attribute labels. For example, pedestrian attributes only present in , shared attributes present in all ’s, and the unshared attributes present in one of them, e.g.’traffic light’ of .
We construct a training set by combing patches cropped from both and ’s. Let be a set of image patches and their labels, where each is a fourtuple^{1}^{1}1
In this paper, scalar variable is denoted by normal letter, while set, vector, or matrix is denoted as boldface letter.
. Specifically, denotes a binary label, indicating whether an image patch is pedestrian or not. , , and are three sets of binary labels, representing the pedestrian, shared scene, and unshared scene attributes, respectively. As shown in Fig.4 (b), TACNN employs image patch as input and predicts, by stacking four convolutional layers (conv1 to conv4), four maxpooling layers, and two fullyconnected layers (fc5 and fc6). This structure is inspired by the AlexNet
[16] for largescale general object categorization. However, as the difficulty of pedestrian detection is different from general object categorization, we remove one convolutional layer of AlexNet and reduce the number of parameters at all remaining layers. The subsequent structure of TACNN is specified in Fig.4 (b).Formulation of TACNN Each hidden layer of TACNN from conv1 to conv4 is computed recursively by convolution and maxpooling, which are formulated as
(1)  
(2) 
In Eqn.(1), is the rectified linear function [19] and denotes the convolution operator applied on every pixel of the feature map , where and stand for the th input channel at the layer and the th output channel at the layer, respectively. and denote the filters and bias. In Eqn.(2), the feature map is partitioned into grid with overlapping cells, each of which is denoted as , where indicates the cell index. The maxpooling compares value at each location of a cell and outputs the maximum value of each cell.
Each hidden layer in fc5 and fc6 is obtained by
(3) 
where the higher level representation is transformed from lower level with a nonlinear mapping. and
are the weight matrixes and bias vector at the
th layer.TACNN can be formulated as minimizing the log posterior probability with respect to a set of network parameters
(4) 
where
is a complete loss function regarding the entire training set. Here, we illustrate that the shared attributes
in Eqn.(4) are crucial to learn shared representation across multiple scene datasets ’s.For clarity, we keep only the unshared scene attributes in the loss function, which then becomes . Let denote the sample of . A shared representation can be learned if and only if all the samples share at least one target (attribute). Since the samples are independent, the loss function can be expanded as , where , implying that each dataset is only used to optimize its corresponding unshared attribute, although all the datasets and attributes are trained in a single TACNN. For instance, the classification model of is learned by using without leveraging the existence of the other datasets. In other words, the probability of because of missing labels. The above formulation is not sufficient to learn shared features among datasets, especially when the data have large differences. To bridge multiple scene datasets ’s, we introduce the shared attributes , the loss function develops into , such that TACNN can learn a shared representation across ’s because the samples share common targets , i.e. .
Now, we reconsider Eqn.(4), where the loss function can be decomposed similarly, , with . Even though the discrepancies among ’s can be reduced by , this decomposition shows that gap remains between datasets and ’s. To resolve this issue, we compute the structure projection vectors for each sample , and Eqn.(4) turns into
(5) 
For example, the first term of the above decomposition can be written as , where is attained by projecting the corresponding in on the feature space of . This procedure is explained below. Here is used to bridge multiple datasets, because samples from different datasets are projected to a common space of P. TACNN adopts a pair of data as input (see Fig.4 (b)). All the remaining terms can be derived in a similar way.
Structure Projection Vector As shown in Fig.6, to close the gap between and ’s, we calculate the structure projection vector (SPV) for each sample by organizing the positive (+) and negative () data of into two tree structures, respectively. Each tree has depth that equals three and partitions the data topdown, where each child node groups the data of its parent into clusters, for example and
. Then, SPV of each sample is obtained by concatenating the distance between it and the mean of each leaf node. Specifically, at each parent node, we extract HOG feature for each sample and apply kmeans to group the data. We partition the data into five clusters (
to ) in the first level, and then each of them is further partitioned into ten clusters, e.g. to .3 Learning TaskAssistant CNN
To learn network parameters , a natural way is to reformulate Eqn.(5) as the softmax loss functions similar to the previous methods. We have^{2}^{2}2We drop the sample index in the remaining derivation for clarity.
(6) 
where the main task is to predict the pedestrian label and the attribute estimations, i.e. , , and , are auxiliary semantic tasks. , , and denote the importance coefficients to associate multiple tasks. Here, , , , and are modeled by softmax functions, for example, , where and indicate the toplayer feature vector and the parameter matrix of the main task respectively, as shown in Fig.4 (b), and is obtained by .
Eqn.(6
) optimizes eighteen loss functions together. It has two main drawbacks. First, since different tasks have different convergence rates, training many tasks together suffers from overfitting. Previous works prevented overfitting by adjusting the importance coefficients. However, they are determined in a heuristic manner, such as early stopping
[38], other than estimating in the learning procedure. Second, if the dimension of the features is high, the number of parameters at the toplayer increases exponentially. For example, if the feature vector has dimensions, the weight matrix of each twostate variable (e.g. of the main task) has parameters, whilst the weight matrix of the fourstate variable ‘viewpoint’ has parameters^{3}^{3}3All tasks are binary classification (i.e. two states) except the pedestrian attribute ‘viewpoint’, which has four states, including ‘front’, ‘back’, ‘left’, and ‘right’.. As we have seventeen twostate variables and one fourstate variable, the total number of parameters at the toplayer is .To resolve the above issues, we cast learning multiple tasks in Eqn.(6) as optimizing a single weighted multivariate crossentropy loss, which can not only learn a compact weight matrix but also iteratively estimate the importance coefficients,
(7) 
where denotes a vector of importance coefficients and represents a diagonal matrix. Here, is a vector of binary labels, concatenating the pedestrian label and all attribute labels. Note that each twostate (fourstate) variable can be described by one bit (two bits). Since we have seventeen twostate variables and one fourstate variable, the weight matrix at the top layer, denoted as in this case, has parameters, which reduces the number of parameters by half, i.e. compared to of Eqn.(6). Moreover,
is modeled by sigmoid function,
i.e. , where is achieved in the same way as in Eqn.(6).The optimization of Eqn.(7) iterates between two steps, updating network parameters with the importance coefficients fixed and updating coefficients with the network parameters fixed.
Learning Network Parameters The network parameters are updated by minimizing Eqn.(7) using stochastic gradient descent [16] and backpropagation (BP) [28], where the error of the output layer is propagated topdown to update filters or weights at each layer. For example, the weight matrix of the th layer in the th iteration, , is attained by
(8) 
Here, is the index of training iteration. is the momentum variable, is the learning rate, and is the derivative calculated by the outer product of the backpropagation error and the hidden features . The BP procedure is similar to [16]. The main difference is how to compute error at the th layer. In the traditional BP algorithm, the error at the th layer is obtained by the gradient of Eqn.(7), indicating the loss, i.e. , where denotes the predicted labels. However, unlike the conventional BP where all the labels are observed, each of our dataset only covers a subset of attributes. Let signify the unobserved labels. The posterior probability of Eqn.(7) becomes , where specifies the labels excluding . Here we demonstrate that can be simply marginalized out, since the labels are independent. We have . Therefore, the error of Eqn.(7) can be computed as
(9) 
which demonstrates that the errors of the missing labels will not be propagated no matter whether their predictions are correct or not.
Learning Importance Coefficients We update the importance coefficients with the network parameters fixed, by minimizing the posterior probability as introduced in [6]. Taking the negative logarithm of the posterior, the problem develops into
(10) 
where the first term, , is a log likelihood similar to Eqn.(7), measuring the evidence of selecting importance coefficients . The second term specifies a log prior of . To avoid trivial solution, i.e. exists equals zero, we have , showing that each coefficient is regularized by a Gaussian prior with mean ‘
’ and standard deviation
. This implies that each should not deviate too much from one, because we assume all tasks have equal contributions at the very beginning. Let be the coefficient of the main task. We fixthrough out the learning procedure, as our goal is to optimize the main task with the help of the auxiliary tasks. The third term is a normalization constant, which can be simply modeled as a constant scalar. In this work, we adopted the restricted Boltzmann machine (RBM)
[15] to learn , because RBM can well model the data space. In other words, we can measure the predictions of the coefficients with respect to the importance of each sample. Note that RBM can be learned offline and can be stored in a probability table for fast indexing.Intuitively, coefficient learning is similar to the process below. At the very beginning, all the tasks have equal importance. In the training stage, for those tasks whose values of the loss function are stable but large, we decrease their weights, because they may not relate to the main task or begin to overfit the data. However, we penalize the coefficient that is approaching zero, preventing the corresponding task from suspension. For those tasks have small values of loss, their weights could be increased, since these tasks are highly related to the main task, i.e. whose error rates are synchronously decreased with the main task. In practice, all the tasks’ coefficients in our experiments become when training converges, except the main task whose weight is fixed and equals one. Learning of TACNN is summarized in Algorithm 1. Typically, we run the first step for sufficient number of iterations to reach a local minima, and then perform the second step to update the coefficients. This strategy can help avoid getting stuck at local minima.
Here, we explain the third term in details. With the RBM, we have
which represents the free energy [15] of RBM. Specifically, is the energy function, which learns the latent binary representation that models the shared hidden space of and . and are the projection matrixes capturing the relations between and , and and , respectively, while , and
are the biases. The RBM can be solved by the contrastive divergence
[15]. Since the latent variables are independent given and , can be rewritten by integrating over , i.e. . Combining all the above definitions, Eqn.(10) is an unconstrained optimization problem, where the importance coefficients can be efficiently updated by using the LBFGS algorithm [1].4 Experiments
The proposed TACNN is evaluated on the CaltechTest [11] and ETH datasets [12]. We strictly follow the evaluation protocol proposed in [11], which measures the log average miss rate over nine points ranging from to FalsePositivePerImage. We compare TACNN with the bestperforming methods as suggested by the Caltech and ETH benchmarks^{4}^{4}4 on the reasonable subsets, where pedestrians are larger than pixels height and have percent visible body parts.
4.1 Effectiveness of TACNN
We systematically study the effectiveness of TACNN in four aspects as follows. In this section, TACNN is trained on CaltechTrain and tested on CaltechTest.
Effectiveness of Hard Negative Mining To save computational cost, We employ ACF [9] for mining hard negatives at the training stage and pruning candidate windows at the testing stage. Two main adjustments are made in ACF. First, we compute the exact feature pyramids at each scale instead of making an estimated aggregation. Second, we increase the number of weak classifiers to enhance the recognition ability. Afterwards, a higher recall rate is achieved by ACF and it obtains percent miss rate on CaltechTest. Then the TACNN with only the main task (pedestrian classification) achieved 31.45 percent miss rate by cascading on ACF, obtaining more than percent improvement.
main task 
backpack 
darktrousers 
hat 
bag 
gender 
occlusion 
riding 
whitecloth 
viewpoint 
All 

25.64 
main task 
sky 
tree 
building 
road 
vehicle 
trafficlight 
vertical 
horizontal 


Neg.  
Attr. 
Effectiveness of Pedestrian Attributes We investigate how different pedestrian attributes can help improve the main task. To this end, we train TACNN by combing the main task with each of the pedestrian attributes, and the miss rates are reported in Table 1, where shows that ‘viewpoint’ is the most effective attribute, which improves the miss rate by percent, because ‘viewpoint’ captures the global information of pedestrian. The attribute capture the pose information also attains significant improvement, e.g. percent by ‘riding’. Interestingly, among those attributes modeling local information, ‘hat’ performs the best, reducing the miss rate by percent. We observe that this result is consistent with previous works, SpatialPooling [26] and InformedHaar [37], which showed that head is the most informative body parts for pedestrian detection. When combining all the pedestrian attributes, TACNN achieved 25.64 percent miss rate, improving the main task by percent.
Effectiveness of Scene Attributes Similarly, we study how different scene attributes can improve pedestrian detection. We train TACNN combining the main task with each scene attribute. For each attribute, we select hard negative samples from its corresponding dataset. For example, we crop five thousand patches for ‘vertical’ from the Stanford Background dataset. We test two settings, denoted as “Neg.” and “Attr.”. In the first setting, we label the five thousand patches as negative samples. In the second setting, these patches are assigned to their original attribute labels. The former one uses more negative samples compared to the TACNN (main task), whilst the latter one employs attribute information.
The results are reported in Table 2, where shows that ‘trafficlight’ improves the main task by percent, revealing that the patches of ‘trafficlight’ are easily confused with positives. This is consistent when we exam the hard negative samples of most of the pedestrian detectors. Besides, the ‘vertical’ background patches are more effective than the ‘horizontal’ background patches, corresponding to the fact that hard negative patches are more likely to present vertically.
Attribute Prediction We also consider the accuracy of attribute prediction and find that the averaged accuracy of all the attributes exceeds percent. We select the pedestrian attribute ‘viewpoint’ as illustration. In Table 3
, we report the confusion matrix of ‘viewpoint’, where the number of detected pedestrians of ‘front’, ‘’back’, ‘’left’, and ‘right’ are
, , , respectively. We observed that ‘front’ and ‘back’ information are relatively easy to capture, rather than the ‘left’ and ‘right’, which are more likely to confuse with each other, e.g. misclassified samples.Predict State  
Frontal  Back  Left  Right  
Frontal  
True  Back  
State  Left  
Right  
Accuracy 
4.2 Overall Performance on Caltech
We report overall results in two parts. All the results of TACNN are obtained by training on CaltechTrain and evaluating on CaltechTest. In the first part, we analyze the performance of different components of TACNN. As shown in Fig.6(a), the performances show clear increasing patterns when gradually adding more components. For example, TACNN (main task) cascades on ACF and reduces the miss rate of it by more than 5 percent. TACNN (PedAttr.+SharedScene) reduces the result of TACNN (PedAttr.) by percent, because it can bridge the gaps among multiple scene datasets. After modeling the unshared attributes, the miss rate is further decreased by percent, since more attribute information is incorporated. The final result of miss rate is obtained by using the structure projection vector as input to TACNN. Its effectiveness has been demonstrated in Fig.6(a).
In the second part, we compare the result of TACNN with all existing bestperforming methods, including VJ [32], HOG [8], ACFCaltech [9], MTDPM [35], MTDPM+Context [35], JointDeep [23], SDN [18], ACF+SDT [27], InformedHaar [37], ACFCaltech+ [20], SpatialPooling [26], LDCF [20], Katamari [4], SpatialPooling+ [25]. These works used various features, classifiers, deep networks, and motion and context information. We summarize them as below. Note that TACNN dose not employ motion and context information.
Features: Haar (VJ), HOG (HOG, MTDPM), ChannelFeature (ACF+Caltech, LDCF); Classifiers: latentSVM (MTDPM), boosting (VJ, ACF+Caltech, SpatialPooling); Deep Models: JointDeep, SDN; Motion and context: MTDPM+Context, ACF+SDT, Katamari, SpatialPooling+.
Fig.6(b) reports the results. TACNN achieved the smallest miss rate compared to all existing methods. Although it only outperforms the second best method (SpatialPooling+ [25]) by percent, it learns dimensions highlevel features with attributes, other than combining LBP, covariance features, channel features, and video motion as in [25]. Also, the Katamari [4] method integrates multiple types of features and context information.
Handcrafted Features The learned highlevel representation of TACNN outperforms the conventional handcrafted features by a large margin, including Haar, HOG, HOG+LBP, and channel features, shown in Fig.8 (a). For example, it reduced the miss rate by 16 and 9 percent compared to DPM+Context and Spatial Pooling, respectively. DPM+Context combined HOG feature with pose mixture and context information, while SpatialPooling combined multiple features, such as LBP, covariance, and channel features.
Deep Models Fig.8 (b) shows that TACNN surpasses other deep models. For example, TACNN outperforms two stateoftheart deep models, JointDeep and SDN, by 18 and 17 percent, respectively. Both SDN and JointDeep treated pedestrian detection as a single task and thus cannot learn highlevel representation to deal with the challenging hard negative samples.
Time Complexity Training TACNN on CaltechTrain with a single GPU takes hours. At the testing stage, the running time of hard negative mining is frames per second (FPS) on Matlab with CPU, while TACNN runs at FPS on GPU. In summary, the entire system detects pedestrians from raw images at around FPS. The bottleneck is the step of hard negative mining. We consider to migrate it to GPU platform.
4.3 Overall Performance on ETH
We compare TACNN with the existing bestperforming methods (see Sec.4.2) on ETH [12]. TACNN is trained on INRIATrain [8]. This setting aims at evaluating the generalization capacity of the TACNN. As shown in Fig.9, TACNN achieves the lowest miss rate, which outperforms the second best method by percent. It also outperforms the best deep model by percent.
Effectiveness of different Components The analysis of the effectiveness of different components of TACNN is displayed in Fig.10, where the logaverage miss rates show clear decreasing patterns as follows, while gradually accumulating more components.
TACNN (main task) cascades on ACF and reduces the miss rate by 5.4 percent.
With pedestrian attributes, TACNN (PedAttr.) reduces the result of TACNN (main task) by 5.5 percent.
When bridging the gaps among multiple scene datasets with shared scene attributes, TACNN (PedAttr.+SharedScene) further lower the miss rate by 1.8 percent.
After incorporating unshared attributes, the miss rate is further decreased by another 1.2 percent.
TACNN finally achieves 34.99 percent logaverage miss rate with the structure projection vector.
Comparisons with Handcrafted Features Fig.11 shows that the learned representation of TACNN outperforms the conventional handcrafted features in a large margin, including Haar, HOG, HOG+LBP, and channel features. For instance, it reduces the miss rate by 9.8 and 8.5 percent compared to FisherBoost [30] and Roerei [3]
, respectively. FisherBoost combined HOG and covariance features, and trained the detector in a complex model, while Roerei carefully designed the feature pooling, feature selection, and preprocessing methods based on channel features.
Comparisons with Deep Models Fig.12 shows that TACNN surpasses other deep models on ETH dataset. For example, TACNN outperforms other two bestperforming deep models, SDN [18] and DBNMul [24], by 5.5 and 6 percent, respectively. Besides, TACNN even reduces the miss rate by 12.7 compared to MultiSDP [36], which carefully designed multiple classification stages to recognize hard negatives.
4.4 Visualization of Detection Results
5 Conclusions
In this paper, we proposed a novel TaskAssistant CNN (TACNN) to learn features from multiple tasks (pedestrian and scene attributes) and datasets, showing superiority over handcrafted features and features learned by other deep models. This is because highlevel representation can be learned by employing sematic tasks and multiple data sources. Extensive experiments demonstrate its effectiveness. The proposed model can be further improved by incorporating more attributes. Future work will explore more attribute configurations. The proposed approach also has potential for scene parsing, because it predicts background attributes.
References
 [1] G. Andrew and J. Gao. Scalable training of l 1regularized loglinear models. In ICML, pages 33–40, 2007.
 [2] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool. Pedestrian detection at 100 frames per second. In CVPR, pages 2903–2910, 2012.
 [3] R. Benenson, M. Mathias, T. Tuytelaars, and L. V. Gool. Seeking the strongest rigid detector. In CVPR, 2013.
 [4] R. Benenson, M. Omran, J. Hosang, and B. Schiele. Ten years of pedestrian detection, what have we learned? In ECCV Workshop, 2014.
 [5] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla. Segmentation and recognition using structure from motion point clouds. In ECCV, pages 44–57, 2008.
 [6] R. Caruana. Multitask learning. 1998.
 [7] G. Chen, Y. Ding, J. Xiao, and T. X. Han. Detection evolution with multiorder contextual cooccurrence. In CVPR, pages 1798–1805, 2013.
 [8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR. 2005.
 [9] P. Dollár, R. Appel, S. Belongie, and P. Perona. Fast feature pyramids for object detection. TPAMI, 2014.
 [10] P. Dollár, Z. Tu, P. Perona, and S. Belongie. Integral channel features. In BMVC, 2009.
 [11] P. Dollár, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. TPAMI, 34, 2012.
 [12] A. Ess, B. Leibe, and L. Van Gool. Depth and appearance for mobile scene analysis. In ICCV, pages 1–8, 2007.
 [13] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. TPAMI, 32(9):1627–1645, 2010.
 [14] S. Gould, R. Fulton, and D. Koller. Decomposing a scene into geometric and semantically consistent regions. In ICCV, pages 1–8, 2009.
 [15] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
 [16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
 [17] Z. Lin and L. S. Davis. Shapebased human detection and segmentation via hierarchical parttemplate matching. TPAMI, 32(4):604–618, 2010.
 [18] P. Luo, Y. Tian, X. Wang, and X. Tang. Switchable deep network for pedestrian detection. In CVPR, pages 899–906, 2013.
 [19] V. Nair and G. E. Hinton. Rectified linear units improve restricted boltzmann machines. In ICML, pages 807–814, 2010.
 [20] W. Nam, P. Dollár, and J. H. Han. Local decorrelation for improved pedestrian detection.
 [21] T. Nortcliffe. People analysis cctv investigator handbook. In Home Office Centre of Applied Science and Technology, 2011.
 [22] W. Ouyang and X. Wang. A discriminative deep model for pedestrian detection with occlusion handling. In CVPR, 2012.
 [23] W. Ouyang and X. Wang. Joint deep learning for pedestrian detection. In ICCV, 2013.
 [24] W. Ouyang, X. Zeng, and X. Wang. Modeling mutual visibility relationship in pedestrian detection. In CVPR, 2013.
 [25] S. Paisitkriangkrai, C. Shen, and A. v. d. Hengel. Pedestrian detection with spatially pooled features and structured ensemble learning. arXiv, 2014.
 [26] S. Paisitkriangkrai, C. Shen, and A. van den Hengel. Strengthening the effectiveness of pedestrian detection with spatially pooled features. In ECCV, pages 546–561. 2014.

[27]
D. Park, C. L. Zitnick, D. Ramanan, and P. Dollár.
Exploring weak stabilization for motion feature extraction.
In CVPR, pages 2882–2889, 2013.  [28] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by backpropagating errors. Nature, 323:533–536, 1986.
 [29] P. Sermanet, K. Kavukcuoglu, S. Chintala, and Y. LeCun. Pedestrian detection with unsupervised multistage feature learning. In CVPR. 2013.
 [30] C. Shen, P. Wang, S. Paisitkriangkrai, and A. van den Hengel. Training effective node classifiers for cascade classification. IJCV, 103(3):326–347, 2013.
 [31] J. Tighe and S. Lazebnik. Superparsing: scalable nonparametric image parsing with superpixels. In ECCV, pages 352–365. 2010.

[32]
P. Viola and M. J. Jones.
Robust realtime face detection.
IJCV, 57(2):137–154, 2004.  [33] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. IJCV, 63(2):153–161, 2005.
 [34] X. Wang, T. X. Han, and S. Yan. An HOGLBP human detector with partial occlusion handling. In ICCV, 2009.
 [35] J. Yan, X. Zhang, Z. Lei, S. Liao, and S. Z. Li. Robust multiresolution pedestrian detection in traffic scenes. In CVPR, pages 3033–3040, 2013.
 [36] X. Zeng, W. Ouyang, and X. Wang. Multistage contextual deep learning for pedestrian detection. In ICCV, pages 121–128, 2013.
 [37] S. Zhang, C. Bauckhage, and A. Cremers. Informed haarlike features improve pedestrian detection. In CVPR, pages 947–954, 2013.
 [38] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multitask learning. In ECCV. 2014.
 [39] L. Zhu, Y. Chen, and A. Yuille. Learning a hierarchical deformable template for rapid deformable object parsing. TPAMI, 32(6):1029–1043, 2010.