PedestrianDetectiononTX1
None
view repo
Deep convolutional neural networks continue to advance the stateoftheart in many domains as they grow bigger and more complex. It has been observed that many of the parameters of a large network are redundant, allowing for the possibility of learning a smaller network that mimics the outputs of the large network through a process called Knowledge Distillation. We show, however, that standard Knowledge Distillation is not effective for learning small models for the task of pedestrian detection. To improve this process, we introduce a higherdimensional hint layer to increase information flow. We also estimate the variance in the outputs of the large network and propose a loss function to incorporate this uncertainty. Finally, we attempt to boost the complexity of the small network without increasing its size by using as input handdesigned features that have been demonstrated to be effective for pedestrian detection. We succeed in training a model that contains 400× fewer parameters than the large network while outperforming AlexNet on the Caltech Pedestrian Dataset.
READ FULL TEXT VIEW PDF
In recent years, deep learning has spread rapidly, and deeper, larger mo...
read it
It remains very challenging to build a pedestrian detection system for r...
read it
Knowledge distillation is the process of transferring the knowledge from...
read it
Current knowledge distillation methods require full training data to dis...
read it
Pedestrian detection is a popular research topic due to its paramount
im...
read it
This paper is aimed at creating extremely small and fast convolutional n...
read it
In this letter, we present a novel and extremely fast steganalysis metho...
read it
None
Stateoftheart deep convolutional neural networks are extremely large and require a vast amount of resources. For example, the classic VGG16 image classification network [28] contains 138 million parameters, and the more recent ResNet200 [15] still contains over 60 million parameters.
This holds true not just for image classification. For instance, at the time of writing, the top three approaches for pedestrian detection as measured on the Caltech Pedestrian Dataset [10] consist of MSCNN [3], RPN+BF [30], both built upon the FasterRCNN [25] architecture containing over 100 million parameters, and SAFastRCNN [21] which features a network with over 30 million parameters. The larger a network is, the more disk space, memory, and energy it consumes, and the slower it is to use.
These large networks contain many redundant parameters [20, 7], so in theory they could be much smaller. To demonstrate this, we adopt Knowledge Distillation (KD) [17] to train a small student network to mimic the large teacher network.
KD was developed for classification on the ImageNet dataset
[27] with the idea that the 1000dimensional prediction from the teacher is much more informative than the single ground truth label. But for pedestrian classification where there are only 2 outputs (pedestrian / no pedestrian), this difference is much less pronounced. To increase the dimensionality of the data that the student tries to learn, we propose introducing just before the final layer a reasonably sized fullyconnected hint layer whose outputs we try to match.The teacher network does not always predict the correct label. The policy behind KD is to train the student to mimic the teacher regardless of the mistakes. However, if the teacher has an estimate of how confident its prediction is, then the student could make more informed decisions. Intuitively, if the teacher reports that it is very confident about its prediction, then the student should trust the teacher more. To produce a measure of teacher confidence, we follow the insights in [11]
and utilize dropout at test time to fit a Gaussian distribution to the teacher outputs. We then propose a loss function that incorporates this information when learning from the teacher.
Handdesigned features are very popular in traditional computer vision. However, these features have not seen much success when used with deep learning
[18]. One theory is that deep networks are able to learn features internally that outperform the handdesigned features. If this is indeed the case, we hypothesize that small networks may not have the capacity to do so, and traditional feature extraction may improve small networks. We investigate this using Aggregate Channel Features (ACF)
[8] which are popular and proven features for pedestrian detection.In this paper we propose to use Knowledge Distillation to compress a large network for pedestrian classification. We explore variations on the training process by learning from the outputs of a hint layer inserted before the final fullyconnected layer, introducing a loss function that takes into account output covariances, and using Aggregate Channel Features as input.
We show the effect of these modifications on student networks of various sizes, and show that they outperform both training from scratch and standard KD. We produce a model with only 157K parameters that outperforms AlexNet [19] which has over 57M parameters, 360 as many.
Network Pruning The earliest studies into network size reduction came in the form of weight pruning, motivated by the need for regularization. These methods use the magnitude of the weights [13] or the Hessian of the loss function [20, 14] to prune away less useful weights. Apart from pruning weights, Srinivas and Babu [29]
devised a method for pruning neurons directly without the use of any training data. These pruning approaches remove a significant amount of the uninformative parts of the network and results in lower computation costs and storage requirements.
Parameter Sharing Han et al. [12] introduced a multistep pipeline with pruning, weight clustering and Huffman encoding. An orthogonal approach uses hashing or bucketing to quantize various parts of the model [4, 22]. Cheng et al. [5]
enforce a circulant matrix model on the fullyconnected layers to exploit faster computation and smaller model size via Fast Fourier Transforms. By quantizing and sharing parameters, the amount of space needed to store the network representation is reduced.
Matrix Decomposition Neural network weights can be treated as matrices and compressed through matrix decomposition. Denil et al. [7]
use a low rank decomposition of the weight matrices together with a sparse dictionary learned from an autoencoder to reduce the number of parameters. Novikov
et al. [23]apply the TensorTrain decomposition
[24] to compress the weight matrices in the fullyconnected layers.Transfer Learning. While the above methods compress an existing network directly, the underlying architecture remains bulky with the same wide and depth as before. An alternative is to consider transferring the knowledge to a new smaller network. This produces a much more compact model with dense weights instead of sparse weights. Moreover, it is possible to then apply the above methods on top of the new network.
Ba and Caruana [1] showed that it is possible to train a shallower but wider student network to mimic a teacher network, performing almost as well as the teacher. Hinton et al. [17] generalized this idea by training the student to learn from both the teacher and from the training data, naming this process Knowledge Distillation (KD). They demonstrated that students trained this way outperform those trained directly using only the training data. FitNets [26] use Knowledge Distillation with intermediate hint layers to train a thinner but deeper student network containing fewer parameters that outperforms even the teacher network. However, to the best of our knowledge, this approach has yet to be applied to training a network that is both thinner and shallower.
The process of Knowledge Distillation (KD) for classification networks is to train the student from the predictions of the teacher network in addition to the ground truth hard targets (Figure 0(a)
). However, with a standard soft maximum (softmax) classification layer, the teacher predictions will often be very similar to the hard targets with one class having probability close to 1 and the other classes having probabilities close to 0. So, instead, a variant of the softmax function which includes a temperature parameter
is used instead to produce soft targets.(1) 
When , this is the standard softmax function, while higher values of
produce a smoother probability distribution over the classes.
are the input logits to the softmax layer, and are also the outputs of the fullyconnected layer before it.
The loss function used for training the student is a combination of the soft loss , the crossentropy loss between the soft outputs of the student and teacher, as well as the hard loss , the standard classification crossentropy loss between the student outputs and the ground truth labels.
(2)  
(3)  
(4) 
A graphical outline of our pipeline can be found in Figure 0(b). Here we explain the various parts of the pipeline and the motivations behind them.
KD was developed for ImageNet classification with the idea that the 1000dimensional prediction from the teacher is much more informative than the single ground truth label. But for pedestrian classification where there are only 2 outputs (pedestrian / no pedestrian), this difference is much less pronounced. Since the output of a softmax function sums up to 1 for every value of , the soft targets are actually only 1dimensional.
To increase the dimensionality of the data that the student learns from, we introduce a hint layer, a fullyconnected (FC) layer with 64 outputs in front of the final FC layer, and train the student to match the outputs of the hint layer instead. If the student network can perfectly match the hint layer outputs, then just by copying over the teacher’s final FC layer, the student will be able to mimic the teacher’s outputs. Even if the student cannot perfectly match the hint layer outputs, the weights from the teacher’s final FC layer still serve as a good initialization for the student’s final FC layer, which will be finetuned through the hard loss coming from the ground truth labels. In this work, we assume that the hint layer is the same size for both the teacher and student so that interpolation is not required.
This idea of matching earlier hint layers has been explored in FitNets [26]. However, they choose to match a layer in the middle of the model to provide additional guidance in training a thinner but deeper student network. We instead propose this idea in order to increase the amount of information obtained from the teacher model in cases where there are only a small number of output classes.
The outputs of this hint layer cannot be interpreted as a probability distribution, so crossentropy loss is not applicable. Instead, we use mean squared error loss as the soft loss.
Care must be taken when the activation for the hint layer is a rectified linear (ReLU) nonlinearity, in which case it is advised to match the values before passing them through the ReLU function. This is because the ReLU function discards information of negative values, and also because the gradient for where the student predicts a negative value is ignored, leading to instabilities in training.
There will be cases where the teacher makes mistakes and predicts differently from the ground truth. The policy behind KD is to train the student to mimic the teacher regardless of the mistakes, relying on the hard losses to nudge the outputs towards the correct label. This results in a tension between the soft and hard losses, each producing a gradient for the opposite label.
This tension can be relaxed slightly if the teacher has an estimate of how confident its prediction is. Intuitively, if the teacher reports that it is very confident about its prediction, then the student should trust the teacher more, and if the teacher instead reports that it is not confident about its prediction, then the student should balance mimicking the teacher with predicting the correct label. The underlying assumption is that the teacher is more likely to be confident about examples that they predict correctly. There will be cases where the teacher is very confident yet mistaken, but we believe that it is important for the student not to disregard the teacher in these cases.
In [11], the authors draw a theoretical link casting dropout as a Bayesian approximation of Gaussian Processes. Following their ideas, we enable dropout during test time and forward the same input through the model times. Each pass can be thought of as the output of a single model sampled from an ensemble. From this, the sample mean and covariance of the outputs of the ensemble can be estimated.
By doing so, we are fitting a multivariate Gaussian distribution to the teacher outputs, from which it is possible to measure the likelihood of the student output as being drawn from the distribution. In particular, the likelihood of the student output is:
(5) 
Maximizing the loglikelihood of Equation 5 is equivalent to minimizing the following loss function:
(6) 
This function is the square of the Mahalanobis distance. Compared to the meansquare distance, it is smaller along dimensions of high variability, consistent with our idea of reporting smaller gradients for outputs that the teacher is not confident in.
One limitation of our method lies in the dimension of the covariance matrix. Since the loss function requires the inversion of covariance matrix, the number of samples must be larger than the dimensionality of the teacher’s output. However, the output from the time consuming convolutional layers can be cached, and only the last few layers with dropout need multiple passes, so the additional overhead during training is low.
Before deep learning became mainstream, computer vision was dominated by the use of specialized features for each task discovered through extensive experimentation. For example, the advent of HOG features [6] was groundbreaking in the development of pedestrian detection, and the introduction of Integral Channel Features [9] brought about another revolution, leading to the discovery of many derivative features such as Aggregate Channel Features [8] and Checkerboards features [31], the latter of which is competitive with stateoftheart.
These handdesigned features are largely ignored in deep learning. In [18], the authors found that there was no improvement in neural networks trained using handdesigned features compared to those trained using raw RGB images as input. However, their model was large, so it is possible that it was able to learn features internally that outperform the handdesigned features. The same might not be true for a small model, in which case it may be reasonable to expect that by using these handdesigned features as input, the small model can be improved. The use of handdesigned features as input can also be thought of as attaching a fixed layer to the front of the network, pretrained through years of human research.
For this reason, we explore training our student networks using Aggregate Channel Features (ACF) as input. We choose ACF because it offers a good tradeoff between detection accuracy and speed, taking less than 10ms to compute for a image on a single CPU [8].
ACF consist of 10 channels: the LUV color channels, gradient magnitude, and six oriented gradient bins. The input image is first converted into these 10 channels, then, within each channel, pixels are divided into 4x4 blocks and summed.
Note that when we train the student using ACF features as input, the input to the teacher remains the original RGB image. Whether the student is trained on RGB or ACF, they learn from the exact same teacher.
We perform all training and evaluation on the Caltech Pedestrian Dataset [10]. Following standard practice, we use the first 5 sequences as the training set, the 6th sequence as the validation set, and the last 5 sequences as the test set.
We follow the setup of Caltech10x in [18] and sample every 3rd frame for training. We use the Reasonable configuration when testing on the Caltech test set, which samples every 30th frame and includes only pedestrians without significant occlusion with a minimum height of 50 pixels and excludes the labels “people” and “person?”. Evaluation is performed using the official evaluation script, which computes a curve of the logarithm of the number of false positives per image versus the miss rate. A value for the logaverage miss rate is also calculated, and a lower value indicates a better result.
Our training set uses ground truth patches as well as patches with IntersectionoverUnion (IoU) greater than 0.5 as positive patches, and patches with IoU less than 0.5 as negative patches. There are 31,129 positive patches and 748,139 negative patches in the training set.
Teacher Network For our teacher network, We use preactivation ResNet200 [16] pretrained on ImageNet, augmented with dropout and a 64dimensional hint layer, then finetuned on our training set.
Student Network We use preactivation ResNet18 [16] pretrained on ImageNet augmented with a 64dimensional hint layer as the basis for our student networks. We experiment with three versions: unmodified ResNet18, ResNet18Thin which cuts the number of channels for every layer in half, and ResNet18Small which fixes every layer at 32 channels. The compression rates of these models can be found in Table 2.
ResNet200  ResNet18  
conv1 
, stride 2 
, stride 2 
pool, stride 2  pool, stride 2  
conv2_x  x3  x2 
conv3_x  x24  x2 
conv4_x  x36  x2 
conv5_x  x3  x2 
classifier  avgpool  avgpool 
dropout  FC(512, 64, ReLU)  
FC(2048, 64, ReLU)  FC(64, 2, softmax)  
FC(64, 2, softmax) 
input patches. Pool refers to a maxpooling layer, FC refers to a fullyconnected layer, and avgpool refers to a global average pooling layer. All convolutional layers include batch normalization and a ReLU activation. The first convolution layer for conv{3,4,5} have a stride of 2.
Model  #Parameters  Compression 

ResNet200  63M  
ResNet18  11M  
ResNet18Thin  2.8M  
ResNet18Small  157K  
AlexNet  57M 
Training is performed through stochastic gradient descent with Nesterov Momentum 0.9 and weight decay 0.0005. We use a batch size of 16, an epoch size of 1000 iterations, a learning rate of 0.01 dropping by a factor of 5 every 20 epochs, and a total of 70 epochs. Since there are many more negatives patches than positive patches in our training set, we force a positive to negative ratio of 1:3 for each training batch.
The inputs to the teacher network are 224x224x3 RGB patches, and the inputs to the student networks are either 224x224x3 RGB patches or 224x224x10 ACF patches. Patches are scaled by warping them to fit the input size, and RGB inputs are normalized using ImageNet mean and standard deviation. During training, patches are randomly flipped horizontally. The extraction of ACF features occur after the flip.
When combining soft and hard losses, hard losses are weighted with a lambda of 0.5. For dropout during testing, we use a probability of 0.5. When estimating the covariance of teacher output, we forward each input 200 times.
The models are trained with the Torch framework on a NVIDIA Titan X GPU with 12GB memory.
All of the evaluation results are reported on the Reasonable subset of the Caltech test set using the model at the epoch with the lowest logaverage miss rate on the Reasonable subset of the Caltech validation set.
For all of the student models tested, the best performing configuration was to add a hint layer, use teacher confidence, and train with RGB inputs. A summary of our results can be found in Table 9. In the following sections, we break down the contribution from each of our innovations.
Model  Direct  KD 

ResNet200  17.5%  — 
ResNet18  19.1%  18.6% 
ResNet18Thin  22.0%  22.8% 
ResNet18Small  24.5%  24.8% 
AlexNet  23.3%  — 
Table 3 compares the various models trained directly from ground truth as well as the student models trained using standard Knowledge Distillation on the softmax logits. The temperature used for KD, , was picked after testing multiple values.
Standard KD does not seem to work for the smaller models. In Figure 3, we visualize the histogram of the distribution of the two output logits on the entire test set for the teacher (ResNet200), large (ResNet18) and small (ResNet18Small) students trained via standard KD, and the small student trained with our full pipeline. Only the small student trained via standard KD is visibly different.
This difference highlights an important property of standard KD. With or without a temperature, the softmax function normalizes its outputs to sum to 1, and two different inputs can result in the same output. This is not necessarily a problem, but we hypothesize that with very small student models, for problems with very few output labels, standard KD does not offer enough guidance to be superior to training from ground truth labels.
Model  Direct  Hint 

ResNet200  17.5%  — 
ResNet18  19.1%  18.1% 
ResNet18Thin  22.0%  20.4% 
ResNet18Small  24.5%  23.1% 
As reported in Table 4, adding a hint layer improves training for all student models. This enforces the idea that increasing the amount of information used for training models is beneficial.
It would be interesting to explore the effect of varying the size of the hint layer, but that is unfortunately outside the scope of this work. Is there a point where the hint layer is too large and dominated by noise instead of useful data?
Model  Direct  Conf  Hint + Conf (Ours) 

ResNet200  17.5%  —  — 
ResNet18  19.1%  18.2%  18.0% 
ResNet18Thin  22.0%  20.7%  20.3% 
ResNet18Small  24.5%  23.7%  22.4% 
Table 5 shows that estimating output covariances and training with our proposed loss function improves the student models. There is slightly more improvement if the covariances are estimated from the 64dimensional hint layer outputs instead of the 2dimensional softmax logits.
This shows that there is indeed more information in the teacher network that can be extracted when treated as an ensemble using dropout.
Model  RGB  ACF  ACF + Hint + Conf 

ResNet200  17.5%  19.6%  — 
ResNet18  19.1%  21.4%  18.7% 
ResNet18Thin  22.0%  22.4%  20.4% 
ResNet18Small  24.5%  25.2%  23.4% 
Our results in Table 6 are consistent with [18] in that a network trained directly using features like ACF as input is worse than if it were trained with raw RGB inputs. This holds true even as the network size is significantly reduced, though the drop in performance for smaller models is not as severe.
Similar to the results for our previous models taking RGB inputs, introducing a hint layer and factoring in teacher confidence offers similar amounts of improvement to the models taking ACF input. However, the models with ACF inputs still fall short compared to those that take RGB inputs, and we can only conclude that ACF is worse than RGB as inputs to pedestrian detection networks.
In this section, we compare ResNet200 (teacher) with ResNet18SmallRGBHintConf (student).
Model Similarity We visualize the weights of the first convolutional layer in Figure 4. The first layer weights are slightly different, however, we can see similar shapes in the patterns.





73.74%  21.12%  

0.60%  4.54% 
Failure Cases We tabulate the correct and incorrect predictions for each model in Table 7. The teacher and student networks predict the same label 78.28% of the time. We sample some example patches where the teacher predicts correctly but the student fails in Figure 4(a), 4(b) and the reverse in Figure 4(c). The student did not predict any positive patches correctly that the teacher had predicted incorrectly. The patches where the student outperformed the teacher are indeed harder to classify, and could be a result of the student model being much smaller and thus more regularized.



We report the resource usage during test time on a NVIDIA Titan X in Table 8.
Model  Time  Memory 

ResNet200  24ms  5377MB 
ResNet18  3ms  937MB 
ResNet18Thin  3ms  633MB 
ResNet18Small  3ms  565MB 
Single Identity Layer  0.02ms  325MB 
It appears that modern GPUs are not affected very much by the number of channels in convolutional layers, so while ResNet18Thin and ResNet18Small are much smaller in terms of the number of parameters, they are not significantly faster than ResNet18. However, the memory usage is significantly decreased. Ignoring the fixed amount of memory used by the inputs and the system measured using a model with a single identity layer, ResNet18Small uses less memory.
Statistic  ResNet200 (Teacher)  ResNet18  ResNet18Thin  ResNet18Small  

RGB  ACF  RGB  ACF  RGB  ACF  RGB  ACF  
LogAvg MR 
Direct  17.5%  19.6%  19.1%  21.4%  22.0%  22.4%  24.5%  25.2% 
KD  —  —  18.6%  —  22.8%  23.1%  24.8%  —  
Conf  —  —  18.2%  —  20.7%  —  23.7%  —  
Hint  —  —  18.1%  —  20.4%  21.1%  23.1%  —  
Hint+Conf  —  —  18.0%  18.7%  20.3%  20.4%  22.4%  23.4%  
#Parameters  63M (1)  11M (6)  2.8M (22)  157K (400)  
Speed  24ms (1)  3ms (8)  3ms (8)  3ms (8)  
Memory  5052MB (1)  612MB (8)  308MB (16)  240MB (21) 
We have shown that there is indeed a lot of redundancy in large deep neural networks. We have shown that it is possible to train a student network that contains 400 times fewer parameters while only observing a drop in logaverage miss rate of 4.9%. The main gains of our approach utilizes the dimensionality of our new hint layers. We also described a method of obtaining a measure of confidence from the teacher network, and demonstrated that taking this information into account during training can lead to considerable gains. Our student models perform 8x faster than the teacher with 21x less memory usage.
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on
, volume 1, pages 886–893. IEEE, 2005.
Comments
There are no comments yet.