Advanced driver assistance systems (ADAS) are important for the prevention of vehicle accidents. ADAS contain several components, such as forward collision warning (FCW), lane departure warning (LDW), and pedestrian collision warning (PCW) systems. The PCW system is especially helpful for preventing serious damage and casualties, and as a result, many researchers and developers have tried to improve this system . PCW systems usually rely on pedestrian detection and have an unfortunate tendency to produce false alarms in safe situations.
For example, as shown in the image on the right side of Fig. 1
, if pedestrians are walking on the sidewalk, then an alarm is not necessary. However, traditional pedestrian detection-based PCW systems produce alerts in this situation. It is not trivial to develop a system only using hand-crafted features because it requires several complex steps, including pedestrian detection and scene recognition. In this study, using an end-to-end framework based on a convolutional neural network (CNN), we build a system that solves this problem.
Our contributions in this paper are summarized as follows:
We propose a novel framework for a PCW system composed of an end-to-end CNN-based learning algorithm.
We show performance improvements of the proposed PCW system, which are achieved using semantic information from images.
There are two main advantages to our CNN-based PCW system. First, our system is effective in reducing false alarms compared to traditional pedestrian-detection-based PCW systems. Second, unlike traditional methods, our system can give warning alarms in response to cyclists as well as to pedestrians.
2 End-to-End PCW System
Fig. 2 shows the proposed architecture of the CNN for our PCW system. We use five convolutional layers (CONV1 CONV5) and four fully connected layers (FC1 FC4). Unlike traditional approaches, which have distinct pedestrian detection and warning decision stages, as shown in Fig. 3, our PCW system does not include a pedestrian detection stage. In other words, our system predicts whether the situation is dangerous directly from the raw input image. This makes our system more accurate than traditional systems, since the varying appearance of pedestrians causes the pedestrian detection stage to be imperfect.
Our network is a combination of two networks, one responsible for prediction and the other for semantic segmentation, as shown in Fig. 2. The prediction network determines whether the input image shows a warning situation, which is a binary classification problem. The semantic segmentation network segments the input image and extracts useful semantic information to feed into prediction network. These two networks are simultaneously trained by minimizing the loss function as follows:
where , , and are the total, cross-entropy, and Euclidean loss functions, respectively. is a tuning parameter, which we set to in order to adjust the scale between the two loss values. The cross-entropy loss is defined as follows:
where is the total number of data samples in a batch. is the -th value of the ground truth label for the -th data sample of the current training batch, and is the -th softmax output value of the prediction network for the -th data sample. is the total number of classes; in this case, . This loss function helps the network to predict the situation correctly. The Euclidean loss function for the semantic segmentation network is defined as follows:
is an output vector of the last FC layer (FC4) of the semantic segmentation network for the-th data sample of a batch. is a vectorized form of the ground truth segmentation image for the -th data sample of a batch. Using this loss function, the semantic segmentation network learns how to segment input image semantically.
These two networks share their two low level layers, CONV1 and CONV2, as shown in Fig. 2
. This design is efficient, since these lower layers produce common features such as edges or blobs, allowing us to reduce the total number of learnable parameters. The output of the two networks is integrated at the FC2 layer. The high level features of FC1 and FC3 are concatenated, and the features are used as an input to the FC2 layer. Unlike the features extracted by FC1, the features extracted by FC3 represent semantic features of the input image. The semantic features can detect and classify objects implicitly, so we expect that the semantic features will be helpful for inferring dangerous situations. We do not use the output extracted by the FC4 layer because the dimensionality of FC4 features is too large: 2048 for FC3 compared to 131,072 for FC4. This huge number of dimensions requires many weight connections at the FC2 layer that can cause over-fitting.
3 Balancing Training Data
It is difficult to train deep neural networks if the training data are imbalanced; that is, if the amount of data varies significantly between classes . This results in imbalanced loss values between classes and failure in training the CNN. In our case, the number of images with no warning case is five times greater than the number of images showing a warning case. To resolve this problem, we copy the warning case training images in order to generate the same number of images as the non-warning case.
4 Experimental Results
4.1 CNN Parameter Details
The detailed parameter settings for our network, which is similar to AlexNet , are as follows: The input image size is
, with three channels representing RGB values. For the prediction network, we use CONV(11, 96, 4) - ReLU - MaxPool(3, 2) - CONV(5, 256, 1) - ReLU - MaxPool(3, 2) - CONV(3, 384, 1) - ReLU - CONV(3, 256, 1) - ReLU - MaxPool(3, 2) - FC(256) - ReLU - FC(256) - ReLU - Softmax(2), where each value in brackets means CONV(kernel size, the number of channels, stride), MaxPool(kernel size, stride), and FC(the number of output node). For the semantic segmentation network, we use CONV(11, 96, 4) - ReLU - MaxPool(3, 2) - CONV(5, 256, 1) - ReLU - MaxPool(3, 2) - FC(2048) - FC(131072).
The size of mini-batch is of , and the value of learning rate is of . Further, we set the value of weight decay to , and the total iteration number is of . For the weight initialization of whole weight layers, we used the method as described in .
We use a cityscape dataset taken from urban environments to evaluate our method . The dataset includes various objects, such as vehicles, pedestrians, and cyclists. Furthermore, each image is densely or sparsely annotated for semantic segmentation tasks. In our experiments, we use densely annotated data consisting of training images and test images. Unfortunately, the dataset does not provide the ground truth data for PCW, so we have annotated each image manually (0: no alarm, 1: warning).
4.3 Comparison to an HoG-based PCW System
We built a baseline algorithm, shown in Fig. 3, that includes an HoG-based pedestrian detection method, for comparison with our method . The rule for making a determination about the danger of a situation was: If pedestrians exist in a dangerous region, determined manually by setting a region of interest from (128, 0) to (383, 255) in the input image, then a warning alarm should be provided to the driver. (The size of the input image is .). This assumption is natural since sidewalk regions can be significantly reduced in these images.
Fig. 4 shows the receiver operating characteristic (ROC) curves for each method, showing the rates of both true and false positives. The blue line represents the HoG-based algorithm and the red and green lines denote our proposed method with and without a semantic segmentation network, respectively. The true positive rate of the HoG-based algorithm is better than that of the other methods when the false positive rate is under 0.05; however, our proposed methods are superior to the HoG-based algorithm in other cases. In particular, our proposed method with a semantic segmentation network shows the best performance among the three approaches.
Table 1 shows the accuracy (true positive rate) of each method at a 15% false positive rate. Our proposed method without the semantic segmentation network shows about a improvement compared to the HoG-based algorithm. Furthermore, with the semantic segmentation network, performance was significantly improved, by .
|(HoG-based)||(without semantic||(with semantic|
Fig. 5 shows qualitative results extracted from our proposed PCW system with a semantic segmentation network. Surprisingly, our system was aware of both pedestrians and cyclists. Additionally, our system did not raise an alarm when there was no risk of collision with pedestrians or cyclists. This is a significant benefit of the use of our PCW system.
We have proposed an end-to-end PCW system based on a CNN. The usefulness of traditional systems, based on pedestrian detection, is limited by their false alarm rate. Our system, however, is effective in reducing false alarms and improves system accuracy. Our model combines two networks that individually perform prediction and semantic segmentation tasks, and it was helpful in improving our proposed PCW system. One of the main contributions of this paper is a demonstration of the feasibility of an end-to-end PCW system. Our system could potentially be improved by replacing the deep neural network architecture to a more recent architecture, for example, a deep residual network . Additionally, we believe that our proposed framework can be applied to other ADAS, such as LDW and FCW systems.
This work was supported by the DGIST R&D Program of the Ministry of Science of Korea (16-FA-07).
 Dalal, N. and Bill T. ‘Histograms of oriented gradients for human detection’ CVPR’05. Vol. 1. IEEE, 2005.
 Krizhevsky, A., Sutskever, I., and Hinton, G. E. ‘Imagenet classification with deep convolutional neural networks’NIPS (pp. 1097-1105), 2012.
 Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., and Schiele, B. ‘The cityscapes dataset for semantic urban scene understanding’ arXiv preprint arXiv:1604.01685, 2016.
 Masko, D., and Hensman, P. ‘The impact of imbalanced training data for convolutional neural networks’ 2015.
 He, K., Zhang, X., Ren, S., and Sun, J. ‘Delving deep into rectifiers: Surpassing human-level performance on imagenet classification’ ICCV (pp. 1026-1034), 2015.
 He, K., Zhang, X., Ren, S., and Sun, J. ‘Deep residual learning for image recognition’ arXiv preprint arXiv:1512.03385, 2015.
 Geronimo, D., Lopez, A. M., Sappa, A. D., and Graf, T. ‘Survey of pedestrian detection for advanced driver assistance systems’ IEEE transactions on pattern analysis and machine intelligence, 32(7), 1239-1258, 2010.