With the help of powerful well-designed deep neural networks, great progresses have been made in the field of object detection , . For self-driving, it is essential to provide real-time and accurate information of object location and categories. Nowadays, there are a dilemma between detection accuracy and speed. For example, region proposal based methods (e.g., Faster RCNN ) can obtain a high recall and better accuracy, but finding proposals is time-consuming. Meanwhile, regression-based methods (e.g., SSD ) can conduct real-time detection, but often without promising accuracy.
There is a trend to design more time-efficient detection models that nearly do not affect the accuracy. One typical branch driven by this idea is neural network compression and acceleration. One excellent work is SqueezeNet , it achieved an accuracy comparable to AlexNet (a widely used image classification model ), with 50x faster speed. This model used several advanced strategies in the design of ConvNets (e.g., combine 33 filters with 11 filters, decrease the number of input channels, downsample later in the network for convolution layers have large activation maps) and a novel fire module to design a more powerful network. By using SqueezeNet as the front ConvNet, 
proposed a fully convolutional neural network for real-time object detection without losing accuracy, named as SqueezeDet. This method achieved state-of-the-art performance on KITTI, a popular object detection dataset for autonomous driving.
Although promising results have been reached on the task of real-time object detection, how to detect objects with diverse sizes, especially small objects, is still an open question. Current methods aimed at handling this problem can be roughly categorized into two branches. One typical branch is to design multi-scale neural networks to extract features of different levels to adapt the detection of objects of diverse sizes , 
. Another kind of popular methods used deconvolution to enlarge the size of deep feature maps, . Those methods eliminated the problems of object variation, but still can not detect small objects well. In this article, we propose to use cognitive mechanisms-inspired neural network. It shows better promising performance on small object detection.
The human brain consists of multiple modular subsystems, with a unique way interacting among. Attention is a vital function or phenomenon of the cognition. Meanwhile, we humans use different kinds of knowledge in a complicated manner to perform tricky tasks. It is urgently needed for self-driving systems to detect small objects well in order to make better decision. Inspired by those cognition-based mechanisms, we propose a knowledge-based recurrent attentive neural network, which combines intuitive prior knowledge with domain knowledge, and applies the mechanism of attention.
TSD is one tricky task that satisfies all the characteristics mentioned above. On the one hand, the intelligent system always need to get prepared when approaching traffic signs ahead of them. It is because traffic signs are used to provide useful information to help make driving convenient and safe. The intelligent system needs to make decisions partially relying on traffic signs. What’s more, the system need to make right decisions after receiving the detection result. Therefore, useful traffic signs are always very small in cameras on the car as they are far away from those signs. However, as we’ve mentioned, small object detection is still an open question. Besides, real world condition is very complicated, poor weather conditions such as rain, fog and snow have great effects on the detection accuracy of traffic signs. We use a Recurrent Attentive Neural Network (RANN) to increase the detection accuracy and attention fixation in an iterative manner. On the other hand, driving conditions have many similarities to the cognition of we humans. Drivers’ gazes are biased toward the center. Usually, the biased center is the drivable area. In this article, we assume that the traffic signs are always located at the bias of the chosen drivable area. Experiment shows that this assumption can increase the detection accuracy about 5 points.
Many researchers have been focused on the detection and recognition of traffic signs in natural sciences. Approaches to this kind of problem can be broadly classified into two groups: region-based methods and component-based connected methods. As for the first method, local features including texture and color are utilized to locate text regions. For the latter one, text characters are individually segmented by applying information such as color contrast adn intensity variation[10, 11] Detecting a small road sign accurately in crowded scenes or bad lighting conditions is of vital importance. However, those state-of-the-art methods[12, 10] at several popular datasets[13, 14] are not excellent at handling with small road signs recognition problem in a crowded scene with many other attractors like pedestrians and other vehicles.
In order to solve this problem, we propose a novel knowledge-based recurrent attentive neural network (KB-RANN). Overall, our contributions can be summarized into three aspects:
a. To the best of our knowledge, it is the first time to propose the attention mechanism of cognitive science to solve the problem of traffic signs detection.
b. In order to detect small and weather variation traffic signs well, which is a common case in real-world object detection, we propose a novel recurrent attention neural network by modifying the typical Long Short-Term Memory (LSTM). This method can process features of traffic signs in an iterative manner.
c. We use the reverse gaussian prior distribution to regularize TSD problem. This method combines domain knowledge (drivers’ attention is always the bias of the center. This location is often the drivable area while the traffic signs are always the bias of it) and intuitive knowledge (humans’ attention obey the rule of gaussian distribution, and we can normalize the attention of traffic signs as reverse gaussian distribution).
d. We transplanted our algorithm on our self-designed embedded system successfully. Further, we deployed our method on Pioneer I self-driving car and tested it in real traffic scenes, which showed KB-RANN is robust to detect traffic signs with considerable speed.
Ii Related Work
In this article, we use domain/intuitive knowledge and attention mechanism to help the detection of traffic signs. In this chapter, we will introduce the related works from three aspects: TSD, Recurrent Neural Network with Attention Mechanism and Knowledge-based Deep Learning Systems. The first part will introduce several mainstream methods in the field of traffic signs detection. The second part will investigate the attention-based neural network. The third part will analyze the current situation of applying domain/intuitive knowledge with neural network to solve tricky problems.
Ii-a Traffic Signs Detection
Traffic signs detection is an important task of ADAS and self-driving systems. Currently, there are great progresses in this field. The mainstream approaches are to extract the shape and color features of traffic signs. For those methods, different colors and shapes based approaches are used to minimize the effect of environment on the test images. A few decades ago, a lot of approaches using shape and color information were proposed to conduct interest region detection (e.g., region growing , YCbCr color space transform  and color indexing ). As the color information can be unreliable when facing illumination and weather change, thus shape-based algorithms are introduced. The prevailing shape-based approaches are similarity detection , edges with Haar-like features  and distance transform matching .
Deep learning has been widely used in many fields and largely improved the performance in those areas in recent years. Compared to color and shape based methods, deep neural network methods can automatically learn hierarchical essential features at every level from labeled data that are tuned to the task at hand robustly. Nowadays, methods based on local features achieved promising performance on several traffic sign detection and classification datasets such as German traffic-sign detection and classification . However, those methods can not very robust on real world conditions as the traffic conditions variety. In order to test the performance of our proposed KB-RNN, we testes it on Belgium Traffic Signs Detection (BTSD) dataset and real traffic scenes, which showed our method got impressive performance compared to several state-of-the-art detection method , .
Ii-B Recurrent Neural Network with Attention Mechanism
Recurrent Neural Networks (RNNs) have been widely utilized in handling with sequence problems. In the perspective of our human brains, we pay special attention to the words that really matter to us or fix on the particular location when looking at a picture. RNNs with this attention mechanism can achieve the same behavior, focusing on part of the given information.
There are several branches of ideas of applying attention with RNNs to handle problems similar to those mentioned above. One branch is the content-based attention methods, which means that we just look around and learn the attention automatically by extending the traditional RNNs. The attentive RNN produces a query describing what it wants to focus on. Each item is dot product with query to calculate a score, describing the degree it matches the query. The scores are put into a softmax to figure the attention distribution.  proposed an attention-based RNN to process the input to pass along information about every seen word, and then for the RNN produces the output to focus on words for they become relevant in machine translation.  used a RNN to process the audio and then had another RNN skim over it, focusing on relevant parts as it produced a transcript in handling with voice recognition problems.
Another branch of attentive RNNs is creating a better-designed RNNs architecture with attention mechanism. In the field of image captioning,  first extracted multi-level features of the input image by a ConvNet, and then produced a description of the image by RNNs. As it generated each word in the description, the RNN focused on the ConvNet’s interpretation of the relevant parts of the image.  utilized a recurrent network units to attend iteratively to selected image sub-regions to conduct saliency refinement in a progressive manner.  proposed a well-designed Attentive LSTM architecture to refine the feature maps that are extracted from ConvNets. We modified Attentive LSTM to a Recurrent Attentive Neural Network, which increases the mean Average Precision (mAP) up to 3 points.
Ii-C Knowledge-based Deep Learning Systems
We humans can use different knowledge (e.g., internal knowledge from self experience, environment knowledge by interacting with objects around, global knowledge extracted from the universe) to learn and perceive the world. Inspired by the truth, knowledge-based deep learning systems have received great interests recent years. Those methods can be briefly categorized into two branches. One is using the intuitive prior knowledge.  trained a convolutional neural network to detect and track objects without any labeled examples.  used prior knowledge that human gaze is the bias of the center of what we see to improve the accuracy of eye fixation prediction. Another branch is using the domain specific knowledge.  proposed domain constraints to detect objects without utilizing labeled data. In this article, we combined the domain knowledge and intuitive knowledge. We assume that our gaze is in the drivable area of a self-driving field, and the traffic signs are always the bias of the drivable area. We proposed a novel reverse gaussian prior distribution to regularize this problem. The accuracy improved 5 points percentage by using this fusion knowledge compared to the method without it.
Iii Model Architecture
In this section, we present the architecture of our complete model, called KB-RANN.
Iii-a Real-time Accurate SqueezeNet
Nowadays, many pre-trained CNNs models (e.g., VGGNet  and ResNet ) are prevailing in the area of object detection and achieved state-of-the-art performance. Although those models increase the efficiency of object detection, they are at the expense of time. Constructing a faster well-designed pre-trained model is very essential in real-time object detection. SqueezeNet  is a typical pre-trained model that has an AlexNet-level accuracy with fewer parameters. Meanwhile, we adopted fully convolutional layers as they are strong enough to classify and localize objects at the same time. Besides, we used two extra fire modules to increase the accuracy of our network.
Iii-B Recurrent Attentive Neural Network
Attention mechanism is a vital element of cognitive science. we introduced a well-designed Recurrent attentive neural network to help the detection of traffic signs.
LSTM  was widely used on those time-dependent tasks , .  utilizes the sequential characteristics of LSTM to process features in an iterative manner, rather than exploiting the model to handle with temporal dependencies relationship among the inputs. Inspired by this idea, we propose a novel recurrent attentive neural network. This network is composed of several attentive neural network (ANN). The update rules obey the following equation (1):
, are two typical gates in normal LSTM. is the memory gate and , , are three internal gates in our ANN architecture.
Our proposed RANN architecture is computed through an attentive mechanism which focuses on different regions of the image selectively. The update rule of attentive gate obey the following equation (2):
The proposed RANN (the cascade of attentive neural networks in a iterative manner) can improve the detection accuracy in a fine-grained way.
Iii-C Fusion of Domain Knowledge and Intuitive Knowledge
Humans can combine different kinds of knowledge in a complicated manner to solve very difficult problems. Domain knowledge is a very essential one. In the field of self-driving, people’s gazes are biased toward the center. Usually, the biased center is the drivable area. In this article, we assume that the traffic signs are always located at the bias of the chosen drivable area. In Fig 2, the left figure 2 is the original image while the right Fig 2 is the demonstration of reverse guassian prior and domain knowledge. For figure 2, the orange circle at the central location is our major attention area (i.e., the gaze of drivable area in the field of self-driving), and black circle except the drivable area is the focus of our method aimed at detecting small traffic signs. The area near the top right corner is the cluster of traffic signs.
Prior knowledge is used to deal with the traffic signs recognition problems, but almost all of them are focused on extracting color and shape features of a traffic signs , . To the best of our knowledge, domain knowledge has not been used in dealing with traffic signs detection. The proposed reverse gaussian method is used in this task. Further, to reduce the number of parameters and facilitate the learning, we constraint that each prior should be a 2D Gaussian function, whose mean and covariance matrices are instead freely learnable. This enables the network to learn its own priors purely from data, without relying on assumptions from biological studies.
Our proposed model can learn the parameters for each prior map through the following equation (3):
And we can compute the reverse gaussian distribution by the following equation (4):
We combine the N reverse guassian feature maps with those W feature maps extracted by ConvNet. Besides, we set N to 16 and W to 512. Therefore, after the concatenation, we get a mixed feature maps with 528 channels. The injection of domain and intuitive knowledge proves to be effective compared to several typical models, as we can see from the result of KB-CNN on Table II.
Iv-a Training protocol
showed that one step training strategy that trains localization loss and classification loss together can speed up networks without losing too much accuracy, we define a multi-task loss function in the following form (5):
The loss function contains three parts, which are bounding box regression, confidence score regression and cross-entropy loss for classification respectively. means ( represents x, y, , w, and h respectively).
The first part is the loss of the bounding box regression mentioned above. () represents the relative coordinates of anchor-k at grid center (i,j). Meanwhile, , or () is the ground truth bounding box. It can be computed by the following equation (6):
The second part is the regression loss of confidence score. The output of the last feature map is that represent the predicted confidence score for anchor-k corresponding to position (i, j). is the IoU of the ground truth and predicted bounding box. Besides, we penalize the confidence scores with for those anchors are irrelevant to the task of detection. Meanwhile, , and are used to adapt the weights.
The last part of is the cross-entropy loss of classification. The ground truth label is , and it is a binary parameter. is the classification distribution that is predicted by the neural network. We used softmax regression to normalize the score so as to make sure that is ranged between [0,1].
The one-stage loss function can update automatically by using backpropagation.
Iv-B Datasets and Baselines
There are several mainstream datasets in the field of traffic signs detection and recognition. There is a concept fusion of recognition and detection several years ago. Overall, recognition tasks aim at classifying objects while detection tasks need to locate objects and classify them correctly. German Traffic Signs Recognition is a widely used traffic signs recognition dataset and traditional classification methods can already work well on it. And there are only 600 training images, which is insufficient to train a deep neural network well. We focus on another popular traffic signs detection dataset, named as Belgium Traffic Signs Detection (BTSD). The details and testing results of those two datasets will be introduced next.
We compared our method with several popular open-sourced methods in the field of object detection (e.g., Faster RCNN  and SqueezeDet ). The source codes are from their original papers directly. What we did are only to reset the mean RBG color towards the BTSD dataset and reset the size of anchors.
Iv-C Experiments on BTSD
One of the widely used traffic signs detection dataset is BTSD dataset. All the traffic signs are categorized into 13 different kinds. This dataset consists of 5905 training images and 3101 testing images. We compared several state-of-the-art object detection methods on this dataset at table II. We use recall, mAP with different IoU to evaluate different methods. As we can see from table II, our proposed KB-RANN methods achieved better results at different IoU. Further, RANN and KB-CNN both got better performance than two widely used object detection methods.
|Method||mAP(IoU = 0.3)||mAP(IoU = 0.5)|
We picture of the comparison results of the mAP of those methods at table II. From this table, we can see our proposed KB-RANN achieves highest mAP compared to several popular object detection methods. Knowledge extracted from domain and intuition, and the recurrent attentive neural network can really improve the accuracy of traffic signs detection on BTSD benchmark.
We also select several images representing our the result of our proposed KB-RANN showed on Fig.3. From those images, we can see that our method performs better than Faster RCNN and SqueezeDet on small traffic signs detection.
Iv-D Experiment Setup
The number of iterations of those methods in II are 100K. We set the batch size to 32. The original picture size in BTSD dataset is 16261236. When feeding those picture into deep neural network, the computing capability is beyond the limitation of a single NVIDIA GTX 1080 Ti. In order to solve this problem, we resize the picture into 542412. This operation makes the traffic signs smaller. In some way, it increases the difficulty of traffic signs detection.
V Implementation on Embedded System
For a prospective and valuable autonomous driving method, there are two essentials that authentically matter: one is an available implementation on the platform of low power consumption, the other is a high frame rate to meet the requirements of real time processing. Thus, we carried out the algorithm on our self-designed embedded system, which is aimed for evaluating and optimizing the algorithm for industrial applications such as the automobile application. We choose NVIDIA Jetson TX2 Module as the core of our embedded system, which balanced the power-efficient and the computing power.
Further, we implemented the methods mentioned above (including Faster RCNN, SqueezeNet, and a series of our proposed methods), where hardware acceleration and software optimization ware carried out. The result is shown at Table II. Our method runs on our self-designed embedded system with 10FPS, the speed is fast enough to conduct traffic sign detection. However, we are pursuing real-time performance by accelerating our method with TensorRT, which is effective in shortening the inference time of deep neural networks, as real-time detection can also be used in road-obstacle detection such as pedestrian detection.
The proposed algorithm is implemented on the self-driving platform as shown in Fig 4. The Tx2 module interacts with our self-driving platform. Firstly, the system gets raw video stream from onboard camera. And then our proposed KB-RANN algorithm gets the data of detection result after image pre-processing. Next, the system fuses the motion data obtained by the self-driving platform. Finally, effective detection results can be generated and further used in decision making.
|Method||Image Size||Frame Rate|
|Faster RCNN||542412||0.84 FPS|
Vi Conclusions and Future Works
In this paper, we aim at small object detection, especially in traffic signs detection. Inspired by the cognition mechanisms of our human brain, we proposed a novel knowledge-based recurrent attentive neural network. This method achieved far better performance than several popular object detection methods. Besides, we proved that knowledge extracted from domain and intuition really works, and the recurrent attention mechanism can help detect small objects better in a fine-grained manner.
The future direction may be trying to use attention mechanism into road signs detection in video, in which we can use rich context information. What’s more, how to combine traffic rules with intuition knowledge to build a more dynamic small object detection method in real world condition really matters.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg,
“Ssd: Single shot multibox detector,” in
European conference on computer vision. Springer, 2016, pp. 21–37.
-  F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size,” arXiv preprint arXiv:1602.07360, 2016.
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inAdvances in neural information processing systems, 2012, pp. 1097–1105.
-  B. Wu, F. Iandola, P. H. Jin, and K. Keutzer, “Squeezedet: Unified, small, low power fully convolutional neural networks for real-time object detection for autonomous driving,” arXiv preprint arXiv:1612.01051, 2016.
-  Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A unified multi-scale deep convolutional neural network for fast object detection,” in European Conference on Computer Vision. Springer, 2016, pp. 354–370.
-  Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Subcategory-aware convolutional neural networks for object proposals and detection,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017, pp. 924–933.
-  J. L. Starck, A. Bijaoui, I. Valtchanov, and F. Murtagh, “A combined approach for object detection and deconvolution,” Astronomy and Astrophysics Supplement, vol. 147, no. 1, pp. 139–149, 2000.
-  C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: Deconvolutional single shot detector,” arXiv preprint arXiv:1701.06659, 2017.
-  J. Greenhalgh and M. Mirmehdi, “Recognizing text-based traffic signs,” IEEE Transactions on Intelligent Transportation Systems, vol. 16, no. 3, pp. 1360–1369, 2015.
-  C. Bahlmann, Y. Zhu, V. Ramesh, M. Pellkofer, and T. Koehler, “A system for traffic sign detection, tracking, and recognition using color, shape, and motion information,” in Intelligent Vehicles Symposium, 2005. Proceedings. IEEE. IEEE, 2005, pp. 255–260.
-  A. K. K Sumi, “Detection and recognition of road traffic signs - a survey,” International Journal of Computer Applications, 2017.
-  J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “The german traffic sign recognition benchmark: a multi-class classification competition,” in Neural Networks (IJCNN), The 2011 International Joint Conference on. IEEE, 2011, pp. 1453–1460.
-  R. Timofte, K. Zimmermann, and L. Van Gool, “Multi-view traffic sign detection, recognition, and 3d localisation,” Machine vision and applications, vol. 25, no. 3, pp. 633–647, 2014.
L. Priese and V. Rehrmann, “On hierarchical color segmentation and
Computer Vision and Pattern Recognition, 1993. Proceedings CVPR ’93., 1993 IEEE Computer Society Conference on, 1993, pp. 633–634.
-  A. Hechri and A. Mtibaa, “Automatic detection and recognition of road sign for driver assistance system,” in Electrotechnical Conference, 2012, pp. 888–891.
-  M. J. Swain and D. H. Ballard, “Color indexing,” International journal of computer vision, vol. 7, no. 1, pp. 11–32, 1991.
-  S. Vitabile, G. Pollaccia, G. Pilato, and F. Sorbello, “Road signs recognition using a dynamic pixel aggregation technique in the hsv color space,” in International Conference on Image Analysis and Processing, 2001, p. 572.
-  B. Höferlin and K. Zimmermann, “Towards reliable traffic sign recognition,” Intelligent Vehicles Symposium IEEE, vol. 5, no. 3, pp. 324 – 329, 2009.
-  D. Gavrila, “Traffic sign recognition revisited,” in Mustererkennung 1999, 21. DAGM-Symposium, 1999, pp. 86–93.
-  J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “The german traffic sign recognition benchmark: a multi-class classification competition,” in Neural Networks (IJCNN), The 2011 International Joint Conference on. IEEE, 2011, pp. 1453–1460.
-  D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
-  W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. 4960–4964.
K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and
Y. Bengio, “Show, attend and tell: Neural image caption generation with
visual attention,” in
International Conference on Machine Learning, 2015, pp. 2048–2057.
-  J. Kuen, Z. Wang, and G. Wang, “Recurrent attentional networks for saliency detection,” arXiv preprint arXiv:1604.03227, 2016.
-  M. Cornia, L. Baraldi, G. Serra, and R. Cucchiara, “Predicting human eye fixations via an lstm-based saliency attentive model,” arXiv preprint arXiv:1611.09571, 2016.
-  R. Stewart and S. Ermon, “Label-free supervision of neural networks with physics and domain knowledge.” in AAAI, 2017, pp. 2576–2582.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
-  J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell, “Long-term recurrent convolutional networks for visual recognition and description,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 2625–2634.
-  Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel, “Ask me anything: Free-form visual question answering based on knowledge from external sources,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4622–4630.
X. W. Gao, L. Podladchikova, D. Shaposhnikov, K. Hong, and N. Shevtsova, “Recognition of traffic signs based on their colour and shape features extracted using human vision models,”Journal of Visual Communication and Image Representation, vol. 17, no. 4, pp. 675–685, 2006.
-  M. Rincón, S. Lafuente-Arroyo, and S. Maldonado-Bascón, Knowledge Modeling for the Traffic Sign Recognition Task, 2005.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.