In urban traffic systems, there are various participants, such as vehicles, pedestrians and bicycles. Bicycles are one of the major reasons of accidents in China.. In the last few years, we have seen much research on protecting vulnerable road users(VRUs), such as pedestrians, bicycles and other small vehicles, which is a natural trend to enrich total driving safety. Therefore it becomes a critical task to detect bicycles in urban intelligent traffic surveillance systems in order to reduce such accidents .
There exist some bicycle detection methods, which can be classified into two types. One type uses additional external sensors, such as laser sensors and infrared sensors . The other type detects bicycles through image processing, which is the main focus of the present paper. In , one salient feature of bicycles, having two round wheels, is taken to detect a bicycle through detecting two ellipses (the two wheels) in the image after Hough transformation. In , instead of wheels, the helmet of the bicycle rider is taken to detect a bicycle. The detection precision in  , however, strongly depends on the video quality, and may be poor for vague videos where it is quite difficult to detect a bicycle’s two wheels or a bicycle rider’s helmet. In , a method is proposed to detect and track bicycle riders based on Histograms of Oriented Gradients (HOG). The implementation of that method is limited because a HOG feature always requires large enough objects to ensure its accuracy and effectiveness while the videos in real traffic video surveillance systems may not satisfy this requirement. Moreover, it is time-consuming to extract HOG features due to their high complexity.
Some other methods like using MSC-HOG method for detection  or detecting tires of bicycles in videos  also can get good results, but they are either time consuming or high quality videos required. Some new methods, such the method based on HOG features with ROI in , try to use more advanced hardware device like GPU to finish the great amount of computation.
In summary, there are three major defects in the available bicycle detection methods based on image processing. First, they require fine features for detection, which are hard to extract, particularly for traffic videos with low-resolution. Second, the processing time under these methods is usually long and may not meet the requirement of the real-time detection. Last, they make the bicycle detection decision by the information in a single frame, which may lead to misjudgment, especially in the case of strong noise or light changing.
Traffic videos, limited by their capture device and environmental conditions, have some prominent features, such as low resolution, complex background, various weather conditions and a variety of lighting levels. Because of low-resolution, it is difficult to extract moving objects from the image precisely. Subsequently, effective features that can be used in the object classification are hard to be obtained.
In order to resolve the above issues and to adapt to the application scenario of low-resolution traffic videos, we propose a bicycle detection method based on multi-feature and multi-frame fusion. First, we extract geometric features and velocity features and then fuse them by using support vector machine (SVM) or cascade classifier. As the objects are usually small in real video surveillance systems, we extract sparse geometric features, rather than dense features, to detect bicycles. To enhance the precision of this feature descriptor, methods for feature fusion both on frame level and on image sequence level are proposed. These multiple geometric features are concatenated into a feature vector and then the support vector machine (SVM) or the cascade classifier methods are implemented to produce a preliminary detection decision for the current single frame. Moreover, we fuse the preliminary detection results from multiple frames by the majority rule, which provides not only a more reliable detection result, but also the confidence level of that detection result. Second, without the pressure from obtaining dense features, this detection method can work well with a relatively low computation complexity. Thus it can meet the real-time detection requirement. Third, as mentioned before, a multi-frame fusion method is provided to avoid the false classification caused by noise. Experimental results confirm the efficiency of our algorithm in different scenes.
In this paper, we assume that training data and future data have the same feature space. However, 
mentioned that this assumption may not be guaranteed due to the limited availability of human labeled training data. For this case, some methods based on transfer learning should be considered. As the traffic scenarios we adopt to test our method here are different scenes captured by fixed cameras, the features of bicycles we select do not change severely. As a result, the effect of the feature space shifting is not significant. For further research or other different application scenarios, like handling videos captured by cameras with PAN/tilt, feature space shifting should be taken into consideration.
The rest of this paper is organized as follows. In Section II, an overview of our method is given. Then the bicycle detection method based on multi-feature fusion is described in detail in Section III. The multi-frame fusion is explained in Section IV. The framework of this bicycle detection method is provided in Section V. Experimental results are presented in Section VI, where three different scenes are used to verify our algorithm. In Section VII, some concluding remarks are placed.
Ii Overview of our bicycle detection algorithm
As mentioned in Section I, there are three drawbacks in current researches–the improper features for bicycle detection in traffic videos, the insufficient use of extracted features and the high computation complexity. To conquer these shortcomings, a method based on multi-feature fusion method and multi-frame fusion method is adopted.
First, in this method, a group of simple but effective features like geometric features and velocity features are selected. These features are easy to compute and can describe the global characteristics of the bicycles. Unlike dense features, they are not insensitive to the quality of videos. Thus, they can be utilized in the scenario of low-resolution traffic videos. Second, to get a precise result, this method fuses features and judgments in two stages. In the first stage, this method fuses the obtained features through support vector machine(SVM) or cascade classifier method to get a preliminary detection decision in the single frame level. In the second stage, it fuse the preliminary detection results from multiple frames by the majority rule to obtain a more reliable detection result in the image sequence level. Third, due to the low computation complexity in feature extraction and object classification, this method can achieve a high computational efficiency, which can meet the real-time requirement.
Iii Multi-feature fusion
There are two key factors in the multi-feature fusion, including the selection of appropriate features and the fusion method. Selecting a group of reliable and salient features is the foundation of the multi-feature fusion while the fusion method determines the speed and accuracy of bicycle detection.
In this section, we provide the multi-feature fusion method. Both feature selection and multi-feature fusion method are proposed. First, to avoid the error caused by dense features, to decrease the impact of the quality of image and to reduce the computational complexity, geometric features and velocity features are used in this method. Second, considering both effective detection and efficiency, two methods – SVM and cascade classifier – are given.
Iii-a Selection of features
There are vastly different features for object detection in the existing literature. A typical example of features is the Histograms of Oriented Gradients (HOG) feature in , where excellent pedestrian detection is obtained. The HOG feature is a kind of dense feature and they need the detailed information of objects and require objects to be large enough, which may not be satisfied by the low-resolution traffic videos. Moreover, the processing of dense features is usually time-consuming and cannot be done in a real-time fashion.
The geometric features mainly refer to the salient features of objects and are less sensitive to the video quality. Although a single geometric feature may not supply enough information for object detection, we can fuse multiple geometric features to yield good detection results. Moreover, the processing of geometric features is fast and can be implemented to real-time applications. We choose the following geometric features. Note that we bound each object with a rectangular box, which is referred to as object region.
The number of foreground pixels in an object region.
The width, length and aspect ratio of an object region.
The foreground duty cycle of an object region. The foreground duty cycle of the object region stands for the ratio of the number of foreground pixels over the total number of pixels in an object region.
We can further equally divide the object region into two equal halves and calculate of the upper and lower sub-regions respectively, which are denoted as and . For vehicles and pedestrians, and are close to each other while is much smaller than in bicycles.
The speed of an object. Suppose the object has been recognized in
frames. The speed of the object is estimated as
where is the displacement of the centers of the object’s bounding rectangles between two consecutive frames. The above average method can effectively reduce the effects of the inaccurate foreground segmentation on the speed estimation.
Iii-B The multi-feature fusion method based on linear SVM
As mentioned in the last subsection, we extract multiple geometric features. A single feature cannot produce satisfactory bicycle detection results. So we consider to fuse these features together to detect bicycles more reliably. One multi-feature fusion method is linear support vector machine(SVM)
, which is widely implemented in image processing due to its excellent classification performance. SVM constructs the maximum-margin hyperplane to separate data sets as widely as possible.
Now we explain the procedure of our multi-feature fusion based on linear SVM. We first extract a large number of bicycle images from the videos to construct the positive sample set. Then we extract non-bicycle images from the videos (approximately twice of bicycle images) to construct the negative sample set. The feature vector of each sample in the positive and negative sample sets is denoted as
where , , …, represent the value of each feature, and is a column feature vector. The positive samples are expected to be separated from the negative samples by the following hyperplane,
where is the feature vector, is the normal vector of the hyperplane and is a row vector, and represents the product of two vectors (a row vector multiplies a column vector). We train SVM with the positive and negative samples and get and . The features can be fused as follows,
where is the fusion result. With , we make the following final decision,
Iii-C The multi-feature fusion method based on the cascade classifier
Fusing features with the cascade classifier also can obtain high accuracy and low complexity. It can be regard as a series of single feature classifier. The vague but easy accessible classifiers put in front stages, and the precise and difficult classifiers put in back stages. Cascade classifier is a combination of simple classifiers with a series structure, which is shown in Fig. 1,
In Fig. 1, each circle represents a classifier, which uses a single feature to determine whether an object is possibly a bicycle. When a classifier believes an object is NOT a bicycle, it yields the FALSE decision and the detection of that object is terminated. As mentioned before, a classifier is built upon a single feature and could make a mistake. Therefore, the weak classifiers at all levels are combined in a series to obtain a strong overall classifier. It is beneficial to place simple classifiers at the beginning levels of the cascade classifier because these classifiers will exclude some objects and save the computational time of the subsequent (complicated) classifiers. Cascade classifier has very fast computational speed and is applicable for real-time bicycle detection.
In our algorithm, we place the simple classifiers based on the shape information of objects, such as the width, length and aspect ratio, at the beginning stages of the cascade classifier. These simple classifiers can determine that some objects are not bicycles and exclude them. Finer detection, e.g., distinguishing a bicycle from a pedestrian, is achieved by more complicated classifiers, like the speed of an object, at the later stages of the cascade classifier.
Iv Multi-frame fusion
In Sections III-B and III-C, we fuse the features extracted from a single frame to obtain a preliminary detection result for an object. As we know, a single frame can be disturbed by noise and/or light changing so that the detection decision from that frame could be wrong. If we combine the detection results from multiple frames together, the detection errors in a few frames may not be that serious, which exactly inspires our multi-frame fusion. Generally speaking, an object is finally determined to be a bicycle if the bicycle detection decision is made in most of the frames where is detected. Moreover, the number of frames in which is detected provides a way to quantify the reliability of the final decision, which is referred to as confidence level. As we show later in this section, confidence level can effectively balance the detection accuracy and the false alarm rate.
Iv-a Fusing rule
As mentioned in Section III
, a bicycle is not too different from a pedestrian, especially under the disturbance of noise and/or light change. Although the multi-feature fusion in a single frame can reduce such error probability, that reduction is not enough and we need the multi-frame fusion.
Our multi-frame fusion follows the majority rule. Suppose an object has been detected in frames, which may not be consecutive. Among them, frames make the bicycle decision regarding . Then the multi-frame fusion makes the following decision,
When is large enough, to say , the multi-frame fusion in eq. 6 is quite robust against noise. When is small, the reliability of the multi-frame fusion is weak. The reliability of a decision will be quantitatively represented by confidence level, which is described in detail in Section IV-B.
Iv-B Life cycle and confidence level
We first introduce the concept of life cycle. Life cycle is a threshold on the number of frames, and is denoted as . Suppose an object in the stored object set cannot match with any segmented object. Then the life of is increased by . When the life of is larger than a given life cycle, , is kicked out of the stored object set because it is believed to have already left the detection region.
The motivation of life cycle is to improve the robustness against disturbance. Due to noise or light change, an object may not be correctly segmented in a frame and cannot find any match. Suppose is kicked out of the stored object set immediately after no matching. After several frames, may be segmented correctly, but detected as a new object, which yields that the number of reported objects is much larger than the number of real objects, i.e., the so-called “duplicated detection”. Due to duplicated detection, an object is reported as several ones, which significantly reduces the efficiency of object detection and increases the burden on the object database. In order to resolve this issue, the kick-out of an object is delayed by frames, i.e., an object is kicked out of the stored object list if it cannot find any match in consecutive frames. In that case, duplicated detection can be efficiently attenuated.
In our experiments, we choose around . Our frame rate is . frames take about seconds, during which a bicycle will not move too much and the delayed matching still makes sense. When is too large, the movement of an object can be large during frames and two different objects may be matched by mistake.
In the multi-frame fusion rule in eq. 6, an object has been detected in frames. The larger is , the more reliable is the detection result. According to , we can define the confidence level of a decision, , as
measures the reliability of object detection results. When is close to , we are more confident about the detection decision. When is low, we are less sure about the obtained result.
We introduce a threshold on , , to decide whether a detected object is acceptable by
measures the reliability of acceptable object detection results and lies between and . By setting different confidence level threshold , we can select object detection results according to the needs of users. When is set high (close to 1), the concerned results are required to be more accurate while some detection results with small will be discarded. When is set low, more detection results are accepted and the false alarm rate could also be high due to the included results being detected in only a few frames. Anyway, provides users a way to balance among the accuracy, the false alarm rate and the completeness of detection results.
V The framework of the designed bicycle detection method
The procedure of our algorithm is shown in Fig. 2. Now we briefly explain its main steps.
In STEP 2, the background model is obtained by the common background updating method, such as GMM.
In Step 3, the foreground is achieved through background subtraction. There may exist some holes and noisy points in the achieved foreground due to the background noise, light change, camera shaking, etc. So we do aftertreatment, such as erosion, dilation, basic morphology processing and target fusion, in order to get a better foreground.
In STEP 4, the processed foreground is partitioned into objects. We take an object from STEP 4 to explain the subsequent steps. Other objects follow the same procedure.
In STEP 5, we extract some geometric features of in the current frame, such as the aspect ratio, the duty cycle of foreground, the number of foreground pixels and the speed. In STEP 6, support vector machine method or cascade classifier is used to produce a preliminary detection result, denoted as (where is the frame index).
In STEP 8, we track object . More specifically, we predict the regions, where all stored objects could lie in frame16]. Then we compare the actual region where appears with the predicted regions. If the overlapping between ’s region and the predicted region of one object is large enough, we decide that and that object match, i.e., they are the same one; otherwise, is determined as a newly emerging object. By computing the distance between the centers of the predicted and actual regions, we can also obtain the movement information of . Note that if a saved object cannot match with any new object for more than consecutive frame, we determine that object has left the detection area and remove it from the object list.
In STEP 9, we fuse all preliminary detection results regarding from all previous frames by the majority rule, which provide not only a more reliable detection result, but also the confidence level of that detection result. The procedure between STEP 4 and STEP 9 will be repeated for other objects being segmented in the current frame.
In Step 10, we load the next frame and repeat the above procedure.
Vi Experimental results
In order to validate our algorithm, we implemented and tested it with multiple traffic scenes, which contain pedestrians, bicycles and vehicles. Our algorithm is tested with real traffic surveillance videos, whose image size is 352*288 pixels and frame rate is . The algorithm is run on an ordinary PC (Intel Core i3-2120, 3.3GHz).
As there is no public dataset for bicycle detection, we generate a dataset from urban traffic videos. We select three different scenarios, which are shown in Fig.3.
In Fig.3, scene 1 is a rainy day with much water on the road, scene 2 is a foggy day and scene 3 is sunny and more favorable for detection.
Vi-B Performance indices
In the experiments, the relevant performance indices include detection rate, false alarm rate, missing rate, duplication rate and processing time per frame, which are explained as follows.
The detection rate . The detection rate is the number of detected bicycles over the total number of bicycles in the video.
The false alarm rate . The false alarm rate is the number of objects wrongly detected as bicycles over the total number of bicycles.
The duplication rate . It is defined as
The processing time. The processing time per frame is used to measure the complexity of detection algorithms.
Vi-C Detection results and analysis
The experimental results for the scenes are shown in Table I. We can see that
|Scene||Method||Processing time per frame|
|Scene1||Cascade Classifier fusion|
|Scene2||Cascade Classifier fusion|
|Scene3||Cascade Classifier fusion|
In the rainy day(Scene 1), both SVM and cascade classifier methods work well with relatively poor performance. In that scene, there exist strong shadows, which mislead the detection of bicycles. Due to the fusion of multiple features and multiple frames, we can still achieve reasonable results.
In the foggy day(Scene 2), although the moving objects are vague, we can still extract their salient features and detect bicycles very well.
The missing rate of the method based on SVM is slightly lower than the one of the cascade classifier method while the false alarm rate of the SVM method is higher than the other. The reason is that the SVM method is built upon the simple linear combination of multiple features in eq. 4, which is easier to match objects and yields lower missing rate and higher false alarm rate. The overall performance of the cascade classifier method is better than the one of the SVM method.
Both the SVM and cascade classifier methods can achieve satisfactory detection rates, which meet the requirements of practical applications. The two methods have similar processing time per frame, which is about . The frame rate of ordinary videos is around 25-30 frames per second. So these two methods can meet the real-time detection requirements of real applications.
Vi-D The effects of the confidence level threshold
As mentioned in Section IV-B, we can effectively balance the detection accuracy and the false alarm rate by selecting an appropriate confidence level threshold. Now we try different confidence level thresholds to show their effects on missing rate, false alarm rate and duplication rate. Note that we take the SVM multi-feature fusion method here. The experimental results are shown in Fig. 4 and Table II. We can see that
|Confidence level threshold|
Before considering confidence level, i.e., setting the confidence level threshold at , the duplication rate of bicycles is about , which means that each bicycle has been detected as new ones about twice in average. As the confidence level threshold increases, the duplication rate greatly decreases. When the confidence level threshold is set at , the duplication rate is , which is only of the duplication rate under the confidence level threshold of . So we can effectively attenuate the duplication rate based on confidence level.
When the confidence level threshold is set low, more detection results are accepted and the false alarm rate could also be high due to the inclusion of detection results being obtained from a few frames. Accordingly, the missing rate decreases as the confidence level threshold decreases.
In practical applications, the confidence level threshold can be adjusted to reach a trade-off among missing rate, false alarm rate and duplication rate according to users’ preference. When the system requires high accuracy, the confidence level threshold is set high to effectively reduce false alarms. When the system requires low missing rate, the confidence level threshold is set low to accept more detection results, whose reliability may be low.
We present an approach based on multi-feature and multi-frame fusion to detect bicycles. Our approach extracts the sparse geometric features (simple salient features) of objects, and fuses these sparse geometric features by the SVM method and the cascade classifier method to improve detection performance. The detection results from multiple frames are fused together to further reduce detection errors. Moreover, we introduce the confidence level of detection results to achieve a desired balance among detection accuracy, false alarm rate, missing rate and duplication rate. As experimental results show, our approach can efficiently detect bicycles with low computational complexity, and is therefore applicable for real-time traffic surveillance systems.
-  N. Sheng, H. Wang and H. Liu, “Multi-traffic objects classification using support vector machine,” In Proceedings of Chinese Control and Decision Conference (CCDC), pp. 3215-3218, May 2010.
-  Y. Wang, Y. Wang, “Inducement of Energy Crisis of China and the Strategy in Response to the Crisis,” In Sino-Global Energy, pp. 15-18, vol. 12, 2007.
-  The Ministry of Public Security of China, “The National Road Traffic Accidents in the First Half of 2011,” http://www.mps.gov.cn/n16/n1282/n3553/2921474.html, 2011.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” In
-  A. Shashua, Y. Gdalyahu and G. Hayun, “Pedestrian Detection for Driving Assistance Systems: Single-frame Classification and System Level Performance,” In Proceedings of IEEE Intelligent Vehicles Symposium(IV2004), University of Parma, pp. 1-6, 2004.
-  W. Wan, H. Huo and Y. Zhao, “Target Detection and Recognition in Intelligent Video Surveillance,” Shanghai Jiaotong University Press, 2010.
-  Y. Wu, Q. Kong, Z. Liu and Y. Liu, “Pedestrian and Bicycle Detection and Tracking in Range Images,” In Proceedings of International Conference on Optoelectronics and Image Processing(ICOIP2010), pp109-112, 2010.
-  N. Thepvilojanapng, K. Sugo, Y. Namiki and Y. Tobe, “Recognizing Bicycling States with HMM based on Accelerometer and Magnetometer Data,” In Proceedings of SICE Annual Conference, Waseda University, pp. 831-832, 2011.
-  S. Rogers, P. Nikolaos, “A Robust Video-based Bicycle Counting System,” In Proceedings of ITS America Meeting (9th: New thinking in transportation), Washington DC, pp. 1-12, 1999.
-  C. Chiu, M. Ku and H. Chen, “Motorcycle Detection and Tracking System with Occlusion Segementation,” In Proceedings of the Eight International Workshop on Image Analysis for Multimedia Interactive Services(WIAMIS’07), pp. 32-35, 2007.
-  H. Cho, P. Rybski and W. Zhang, “Vision-based Bicyclist Detection and Tracking for Intelligent Vehicles,” In Proceedings of IEEE Intelligent Vehicle Symposium, University of California, San Diego, pp. 454-461, Jun. 2010.
-  H. Jung, Y. Ehara, J. K. Tan, H. Kim, and S. Ishikawa, “Applying MSC-HOG Feature to the Detection of a Human on a Bicycle,” In Proceedings of the 12th International Conference on Control, Automation and Systems (ICCAS), IEEE, pp. 514-517, 2012.
-  Y. Fujimoto, J. Hayashi, “A method for bicycle detection using ellipse approximation, ” In 19th Korea-Japan Joint Workshop on Frontiers of Computer Vision(FCV), IEEE, pp. 254-257, 2013.
-  A. Moro, E. Mumolo, M. Nolich, K. Umeda, “Real-time GPU implementation of an improved cars, pedestrians and bicycles detection and classification system, ”In 14th International IEEE Conference on Intelligent Transportation Systems (ITSC), IEEE, pp.1343-1348, 2011.
L. Shao, F. Zhu, and X. Li. “Transfer learning for visual categorization: A survey, ”In
IEEE Transactions on Neural Networks and Learning Systems,2014.
-  R. Kalman, “A new approach to linear filtering and prediction problems,” In Transactions of the ASME Journal of Basic Engineering, pp. 35-45, 1960.
-  Z. Bian, X. Zhang, “Pattern Recognition”, pp. 284-299, Tsinghua University Press, 2000.
P. Viola, “Robust Real-time Face Detection”,International Journal of Computer Vision, vol. 57, No. 2, pp. 137-152, 2004.
-  Y. Zhang, J. Yan, Q. Ling, F. Li and J. Zhu. Moving cast shadow detection based on regional growth. In Control Conference (CCC), 2013 32nd Chinese, pp. 3791-3794, IEEE, 2013.
-  J. Yan, Q. Ling, Y. Zhang, F. Li and F. Zhao. An adaptive bicycle detection algorithm based on multi-Gaussian models. Journal of Computational Information Systems, 9(24), pp.10075-10083, 2013.
-  J. Yan, Q. Ling, Y. Zhang, F. Li and F. Zhao. A novel occlusion-adaptive multi-object tracking method for road surveillance applications. In Control Conference (CCC), 2013 32nd Chinese, pp. 3547-3551, IEEE, 2013.