1 Introduction
Surveillance cameras are widely installed, recording and storing massive data every day. But anomalous events are very rare and it is impossible for humans to monitor all these cameras. Car crashes are a crucial safety issue nowadays. Leveraging the recent development of computer vision algorithms, we are developing an automatic system for traffic surveillance on highways and streets.
We have built a model that can predict and recognize crashes from surveillance cameras. One benefit is that ambulances could immediately be sent to the crash scene saving lives. As accidents are relatively few, our model also supports proactive safety check based on normal traffic flow. Realtime speed and distance measurements will lead to insights about highrisk areas, such as where cars frequently get too close. This will help to improve traffic safety on the long term.
As accidents are rare in regular surveillance videos, it is arduous to collect and build a labeled dataset of car crashes covering all possible situations. Taking this reality into account, we propose a model that requires no labeled crash data for training. Physically, a collision between cars occurs when they gradually get closer and finally come into contact. We can predict their trajectories and check overlap positions indicating a collision. In severe crashes, vehicles are deformed and undetectable afterwards, but the crash is recognized ahead of time based on the predictions.
Our model consists of five steps to achieve the goal. Camera calibration method is applied to transform a point on the image to the road plane. Object detection and tracking algorithms identify a vehicle and trace its history. A 3D bounding box is built to get the projection of a car on the road. Position and speed are estimated and predicted for the future. Finally, the model can recognize danger based on distances between vehicles and overlaps in the trajectories.
We run this model on BrnoCompDataset [19], which contains highway surveillance videos with ground truth speed and distance measurements. We evaluate its performance for different steps and show convincing results. It performs an effective 3D reconstruction of the road plane with a mean distance measurement error of 1.80% along the road. Upon efficient detection and tracking of vehicles, it takes a precise measurement of speeds at a mean error of 2.77 km/h. It predicts vehicle trajectories reliably with errors of 0.24 m for car positions and 2.53 km/h for speeds averagely for 0.12 seconds ahead. This allows refined recognition of traffic danger from all the measurement and predictions. Importantly, all these results are achieved without any labeled training data.
The key contributions of this paper are:

A traffic danger recognition model for surveillance cameras based on a 3D reconstruction of the road and prediction of trajectories. It does not need any labeled training data of car crashes.

Results show that the model monitors the road accurately, with mean errors of 1.80% for distance measurement, 2.77 km/h for speed measurement, 0.24 m for car position prediction, and 2.53 km/h for speed prediction.
2 Related Work
Camera Calibration. Calibration methods are employed to derive the intrinsic (focal length, principal point) and extrinsic (rotation, translation) parameters of a camera. The accuracy of calibration is critical for the 3D reconstruction and further processing. Different methods may require various forms of user inputs, such as drawing parallel lines [12], camera position [22, 14], and average vehicle size [3] or speed [16]. Fully automatic calibration is also achievable according to [4, 18].
Object Detection. Object detection models are utilized to identify vehicles in video frames. These models such as Fast RCNN [6] and Faster RCNN [15]
rely on region proposal algorithms and deep convolution neural networks to get bounding boxes of objects. Mask RCNN
[7] further extends by predicting object masks simultaneously.Multiple Object Tracking. Vehicle objects detected in adjacent frames need to be traced correctly. SORT algorithm [2]
supports fast online tracking with Kalman Filter
[9] and Hungarian algorithm [10]. DeepSORT [23] additionally integrates appearance information to improve the performance.Anomaly Detection.
Traffic danger recognition is one specific aspect of anomaly detection. Multiple instance learning
[20] requires sufficient annotated training data. Motion pattern based learning for traffic anomaly [24] also uses labeled data. But our approach is built upon no labeled videos of car crashes.3 Methodology
Our traffic danger recognition model consists of five steps. Camera calibration provides geometry parameters and a transformation from image coordinates to road plane coordinates. Object detection and tracking algorithms provide the types, positions, and masks of vehicles and trace their histories. 3D bounding boxes are built to localize vehicles in the world space and then project to the road plane. Positions and speeds are calculated with adjacent frames plus smoothing and predicted for the future. Finally, we can recognize danger from vehicle distances and potential overlaps in the predictions.
3.1 Camera Model and Calibration
We adopt a traffic camera model similar to the paper of Sochor [18] as shown in Figure 1. We follow the practice of Dubská [5] in setting up directions of three vanishing points . With a known plane, points in an image can be reprojected to points on the plane in the world space. The reprojection enables a 3D reconstruction of vehicles on the road.
Although some automatic calibration methods have been developed, they do not achieve perfect performance in our model. So we remain using a manual calibration which requires labeling two groups of parallel lines of each camera view. Then we derive two vanishing points in the image space using a least square error method as in [12]. With Algorithm 1 extracted from the supplementary material of the dataset [19], we can derive the road plane in the world space and project image points to world points on the plane.
We rotate the world coordinate system to make the  plane parallel to the road plane, so we can get plane coordinates of a point by omitting axis. Rotation parameters are acquired by solving Equation 1.
(1)  
3.2 Object Detection and Tracking
We select Mask RCNN by He [7] as our object detection model, which outputs detection scores, object types, bounding boxes, and object masks. We use Abdulla’s implementation [1] with trained weights on Microsoft COCO dataset [13] and select three types of objects as targets: car, bus, and truck. Then we apply a filter to the detected objects as shown in Figure 2. The filter follows three rules:

Vehicles should not be too small in size.

Vehicles should be in the road area.

Vehicles should be completely visible.
We use Deep SORT by Wojke [23] to track vehicles across frames. Each vehicle is supposed to get a unique ID from the tracking model, and it is robust through brief loss of detection.
3.3 3D Bounding Box
We estimate the contour of a vehicle with its mask from Mask RCNN, using the algorithm by Suzuki [21]. For each of the three vanishing points, we calculate the tilt angles of the lines passing that vanishing point and each point in the contour. In this way we find the tangent lines of the contour passing three vanishing lines. We alter the algorithm from Sochor [17] to build 3D bounding boxes of cars as described in Algorithm 2 and Figure 3.
3.4 Trajectory Prediction
To get the current location of a vehicle, we can find the bottom of the 3D bounding boxes and project them to the road plane according to Section 3.1. The set of the bottom points relies on the direction of the vehicle as:
(2) 
The center position of a vehicle is calculated by
(3) 
and a recent speed is calculated from adjacent frames as
(4) 
where denotes frame number and is the frame rate of the video. Exponential smoothing is applied to get a smoothed speed as
(5) 
With an optional scale factor , we are able to know the real world value of the speed.
To predict the trajectories, we assume:

The future is divided into time slots with equal lengths.

The vehicle centers follow normal distributions.

The vehicle shapes do not change.
We predict speed, acceleration, center coordinates and variance for the beginning of each slot as a snapshot. Within a slot, we assume there are fixed acceleration and variance. Then the speed and center coordinates can be calculated according to kinematics rules. In this way, predictions are available for an arbitrary time in the future.
3.5 Danger Recognition
We use two ways to recognize dangerous situations. The first one is the distance measurement between vehicles. It not only tells where cars are going to crash but provides a proactive safety check for areas where cars often get too close, as well. The second one is called danger map, which detects overlap of vehicles in the predictions that indicates crashes.
The distance between two vehicles is defined as the minimum distance between two points from two quadrangles respectively.
Lemma 1.
Let be the pair with the minimum distance among all pairs of points from two quadrangles respectively, then at least one of must be a vertex.
Proof.
Suppose both are not vertices, so each of them is on an edge, namely . If , there must be another pair of points consisting of at least one vertex that has an equal distance. If is not parallel to , then the nearest distance between and cannot be at the middle of both edges, which contradicts the suppose. Therefore, at least one of is a vertex. ∎
With Lemma 1, we can calculate the minimum distance between two quadrangles as:
(6)  
(7) 
where is the distance between a point and an edge. In this way, the minimum distance is calculated from only 32 candidates. Distances for all vehicle pairs are calculated and alerted when less than a threshold.
We accumulate the probability of a car box based on the distribution of its center to get the heat map of a vehicle. It represents the probability of its position at a specific time in the future. Then we aggregate the heat maps of all the vehicles in a scene into a danger map. A danger map represents the probability of coexistence of two or more vehicles in the same location. Figure
4 shows a sample result of danger recognition.4 Experiments
4.1 Dataset and Setup
We use BrnoCompDataset [19] to evaluate the performance of our model. It consists of surveillance videos of 6 sessions from 3 directions on the highway in the Czech Republic. The dataset provides the ground truth of distance measurement lines and speed of vehicles from Lidar sensor. It also has calibration results from various systems [19, 18].
We run our model on each of the 18 videos for 10 minutes. The videos are processed at the original resolution of 1080p and downsampled from 50 fps to 25 fps. We do not use lower frame rate because the DeepSORT model has worse performance when it is less than 25fps. We exploit the calibration results from [18] which provides vanishing points acquired by manual calibration from parallel lines, along with scale factors inferred from speeds. We let the smoothing parameter according to some preliminary experiments. The trajectory prediction is set for 0.12 and 0.24 seconds ahead, accordingly 3 and 6 frames.
4.2 Calibration Error
We measure the calibration error to test the provided calibration results and the correctness of our coordinates transformation algorithm which maps a point from the image space to the road plane space.
We calculate the distance of the given measurement lines in our plane coordinate system. The lines are divided into two groups according to their directions: toward or . The average length of the given lines is different in each group. We collect absolute and relative errors of measured distances and report the mean and median values in Table 1.
Mean  Median  


Absolute Error (m)  0.2618  0.1684  
Relative Error  1.80%  1.42%  

Absolute Error (m)  0.1633  0.1646  
Relative Error  2.06%  2.07% 
The results show that our model can accurately measure distances in the real world based merely on surveillance camera views and calibration parameters. The error in each direction is much smaller than the shape of conventional vehicles. As of the high speed in the highway, these errors are even smaller than the movement of a vehicle between two adjacent frames. This model provides an effective 3D reconstruction of the road plane with little error.
4.3 Vehicle Detection and Tracking Error
We measure vehicle detection and tracking error to test the Mask RCNN and DeepSORT models. For each vehicle detected and tracked, we record the time and position of its every appearance. Based on the appearance history, we calculate an estimated period of the vehicle in the measurement area of Lidar sensors. The measurement area is considered to be the largest one if there are more than two Lidar sensors set up. Then we calculate the intersection of union (IoU) between the estimated period and the real period of existence in the ground truth to get a similarity matrix. Hungarian algorithm [10] is employed to solve this matching problem. Additionally, matching results with IoU less than are dropped. We report the recall on each video for this evaluation in Table 4.
We find that Mask RCNN sometimes does not work at certain viewing angles or for certain types of vehicles. For lost detections, as long as the gap is short enough, DeepSORT is still able to track. In other cases, however, tracking also fails and that causes the loss. Despite these, the combination of Mask RCNN and DeepSORT have achieved an overall recall rate, which shows that it is efficacious for the vehicle detection and tracking in this task.
4.4 Speed Estimation Error
We use the matched vehicles from the previous section to evaluate the performance of our speed estimation. As the ground truth only has the average speed for each vehicle, we use the smoothed speed of a vehicle at its last appearance for comparison. We collect absolute and relative errors of the estimated speed of each vehicle and report in Table 2.
Mean  Median  

Absolute Error (km/h)  2.7708  1.8625 
Relative Error  3.68%  2.55% 
According to the dataset, the average speed for each session is mostly between 60 km/h and 90 km/h. For highway traffic, a mean error of less than 2.77 km/h proves that our model can precisely measure the speeds of vehicles. This accurate measurement is the foundation for further predictions and danger recognition.
4.5 Prediction Error
We evaluate the two levels of predictions separately. For each level, we collect the absolute error of location prediction, plus the absolute and relative error of speed prediction of each vehicle. As the smoothed speed is not stable at the beginning, predictions from vehicles with a history of fewer than frames (0.2 seconds) are excluded. The mean and median values of each metric are shown in Table 3.
Level  +0.12s  Mean  Median  


Absolute Error (m)  0.2433  0.1736  

Absolute Error (km/h)  2.5313  1.8373  
Relative Error  4.55%  2.52% 
Level  +0.24s  Mean  Median  


Absolute Error (m)  0.3563  0.3256  

Absolute Error (km/h)  3.0134  2.4995  
Relative Error  5.71%  3.92% 
Although the prediction mechanism currently deployed is rather simple, it provides results much beyond our expectations. As a vehicle at 75 km/h would move 2.5 meters in 0.12 seconds, a mean error of 0.24 m for location prediction is well acceptable. The difference between the mean and median values indicates some outliers are harming the performance, but we can still see that most of the predictions are within an error of 2km/h. For traffic on highways, crashes usually happen within 0.12 seconds, so it is enough for the danger map to work. Moreover, another prediction of +0.24s is there for more information beforehand, and it is reasonable to have a slightly larger error than +0.12s.
Video ID  1C  1L  1R  2C  2L  2R  3C  3L  3R  4C 
Vehicle Matching Recall  95.0%  92.2%  97.0%  82.2%  92.6%  92.5%  81.8%  100%  100%  92.4% 
Video ID  4L  4R  5C  5L  5R  6C  6L  6R  Mean  
Vehicle Matching Recall  93.2%  98.2%  83.0%  98.9%  97.5%  98.8%  99.5%  96.4%  94.0% 
5 Conclusions
We propose a traffic danger recognition model that works with arbitrary surveillance cameras. It does not require any labeled training data of crashes. The model consists of five steps: camera calibration, object detection and tracking, 3D bounding box, trajectory prediction, and danger recognition. We measure the performance with experiments step by step, presenting that it is accurate at the estimation of speed and position of vehicles by projecting to a 3D reconstructed road plane. It is suitable for crash detection and proactive safety checks.
A demo of our model working on a real crash scene can be found on Youtube^{1}^{1}1https://www.youtube.com/playlist?list=PLssAerj8zfUR5wBc7N6gmCFTm0azCHSIf
. In the future, a complete test set of video containing real crashes will be processed to report detection accuracy. Trajectory prediction model could be improved with conditional random fields or recurrent neural network. We will also test automatic camera calibration methods to obtain similar performance as manual calibration, then the system could function on arbitrary surveillance cameras with zero input.
References

[1]
W. Abdulla.
Mask rcnn for object detection and instance segmentation on keras and tensorflow.
https://github.com/matterport/Mask_RCNN, 2017.  [2] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft. Simple online and realtime tracking. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 3464–3468. IEEE, 2016.
 [3] D. J. Dailey, F. W. Cathey, and S. Pumrin. An algorithm to estimate mean traffic speed using uncalibrated cameras. IEEE Transactions on Intelligent Transportation Systems, 1(2):98–107, 2000.
 [4] M. Dubská, A. Herout, R. Juránek, and J. Sochor. Fully automatic roadside camera calibration for traffic surveillance. IEEE Transactions on Intelligent Transportation Systems, 16(3):1162–1171, 2015.
 [5] M. Dubská, A. Herout, and J. Sochor. Automatic camera calibration for traffic understanding. In BMVC, volume 4, page 8, 2014.
 [6] R. Girshick. Fast rcnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
 [7] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask rcnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
 [8] S. Hochreiter and J. Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [9] R. E. Kalman. A new approach to linear filtering and prediction problems. Journal of basic Engineering, 82(1):35–45, 1960.
 [10] H. W. Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(12):83–97, 1955.
 [11] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.

[12]
S. C. Lee and R. Nevatia.
Robust camera calibration tool for video surveillance camera in urban
environment.
In
Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on
, pages 62–67. IEEE, 2011.  [13] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
 [14] T.W. Pai, W.J. Juang, and L.J. Wang. An adaptive windowing prediction algorithm for vehicle speed estimation. In Intelligent Transportation Systems, 2001. Proceedings. 2001 IEEE, pages 901–906. IEEE, 2001.
 [15] S. Ren, K. He, R. Girshick, and J. Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
 [16] T. N. Schoepflin and D. J. Dailey. Dynamic camera calibration of roadside traffic management cameras for vehicle speed estimation. IEEE Transactions on Intelligent Transportation Systems, 4(2):90–98, 2003.
 [17] J. Sochor. Traffic analysis from video. Diplomová práce, Brno University of Technology, Faculty of Information Technology, 2014.
 [18] J. Sochor, R. Juránek, and A. Herout. Traffic surveillance camera calibration by 3d model bounding box alignment for accurate vehicle speed measurement. Computer Vision and Image Understanding, 161:87–98, 2017.
 [19] J. Sochor, R. Juránek, J. Špaňhel, L. Maršík, A. Širokỳ, A. Herout, and P. Zemčík. Comprehensive data set for automatic single camera visual speed measurement. IEEE Transactions on Intelligent Transportation Systems, 2018.
 [20] W. Sultani, C. Chen, and M. Shah. Realworld anomaly detection in surveillance videos. Center for Research in Computer Vision (CRCV), University of Central Florida (UCF), 2018.
 [21] S. Suzuki et al. Topological structural analysis of digitized binary images by border following. Computer vision, graphics, and image processing, 30(1):32–46, 1985.
 [22] K. Wang, H. Huang, Y. Li, and F.Y. Wang. Research on lanemarking line based camera calibration. In Vehicular Electronics and Safety, 2007. ICVES. IEEE International Conference on, pages 1–6. IEEE, 2007.
 [23] N. Wojke, A. Bewley, and D. Paulus. Simple online and realtime tracking with a deep association metric. In Image Processing (ICIP), 2017 IEEE International Conference on, pages 3645–3649. IEEE, 2017.
 [24] Y. Xu, X. Ouyang, Y. Cheng, S. Yu, L. Xiong, C.C. Ng, S. Pranata, S. Shen, and J. Xing. Dualmode vehicle motion pattern learning for high performance road traffic anomaly detection. In CVPR Workshop (CVPRW) on the AI City Challenge, 2018.
Comments
There are no comments yet.