Detection of 3D Bounding Boxes of Vehicles Using Perspective Transformation for Accurate Speed Measurement

03/29/2020 ∙ by Viktor Kocur, et al. ∙ Comenius University in Bratislava 0

Detection and tracking of vehicles captured by traffic surveillance cameras is a key component of intelligent transportation systems. We present an improved version of our algorithm for detection of 3D bounding boxes of vehicles, their tracking and subsequent speed estimation. Our algorithm utilizes the known geometry of vanishing points in the surveilled scene to construct a perspective transformation. The transformation enables an intuitive simplification of the problem of detecting 3D bounding boxes to detection of 2D bounding boxes with one additional parameter using a standard 2D object detector. Main contribution of this paper is an improved construction of the perspective transformation which is more robust and fully automatic and an extended experimental evaluation of speed estimation. We test our algorithm on the speed estimation task of the BrnoCompSpeed dataset. We evaluate our approach with different configurations to gauge the relationship between accuracy and computational costs and benefits of 3D bounding box detection over 2D detection. All of the tested configurations run in real-time and are fully automatic. Compared to other published state-of-the-art fully automatic results our algorithm reduces the mean absolute speed measurement error by 32 km/h to 0.75 km/h) and the absolute median error by 40 km/h).



There are no comments yet.


page 5

page 6

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent development in commercially available cameras has increased the quality of recorded images and decreased the costs required to install cameras in traffic surveillance scenarios. Automatic traffic surveillance aims to provide information about the surveilled vehicles such as their speed, type and dimensions and as such is an important aspect of intelligent transportation system design.

Automatic traffic surveillance system requires an accurate way of detecting the vehicles in the image and an accurate calibration of the recording equipment.

Standard procedures of camera calibration require a calibration pattern or measurement of distances on the road plane. Dubská et al. dubska2014 proposed a fully automated camera calibration for the traffic surveillance scenario. We use an improved version sochor2017 of this method to obtain the camera calibration and focus on the accuracy of vehicle detection.

Object detection is one of the fundamental tasks of computer vision. Recent deep learning techniques have successfully been applied to this task. Deep convolutional neural networks are used to extract features from images and a supplementary structure utilizes these features to detect objects. We opt to use the object detector RetinaNet

RetinaNet as a base framework for object detection as it offers good tradeoff between accuracy and low inference times. RetinaNet uses a structure of so-called anchorboxes for object detection and our method could therefore utilize any other widely used object detection framework based on anchorboxes SSD ; redmon ; FasterRCNN . With minor modifications our method could also utilize emerging object detection frameworks based on keypoint detection CornerNet ; ObjectsAsPoints .

In this paper we extend our previous work CVWW2019 where we proposed a perspective image transformation which utilizes the geometry of vanishing points in a standard traffic surveillance scenario. The perspective transformation enables us to rectify the image which has two significant effects. The first effect is recftification of the image which aids the object detection accuracy. The second one is that this enables an intuitive parametrization of the 3D bounding box of a vehicle as a 2D bounding box with one additional parameter. Our method has surpassed the existing state-of-the-art approaches in terms of speed measurement accuracy while being computationally cheaper. The method was mostly automatic, but the construction of the perspective transformation was not robust enough resulting in a need for manual adjustments for some camera angles. Now we propose a new approach which remedies this problem and enables two different transformations to be constructed for a single traffic scene. We also provide an extended study of performance of our method with different configurations to gauge their effects on the speed measurements accuracy and computational costs to offer various options for different computational constraints. We also show that the improved transformation brings improvements in speed measurement accuracy.

2 Related Work

Measuring speeds of vehicles captured by a monocular camera requires their detection and subsequent tracking followed by measurement of the distance they passed utilizing camera calibration. Connecting these subtasks into a single pipeline is usually trivial so we focus the last subsection on the available means of evaluating the accuracy of the whole pipeline.

2.1 Object Detection

Recent advent of convolutional neural networks had a significant impact on the task of object detection. Two stage approaches such as Faster R-CNN FasterRCNN use a convolutional neural network to generate proposals for objects in image. In the second stage the network determines which of these proposed regions contain objects and regress the boundaries of their bounding boxes.

Single stage approaches RetinaNet ; SSD ; redmon work by using a structure of anchorboxes as the output of the network. Each anchorbox represents a possible bounding box. Each anchorbox has a classification output to determine which object, if any, is in the anchorbox and a regression output to align the bounding box to the object. In this approach one object can be covered by multiple anchorboxes so a technique such as non-maximum suppression must be used to leave only one bounding box per object.

Current state of the art approaches forego the use of anchorboxes completely and rely on detecting keypoints in the image via heatmaps on the output of the network. CornerNet CornerNet detects the two opposite corners of a bounding box and pairs them using an embedding. CenterNet CenterNet detects the center of the object and uses regression to determine the dimensions of the object.

2.2 Vehicle Detection

In our work we focus on detecting vehicles via their 3D bounding boxes as this approach has been shown to be also beneficial for subsequent tasks such as fine-grained vehicle classification boxcars ; perspectivenet and re-identification vehiclereid . In the evaluation section of this paper we also show that detecting 3D bounding boxes as opposed to 2D bounding boxes is beneficial to speed measurement accuracy.

Background subtraction is a common method of detecting vehicles as the traffic surveillance cameras are static. Corral-Soto and Elder slotcars fit a mixture model for the distribution of vehicle dimensions on labeled data. The model is used together with the known geometry of the scene to estimate the vehicle configuration for blobs of vehicles obtained via background subtraction. Similarly, Dubská et al. use background subtraction to obtain masks of vehicles. 3D bounding boxes aligned with the vanishing points of the scene are then constructed tangent to these masks. The order of construction of the edges of the bounding box is important and the process may not be stable. This approach has been slightly improved sochor2017 by using Faster R-CNN object detector FasterRCNN before the background subtraction to determine which blobs are cars. Approaches relying on background subtraction can be sensitive to changing light conditions or vibrations of traffic cameras and may thus not be suitable for some traffic surveillance scenarios.

Zeng et al. perspectivenet use a combination of two networks to determine the 3D bounding boxes of vehicles, which are subsequently used to aid in the task of fine-grained vehicle classification. The first network is based on RetinaNet object detector RetinaNet . The second network is given the position of 2D bounding boxes obtained by the first network to perform a ROIAlign operation MaskRCNN on a feature map from a separate ResNet network resnet . This second network then outputs the positions of the vertices of the 3D bounding boxes and is trained as standard regression task with a regularization term which ensures that the bounding box conforms to perspective geometry. The obtained geometry of the vehicle is then used to extract features for a fusion network which is trained on the task of fine-grained vehicle classification. The whole system is trained on the BoxCars116k boxcars dataset, which contains over 116 thousand images each with one vehicle with annotations containing its 2D and 3D bounding box as well as its make and model. Since the dataset contains 3D bounding box annotations we also utilize it for training.

Multiple approaches for detecting 3D bounding boxes have been published and evaluated on the KITTI dataset KITTI . This dataset contains videos from the ego-centric view from a vehicle driving in various urban environments. The videos are annotated with 3D bounding boxes of relevant objects such as cars, cyclists and pedestrians. Many published approaches rely on modified 2D object detectors. The authors of CenterNet object detector published CenterNet an evaluation of a slightly modified version of their detector on the KITTI dataset. Mousavian et al. mousavian use a 2D bounding box and regress orientation and dimensions of vehicles separately and combine them with geometry constraints to obtain a final 3D bounding box. Simonelli et al. simonelli add a 3D detection head on top of a RetinaNet object detector RetinaNet . The detection head is trained to regress 10 parameters of the 3D bounding box in a special regime where in each step some parameters are fixed to the ground truth for loss computation. Kim and Kum kim propose to use perspective transformation on the image to create a rectified birds eye view of the road plane and find the bounding boxes of vehicles in the transformed image.

The traffic surveillance scenario is significantly different from the autonomous vehicle scenario of the KITTI dataset. It is therefore not possible to compare approaches for these two tasks directly. Our method shares similarities to the presented works by using a 2D object detector, while exploiting additional constraints that can be assumed in a geometry of a standard traffic surveillance scenario.

2.3 Camera Calibration

In the context of traffic surveillance, camera calibration is necessary to enable measurement of real world distances in the surveilled scene. A review of available methods has been presented by Sochor et al. brnocompspeed . The review found that most published methods are not automatic and require human input such as drawing a calibration pattern on the road pattern , using positions of line markings on the road cathey ; lan ; maduro or some other measured distance related to the scene schoepflin ; you .

Ideally, camera calibration can be performed automatically and accurately. Filipiak et al. filipiak

proposed an automatic approach based on an evolutionary algorithm, though the approach was validated only on footage zoomed in to obtain clear image of license plates, which is unsuitable for traffic surveillance on multi-lane roads.

A fully automatic method has been proposed by Dubská et al. dubska2014 . The camera is calibrated by finding the three orthogonal vanishing points related to the road plane. The first vanishing point corresponds to the movement of the vehicles. Relevant keypoints are detected and tracked using the KLT tracker. The tracked lines of motion are then transformed into a so-called diamond space based on parallel coordinates in a fashion similar to the Hough transform. Edges of vehicles which are perpendicular to their movement are used in the same way to determine the position of the second vanishing point. The last vanishing point is calculated as perpendicular to the first two. To enable measurements of distances in the road plane a scale factor needs to be determined. The dimensions of the detected vehicles are recorded and their mean is calculated. The mean is compared to statistical data based on typical composition of traffic in the country to obtain the scale. This method has been further improved by Sochor et al. sochor2017 by fitting a 3D model of a known common vehicle to its detection in the footage. The detection of the second vanishing point is also improved by using edgelets instead of edges. We opt to use this improved fully-automatic calibration method in our pipeline.

2.4 Object Tracking

To allow for vehicle counting and speed measurement, the vehicles have to be tracked from frame to frame. Since object detection may sometimes fail a robust tracker is necessary. Kalman filter

Kalman has been a reliable tool to tackle the task of object tracking in many domains. However, for our case we found that a simple object tracker which compares the positions of bounding boxes in subsequent frames is sufficient.

2.5 Speed Measurement Accuracy Evaluation

A review brnocompspeed of existing traffic camera calibration methods, vehicle speed measurement methods and evluation datasets found that many of the published results are evaluated on small datasets with ground truth known for only few of the surveilled vehicles. Additionally most of the datasets used in published literature were not publicly available. The authors of the review offer their own dataset called BrnoCompSpeed containing 21 one hour long videos which collectively contain 20 thousand vehicles with known ground truth speeds obtained via laser gates. The authors also provide an evaluation script for this datset. We chose to perform the evaluation of our method on this dataset.

3 Proposed Algorithm

The goal of our algorithm is to detect 3D bounding boxes of vehicles recorded with a monocular camera installed above the road plane, track the vehicles and evaluate their speed. The algorithm consists of several stages and it requires the camera to be calibrated as per sochor2017 . The algorithm is an improvement of our previous work CVWW2019 .

We will first provide an overview of the whole algorithm and then describe each part in greater detail. At first, a perspective transformation is constructed using the positions of the vanishing points of the traffic scene which are known thanks to the calibration. This transformation is applied to every frame of the recording. The transformed frames are used as an input to an object detector which detects vehicles and their 2D bounding boxes with one additional parameter. The output of the object detection network is used for tracking and 3D bounding box construction. Tracking is performed by comparison of the 2D bounding boxes in successive frames using a simple algorithm based on the IoU metric. The 3D bounding boxes are constructed in the transformed frames using the 2D bounding boxes with the additional parameter. Inverse perspective transformation is then used to transform the 3D bounding boxes onto the original scene. The center of the bottom frontal edge of every 3D bounding box is used to provide pixel position of a vehicle for each frame. The calibration is then used to project the pixels onto the road plane and thus enable measurement of the distance traveled between frames by one vehicle. These interframe distances are consequently used to measure the speed of the vehicle over the whole track.

3.1 Image Transformation

Figure 1: The process of the construction of the perspective transformation for the pair VP1-VP2. a) The original traffic surveillance scene. b) The mask of the desired road segment is applied. c) For both VP1 (dotted red) and VP2 (solid blue) lines which originate in the vanishing point and are tangent to the mask from both sides are constructed. d) The four intersections of these lines are found. e) The four points are paired with the four corners of the rectangle with the desired dimensions of the transformed image. f) When the transformation is applied the lines corresponding to each of the two vanishing points are parallel to the axes. If the total blank (white) area in the transformed image is more than 20% of the pixels in the transformed image then the mask is cropped by few pixels from the bottom and the process starts again from step b).

To construct the image transformation we require the camera to be calibrated as described in sochor2017 . This calibration method has very few limitations regarding the camera position. The camera has to be positioned above the road plane and the observed road segment has to be straight. The main parameters obtained by the calibration are the positions of the two relevant vanishing points in the image. Assuming that the principal point is in the center of the image, the position of the third vanishing point as well as focal length of the camera can be calculated. This enables us to project any point in the image onto the road plane. To enable measurements of distances on the road plane one additional parameter, denoted as scale, is determined during calibration.

The first detected vanishing point (denoted further as VP1) corresponds to the lines on the road plane which are parallel to the direction of the moving vehicles. The second detected vanishing point (VP2) corresponds to the lines which lie on the road plane but are perpendicular to the the direction of the moving vehicles. The third vanishing point (VP3) corresponds to the lines which are perpendicular to the road plane.

The goal of the transformation is to create a new image in which lines corresponding to one of the vanishing points are parallel to one of the image axes and lines corresponding to another vanishing point are parallel to the other image axis. In the transformed image the two vanishing points will thus be ideal points. We also require that the lines corresponding to the last vanishing point remain lines in the transformed image. In order to fulfill these conditions we use the perspective transformation. Since the orientation of the vehicles is closely related to the positions of the vanishing points the vehicles will be rectified in the transformed image.

In our previous work CVWW2019 we proposed an algorithm which was able to construct such transformation for the pair of vanishing points VP2 and VP3, however we observed that for some camera positions the results were much worse than for the rest. This was caused by an inadequate perspective transformation which resulted in a very small and distorted part of the transformed image to be relevant for detection. At that time we remedied this by significant manual adjustments, which were not automated and therefore undesirable. Furthermore the previous approach failed completely to construct a reasonable transformation for the pair of vanishing points VP1 and VP2.

Now we propose to remedy this problem in an automated fashion by setting a condition that the transformed image should contain as much relevant information as possible. To satisfy this we propose two following adjustments. Firstly, we use a mask of the surveilled traffic lanes for the construction of the transformation instead of the whole image. For evaluation we use the BrnoCompSpeed dataset brnocompspeed in which the masks are already provided. In other cases the masks can be obtained automatically by utilizing optical flow opticalflow

. Secondly, we heuristically set a limit that no more than 20% of the pixels in the transformed image should correspond to pixels which lie outside of the mask in the original image. In the evaluation section we show that this approach not only makes the algorithm fully automatic, but also leads to better accuracy for the speed estimation task.

The construction algorithm of the transformation has the following steps:

  1. Out of the three vanishing points choose either the pair VP1-VP2 or VP2-VP3.

  2. For each of the selected vanishing points construct two lines which originate in the vanishing point and are tangent to the mask. Thus creating four lines.

  3. Find the four intersections of the lines, for the pairs of lines which originate in different vanisihng points.

  4. Pair each of the four intersection points with a corner of a rectangle with the desired dimensions of the transformed image in the way that preserves the vertical direction of the vehicle movement (e.g. vehicles traveling from top-left to bottom-right will be traveling from top to bottom in the transformed image).

  5. Use the four pairs of points to obtain the perspective transformation.

  6. Apply the transformation on image of the mask. If the area of the transformed mask is less than 80% of the transformed image then the original mask is cropped from the bottom by one pixel (if not possible terminate with failure) and the process is repeated from step 2). Otherwise output the transformation.

The algorithm is visualised in Fig. 1. Note that this algorithm may terminate with failure. This is usually the case when the line connecting the two vanishing points intersects the masks. In that case even if the algorithm would output a transformation only a small part of the relevant road segment would be visible in the transformed image and would be unusable for traffic surveillance. In the BrnoCompSpeed dataset brnocompspeed this does not occur. Since the calibration method requires that the three vanishing points are orthogonal it is safe to assume that the transformation would work well for at least one of the pairs.

The method can theoretically be extended to include the transformation for the pair VP1-VP3, but on the BrnoCompSpeed dataset this results in failure of the construction algorithm for multiple videos so we dismiss this approach. Removing this pair also has the benefit of simplifying the parametrization of the 3D bounding box presented in the following subsection.

3.2 Parametrization of the 3D Bounding Boxes

Figure 2: The process of constructing 2D bounding box with the parameter from a 3D bounding box using the transformation for the pair VP2-VP3. a) 3D bounding box (green) which is aligned with VP1 (yellow), VP2 (blue) and VP3 (red). b) 3D bounding box. c) 3D bouding box after the perspective transform for the pair VP2-VP3 is applied. d) The parametrization of the 3D bouding box as a 2D bouding box (green). The parameter is determined as the ratio of the distance from top of the 2D bounding box to the top-front edge of the transformed 3D bounding box (blue) and the height of the 2D bouding box.

We aim to detect 3D bounding boxes aligned with the vanishing points. After performing the image transformation from the previous section, 8 of the 12 edges of the bounding box are aligned with the image axes. This enables us to describe the 3D bounding box as a 2D bounding box with one additional parameter in an intuitive way. The 2D bounding box is the rectangle which encloses the 3D bounding box.

The additional parameter denoted as is determined by measuring the vertical distance from the top of the 2D bounding box to the top frontal edge of the 3D bounding box and dividing it by the height of the 2D bounding box. This parameter thus always falls into the interval. The construction of the 2D bounding box and the additional parameter can be seen in Fig. 2.

Figure 3: The process of reconstructing the 3D bounding box (solid green) from the known 2D bounding box (dashed black), the line given by the parameter (dotted blue) and the position of the VPU (red cross). The process begins in the top left of the figure. Line segments originating in the VPU (red dotted) are used when needed to determine the corners and edges of the 3D bounding box.

3.3 3D Bounding Box Reconstruction

The 2D bounding with the parameter can be used to reconstruct the 3D bounding box. Here we apply a similar process to the one described in our previous work CVWW2019 , but we generalize it to accommodate to new cases which emerge from the improved way the perspective transformation is obtained.

The process of reconstruction depends on the position of the vanishing point which was not used for the transformation. Note that the position of this vanishing point has to be known in the transformed image which is easily obtainable by applying the perspective transformation. For simplicity, we will denote this vanishing point in the transformed image as VPU.

Due to the geometry of the vanishing points there are only two possibilities of relative vertical positions of the 2D bounding box and VPU. The box is either above or below the VPU. Let us first consider that VPU is above the 2D bounding box.

In that case there are only three possibilities regarding the the horizontal positions of VPU and the 2D bounding box. If VPU is to the right of the 2D bounding box then the left end of the line segment representing the is a vertex of the 3D bounding box. Knowing this vertex one can construct the 3D bounding box in the transformed image. Similarly, if the VPU is to the left of the 2D bounding box then the right end of the line segment is used. If the VPU is neither to the left or to the right of the 2D bounding box then either of them can be used as a corner to start construction of the 3D bounding box. The process is visualized in Fig. 3 for the case when VPU is to the left of the 2D bounding box.

In the case when the VPU is below the 2D bounding box the process is almost identical, the only difference is that when VPU is to the left of the 2D bounding box the left end of the line segment is used as a starting vertex and vice versa.

This process may fail to produce a valid 3D bounding box. This can be easily detected during the reconstruction process as in that case a part of at least one of the edges of the 3D bounding box would lie outside of the area enclosed by the 2D bounding box. This failure indicates that there is no valid 3D bounding box for the given parametrization and perspective geometry. Such a situation may occur as it is impossible to guarantee that a neural network outputs only valid outputs. Since this occurs only rarely a simple solution of regarding these outputs as false positives works well enough in practice.

After the 3D bounding box is constructed in the transformed image an inverse perspective transformation can be applied to the vertices of the 3D bounding box to obtain the 3D bounding box in the original image.

3.4 Bounding Box Detection

As shown in the previous subsection we only need to detect 2D bounding boxes with the parameter . For this purpose we utilize the RetinaNet object detector RetinaNet . This detector outputs 2D bounding boxes for the detected objects. We modify the method to add to each of the output boxes.

The RetinaNet RetinaNet , as well as other object detecting meta-architectures, uses anchorboxes as default positions of bounding boxes to determine where the objects are. The object detection task is separated into three parts: determining which anchorboxes contain which objects, resizing and moving the anchorboxes to better fit the objects and finally performing non-maximum suppression to avoid multiple detections of the same object. To train the network a two-part loss (1) is used.


The loss is averaged over all anchorboxes,

is the Focal loss used to train a classifier to determine which objects, if any, are in the bounding box.

is the regression loss to train the network how to reshape and offset the anchorboxes. To include the parameter we simply add one additional regression loss which results in the total loss:


The loss (3) is identical in the base structure to the loss used for the four regression parameters in the RetinaNet, which is itself based on the regression loss of the SSD object detector SSD . The loss is calculated as a sum over all of the anchorboxes and ground truth bounding boxes. determines whether the -th anchorbox corresponds to the -th ground truth label. We subtract the ground truth value of denoted as from the predicted value and apply the smooth L1 function ().


Note that this approach could be extended to some of the more recent object detection frameworks which rely on keypoint detection CornerNet ; ObjectsAsPoints . However, we opt to use the anchorbox-based approach as this makes our method universally transferable between different object detection frameworks with widespread use.

3.5 Object Detector Training

To obtain training data we use data from two distinct datasets. The first dataset is BoxCars116k boxcars . The original purpose of this dataset is fine-grained vehicle classification. The dataset contains over 116 thousand images, each containing one car along with make and model labels, information on positions of vanishing points and the 3D bounding box of the car. We transform these images with the proposed transformation and calculate the 2D bounding boxes and the parameter based on the provided 3D bounding boxes. Since each image is only of one car we augment the images by randomly rescaling them and placing them on a black background.

The other used dataset is BrnoCompSpeed brnocompspeed . We use the split C of this dataset providing 9 videos for testing and 12 for validation and training. Each video is approximately one hour long with 50 frames per second (with one exception). For training and validation we use only every 25-th frame of the videos. We use the first 30000 frames for validation and the rest are used for training. The main purpose of this dataset is to evaluate camera calibration and speed measurement algorithms. The cameras have been manually calibrated and thus the positions of the vanishing points are available, however the dataset does not contain 3D bounding box annotations.

Figure 4: The process of creating annotations from provided mask of a vehicle. a) The original image. b) Mask (yellow) is obtained using Mask R-CNN MaskRCNN . c) Both the mask and the image is transformed using the transformation for the pair VP2-VP3 (see subsection 3.3). d) The 2D bounding box (green rectangle) and the two lines (dotted red) originating in VPU and tangent to the mask are drawn. e) Intersection of one of the tangents with a vertical edge of the 2D bounding box can be used to determine the first candidate for the parameter line (solid blue). f) Intersection of the other tangent and the bottom edge of the bouding box is used to draw a vertical line (dashed blue) through it. g) A line (dotted red) originating in VPU and going through the top-left corner of the bounding box is drawn. The intersection of this line and the vertical line from the previous step is used to determine the position of the parameter line (solid blue). h) Finally, from the two possible parameter lines depicted in e) and g) we choose the one constructed in g) as it creates a wider 3D bounding box. The result of this process is a 2D bounding box (green rectangle) with the parameter line in the transformed image.

To obtain the necessary 3D bounding box annotation we run these frames through Mask R-CNN MaskRCNN image segmentation network trained on the COCO dataset COCO . We transform the masks of detected vehicles and the images using our transformation and create the 2D bounding boxes with as labels for training. Obtaining a 2D bounding box from a mask is straightforward. The computation of the parameter requires a few steps since the masks may not be perfect and the cars are not commonly box shaped. The process begins with drawing the two lines tangent to the mask from both sides originating in VPU (see subsection 3.3). Each of these lines intersect the edges of the 2D bounding box twice. The intersection closer to the VPU is discarded. Thus we have two points on the edges of the 2D bounding box each corresponding to one tangent line. Calculating the parameter for the point which lies on one of the vertical edges is straightforward. In case of a point on one of the horizontal edges of the bounding box a vertical line through this point is drawn. Next a line from VPU is drawn to the closest corner of the 2D bounding box. The vertical position of this intersection is then used to determine the parameter. In the end we obtain two values for the parameter and use the one which creates a line closer to the VPU thus choosing the wider of the two options. For visual reference see fig. 4.

Based on the development of the validation loss during the training we employ early stopping and train our models for 30 epochs each with 10000 training steps. For each pair of vanishing points we train models of three different sizes dependent on the input size of the transformed image. For the pair

VP2-VP3 the sizes of the input image in pixels (width height) are and . For the pair VP1-VP2 we use the same dimensions we just flip them so the bigger dimension is the height of the image. We use the minibatch size of 16 for the models of the two smaller sizes and due to memory constraints a minibatch size of 8 for the largest model.

3.6 Tracking

From the object detector we obtain 2D bounding boxes with the parameter for vehicles in each frame of the recording. The tracking algorithm begins in the first frame with no active tracks and continues iterating through frames. For each 2D bounding box detected in the frame its IoU against the last 2D bounding box in each active track is calculated. If IoU of a detection is higher than 0.1 for at least one track, then the bounding box is added to the track with highest IoU score. If no track has at least 0.1 IoU against the detection, then a new active track is created. If a track hasn’t had any bounding boxes added to it in the last 10 frames, then the track is no longer considered active and is added to the results. To detect speed we filter out bounding boxes which are less than 10 pixels away from the edges of the images. We also discard tracks which have less than 5 detected bounding boxes within them or smaller distance traveled than 100 pixels.

3.7 Speed Measurement

In the previous step the 2D bounding boxes with the parameter were grouped and filtered into relevant tracks. 3D bounding boxes are reconstructed (see subsection 3.3) for all detections. Knowing the 3D bounding box position in the original image the speed is determined using a point which is in the middle of the frontal bottom edge of the 3D bounding box (see 5). Since these points should under normal circumstances lie on the road plane, we can use the camera calibration to easily determine the distances between various positions within a track. To detect the average speeds of the vehicles we employ the same method as brnocompspeed by calculating the median of the interframe speeds of the whole track.

4 Evaluation

Figure 5: 3D bounding boxes detected on the videos from the test set of the BrnoCompSpeed brnocompspeed dataset. a-f) Results for the model using the pair VP2-VP3 and input size px are on the left and results for VP1-VP2 and px are on the right. In images a-c) we can observe that there are only minor differences between the models and these predictions can be considered accurate, however d-f) show less accurate results. Inaccuracies for the pair VP1-VP2 are greater. g-l) Results for the pair VP2-VP3. Similar results can be obtained also for the other pair. g-i) Various levels of inaccurate placement of the bounding boxes. j) An accurately detected occluded vehicle. k) Sometimes the two vehicles get grouped into one bounding box. l) Occlusion can also sometimes result in a false positive between the two vehicles, however such false positive usually gets filtered out during tracking.

The output 3D bounding boxes for two of our models are showcased in Fig. 5 along with some cases where our models fail to detect the vehicle accurately.

Method VP pair
Input Size
Mean error
Median error
95-th percentile
Mean precision
Mean recall
DubskaAutodubska2014 - - 8.22 7.87 10.43 73.48 90.08 -
SochorAutosochor2017 - - 1.10 0.97 2.22 90.72 83.34 -
SochorManualsochor2017 - - 1.04 0.83 2.35 90.72 83.34 -
Previous2DCVWW2019 VP2-VP3 640 x 360 0.83 0.60 2.17 83.53 82.06 62
Previous3DCVWW2019 0.86 0.65 2.17 87.67 89.32 62
Transform3D VP2-VP3 480 x 270 0.92 0.72 2.35 89.26 79.99 70
640 x 360 0.79 0.60 1.96 87.08 83.32 62
960 x 540 0.75 0.58 1.84 87.74 83.21 43
VP1-VP2 270 x 480 1.12 0.84 2.84 87.68 84.06 70
360 x 640 1.17 0.87 2.88 88.32 86.32 62
540 x 960 1.09 0.84 2.65 88.06 85.30 43
Transform2D VP2-VP3 640 x 360 0.92 0.69 2.18 84.73 77.58 62
VP1-VP2 360 x 640 1.11 0.91 2.70 86.96 79.42 62
Orig2D - 640 x 360 0.93 0.73 2.53 88.47 85.20 62
MaskRCNN3D - 1024 x 576 0.88 0.64 2.19 88.44 81.89 5
Table 1: The results of the compared methods on the split C of the BrnoCompSpeed dataset brnocompspeed . Mean, median and 95-th percentile errors are calculated as means of the corresponding error statistics for each video. Recall and precision are averaged over the videos in the test set. FPS values for our methods are calculated on a machine with 8-core AMD Ryzen 7 2700 CPU, 48 GB RAM and Nvidia TITAN V GPU.

4.1 Speed Measurement Accuracy

We evaluate our method on the speed measurement task on the split C of the BrnoCompSpeed dataset brnocompspeed

. The evaluation metrics can be seen in Table

1 and we provide files for evaluation for all of our presented variants including the ablation experiments online 111 We compare our results to available published results on the same data. We include the original method by Dubská et al. dubska2014 denoted as DubskaAuto. We also include its improved version by Sochor et al. sochor2017 in two variants: SochorAuto, which to our knowledge is the most accurate fully automatic method for speed measurement evaluated on the dataset and SochorManual which is more accurate, but includes a manual adjustment of the scale factor during calibration. We also include our previous work CVWW2019 in two variants: Previous3D, which detects 3D bounding boxes aligned with the vanishing points and Previous2D, which detects 2D bounding boxes aligned with just two of the vanishing points. Both of these methods require a manual adjustment of the perspective image transformation for some camera angles to work properly, therefore they can not be considered fully automatic. To our knowledge the results for the method denoted as Previous2D are the best published so far with respect to the speed measurement accuracy on the dataset.

We report results for our method denoted as Transform3D in its six variants described in subsection 3.5. For all of these we report the rate of frames per second that can be processed on a machine with 8-core AMD Ryzen 7 2700 CPU, 48 GB RAM and Nvidia TITAN V GPU. The results show that the variants with the two bigger input sizes using the pair of vanishing points VP2-VP3 outperform all other published methods. The variants using the other pair of vanishing points show worse performance, but are still comparable to the results of SochorAuto.

4.2 Ablation Studies

To properly gauge the impact of the perspective transformation we perform two ablation experiments. We train the standard RetinaNet 2D object detector on the same data as the other models, except that the images are not transformed. We refer to this model as Orig2D. We also train the standard RetinaNet 2D object detector on the transformed images. We use the same 2D bounding boxes as in Transform3D, but without the parameter . We refer to this method as Transform2D

. We use the center of the bottom edge of the 2D bounidng box to determine the speeds. We train these models with the same hyperparameters as our base model. We perform the ablation experiments only for the image size of

and pixels. We also perform an experiment denoted as MaskRCNN3D where we obtain the 3D bounding boxes of vehicles via their mask obtained by using the Mask R-CNN network MaskRCNN pre-trained on the MS COCO dataset COCO . We construct the 3D bounding boxes in the same manner as described in subsection 3.5. The results of the ablation experiments can be seen in Table 1.

When compared to the results SochorAuto and SochorManual which rely on Faster R-CNN in combination with background subtraction for detection it is clear that the use of RetinaNet alone (Orig2D) for detection of vehicles brings significant improvements. Transforming the image (Transform2D) also brings a minor improvement for the pair VP2-VP3. It is clear that introducing the construction of the 3D bounding box for this pair improves speed measurement accuracy significantly and is thus beneficial for speed measurement tasks.

Surprisingly, the transformation for the pair VP1-VP2 increases the mean speed measurement error over the non-transformed variant. This may possibly be caused by the rectification of the image resulting in loss of some visual cues important for object localization.

Results for MaskRCNN3D are worse than the results of our main approach (Transform3D) for the pair VP2-VP3, but better than the results for other methods and ablation experiments. This ablation experiment provides further evidence that constructing a 3D bounding box is beneficial for the task of speed measurement, since it performed better than all of the three ablation setups which only used 2D bounding boxes. This ablation experiment required no task-specific training, however this came at a significant hit to the FPS performance and may thus not be a cost-effective option for real-world applications.

4.3 Computational Costs

All of our variants run at faster than real-time (25 FPS) speeds, though we have to note that the testing videos of the BrnoCompSpeed dataset were recorded with 50 FPS and the speed measurement accuracy can therefore be worse for footage with lower FPS rates. Results show that increasing the input image size tends to result in increased speed measurement accuracy, while also increasing the computational demands reflected by the FPS rates for models of different sizes. Our method is therefore easily configurable to work under different hardware constraints in real-world applications. We were not able to perform FPS measurement for the other published approaches, however the most significant methods SochorAuto and SochorManual both rely on the Faster R-CNN object detector which, in general, is significantly slower than RetinaNet used in our method RetinaNet .

5 Conclusion

We proposed several improvements and extensions to our previously published method CVWW2019 for detection of 3D bounding boxes of vehicles in a traffic surveillance scenario. Our improvements eliminate the need to manually adjust the construction of the perspective transformation for some camera angles. We also extended the transformation method to enable using a different pair of vanishing points.

We have also extended the experimental analysis of our method providing a range of configurations, which allows for a flexibility of choice with respect to the accuracy-speed tradeoff for real-world applications. All of the models can be run in real-time on commercially available GPUs. Configurations relying on smaller input sizes provide a possibility of processing multiple video streams concurrently.

Our improved fully automatic approach led to an improvement in speed measurement on the BrnoCompSpeed dataset brnocompspeed . Compared to our previously published non-automatic state of the art method CVWW2019 we reduced the mean speed measurement error by 10% (0.83 km/h to 0.75 km/h), the median speed measurement error by 3% (0.60 km/h to 0.58 km/h) and the the 95-th percentile error by 15% (2.17 km/h to 1.84 km/h). Compared to the state of the art fully automatic method sochor2017 we reduced the mean absolute speed measurement error by 32% (1.10 km/h to 0.75 km/h), the absolute median error by 40% (0.97 km/h to 0.58 km/h) and the 95-th percentile error by 17% (2.22 km/h to 1.84 km/h).

The authors would like to thank Adam Herout for his valuable comments. The authors also gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs.


  • (1) Cathey, F., Dailey, D.: A novel technique to dynamically measure vehicle speed using uncalibrated roadway cameras. In: Proceedings of the IEEE Intelligent Vehicles Symposium, 2005., pp. 777–782. IEEE (2005)
  • (2) Corral-Soto, E.R., Elder, J.H.: Slot cars: 3d modelling for improved visual traffic analytics.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–24. IEEE (2017)

  • (3) Do, V.H., Nghiem, L.H., Thi, N.P., Ngoc, N.P.: A simple camera calibration method for vehicle velocity estimation. In: Proceedings of the 12th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, pp. 1–5. IEEE (2015)
  • (4) Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q., Tian, Q.: Centernet: Keypoint triplets for object detection. arXiv preprint arXiv:1904.08189 (2019)
  • (5) Dubská, M., Herout, A., Sochor, J.: Automatic camera calibration for traffic understanding. In: Proceedings of the British Machine Vision Conference, vol. 4, p. 8. BMVA Press (2014)
  • (6) Farnebäck, G.: Two-frame motion estimation based on polynomial expansion. In: Proceedings of the Scandinavian conference on Image analysis, pp. 363–370. Springer (2003)
  • (7) Filipiak, P., Golenko, B., Dolega, C.: Nsga-ii based auto-calibration of automatic number plate recognition camera for vehicle speed measurement.

    In: European Conference on the Applications of Evolutionary Computation, pp. 803–818. Springer (2016)

  • (8) Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354–3361. IEE (2012)
  • (9) He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969. IEEE (2017)
  • (10) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. IEEE (2016)
  • (11) Kalman, R.E.: A new approach to linear filtering and prediction problems. Journal of basic Engineering 82(1), 35–45 (1960)
  • (12) Kim, Y., Kum, D.: Deep learning based vehicle position and orientation estimation via inverse perspective mapping image. In: Proceedings of the 2019 IEEE Intelligent Vehicles Symposium, pp. 317–323. IEEE (2019)
  • (13) Kocur, V.: Perspective transformation for accurate detection of 3d bounding boxes of vehicles in traffic surveillance. In: Proceedings of the 24th Computer Vision Winter Workshop, pp. 33–41 (2019)
  • (14) Lan, J., Li, J., Hu, G., Ran, B., Wang, L.: Vehicle speed measurement based on gray constraint optical flow algorithm. Optik 125(1), 289–295 (2014)
  • (15) Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision, pp. 734–750 (2018)
  • (16) Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. IEEE (2017)
  • (17) Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European conference on computer vision, pp. 740–755. Springer (2014)
  • (18) Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Proceedings of the European conference on computer vision, pp. 21–37. Springer (2016)
  • (19) Maduro, C., Batista, K., Peixoto, P., Batista, J.: Estimation of vehicle velocity and traffic intensity using rectified images. In: Proceedings of the 15th IEEE International Conference on Image Processing, pp. 777–780. IEEE (2008)
  • (20) Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3d bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7074–7082. IEEE (2017)
  • (21) Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. IEEE (2016)
  • (22) Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, pp. 91–99 (2015)
  • (23) Schoepflin, T.N., Dailey, D.J.: Dynamic camera calibration of roadside traffic management cameras for vehicle speed estimation. IEEE Transactions on Intelligent Transportation Systems 4(2), 90–98 (2003)
  • (24) Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3d object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1991–1999 (2019)
  • (25) Sochor, J., Juránek, R., Herout, A.: Traffic surveillance camera calibration by 3d model bounding box alignment for accurate vehicle speed measurement. Computer Vision and Image Understanding 161, 87–98 (2017)
  • (26) Sochor, J., Juránek, R., Špaňhel, J., Maršík, L., Širokỳ, A., Herout, A., Zemčík, P.: Comprehensive data set for automatic single camera visual speed measurement. IEEE Transactions on Intelligent Transportation Systems 20(5), 1633–1643 (2018)
  • (27) Sochor, J., Špaňhel, J., Herout, A.: Boxcars: Improving fine-grained recognition of vehicles using 3-d bounding boxes in traffic surveillance. IEEE Transactions on Intelligent Transportation Systems 20(1), 97–108 (2018)
  • (28) You, X., Zheng, Y.: An accurate and practical calibration method for roadside camera using two vanishing points. Neurocomputing 204, 222–230 (2016)
  • (29) Zapletal, D., Herout, A.: Vehicle re-identification for automatic video traffic surveillance. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 25–31 (2016)
  • (30) Zeng, R., Ge, Z., Denman, S., Sridharan, S., Fookes, C.: Geometry-constrained car recognition using a 3d perspective network. arXiv preprint arXiv:1903.07916 (2019)
  • (31) Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)