Online Clustering-based Multi-Camera Vehicle Tracking in Scenarios with overlapping FOVs

by   Elena Luna, et al.

Multi-Target Multi-Camera (MTMC) vehicle tracking is an essential task of visual traffic monitoring, one of the main research fields of Intelligent Transportation Systems. Several offline approaches have been proposed to address this task; however, they are not compatible with real-world applications due to their high latency and post-processing requirements. In this paper, we present a new low-latency online approach for MTMC tracking in scenarios with partially overlapping fields of view (FOVs), such as road intersections. Firstly, the proposed approach detects vehicles at each camera. Then, the detections are merged between cameras by applying cross-camera clustering based on appearance and location. Lastly, the clusters containing different detections of the same vehicle are temporally associated to compute the tracks on a frame-by-frame basis. The experiments show promising low-latency results while addressing real-world challenges such as the a priori unknown and time-varying number of targets and the continuous state estimation of them without performing any post-processing of the trajectories.



There are no comments yet.


page 1

page 2

page 11


Traffic-Aware Multi-Camera Tracking of Vehicles Based on ReID and Camera Link Model

Multi-target multi-camera tracking (MTMCT), i.e., tracking multiple targ...

City-Scale Multi-Camera Vehicle Tracking Guided by Crossroad Zones

Multi-Target Multi-Camera Tracking has a wide range of applications and ...

Online Multi-Target Tracking Using Recurrent Neural Networks

We present a novel approach to online multi-target tracking based on rec...

Multi-Target Tracking in Multiple Non-Overlapping Cameras using Constrained Dominant Sets

In this paper, a unified three-layer hierarchical approach for solving t...

Multi-Target Multi-Camera Tracking of Vehicles using Metadata-Aided Re-ID and Trajectory-Based Camera Link Model

In this paper, we propose a novel framework for multi-target multi-camer...

Measurement of Road Traffic Parameters Based on Multi-Vehicle Tracking

Development of computing power and cheap video cameras enabled today's t...

Automatic vehicle tracking and recognition from aerial image sequences

This paper addresses the problem of automated vehicle tracking and recog...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Intelligent Transportation Systems (ITS) are considered a key part of smart cities. Consistent with the accelerated development of modern sensors, new computing capabilities and communication, ITS technology engages the attention of both academia and industry. ITS point to offer smarter transportation facilities and vehicles, along with safer transport services.

One of the main research fields on ITS is visual traffic monitoring using video analytics with data captured by visual sensors. This data can be used to provide information, such as traffic flow estimation, or to detect traffic patterns or anomalies. In recent years it has become an active field within the computer vision community

[1, 2, 3], however it is still remains a challenging task [4], mainly if we consider the case of multiple cameras.

In contrast to mono-camera traffic monitoring, multi-camera setups requires of a more complex infrastructure, the capability of dealing with more simultaneously data, as well as a higher processing capability. Multi-Target Multi-Camera (MTMC) tracking algorithms are fundamental for many ITS technologies.

Different from Multi-Target Single-Camera (MTSC) tracking [5, 6], MTMC tracking entails the analysis of visual signals captured by multiple cameras, considering setups with overlapping fields of view (FOVs), but also scenarios for wide-area monitoring, where cameras may be separated by large distances. Road intersections are well-know targets for monitoring due to the high number of reported accidents and collisions [7]. These intersections are known for their intrinsic and complex nature due to a variety of the vehicles’ behaviors. This kind of scenarios are usually monitored with multiple partially overlapping cameras, which introduces new challenges, but also powerful opportunities for video analysis (e.g. traffic flow optimization and pedestrian behaviour analysis).

For the multi-camera tracking problem, efficient data association across cameras, and also, across frames, becomes the key problem to solve. A considerable amount of existing MTMC vehicle tracking algorithms perform an offline batch processing scheme to carry out the association [8, 9, 10, 11, 12, 13, 14, 15]. They consider previous and future frames, and often the whole video sequences at once, to merge vehicles trajectories across cameras and time. They also rely on post-processing techniques to refine the resulting trajectories. This offline scheme provides more robustness, compared to online designs, albeit it is not compatible with online applications; hence, limiting its applicability in real-time traffic monitoring scenarios.

In this paper, we describe the first, to the best of our knowledge, low-latency online MTMC vehicle tracking approach for cameras with partially overlapping FOVs capturing intersection scenarios. The proposed system follows an online and frame-by-frame processing scheme. Furthermore, compared to other state-of-the-art systems (see Table I), our approach does not perform any post-processing track refinement, it is agnostic to potential motion patterns (i.e. it works without prior knowledge of vehicles paths within cameras’ FOV) and it does not require additional manual ad-hoc annotations (e.g. definition of regions and boundaries on the roads). These two last characteristics avoid the need of configuring each real set-up where the system is deployed, improving flexibility and generalising its use.

The proposed MTMC tracking approach builds upon detection of multiple vehicles on every single camera. Afterwards, a combined cross-camera agglomerative clustering, combining spatial locations (using GPS coordinates) and appearance features, is used to merge vehicles from different cameras. This clustering is evaluated using validation indexes and, finally, a temporal linkage of the obtained clusters is performed to obtain the trajectories of each moving vehicle in the scene along time.

of tracks
Awareness of
motion patterns
Level of
Baidu [8]



NCCU-UAlbany [9] - -
BUPT [11] - - -
ANU [12] - - -
UWIPL [13]
DiDi Global [14] - -
Shanghai Tech. U. [15] -
Ours Online - - - Detections
TABLE I: Comparison of available MTMC vehicle tracking approaches. The table shows differences regarding the type of processing, the use of post-processing tracks, the awareness about the vehicles’ motion patterns, the use of ad-hoc information annotated manually and the level of cross-camera association. As can be seen, ours is the unique online approach considering detections to perform the cross-camera association, also, we do not employ any post-processing of the tracks, we are agnostic to the motion patterns of the vehicles and we do not use any additional manual annotations.

This paper is an extended version of our related conference publication [16]

, with additional contributions as follows. First, we include and evaluate the impact of additional object detectors. Second, we remove any offline dependency in order to become a genuine online approach. Third, we design and train a completely new appearance feature extraction, and also investigate the impact of an additional dataset for training. Fourth, we improve the cross-camera clustering and temporal association reasoning. Fifth, we design and implement a new occlusion handling strategy. Lastly, we perform a wide ablation study to measure the impact of different parameters and strategies at different stages of the proposal, and we show results in a detailed comparison with the state-of-the-art.

The paper is organized as follows. Section II reviews the state-of-the-art in MTMC vehicle tracking. Section III describes the proposed approach. Section IV presents the evaluation framework, the implementation details, the ablation study and finally, a comparison with the state-of-the-art. Finally, conclusion remarks are described in Section V.

Ii Related work

For the last recent years, several approaches devoted to track pedestrians in multi-camera environments have been published [17, 18, 19, 20, 21, 22]. The releases of public benchmarks such as MARS [23] and DukeMTMC [24] powered the research community to put efforts into Multi-Target Multi-Camera tracking oriented to people tracking.

Due to the lack of appropriate publicly available datasets, MTMC tracking focused on vehicles was a nearly unexplored field. To encourage research and development in ITS problems, the AI City Challenge Workshop111

launched three distinct but closely related tasks: 1) City-Scale Multi-Camera Vehicle Tracking, 2) City-Scale Multi-Camera Vehicle Re-Identification and 3) Traffic Anomaly Detection. Focusing on MTMC tracking, the CityFlow benchmark was presented

[25]. At the time of publication, it is the only dataset and benchmark for MTMC vehicle tracking. Figure 1 depicts four sample views from an intersection in City-Flow benchmark.

Fig. 1: Sample views from an intersection in CityFlow benchmark.

The major challenge of tracking vehicles is the viewpoint variation problem. As can be seen in Figure 2, different vehicles may appear quite similar from the same viewpoint, however the same vehicle captured from different viewpoints may be difficult to recognise. It can be extremely hard, even for humans, to determine if two vehicles from different points of view depict the same car (e.g., as shown in Figure 2, pairs [(a), (d)], [(b), (e)] and [(c), (f)]).

Fig. 2: Illustration of the viewpoint variation problem. Under the same view different vehicles may appear very similar (a), (b) and (c), while the same car from different viewpoints may be extremely difficult to recognise [(a), (d)], [(b), (e)] and [(c), (f)].

According to the processing scheme, MTMC tracking methods can be categorized in two groups: 1) offline methods, and 2) online methods. Offline tracking methods, perform a global optimization to find the optimal association using the entire video sequence. The vehicles’ detections are temporally grouped into tracklets (short trajectories of detections) using MTSC tracking techniques, and, afterwards, tracklet-to-tracklet association is performed, mainly by using re-identification techniques: considering the whole video sequences at once [8, 11, 13, 14, 15], considering windows of frames [10], or even combining both approaches [12].

On the other side, online approaches need to perform cross-camera association of target detections on a frame-by-frame basis, using detectors’ outputs (usually, bounding boxes) as the smallest unit for matching, instead of tracklets.

As can be seen in Table I, to the best of our knowledge, all existing approaches chose to work in an offline way. In order to remove false positive trajectories or ID switches [24], the offline approaches sometimes may apply post-processing filtering at the end of some intermediate stages [8, 10, 14], or at the end of the whole procedure [13]. Being aware of the motion patterns that the vehicles can adopt in every camera view, can also help to remove undesired trajectories and, therefore, increase the recall [8, 9, 10, 13, 15]. Offline working also allows to apply additional temporal constraints to increase the performance [8, 13, 10]. Another strategy to improve overall performance consists in incorporating some additional manually annotated, scenario specific, information; for example, additional vehicle’s attributes (colour, type, etc) for getting a better appearance model [8] or road boundaries [10].

It is common in the literature of MTMC tracking to treat the tracklet-to-tracklet cross-camera association task as a clustering problem, grouping them by appearance features [26, 8, 11], or by combining appearance and other constraints (e.g., time and location) [27, 28, 13, 10]

. Clustering algorithms are often categorized into two broad categories: 1) partitioning algorithms (center-based, e.g. K-means

[29], or density-based, e.g. DBSCAN [30]

), and 2) hierarchical clustering

[31] (being agglomerative or divisive). While hierarchical algorithms build clusters gradually (as a tree of clusters) and they do not require pre-specification of the number of clusters, partitioning algorithms learn clusters at once and they require pre-specification of the number of clusters (K-means) or the minimum number of points defining a cluster (DBSCAN). Therefore, hierarchical clustering is advantageous when there is no prior knowledge about the number of clusters, but on the contrary, it outputs a tree of clusters, commonly represented as a dendrogram. Such structure does not provide the number of clusters, but gives information about the relations between the data. For this reason, cluster validation techniques, such as Davies-Bouldin index [32], the Dunn index [33] or the Silhouette coefficient [34], are used to determine the number of clusters, which may differ for each technique. In the proposed work, as there is no prior knowledge about the number of vehicles in the scene, we apply agglomerative hierarchical clustering combining location and appearance information.

Existing MTMC vehicle tracking approaches firstly compute tracklets by temporally merging detections on every single camera, and then performing cross-camera tracklet-to-tracklet association. In contrast, we firstly compute clusters by cross-camera association of vehicle detections and, afterwards, on a frame-by-frame basis, we temporally associate the clusters to compute the tracks.

Iii Proposed Approach

Fig. 3: Block diagram of the proposed approach. The inputs are frames from cameras. The trajectories are computed for each frame. First, the vehicle detection block computes , the set of vehicle detections. feeds both feature extraction and homography projection blocks. is the set of appearance feature descriptors and the set of GPS world coordinates of every vehicle. The cross-camera clustering block uses and to aggregate different views of the same vehicle and to compute the set of clusters at each temporal instant. Lastly, the temporal association block associates clusters in in a temporal way to compute the set of tracks .

In the proposed online Multi Target Multi Camera tracking approach, all cameras’ videos are processed simultaneously frame by frame, without any post-processing of the trajectories. The approach is composed of five processing blocks, as shown in Figure 3. As input, we consider a network of calibrated and synchronized cameras with partially overlapping FOVs providing independent video sequences. Given a network of cameras, the pipeline includes the following stages: (1) vehicle detection, (2) feature extraction, (3) homography projection, which projects single camera vehicles from each camera to the world coordinates system (GPS) for providing location information, (4) cross-camera clustering, that is fed on the output of (2) and (3) blocks, and (5) temporal association of vehicles trajectories over time to compute the tracks. As result, the system generates tracks consisting on the identity and location of every vehicle along time. The design of the processing blocks is detailed in the following subsections, whilst the implementation details are given in Section IV-B.

Table II summarizes the notation used in this section. The scope of each variable is also defined. Scenario refers to the set of cameras, Frames stand for all the simultaneous images coming from the cameras at each temporal instant. Sequence is comprised of all the aggregated frames coming simultaneously from the cameras. , and are intrinsic to the scenario, while is a design parameter. , , , , and are computed at each temporal instant, needing the simultaneous frames. Last, is updated frame-by-frame for the whole sequence.

Number of cameras Scenario

Homography matrix of camera Scenario
Association radius Scenario
Number of total detections Frames
Set of bounding boxes Frames
Set of GPS world locations Frames
Set of feature descriptors Frames
Set of clusters Frames
Set of tracks Frames, Sequence
TABLE II: Notation used throughout the paper.

Iii-a Vehicle Detection

As most of the state-of-the-art MTMC tracking methods, we follow the tracking-by-detection paradigm. Therefore, the first stage of the pipeline is vehicle detection at each frame. Let be a bounding box with being the upper-left corner pixel coordinates, and the width and height. Let define as the set of bounding boxes at each frame for all the cameras, with the total number of detections.

Note that the proposal can incorporate any single-camera vehicle detection algorithm whose output is in a bounding box form.

Iii-B Feature Extraction

In order to describe the appearance of the bounding box detection, let be its

-dimensional deep feature descriptor. Let

be the set of appearance feature descriptors for each frame for all the detected vehicles.

Due to the intrinsic geometry of vehicles, their appearance may suffer strong variations across different camera views. This variance is such that it could be, even for a human being, very hard to determine if they are the same vehicle. Thus, in order to have highly discriminating features, we trained a model to improve vehicle classification ability in the faced scenario. More details on this vehicle specific model will be given in Section


Class imbalance is a form of the imbalance problem [35], that occurs when there is an important inequality regarding the number of examples pertaining to each class in the data. When not addressed, it may have negative effects on the final performance. It is known that classes with a higher number of observations tend to dominate the learning process, hindering the learning and generalization of low-represented classes. In order to minimize the imbalance effects, instead of classical Cross-Entropy (CE) loss [36], we employ the focal loss (FL) proposed in [37].

Iii-C Homography-based Projection

This processing block computes the location of each detected vehicle on the common ground-plane employing GPS coordinates. Let be the homography matrix that transforms coordinates from the image plane of the camera to the GPS coordinates of the common ground plane. Let the inverse matrix be the inverse transformation. We leverage the GPS coordinates to achieve a high-precision clustering based on the location information by applying camera projection. Given a bounding box b, one can obtain its associated GPS coordinates, i.e. (latitude and longitude), by projecting the middle point of its base with the transformation. , the set of GPS coordinates, is obtained after applying the transformation to the set .

Figure 4 illustrates an example of the projected detections coming from different cameras. Note that this block relies on the output of the object detection stage, and along with the feature extraction module, it feeds the cross-camera clustering.

Fig. 4: Vehicle detections from four partially overlapped cameras projected to GPS coordinates at a certain temporal instant. Detections within a 8 meters radius are more likely to be joined. (Best viewed in color)

Iii-D Cross-camera Clustering

Given the sets , and , the cross-camera clustering block associates different camera views of the same vehicle at each frame to compute , the set of clusters at a given frame, being be the number of created clusters. Clusters’ content ranges from a single detection, if the vehicle is only visible by one camera, to the maximum number of detections, defined by the maximum number of cameras capturing the scene. To create the clusters, we compute a frame-by-frame linkage by performing an agglomerative hierarchical clustering combining location and appearance features.

Hierarchical clustering [31] requires a square connectivity matrix of distances (dissimilarities) or similarities of the input data to merge. We compute the connectivity matrix

as a constrained pairwise features distance between all the vehicles coming from every camera at each frame. At each frame, we compute the pairwise Euclidean distance between the appearance feature vectors of all the vehicles under consideration, as follows:


Also at each frame, we compute the Euclidean pairwise distance between all the GPS coordinates of vehicles:


The spatial distance and the camera ID are used to apply some constraints. Since two vehicles’ detections widely separated in GPS coordinates are highly unlikely to come from the same vehicle, it is reasonable to assume a maximum association distance. This constraint narrows down the list of vehicles to be matched and improves the ability to distinguish different identities by focusing on comparing only nearby targets. Hence, the connectivity matrix is computed as follows:


being the maximum association radius. A second condition is applied for preventing vehicles’ detections from the same camera view to be merged together. It is done by constraining the association matrix as follows:


Let be the camera yielding the detection.

Fig. 5: Dendrogram illustrating the hierarchical relationships between all the detected vehicles at a certain temporal instant. After cluster validation, the dendrogram is cut at : detections that are joined together below the red line are part of the same cluster. In this example (18, 25), (6, 27), (2, 14, 28, 19) and (13, 29) are joined together, while the rest of detections comprise a cluster by themselves. Hence, .

As stated above, hierarchical clustering methods departs from a connectivity matrix to compute a tree of clusters and this structure does not provide the number of clusters, but gives information on the relations between the data. These relationships can be represented by a tree diagram called dendrogram. An example is presented in Figure 5. In order to cut the dendrogram and identify the optimal number of clusters, we use the Dunn index [33] for cluster validation. The aim of this index is to find clusters that are compact, with a small variance between members of the cluster, and well separated by comparing the minimal cluster distance to the maximal cluster diameter. The cluster diameter is defined as the distance between the two farthest elements in the cluster. This process provides the number of vehicles at every frame in the scene, in the form of clusters, as well as its location, in the form of the cluster’s centroid (outlined as the mean point at each coordinate axis of all the components). To sum, at every frame, each cluster designates an existing vehicle.

Iii-E Temporal Association

The last stage of the proposed approach links clusters over time to estimate the vehicle tracks. Let be the track defining the trajectory of a moving vehicle by a succession of states. Each state is described by where is the target location and is the target velocity, both represented using GPS coordinates. Let define as the set of tracks along the video sequence. In contrast to previous sets , and , that are initialized at each frame, is built incrementally, i.e. it is computed at first frame and updated along time. In other words, tracks depict the location of clusters along time. As in the whole system, the temporal association is performed on-line, that is, frame-by-frame.

Vehicles’ motion is estimated using a constant-velocity Kalman filter 

[38]. The Kalman filter makes a prediction of the state of the target as a combination of the targets’ previous state (at prior frame) and the new measurement (at current frame) using a weighted average. It results in a new state estimation lying in between the previous prediction and the measurement. Thus, at each frame, on the one hand we employ Kalman filter to get the estimate location of the tracks of the previous frame, and on the other hand, we get the current vehicle measurements as the clusters resulting from the cross-camera association.

In order to associate both, we apply the Hungarian Algorithm [39] to solve the assignment problem, using an association matrix to enumerate all possible assignments. The association matrix is filled with the pairwise L2-norm, i.e. the euclidean distance, between the location of the estimated tracks and the clusters’ centroid location (see Section III-D).

To provide robustness against occlusions we designed two strategies: a blind occlusion handling and a reprojection-based occlusion handling. The first maintains alive the tracks during a short time when the detections associated to it are lost. Keeping on predicting the position of the track during that period allows to recover it in case the detections are recovered. This is helpful if the vehicle detector loses a detection, either due to a bad detection performance or a hard occlusion. The second strategy detects if a track has lost one or more of its associated detections and looks for the same track in the previous frame to get the information about the size of its previously associated bounding boxes. The new location in the current frame is inferred by applying the corresponding inverse homography matrix (e.g. assuming a detection is missing for the camera) to the estimated track position. Therefore, when this strategy reveals a track which detection or detections are lost, mostly due to an occlusion the detector cannot deal with, we can generate an artificial detection with accurate estimates on the correct position and with the previous detected size of the occluded vehicle.

Iv Experiments

Iv-a Evaluation framework

Iv-A1 Datasets

We considered the CityFlow benchmark [25], since there is no other publicly available dataset devoted to MTMC vehicle tracking with partially overlapping FOVs. The dataset comprises videos of 40 cameras, 195 total minutes recorded for all cameras, and manually annotated ground-truth consisting of 229,690 bounding boxes for 666 vehicles. The dataset is divided into 5 scenarios (S01, S02, S03, S04 and S05) covering intersections and stretches of roadways. S01 and S02 have overlapping FOVs, while S03, S04 and S05 are wide-area scenarios. The CityFlow benchmark also provides the camera homography matrices between the 2D image plane and the ground plane defined by GPS coordinates based on the flat-earth approximation.

We have also used VeRi-776 dataset for improving the feature extraction model by using it as additional training data. VeRi-776 [40] is one of the largest and most common datasets for vehicle re-identification in multi-camera scenarios. It comprises about 50,000 bounding boxes of 776 vehicles captured by 20 cameras.

Iv-A2 Evaluation Metrics

The MTMC tracking ground-truth provided by the CityFlow benchmark consists of the bounding boxes of multi-camera vehicles labeled with consistent IDs.

Following the CityFlow benchmark evaluation methodology, Identification Precision, Identification Recall and Score measures [24] are adopted:


where , and stand for True Positive ID, False Positive ID and False Negative ID, respectively. () is the fraction of computed (ground-truth) tracks that are correctly identified. is the ratio of correctly identified tracks over the average number of ground-truth and computed tracks.

Automatically obtained tracks by the proposed method are pairwise compared with the ground-truth tracks. We declare a match, i.e. an , when two tracks temporarily coexist and the area of the intersection of the bounding boxes is higher than (with ) times the area of the union of the two boxes. Hence, is the Intersection Over Union (IoU) threshold. A high score is obtained when the correct multi-camera vehicles are detected, accurately tracked within each camera view, and labeled with a consistent ID across all the views in the dataset.

Iv-A3 Hardware and software

The algorithm and model training have been implemented using PyTorch 1.0.1 Deep Learning framework running on a computer using a 6 Cores CPU and a NVIDIA GeForce GTX 1080 12GB Graphics Processing Unit.

Iv-B Implementation details

Iv-B1 Vehicle detection

Regarding single-camera vehicle detection we have experimented with public detections, i.e. vehicle detections provided by the CityFlow Benchmark, and private detections, computed using a state-of-the-art algorithm. The public detections were obtained by using three popular detectors: Yolo v3 [41], SSD512 [42] and Mask R-CNN [43]

. Yolo v3 is a one-stage object detector that solves detection as a regression problem. SSD512 is also a single-shot detector which directly predicts category scores and box offsets for a fixed set of default bounding boxes of different scales at each location. Mask R-CNN, on the contrary, is a two-stage detector consisting of a region proposal network that feeds region proposals into a classifier and a regressor.

Moreover, we have complemented the provided detections with those obtained by the EfficientDet [44] algorithm, a top-performing state-of-the-art object detector. EfficientDet is also a one-stage detector that uses EfficientNet [45] as the backbone network and a bi-directional feature pyramid feature network (BiFPN).

All these approaches make use of pre-trained models on the COCO benchmark  

[46]. For our purpose, we considered only detections classified as instances of the car, truck and bus classes.

Iv-B2 Feature extraction

For the feature extraction network, we employ ResNet-50 [47] as backbone, but the original classification layer (fc_1

layer), shaped for image classification on the ImageNet dataset


, is replaced by a new classification layer whose size is tailored to the total number of identities in the training data. In order to leverage the pretained weigths on Imagenet, we fine-tune the network but freeze it until

conv_5 layer.

To fine-tune the network, we used the CityFlow benchmark training data (S01, S03 and S04) and we also included the VeRi-776 dataset, bringing a total of 905 vehicle IDs for training (129 IDs from CityFlow, plus 776 IDs from VeRi-776). Since only training identities are known, the network learns features to correctly classify the 905 different training vehicle identities. We perform a validation methodology on pairs of unseen vehicles and comparing whether predictions are the same or not. Therefore, we check the network ability to discern different views of the same target. To create these pairs, we randomly select half of the data from S05 scenario to create a 169 IDs validation set. We forced the validation batch to contain approximately 50 of positive and 50 of negatives pairs. The pair selection is randomly done over the set of IDs, instead of the set of images, thus, IDs containing few samples are not impaired. At inference, we adopt, as a 2048-dimensional descriptor, the output of the average pooling layer, just before the classifier.

Each input image containing a bounding box of a vehicle is adapted to the network by resizing it to xx

and pixels’ values are normalized by the mean and standard deviation of the ImageNet dataset. In order to reduce model overfitting and to improve generalization, we perform several random data augmentation techniques such as horizontal flip, dropout, Gaussian blur and contrast perturbation.

To minimize the loss function and optimize the network parameters, we adopt Stochastic Gradient Descend (SGD) solver. Experimentally, the initial learning rate was set to 0.1 and we follow a step decay schedule dropping it by 0.1 every 25 epochs. Momentum was set to 0.9 and weight decay to


Iv-C Ablation Study

This section measures the impact of the strategies used along the different stages of the proposed approach. Firstly, the effect of using different vehicle detectors is evaluated. Secondly, the influence of the association radius parameter is analysed. Subsequently, we gauge the influence of the appearance model training method as well as the size of the feature embedding. And finally, some additional strategies (e.g. occlusions handling) are assesed. All the experiments are evaluated on the testing scenario of the CityFlow benchmark dataset with partially overlapping FOVs, i.e. the S02 scenario. It is composed of 4 cameras pointing to an intersection roadway (see Figure 1). In total, aggregating 129 annotated vehicles whose trajectories are distributed along 8440 frames (2110 per camera) that have been captured at 10 fps.

28.8 SSD512 0.1 48.96 63.43 55.27
0.2 48.97 63.43 55.27
0.3 49.05 60.87 54.33
33.0 Yolo v3 0.1 41.16 60.54 49.00
0.2 41.06 60.37 48.88
0.3 39.83 55.25 46.23

Mask R-CNN 0.1 49.11 64.84 55.89
0.2 49.11 64.84 55.89
0.3 48.27 63.63 54.89
55.1 EfficientDet 0.1 39.93 60.85 48.22
0.2 44.35 65.12 52.76
0.3 47.10 68.21 55.72

TABLE III: MTMC tracking performance of the proposed approach for different vehicle detectors. We differentiate between the public detectors (SSD512, Yolo v3 and Mask R-CNN) and and the private one (EfficientDet). For each detector, we include the mean Average Precision (mAP) for object detection task in COCO dataset [46] as a measure of performance. Best of both categories in bold.

Influence of the vehicle detector algorithm: Table III comprises the impact of different vehicle detectors on the overall performance of the proposed approach. As stated before, we consider three provided object detections coming from Yolo v3, SSD512 and Mask R-CNN, i.e. public detections. We also evaluate the performance of EfficientDet, a top-performing algorithm.

We experimented with three different score thresholds to get the output detections (0.1, 0.2 and 0.3). Regarding the public detections, one can observe that the compared detectors achieve the peak performance when a low threshold is applied. The results suggest that filtering the output detections by scores higher than 0.2 leads to a lower IDR in the MTMC tracking performance. This finding indicates that detections with low confidence (mostly generated by remote and partially visible vehicles) are still useful.

On the contrary, EfficientDet, since it is a better performing objecter detector, results in a higher and being filtered with 0.3 instead of 0.2. It enhances by 3.37, compared with the best results of Mask-RCNN, however, is degraded by 2.01. The reason for this decline is that EfficientDet is providing more False Positive trajectories arising from the detection of partially occluded vehicles that Mask-RCNN is not able to detect.

In the light of these results, we opted for adopting Mask R-CNN output detections filtered by 0.2 score as public detections, and EfficientDet output filtered out by 0.3, as private detections for the rest of the experiments.

Association radius

Mask R-CNN

= 5 m. 47.04 62.11 53.53
= 6.5 m. 46.03 60.78 52 39
= 8 m 49.11 64.84 55.89
= 9.5 m. 46.56 61.47 52.99


= 5 m. 47.18 68.37 55.83
= 6.5 m. 46.41 67.24 54.92
= 8 m. 47.10 68.21 55.72
= 9.5 m. 43.24 62.59 61.14
TABLE IV: Impact of the association parameter over the MTMC tracking performance. Best in bold.

Influence of the association radius: Table IV shows how the association radius , used in the cross-camera clustering (see Section III-D), affects the MTMC tracking performance of the proposed approach in the evaluated scenario. We sweep radius values of 5, 6.5, 8 and 9.5 meters. The results on the table indicate that the choice of the radius is quite relevant, having a significant impact in the performance, and also it is highly-dependant on the detection algorithm. The Mask-RCNN detector gets performance peak for , however, when using the EfficientDet detector a smaller radius, , is the optimal choice. The reason of this difference may be related with bounding box accuracy (i.e. how the output bounding boxes fit the vehicles). Since the middle points of the bases of the bounding boxes from different camera views are projected to the ground-plane, the tighter the boxes are, the more accurate are the projections.

Due to the common vehicles dimensions, it may be natural to think that a smaller radius should be enough to successfully associate several detections of the same vehicle. However, due to noise in the video transmission while capturing the data, some frames are skipped within some videos, so some cameras suffer from a subtle temporal misalignment (i.e. they are unsynchronized with respect to the others). Therefore, the optimal values for the CityFlow benchmark using the proposed algorithm are 5 and 8 meters, given the two evaluated detectors.

Influence of the appearance feature model: Table V summarizes the effect of the training schemes on the model used to describe the appearance features of vehicles for the proposed MTMC tracking approach. The table lists the data that has been used for training the network (described in Section IV-B2) and how the weights of the network were obtained. As the baseline, we use the model pretrained on the Imagenet dataset. As training data, we considered the training set of the CityFlow benchmark (S01, S03 and S04 scenarios) and also the training set of the CityFlow benchmark jointly with the VeRi-776 dataset. We tried two classification loss functions: Cross-Entropy loss (CE loss) and Focal Loss (FL).

Table V indicates that the tracking performance behaviours in a coordinated manner using both Mask R-CNN and EfficientDet detectors. In both cases, fine-tuning the network to the CityFlow benchmark has a slightly, but positive, influence. Including more training data, the VeRi-776 dataset, appears to improve the quality of the feature embeddings, resulting in a even better tracking performance.

Figure 6 depicts in red the distribution of the number of images per vehicle ID of the training set of the CityFlow dataset illustrating that it is a quite unbalanced set with a very scattered distribution. The average of the distribution is , while the standard deviation is . From Table V, we observe that training the CityFlow benchmark with the Focal loss, instead of the Cross-Entropy loss, has a positive influence in our MTMC tracking approach.

Figure 6 also depicts in blue the distribution of the number of images per vehicle ID of the VeRi-776 dataset, as one may observe, it is more balanced than the CityFlow set. Considering both datasets together, the join distribution is now described by and , as , one could say that the join dataset is less disperse than the single CityFlow, which can be an indicator for the subtle increase in performance obtained when the combined dataset is used. According to these results, we opt for using the combined dataset and the CE loss for the rest of the experiments.

Training Data

Mask R-CNN

Imagenet Pretrained 49.11 64.84 55.89
CityFlow F + CE 49.16 64.91 55.95
F + FL 49.26 65.04 56.06
CityFlow + VeRi-776 F + CE 50.56 66.75 57.54
F + FL 49.53 65.39 56.37


Imagenet Pretrained 47.18 68.37 55.83
CityFlow F + CE 47.41 68.70 56.11
F + FL 47.43 68.73 56.13
CityFlow + VeRi-776 F + CE 48.33 70.03 57.19
F + FL 47.63 69.01 56.36
TABLE V: Impact of appearance feature model over the MTMC tracking performance. F: Finetuned. CE: Cross-Entropy Loss. FL: Focal Loss. Best in bold.
Fig. 6: The distribution of the number of images per vehicle identity in the CityFlow training dataset, the VeRi-776 datastet and the distribution of both joined. Best viewed in color.

Influence of size of the feature embedding: Table VI comprises the experiments carried out to explore the effect of the size of the feature embeddings. As stated in Section IV-B2

, the output of the last average pooling layer of ResNet-50 provides a 2048-dimensional embedding. We set this embedding size as the baseline. In order to modify the length of the embedding, an additional fully connected layer of size 512, 1024 or 4096 is added at the end of the network. The additional fully connected layer is preceded by batch normalization and ReLU layers, and the training procedure is the same as described in Section

IV-B2. The performance suggests that adding an additional layer, and therefore, more complexity to the model, either to reduce or increase the embedding size, may harm the performance, leading the model to overfitting.

Embedding size

Mask R-CNN

Baseline (2048) 50.56 66.75 57.54
512 49.24 65.01 56.04
1024 49.67 65.58 56.53

4096 49.70 65.62 56.56


Baseline (2048) 48.33 70.03 57.19
512 46.72 67.69 55.28
1024 47.06 68.20 55.69

4096 47.50 68.83 56.21

TABLE VI: Impact feature embedding size. Best in bold.

Mask R-CNN

Baseline 50.56 66.75 57.54
+ Size filtering 53.03 66.70 59.08
+ Blind Occlusion handling 53.46 70.59 60.84
+ Reprojection-based handling 52.99 70.96 60.67
+ Blind Occlusion handling + Size filtering 54.99 69.17 61.27
+ Reprojection-based handl. + Size filtering 54.06 69.02 60.63


Baseline 48.33 70.03 57.19
+ Size filtering 50.20 70.07 58.49
+ Blind Occlusion handling 53.18 77.05 63.50
+ Reprojection-based handling 51.53 75.67 61.31
+ Blind Occlusion handling + Size filtering 54.73 76.38 63.77
+ Reprojection-based handl. + Size filtering 53.34 75.50 62.52

TABLE VII: Impact of additional strategies. Best in bold.

Influence of additional strategies: The additional strategies we have designed are divided in two branches: removing small detected objects that are not considered in the ground-truth, and occlusion handling.

To avoid the existing bias in the ground-truth towards distant cars that are not annotated, we performed a size filtering strategy by removing detections which area is under 0.10 of the total frame area.

The blind occlusion handling and the reprojection-based occlusion handling strategies are detailed in Section III-E.

Table VII shows the ablation results of these strategies. As expected, we can observe that the procedure of removing small detections increases the measure, using both object detectors, by 2.47 (1.87), while maintaining almost the same . Since reacts to false positives, this seems to indicate that the size filtering removes those small detections we track, but are not annotated in the ground-truth.

Both occlusion handling strategies improve the baseline tracking significantly, 3.81 (7.02) and 4.21 (5.64) respectively, and also the is being improved by 2.90 (4.85) and 2.43 (3.20). Contrary to expectations, the reprojection-based strategy is not overcoming the blind one. Another bias existing in the ground-truth could be the reason for explaining this, since occluded vehicles are not annotated.

When combining both occlusion handling strategies with size filtering, we achieve a higher precision than applying them separately, while recall is slightly narrowed. As in the previous comparison, these results suggest that the reprojection-based strategy does not provide improvements over the blind strategy due to the nature of the ground-truth. We consider using the baseline approach together with the blind occlusion handling and the size filtering strategies, a good trade-of between the and .

Iv-D Comparison with the state-of-the-art

Along this section, the proposed algorithm is compared with state-of-the-art approaches. Comparison is performed in the S02 scenario of the CityFlow benchmark, which is the only validation scenario with partially overlapping FOVs, as our method targets this scenario.

The proposed approaches in the literature devoted to vehicles MTMC tracking, listed in Table I, have been already compared in the The 2019 AI City Challenge [49] jointly over the testing scenarios S02 and S05. However, as S05 consists of non-overlapping cameras, to ensure a fair comparison, we perform the evaluation only over S02. For this purpose, we ran the public available codes and we evaluated them following the CityFlow benchmark evaluation methodology, detailed in Section IV-A2.

Processing Latency (s)
Cost (min)
UWIPL [13] 70.21 92.61 79.87 0.2 Offline 3015* + 211 53.76*
70.02 92.36 79.65 0.5
ANU [12] 67.53 81.99 74.06 0.2 1159* + 211 22.83*
66.42 80.64 72.85 0.5
BUPT [11] 78.23 63.69 70.22 0.2 1389* + 211 26.66*
78.16 63.63 70.15 0.5
NCCU [9] 48.91 43.35 45.97 0.2 2316* + 211 42.11*
24.36 21.59 22.89 0.5
Ours (EfficientDet) 55.15 76.98 64.26 0.2 Online 2.55 13.65
54.73 76.38 63.77 0.5
Ours (Mask-RCNN) 57.23 71.99 63.76 0.2 2.29 12.71
54.99 69.17 61.27 0.5

TABLE VIII: Comparison with the state-of-the-art approaches. is the Intersection Over Union (IoU) evaluation threshold. The star (*) denotes that is an estimation. The extra 211 seconds are the duration of the video sequence under evaluation.

Table VIII shows the evaluated performances in terms of , , , latency and total computational time. The listed approaches can be divided by the processing mode in two groups: offline and online processing. As described in Section II, to the best of our knowledge there is no previous proposal dealing with online MTMC vehicle tracking. For this reason, all the state-of-the-art methods that we evaluated are offline approaches. It is important to remark that, in Table VIII, the star denotes a partial and downward estimation. The codes for the complete systems are not publicly available, and only solutions based on precomputed intermediate results are accessible; hence, we can only compute the running time of the available modules. Therefore, the overall latency of the compared offline approaches is expected to be much higher than the results reported in Table VIII. Note that, the duration of the sequence under evaluation is also included in the latency since these offline approaches require access to results for the whole video to compute tracklets at each camera and then compute multi-camera tracks in a global way. As our proposal yields tracking results incrementally, from the beginning of the sequence, it can achieve a really low latency, in comparison with the others methods.

Regarding the quantitative measures of the tracking performance: , and , offline methods using constraining priors tailored to the target scenario clearly benefit from this extra information (see Table I). In contrast to the rest of the state-of-the-art approaches, we are agnostic to the motion patterns of the vehicles (allowing to filter erroneous tracks), we do not perform any track post-processing (permitting to refining and unifying tracks and by this way reducing ID switches) and finally, we do not make use of manual annotations. On this basis, with an online approximation we perform really close to offline state-of-the-art approaches outperforming two of them in terms of Identification Recall.

Overall, our approach does not quite reach top performance in MTMC vehicle tracking, but its latency is three orders of magnitude smaller and the final computational cost is one order of magnitude faster, enabling a high performance operation on online mode with low-latency, that is a common requirement for many video-related applications, and also, in the generalization of the algorithms, avoiding hand-crafted strategies.

V Conclusion

Not relying on manual ad-hoc annotations, having no prior knowledge about the number of targets, and providing the best result in the shortest possible time are crucial requirements for a convenient and versatile algorithm. This paper presents, to the best of our knowledge, the first online MTMC vehicle tracking solution. Unlike previous approaches, the proposed approach continuously computes and updates the targets’ state. We calculate clusters of detections of the same vehicle from different camera views applying a cross-camera clustering based on appearance and location. We train an appearance model to identify different views of the same vehicle leveraging homography matrices’ information. Using information from the previous frame and a temporal estimation, we developed an occlusion handling strategy able to extrapolate accurate detections even if the target is occluded. Since the state estimation is continually updated, this strategy is useful even if the target is long-term occluded.

This approach results in a low-latency MTMC vehicle tracking solution with quite promising results. Although performance is below its offline counterparts, the proposed one is a suitable solution for a real-world ITS technology.


This work was partially supported by the Spanish Government (TEC2017-88169-R MobiNetVideo). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPU used for the research of our group.


  • [1] J. Guerrero-Ibáñez, S. Zeadally, and J. Contreras-Castillo, “Sensor technologies for intelligent transportation systems,” Sensors, vol. 18, no. 4, p. 1212, 2018.
  • [2] Z. Yang and L. S. Pun-Cheng, “Vehicle detection in intelligent transportation systems and its applications under varying environments: A review,” Image and Vision Computing, vol. 69, pp. 143–154, 2018.
  • [3] M. Veres and M. Moussa, “Deep learning for intelligent transportation systems: A survey of emerging trends,” IEEE Transactions on Intelligent transportation systems, 2019.
  • [4] H. Menouar, I. Guvenc, K. Akkaya, A. S. Uluagac, A. Kadri, and A. Tuncer, “Uav-enabled intelligent transportation systems for the smart city: Applications and challenges,” IEEE Communications Magazine, vol. 55, no. 3, pp. 22–28, 2017.
  • [5] L. Leal-Taixé, A. Milan, K. Schindler, D. Cremers, I. Reid, and S. Roth, “Tracking the trackers: an analysis of the state of the art in multiple object tracking,” arXiv preprint arXiv:1704.02781, 2017.
  • [6] G. Ciaparrone, F. L. Sánchez, S. Tabik, L. Troiano, R. Tagliaferri, and F. Herrera, “Deep learning in video multi-object tracking: A survey,” Neurocomputing, vol. 381, pp. 61–88, 2020.
  • [7] M. S. Shirazi and B. T. Morris, “Looking at intersections: a survey of intersection monitoring, behavior and safety analysis of recent studies,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 1, pp. 4–24, 2016.
  • [8] X. Tan, Z. Wang, M. Jiang, X. Yang, J. Wang, Y. Gao, X. Su, X. Ye, Y. Yuan, D. He, S. Wen, and E. Ding, “Multi-camera vehicle tracking and re-identification based on visual and spatial-temporal features,” in

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops

    , June 2019.
  • [9] M.-C. Chang, J. Wei, Z.-A. Zhu, Y.-M. Chen, C.-S. Hu, M.-X. Jiang, and C.-K. Chiang, “Ai city challenge 2019 – city-scale video analytics for smart transportation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [10] Y. Chen, L. Jing, E. Vahdani, L. Zhang, M. He, and Y. Tian, “Multi-camera vehicle tracking and re-identification on ai city challenge 2019,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [11] Z. He, Y. Lei, S. Bai, and W. Wu, “Multi-camera vehicle tracking with powerful visual features and spatial-temporal cue,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [12] Y. Hou, H. Du, and L. Zheng, “A locality aware city-scale multi-camera vehicle tracking system,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [13] H.-M. Hsu, T.-W. Huang, G. Wang, J. Cai, Z. Lei, and J.-N. Hwang, “Multi-camera tracking of vehicles based on deep features re-id and trajectory-based camera link models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [14] P. Li, G. Li, Z. Yan, Y. Li, M. Lu, P. Xu, Y. Gu, B. Bai, and Y. Zhang, “Spatio-temporal consistency and hierarchical matching for multi-target multi-camera vehicle tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [15] M. Wu, G. Zhang, N. Bi, L. Xie, Y. Hu, and Z. Shi, “Multiview vehicle tracking by graph matching model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [16] E. Luna, P. Moral, J. C. SanMiguel, A. Garcia-Martin, and J. M. Martinez, “Vpulab participation at ai city challenge 2019,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.
  • [17]

    L. Chen, H. Ai, R. Chen, Z. Zhuang, and S. Liu, “Cross-view tracking for multi-human 3d pose estimation at over 100 fps,” in

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3279–3288.
  • [18] Y. He, X. Wei, X. Hong, W. Shi, and Y. Gong, “Multi-target multi-camera tracking by tracklet-to-target assignment,” IEEE Transactions on Image Processing, vol. 29, pp. 5191–5205, 2020.
  • [19] M. C. Liem and D. M. Gavrila, “Joint multi-person detection and tracking from overlapping cameras,” Computer Vision and Image Understanding, vol. 128, pp. 36–50, 2014.
  • [20] C. H. Sio, H.-H. Shuai, and W.-H. Cheng, “Multiple fisheye camera tracking via real-time feature clustering,” in Proceedings of the ACM Multimedia Asia, 2019, pp. 1–6.
  • [21] X. Zhang and E. Izquierdo, “Real-time multi-target multi-camera tracking with spatial-temporal information,” in 2019 IEEE Visual Communications and Image Processing (VCIP).   IEEE, 2019, pp. 1–4.
  • [22] Z. Zhipeng, “Collaborative tracking method in multi-camera system,” Journal of Shanghai Jiaotong University (Science), vol. 2, 2020.
  • [23] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars: A video benchmark for large-scale person re-identification,” in European Conference on Computer Vision.   Springer, 2016, pp. 868–884.
  • [24] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi, “Performance measures and a data set for multi-target, multi-camera tracking,” in European Conference on Computer Vision.   Springer, 2016, pp. 17–35.
  • [25] Z. Tang, M. Naphade, M.-Y. Liu, X. Yang, S. Birchfield, S. Wang, R. Kumar, D. Anastasiu, and J.-N. Hwang, “Cityflow: A city-scale benchmark for multi-target multi-camera vehicle tracking and re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8797–8806.
  • [26] E. Ristani and C. Tomasi, “Features for multi-target multi-camera tracking and re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6036–6046.
  • [27] Z. Tang, G. Wang, H. Xiao, A. Zheng, and J.-N. Hwang, “Single-camera and inter-camera vehicle tracking and 3d speed estimation based on fusion of visual and semantic features,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2018, pp. 108–115.
  • [28] Z. Zhang, J. Wu, X. Zhang, and C. Zhang, “Multi-target, multi-camera tracking by hierarchical clustering: recent progress on dukemtmc project,” arXiv preprint arXiv:1712.09531, 2017.
  • [29] J. MacQueen et al., “Some methods for classification and analysis of multivariate observations,” in

    Proceedings of the fifth Berkeley symposium on mathematical statistics and probability

    , vol. 1, no. 14.   Oakland, CA, USA, 1967, pp. 281–297.
  • [30] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise.” in Kdd, vol. 96, no. 34, 1996, pp. 226–231.
  • [31] S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika, vol. 32, no. 3, pp. 241–254, 1967.
  • [32] D. L. Davies and D. W. Bouldin, “A cluster separation measure,” IEEE transactions on pattern analysis and machine intelligence, no. 2, pp. 224–227, 1979.
  • [33] J. C. Dunn, “A fuzzy relative of the isodata process and its use in detecting compact well-separated clusters,” Journal of Cybernetics, vol. 3, no. 3, pp. 32–57, 1973.
  • [34] L. Kaufman and P. J. Rousseeuw,

    Finding groups in data: an introduction to cluster analysis

    .   John Wiley & Sons, 2009, vol. 344.
  • [35] K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas, “Imbalance problems in object detection: A review,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [36] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning.   MIT press Cambridge, 2016, vol. 1, no. 2.
  • [37] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
  • [38] R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.
  • [39] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  • [40]

    Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals,” in

    Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1900–1909.
  • [41] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
  • [42] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision.   Springer, 2016, pp. 21–37.
  • [43] K. , G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  • [44] M. Tan, R. Pang, and Q. V. Le, “Efficientdet: Scalable and efficient object detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 781–10 790.
  • [45]

    M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in

    International Conference on Machine Learning

    , 2019, pp. 6105–6114.
  • [46] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  • [47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [48] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.   Ieee, 2009, pp. 248–255.
  • [49] M. Naphade, Z. Tang, M.-C. Chang, D. C. Anastasiu, A. Sharma, R. Chellappa, S. Wang, P. Chakraborty, T. Huang, J.-N. Hwang, and S. Lyu, “The 2019 ai city challenge,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2019.