Object tracking in road scenes is an important component of urban scene understanding. With the advent and subsequent surge in autonomous driving technologies, accurate multi-object trackers are desirable in several tasks such as navigation and planning, localization, and traffic behavior analysis.
In this paper, we focus on designing a simple and fast, yet accurate and robust solution to the Multi-Object Tracking (MOT) problem in an urban road scenario. The dominant approach to multi-object tracking is tracking-by-detection, where the entire process is divided into two phases. The first phase comprises object detection, where bounding-boxes of objects of interests are obtained in each frame of the video sequence. The second phase is the data association phase, which is often the hardest step in the tracking-by-detection paradigm. Several factors such as spurious or missing detections, repeat detections, or occlusions and target interactions are confounding factors in this data association phase.
Although several approaches [1, 2, 3, 4, 5] exist for accurate online tracking of moving vehicles from a moving camera, most of them [6, 2] use handcrafted cost functions that are either based on primitive features such as bounding box position in the image and color histograms, or are highly sophisticated and non-intuitive in design (eg. ALFD ). On the other hand, we propose costs that are intuitive, easy to compute and implement, and provide complementary cues about the target.
We exploit the fact that road scenes have a unique geometry and use this prior information to design costs. The proposed costs capture 3D cues arising from this scene geometry, as well as appearance based information. Further, we introduce a novel cost that captures similarity of 3D shapes and poses of target hypotheses. To this end we leverage recent work on shape-priors for object detection and localization from monocular sequences [7, 8]. To the best of our knowledge, such pairwise costs have not been incorporated in multi-object tracking frameworks.
The efficacy of the monocular 3D cues is best portrayed in Fig.Beyond Pixels: Leveraging Geometry and Shape Cues for Online Multi-Object Tracking. In this figure the first two rows illustrate the objects with their bounding boxes in two successive frames at and . Upon lifting the objects at to 3D and ballooning their locations to account for large uncertainties in ego motion, we project them into the image observed at . This gated/overlapping area shown in their respective colors in the last row of Fig.Beyond Pixels: Leveraging Geometry and Shape Cues for Online Multi-Object Tracking reduces the search area for each such object significantly thereby reducing the pairwise costs. By backprojecting that lie only within this gated area into 3D and ascertaining data association costs based on 3D volume overlaps significantly improves tracking accuracy even with a straight forward Hungarian data association scheme.
The proposed costs are not too dependent on the choice of data association framework. We demonstrate the superiority of the proposed costs over monocular video sequences of urban road scenes that capture a wide range of camera and target motions, and also consistent improvement over other costs regardless of the choice of the object detector. We perform an extensive evaluation of various modes of the proposed costs on the KITTI Tracking benchmark  and obtain state-of-the-art performance, beating previous approaches by using a simple two-frame Hungarian association scheme. The approach is tested on KITTI online evaluation sever and outperforms the previous published approaches significantly. Naturally, more complex data association schemes, such as network flow based algorithms [10, 11, 12, 13] can result in much better performance boosts upon incorporation of the proposed pairwise costs.
The paper contributes as follows.
It introduces novel data association cues based on single view reconstruction of objects that results in best tracking performance reported thus far in KITTI training datasets. It outperforms the nearest reported values in training data [14, 1, 6], by at-least 12% . The approach is tested on the KITTI Tracking online evaluation server where it outperforms the published approaches by a margin of over .
Finally it also identifies a role for 3D pose and shape cues where they play a role in improving tracking performance.
Monocular 3D cues especially based on single view geometry can often be unreliable. However when computed effectively they can be used reliably and repeatably even in challenging sequences such as KITTI. This constitutes the central theme of this effort.
Ii Related Work
In this section, we review relevant work on multi-object tracking, and compare and contrast it with the proposed approach.
Ii-a Global Tracking
Many approaches to tackle the association problem are global [10, 12, 17, 11, 18, 19], in the sense that they assume detections from all frames are available for processing. Most global methods operate by mapping the tracking problem to a min-cost network flow problem. The original idea was proposed in  and also provides for a method for explicit occlusion reasoning. An efficient variant is an approach based on generalized minimum clique graphs , where associations are solved for one object at a time while other objects are implicitly incorporated. Another section of global methods attempts to construct small chunks of trajectories (called tracklets), and compose them hierarchically to form longer trajectories, rather than solving for a min-cost flow over a densely connected graph.
Ii-B Online MOT
In contrast to this, online trackers [4, 20, 3, 21] do not assume any knowledge of future frames and operate greedily, only with the data available upto the current instant. Such trackers often formulate the association problem as that of bipartite matching, and solve it via the Hungarian algorithm. A recent variant proposes near-online trackers , in an attempt to provide the best of both worlds, i.e., to combine the capability of global methods to handle long-term occlusions and still achieve very low output latencies. Gieger et al  propose a memory and computation cost bound variant of network flow using dynamic programming.
Both these paradigms rely on handcrafted pairwise costs being fed into the association framework. Most of these are sophisticated in design and do not end up capturing 3D information that is easily available in road scenes.
Ii-C Learning Costs for MOT
Significant attention has also been devoted to the task of learning pairwise costs for target tracking problems. In , a structured SVM was used to learn pairwise costs for a bipartite matching data association framework. Other works have used graphical models, divide and conquer strategies and also learn unary costs. A more recent work 
learns all costs using a deep neural network. On the other hand, we show that our simple, yet clean and efficient cost function designs significantly improve performance without the need of extensive hyperparameter search or cost learning.
Iii Problem Formulation
We adopt the tracking-by-detection paradigm where we assume that we are provided with a monocular video sequence of frames for , and a set of object detections for each frame . Each detection set consists of object detections , where ( is the number of detections in frame ). Note that can also be an empty set, in the case where no objects are detected in a frame. Each detection is parametrized as , where corresponds to the top-left corner of the detection box in the image, is the bounding box width, is the bounding box height, and is the detectors confidence in the bounding box (greater value indicates higher confidence). The multi-object tracking problem is to associate each bounding box to a target trajectory such that the following constraints are met.
Each target trajectory comprises of a set of bounding boxes (all from different frames) belonging to a unique target in the scene.
There are exactly as many trajectories as there are targets to be tracked.
In all frames where a target is visible, it is detected and assigned to the corresponding unique trajectory for the object.
All spurious bounding box detections are unassigned to any target trajectory.
The tracking problem formulated above is usually solved in a min-cost network flow framework (global tracking), a moving window dynamic programming framework (near-online tracking) or a bipartite matching framework (online tracking). Note that these are not the only available frameworks, but a representative set of most tracking approaches. All these frameworks (and the others not mentioned here) use pairwise costs to define affinity across pairs of detections. The association framework then computes a Maximum A Posteriori (MAP) estimate of the target trajectory, given the detection hypotheses
and an affinity matrix that gives the likelihood of each detection in each frame corresponding to every detection in every other frame.
Iv Geometry and Object Shape Costs
The core contribution of this paper is to design intuitive pairwise costs that are efficient to compute, and are accurate for tracking. We focus on urban driving scenarios and demonstrate how the geometry of urban road scenes can be exploited to infer 3D cues for tracking.
Typical costs in tracking algorithms include bounding box locations, trajectory priors, optical flow, bounding box overlap, and appearance information (color histograms or path-based cross-correlation measures). These costs require careful handcrafting, finetuning, and hyperparameter estimation. We propose to use a set of simple complementary costs that are readily available from recent monocular 3D object localization systems [7, 8]. We also introduce a novel cost based on the 3D shape and pose of the target. We show that this cost, apart from improving data association performance, also assists in discarding false detections without incurring large computational overhead.
Iv-a System Setup
We focus on autonomous driving scenarios, where the video sequence is from a monocular camera mounted on a car moving on the road plane, and the targets to be tracked are also moving on the road. Feature based odometry is run on a background thread (for rough frame-to-frame motion estimation). Also, we make use of a recent approach that goes beyond bounding boxes and estimates the 3D shape and pose of objects, given just a single image . This is done by lifting discriminative parts in 2D (keypoints) to 3D. These keypoints are a set of points chosen so that they are common across all object instances (eg. for a car, we have centers of wheels, headlights, taillights, etc). The authors use a CNN architecture  to localize these keypoints in 2D, given a detection.
The 3D shape of the object is parametrized as the sum of the mean shape (for the object category) and a linear combination of so-called basis shapes. Mathematically,
where is the shape of a particular instance, is the mean shape for the object category, and
is the deformation basis (a set of eigenvectors) that characterizes deformation directions of the mean shape. We use the same model in
and denote the shape vector of an object by, where is the number of vectors in the deformation basis (typically, ).
The pipeline in  also estimates the 3D pose of the object, which is parametrized as an axis-angle vector . Moreover, an estimate of object dimensions (height, width, and length) is also returned.
Iv-B 3D-2D Cost
Given the height of the camera above the ground, assuming that the bottom line of each bounding box detection in frame is on the road plane, a depth estimate of the car in the current camera coordinates can be obtained by back projection via the road plane as in , using
where is the bottom center of the detected bounding box, K is the camera intrinsic matrix and is used as shorthand for backprojection via the ground plane. This backprojection equation is only accurate when is known precisely, which is not usually the case. Hence, we estimate the uncertainty in 3D location of by using a linearized version of (2) and assuming that the detector confidence is an isotropic 2D Gaussian, i.e., . This region is expanded (anisotropically) by the estimates of the target dimensions returned by the system .
Now, assume we have another detection in frame with which we wish to compute the pairwise affinity of . We obtain a rough estimate of the camera motion from frame to frame using a feature-based odometry thread running in the background. Using this estimate of the camera motion, we transport to the camera coordinates of frame , while duly accounting for the uncertainty in camera motion estimate, and in the backprojection via the road plane. The obtained coordinates are then projected down to the image frame to obtain a 2D search area in which potential matches for are expected to be found, as shown in the frame of Fig.1. Mathematically, the 3D-2D cost for two detections and is defined as follows
Intuitively, this cost measures a (weighted) overlap of the 2D region in which the target is expected in frame and the detection . denotes the projection operator that projects a 3D point to image pixel coordinates. denotes a rigid-body motion applied to a 3D point . denotes the function that estimates the uncertainty of the 3D point according to a linearized form of (2) and the detector confidence .
Most importantly, this cost is evaluated only for detections that lie inside the expected target area . This significantly reduces the number of comparisons needed to be made among target pairs.
Iv-C 3D-3D cost
Although useful in reducing the number of candidate detections to be evaluated, the 3D-2D cost has frequent confounding cases. This is because, we still measure overlap in the image space. To mitigate this drawback, we define a 3D-3D cost, which, instead of measuring 2D overlap, measures overlap in 3D,as shown in Fig.1 (right side). Here, we backproject each candidate via the road plane, and measure overlap with respect to the transformed 3D volume from frame given by . The 3D-3D cost for two detections and is defined as
In order to speed up evaluation of 3D overlap, we exploit the inherent geometry of road scenes. Since all objects of interest are on the road plane (the XZ plane in our case), it is sufficient to measure overlap in the XZ plane. This is because all objects are at nearly constant heights above the ground and hence have similar overlap in the Y direction.
Iv-D Appearance Cost
In , the authors train a stacked-hourglass CNN architecture to localize a discriminative set of keypoints on an image. This deep CNN architecture captures various discriminative features for each detection, along with the keypoint evidence. We use weighted combination of activation maps from the output of the layers of the hourglass network as a feature descriptor for each detection, as shown in Fig.2 and compute a similarity score between detections using the L Norm between descriptors from the image patch inside each of the bounding boxes. If denotes the feature descriptor of each detection, the appearance cost is defined as
where is a normalization constant.
Iv-E Shape and Pose Cost
We use a novel shape and pose cost based on the single image shape and pose returned by the pipeline of  . Shape is parameterized as a vector comprising of deformation coefficients , where is the number of deformation basis vectors (usually ). Each possible value of denotes a unique class of object instances and hence carries useful information about the 3D shape of the target. For instance varying certain parameters of may represent a shape that is more SUV-like than Sedan-like, and so on. Pose is parametrized as an axis-angle vector . For detections and , the shape and pose cost is specified as
where and are normalization constants.
The overall pairwise cost term is a weighted linear combination of all the aforementioned cost. The weights of the linear combination are determined by four-fold cross validation on the train set.
In this section, we present an account of the experiments we performed, and we report and analyze the findings thereof. In nutshell, we evaluate our tracking framework on a variety of challenging urban driving sequences and demonstrate a substantial performance boost over the state-of-the-art in multi-object tracking, by using the simplest of tracking frameworks, viz. bipartite matching using the Hungarian algorithm.
, we divide the training dataset, which contains 21 sequences, into four splits,for cross validation. The splits are chosen so that each split contains a similar distribution of number of vehicles per sequence, occlusion and truncation levels, and relative motion patterns between the camera and the target. The cross validation helps us to tune the weight for each of the proposed costs to compute the final cost matrix. The best performing combination of these weighted costs are used for reporting the result on the KITTI Tracking benchmark. Multiple vehicles moving with varying speeds, variance in the ego camera motion, and target objects appearing in non conforming locations in frames make the KITTI Tracking dataset a truly challenging one. We report results on the Car class.
|w/o Shape and Pose|
|with Shape and Pose||57.29||1||5|
V-B Evaluation Metrics
To evaluate the performance of our approach, we adopt the widely used CLEAR MOT metrics . The overall performance of the tracker is summed up in two intuitive metrics, viz. Multi-Object Tracking Accuracy (MOTA) and Multi-Object Tracking Precision (MOTP). While MOTA is concerned with tracking accuracy, MOTP deals with object localization precision.
V-C System Overview
The proposed approach is a tracking-by-detection approach and hence assumes per-frame bounding box detections as input. We choose two recent object detectors — Recurrent Rolling Convolution (RRC)  and SubCNN . Each of these detectors provides multiple detections per frame. A threshold is applied on the detection scores and those detections whose confidence scores are lower than the threshold are pruned. In addition to this, we run a non-maxima suppression (NMS) scheme to subdue multiple detections around the same object. These detections are used to compute pairwise costs as outlined in the previous section. These pairwise costs constitute a cost matrix that is used for a bipartite matching algorithm that associates detections across two frames. In practice, bipartite matching is performed using the Hungarian algorithm .
V-D Approaches Considered
V-E Performance Evaluation
We evaluate the performance of our approach on the current best competitors on the KITTI Tracking Benchmark. While [6, 21, 13] rely on complex handcrafted costs,  learns all unary and pairwise costs that are input to a network flow based tracker. Moreover, the data association steps of [6, 21, 13] rely on complex optimization routines. The proposed approach is also evaluated on the KITTI Tracking evaluation sever.
Table I,where we compare our two-frame based approach with the other competitors using the best performing object detector in the form of  and a judicious combination of such appearance, 3D, pose and shape cues best possible results on KITTI training sequence are achieved in terms of MOTA () and MOTP (). Although our method suffers from ID switches and fragmentations, this is typical of online trackers; more so of two-frame greedy trackers. Using the proposed pairwise costs in a slightly more sophisticated tracker such as [6, 13] will naturally reduce ID switches and fragmentations also.
Table II,where we compare our two-frame based approach with the other published approaches on the KITTI Tracking online server. We outperform the next best competitor by a margin of () on the test set, achieving state of the art results in the form of MOTA (), MOTP (), MT () and ML ().
V-F Ablation Study
We then perform a thorough ablation analysis of various cues used for computing pairwise costs across two distinct object detectors: RRC  and SubCNN . Results are summarized in Table III. This analysis captures the importance of each of the proposed cue and demonstrates that the combination of all these is crucial for overall performance. Notice how each cue improves the performance of our system in terms of MOTA ,ID switches and fragmentations. Even with underperforming detectors such as , there is a tangible performance boost by using a combination of monocular 3D cues. This is portrayed in ablation analysis of SubCNN detectors in Table III. Furthermore the repeatability of performance gain using these novel cues over any baseline detection methods is also delineated.
There exist subsequences where the role played by shape and pose cues become relevant. While in a typical road scene involving lane driving the pose cues are not discriminatory (as the vehicles are aligned with the lane direction), they become discerning enough in areas such as intersections, round abouts where pose and viewpoint changes are heterogeneous. This is showcase in Table IV. Here, we select particular frames from the KITTI Tracking dataset, which have images containing cars moving at intersections, which captures different viewpoints and shapes of cars. Using detections from a weak detector  and a simplistic combination 2D-2D cues along with shape and pose cue of the car performs better than the stand alone 2D cue, for sequences which have cars with various viewpoints over the frames.
V-G Qualitative Results
Finally, we present qualitative results from challenging sequences in Fig.3 and Fig.4. These results clearly indicate the ability of the proposed pairwise costs to disambiguate and track across viewpoint variations, clutter, and varying relative motion between the camera and the target.
For example the first column of Fig 3 shows cars occluded on either sides of the road accurately tracked almost till the horizon. Whereas the second column shows efficient tracking of cars at varying depths and varying poses in an intersection while the third column shows precise tracking of occluding cars as well as a car that is being overtaken from the right by the ego car. In fact in the 4th frame a very small portion of the car is visible yet accurately tracked.
V-H Summary of Results
The cornerstone of this effort is that single view monocular 3D cues obtained though formalisms developed on the basis of single view geometry can be effectively exploited to track vehicles in challenging scenes. This gets illustrated in the various tabulations of this section.
Table I depicts significant improvements over many of the current state of the art methods with a tracking accuracy in excess of . We test our approach on the KITTI Tracking online server. Table II depicts significant improvements over the published approaches, with tracking accuracy over .
Whereas the ablation studies in Table III does showcase the repeatability of 3D cues in improving the baseline appearance only tracking over detectors. While not as significant as in  baseline improvement over SubNN object detector can be gleaned from Table III. The improvement in ID switches and fragmentations can also be seen over both detector baselines as a consequence of the 3D cues.
Table IV shows the relevance of pose and shape cues over a subsequence where association costs due to such cues improves baseline performance.
Most state of the art tracking formalisms have not explored the role of 3D cues and when they have done those cues have been due to immediately available stereo depth. This paper showcased for the first time monocular 3D cues obtained from single view geometry along with pose and shape cues results in the best tracking performance on popular object tracking training datasets. These cues result in a set of simple, intuitive pairwise costs for multi-object tracking in a tracking-by-detection setting. Despite being more difficult to compute than ready made 3D depth data, monocular 3D cues have a role to play in diverse on road applications including object and vehicle tracking. Apart from the quantitative, qualitative results too signify its advantage in challenging scenes that involve considerable occlusions, minimal appearance of the object in the scene and objects that are far enough that they appear on the horizon. Although we demonstrated results using a simple Hungarian method based tracker, incorporation of sophisticated trackers would result in even higher performance boosts.
-  S. Schulter, P. Vernaza, W. Choi, and M. Chandraker, “Deep network flow for multi-object tracking,” in
-  B. Lee, E. Erdenee, S. Jin, M. Y. Nam, Y. G. Jung, and P. K. Rhee, “Multi-class multi-object tracking using changing point detection,” in European Conference on Computer Vision. Springer, 2016.
-  S. Wang and C. C. Fowlkes, “Learning optimal parameters for multi-target tracking with contextual interactions,” International Journal of Computer Vision, 2017.
-  H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globally-optimal greedy algorithms for tracking a variable number of objects,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011.
-  A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACM computing surveys (CSUR), 2006.
-  W. Choi, “Near-online multi-target tracking with aggregated local flow descriptor,” in Proceedings of the IEEE International Conference on Computer Vision, 2015.
-  J. K. Murthy, G. S. Krishna, F. Chhaya, and K. M. Krishna, “Reconstructing vehicles from a single image: Shape priors for road scene understanding,” in Proceedings of the IEEE Conference on Robotics and Automation, 2017.
-  J. K. Murthy, S. Sharma, and M. Krishna, “Shape priors for real-time monocular object localization in dynamic environments,” in Proceedings of the IEEE Conference on Intelligent Robots and Systems (In Press), 2017.
-  A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
-  L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008.
-  A. Andriyenko, K. Schindler, and S. Roth, “Discrete-continuous optimization for multi-target tracking,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE, 2012.
-  A. Dehghan, S. Modiri Assari, and M. Shah, “Gmmcp tracker: Globally optimal generalized maximum multi clique problem for multiple object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  P. Lenz, A. Geiger, and R. Urtasun, “Followme: Efficient online min-cost flow tracking with bounded memory and computation,” in Proceedings of the IEEE International Conference on Computer Vision, 2015.
-  W. Choi, C. Pantofaru, and S. Savarese, “A general framework for tracking multiple people from a moving camera,” IEEE transactions on pattern analysis and machine intelligence, 2013.
-  J. X. J. W. J. QiongYan and Y.-W. LiXu, “Accurate single stage detector using recurrent rolling convolution,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
Y. Xiang, W. Choi, Y. Lin, and S. Savarese, “Subcategory-aware convolutional neural networks for object proposals and detection,” inApplications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017.
L. Leal-Taixé, G. Pons-Moll, and B. Rosenhahn, “Everybody needs somebody: Modeling social and grouping behavior on a linear programming multiple people tracker,” inComputer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE, 2011, pp. 120–127.
-  J. Berclaz, F. Fleuret, E. Turetken, and P. Fua, “Multiple object tracking using k-shortest paths optimization,” IEEE transactions on pattern analysis and machine intelligence, 2011.
-  V. Chari, S. Lacoste-Julien, I. Laptev, and J. Sivic, “On pairwise costs for network flow multi-object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  A. Ess, B. Leibe, K. Schindler, and L. Van Gool, “Robust multiperson tracking from a mobile platform,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 10, pp. 1831–1846, 2009.
-  A. Osep, W. Mehner, M. Mathias, and B. Leibe, “Combined image-and world-space tracking in traffic scenes,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017.
-  S. Song and M. Chandraker, “Joint sfm and detection cues for monocular 3d localization in road scenes,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  J. H. Yoon, C.-R. Lee, M.-H. Yang, and K.-J. Yoon, “Online multi-object tracking via structural constraint event aggregation,” in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision making,” in Proceedings of the IEEE International Conference on Computer Vision, 2015.
-  K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” EURASIP Journal on Image and Video Processing, 2008.
-  H. W. Kuhn, “The hungarian method for the assignment problem,” Naval Research Logistics (NRL), 1955.