that directly learn from the whole inputs, instance-level similarity needs to consider object locations, scales, and contexts. The estimated similarity serves as an essential component for many computer vision applications such as object tracking. Contemporary multiple object tracking methodsBergmann et al. (2019); Bewley et al. (2016); Wojke et al. (2017) mainly follow the tracking-by-detection paradigm Ramanan and Forsyth (2003). That is, they detect objects on each frame and then associate them according to the estimated instance similarity. Recent works Bergmann et al. (2019); Bewley et al. (2016); Bochinski et al. (2017) show that if the detected objects are accurate, the spatial proximity between objects in consecutive frames, measured by Interaction of Unions (IoUs) or center distances, is a strong prior to associate the objects. However, this location heuristic only works well under simple scenarios. If the objects are occluded or the scenes are crowded, this location heuristic can easily lead to mistakes. To remedy this problem, some methods introduce motion estimation Andriyenko and Schindler (2011); Choi and Savarese (2010) or a regression-based tracker Feichtenhofer et al. (2017) to ensure an accurate distance estimation.
However, object appearance similarity usually takes a secondary role Lu et al. (2020); Ristani and Tomasi (2018); Wojke et al. (2017) to strengthen object association or re-identify vanished objects. The search region is constrained to be local neighborhoods to avoid distractions because the appearance features can not distinguish different objects effectively. We conjecture that this is because the image and object information is not fully utilized for learning object similarity. Previous methods regard instance similarity learning as a post-hoc stage after object detection and only use sparse ground truth bounding boxes as training samples Wojke et al. (2017). This process ignores the majority of the regions proposed on the images. Because objects in a image rarely identical to each other, if the object representation is learned properly, nearest neighbor search in the embedding space is able to associate and distinguish instances without motion priors.
We observe that besides the ground truth and detected bounding boxes, which sparsely distribute on the images, many object proposals can provide valuable training supervision. They are either close to the ground truth bounding boxes to provide more positive training examples, or in the background as negative examples. We propose a simple yet effective quasi-dense matching method, namely, densely matching between hundreds of regions of interest from a pair of images, as shown in Figure 1. The quasi-dense samples can cover most of the informative regions on the images, providing us both more box examples and matching targets. In training, similar to contrastive learning of image representations Hadsell et al. (2006); Sohn (2016); Wu et al. (2018), each object sample is matched to all proposals on the other image and the model is trained to match targets among all the candidates.
The instance representations learned from quasi-dense matching allow nearest neighbor search to distinguish different instances at inference time. We use bi-directional softmax to obtain the similarity scores between detected boxes and tracklets, which is a soft one-to-one matching constraint. This bi-directional matching can handle the cases of new object appearance and tracklet termination. These objects with no correspondence during matching lack of the one-to-one consistency, thus have low similarity scores to any objects.
Quasi-dense matching can be easily coupled with most of the existing detectors since generating region of interests is widely used in object detection algorithms. We apply our method onto Faster R-CNN Ren et al. (2015) along with a lightweight embedding extractor and residual networks He et al. (2016). Without using location or motion heuristics, our model outperforms existing methods on BDD Yu et al. (2020), Waymo Sun et al. (2019), and KITTI Geiger et al. (2012) object tracking benchmarks. The experiments show that our method can boost almost 10 points of MOTA and significantly decreases the number of ID switches on BDD100K Yu et al. (2020) and Waymo Sun et al. (2019) datasets. Our method allows end-to-end training, thereby simplifying the training and testing procedures of multi-object tracking frameworks. The simplicity and effectiveness shall benefit further research in the related areas.
We also examine the application of the quasi-dense matching on one-shot object detection. In this setting, we can learn how to measure the similarity of objects within the same category, instead of the same identity. The experiments show that we can achieve competitive performance even without fine-tuning the model on novel classes. This further shows the effectiveness of quasi-dense matching for category-level metric learning.
|(a) Sparse (b) Quasi-Dense|
2 Related work
Multiple object tracking
Recent developments in multiple object tracking Leal-Taixé et al. (2017) follow the tracking-by-detection paradigm Ramanan and Forsyth (2003). These approaches present different methods to estimate the instance similarity between detected objects and previous tracklets, then solve the matching process as a bipartite matching problem Munkres (1957). The spatial proximity has been proven effective to associate objects in consecutive frames Bewley et al. (2016); Bochinski et al. (2017)
. Hence, some methods use motion priors, such as Kalman FilterBewley et al. (2016); Yu et al. (2016), optical flow Xiao et al. (2018), and bounding box regression Held et al. (2016); Feichtenhofer et al. (2017), to ensure accurate distance estimations. Besides, recent works Kim et al. (2015); Leal-Taixé et al. (2016); Yang and Nevatia (2012); Son et al. (2017); Sadeghian et al. (2017); Milan et al. (2017); Wojke et al. (2017)
also learn instance representations to exploit appearance similarity or re-identify vanished objects. These methods directly follow the training practice in image similarity learning, then measure the instance similarity by cosine distance or inner product with softmax, etc. That is, they train the model either as a-classes classification problem Wojke et al. (2017) where equals to the number of identities in the whole training or under triplet loss Hermans et al. (2017). However, the classification problem is hard to extend to large-scale datasets while the triplet loss only compare each training sample with two other identities. These rudimentary training samples and objectives leave instance similarity learning not fully explored in MOT.
Detect & Track Feichtenhofer et al. (2017) is the first work that jointly optimizes object detection and tracking, different from the aforementioned methods that treat object detection and tracking as two separate stages. They adopt correlation-based regression tracker to propagate objects. MaskTrack R-CNN Yang et al. (2019) introduces a tracking branch to Mask R-CNN He et al. (2017) for simultaneous detection, segmentation, and tracking. Tracktor Bergmann et al. (2019) directly adopts a detector for tracking, but it highly relies on the prior of small across-frame displacements. RetinaTrack Lu et al. (2020) adds an extra embedding head under triplet loss Sohn (2016) to enable joint object detection and tracking. In contrast, we present quasi-dense matching that associates objects only with feature embeddings and obtain superior performance.
One-shot object detection
The purpose of one-shot object detection is to detect novel objects with only one annotated example. Recent methods mainly follow the ideas of metric-learning Karlinsky et al. (2019) or meta-learningKang et al. (2019); Wang et al. (2019); Yan et al. (2019). The meta-learning methods implement meta feature learner and feature re-weighting mechanism on a single-stage detector Kang et al. (2019) or a two-stage detector Yan et al. (2019). Unlike these methods, we jointly train the embedding extractor with an objectness detector and recognize the category by locating the nearest neighbor from embeddings of the exemplars.
3 Quasi-dense matching
We propose quasi-dense matching to learn the feature embedding space which can associate identical objects and distinguish different objects. We define dense matching to be matching between the box candidates at all pixels, and quasi-dense means we consider the potential object candidates only at the informative regions. Accordingly, sparse matching means only considering ground truth labels as matching candidates in learning object association. In this section, we describe a training and testing framework based on quasi-dense matching, with the main application in joint detection and tracking of multiple objects, while there can be broader applications, such as one-shot object detection. Our method can be directly combined with the existing object detection models. The main ingredients of using quasi-dense matching are object detection, instance similarity learning, and object association.
for object detection. Faster R-CNN is a two-stage detector that uses Region Proposal Network (RPN) to generate Region of Interests (RoIs), and then classifies and localizes the regions to obtain precise semantic labels and locations. Based on Faster R-CNN, FPN exploits lateral connections to build the top-down feature pyramid and tackles the scale-variance problem. The entire network is optimized with a multi-task loss function
where the RPN loss , classification loss , regression loss remain the same as the original paper Ren et al. (2015). The loss weights and are set as 1.0 by default.
Instance similarity learning
We directly use quasi-dense region proposals generated by RPN to learn instance similarity. As shown in Figure 2, given a key image for training, we randomly select a reference image from its temporal neighborhood. The neighbor distance is constrained by an interval , where in our experiments. We use RPN to generate RoIs from the two images and RoI Align He et al. (2017) to obtain their feature maps from different levels in FPN according to their scales Lin et al. (2017). We add an extra lightweight embedding head which is in parallel to the original bounding box head to extract feature embeddings for each RoI. A RoI is defined as positive to an identity if they have an IoU higher than , or negative if they have an IoU lower than . We set and as 0.7 and 0.3. The RoIs on different frames are positive to each other if they are associated with the same identity, and negative to each other otherwise.
Assuming there are samples on the key frame as training samples and samples on the reference frame as contrastive targets, for each training sample, we can use the non-parametric softmax Wu et al. (2018); Oord et al. (2018) with cross-entropy to optimize the feature embeddings
where v, , are feature embeddings of the training sample, its positive target, and negative targets in respectively. The overall embedding loss is averaged over all training samples, but we only illustrate one training sample for simplicity.
In contrast to previous methods that only use cropped sparse ground truth (GT) to learn instance similarity, we apply dense matching between RoIs on the pair of images, namely, each sample on is matched to all samples on . Considering each training sample in the key frame has more than one positive targets in the reference frame, we use a simplified loss function from Sun et al. (2020) as multi-positive contrastive learning:
We further adopt L2 loss as an auxiliary loss to constrain the cosine similarity between pair-wise matching:
where C equals to 1 if the two samples are positive to each other and 0 otherwise.
The entire network is joint optimized under
where and are set as 0.25 and 1.0 by default in this paper. We sample all positive pairs and three times negative pairs to calculate the auxiliary loss.
Tracking objects across frames purely based on object feature embeddings is not trivial as similarity estimation might be confused by newly appeared objects, vanished tracklets, and instances with similar appearances. Taking disadvantage of the effective quasi-dense similarity learning, we can perform a simple inference to associate objects.
Our main strategy is bi-directional matching in the embedding space. Figure 3 shows our testing pipeline. Assume there are detected objects in frame with feature embeddings n, and matching candidates with feature embeddings m from the past frames, the similarity f between the objects and matching candidates is obtained by bi-directional softmax (bi-softmax):
The high score under bi-softmax will satisfy a bi-directional consistency, the two matched objects should be each other’s nearest neighbor in the embedding space. Given the instance similarity f, we can directly associate the objects with their correspondence by a simple nearest neighbor search.
Due to the normalization effect of softmax, it is important to ensure there is one and only one correspondence among the matching candidates for each object. Otherwise, the similarity estimation will be ambiguous. Two cases violate this rule.
No correspondence: Objects without correspondence in the feature space should not be matched to any candidates. Newly appeared objects, vanished tracklets, and some false positives fall into this category. The bi-softmax can tackle this problem naturally, as these objects are hard to obtain bi-directional consistency, thus having low matching scores to any objects. If a newly appeared object has high detection confidence, it can start a new tracklet. Moreover, previous methods often directly drop the objects that do not match to any tracklets. We argue that despite most of them are false positives, they are still useful regions that the following objects are likely to match. We name these unmatched objects backdrops and keep them in the feature space during matching. Experiments show that keeping backdrops helps us reduce the number of false positives.
Multiple correspondences: Most states of the art detectors only do intra-class duplicate removal by Non Maximum Suppression (NMS). Consequently, there might be objects that have different categories but at the same locations. In most cases, only one of these objects is true positives while the others not. This process can increase the object recall to the maximum extent thus contributing a high mean Average Precision (mAP) Everingham et al. (2010); Lin et al. (2014). However, it will create duplicate feature embeddings during the matching process. To handle this issue, we do inter-class duplicate removal by NMS. The IoU threshold for NMS is 0.7 for objects with high detection confidence (larger than 0.5) and 0.3 for objects with low detection confidence (lower than 0.5).
We perform object tracking experiments on BDD100K Yu et al. (2020), Waymo Sun et al. (2019), and KITTI Geiger et al. (2012) to evaluate the effectiveness of quasi-dense instance similarity learning. We also extend the method to one-shot object detection and verify the effectiveness on PASCAL VOC Everingham et al. (2010) dataset.
4.1 Implementation details
Our method is implemented on mmdetection Chen et al. (2019). We use ResNet-50 He et al. (2016) as backbone and keep all hyper-parameters consistent with mmdetection unless otherwise specified. We select 128 RoIs from the key frame as training samples, and 256 RoIs from the reference frame with a positive-negative ratio of 1.0 as contrastive targets. We use IoU-balanced sampling Pang et al. (2019) to sample RoIs. We use 4conv-1fc head with group normalization Wu and He (2018)
to extract feature embeddings. The channel number of embedding features is set as 256 by default. We train our models with a total batch size of 32 and an initial learning rate of 0.04 for 12 epochs. We decrease the learning rate by 0.1 after 8 and 11 epochs.
The images are trained and tested with their original scales. When conducting online joint object detection and tracking, we initialize a new tracklet if its detection confidence is higher than 0.8. Other confidence thresholds are all set as 0.5 in this paper if not mentioned. The objects can be associated only when they are classified as the same category. The feature embeddings of each identity are updated online with a momentum of 0.8. We keep each tracklet alive for 10 frames and only keep backdrops from the consecutive frame.
4.2 BDD100K joint object detection and tracking
We use BDD100K Yu et al. (2020)
detection training set and tracking training set for training, and tracking validation/testing set for testing. All ablation studies are conducted on tracking validation set. The detection set has 70,000 images for training. The tracking set has 1,400 videos (278,079 images) for training, 200 videos (39,973 images) for validation, and 400 videos (79,636 images) for testing. The images in the tracking set are annotated per 5 fps with a 30 fps video frame rate. During training, the image from the detection set uses itself as the reference image. We follow official guidelines and evaluate the tracking performance using standard evaluation metricsRistani et al. (2016). The terms mMOTA, mIDF1, mMOTP are results averaged over 8 categories annotated in this dataset.
|Yu Yu et al. (2020)||val||25.9||44.5||69.7||122406||52372||8315||8396||3795||28.1|
|Yu Yu et al. (2020)||test||26.3||44.7||69.4||213220||100230||14674||16299||6017||27.9|
|Key frame||Reference frame||MOTA||IDF1||ID Sw.||mMOTA||mIDF1||MOTA(P)||IDF1(P)||ID Sw.(P)|
|Positive||GT + Negative||61.5||66.8||7986||35.5||50.0||40.5||52.7||2015|
|Positive||Positive + Negative||62.5||67.8||7476||36.2||50.0||44.0||54.3||1905|
|Bi-Softmax||Matching candidates||MOTA||IDF1||ID Sw.||mMOTA||mIDF1||MOTA(P)||IDF1(P)||ID Sw.(P)|
The main results on BDD100K tracking validation set and testing set are in Table 1. The mMOTA and mIDF1, which represent object coverage and identity consistency respectively, are 36.6% and 50.8% on the validation set, and 35.2% and 51.8% on the testing set. On the two sets, our method outperforms the baseline method by 10.7 points and 8.9 points in terms of mMOTA, and 6.3 points and 7.1 points in terms of mIDF1, respectively. The significant advancements demonstrate that our method enables more stable object tracking.
We conduct ablation studies to study the effectiveness of the quasi-dense matching, as shown in the top sub-table of Table 2. The terms MOTA and IDF1 are calculated over all instances without considering categories as overall evaluations. We use cosine distance to calculate the similarity scores during the inference procedure. Compared to learn with sparse ground truths, quasi-dense matching improves the overall IDF1 by 4.8 points (63.0% to 67.8%). The significant improvement on IDF1 indicates that quasi-dense matching greatly improves the feature embeddings and enables more accurate associations. We then analyze the improvements in detail. From the table, we can observe that including more positive samples as training samples only bring marginal improvements. However, when we match each training sample to more negative samples and train the feature space with Equation 2, the IDF1 is significantly improved by 3.4 points. This improvement contributes 70% to the total improved 4.8 points IDF1. This experiment shows that more contrastive targets, even most of them are negative samples, can improve the feature learning process. The multiple-positive contrastive learning following Equation 3 further improves the IDF1 by 1 point (66.8% to 67.8%).
We also investigate how different inference strategies influence the performance. As shown in the bottom part of Table 2, replacing cosine similarity by bi-softmax improves overall IDF1 by 2.2 points and the IDF1 of pedestrian by 4.5 points. This experiment shows that the one-to-one constraint further strengthens the estimated similarity. With duplicate removal and backdrops, the IDF1 is improved by 1.5 points. Overall, our training and inference strategies significantly improve the IDF1 by 8.5 points (63.0% to 71.5%). The total number of ID switches is decreased by 30%. Especially, the MOTA and IDF1 of pedestrian are improved by 9.1 points and 10.5 points respectively, which further demonstrate the power of quasi-dense matching.
Finally, we try to add the location and motion priors to understand whether they are still helpful when the feature embeddings are greatly improved. These experiments follow the procedures in Tracktor Bergmann et al. (2019) and use the same detector for fair comparisons. As shown in Table 3, without appearance features, the tracking performance is consistently improved with the introduction of additional information. However, these cues barely enhance the performance of our approach. Our method yields the best results when only using appearance embeddings. The results indicate that our instance feature embeddings are sufficient for multiple object tracking with the effective quasi-dense matching.
To understand the runtime efficiency, we profile our method on NVIDIA Tesla V100. Because it only adds a lightweight embedding head to Faster R-CNN, our method only bring marginal inference costs. With an input size of and a ResNet-50 backbone, the inference FPS is 14.1.
|IoU baseline Lu et al. (2020)||Vehicle||38.25||-||-||-||-||-||-||45.78|
|Tracktor++ Bergmann et al. (2019); Lu et al. (2020)||Vehicle||42.62||-||-||-||-||-||-||42.41|
|RetinaTrack Lu et al. (2020)||Vehicle||44.92||-||-||-||-||-||-||45.70|
|Tracktor Kim et al. (2018); Sun et al. (2019)||Vehicle||34.80||10.61||14.88||39.71||28.29||8.63||12.10||50.98|
4.3 Waymo 2D object tracking
Waymo open dataset contains images from 5 cameras associated with 5 different directions: front, front left, front right, side left, and side right. There are 3,990 videos (790,405 images) for training, 1,010 videos (199,935 images) for validation, and 750 videos (148,235 images) for testing. We use COCO pre-train models following Lu et al. (2020). The tracking performance is evaluated for three categories (vehicle, pedestrian, and cyclist) and two difficulty levels (L1 and L2) on the official benchmark.
Table 4 shows our main results on Waymo open dataset. We report the results on the validation set following the setup of RetinaTrack Lu et al. (2020). We also obtain results on the test set via official evaluation. Our method outperforms all baselines on both validation set and test set. We obtain a MOTA of 44.0% and a IDF1 of 56.8% on the validation set. We also obtain a MOTA/L1 of 49.40% and a MOTA/L2 of 43.88% on the test set. The performance of vehicle on the validation set is 10.7, 13.0, and 17.4 points higher than RetinaTrack Lu et al. (2020), Tracktor++ Bergmann et al. (2019); Lu et al. (2020), and IoU baseline Lu et al. (2020), respectively. Our approach also outperforms the baseline method on the official benchmark, where we improve MOTA/L1 from 34.80% to 49.40% and MOTA/L2 from 28.29% to 43.88%.
4.4 KITTI multiple object tracking
We use both detection set and tracking set of KITTI dataset for training. The object detection set consists of 7,481 training images. The tracking set contains 21 training sequences (8,008 images) and 29 testing sequences (11,095 images) with 10 fps frame rate Geiger et al. (2012). The object tracking benchmark only evaluates 2 classes (car and pedestrian) out of 8 labeled classes.
We present the results in Table 5 and compare our method with peer-reviewed methods. Quasi-dense matching outperforms state-of-the-art methods with a MOTA of 85.76% for car and a MOTA of 56.81% for pedestrian. We obtain a 4.46% points improvement on MOTA for pedestrian and significantly reduce the number of ID switches for car.
|IMMDP Xiang et al. (2015)||83.04||5269||391||172|
|MOTBeyondPixels Sharma et al. (2018)||82.24||4247||705||468|
|mono3DT Hu et al. (2019)||84.52||4242||705||377|
|mmMOT Zhang et al. (2019)||84.77||4243||711||284|
|MOTSFusion Luiten et al. (2020)||84.83||4260||681||275|
|MASS Karunasekera et al. (2019)||85.04||4101||742||301|
|JCSTD Tian et al. (2020)||44.20||711975||889||53|
|MCMOT-CPD Lee et al. (2016)||45.94||11112||1260||143|
|NOMT Choi (2015)||46.62||10427||1867||63|
|MDP Xiang et al. (2015)||47.22||9540||2592||87|
|Be-Track Dimitrievski et al. (2019)||51.29||9943||1215||118|
|CAT Nguyen et al. (2019)||52.35||9150||1676||206|
4.5 PASCAL VOC one-shot object detection
We extend our method to one-shot object detection with minor modifications. We follow the setup in Kang et al. (2019); Wang et al. (2019) to conduct experiments on PASCAL VOC Everingham et al. (2010). We use VOC2007 and VOC2012 train/val set for training and VOC2007 test set for testing. 20 classes are split into 15 base classes and 5 novel classes, and each novel class can only access to one annotated object. We perform experiments on 3 different base/novel splits following the image lists in Kang et al. (2019).
Considering the limited GPU memory, we randomly sample 3 images from the entire dataset as reference frames, which is different from the episodic training with a gallery of all classes. The detector is trained only to distinguish the objectness. The category of each object is determined by locating the nearest neighbor from the 20 given objects in the embedding space. As shown in Table 6, our results are comparable with the state-of-the-art methods. These experiments demonstrate the potential of extending quasi-dense matching into category-level metric learning.
|Method||Detector||Backbone||Novel Set 1||Novel Set 2||Novel Set 3||Average|
|FSRW Kang et al. (2019)||YOLO v2||DarkNet-19||14.8||15.7||21.3||17.3|
|MetaDet Wang et al. (2019)||Faster R-CNN||VGG16||18.9||21.8||20.6||20.4|
|FRCNN+ft-full Yan et al. (2019)||Faster R-CNN||ResNet-101||13.8||7.9||9.8||10.5|
|Meta R-CNN Yan et al. (2019)||Faster R-CNN||ResNet-101||19.9||10.4||14.3||14.9|
We present a quasi-dense matching method for instance similarity learning. In contrast to previous methods that use sparse ground-truth matching as similarity supervision, we learn instance similarity from hundreds of region proposals in a pair of images, and train the feature embeddings with multiple positive contrastive learning. In the resulting feature space, a simple nearest neighbor search can distinguish instances without bells and whistles. Our method can be easily coupled with most of the existing detectors and applied in broad applications such as multiple object tracking and one-shot object detection. The simplicity and effectiveness shall benefit research in related areas.
Appendix A Appendix
In this appendix, we present category-wise results, investigate oracle performance, analyze failure cases, and show visualizations on BDD100K joint object detection and tracking dataset Yu et al. (2020). We also present more experimental results on Waymo 2D object tracking dataset Sun et al. (2019) and show visualizations on PASCAL VOC Everingham et al. (2010) one-shot object detection dataset.
a.1 BDD100K joint object detection and tracking
We present more analyses of quasi-dense matching on BDD100K tracking validation set. All experiments use the same model as the one in the main paper.
a.1.1 Category-wise results
BDD100K tracking set consists of 8 categories. These categories are in a long-tail distribution. Almost 75% and 15% instances are “car” and “pedestrian”, respectively. We present the category-wise results in Table 7. We can observe that “car” has the highest performance. The performance of “pedestrain”, “bus” and “truck” are comparable to each other and almost 20 points lower than the performance of “car”. The performance of other classes are even lower. All results of “train” are 0. The inferior results of these classes may be caused by limited training samples. Hence, the long-tail problem is still challenging in this dataset and needs to be solved.
a.1.2 Oracle analysis
We directly extract feature embeddings of the ground truth objects in each frame and associate them to investigate the oracle performance. The results are shown in Table 8.
We can observe that all MOTAs are higher than 94%, and some of them are even close to 100%. This is because we use the ground truth boxes directly so that the number of false negatives and false positives are close to 0. The high MOTAs show that the metric MOTA penalizes more on the object coverage (detection), but less on the identity consistency (tracking).
The metric IDF1 and ID Switches can measure the performance of identity consistency. As shown in Table 8, the average IDF1 over the 8 classes is 88.8%, which is 38 points higher than our result. The gaps on classes “car” and “pedestrain” are only 11.1 points and 19.3 points between oracle results and our results respectively, while gaps on other classes are exceeding 30 points. The high IDF1s of oracle results demonstrate the effectiveness of quasi-dense matching. If given highly accurate detection results, our method can obtain robust feature embeddings and associate objects effectively. However, the huge performance gaps also indicate the demand of promoting detection algorithms in the video domain. We also notice that the total number of ID switches in the oracle experiment is higher than ours. This is due to the high object recalls in the oracle experiment, as more detected instances may introduce more ID switches accordingly.
a.1.3 Failure case analysis
Our method can distinguish different instances even they are similar in appearance. However, there are still some failure cases. We show them below with figures, in which we use yellow color to represent false negatives, red color to represent false positives, and cyan color to represent ID switches. The float number at the corner of each box indicates the detection score, while the integer indicates the object identity number. We use green dashed box to highlight the objects we want to emphasize.
Inaccurate classification confidence is the main distraction for the association procedure because false negatives and false positives destroy the one-to-one matching constraint. As shown in Figure 4, the false negatives are mainly small objects or occluded objects under crowd scenes. The false positives are objects that have similar appearances to annotated objects, such as persons in the mirror or advertising board, etc.
Inaccurate object category is a less frequent distraction caused by classification. The class of the instance may switch between different categories, which mostly belong to the same super-category. Figure 5 shows an example. The category of the highlighted object changes from “rider” to “pedestrian” when the bicycle is occluded. Our method fails in this case because we require the associated objects have the same category.
These failure cases caused by object classification suggest the improvements on video object detection algorithms. We can exploit temporal or tracking information to improve the detectors, thus obtaining better tracking performance.
Object truncation/occlusion causes inaccurate object localization. As shown in Figure 6, the highlighted objects are truncated by other objects. The detector detects two objects. One of them is a false positive box that only covers a part of the object. The other one is a box with a lower detection score but covers the entire object. This case may influence the association process if the two boxes have similar feature embeddings.
An instance may have totally different appearances before and after occlusion that result in low similarity scores. As shown in Figure 7, only the front of the car appears before occlusion, while only the rear of the car appears after occlusion. Our method can associate two boxes if they cover the same discriminative regions of an object, not necessarily the exact same region. However, if two boxes cover totally different regions of the object, they will have a low matching score.
Another corner case is the extreme high-level truncation. As shown in Figure 8, the highly truncated objects only appear a little when they just enter or leave the camera view. We cannot distinguish different instances effectively according to the limited appearance information.
We show the visualizations of different instance patches during the testing procedure in Figure 9. The detected objects in each frame are matched to prior objects via bi-directional softmax. The prior objects include tracklets in the consecutive frame, vanished tracklets, and backdrops. We annotate them with different colors. Each detected object is enclosed by the same color of its matched object. We can observe that most false positives in the current frame are matched to backdrops, which demonstrates keeping backdrops during the matching procedure helps reduce the number of false positives.
|R101 + DCN||Vehicle||val||52.46||7.19||1.59||38.77||64.68||42.64||6.63||1.34||49.40||54.55|
|R101 + DCN||Pedestrian||val||56.00||9.24||2.03||32.73||66.80||51.87||8.85||1.89||37.39||61.98|
|R101 + DCN||Cyclist||val||36.73||10.34||0.72||52.21||45.23||30.50||8.59||0.60||60.31||36.91|
|R101 + DCN||All||val||48.40||8.92||1.45||41.23||58.90||41.67||8.02||1.27||49.03||51.15|
|R101 + DCN||Vehicle||test||58.84||6.66||1.54||32.96||70.13||49.23||6.17||1.32||43.28||60.27|
|R101 + DCN||Pedestrian||test||55.77||9.02||2.08||33.12||67.77||52.87||9.25||1.99||35.89||65.83|
|R101 + DCN||Cyclist||test||38.93||7.25||0.72||53.10||46.73||33.17||6.18||0.61||60.05||40.83|
|R101 + DCN||All||test||51.18||7.64||1.45||39.73||61.55||45.09||7.20||1.31||46.41||55.64|
a.2 Waymo 2D object tracking
We show more results on the Waymo 2D object tracking benchmark in Table 9. We test our method with ResNet-50 (R50) on the validation set following the official evaluation metrics. We also report the results with ResNet-101 and deformable convolutions Dai et al. (2017) (R101 + DCN) on both validation set and test set. Our method achieves states of the art performance on Waymo 2D object tracking benchmark without bells and whistles.
a.3 PASCAL VOC one-shot object detection
We show the testing procedure of one-shot object detection in Figure 10. We first get the objectness score for the object in the query image. We treat the given ground truths in support images as exemplars and extract their feature embeddings accordingly. We also extract feature embeddings of the object in the query image. Then we apply inner product with softmax between the feature embeddings of the query object and support objects to get the similarity scores. The objectness score is then distributed to each category according to obtained category-level similarity scores.
- Multi-target tracking by continuous energy minimization. In CVPR 2011, pp. 1265–1272. Cited by: §1.
- Tracking without bells and whistles. arXiv preprint arXiv:1903.05625. Cited by: §1, §2, §4.2, §4.3, Table 4.
- Simple online and realtime tracking. In International Conference on Image Processing, Cited by: §1, §2.
- High-speed tracking-by-detection without using image information. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. Cited by: §1, §2.
- MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.1.
- Multiple target tracking in world coordinate with single, minimally calibrated camera. In European Conference on Computer Vision, pp. 553–567. Cited by: §1.
- Near-online multi-target tracking with aggregated local flow descriptor. In IEEE International Conference on Computer Vision, Cited by: Table 5.
- Deformable convolutional networks. In IEEE International Conference on Computer Vision, Cited by: §A.2.
- Behavioral pedestrian tracking using a camera and lidar sensors on a moving vehicle. Sensors. Cited by: Table 5.
- The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88. Cited by: Appendix A, §3, §4.5, §4.
- Detect to track and track to detect. In IEEE International Conference on Computer Vision, Cited by: §1, §2, §2.
Are we ready for autonomous driving? the kitti vision benchmark suite.
IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §4.4, §4.
- Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §1.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2, §3.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §4.1.
- Learning to track at 100 FPS with deep regression networks. In European Conference on Computer Vision, Cited by: §2.
- In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §1, §2.
- Joint monocular 3d vehicle detection and tracking. In IEEE International Conference on Computer Vision, Cited by: Table 5.
- Few-shot object detection via feature reweighting. In IEEE International Conference on Computer Vision, Cited by: §2, §4.5, Table 6.
- RepMet: representative-based metric learning for classification and few-shot object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
- Multiple object tracking with attention to appearance, structure, motion and size. IEEE Access. Cited by: Table 5.
- Multiple hypothesis tracking revisited. In IEEE International Conference on Computer Vision, Cited by: §2.
- Multi-object tracking with neural gating using bilinear LSTM. In European Conference on Computer Vision, Cited by: Table 4.
- Learning by tracking: siamese CNN for robust target association. In IEEE Conference on Computer Vision and Pattern Recognition Workshop, Cited by: §2.
- Tracking the trackers: an analysis of the state of the art in multiple object tracking. arXiv preprint arXiv:1704.02781. Cited by: §2.
- Multi-class multi-object tracking using changing point detection. In European Conference on Computer Vision Workshop, Cited by: Table 5.
- Feature pyramid networks for object detection.. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3, §3.
- Microsoft coco: common objects in context. In European Conference on Computer Vision, Cited by: §3.
- RetinaTrack: online single stage joint detection and tracking. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §4.3, §4.3, Table 4.
- Track to reconstruct and reconstruct to track. IEEE Robotics Autom. Lett.. Cited by: Table 5.
Online multi-target tracking using recurrent neural networks. In
The AAAI Conference on Artificial Intelligence, Cited by: §2.
- Algorithms for the assignment and transportation problems. Society for Industrial and Applied Mathematics. Cited by: §2.
- Confidence-aware pedestrian tracking using a stereo camera. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences. Cited by: Table 5.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.
- Libra r-cnn: towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830. Cited by: §4.1.
- Finding and tracking people from the bottom up. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 2, pp. II–II. Cited by: §1, §2.
- Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Cited by: §1, §3.
- Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, Cited by: §4.2.
- Features for multi-target multi-camera tracking and re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6036–6046. Cited by: §1.
- Tracking the untrackable: learning to track multiple cues with long-term dependencies. In IEEE International Conference on Computer Vision, Cited by: §2.
- Beyond pixels: leveraging geometry and shape cues for online multi-object tracking. In IEEE International Conference on Robotics and Automation, Cited by: Table 5.
- Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems, pp. 1857–1865. Cited by: §1, §1, §2.
Multi-object tracking with quadruplet convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
- Scalability in perception for autonomous driving: waymo open dataset. Cited by: Appendix A, §1, Table 4, §4.
- Circle loss: A unified perspective of pair similarity optimization. arXiv preprint arXiv:2002.10857. Cited by: §3.
- Online multi-object tracking using joint domain information in traffic scenarios. IEEE Trans. Intell. Transp. Syst.. Cited by: Table 5.
- Meta-learning to detect rare objects. In IEEE International Conference on Computer Vision, Cited by: §2, §4.5, Table 6.
- Simple online and realtime tracking with a deep association metric. In International Conference on Image Processing, Cited by: §1, §1, §2.
- Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §4.1.
- Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §1, §3.
- Learning to track: online multi-object tracking by decision making. In IEEE International Conference on Computer Vision, Cited by: Table 5.
Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pp. 466–481. Cited by: §2.
- Meta r-cnn: towards general solver for instance-level low-shot learning. In IEEE International Conference on Computer Vision, Cited by: §2, Table 6.
- An online learned CRF model for multi-target tracking. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
- Video instance segmentation. In IEEE International Conference on Computer Vision, Cited by: §2.
- POI: multiple object tracking with high performance detection and appearance feature. In European Conference on Computer Vision Workshop, Cited by: §2.
- BDD100K: a diverse driving dataset for heterogeneous multitask learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix A, §1, §4.2, Table 1, §4.
- Robust multi-modality multi-object tracking. In IEEE International Conference on Computer Vision, Cited by: Table 5.