Quasi-Dense Instance Similarity Learning

06/11/2020 ∙ by Jiangmiao Pang, et al. ∙ 1

Similarity metrics for instances have drawn much attention, due to their importance for computer vision problems such as object tracking. However, existing methods regard object similarity learning as a post-hoc stage after object detection and only use sparse ground truth matching as the training objective. This process ignores the majority of the regions on the images. In this paper, we present a simple yet effective quasi-dense matching method to learn instance similarity from hundreds of region proposals in a pair of images. In the resulting feature space, a simple nearest neighbor search can distinguish different instances without bells and whistles. When applied to joint object detection and tracking, our method can outperform existing methods without using location or motion heuristics, yielding almost 10 points higher MOTA on BDD100K and Waymo tracking datasets. Our method is also competitive on one-shot object detection, which further shows the effectiveness of quasi-dense matching for category-level metric learning. The code will be available at https://github.com/sysmm/quasi-dense.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 11

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Instance similarity learning is crucial for many computer vision problems. In contrast to image-level similarity metrics Hermans et al. (2017); Sohn (2016)

that directly learn from the whole inputs, instance-level similarity needs to consider object locations, scales, and contexts. The estimated similarity serves as an essential component for many computer vision applications such as object tracking. Contemporary multiple object tracking methods 

Bergmann et al. (2019); Bewley et al. (2016); Wojke et al. (2017) mainly follow the tracking-by-detection paradigm Ramanan and Forsyth (2003). That is, they detect objects on each frame and then associate them according to the estimated instance similarity. Recent works Bergmann et al. (2019); Bewley et al. (2016); Bochinski et al. (2017) show that if the detected objects are accurate, the spatial proximity between objects in consecutive frames, measured by Interaction of Unions (IoUs) or center distances, is a strong prior to associate the objects. However, this location heuristic only works well under simple scenarios. If the objects are occluded or the scenes are crowded, this location heuristic can easily lead to mistakes. To remedy this problem, some methods introduce motion estimation Andriyenko and Schindler (2011); Choi and Savarese (2010) or a regression-based tracker Feichtenhofer et al. (2017) to ensure an accurate distance estimation.

However, object appearance similarity usually takes a secondary role Lu et al. (2020); Ristani and Tomasi (2018); Wojke et al. (2017) to strengthen object association or re-identify vanished objects. The search region is constrained to be local neighborhoods to avoid distractions because the appearance features can not distinguish different objects effectively. We conjecture that this is because the image and object information is not fully utilized for learning object similarity. Previous methods regard instance similarity learning as a post-hoc stage after object detection and only use sparse ground truth bounding boxes as training samples Wojke et al. (2017). This process ignores the majority of the regions proposed on the images. Because objects in a image rarely identical to each other, if the object representation is learned properly, nearest neighbor search in the embedding space is able to associate and distinguish instances without motion priors.

We observe that besides the ground truth and detected bounding boxes, which sparsely distribute on the images, many object proposals can provide valuable training supervision. They are either close to the ground truth bounding boxes to provide more positive training examples, or in the background as negative examples. We propose a simple yet effective quasi-dense matching method, namely, densely matching between hundreds of regions of interest from a pair of images, as shown in Figure 1. The quasi-dense samples can cover most of the informative regions on the images, providing us both more box examples and matching targets. In training, similar to contrastive learning of image representations Hadsell et al. (2006); Sohn (2016); Wu et al. (2018), each object sample is matched to all proposals on the other image and the model is trained to match targets among all the candidates.

The instance representations learned from quasi-dense matching allow nearest neighbor search to distinguish different instances at inference time. We use bi-directional softmax to obtain the similarity scores between detected boxes and tracklets, which is a soft one-to-one matching constraint. This bi-directional matching can handle the cases of new object appearance and tracklet termination. These objects with no correspondence during matching lack of the one-to-one consistency, thus have low similarity scores to any objects.

Quasi-dense matching can be easily coupled with most of the existing detectors since generating region of interests is widely used in object detection algorithms. We apply our method onto Faster R-CNN Ren et al. (2015) along with a lightweight embedding extractor and residual networks He et al. (2016). Without using location or motion heuristics, our model outperforms existing methods on BDD Yu et al. (2020), Waymo Sun et al. (2019), and KITTI Geiger et al. (2012) object tracking benchmarks. The experiments show that our method can boost almost 10 points of MOTA and significantly decreases the number of ID switches on BDD100K Yu et al. (2020) and Waymo Sun et al. (2019) datasets. Our method allows end-to-end training, thereby simplifying the training and testing procedures of multi-object tracking frameworks. The simplicity and effectiveness shall benefit further research in the related areas.

We also examine the application of the quasi-dense matching on one-shot object detection. In this setting, we can learn how to measure the similarity of objects within the same category, instead of the same identity. The experiments show that we can achieve competitive performance even without fine-tuning the model on novel classes. This further shows the effectiveness of quasi-dense matching for category-level metric learning.

                        (a) Sparse                                             (b) Quasi-Dense
Figure 1: In contrast to previous methods that (a) use sparse ground truth matching as similarity supervision, we present (b) quasi-dense matching that learns similarity from hundreds of region of interests.

2 Related work

Multiple object tracking

Recent developments in multiple object tracking Leal-Taixé et al. (2017) follow the tracking-by-detection paradigm Ramanan and Forsyth (2003). These approaches present different methods to estimate the instance similarity between detected objects and previous tracklets, then solve the matching process as a bipartite matching problem Munkres (1957). The spatial proximity has been proven effective to associate objects in consecutive frames Bewley et al. (2016); Bochinski et al. (2017)

. Hence, some methods use motion priors, such as Kalman Filter 

Bewley et al. (2016); Yu et al. (2016), optical flow Xiao et al. (2018), and bounding box regression Held et al. (2016); Feichtenhofer et al. (2017), to ensure accurate distance estimations. Besides, recent works Kim et al. (2015); Leal-Taixé et al. (2016); Yang and Nevatia (2012); Son et al. (2017); Sadeghian et al. (2017); Milan et al. (2017); Wojke et al. (2017)

also learn instance representations to exploit appearance similarity or re-identify vanished objects. These methods directly follow the training practice in image similarity learning, then measure the instance similarity by cosine distance or inner product with softmax, etc. That is, they train the model either as a

-classes classification problem Wojke et al. (2017) where equals to the number of identities in the whole training or under triplet loss Hermans et al. (2017). However, the classification problem is hard to extend to large-scale datasets while the triplet loss only compare each training sample with two other identities. These rudimentary training samples and objectives leave instance similarity learning not fully explored in MOT.

Detect & Track Feichtenhofer et al. (2017) is the first work that jointly optimizes object detection and tracking, different from the aforementioned methods that treat object detection and tracking as two separate stages. They adopt correlation-based regression tracker to propagate objects. MaskTrack R-CNN Yang et al. (2019) introduces a tracking branch to Mask R-CNN He et al. (2017) for simultaneous detection, segmentation, and tracking. Tracktor Bergmann et al. (2019) directly adopts a detector for tracking, but it highly relies on the prior of small across-frame displacements. RetinaTrack Lu et al. (2020) adds an extra embedding head under triplet loss Sohn (2016) to enable joint object detection and tracking. In contrast, we present quasi-dense matching that associates objects only with feature embeddings and obtain superior performance.

One-shot object detection

The purpose of one-shot object detection is to detect novel objects with only one annotated example. Recent methods mainly follow the ideas of metric-learning Karlinsky et al. (2019) or meta-learningKang et al. (2019); Wang et al. (2019); Yan et al. (2019). The meta-learning methods implement meta feature learner and feature re-weighting mechanism on a single-stage detector Kang et al. (2019) or a two-stage detector Yan et al. (2019). Unlike these methods, we jointly train the embedding extractor with an objectness detector and recognize the category by locating the nearest neighbor from embeddings of the exemplars.

3 Quasi-dense matching

Figure 2: The training pipeline of our method with FPN Faster R-CNN. We apply dense matching between quasi-dense samples and optimize the network with contrastive learning.

We propose quasi-dense matching to learn the feature embedding space which can associate identical objects and distinguish different objects. We define dense matching to be matching between the box candidates at all pixels, and quasi-dense means we consider the potential object candidates only at the informative regions. Accordingly, sparse matching means only considering ground truth labels as matching candidates in learning object association. In this section, we describe a training and testing framework based on quasi-dense matching, with the main application in joint detection and tracking of multiple objects, while there can be broader applications, such as one-shot object detection. Our method can be directly combined with the existing object detection models. The main ingredients of using quasi-dense matching are object detection, instance similarity learning, and object association.

Object detection

We adopt Faster R-CNN Ren et al. (2015) with Feature Pyramid Network (FPN) Lin et al. (2017)

for object detection. Faster R-CNN is a two-stage detector that uses Region Proposal Network (RPN) to generate Region of Interests (RoIs), and then classifies and localizes the regions to obtain precise semantic labels and locations. Based on Faster R-CNN, FPN exploits lateral connections to build the top-down feature pyramid and tackles the scale-variance problem. The entire network is optimized with a multi-task loss function

(1)

where the RPN loss , classification loss , regression loss remain the same as the original paper Ren et al. (2015). The loss weights and are set as 1.0 by default.

Instance similarity learning

We directly use quasi-dense region proposals generated by RPN to learn instance similarity. As shown in Figure 2, given a key image for training, we randomly select a reference image from its temporal neighborhood. The neighbor distance is constrained by an interval , where in our experiments. We use RPN to generate RoIs from the two images and RoI Align He et al. (2017) to obtain their feature maps from different levels in FPN according to their scales Lin et al. (2017). We add an extra lightweight embedding head which is in parallel to the original bounding box head to extract feature embeddings for each RoI. A RoI is defined as positive to an identity if they have an IoU higher than , or negative if they have an IoU lower than . We set and as 0.7 and 0.3. The RoIs on different frames are positive to each other if they are associated with the same identity, and negative to each other otherwise.

Assuming there are samples on the key frame as training samples and samples on the reference frame as contrastive targets, for each training sample, we can use the non-parametric softmax Wu et al. (2018); Oord et al. (2018) with cross-entropy to optimize the feature embeddings

(2)

where v, , are feature embeddings of the training sample, its positive target, and negative targets in respectively. The overall embedding loss is averaged over all training samples, but we only illustrate one training sample for simplicity.

In contrast to previous methods that only use cropped sparse ground truth (GT) to learn instance similarity, we apply dense matching between RoIs on the pair of images, namely, each sample on is matched to all samples on . Considering each training sample in the key frame has more than one positive targets in the reference frame, we use a simplified loss function from Sun et al. (2020) as multi-positive contrastive learning:

(3)

We further adopt L2 loss as an auxiliary loss to constrain the cosine similarity between pair-wise matching:

(4)

where C equals to 1 if the two samples are positive to each other and 0 otherwise.

The entire network is joint optimized under

(5)

where and are set as 0.25 and 1.0 by default in this paper. We sample all positive pairs and three times negative pairs to calculate the auxiliary loss.

Figure 3: The testing pipeline of our method. We use bi-directional softmax to estimate the similarity scores. The objects with no correspondence during matching do not satisfy the bi-directional consistency thus obtain low similarity scores to any matching candidates.
Object association

Tracking objects across frames purely based on object feature embeddings is not trivial as similarity estimation might be confused by newly appeared objects, vanished tracklets, and instances with similar appearances. Taking disadvantage of the effective quasi-dense similarity learning, we can perform a simple inference to associate objects.

Our main strategy is bi-directional matching in the embedding space. Figure 3 shows our testing pipeline. Assume there are detected objects in frame with feature embeddings n, and matching candidates with feature embeddings m from the past frames, the similarity f between the objects and matching candidates is obtained by bi-directional softmax (bi-softmax):

(6)

The high score under bi-softmax will satisfy a bi-directional consistency, the two matched objects should be each other’s nearest neighbor in the embedding space. Given the instance similarity f, we can directly associate the objects with their correspondence by a simple nearest neighbor search.

Due to the normalization effect of softmax, it is important to ensure there is one and only one correspondence among the matching candidates for each object. Otherwise, the similarity estimation will be ambiguous. Two cases violate this rule.

No correspondence: Objects without correspondence in the feature space should not be matched to any candidates. Newly appeared objects, vanished tracklets, and some false positives fall into this category. The bi-softmax can tackle this problem naturally, as these objects are hard to obtain bi-directional consistency, thus having low matching scores to any objects. If a newly appeared object has high detection confidence, it can start a new tracklet. Moreover, previous methods often directly drop the objects that do not match to any tracklets. We argue that despite most of them are false positives, they are still useful regions that the following objects are likely to match. We name these unmatched objects backdrops and keep them in the feature space during matching. Experiments show that keeping backdrops helps us reduce the number of false positives.

Multiple correspondences: Most states of the art detectors only do intra-class duplicate removal by Non Maximum Suppression (NMS). Consequently, there might be objects that have different categories but at the same locations. In most cases, only one of these objects is true positives while the others not. This process can increase the object recall to the maximum extent thus contributing a high mean Average Precision (mAP) Everingham et al. (2010); Lin et al. (2014). However, it will create duplicate feature embeddings during the matching process. To handle this issue, we do inter-class duplicate removal by NMS. The IoU threshold for NMS is 0.7 for objects with high detection confidence (larger than 0.5) and 0.3 for objects with low detection confidence (lower than 0.5).

4 Experiments

We perform object tracking experiments on BDD100K Yu et al. (2020), Waymo Sun et al. (2019), and KITTI Geiger et al. (2012) to evaluate the effectiveness of quasi-dense instance similarity learning. We also extend the method to one-shot object detection and verify the effectiveness on PASCAL VOC Everingham et al. (2010) dataset.

4.1 Implementation details

Our method is implemented on mmdetection Chen et al. (2019). We use ResNet-50 He et al. (2016) as backbone and keep all hyper-parameters consistent with mmdetection unless otherwise specified. We select 128 RoIs from the key frame as training samples, and 256 RoIs from the reference frame with a positive-negative ratio of 1.0 as contrastive targets. We use IoU-balanced sampling Pang et al. (2019) to sample RoIs. We use 4conv-1fc head with group normalization Wu and He (2018)

to extract feature embeddings. The channel number of embedding features is set as 256 by default. We train our models with a total batch size of 32 and an initial learning rate of 0.04 for 12 epochs. We decrease the learning rate by 0.1 after 8 and 11 epochs.

The images are trained and tested with their original scales. When conducting online joint object detection and tracking, we initialize a new tracklet if its detection confidence is higher than 0.8. Other confidence thresholds are all set as 0.5 in this paper if not mentioned. The objects can be associated only when they are classified as the same category. The feature embeddings of each identity are updated online with a momentum of 0.8. We keep each tracklet alive for 10 frames and only keep backdrops from the consecutive frame.

4.2 BDD100K joint object detection and tracking

We use BDD100K Yu et al. (2020)

detection training set and tracking training set for training, and tracking validation/testing set for testing. All ablation studies are conducted on tracking validation set. The detection set has 70,000 images for training. The tracking set has 1,400 videos (278,079 images) for training, 200 videos (39,973 images) for validation, and 400 videos (79,636 images) for testing. The images in the tracking set are annotated per 5 fps with a 30 fps video frame rate. During training, the image from the detection set uses itself as the reference image. We follow official guidelines and evaluate the tracking performance using standard evaluation metrics  

Ristani et al. (2016). The terms mMOTA, mIDF1, mMOTP are results averaged over 8 categories annotated in this dataset.

Method Set mMOTA mIDF1 mMOTP FN FP ID Sw. MT ML mAP
Yu  Yu et al. (2020) val 25.9 44.5 69.7 122406 52372 8315 8396 3795 28.1
Ours val 36.6 50.8 70.2 108614 46621 6262 9481 3034 32.6
Yu  Yu et al. (2020) test 26.3 44.7 69.4 213220 100230 14674 16299 6017 27.9
Ours test 35.2 51.8 77.5 193291 85987 11019 17745 4865 31.8
Table 1: Main results on the BDD100K multiple object tracking validation set and test set. means higher is better, means lower is better.
Key frame Reference frame MOTA IDF1 ID Sw. mMOTA mIDF1 MOTA(P) IDF1(P) ID Sw.(P)
GT GT 60.4 63.0 8916 34.0 47.9 37.6 49.7 2213
Positive GT 60.7 63.4 8728 34.3 48.2 38.3 50.1 2197
Positive GT + Negative 61.5 66.8 7986 35.5 50.0 40.5 52.7 2015
Positive Positive + Negative 62.5 67.8 7476 36.2 50.0 44.0 54.3 1905
Bi-Softmax Matching candidates MOTA IDF1 ID Sw. mMOTA mIDF1 MOTA(P) IDF1(P) ID Sw.(P)
 D. R. Backdrops
- - 62.9 70.0 8059 35.4 48.5 45.5 58.8 1609
- 63.2 70.1 6115 36.4 50.4 45.5 58.3 1487
63.5 71.5 6262 36.6 50.8 46.7 60.2 1501
Table 2: Ablation studies on quasi-dense matching and the inference strategy. We use cosine similarity to calculate the similarity scores for experiments in the top lines. All models are comparable on detection performance. D. R. means duplicate removal. (P) means results of the class “pedestrian”.
Appearance IoU Motion Regression mMOTA mIDF1 mMOTP FN FP ID Sw. MT ML
- - - 26.3 36.0 70.5 115356 40587 69564 8759 3240
- - 27.7 38.5 70.5 116056 39852 61024 8756 3272
- - 28.6 39.3 70.4 112166 43783 57064 9106 3091
- - - 36.6 50.8 70.2 108614 46621 6262 9481 3034
- - 36.3 49.8 70.3 109517 44680 7838 9369 3094
- 36.4 49.9 70.3 110109 43860 7807 9327 3136
- 36.4 50.1 70.3 109407 44973 7714 9389 3082
Table 3: Ablations studies on location and motion cues on the BDD100K tracking validation set. All experiments use the same detector for fair comparisons.

The main results on BDD100K tracking validation set and testing set are in Table 1. The mMOTA and mIDF1, which represent object coverage and identity consistency respectively, are 36.6% and 50.8% on the validation set, and 35.2% and 51.8% on the testing set. On the two sets, our method outperforms the baseline method by 10.7 points and 8.9 points in terms of mMOTA, and 6.3 points and 7.1 points in terms of mIDF1, respectively. The significant advancements demonstrate that our method enables more stable object tracking.

We conduct ablation studies to study the effectiveness of the quasi-dense matching, as shown in the top sub-table of Table 2. The terms MOTA and IDF1 are calculated over all instances without considering categories as overall evaluations. We use cosine distance to calculate the similarity scores during the inference procedure. Compared to learn with sparse ground truths, quasi-dense matching improves the overall IDF1 by 4.8 points (63.0% to 67.8%). The significant improvement on IDF1 indicates that quasi-dense matching greatly improves the feature embeddings and enables more accurate associations. We then analyze the improvements in detail. From the table, we can observe that including more positive samples as training samples only bring marginal improvements. However, when we match each training sample to more negative samples and train the feature space with Equation 2, the IDF1 is significantly improved by 3.4 points. This improvement contributes 70% to the total improved 4.8 points IDF1. This experiment shows that more contrastive targets, even most of them are negative samples, can improve the feature learning process. The multiple-positive contrastive learning following Equation 3 further improves the IDF1 by 1 point (66.8% to 67.8%).

We also investigate how different inference strategies influence the performance. As shown in the bottom part of Table 2, replacing cosine similarity by bi-softmax improves overall IDF1 by 2.2 points and the IDF1 of pedestrian by 4.5 points. This experiment shows that the one-to-one constraint further strengthens the estimated similarity. With duplicate removal and backdrops, the IDF1 is improved by 1.5 points. Overall, our training and inference strategies significantly improve the IDF1 by 8.5 points (63.0% to 71.5%). The total number of ID switches is decreased by 30%. Especially, the MOTA and IDF1 of pedestrian are improved by 9.1 points and 10.5 points respectively, which further demonstrate the power of quasi-dense matching.

Finally, we try to add the location and motion priors to understand whether they are still helpful when the feature embeddings are greatly improved. These experiments follow the procedures in Tracktor Bergmann et al. (2019) and use the same detector for fair comparisons. As shown in Table 3, without appearance features, the tracking performance is consistently improved with the introduction of additional information. However, these cues barely enhance the performance of our approach. Our method yields the best results when only using appearance embeddings. The results indicate that our instance feature embeddings are sufficient for multiple object tracking with the effective quasi-dense matching.

To understand the runtime efficiency, we profile our method on NVIDIA Tesla V100. Because it only adds a lightweight embedding head to Faster R-CNN, our method only bring marginal inference costs. With an input size of and a ResNet-50 backbone, the inference FPS is 14.1.

Method Category MOTA IDF1 FN FP ID Sw. MT ML mAP
IoU baseline Lu et al. (2020) Vehicle 38.25 - - - - - - 45.78
Tracktor++ Bergmann et al. (2019); Lu et al. (2020) Vehicle 42.62 - - - - - - 42.41
RetinaTrack Lu et al. (2020) Vehicle 44.92 - - - - - - 45.70
Ours Vehicle 55.6 66.2 514548 214998 24309 17595 5559 49.5
Ours Pedestrian 50.3 58.4 151957 48636 6347 3746 1866 40.7
Ours Cyclist 26.2 45.7 7559 1252 56 69 85 30.0
Ours All 44.0 56.8 674064 264886 30712 21410 7510 40.1
Method Category MOTA/L1 FP/L1 MisM/L1 Miss/L1 MOTA/L2 FP/L2 MisM/L2 Miss/L2
Tracktor Kim et al. (2018); Sun et al. (2019) Vehicle 34.80 10.61 14.88 39.71 28.29 8.63 12.10 50.98
Ours Vehicle 56.99 6.38 1.55 35.08 47.42 5.91 1.33 45.34
Ours Pedestrian 54.75 9.08 2.05 34.12 51.83 9.19 1.95 37.03
Ours Cyclist 36.45 6.78 0.77 56.01 30.90 6.20 0.66 62.24
Ours All 49.40 7.41 1.46 41.74 43.88 7.10 1.31 48.21
Table 4: Results on Waymo open dataset validation set using py-motmetrics library (top) 222https://github.com/cheind/py-motmetricsand test set using official evaluation for two difficulty levels (bottom).

4.3 Waymo 2D object tracking

Waymo open dataset contains images from 5 cameras associated with 5 different directions: front, front left, front right, side left, and side right. There are 3,990 videos (790,405 images) for training, 1,010 videos (199,935 images) for validation, and 750 videos (148,235 images) for testing. We use COCO pre-train models following Lu et al. (2020). The tracking performance is evaluated for three categories (vehicle, pedestrian, and cyclist) and two difficulty levels (L1 and L2) on the official benchmark.

Table 4 shows our main results on Waymo open dataset. We report the results on the validation set following the setup of RetinaTrack Lu et al. (2020). We also obtain results on the test set via official evaluation. Our method outperforms all baselines on both validation set and test set. We obtain a MOTA of 44.0% and a IDF1 of 56.8% on the validation set. We also obtain a MOTA/L1 of 49.40% and a MOTA/L2 of 43.88% on the test set. The performance of vehicle on the validation set is 10.7, 13.0, and 17.4 points higher than RetinaTrack Lu et al. (2020), Tracktor++ Bergmann et al. (2019); Lu et al. (2020), and IoU baseline Lu et al. (2020), respectively. Our approach also outperforms the baseline method on the official benchmark, where we improve MOTA/L1 from 34.80% to 49.40% and MOTA/L2 from 28.29% to 43.88%.

4.4 KITTI multiple object tracking

We use both detection set and tracking set of KITTI dataset for training. The object detection set consists of 7,481 training images. The tracking set contains 21 training sequences (8,008 images) and 29 testing sequences (11,095 images) with 10 fps frame rate Geiger et al. (2012). The object tracking benchmark only evaluates 2 classes (car and pedestrian) out of 8 labeled classes.

We present the results in Table 5 and compare our method with peer-reviewed methods. Quasi-dense matching outperforms state-of-the-art methods with a MOTA of 85.76% for car and a MOTA of 56.81% for pedestrian. We obtain a 4.46% points improvement on MOTA for pedestrian and significantly reduce the number of ID switches for car.

Method MOTA FN FP IDSw.
IMMDP Xiang et al. (2015) 83.04 5269 391 172
MOTBeyondPixels Sharma et al. (2018) 82.24 4247 705 468
mono3DT Hu et al. (2019) 84.52 4242 705 377
mmMOT Zhang et al. (2019) 84.77 4243 711 284
MOTSFusion Luiten et al. (2020) 84.83 4260 681 275
MASS Karunasekera et al. (2019) 85.04 4101 742 301
Ours 85.76 4288 517 93
Method MOTA FN FP IDSw.
JCSTD Tian et al. (2020) 44.20 711975 889 53
MCMOT-CPD Lee et al. (2016) 45.94 11112 1260 143
NOMT Choi (2015) 46.62 10427 1867 63
MDP Xiang et al. (2015) 47.22 9540 2592 87
Be-Track Dimitrievski et al. (2019) 51.29 9943 1215 118
CAT Nguyen et al. (2019) 52.35 9150 1676 206
Ours 56.81 8460 1284 254
Table 5: Results on KITTI object tracking benchmark test set for car (left) and pedestrian (right). Only published methods are reported.

4.5 PASCAL VOC one-shot object detection

We extend our method to one-shot object detection with minor modifications. We follow the setup in  Kang et al. (2019); Wang et al. (2019) to conduct experiments on PASCAL VOC Everingham et al. (2010). We use VOC2007 and VOC2012 train/val set for training and VOC2007 test set for testing. 20 classes are split into 15 base classes and 5 novel classes, and each novel class can only access to one annotated object. We perform experiments on 3 different base/novel splits following the image lists in Kang et al. (2019).

Considering the limited GPU memory, we randomly sample 3 images from the entire dataset as reference frames, which is different from the episodic training with a gallery of all classes. The detector is trained only to distinguish the objectness. The category of each object is determined by locating the nearest neighbor from the 20 given objects in the embedding space. As shown in Table 6, our results are comparable with the state-of-the-art methods. These experiments demonstrate the potential of extending quasi-dense matching into category-level metric learning.

Method Detector Backbone Novel Set 1 Novel Set 2 Novel Set 3 Average
FSRW Kang et al. (2019) YOLO v2 DarkNet-19 14.8 15.7 21.3 17.3
MetaDet Wang et al. (2019) Faster R-CNN VGG16 18.9 21.8 20.6 20.4
FRCNN+ft-full Yan et al. (2019) Faster R-CNN ResNet-101 13.8 7.9 9.8 10.5
Meta R-CNN Yan et al. (2019) Faster R-CNN ResNet-101 19.9 10.4 14.3 14.9
Ours Faster R-CNN ResNet-101 22.1 12.8 16.9 17.3
Table 6: One-shot object detection results (mAP) on PASCOL VOC with three novel/base splits.

5 Conclusion

We present a quasi-dense matching method for instance similarity learning. In contrast to previous methods that use sparse ground-truth matching as similarity supervision, we learn instance similarity from hundreds of region proposals in a pair of images, and train the feature embeddings with multiple positive contrastive learning. In the resulting feature space, a simple nearest neighbor search can distinguish instances without bells and whistles. Our method can be easily coupled with most of the existing detectors and applied in broad applications such as multiple object tracking and one-shot object detection. The simplicity and effectiveness shall benefit research in related areas.

Appendix A Appendix

In this appendix, we present category-wise results, investigate oracle performance, analyze failure cases, and show visualizations on BDD100K joint object detection and tracking dataset Yu et al. (2020). We also present more experimental results on Waymo 2D object tracking dataset Sun et al. (2019) and show visualizations on PASCAL VOC Everingham et al. (2010) one-shot object detection dataset.

a.1 BDD100K joint object detection and tracking

We present more analyses of quasi-dense matching on BDD100K tracking validation set. All experiments use the same model as the one in the main paper.

a.1.1 Category-wise results

BDD100K tracking set consists of 8 categories. These categories are in a long-tail distribution. Almost 75% and 15% instances are “car” and “pedestrian”, respectively. We present the category-wise results in Table 7. We can observe that “car” has the highest performance. The performance of “pedestrain”, “bus” and “truck” are comparable to each other and almost 20 points lower than the performance of “car”. The performance of other classes are even lower. All results of “train” are 0. The inferior results of these classes may be caused by limited training samples. Hence, the long-tail problem is still challenging in this dataset and needs to be solved.

a.1.2 Oracle analysis

We directly extract feature embeddings of the ground truth objects in each frame and associate them to investigate the oracle performance. The results are shown in Table 8.

We can observe that all MOTAs are higher than 94%, and some of them are even close to 100%. This is because we use the ground truth boxes directly so that the number of false negatives and false positives are close to 0. The high MOTAs show that the metric MOTA penalizes more on the object coverage (detection), but less on the identity consistency (tracking).

The metric IDF1 and ID Switches can measure the performance of identity consistency. As shown in Table 8, the average IDF1 over the 8 classes is 88.8%, which is 38 points higher than our result. The gaps on classes “car” and “pedestrain” are only 11.1 points and 19.3 points between oracle results and our results respectively, while gaps on other classes are exceeding 30 points. The high IDF1s of oracle results demonstrate the effectiveness of quasi-dense matching. If given highly accurate detection results, our method can obtain robust feature embeddings and associate objects effectively. However, the huge performance gaps also indicate the demand of promoting detection algorithms in the video domain. We also notice that the total number of ID switches in the oracle experiment is higher than ours. This is due to the high object recalls in the oracle experiment, as more detected instances may introduce more ID switches accordingly.

Category Set MOTA IDF1 MOTP FN FP ID Sw. MT ML mAP
Pedestrian val 46.7 60.2 77.6 17576 11322 1501 1595 626 40.8
Rider val 32.8 48.1 77.7 1619 78 13 8 74 25.5
Car val 69.6 75.0 84.1 68227 30875 4547 7655 1835 60.3
Bus val 42.0 61.8 86.4 3967 1265 39 48 61 46.1
Truck val 39.2 56.5 85.7 13983 2514 122 115 312 40.8
Bicycle val 28.9 47.7 75.7 2446 457 38 52 100 24.2
Motorcycle val 33.5 56.7 74.5 488 110 2 8 20 22.8
Train val 0.0 0.0 0.0 308 0 0 0 6 0
All (average) val 36.6 50.8 70.2 108614 46621 6262 9481 3034 32.6
Pedestrian test 47.2 59.8 77.2 34210 18839 2480 2339 1075 -
Rider test 37.0 53.7 76.4 3764 401 20 37 121 -
Car test 71.6 76.8 84.2 112992 57512 8183 14972 2655 -
Bus test 34.3 57.0 85.4 7128 2694 75 80 144 -
Truck test 32.5 53.6 85.0 27685 5411 209 210 550 -
Bicycle test 30.6 51.0 76.0 4666 900 44 87 204 -
Motorcycle test 33.0 51.1 76.9 2378 171 8 20 106 -
Train test -4.8 11.7 59.0 468 59 0 0 10 -
All (average) test 35.2 51.8 77.5 193291 85987 11019 17745 4865 -
Table 7: Category-wise results on BDD100K multiple object tracking validation set and test set. means higher is better, means lower is better.
Category Set MOTA IDF1 MOTP FN FP ID Sw. MT ML
Pedestrian val 94.3 79.5 (+19.3) 99.8 1 1 3226 3506 0
Rider val 95.8 88.5 (+40.4) 99.9 0 0 107 134 0
Car val 97.7 86.1 (+11.1) 99.9 0 0 7716 13189 0
Bus val 99.2 93.0 (+31.2) 100.0 0 0 72 196 0
Truck val 98.8 90.3 (+33.8) 100.0 0 0 340 726 0
Bicycle val 88.2 79.5 (+31.8) 98.7 8 8 470 243 0
Motorcycle val 97.0 94.5 (+37.8) 99.8 0 0 27 44 0
Train val 99.4 98.7 (+98.7) 100.0 0 0 2 6 0
All (average) val 96.3 88.8 99.8 9 9 11960 18044 0
Table 8: Oracle analysis on BDD100K tracking validation set. The numbers in the round brackets mean the gaps between oracle results and our results.

a.1.3 Failure case analysis

Our method can distinguish different instances even they are similar in appearance. However, there are still some failure cases. We show them below with figures, in which we use yellow color to represent false negatives, red color to represent false positives, and cyan color to represent ID switches. The float number at the corner of each box indicates the detection score, while the integer indicates the object identity number. We use green dashed box to highlight the objects we want to emphasize.

Object classification

Inaccurate classification confidence is the main distraction for the association procedure because false negatives and false positives destroy the one-to-one matching constraint. As shown in Figure 4, the false negatives are mainly small objects or occluded objects under crowd scenes. The false positives are objects that have similar appearances to annotated objects, such as persons in the mirror or advertising board, etc.

Inaccurate object category is a less frequent distraction caused by classification. The class of the instance may switch between different categories, which mostly belong to the same super-category. Figure 5 shows an example. The category of the highlighted object changes from “rider” to “pedestrian” when the bicycle is occluded. Our method fails in this case because we require the associated objects have the same category.

These failure cases caused by object classification suggest the improvements on video object detection algorithms. We can exploit temporal or tracking information to improve the detectors, thus obtaining better tracking performance.

Object truncation/occlusion

Object truncation/occlusion causes inaccurate object localization. As shown in Figure 6, the highlighted objects are truncated by other objects. The detector detects two objects. One of them is a false positive box that only covers a part of the object. The other one is a box with a lower detection score but covers the entire object. This case may influence the association process if the two boxes have similar feature embeddings.

An instance may have totally different appearances before and after occlusion that result in low similarity scores. As shown in Figure 7, only the front of the car appears before occlusion, while only the rear of the car appears after occlusion. Our method can associate two boxes if they cover the same discriminative regions of an object, not necessarily the exact same region. However, if two boxes cover totally different regions of the object, they will have a low matching score.

Another corner case is the extreme high-level truncation. As shown in Figure 8, the highly truncated objects only appear a little when they just enter or leave the camera view. We cannot distinguish different instances effectively according to the limited appearance information.

Figure 4: Failure cases caused by inaccurate classification confidences. The objects enclosed by yellow rectangles are false negatives, and the objects enclosed by red rectangles are false positives.
Figure 5: Failure case caused by inaccurate object category. The category of the highlighted object changes from “rider” to “pedestrian” due to the occlusion of the bicycle. They cannot be associated because they do not satisfy the category consistency.
Figure 6: Inaccurate object localization caused by truncation. The red false positive box only covers part of the object, while the yellow box covers the entire object. They may have similar feature embeddings thus influencing the association procedure.
Figure 7: Two detected objects in different frames cover totally different regions of the object thus having low appearance similarity.
Figure 8: Our method cannot distinguish different instances effectively according to the limited appearance information in highly truncated objects.
Figure 9: The visualizations of different instance patches during the testing procedure. The detected objects in the current frame are matched to tracklets in the consecutive frame, vanished tracklets, and backdrops via bi-directional softmax.

a.1.4 Visualizations

We show the visualizations of different instance patches during the testing procedure in Figure 9. The detected objects in each frame are matched to prior objects via bi-directional softmax. The prior objects include tracklets in the consecutive frame, vanished tracklets, and backdrops. We annotate them with different colors. Each detected object is enclosed by the same color of its matched object. We can observe that most false positives in the current frame are matched to backdrops, which demonstrates keeping backdrops during the matching procedure helps reduce the number of false positives.

Backbone Category Set L1 L2
MOTA FP MisM Miss mAP MOTA FP MisM Miss mAP
R50 Vehicle val 50.96 7.22 1.65 40.18 64.49 41.31 6.60 1.38 50.72 54.23
R50 Pedestrian val 54.43 8.78 2.06 34.73 66.54 50.33 8.86 1.92 38.89 61.69
R50 Cyclist val 32.08 9.81 0.71 57.40 41.48 26.43 8.77 0.58 64.22 35.31
R50 All val 45.83 8.60 1.47 44.10 57.51 39.35 8.07 1.29 51.28 50.41
R101 + DCN Vehicle val 52.46 7.19 1.59 38.77 64.68 42.64 6.63 1.34 49.40 54.55
R101 + DCN Pedestrian val 56.00 9.24 2.03 32.73 66.80 51.87 8.85 1.89 37.39 61.98
R101 + DCN Cyclist val 36.73 10.34 0.72 52.21 45.23 30.50 8.59 0.60 60.31 36.91
R101 + DCN All val 48.40 8.92 1.45 41.23 58.90 41.67 8.02 1.27 49.03 51.15
R101 + DCN Vehicle test 58.84 6.66 1.54 32.96 70.13 49.23 6.17 1.32 43.28 60.27
R101 + DCN Pedestrian test 55.77 9.02 2.08 33.12 67.77 52.87 9.25 1.99 35.89 65.83
R101 + DCN Cyclist test 38.93 7.25 0.72 53.10 46.73 33.17 6.18 0.61 60.05 40.83
R101 + DCN All test 51.18 7.64 1.45 39.73 61.55 45.09 7.20 1.31 46.41 55.64
Table 9: Results on Waymo open dataset 2D object tracking validation set and test set using official evaluation for two difficulty levels.

a.2 Waymo 2D object tracking

We show more results on the Waymo 2D object tracking benchmark in Table 9. We test our method with ResNet-50 (R50) on the validation set following the official evaluation metrics. We also report the results with ResNet-101 and deformable convolutions Dai et al. (2017) (R101 + DCN) on both validation set and test set. Our method achieves states of the art performance on Waymo 2D object tracking benchmark without bells and whistles.

a.3 PASCAL VOC one-shot object detection

We show the testing procedure of one-shot object detection in Figure 10. We first get the objectness score for the object in the query image. We treat the given ground truths in support images as exemplars and extract their feature embeddings accordingly. We also extract feature embeddings of the object in the query image. Then we apply inner product with softmax between the feature embeddings of the query object and support objects to get the similarity scores. The objectness score is then distributed to each category according to obtained category-level similarity scores.

Figure 10: Testing pipeline of our method on one-shot object detection. We first get the objectness score for the query object, then get the similarity scores between the query object and the support objects via softmax. The objectness score is then distributed to each category according to the similarity scores. We show detection scores of the query object in the boxes at the right side.

References

  • A. Andriyenko and K. Schindler (2011) Multi-target tracking by continuous energy minimization. In CVPR 2011, pp. 1265–1272. Cited by: §1.
  • P. Bergmann, T. Meinhardt, and L. Leal-Taixé (2019) Tracking without bells and whistles. arXiv preprint arXiv:1903.05625. Cited by: §1, §2, §4.2, §4.3, Table 4.
  • A. Bewley, Z. Ge, L. Ott, F. T. Ramos, and B. Upcroft (2016) Simple online and realtime tracking. In International Conference on Image Processing, Cited by: §1, §2.
  • E. Bochinski, V. Eiselein, and T. Sikora (2017) High-speed tracking-by-detection without using image information. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. Cited by: §1, §2.
  • K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.1.
  • W. Choi and S. Savarese (2010) Multiple target tracking in world coordinate with single, minimally calibrated camera. In European Conference on Computer Vision, pp. 553–567. Cited by: §1.
  • W. Choi (2015) Near-online multi-target tracking with aggregated local flow descriptor. In IEEE International Conference on Computer Vision, Cited by: Table 5.
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In IEEE International Conference on Computer Vision, Cited by: §A.2.
  • M. Dimitrievski, P. Veelaert, and W. Philips (2019) Behavioral pedestrian tracking using a camera and lidar sensors on a moving vehicle. Sensors. Cited by: Table 5.
  • M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International Journal of Computer Vision 88. Cited by: Appendix A, §3, §4.5, §4.
  • C. Feichtenhofer, A. Pinz, and A. Zisserman (2017) Detect to track and track to detect. In IEEE International Conference on Computer Vision, Cited by: §1, §2, §2.
  • A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In

    IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §1, §4.4, §4.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1735–1742. Cited by: §1.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §2, §3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §4.1.
  • D. Held, S. Thrun, and S. Savarese (2016) Learning to track at 100 FPS with deep regression networks. In European Conference on Computer Vision, Cited by: §2.
  • A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §1, §2.
  • H. Hu, Q. Cai, D. Wang, J. Lin, M. Sun, P. Krähenbühl, T. Darrell, and F. Yu (2019) Joint monocular 3d vehicle detection and tracking. In IEEE International Conference on Computer Vision, Cited by: Table 5.
  • B. Kang, Z. Liu, X. Wang, F. Yu, J. Feng, and T. Darrell (2019) Few-shot object detection via feature reweighting. In IEEE International Conference on Computer Vision, Cited by: §2, §4.5, Table 6.
  • L. Karlinsky, J. Shtok, S. Harary, E. Schwartz, A. Aides, R. S. Feris, R. Giryes, and A. M. Bronstein (2019) RepMet: representative-based metric learning for classification and few-shot object detection. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • H. Karunasekera, H. Wang, and H. Zhang (2019) Multiple object tracking with attention to appearance, structure, motion and size. IEEE Access. Cited by: Table 5.
  • C. Kim, F. Li, A. Ciptadi, and J. M. Rehg (2015) Multiple hypothesis tracking revisited. In IEEE International Conference on Computer Vision, Cited by: §2.
  • C. Kim, F. Li, and J. M. Rehg (2018) Multi-object tracking with neural gating using bilinear LSTM. In European Conference on Computer Vision, Cited by: Table 4.
  • L. Leal-Taixé, C. Canton-Ferrer, and K. Schindler (2016) Learning by tracking: siamese CNN for robust target association. In IEEE Conference on Computer Vision and Pattern Recognition Workshop, Cited by: §2.
  • L. Leal-Taixé, A. Milan, K. Schindler, D. Cremers, I. Reid, and S. Roth (2017) Tracking the trackers: an analysis of the state of the art in multiple object tracking. arXiv preprint arXiv:1704.02781. Cited by: §2.
  • B. Lee, E. Erdenee, S. Jin, M. Y. Nam, Y. G. Jung, and P. Rhee (2016) Multi-class multi-object tracking using changing point detection. In European Conference on Computer Vision Workshop, Cited by: Table 5.
  • T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection.. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §3, §3.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision, Cited by: §3.
  • Z. Lu, V. Rathod, R. Votel, and J. Huang (2020) RetinaTrack: online single stage joint detection and tracking. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §4.3, §4.3, Table 4.
  • J. Luiten, T. Fischer, and B. Leibe (2020) Track to reconstruct and reconstruct to track. IEEE Robotics Autom. Lett.. Cited by: Table 5.
  • A. Milan, S. H. Rezatofighi, A. R. Dick, I. D. Reid, and K. Schindler (2017)

    Online multi-target tracking using recurrent neural networks

    .
    In

    The AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • J. Munkres (1957) Algorithms for the assignment and transportation problems. Society for Industrial and Applied Mathematics. Cited by: §2.
  • U. Nguyen, F. Rottensteiner, and C. Heipke (2019) Confidence-aware pedestrian tracking using a stereo camera. ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences. Cited by: Table 5.
  • A. v. d. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748. Cited by: §3.
  • J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin (2019) Libra r-cnn: towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830. Cited by: §4.1.
  • D. Ramanan and D. A. Forsyth (2003) Finding and tracking people from the bottom up. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 2, pp. II–II. Cited by: §1, §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Cited by: §1, §3.
  • E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision, Cited by: §4.2.
  • E. Ristani and C. Tomasi (2018) Features for multi-target multi-camera tracking and re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6036–6046. Cited by: §1.
  • A. Sadeghian, A. Alahi, and S. Savarese (2017) Tracking the untrackable: learning to track multiple cues with long-term dependencies. In IEEE International Conference on Computer Vision, Cited by: §2.
  • S. Sharma, J. A. Ansari, J. K. Murthy, and K. M. Krishna (2018) Beyond pixels: leveraging geometry and shape cues for online multi-object tracking. In IEEE International Conference on Robotics and Automation, Cited by: Table 5.
  • K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems, pp. 1857–1865. Cited by: §1, §1, §2.
  • J. Son, M. Baek, M. Cho, and B. Han (2017)

    Multi-object tracking with quadruplet convolutional neural networks

    .
    In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov (2019) Scalability in perception for autonomous driving: waymo open dataset. Cited by: Appendix A, §1, Table 4, §4.
  • Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei (2020) Circle loss: A unified perspective of pair similarity optimization. arXiv preprint arXiv:2002.10857. Cited by: §3.
  • W. Tian, M. Lauer, and L. Chen (2020) Online multi-object tracking using joint domain information in traffic scenarios. IEEE Trans. Intell. Transp. Syst.. Cited by: Table 5.
  • Y. Wang, D. Ramanan, and M. Hebert (2019) Meta-learning to detect rare objects. In IEEE International Conference on Computer Vision, Cited by: §2, §4.5, Table 6.
  • N. Wojke, A. Bewley, and D. Paulus (2017) Simple online and realtime tracking with a deep association metric. In International Conference on Image Processing, Cited by: §1, §1, §2.
  • Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §4.1.
  • Z. Wu, Y. Xiong, S. X. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742. Cited by: §1, §3.
  • Y. Xiang, A. Alahi, and S. Savarese (2015) Learning to track: online multi-object tracking by decision making. In IEEE International Conference on Computer Vision, Cited by: Table 5.
  • B. Xiao, H. Wu, and Y. Wei (2018)

    Simple baselines for human pose estimation and tracking

    .
    In Proceedings of the European conference on computer vision (ECCV), pp. 466–481. Cited by: §2.
  • X. Yan, Z. Chen, A. Xu, X. Wang, X. Liang, and L. Lin (2019) Meta r-cnn: towards general solver for instance-level low-shot learning. In IEEE International Conference on Computer Vision, Cited by: §2, Table 6.
  • B. Yang and R. Nevatia (2012) An online learned CRF model for multi-target tracking. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • L. Yang, Y. Fan, and N. Xu (2019) Video instance segmentation. In IEEE International Conference on Computer Vision, Cited by: §2.
  • F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan (2016) POI: multiple object tracking with high performance detection and appearance feature. In European Conference on Computer Vision Workshop, Cited by: §2.
  • F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell (2020) BDD100K: a diverse driving dataset for heterogeneous multitask learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix A, §1, §4.2, Table 1, §4.
  • W. Zhang, H. Zhou, S. Sun, Z. Wang, J. Shi, and C. C. Loy (2019) Robust multi-modality multi-object tracking. In IEEE International Conference on Computer Vision, Cited by: Table 5.