Log In Sign Up

TransFiner: A Full-Scale Refinement Approach for Multiple Object Tracking

by   Bin Sun, et al.

Multiple object tracking (MOT) is the task containing detection and association. Plenty of trackers have achieved competitive performance. Unfortunately, for the lack of informative exchange on these subtasks, they are often biased toward one of the two and remain underperforming in complex scenarios, such as the expected false negatives and mistaken trajectories of targets when passing each other. In this paper, we propose TransFiner, a transformer-based post-refinement approach for MOT. It is a generic attachment framework that leverages the images and tracking results (locations and class predictions) from the original tracker as inputs, which are then used to launch TransFiner powerfully. Moreover, TransFiner depends on query pairs, which produce pairs of detection and motion through the fusion decoder and achieve comprehensive tracking improvement. We also provide targeted refinement by labeling query pairs according to different refinement levels. Experiments show that our design is effective, on the MOT17 benchmark, we elevate the CenterTrack from 67.8


page 6

page 11


Multiple Object Tracking with Kernelized Correlation Filters in Urban Mixed Traffic

Recently, the Kernelized Correlation Filters tracker (KCF) achieved comp...


In this paper we present a robust tracker to solve the multiple object t...

Tracking from Patterns: Learning Corresponding Patterns in Point Clouds for 3D Object Tracking

A robust 3D object tracker which continuously tracks surrounding objects...

DSRRTracker: Dynamic Search Region Refinement for Attention-based Siamese Multi-Object Tracking

Many multi-object tracking (MOT) methods follow the framework of "tracki...

Visual object tracking performance measures revisited

The problem of visual tracking evaluation is sporting a large variety of...

DeepMOT: A Differentiable Framework for Training Multiple Object Trackers

Multiple Object Tracking accuracy and precision (MOTA and MOTP) are two ...

Improving tracking with a tracklet associator

Multiple object tracking (MOT) is a task in computer vision that aims to...

1 Introduction

(a) Track queries from the previous
(b) Track queries from the history buffer
(c) Object query pairs enhanced by encoded features (Ours)
Figure 1: Pipelines of preparing queries for the decoder. 1(a) Track queries from the previous frame, directly [22, 33] or enhanced by features from the current frame [7]. 1(b) History buffer is responsible for producing track queries [47, 4]. 1(c) Ours. As a post-refinement framework, we enhance the object query pairs across frames with aligned encoded features obtained under the guidance of the original tracker.

Multiple object tracking (MOT) refers to linking identical detections across frames and primarily exists in the form of two mainstream paradigms, namely tracking-by-detection (TBD) and joint detection and tracking (JDT). TBD approaches [3, 37, 17, 10, 27] split the MOT into two separate stages, including detection and association. JDT, alternatively, solves the MOT problem in unified ways via constructing a tracking-related structure [11, 46, 31, 40] within or adjusting the output objective of the particular branch [1] of the existing detectors. From an additionally emerging paradigm, transformer-based MOT formulations [33, 22, 7, 47, 4, 41] also finish tracking satisfactorily. Nevertheless, these methods still struggle with intricate scenarios, such as several objects passing each other and patches of crowded objects, which lead to either high false positives or high false negatives and degrade the association simultaneously. On the other hand, with DETR [5], end-to-end object detection is realized through object queries and Hungarian loss, facilitating individual-separate detection (e.g., without the need for NMS).

In light of these, we show how to build a generic and targeted framework for refining MOT, referred to as TransFiner, a transformer-based refinement approach. Unlike most related work, DRT [38] refines MOT patch by patch, which indeed improves detection but hardly promotes association (even degrades it according to the IDF1 reported in experiments [38]). We, instead, take a full-scale approach by enriching query pairs guided by the original tracker (Figure 1(c)), refinement then is a fine-tuning process for query pairs without scope restriction.

As summarized in Figure 1, the existing transformer-based MOT formulations [22, 33, 7, 47, 4, 41] primarily accomplish tracking via the tracklet record (e.g., track query). Instead, we use freshly initialized query pairs (i.e., separately for detection and association) for every shot. With this design, we note that a competitive tracking refinement can be achieved while less affected by the formerly poor tracking predictions.

TransFiner takes originally estimated object locations, class predictions, and two successive frames as inputs, predicting detections (frame

) and association clues containing motions of center and box (mapping detections from frame to frame ). These are achieved via TransFiner’s fusion decoder, which consists of the fusion attention module and dual-decoder. Specifically, fusion attention is responsible for the interaction between query pairs, while dual-decoder is assigned to take care of these two separately.

In order to better utilize information from the original tracker, predictions are categorized into qualified and poor ones in terms of their class scores. Together with learnable label embeddings, TransFiner finishes targeted refinement with different focuses of query embeddings on various estimations parallelly. During training, we additionally refer to ground-truth objects when pre-assigning refinement targets to original estimations with close distance, avoiding instability introduced in layer-wise Hungarian matching when refining.

Experiments show that tracker refined by TransFiner are robust enough to revisit compelling performance. With TransFiner’s refinement, CenterTrack achieves 71.5% MOTA, 66.8% IDF1 on the MOT17 benchmark.

2 Related work

2.1 Association in tracking

Motion and appearance are two crucial references when linking detections between frames. Several works rely solely on motions, guiding objects to the next frame [3, 37, 1, 31, 11] or moving them backward [46, 41] to search for associated ones. Some [20, 27] take advantage of appearance features to match interframe objects by computing similarity scores between feature embeddings. Straightforwardly, Combining both in the association [17, 39, 32, 10, 45, 40, 9] is also widely explored.

Another recent popular trend builds on transformer [36], packaging the preceding information into high-level embeddings (e.g., track queries [22, 33, 7, 47, 4]). These embeddings are then processed together with the current information [22, 47, 7, 4], or they serve as the initialization in the latest detection [33], handling association problems via another detection shot. Our method extends this trend by injecting freshly aligned encoded features to query pairs focused on joint prediction of detections and corresponding motions for association, which is completed in one run. Furthermore, we package information from centers and boxes into motions, facilitating precise association even among crowds.

2.2 DETR and its variants

DETR [5]

handles object detection in an end-to-end manner. This primarily benefits from the transformer’s attention mechanism and the introduction of object query, unfortunately, two dominating factors contribute to the slow convergence of DETR (e.g., It takes 500 training epochs to achieve a competitive performance). To be specific, several variants

[48, 13, 23] improve the attention module by designing mechanisms to constrain the interaction fields (e.g., sampling points [48], additional spatial attention weight [13, 23]), easing the match burden in comparison to the inefficient global search from DETR. Differently, object query alignment is studied in [43, 44], with the retrieval of the queries from encoded features showing effectiveness in accelerating convergence.

We build upon Deformable DETR [48]. Specifically for tracking (refinement), we construct a fusion decoder composed of fusion attention and a dual-decoder. Two decoders are connected through the fusion attention module, an additionally masked self-attention mechanism, ensuring effective intercommunication of query pairs. It is noteworthy that query pairs are iteratively aligned based on the inherent variable reference locations. Repetitive refinement is then realized through consecutive updates of the pairs via decoder layers.

2.3 Refinement

By exploring the joint space of inputs and outputs, refinement can generally be divided into multi- and single-step approaches. The former involves iterative correction [6, 35] and cascaded rectification [14]. Contrastively, the latter simply attaches an independent module to the original model [8, 12, 34, 25, 38, 42], yielding the refined results in a single pass.

MOT refinement focuses on optimizing detections and associations, and existing methods [42, 38] fall into the second category outlined above. ReMoT [42] enhances the tracklets of objects through a split-then-merge strategy, reducing the identity switches. However, this is not the primary cause of performance degradation. Alternatively, DRT [38] refines the detection results from ambiguous patches, resulting in decent improvements. Nevertheless, due to the patch-based nature, the scope of post-processing is limited to a predefined area, making it different from full-scale refinement and failing to strengthen association performance for the original tracker effortlessly. These inspire the design of TransFiner, a full-scale and single-step approach to refine MOT on detection and association.

3 Preliminaries

Original tracker. Generally taking a subset from frames (up to the last frame of the video) as inputs, original tracker refers to the tracker whose predictions are to be refined. Outputs from the origin are , where predictions are extracted from the post-processing with laxer output settings (e.g., lower objectness score threshold). Referring to , and respectively indicate the classification score, as well as bounding box of object out of the predictions in frame . represents the association clues (e.g., motions [3, 11, 1, 37, 31] or feature embeddings [20, 45, 27]) linking objects across frames.

Refinement. Let encoded image features from frame and be the and , respectively. Performing refinement on contributes to , where , and is the number of queries in decoder. denotes the motion of object between frames. Additionally, TransFiner is built upon Deformable DETR [48], whose decoder relies on the initial reference locations to make final predictions.

Figure 2: Refining tracker with TransFiner. Encoded features and produced by the CNN backbone and encoder, original tracking results, and plain query pairs serve as inputs for the fusion decoder. Following the FFN module, query pairs from decoder transform into detections and motions. For the original results, boxes ignored by post-processing are in dotted form and are partially picked for illustration. The dotted CNN and the Encoder indicate that weights are shared with the solid ones.

4 MOT refinement driven by transformer

4.1 Why transformer in refinement

Inspired by DETR [5] and its derivations [18, 44, 48, 43, 23]

, we conclude that transformer’s superiorities over traditional convolutional neural networks in post-refinement are shown in the following ways: (1) In DETR-like methods, the stacked decoder layers (especially, e.g.,

iterative bounding box refinement in Deformable DETR [48]) serve as the rectifiers of the earlier predictions, just as post-refinement does to some extent; (2) The object query is regarded as the composite information container of the corresponding target, a powerful initialization of which through fetching from encoded image feature under the guidance of the predictions to be refined enables the better leverage of the obtained original predictions; (3) Inspired by the training method of joint denoising and matching in [18], refinement with transformer can be cast into two parallel processes containing denoising the qualified predictions and rematching for the poor ones. In the following sections, we describe how TransFiner incorporates these characteristics.

4.2 Framework of TransFiner

A core concept in our approach is query pairs . That is, detects, and produces related motions, both for the same targets. Illustrated in Figure 2, propagation of query pairs occurs in the fusion decoder. Thus, framework of TransFiner can be divided into three parts: inputs preparation for decoder, decoding, and target predictions over decoder’s queries. For the first part, we package encoded features and , original results , and plain query pairs into inputs. Decoding then focuses on fusing query pairs and separately processing queries for detection () and association (), contributing to estimations of targets of frame (i.e., ) and association clues as target offsets of center and bounding box relative to that of the previous frame.

4.3 TransFiner’s fusion decoder layer

Figure 3: Roadmap of training fusion decoder layers. The fusion decoder layer starts by splitting queries into denoising and rematching groups (indicated by a dotted line within subsequent queries). Fusion between query pairs is performed afterward. Detailedly, the fusion mask completely blocks intraframe exchange while selectively allowing communication between frames (white means no mask, gray for partial mask). We ROIAlign encoded features guided by reference locations (of layer ) before dual-decoding. The aligned features are added to query pairs serving as extra semantic information. Additionally, we train categorized query pairs in a targeted manner.

According to figure 3, the structure of fusion decoder layer contains a fusion attention module and dual-decoder layer. The former is designed to allow the detection and association clues to flow interchangeably, and the latter is used to focus on and , respectively.

Dual-decoder layer. TransFiner provides association clues in the form of motions, i.e., offsets (if ideal) pointing from to ( for ground truth boxes). The iterative bounding box refinement [48] mode of decoder works by iteratively correcting box predictions from the former decoder layer (through rectifications from ). We, inside query pairs, extend the mode by simultaneously rectifying (or predicting) the attached bounding boxes in frame (through motions from ), that is


where is the layer index of decoder. In general, and , connected by the motions across frames, separately propagate through the dual-decoder structure of fusion decoder.

Fusion attention is the self-attention mechanism with an additionally added fusion mask . Depicted as the block with a color ramp from orange to gray in Figure 3, fusion begins by concatenating the embeddings from query pairs (i.e., , is the feature dimension). Self-attention is then performed on constrained by to focus on exchange of cross-frame information. Thus satisfies


is a hyperparameter introduced in the following.

Detailedly, sub masks of can be categorized into two groups, namely serving as the mask of intra-frame (top-left and bottom-right of ), along with , similarly. , following Equation 1a and 1b, are moved from through query pairs, which require each query pair to pinpoint a specific object. It is for this reason that elements along the main diagonal of are emphasized more than others, offering larger room for each pair to determine its target ( in Equation 2 shows this attention difference, is our default setting, more to refer to the discussion on Table 2). In a nutshell, the Fusion mask is designed to improve the match between query pairs while reserving space for the retrieval of the extra information.

4.4 Decoder initialization

Query pairs of the fusion decoder greatly contribute to the object predictions. Hence, it is straightforward to consider integrating the original predictions into their initialization.

Reference locations. TransFiner fills initial reference locations of two recent frames and with ( is the same as ).

Query pairs. Some [43, 44] inject the query embeddings with encoded features from the regions of interest. As shown in Figure 3, we similarly ROIAlign [15] the encoded features within reference locations under layer , resulting in aligned feature maps. Afterward, extracting and combining the features from the sampling points of each feature map yields distinct feature embeddings, which are then added to the corresponding query pairs.

4.5 Query denoising & query rematching

Prediction is often categorized as good or bad based on its accordance with the supposed ground truth. The former usually takes less effort than the latter under refinement. In other words, a query initialized from the former usually has a closely related target, which may suffer from the instability of the Hungarian matching (i.e., target shift as the disturbance introduced in the refinement, a similar question discussed in [18]). Hence, we introduce denoising and rematching split (d&r split for short), including inference and training steps according to Figure 3.

Inference. We distinguish a query for denoising or rematching by comparing its objectness score from accordingly initialized original prediction with (e.g., 0.4). Afterward, we label queries by assigning the denoising embedding to those associated with decent predictions, i.e., , and the rematching embedding to those related to poor predictions, i.e., . There is a reminder that decoder performs identification over denoising and rematching at the first layer.

Training. After conducting the inference step amid training, we further pre-determine the matched target-prediction pairs among following


is the optimal assignment from Hungarian match between decent predictions and targets. is the threshold filtering denoising queries whose initialized locations intolerably deviate from targets even with high objectness scores.

In the subsequent layer-by-layer refinement, Hungarian matching is performed outside the matched , leaving unmatched and the entire to search for the best-associated targets in each layer.

5 Experiments

5.1 Datasets & evaluation metrics

MOT. In multiple object tracking, MOT benchmarks are generally used to evaluate the performance of trackers. We conduct experiments on the MOT16 and MOT17 [24], both including 7 training sequences and 7 test sequences. The final results reported in Section 5.3 are obtained through training on the entire train set (additionally with the validation set of CrowdHuman [30]) and evaluating on the test set officially under the private detection protocol. For the ablation study, we, following Centertrack [46], split the official train set into two halves. The first half is used for training, while the second is for validation.

CrowdHuman [30] is a detection dataset filled with collections of images of the crowd, containing 15000 training images and 4370 validation images, which is widely used as a pre-training dataset for the MOT trackers.


We demonstrate our results using the popular MOT evaluation metrics set CLEAR

[2], including Multiple-Object Tracking Accuracy (MOTA), Identity Switch (IDS), False Positive (FP), and False Negative (FN). Additionally, we report the IDentification F1 score (IDF1) [29] and the Higher Order Tracking Accuracy (HOTA) [21]

, which is the geometric mean of two sub-metrics comprising Association Accuracy score (AssA) and Detection Accuracy score (DetA).

5.2 Implementation details

Model. We pick CenterTrack [46] as the original tracker in our experiments. As for TransFiner, the backbone network utilizes ResNet-50 [16], paired with a twin structure based on a six-layer encoder and decoder from Deformable DETR [48]. The number of query embeddings is set to . In MOT datasets, the bounding boxes fully cover the targets, which means parts of the objects have their box centers outside the images, making it suboptimal to directly predict the objects’ centers, widths, and heights. Hence, following the solution in [46], we also formulate the box representation set . The last four respectively show the non-negative distance from the center to the top, left, bottom, and right edge of the bounding box. This allows more precise estimations even when objects are heavily cropped.

Decoder initialization. TransFiner’s decoder outputs , while its initialization input is . Obviously, the mismatch between and raises the question of how to perform a one-to-one assignment at the beginning of the object querys’ initialization. Following the categorizing standard in Section 4.5, we address this first by separating into set . There are respectively and elements in and . Next, we obtain the sequence by linking with times repeated . The sequence is then clipped to that of length . In addition, to train the model robustly, each prediction is disturbed with different random noises.

Training settings. Images are resized to as inputs. Following the coefficients of Hungarian loss in [5], which are 2, 5, and 2 for , , and , respectively. We use the last two for the loss calculation on detection boxes in Equation 1a, while estimation of association boxes (from Equation 1b) are trained under the same coefficients divided by 5. Due to GPU memory limitation, the batch size is set to 8, with gradient accumulation amid every two iterations and simulating a 16-batch setup. Overall, we use 2 NVIDIA RTX 3090 GPUs with batch size 8, optimizer AdamW [19], and the initial learning rate . TransFiner is first pre-trained on the CrowdHuman train set [30] for 95 epochs, with learning rate dropping to after 50 epochs. We then train the TransFiner on both MOT [24] and CrowdHuman validation set [30] for another 130 epochs with learning rate decreasing by 10 at 100-th epoch.

TubeTK [26] 62.2 66.9 50.8 55.0 47.3 1236 11544 47502
Chain-Tracker [28] 57.2 67.6 48.8 55.0 43.7 1897 8934 48350
TraDeS [40] 64.7 70.1 53.2 56.2 50.9 1144 8091 45210
QuasiDense[27] 67.1 69.8 54.5 56.6 52.8 1097 9861 44050
MeMOT [4] 69.7 72.6 57.4 - 55.7 845 14595 34595
PatchTrack [7] 65.8 73.3 54.2 59.6 49.7 1179 10660 36824
CenterTrack+TF (ours) 67.6 73.0 55.1 58.6 52.2 976 10463 37723
TraDeS [40] 63.9 69.1 52.7 55.2 50.8 3555 20892 150060
QuasiDense[27] 66.3 68.7 53.9 55.6 52.7 3378 26589 146643
TransTrack [33] 63.9 74.5 53.9 60.5 48.3 3663 28323 112137
TransCenter [41] 62.2 73.2 54.5 60.1 49.7 4614 23112 123738
TubeTK [26] 58.6 63.0 48.0 51.4 45.1 4137 27060 177483
Chain-Tracker [28] 57.4 66.6 49.0 53.6 45.2 5529 22284 160491
TrackFormer [22] 63.9 65.0 - - - 3258 70443 123552
MeMOT [4] 69.0 72.5 56.9 - 55.2 2724 37221 115248
PatchTrack [7] 65.2 73.6 53.9 59.4 49.3 3795 23976 121230
CenterTrack [46] 64.7 67.8 52.2 53.8 51.0 3039 18498 160332
CenterTrack+TF (ours) 66.8 71.5 54.5 57.5 52.0 3056 29283 128665
Table 1: Evaluation results on MOT challenge datasets (private detection). The TF stands for TransFiner. The best result in each column is marked in red and in blue for the second-to-best.
Figure 4: Case study. Examples of CenterTrack (left column) refined by TransFiner (right column). Tracks are marked by color. The big black arrow depicts a tracklet of identical objects across frames. Under CenterTrack, the pedestrian in the orange box at Frame 555 now appears in a blue box since Frame 588. TransFiner, on the other hand, handles the identity switch originally introduced by this target via continuous tracks with green boxes. Moreover, additional annotations in red denote objects that CenterTrack ignores, while TransFiner fixes them.

5.3 Benchmark results

As a post-refinement model, we first discuss the improvement made by applying TransFiner after the original tracker (CenterTrack [46] in our experiments). Then, we compare the refined tracker with recent MOT trackers on MOT16 and MOT17 [24].

Improvement under TransFiner. CenterTrack officially reports results on the MOT17 benchmark, where we have a detailed look. As shown in Table 1, refinement by TransFiner shows a comprehensive improvement (+2.1% IDF1 and +3.7% MOTA). This benefits from the distinct focuses of query pairs over targets, contributing to apparent refinements on FN (decreasing by 31667), while IDsw virtually stays unchanged (increasing from 3039 to 3056). A case depiction can be seen in Figure 4.

MOT16 & MOT17. Table 1 demonstrates results reported on MOT16 and MOT17 test datasets. The tracker, after our enhancement, revisits competitive performance in a holistic way. In MOT16, we chiefly compare enhanced CenterTrack with two other transformer-based trackers, namely PatchTrack [7] and MeMOT [4], which respectively obtain state-of-the-art performance in detection and association. Improved CenterTrack achieves comparative detection performance (73.0% MOTA and 58.6% DetA), with 0.3% less MOTA and 1.0% fewer DetA than PatchTracks [7]. Alternatively, we better associate objects than PatchTrack, relying on the informative motions from query pairs, but still underperform MeMOT both on IDF1 (67.6 vs. 69.7) and AssA (52.2 vs. 55.7), possibly due to our local linkage (performing on two continuous frames). In MOT17, CenterTrack powered by TransFiner embraces second-to-best tracking ability, surpassing most transformer-based approaches like TransTrack [33], TransCenter [41], Trackformer [22] and PatchTrack [7]

. In addition, TransFiner with CenterTrack detects well (57.5% DetA) but is inferior to several SOTA transformer-based trackers. There is probably a reason for this: query pairs restrict the prediction of objects on the current frame if they are out of scope on the previous frame.

5.4 Ablation study

We test our design choices with the same model combination (CenterTrack and TransFiner) in Section 5.3 on the train-val split of the MOT17 train dataset.

Ablation Choice MOTA IDF1 HOTA AssA

Single 62.3 59.0 48.6 44.8
Decoder structure *Fusion 70.1 74.0 60.6 63.0

w/ back refer 69.8 71.5 59.2 60.0
w/o d&r split 69.0 72.6 59.8 61.6
w/o d&r embeddings 68.9 72.8 59.3 60.5
Refinement tactic *Vanilla 70.1 74.0 60.6 63.0

0 69.5 71.8 58.8 59.5
-5 70.5 73.0 60.0 61.1
*-10 70.1 74.0 60.6 63.0
Hyperparameter - 69.5 73.7 60.1 62.0

*Center+Box 70.1 74.0 60.6 63.0
Center 69.0 67.5 56.6 55.4
Motion 67.9 65.7 55.5 53.7

Baseline 66.2 69.4 - -

Table 2: Ablation studies on the MOT17 validation set. * means our default settings. indicates back refer is an experimental attempt proposed in Section 5.4. Baseline is the tracking performance of CenterTrack [46] under the same experiment settings. We explore the design options on decoder structure (failure in refining with the single-decoder structure), refinement tactic (d&r split boosts refinement, and back refer drags it down), fusion mask hyperparameter ( balances detection and association), and motion (box motion is critical in association). The best results in each color block are marked in bold.

Decoder structure. Fusion attention module and dual-decoder are layered repeatedly to form the fusion decoder. Additionally, we receive the single version by throwing fusion attention and the decoder focusing on . Straightforwardly, refining with TransFiner built on single merely redetects the objects of the current frame with specific decoder initialization. The results shown in the blue block of Table 2 suggest that the information fusion, as well as motion estimations, play a crucial role in MOT refinement. We observe fusion decoder elevates association significantly (15.0% improvements on IDF1 and 20.0% increases on AssA compared with single decoders), indicating motions from query pairs of fusion decoder are robust in linking objects across frames.

Refinement tactic. We begin by exploring the initialization with back referring. Next, we discuss the ablations on the d&r split of queries.

To further leverage during initialization of decoder, we attempt to extend the locations assignment in Section 4.4 by back referring through instead of putting identical to . Specifically, back referring derives the reference locations of the previous frame through . Here we consider as backward motions. In this case, back referring is achieved via . The effectiveness of back refer can be seen in the gray block of Table 2, which shows degradation in overall performance. We conclude two reasons for this: (1) motions from objects whose objectness scores are far from certain usually have significant biases, deteriorating refinement by acting as unhealthy noises; (2) query pairs and the fusion mask are qualified in gradually adjusting the position pairs, discouraging excessive locations assignment at the start.

For ablation studies on the d&r split, we drop it from the vanilla. The 2nd row of the gray block in Table 2 shows that this lowers the model performance for, probably, pushing TransFiner to treat original predictions equally, without special attention to tough ones. In addition, we trial d&r split lacking embeddings labeling denoising and rematching queries (i.e., without and ). This, however, further degrades TransFiner. Part of the reason is that little information hints at the queries with different refinement purposes when functioning.

Hyperparameter . Green rows of table 2 show optimization performances under various choices of . leads to an obvious decline in association (reducing IDF1 by 2.2% and 3.5% for AssA from the default setting). In contrast, detection and association suffer slightly when , dropping from the vanilla by 0.6% MOTA and 0.3% IDF1. Moreover, we observe mild overall improvement when placing to a moderate value (e.g., ). An intuitive illustration is that a suitable value of properly weighs interactions between queries outside and inside their in-couples, where queries are dynamically and controllably fitted.

Motion. Transfiner evaluates motions in the form of centers and boxes of objects from the present to the last frame. According to the yellow chunk of Table 2, we observe a considerable gap with and without box motions in the association (74.0% IDF1 vs. 67.5% IDF1 and 63.0% AssA vs. 55.4% AssA), considering box motions are more distinctive in crowded scenarios.

5.5 Limitations

TransFiner performs on the local tracking (within adjacent frames), limiting refinement when the targets are under long-term occlusions. To address these, the design of a prediction error buffer (e.g., contains the TransFiner’s predictions crossing the border of d&r split), along with a stronger query interaction mechanism, may help improve this defect. In addition, Although TransFiner leverages initial tracking results in a non-trivial way, how to better semantically joint inputs (e.g., frames) and outputs (i.e., original predictions) space is an open question. We leave these as future work.

6 Conclusion

We present TransFiner, a generic post-refinement framework for MOT. Our approach adapts transformer for refinement task and simply takes locations predicted by the original tracker with two consecutive frames as inputs. TransFiner fully exploits initial estimations, locations guide the extractions of image features to enrich query pairs and are used to divide pairs into different levels for ease of targeted rectification. Labeled query pairs, highly representing original predictions, deeply combine the input and output space for refinement via propagating through fusion decoder. Our tracker-booster achieves impressive refinement outcomes on MOT16 and MOT17 benchmarks.


  • [1] P. Bergmann, T. Meinhardt, and L. Leal-Taixe (2019) Tracking without bells and whistles. In

    Proceedings of the IEEE/CVF International Conference on Computer Vision

    pp. 941–951. Cited by: §1, §2.1, §3.
  • [2] K. Bernardin and R. Stiefelhagen (2008) Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008, pp. 1–10. Cited by: §5.1.
  • [3] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft (2016) Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pp. 3464–3468. Cited by: §1, §2.1, §3.
  • [4] J. Cai, M. Xu, W. Li, Y. Xiong, W. Xia, Z. Tu, and S. Soatto (2022) MeMOT: multi-object tracking with memory. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 8090–8100. Cited by: Figure 1, §1, §1, §2.1, §5.3, Table 1.
  • [5] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European conference on computer vision, pp. 213–229. Cited by: §1, §2.2, §4.1, §5.2.
  • [6] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik (2016) Human pose estimation with iterative error feedback. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4733–4742. Cited by: §2.3.
  • [7] X. Chen, S. M. Iranmanesh, and K. Lien (2022) PatchTrack: multiple object tracking using frame patches. arXiv preprint arXiv:2201.00080. Cited by: Figure 1, §1, §1, §2.1, §5.3, Table 1.
  • [8] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun (2018) Cascaded pyramid network for multi-person pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112. Cited by: §2.3.
  • [9] Y. Du, Y. Song, B. Yang, and Y. Zhao (2022) StrongSORT: make deepsort great again. arXiv preprint arXiv:2202.13514. Cited by: §2.1.
  • [10] K. Fang, Y. Xiang, X. Li, and S. Savarese (2018) Recurrent autoregressive networks for online multi-object tracking. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 466–475. Cited by: §1, §2.1.
  • [11] C. Feichtenhofer, A. Pinz, and A. Zisserman (2017) Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3038–3046. Cited by: §1, §2.1, §3.
  • [12] M. Fieraru, A. Khoreva, L. Pishchulin, and B. Schiele (2018) Learning to refine human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 205–214. Cited by: §2.3.
  • [13] P. Gao, M. Zheng, X. Wang, J. Dai, and H. Li (2021) Fast convergence of detr with spatially modulated co-attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3621–3630. Cited by: §2.2.
  • [14] S. Gidaris and N. Komodakis (2017) Detect, replace, refine: deep structured prediction for pixel wise labeling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5248–5257. Cited by: §2.3.
  • [15] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.4.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.2.
  • [17] L. Leal-Taixé, C. Canton-Ferrer, and K. Schindler (2016) Learning by tracking: siamese cnn for robust target association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 33–40. Cited by: §1, §2.1.
  • [18] F. Li, H. Zhang, S. Liu, J. Guo, L. M. Ni, and L. Zhang (2022) Dn-detr: accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13619–13627. Cited by: §4.1, §4.5.
  • [19] I. Loshchilov and F. Hutter (2018) Decoupled weight decay regularization. In International Conference on Learning Representations, Cited by: §5.2.
  • [20] Z. Lu, V. Rathod, R. Votel, and J. Huang (2020) Retinatrack: online single stage joint detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 14668–14678. Cited by: §2.1, §3.
  • [21] J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, and B. Leibe (2021) Hota: a higher order metric for evaluating multi-object tracking. International journal of computer vision 129 (2), pp. 548–578. Cited by: §5.1.
  • [22] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer (2021) Trackformer: multi-object tracking with transformers. arXiv preprint arXiv:2101.02702. Cited by: Figure 1, §1, §1, §2.1, §5.3, Table 1.
  • [23] D. Meng, X. Chen, Z. Fan, G. Zeng, H. Li, Y. Yuan, L. Sun, and J. Wang (2021) Conditional detr for fast training convergence. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3651–3660. Cited by: §2.2, §4.1.
  • [24] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler (2016) MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Cited by: §5.1, §5.2, §5.3.
  • [25] G. Moon, J. Y. Chang, and K. M. Lee (2019) Posefix: model-agnostic general human pose refinement network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7773–7781. Cited by: §2.3.
  • [26] B. Pang, Y. Li, Y. Zhang, M. Li, and C. Lu (2020) Tubetk: adopting tubes to track multi-object in a one-step training model. In CVPR, Cited by: Table 1.
  • [27] J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu (2021) Quasi-dense similarity learning for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 164–173. Cited by: §1, §2.1, §3, Table 1.
  • [28] J. Peng, C. Wang, F. Wan, Y. Wu, Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and Y. Fu (2020) Chained-tracker: chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In ECCV, Cited by: Table 1.
  • [29] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pp. 17–35. Cited by: §5.1.
  • [30] S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun (2018) Crowdhuman: a benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123. Cited by: §5.1, §5.1, §5.2.
  • [31] B. Shuai, A. Berneshawi, X. Li, D. Modolo, and J. Tighe (2021) SiamMOT: siamese multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12372–12382. Cited by: §1, §2.1, §3.
  • [32] J. Son, M. Baek, M. Cho, and B. Han (2017)

    Multi-object tracking with quadruplet convolutional neural networks

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5620–5629. Cited by: §2.1.
  • [33] P. Sun, J. Cao, Y. Jiang, R. Zhang, E. Xie, Z. Yuan, C. Wang, and P. Luo (2020) Transtrack: multiple object tracking with transformer. arXiv preprint arXiv:2012.15460. Cited by: Figure 1, §1, §1, §2.1, §5.3, Table 1.
  • [34] C. Tang, H. Chen, X. Li, J. Li, Z. Zhang, and X. Hu (2021) Look closer to segment better: boundary patch refinement for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13926–13935. Cited by: §2.3.
  • [35] H. Tang, X. Liu, S. Sun, X. Yan, and X. Xie (2021) Recurrent mask refinement for few-shot medical image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3918–3928. Cited by: §2.3.
  • [36] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §2.1.
  • [37] C. Vatral, G. Biswas, and B. Goldberg (2021) Online multi-object motion tracking by fusion of head and body detections. Cited by: §1, §2.1, §3.
  • [38] B. Wang, C. Fruhwirth-Reisinger, H. Possegger, H. Bischof, G. Cao, and E. M. Learning (2021) DRT: detection refinement for multiple object tracking. In 32nd British Machine Vision Conference: BMVC 2021, Cited by: §1, §2.3, §2.3.
  • [39] N. Wojke, A. Bewley, and D. Paulus (2017) Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pp. 3645–3649. Cited by: §2.1.
  • [40] J. Wu, J. Cao, L. Song, Y. Wang, M. Yang, and J. Yuan (2021) Track to detect and segment: an online multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12352–12361. Cited by: §1, §2.1, Table 1.
  • [41] Y. Xu, Y. Ban, G. Delorme, C. Gan, D. Rus, and X. Alameda-Pineda (2021) Transcenter: transformers with dense queries for multiple-object tracking. arXiv preprint arXiv:2103.15145. Cited by: §1, §1, §2.1, §5.3, Table 1.
  • [42] F. Yang, X. Chang, S. Sakti, Y. Wu, and S. Nakamura (2021) ReMOT: a model-agnostic refinement for multiple object tracking. Image and Vision Computing 106, pp. 104091. Cited by: §2.3, §2.3.
  • [43] Z. Yao, J. Ai, B. Li, and C. Zhang (2021) Efficient detr: improving end-to-end object detector with dense prior. arXiv preprint arXiv:2104.01318. Cited by: §2.2, §4.1, §4.4.
  • [44] G. Zhang, Z. Luo, Y. Yu, K. Cui, and S. Lu (2022) Accelerating detr convergence via semantic-aligned matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 949–958. Cited by: §2.2, §4.1, §4.4.
  • [45] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu (2020) Fairmot: on the fairness of detection and re-identification in multiple object tracking. arXiv preprint arXiv:2004.01888. Cited by: §2.1, §3.
  • [46] X. Zhou, V. Koltun, and P. Krähenbühl (2020) Tracking objects as points. In European Conference on Computer Vision, pp. 474–490. Cited by: §1, §2.1, §5.1, §5.2, §5.3, Table 1, Table 2.
  • [47] T. Zhu, M. Hiller, M. Ehsanpour, R. Ma, T. Drummond, and H. Rezatofighi (2021) Looking beyond two frames: end-to-end multi-object tracking using spatial and temporal transformers. arXiv preprint arXiv:2103.14829. Cited by: Figure 1, §1, §1, §2.1.
  • [48] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable detr: deformable transformers for end-to-end object detection. In International Conference on Learning Representations, Cited by: §2.2, §2.2, §3, §4.1, §4.3, §5.2.