TransCenter: Transformers with Dense Queries for Multiple-Object Tracking

03/28/2021 ∙ by Yihong Xu, et al. ∙ MIT Inria 52

Transformer networks have proven extremely powerful for a wide variety of tasks since they were introduced. Computer vision is not an exception, as the use of transformers has become very popular in the vision community in recent years. Despite this wave, multiple-object tracking (MOT) exhibits for now some sort of incompatibility with transformers. We argue that the standard representation – bounding boxes – is not adapted to learning transformers for MOT. Inspired by recent research, we propose TransCenter, the first transformer-based architecture for tracking the centers of multiple targets. Methodologically, we propose the use of dense queries in a double-decoder network, to be able to robustly infer the heatmap of targets' centers and associate them through time. TransCenter outperforms the current state-of-the-art in multiple-object tracking, both in MOT17 and MOT20. Our ablation study demonstrates the advantage in the proposed architecture compared to more naive alternatives. The code will be made publicly available.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

page 13

page 14

Code Repositories


This is a placeholder of the official implementation of TransCenter. The code will be made publicly available soon.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

(a) TransCenter (Ours)
(b) TransTrack [50]
(c) CenterTrack [66]
(d) FairMOT [63]
Figure 1: Results of state-of-the-art MOT methods: (a), (c) and (d) are center heatmaps of TransCenter (Ours), CenterTrack [66] and FairMOT [63] respectively, (b) shows the bounding boxes centers from the queries in TransTrack [50]. Previous transformer-based tracking methods [50, 39] use spare queries, leading to miss detections (pink arrow), that are heavily overlapped, possibly leading to false detections (green arrow). Previous MOT center trackers [63, 66]

suffer from the same problems because the centers are estimated independently of each other. TransCenter is designed to mitigate these two adverse effects by using dense (pixel-level) multi-scale queries to enable heatmap-based inference and exploiting the attention mechanisms to introduce co-dependency between center predictions.

The task of tracking multiple objects, usually understood as the simultaneous inference of the position and identity of various persons in a visual scene recorded by one or more cameras, became a core problem in computer vision in the past years. Undoubtedly, the various multiple-object tracking (MOT) challenges and associated datasets [40, 10], helped foster research on this topic and provided a standard way to evaluate and monitor the performance of the methods proposed by many research teams worldwide.

In line with recent progress in computer vision using transformers [54] for tasks such as pedestrian detection [7, 67, 34], person re-identification [24]

or image super resolution 

[60], we are interested in investigating the use of transformer-based architectures for multiple-object tracking, as recent evidence [50, 39] demonstrated the interest of exploring the use of such architectures for this task. However, we argue that the pedestrian representation used so far is not appropriate for learning transformer-based architectures for MOT. Indeed, TransTrack [50] and TrackFormer [39] use bounding boxes to represent pedestrians, which is very intuitive since bounding-box is a wide-spread representation for MOT for instance in combination with probabilistic methods [45, 2] or deep convolutional architectures [3, 59, 44, 42, 55, 18, 62, 48]. One of the prominent drawbacks of using bounding boxes for tracking multiple objects manifests when dealing with very crowded scenes [10], where occlusions are very difficult to handle since ground-truth bounding boxes often overlap each other. This is problematic because these bounding boxes are used during training, not only to regress the position, width, and height of each person but also to discriminate the visual appearance associated to each track. In this context, overlapping bounding boxes mean training a visual appearance representation that combines the visual content of two or even more people [23, 22]. Certainly, jointly addressing the person tracking and segmentation tasks [39] can partially solve the occlusion problem. However, this requires to have extra annotations – segmentation masks – which are very tedious ans costly to obtain. In addition, such annotations are not available in standard benchmark datasets [40, 10].

In this paper, we get inspiration from very recent research in MOT [66, 63] and choose to devise a transformer-based architecture that can be trained to track the center of each person, and name it TransCenter. Therefore, the main difference with respect to TransTrack [50] and TrackFormer [39], developed directly from object detection transformers [67] and [7] respectively, is that TransCenter is conceived to mitigate the occlusion problem inherent to bounding-box tracking without requiring extra ground-truth annotations such as segmentation masks. While this intuition is very straightforward, designing an efficient transformer-based architecture that implements this intuition is far from evident.

Indeed, the first challenge is to be able to infer dense representations (i.e. center heatmaps). To do so, we propose the use of dense (pixel-level) multi-scale queries. In addition to allowing heatmap-based MOT, the use of dense queries overcomes the limitations [7, 67] associated with querying the decoder with a small number of queries. Inspired by [50], TransCenter has two different decoders: one for person detection and another one for person tracking. Both decoders are given queries that depend on the current image, but they are extracted with different learnable layers. However, while the memory (i.e. the output of the transformer encoder) of the current frame is given to the detection decoder, the memory of the previous frame is given to the tracking decoder.

Overall, this paper has the following contributions:

  • We propose the use of transformers for multiple-object center tracking and term this architecture TransCenter.

  • To infer position heatmaps, we propose the use of dense multi-scale queries that are computed from the encoding of the current image using learnable layers.

  • TransCenter sets a new state-of-the-art baseline among online MOT tracking methods in MOT17 [40] (+10.1% multiple-object tracking accuracy, MOTA) as well as MOT20 [10] (+5% MOTA), leading both MOT competitions. Moreover, to our knowledge, TransCenter sets the first transformer-based state-of-the-art baseline in MOT20111TrackFormer [39] is tested on MOT20S, which are sequences from MOT17 containing far less crowded scenes than MOT20., thanks to its ability to track in crowded scenes.

2 Related Works

2.1 Multiple-Object Tracking

In MOT literature, initial works [2, 45, 1] focus on how to find the optimal associations between detections and trackelets through probabilistic models while [41]

first formulates the problem as an end-to-end learning task with recurrent neural networks. Moreover,

[47] models the dynamics of objects by a recurrent network and further combines the dynamics with an interaction and an appearance branch. [59]

proposes a framework to directly use the standard evaluation measures MOTA and MOTP as loss functions to backpropagate the errors for an end-to-end tracking system.

[3] employs object detection methods for MOT by modeling the problem as a regression task. A person re-identification network [53, 3] can be added at the second stage to boost the performance. However, it is still not optimal to treat the person re-identification as a secondary task. [63] further proposes a framework that treats the person detection and re-identification task equally.

Moreover, traditional graphs are also used to model the positions of objects as nodes and the temporal connection of the objects as edges [25, 53, 51, 29, 52]

. The performance of those methods is further boosted by the recent rise of Graph Neural Networks (GNNs): hand-designed graphs are replaced by learnable GNNs

[56, 57, 55, 43, 6] to model the complex interaction of the objects.

In most of the methods above, bounding boxes are used as object representation for the network. However, it is not a satisfying solution because it creates ambiguity when objects occlude each other, or noisy background information is included. CenterTrack [66] and FairMOT [63] represent objects as heatmaps then reasons about all the objects jointly and associate heatmaps across frames.

Figure 2: Overview of TransCenter. Images at and are fed to a CNN backbone to produce multi-scale features, then processed by a deformable encoder to produce memory and respectively. is used to compute dense multi-scale detection and tracking queries ( and ) through two query learning networks (QLN). and are fed to the detection and tracking deformable decoders respectively, together with and . The outputs are multi-scale detection and tracking features ( and ) and are used to estimate the center heatmap and object sizes. Both multi-scale features, together with the center heatmap at

are used to estimate the displacement vector for each center.

2.2 Transformers in Vision

Transformer is first proposed by [54]

for machine translation, and has shown its ability to handle long-term complex dependencies between sequences by using multi-head attention mechanism. With its great success in natural language processing, works in computer vision start to investigate transformers for various tasks, such as image recognition

[14], person re-identification [24], realistic image generation [27], super resolution [60] and audio-visual learning [17, 16].

Object detection with Transformer (DETR) [7] can be seen as an exploration and correlation task. It is an encoder-decoder structure where the encoder extracts the image information and the decoder finds the best correlation between the object query and the encoded image features with an attention module. However, the attention calculation suffers from heavy computational and memory complexities w.r.t the input size: the feature map extracted from a ResNet [21] backbone is used as the input to alleviate the problem. Deformable DETR [67] tackles the issue by proposing a deformable attention inspired by [9], drastically speeding up convergence (10) and reducing the complexity. This allows to capture finer details by using multi-scale features, yielding better detection performance.

Following the success in detection using transformers, two concurrent works directly apply transformers on MOT based on DETR framework. First, Trackformer [39] builds directly from DETR [7] and is trained to propagate the queries through time. Second, Transtrack [50] extends [67] to MOT by adding a decoder that processes the features at to refine previous detection positions. Importantly, both methods stay in the detection framework and use it for tracking, a strategy that have proven successful in previous works [59, 3]. However, recent literature [66, 63] also suggests that bounding boxes may not be the best representation for MOT, and this paper investigates the use of transformers for center tracking, thus introducing TransCenter.

3 TransCenter for Multiple Object Tracking

We are motivated to investigate the use of transformers for multiple-object tracking. As described in the introduction, previous works in this direction attempted to learn to infer bounding boxes. We question this choice, and explore the use of an alternative representation very popular in the recent past: center heatmaps. However, differently from bounding boxes, heatmaps are dense rather than sparse representations. Consequently, while [50, 39] used sparse object queries, we introduce the use of dense multi-scale queries for transformers in computer vision. Indeed, up to our knowledge, we are the first to propose the use of a dense query feature map that scales with the input image size. To give a figure, in our experiments the decoders are queried with roughly k queries. One downside of using dense queries is the associated memory consumption. To mitigate this undesirable effect, we propose to use deformable decoders, inspired by deformable convolutions.

More precisely, we cast the multiple-object tracking problem into two separate subtasks: the detection of objects at time , and the association with objects detected at . Different from previous studies following the same rationale [3, 59], TransCenter addresses these two tasks in parallel, by using a fully deformable dual decoder architecture. The output of the detection decoder is used to estimate the object center and size, while it is combined with the output of the tracking decoder to estimate the displacement of the object w.r.t. the previous image. An important consequence of combining center heatmaps with the use of a dual decoder architecture is that the object association through time depends not only on geometry features (e.g. IOU) but also on the visual features from the decoder.

3.1 TransCenter in a Nutshell

The overall architecture of TransCenter can be seen in Figure 2. The RGB images at time and are fed to a CNN backbone to produce multi-scale features and capture finer details in the image as done in [67] and then to a deformable self-attention encoder, thus obtaining multi-scale memory feature maps associated to the two images, and respectively. Then, is given to a query learning network (QLN), which are fully connected layers operating pixel-wise, that outputs a feature map of dense multi-scale detection queries, . These go through another QLN to produce a feature map of dense multi-scale tracking queries, . A fully deformable dual decoder architecture is then used to process them: the deformable detection decoder compares the detection queries to the memory to output multi-scale detection features , and the deformable tracking decoder does the same with the tracking queries and the memory to output multi-scale tracking features . The detection multi-scale features are used to estimate the bounding box size and the center heatmap . Together with the tracking features and the center heatmap, , the detection features are also used to estimate the tracking displacement .

In the following we first explain the design of the dense multi-scale queries, then the architecture of the fully deformable dual decoder, the three main branches – center heatmap, object size, and tracking – and finally the training losses.

Figure 3: Overview of the center heatmap branch. The multi-scale detection features are upscaled and merged via a series of deformable convolutions, into the output center heatmap. A similar strategy is followed for the object size and the tracking branches.

3.2 Dense Multi-scale Queries

Traditional transformer architectures output as many elements as queries fed to the decoder, and more importantly, these outputs correspond to the entities sought (e.g. pedestrian bounding boxes). When inferring center heatmaps, the probability of having a person’s center at a given pixel becomes one of these sought entities, thus requiring the transformer decoder to be fed with dense queries. Such queries are obtained from the multi-scale encoder’s memory, via a first query learning network (QLN), which is a feed-forward network operating pixel-wise, obtaining

. We use two different queries for the dual decoder: a second QLN processes to obtain . They will be fed to the fully deformable dual decoder, see Sec. 3.3.

The fact that the dense query feature map resolution is proportional to the resolution of the input image has two prominent advantages. First, the queries can be multi-scale and exploit the multi-resolution structure of the encoder, allowing for very small targets to be captured by those queries. Second, dense queries also make the network more flexible since it is able to adapt to arbitrary image size. More generally, the use of QLN avoids the problem of manually sizing the queries and selecting beforehand the number of maximum detection, as it was done in previous transformer architectures (for computer vision).

3.3 Fully Deformable Dual Decoder

To successfully find object trajectories, a MOT method should not only detect the objects but also associate them across frames. To do so, TransCenter proposes to use a fully deformable dual decoder. More precisely, two fully deformable decoders deal in parallel with the two subtasks: detection and tracking. While the detection decoder correlates and with the attention modules to detect objects in the image , the tracking decoder correlates and to associate the detected objects to their position in the previous image . Specifically, the detection decoder searches for objects in multi-scale with the attention correlated to the multi-scale and then outputs the multi-scale detection features , used to find the object centers and box sizes. Differently, the deformable tracking decoder finds the objects in and associates them with the objects at . To do this, the multi-head deformable attention in the tracking decoder performs a temporal cross-correlation between the multi-scale and and outputs the multi-scale tracking features , containing the temporal information that is used in the tracking branch to estimate the displacements from time back to .

Both the detection and tracking decoders input a dense query feature map so as to output dense information as well. However, the use of the multi-head attention modules used in traditional transformers [54] in TransCenter implies a memory and complexity growth that is quadratic with the input image size . Of course this is undesirable and would limit the scalability and usability of the method, especially when processing multi-scale features. Naturally, we resort to deformable multi-head attention, thus leading to a fully deformable dual decoder architecture.

3.4 The Center, the Size and the Tracking Branches

The output of the two fully deformable decoders are two sets of multi-scale features, referred to as the detection and tracking features . More precisely, these multi-scale features contain four feature maps at different resolutions, namely and

of the input image resolution. For the center heatmap and the object size branches, the feature maps at different resolutions are combined using deformable convolutions and bilinear interpolation, following the architecture shown in Figure 

3, into a a feature maps of of the input resolution, and finally into and (the two channels of encode the width and the height). Regarding the tracking branch, the two multi-scale features follow the same up-scaling as in the two other branches (but with different parameters), obtaining two feature maps at resolution . These two feature maps are concatenated to the previous center heatmap downscaled to the resolution of the feature maps. As in the other branches, a block of convolutional layers computes the final output, i.e. the displacement of the objects where the two channels encode the horizontal and vertical displacements respectively.

3.5 Training TransCenter

Training TransCenter is achieved by jointly learning a classification task for the object center heatmap and a regression task for the object size and tracking displacements, covering the branches of TransCenter. For the sake of clarity, in this section we will drop the time index . Center Focal Loss  In order to train the center branch, we need first to build the ground-truth heatmap response . As done in [66], we construct by considering the maximum response of a set of Gaussian kernels centered at each of the ground-truth object centers. More formally, for every pixel position the ground-truth heatmap response is computed as:


where is the ground-truth object center, and is the Gaussian kernel with spread . In our case, is proportional to the object’s size, as described in [30]. Given the ground-truth and the inferred center heatmaps, the center focal loss, is formulated as:


where the scaling factors are and , see [63].

Sparse Regression Loss  The values of and are supervised only on the locations where object centers are present, i.e. using a loss:


The formulation of is analogous to but using the tracking output and ground-truth, instead of the object size. To complete the sparsity of , , we add an extra regression loss, denoted as with the bounding boxes computed from and ground-truth centers. The impact of this additional loss is marginal as shown in Section 4.4.

In summary, the overall loss is formulated as the weighted sum of all the losses, the weights are chosen according to the numeric scale of each loss:


4 Experimental Evaluation

Public Detections Private Detections
TransCenter (Ours) orange!12ch orange!1271.9 orange!1281.4 orange!1262.3 orange!1238.0 orange!1222.7 orange!1217,378 orange!12137,008 orange!124,046 orange!12ch orange!1273.2 orange!1281.1 orange!1262.2 orange!1240.8 orange!1218.5 orange!1223,112 orange!12123,738 orange!124,614
*TrackFormer [39] orange!12ch orange!1261.8 black!12 orange!1259.8 orange!1235.4 orange!1221.1 orange!1235,226 orange!12177,270 orange!122,982 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
*UnsupTrack [28] orange!12pt orange!1261.7 orange!1278.3 orange!1258.1 orange!1227.2 orange!1232.4 orange!1216,872 orange!12197,632 orange!121,864 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
MOTDT17 [8] orange!12re orange!1250.9 orange!1276.6 orange!1252.7 orange!1217.5 orange!1235.7 orange!1224,069 orange!12250,768 orange!122,474 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
*TransTrack [50] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 orange!12ch orange!1265.8 orange!1278.8 orange!1256.9 orange!1232.2 orange!1221.8 orange!1224,000 orange!12163,683 orange!125,355
CenterTrack [66] green!12no green!1261.5 green!1278.9 green!1259.6 green!1226.4 green!1231.9 green!1214,076 green!12200,672 green!122,583 orange!12ch orange!1267.8 orange!1278.4 orange!1264.7 orange!1234.6 orange!1224.6 orange!1218,489 orange!12160,332 orange!123,039
FUFET [48] green!12no green!1262.0 black!12 green!1259.5 green!1227.8 green!1231.5 green!1215,114 green!1219,6672 green!122,621 red!12(5d1) red!1276.2 red!1281.1 red!1268.0 red!1251.1 red!1213.6 red!1232,796 red!1298,475 red!123,237
MLT [62] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 red!12(5d1) red!1275.3 red!1281.7 red!1275.5 red!1249.3 red!1219.5 red!1227,879 red!12109,836 red!121,719
*CSTrack [33] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 red!125d1 red!1274.9 red!1280.9 red!1272.6 red!1241.5 red!1217.5 red!1223,847 red!12114,303 red!123,567
*FairMOT [63] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 red!125d1 red!1273.7 red!1281.3 red!1272.3 red!1243.2 red!1217.3 red!1227,507 red!12117,477 red!123,303
*GSDT [55] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 red!125d2 red!1266.2 red!1279.9 red!1268.7 red!1240.8 red!1218.3 red!1243,368 red!12144,261 red!123,318
GSM_Tracktor [36] green!12no green!1256.4 green!1277.9 green!1257.8 green!1222.2 green!1234.5 green!1214,379 green!12230,174 green!121,485 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
Tracktor++ [3] green!12no green!1256.3 green!1278.8 green!1255.1 green!1221.1 green!1235.3 green!128,866 green!12235,449 green!121,987 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
TrctrD17 [59] green!12no green!1253.7 green!1277.2 green!1253.8 green!1219.4 green!1236.6 green!1211,731 green!12247,447 green!121,947 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
Tracktor [3] green!12no green!1253.5 green!1278.0 green!1252.3 green!1219.5 green!1236.6 green!1212,201 green!12248,047 green!122,072 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
*MAT [20] green!12no green!1267.1 green!1280.8 green!1269.2 green!1238.9 green!1226.4 green!1222,756 green!12161,547 green!121,279 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
ChainedTracker [44] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 green!12no green!1266.6 green!1278.2 green!1257.4 green!1232.2 green!1224.2 green!1222,284 green!12160,491 green!125,529
TubeTK [42] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 green!12no green!1263.0 green!1278.3 green!1258.6 green!1231.2 green!1219.9 green!1227,060 green!12177,483 green!124,137
TransCenter (Ours) green!12no green!1268.8 green!1279.9 green!1261.4 green!1236.8 green!1223.9 green!1222,860 green!12149,188 green!124,102 green!12no green!1270.0 green!1279.6 green!1262.1 green!1238.9 green!1220.4 green!1228,119 green!121 36,722 green!124,647
Table 1: Results on MOT17 [40]. The left and right halves of the table correspond to public and private detections respectively. The cell background color encodes the amount of extra-training data: green for none, orange for one extra dataset, red for five extra datasets. Methods with * are not associated to a publication. The best result within the same training conditions (background color) is underlined. The best result among published methods is in bold. Best seen in color.

4.1 Implementation Details

Inference with TransCenter  Once the method is trained, we detect objects by filtering the output center heatmap . Since the datasets are annotated with bounding boxes, we need to convert our estimates into this representation. In detail, we apply a threshold to the heatmap, thus producing a list of center positions . We extract the object size associated to each position in . The set of detections produced by TransCenter is directly . Once the detection step is performed, we can estimate the position of the object in the previous image extracting the estimated displacement from the tracking branch output and the center position . Indeed, we can construct a set of detections tracked back to the previous image . Finally we use the Hungarian algorithm to match the detections at the previous time step with the tracked-back detection to associate the tracks through time. The birth and death processes are naturally integrated in TransCenter: Detections not associated to previous detections give birth to new tracks, while unmatched previous detections are put to sleep for at most frames before being discarded. New tracks are compared to sleeping tracks by means of an external re-identification network from [3] trained only on MOT17 [40], whose impact is ablated in the experiments.

Network and Training Parameters  The input images are resized to . Both the encoder and the decoder have six layers with hidden dimension

with eight attention heads. The query learning networks consist of two fully connected layers with ReLU activation. Our CNN backbone is ResNet-50 

[21]. TransCenter is trained with loss weights , and by the AdamW optimizer [37] with learning rate for the CNN backbone and

for the rest of the network. The training lasts 50 epochs, applying learning rate decay of

at the 40th epoch. The entire network is pre-trained on the pedestrian class of COCO 

[35] and then fine-tuned on the respective MOT dataset [40, 10]. Overall, with 2 RTX Titan GPUs and batch size 2, it takes around 1h30 and 1h per epoch of MOT20 and MOT17 respectively. We also present the results fine-tuning with extra data, namely the CrowdHuman dataset [49]. See the results and discussion for details.

4.2 Protocol

Datasets and Detections  We use the standard split of the MOT17 [40] and MOT20 [10] datasets and the evaluation is obtained by submitting the results to the MOTChallenge website. The MOT17 test set contains 2,355 trajectories distributed in 17,757 frames. MOT20 test set contains 1,501 trajectories within only 4,479 frames, which leads to a much more challenging setting. We evaluate TransCenter both under public and private detections. When using public detections, we limit the maximum number of birth candidates at each frame to be the number of public detections per frame, as in [66, 39]. The selected birth candidates are those closest to the public detections with IOU larger than 0. When using private detections, there are no constraints, and the detections depend only on the network capacity, the use of external detectors, and more importantly, the use of extra training data. For this reason, we regroup the results by the use of extra training datasets as detailed in the following.

Public Detections Private Detections
TransCenter (Ours) orange!12ch orange!1258.6 orange!1279.8 orange!1246.7 orange!1235.5 orange!1218.7 orange!1233,691 orange!12175,841 orange!124,850 orange!12ch orange!1258.3 orange!1279.7 orange!1246.8 orange!1235.7 orange!1218.6 orange!1235,959 orange!12174,893 orange!124,947
*UnsupTrack [28] orange!12pt orange!1253.6 orange!1280.1 orange!1250.6 orange!1230.3 orange!1225.0 orange!126,439 orange!12231,298 orange!122,178 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
*GSDT [55] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 red!125d2 red!1267.1 red!1279.1 red!1267.5 red!1253.1 red!1213.2 red!1231,913 red!12135,409 red!123,131
*CSTrack [33] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 red!125d1 red!1266.6 red!1278.8 red!1268.6 red!1250.4 red!1215.5 red!1225,404 red!12144,358 red!123,196
*FairMOT [63] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 red!125d1 red!1261.8 red!1278.6 red!1267.3 red!1268.8 red!127.6 red!12103,440 red!1288,901 red!125,243
*GNNMatch [43] green!12no green!1254.5 green!1279.4 green!1249.0 green!1232.8 green!1225.5 green!129,522 green!12223,611 green!122,038 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
Tracktor++ [3] green!12no green!1252.6 green!1279.9 green!1252.7 green!1229.4 green!1226.7 green!126,930 green!12236,680 green!121,648 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
SORT [5] green!12no green!1242.7 green!1278.5 green!1245.1 green!1216.7 green!1226.2 green!1227,521 green!12264,694 green!124,470 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12
MLT [62] black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 black!12 green!12no green!1248.9 green!1278.0 green!1254.6 green!1230.9 green!1222.1 green!1245,660 green!12216,803 green!122,187
TransCenter (Ours) green!12no green!1257.5 green!1279.4 green!1247.1 green!1235.6 green!1218.0 green!1240,443 green!12174,850 green!124,840 green!12no green!1257.1 green!1279.4 green!1246.7 green!1235.7 green!1218.0 green!1242,871 green!12173,911 green!124,940
Table 2: Results on MOT20 [10]. The table is structured following the same principles as Table 1. Methods with * are not associated to a publication. The best result within the same training conditions (background color) is underlined. The best result among published methods is in bold. Best seen in color.

Extra Training Data  To fairly compare with the state-the-art methods, we clearly denote the extra data used to train each method (including several pre-prints listed in the MOTChallenge leaderboard, which are marked with * in our result tables):222COCO [35]

and ImageNet 

[11] are not considered as extra data according to the MOTchallenge [40, 10]. ch for CrowdHuman [49], pt for PathTrack [38], re for the combination of Market1501 [64], CUHK01 and CUHK03 [31] person re-identification datasets, 5d1 for the use 5 extra datasets (CrowdHuman [49], Caltech Pedestrian [13, 12], CityPersons [61], CUHK-SYS [58], and PRW [65]), 5d2 is the same as 5d1 replacing CroudHuman by ETH [15], (5d1) uses the tracking/detection results of FairMOT [63] (trained with in 5d1 setting), and no for using no extra dataset. Metrics  Standard MOT metrics such as MOTA (Multiple Object Tracking Accuracy) and MOTP (Multiple Object Tracking Precision) [4] are used: MOTA is mostly used since it reflects the average tracking performance including the number of FPs (False positives, predicted bounding boxes not enclosing any object), FNs (False negatives, missing ground-truth objects) and IDS [32] (Identities of predicted trajectories switch through time). MOTP evaluates the quality of bounding boxes from successfully tracked objects. Moreover, we also evaluate on IDF1 [46] (the ratio of correctly identified detections over the average number of ground-truth objects and predicted tracks), MT (the ratio of ground-truth trajectories that are covered by a track hypothesis more than 80% of their life span), and ML (less than 20% of their life span).

4.3 Results and Discussion

MOT17  Table 1 presents the results obtained on the MOT17 [40] dataset. The first global remark is that most state-of-the-art methods do not evaluate under both public and private detections, and under different extra-training data settings, while we do. Secondly, TransCenter systematically outperforms all other methods, in terms of MOTA, under similar training data conditions, both for public and private detections. Indeed, the increase of MOTA w.r.t. the best performing published method is of (10.1% taking unpublished methods into account) and for public detections under extra and no-extra training data, and of and for private detections. If we consider only published methods, the superiority of TransCenter is remarkable in most of the metrics. We can also observe that TransCenter trained with no extra-training data outperforms, not only the methods trained with no extra data but also the methods trained with one extra dataset (in terms of MOTA for both public and private detections). In the same line, TransCenter trained on ch performances better than two of the methods trained with five extra datasets. Overall, these results confirm our hypothesis that heatmaps representation combined with the proposed TransCenter architecture is a better option for MOT using transformers.

Figure 4: Visualization of the attention from the detection (a)-(d) decoder and the tracking (e)-(h) decoder in the current and previous image at (for better visualization), respectively. The brighter the higher the attention weights.

MOT20  Table 2 reports the results obtained in MOT20. In public detections, TransCenter leads the competition both in extra ( MOTA) and no-extra ( MOTA) training data. Another remarkable achievement of TransCenter is the significant decrease of FP when compared to the existing methods ( and beyond). Very importantly, to the best of our knowledge, our study is the first to report the results on MOT20 of a transformer-based architecture, demonstrating the tracking capacity of TransCenter even in a densely crowded scenario. For the sake of completeness, we provide the results on MOT20 for private detections and set a new baseline for future research for methods trained under ch and no extra data.

Attention Visualization We show in Fig. 4 the attention from different attention heads of both detection and tracking decoders. We can see that for the detection attention, different heads focus on different areas of : (a) the people; (b), (c) the background; (d) both the background and the people. For the tracking attention, interestingly we observe that the object information at does correlate to the previous image: in (f)-(h), the tracking decoder tries to look for objects at in the surrounding of the positions of the objects at . In addition, it also focuses in the objects in the previous image, as shown within the orange box in (e).

Figure 5: Tracking results visualization of TransCenter  on MOT20 test set, in the Private Detection setting.

Qualitative Results  We report in Fig. 5 qualitative results on the MOT20 test set, to assess the ability of TransCenter to detect and track targets in the context of crowded scenes and highly overlapping bounding boxes. Fig 4(a) and 4(b) are extracted from MOT20-07, Fig 4(c) and 4(d) MOT20-08. We observe that TransCenter  manages to keep high recall, even in the context of drastic mutual-occlusions and reliably associate detections across time.

To summarize, TransCenter exhibits outstanding results on both MOT17 and MOT20 datasets for both public and private detections, and for both with or without extra training data, which indicates that multiple-object center tracking using transformers is a promising research direction.

4.4 Ablation Study

(a) MOT17
(b) MOT20
Figure 6: MOTA, MOTP and IDF1 ablation studies on MOT17 and MOT20.

In this section, we experimentally demonstrate different configurations of our  TransCenter. For the ablation, we further divide the training sets into train-validation split, we take the first 50% of frames (2,664 and 4,468 frames for MOT17 and MOT20, respectively) as training data and test on the last 25% (1,332 and 2,234 frames for MOT17 and MOT20, respectively). The rest 25% frames in the middle of the sequences are thrown to prevent over-fitting.

Single Decoder Is Not Enough  We study the possibility of using one single decoder and one set of dense multi-scale queries to perform tracking. Using a single decoder leads to very poor results, as shown in Figure 6 (Single Decoder). This is because the network switches its attention between image and image during training and eventually fails to track objects correctly at and (low MOTA). More details can be found in supplementary materials. Using a single decoder for sure brings the memory efficiency, which is not so crucial in TransCenter, thanks to the deformable modules [67]. The overall memory consumption is therefore affordable for a normal GPU setting (see details in Sec. 4.1).

Single decoder 305 12,991 1,782 1,302 160,254 43,482
D.Detr+IOU 1,291 7,774 507 6,921 78,648 7,978
W/out Loss 1,279 7,090 184 5,210 79,103 1,589
W/out Reid 1,202 6,951 467 7,107 76,137 4,286
TransCenter (ours) 1,202 6,951 203 7,127 76,157 1,549
Table 3: Ablation studies on MOT17 [40] and MOT20 [10].

Lost Person Re-identification  We use an external Re-ID network to recover the identities which are temporally suspended by the tracker. The Re-ID network is the one in  [3], pre-trained on MOT17 [40] training set. Similarly, a light-weight optical flow estimation network LiteFlowNet [26] pre-trained on Kitti [19] is used to recover the lost identities. This process helps us to reduce IDS, but the overall tracking performance does not come from these externals networks since FP, FN is not improved by them. see Tab. 3, we even observe a performance drop of FP and FN since the external networks were not finetuned on MOT20.

Beyond Detection  We also ablate the D.Detr [67]+IOU matching, which is to use bounding box object detection and handcrafted geometry IOU matching method to perform tracking. From Figure 6, we observe that bounding box object detector can better enclose correctly detected objects (i.e. higher MOTP). However, due to the fact that it lacks the prior information from the past, which leads to a higher IDS and FNs. Without   We evaluate the impact of the additional bounding box regression loss that completes the sparse object size loss, as discussed in Section 3.5. We observe a slight performance drop (-0.7% MOTA for MOT17 and -0.3% for MOT20), indicating that the two sparse regression losses and the dense center heatmap focal loss are sufficient to train TransCenter.

5 Conclusion

In this paper, we introduce TransCenter, a novel transformer-based architectures for multiple-object tracking. TransCenter proposed the use of dense multi-scale queries in combination with a fully deformable dual decoder, able to output dense representations for the objects’ center, size and temporal displacement. The deformable decoder allows processing thousands of queries while keeping the overall memory usage within reasonable boundaries. Under the same training conditions, TransCenter outperforms all its competitors in MOT17 and MOT20, and even exhibits comparable performance to some methods trained with much more data.


Appendix A Sparse V.S. Dense Queries Models

(a) MOT17
(b) MOT20
Figure 7: FP, FN, IDS ablation studies on MOT17, MOT20 validation of models trained on half MOT17.
Sparse Queries [50] 1,086 7526 190 13,989 190,689 2,496
Dense Queries (ours) 1,202 6,951 203 12,337 145,546 2,889
Table 4: FP, FN, IDS on MOT17 and crowded scenes MOT20 validation of models trained on half MOT17.

Both models are pre-trained on CrowdHuman [49] and finetuned on the first half of sequences of MOT17 [40] dataset. From Fig. 7, we see that TransCenter outperforms the method [50] using sparse queries (+2% MOTA, +0.9% IDF1) on MOT17 [40]. Without fine-tuning on MOT20 [10], we observe a great discrepancy between the performance of the method using dense and sparse queries (+15.2% MOTA and +6.2% IDF1).

The discrepancy is also reflected in Tab. 4, compared to [50] in MOT20 [10], TransCenter, without training on MOT20, can help detect much more objects (-45,143 FNs) while having fewer FPs (-1,652). The rise of IDS is due to the fact that we have more detected objects causing more severe occlusions.

The reason is because of the use of pixel-level queries correlated to the input image. Independent of the number of objects in the image, we do not need to re-parameterize the number of queries according to the number of objects in the image as models using image-independent sparse queries. TransCenter  thus generalizes better in more crowded scenes.

Appendix B Qualitative Visualization

We visualize some qualitative results in Fig. 8 and Fig. 9 on the MOT20 testset showing the capability of TransCenter in tracking people in very crowded scenes.

Appendix C Detailed Results

We provide the detailed results on MOT17 [40] (see Tab. 5) and MOT20 [10] (see Tab. 6) testsets with TransCenter trained on CrowdHuman [49] and MOT17 trainset or MOT20 trainset, respectively.

Figure 8: Tracking results visualization of TransCenter on MOT20-04, MOT20-07 in the test set using the Private Detection setting.
Figure 9: Tracking results visualization of TransCenter on MOT20-06 test set, in the Private Detection setting.

Public Dets.

MOT17-01-DPM 49.8 78.8 40.3 8 9 448 2,745 46
MOT17-01-FRCNN 49.8 78.8 40.1 8 9 480 2,709 47
MOT17-01-SDP 50.4 78.7 39.7 8 9 490 2,662 49
MOT17-03-DPM 89.6 82.0 73.8 126 5 2,514 8,167 226
MOT17-03-FRCNN 88.2 82.1 73.7 123 8 2,505 9,619 224
MOT17-03-SDP 88.9 82.0 73.0 122 9 2,731 8,697 240
MOT17-06-DPM 61.2 80.8 56.3 76 56 497 3,900 170
MOT17-06-FRCNN 63.5 80.5 56.8 82 42 543 3,552 201
MOT17-06-SDP 62.9 80.6 56.9 84 50 556 3,617 194
MOT17-07-DPM 58.7 79.4 48.1 15 8 683 6,100 190
MOT17-07-FRCNN 58.8 79.4 48.5 15 6 674 6,085 194
MOT17-07-SDP 59.8 79.3 47.7 16 6 702 5,885 197
MOT17-08-DPM 46.2 81.5 36.1 22 21 422 10,662 280
MOT17-08-FRCNN 45.7 81.6 36.5 21 21 395 10,815 269
MOT17-08-SDP 46.6 81.4 36.1 22 20 427 10,571 279
MOT17-12-DPM 59.5 84.0 62.3 30 28 334 3,121 51
MOT17-12-FRCNN 59.3 84.0 61.8 30 29 272 3,208 50
MOT17-12-SDP 59.7 83.8 61.7 30 27 361 3,077 53
MOT17-14-DPM 34.9 76.3 36.5 17 61 674 11,050 316
MOT17-14-FRCNN 36.9 75.8 37.1 19 55 843 10,445 383
MOT17-14-SDP 37.6 75.7 38.2 20 55 827 10,321 387
MOT17-all 71.9 81.4 62.3 894 (38.0%) 534 (22.7%) 17,378 137,008 4,046

Private det.

MOT17-01 49.3 78.6 39.7 8 9 568 2,650 49
MOT17-03 90.6 81.8 73.5 136 0 3,410 6,116 266
MOT17-06 64.0 80.4 56.4 87 35 651 3,364 227
MOT17-07 60.0 79.2 47.9 16 6 807 5,755 200
MOT17-08 47.2 81.3 36.3 22 18 445 10,423 286
MOT17-12 57.3 83.5 60.6 30 24 666 2,963 69
MOT17-14 37.4 75.4 37.5 21 53 1,157 9,975 441
MOT17-all 73.2 81.1 62.2 960 (40.8%) 435 (18.5%) 23,112 123,738 4,614
Table 5: Per-sequence detailed results on MOT17 [40] testset for TransCenter trained on CrowdHuman [49] and MOT17 [40]. In the private detection setting, the results for DPM, SDP and FRCN are the same. We, therefore, do not specify their associated public detections.

Public Dets.

MOT20-04 68.8 80.4 54.4 240 82 11,020 72,756 1,732
MOT20-06 46.5 78.8 34.4 88 75 12,109 57,202 1,761
MOT20-07 76.1 81.5 61.7 79 7 1,925 5,719 271
MOT20-08 35.6 77.2 30.6 34 68 8,637 40,164 1,086
MOT20-all 58.6 79.8 46.7 441 (35.5%) 232 (18.7%) 33,691 175,841 4,850

Private det.

MOT20-04 68.7 80.4 54.1 240 82 11,289 72,674 1,730
MOT20-06 46.0 78.7 35.3 88 75 12,947 56,915 1,801
MOT20-07 75.4 81.5 61.5 80 7 2,173 5,687 271
MOT20-08 35.1 77.1 30.8 36 67 9,550 39,617 1,145
MOT20-all 58.3 79.7 46.8 444 (35.7%) 231 (18.6%) 35,959 174,893 4,947
Table 6: Per-sequence detailed results on MOT20 [40] testset for TransCenter trained on CrowdHuman [49] and MOT20 [40].