Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers

03/27/2021 ∙ by Tianyu Zhu, et al. ∙ Monash University 0

Tracking a time-varying indefinite number of objects in a video sequence over time remains a challenge despite recent advances in the field. Ignoring long-term temporal information, most existing approaches are not able to properly handle multi-object tracking challenges such as occlusion. To address these shortcomings, we present MO3TR: a truly end-to-end Transformer-based online multi-object tracking (MOT) framework that learns to handle occlusions, track initiation and termination without the need for an explicit data association module or any heuristics/post-processing. MO3TR encodes object interactions into long-term temporal embeddings using a combination of spatial and temporal Transformers, and recursively uses the information jointly with the input data to estimate the states of all tracked objects over time. The spatial attention mechanism enables our framework to learn implicit representations between all the objects and the objects to the measurements, while the temporal attention mechanism focuses on specific parts of past information, allowing our approach to resolve occlusions over multiple frames. Our experiments demonstrate the potential of this new approach, reaching new state-of-the-art results on multiple MOT metrics for two popular multi-object tracking benchmarks. Our code will be made publicly available.



There are no comments yet.


page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Looking beyond two frames with MO3TR: Temporal and spatial Transformers jointly pay attention to the current image  and the entire embedding history of the two tracked objects (red and green, left). Detection of a previously untracked object (blue) causes initiation of new track (left middle), while an object exiting the scene (green) leads to track termination (middle right). Embeddings encoding spatial and temporal interactions are accumulated over time to form individual object-track histories.

Visually discriminating the identity of multiple objects in a scene and creating individual tracks of their movements over time, namely multi-object tracking, is one of the basic yet most crucial vision tasks, imperative to tackle many real-world problems in surveillance, robotics/autonomous driving, health and biology. While being a classical AI problem, it is still very challenging to design a reliable multi-object tracking (MOT) system capable of tracking an unknown and time-varying number of objects moving through unconstrained environments, directly from spurious and ambiguous measurements and in presence of many other complexities such as occlusion, detection failure and data (measurement-to-objects) association uncertainty.

Early frameworks approached the MOT problem by splitting it into multiple sub-problems such as object detection, data association, track management and filtering/state prediction; each with their own set of challenges and solutions [2, 1, 6, 7, 19, 42, 51, 52]

. Recently, deep learning has considerably contributed to improving the performance of multi-object tracking approaches, but surprisingly not through learning the entire problem end-to-end. Instead, the developed methods adopted the traditional problem split and mainly focused on enhancing some of the aforementioned components, such as creating better detectors 

[17, 38, 39, 40, 64] or developing more reliable matching objectives for associating detections to existing object tracks [22, 29, 46, 58, 59]. While this tracking-by-detection paradigm has become the de facto standard approach for MOT, it has its own limitations. Recent approaches have shown advances by considering detection and tracking as a joint learning task rather than two separate sequential problems [4, 16, 54, 67]. However, these methods often formulate the MOT task as a two consecutive frames problem and ignore long-term temporal information, which is imperative for tackling key challenges such as track initiation, termination and occlusion handling.

In addition to their aforementioned limitations, all these methods can barely be considered to be end-to-end multi-object frameworks as their final outputs, tracks, are generated through a non-learning process. For example, track initiation and termination are commonly tackled by applying different heuristics, and the track assignments are decided upon by applying additional optimization methods, the Hungarian algorithm [26], max-flow min-cut [18]

, , and the generated tracks may be smoothed by a process such as interpolation or filtering 


With the recent rise in popularity of Transformers [56]

, this rather new deep learning tool has been adapted to solve computer vision problems like object detection 

[9] and, concurrent to our work, been deployed to two new MOT frameworks [33, 53]. Nonetheless, they still either rely on conventional heuristics, IoU matching [53], or formulate the problem as a two-frames task [33, 53], making them naive approaches to handle long-term occlusions.

In this paper, we will show that the MOT problem can be learnt end-to-end, without the use of heuristics or post-processing, addressing the key tasks like track initiation and termination, as well as occlusion handling. Our proposed method, nicknamed MO3TR, is a truly end-to-end Transformer-based online multi-object tracking method, which learns to recursively predict the state of the objects directly from an image sequence stream. Moreover, our approach encodes long-term temporal information to estimate the state of all the objects over time and does not contain an explicit data association module (Fig. 1).

Precisely speaking, MO3TR incorporates long-term temporal information by casting temporal attention over all past embeddings of each individual object, and uses this information to predict an embedding suited for the current time step. This access to longer-term temporal information beyond two frames is crucial in enabling the network to learn the difference between occlusion and termination, which is further facilitated through a specific data augmentation strategy. To factor in the influence of other objects and the visual input measurement, we refine the predicted object embedding by casting spatial attention over all identified objects in the current frame (object-to-object attention) as well as over the objects and the encoded input image (object-to-image attention).

The idea of this joint approach relates to the natural way humans perceive such scenarios: We expect certain objects to become occluded given their past trajectory and their surroundings, and predict when and where they will reappear.

To summarize, our main contributions are as follows:

  1. We introduce an end-to-end tracking approach that learns to encode longer-term information beyond two frames through temporal and spatial Transformers, and recursively predicting all states of the tracked objects

  2. We realize joint learning of object initialization, termination and occlusion handling without explicit data association and eliminate the need for heuristic post-processing

  3. MO3TR reaches new state of the art results on two popular multi-object tracking benchmarks

2 Related work

Tracking-by-detection. Tracking-by-detection treats the multi-object tracking (MOT) task as a two-stage problem. Firstly, all objects in each frame are identified using an object detector [17, 39, 40, 64]. Detected objects are then associated over frames, resulting in tracks [6, 11]. The incorporation of appearance features and motion information has been proven to be of great importance for MOT. Appearance and ReID features have been extensively utilized to improve the robustness of multi-object tracking [25, 27, 29, 44, 63]

. Further, incorporating motion has been achieved by utilizing a Kalman filter 

[23] to approximate the displacement of boxes between frames in a linear fashion and with the constant velocity assumption [2, 10] to associate detections [6, 59]. Recently, more complex and data-driven models have been proposed to model motion [15, 31, 66, 67] in a deterministic [37, 46] and probabilistic [15, 47, 57]

manner. Graphs neural networks have been also used in the recent detection based MOT frameworks, conducive to extract a reliable global feature representation from visual and/or motion cues  

[8, 21, 50, 55].

Despite being highly interrelated, detection and tracking tasks are treated independently in this line of works. Further, the performance of tracking by detection methods highly relies on incorporating heuristics and post-processing steps to infer track initiation and termination, handle occlusions and assign tracks.

Joint detection and tracking. The recent trend in MOT has moved from associating detections over frames to regressing the previous track locations to new locations in the current frame. [4, 16, 67] perform temporal realignment by exploiting a regression head. Although detection and tracking are not disjoint components in these works, they still suffer from some shortcomings. These works formulate the problem as detection matching between two/few frames, thus solving the problem locally and ignoring long-term temporal information. We argue that MOT is a challenging task which requires long-term temporal encoding of object dynamics to handle object initiation, termination, occlusion and tracking. Furthermore, these approaches still rely on the conventional post processing steps and heuristics to generate the tracks.

Figure 2: Overview of our MO3TR framework. Starting from the left, the temporal Transformer uses the entire embedding-based track history  to predict representative object encodings  for the current, yet unobserved, time step . The spatial Transformer then jointly considers the predictions together with a set of learnt initiation embeddings  and the input image  to reason about all objects in a joint manner, determining the initiation of new and termination of existing tracks. Embeddings of identified objects in  are used to regress corresponding bounding boxes describing the tracked objects, and are appended to form the track history  for the next frame.

Transformers for vision. Recently, Transformers [56] have been widely applied to many computer vision problems [3, 9, 35, 36], including MOT by two concurrent works [33, 53]. [53] performs multi-object tracking using a query-key mechanism which relies on heuristic post processing to generate final tracks. Trackformer [33] has been proposed as a transformer-based model which achieves joint detection and tracking by converting the existing DETR [9] object detector to an end-to-end trainable MOT pipeline. However, it still considers local information (two consecutive frames) to learn and infer tracks and ignores long-term temporal object dynamics, which are essential for effective learning of all MOT components.
This paper. To overcome all the existing limitations in the previous works, we propose an end-to-end MOT model which learns to jointly track multiple existing objects, handle their occlusion or terminate their tracks and initiate new tracks considering long-term temporal object information.

3 Mo3tr

Learning an object representation that encodes both the object’s own state over time and the interaction with its surroundings is vital to allow reasoning about three key challenges present in end-to-end multiple object tracking (MOT), namely track initiation, termination and occlusion handling. In this section, we demonstrate how such a representation can be acquired and continuously updated through our proposed framework: Multi-Object TRacking using spatial TRansformers and temporal TRansformers – short MO3TR (Fig. 2). We further introduce a training paradigm to learn resolving these three challenges in a joint and completely end-to-end trainable manner. We first present an overview of our framework and introduce the notation used throughout this paper, followed by a detailed introduction of the core components.

3.1 System overview and notation

The goal of tracking multiple objects in a video sequence of  frames is to retrieve an overall set of tracks  representing the individual trajectories for all uniquely identified objects present in at least one frame. Given the first frame  at time , our model tentatively initializes a set of tracks  based on all objects identified for this frame. From the next time step onward, the model aims to compute a set of embeddings  = representing all  objects present in the scene at time  (Fig. 2). Taking in the track history  from the previous time step, we predict a set of embeddings  for the current time step based on the past representations of all objects using temporal attention (Section 3.2). Together with a learnt set of representation queries  proposing the initiation of new object tracks, these predicted object representations are processed by our first spatial attention module to reason about the interaction occurring between different objects (Section 3.3). This refined set of intermediate object representations  is then passed to the second spatial attention module which takes the interaction between the objects and the scene into account by casting attention over the object embeddings and the visual information of the current frame  transformed into its feature map  (Section 3.3). This two-step incorporation of spatial information into the embeddings is iteratively performed multiple times over several layers, returning the final set of refined object representations .

The incorporation of temporal and spatial information into a representative embedding of any object  at time 


can be summarized as a learnt function  of the track history , the learnt set of initiation queries  and the encoded image feature map . This function representation demonstrates our main objective to enable the framework to learn the best possible way to relate the visual input to the objects’ internal states, without enforcing overly-restrictive constraints or explicit data association.

The use of the resulting embeddings  in our framework is twofold. Tracking results in the form of object-specific class scores  and corresponding bounding boxes for the current frame are obtained through simple classification and bounding box regression networks (Fig. 2

). Further, the subset of embeddings yielding a high probability of representing an object present in the current frame (

) is added to the track history to form the basis for the prediction performed in the next time step. Throughout the entire video sequence, new tracks  representing objects that enter the scene are initialized, while previous tracks may be terminated for objects no longer present. This leads to an overall set of tracks  for all  uniquely identified objects present in at least one frame of the video sequence of length , with their life span indicated by the subscript as initiation (start) and termination (end) time, respectively.

3.2 Learning long-term temporal embeddings

Discerning whether an object is not visible in a given frame due to occlusion or because it is no longer present in the scene is challenging. Considering that visual features extracted during partial or full occlusion are not describing the actual object they aim to represent increases this even further. Humans naturally reach decisions in such scenarios by considering all available information jointly. Analyzing the motion behavior of objects up to that point, we ignore frames with non-helpful information, and predict how and where the object is expected to re-appear in the current frame. Intuitively, MO3TR follows a similar approach.

Our framework learns the temporal behavior of objects jointly with the rest of the model through a Transformer-based component [56] that we nickname temporal Transformer. For any tracked object  at time , the temporal Transformer casts attention over all embeddings contained in the object’s track history , and predicts a thereon-based expected object representation  for the current frame. We supplement each object’s track history  by adding positional encodings [56] to the embeddings in the track to represent their relative time in the sequence. We denote the time-encoded track history by  and individual positional time-encodings for time  as . Passing the request for an embedding estimate of the current time step  in form of the positional time-encoding  as a query to the Transformer111Note that this method allows to predict embeddings for any future time step, and could thus be easily extended to further applications like trajectory forecasting, or similar. and providing  as basis for keys and values, we retrieve the predicted object embedding


where  represents the operator, , and are learnt query, key and value functions of the temporal Transformer, respectively, and denotes the dimension of the object embeddings.

In other words, the predicted representation  of object  is computed through a dynamically weighted combination of all its previous embeddings. This allows the temporal Transformer to: (i) incorporate helpful and ignore irrelevant or faulty information from previous time steps, and (ii) predict upcoming occlusions and create appropriate embeddings that focus more on conveying important positional rather than visual information. While these tasks resemble those usually performed via heuristics and manual parameter tuning during track management, MO3TR learns these dependencies end-to-end without the need of heuristics.

In practice, the prediction of object representations introduced for the example of one tracked object in (2) is performed in a batched-parallel manner for the entire set of existing tracks  over multiple layers, resulting in the output set  of the temporal Transformers that is passed as input to the spatial Transformers (Fig. 2). Note that the size of the set is dynamic and depends on the number of tracked objects. Details on how the temporal Transformer is trained are provided in Section 3.4.

3.3 Learning spatial interactions

Multiple pedestrians that are present in the same environment not only significantly influence each others movements, but also their respective visual appearance through occluding each other when perceived from a fixed viewpoint. In this section, we introduce how MO3TR learns to incorporate these dependencies into the object representations. Starting from how detection and track initiation are performed within the concept of Transformers, we then detail the refinement of object embeddings by including the interaction between objects, followed by the interaction between objects and the input image.

Initiation of new tracks. For a new and previously untracked object  spawning at any time , a corresponding track history  does not yet exist and hence, no predicted embedding is passed from the temporal to the spatial Transformer (Fig. 2). To allow initiation of new tracks for such detected objects, we build upon [9] and learn a fixed set of initiation queries . Intuitively, these queries learn to propose embeddings that lead the spatial Transformer to check for objects with certain properties and at certain locations in the visual input data. Importantly, these queries are considered jointly with the ones propagated from the temporal Transformer to avoid duplicate tracks.

Interaction between tracked objects. We use self-attention [56] to capture the influence tracked objects have onto each other’s motion behavior and appearance. This interaction aspect is incorporated into the object embeddings by computing an updated version of the representation set


where , and are all learnt functions of the concatenated object embedding set , is the dimension of the embeddings and  the  operator. Relating this approach to the classical transformer formulation, the functions conceptually represent the queries, keys and values introduced in [56].

Interaction between objects and the input image. The relationship between the set of objects and the image is modeled through encoder-decoder attention (aka cross-attention) to relate all object representations to the encoded visual information of the current image (measurement). Evaluating this interaction results in the computation of a second update to the set of object representations


where is a learnt function of the pre-refined object embeddings , and and are learnt functions of the image embedding  produced by a CNN backbone and a Transformer encoder.   represents the  operator.

Combining interactions for refined embeddings. In practice, the two previously described update steps are performed consecutively with (4) taking as input the result of (3), and are iteratively repeated over several layers of the Transformer architecture. This sequential incorporation of updates into the representation is inspired by DETR [9], where self-attention and cross-attention modules are similarly deployed in a sequential manner. Using both introduced concepts of object-to-object and object-to-measurement attention allow the model to globally reason about all tracked objects via their pair-wise relationships, while using the current image as context information to retrieve the final set of updated object representations .

Updating the track history. After each frame is processed by the entire framework, the final set of embeddings  of objects identified to be present in the frame is added to the track history , creating the basis for the next prediction of embeddings by the temporal Transformer (Fig. 2). We consistently append new embeddings from the right-hand side, followed by right-aligning the entire set of embeddings. Due to the different lengths of tracks for different objects, this procedure aligns embeddings representing identical time steps, a method that we found to help stabilize training and improve the inference of the temporal Transformer (Table 4).

3.4 Training MO3TR

The training procedure of MO3TR (Fig. 2) is composed of two key tasks: (i) creating a set of suitable tracklets that can be used as input  to the temporal Transformer, and (ii) assigning the predicted set of  output embeddings  to corresponding ground truth labels of the training set, and applying a corresponding loss to facilitate training. With the number output embeddings being by design larger than the number of objects in the scene, matching occurs wither either with trackable objects or the background class.

Constructing the input tracklet set. The input to the model at any given time  is defined as the track history  and the current image . To construct a corresponding  for any  sampled from the dataset during training, we first extract the ordered set of  directly preceding images  from the training sequence. Passing these images without track history to MO3TR causes the framework to perform track initiation for all identified objects in each frame by using the trainable embeddings , returning an ordered set of output embedding sets . Each output embedding set  contains a variable number of  embeddings representing objects in the respective frame 

. We use multilayer perceptrons (MLPs) to extract corresponding bounding boxes

and class scores from each of these object embeddings , resulting in a set of  object-specific pairs denoted as  for each frame . The pairs are then matched with the ground truth  of the respective frame through computing a bipartite matching [9] between these sets. The permutation  of the  predicted elements with lowest pair-wise matching cost  is determined by solving the assignment problem


through the Hungarian algorithm [26], with the matching cost taking both the probability of correct class prediction  and bounding box similarity into account


We follow [9] and use a linear combination of L1 distance and the scale-invariant generalized intersection over union [41] cost  to mitigate any possible scale issues arising from different box sizes. The resulting bounding box cost with weights  is then defined as


The identified minimum cost matching between the output and ground truth sets is used to assign all embeddings classified as objects their respective identities annotated in the ground truth labels. The objects of all 

frames are accumulated, grouped regarding their assigned identities and sorted in time-ascending order to form the overall set of previous object tracks  serving as input to our model.

Losses. Given the created input set of tracks  and the image , MO3TR predicts an output set of object embeddings  = at time . Similar to before, we extract bounding boxes and class scores for each embedding in the set. However, embeddings that possess a track history already have unique identities associated to them and are thus directly matched with the respective ground truth elements. Only newly initiated embeddings without track history are then matched with remaining unassigned ground truth labels as previously described. Elements that could not be matched are assigned the background class. Finally, we re-use (6) and (7) for  and apply them as our loss to the matched elements of the output set.

Data augmentation. Most datasets are highly imbalanced regarding the occurrence of occlusion, initiation and termination scenarios. To facilitate learning of correct tracking behaviour, we propose to mitigate the imbalance problem by modelling similar effects through augmentation:

  1. We randomly drop a certain number of embeddings in the track history to simulate cases where the object could not be identified for some frames, aiming to increase robustness. If the most recent embedding is dropped, the model can learn to re-identify objects.

  2. Random false positive examples are inserted into the history to simulate false detection and faulty appearance information due to occlusion. This aims for the model to learn ignoring unsuited representations through its attention mechanism.

  3. We randomly select the sequence length used to create the track history during training to increase the model’s capability to deal with varying track lengths.

The high importance of these augmentations are proved in Section 4.3 and Table 4.

4 Experiments

EAMTT [48] 38.8 42.4 7.9 49.1 8,114 102,452 965
DMAN [68] 46.1 54.8 17.4 42.7 7,909 89,874 532
AMIR [46] 47.2 46.3 14.0 41.6 2,681 92,856 774
MOTDT17 [32] 47.6 50.9 15.2 38.3 9,253 85,431 792
STRN [61] 48.5 53.9 17.0 34.9 9,038 84,178 747
UMA [65] 50.5 52.8 17.8 33.7 7,587 81,924 685
Tracktor++ [4] 54.4 52.5 19.0 36.9 3,280 79,149 682
Tracktor++v2 [4] 56.2 54.9 20.7 35.8 2,394 76,844 617
DeepMOT-T [62] 54.8 53.4 19.1 37.0 2,955 78,765 645
MO3TR (Res50) 64.2 60.6 31.6 18.3 7,620 56,761 929
Table 1: Results on the MOT16 benchmark [34] test set using public detections. Bold and underlined numbers indicate best and second best result, respectively. More detailed results of our approach are provided in the supplementary material.

In this section, we demonstrate the performance of MO3TR by comparing against other multi-object tracking methods on popular MOT benchmarks222https://motchallenge.net/ and evaluate different aspects of our contribution in detailed ablation studies. We further provide implementation and training details.

Datasets. We use the MOT16 and MOT17 [34] datasets from the MOTchallenge benchmarks to evaluate and compare MO3TR with other state of the art models. Both datasets contain seven training and test sequences each, capturing crowded indoor or outdoor areas via moving and static cameras from various viewpoints. Pedestrians are often heavily occluded by other pedestrians or background objects, making identity-preserving tracking challenging. Three sets of public detections are provided with MOT17 (DPM [17], FRCNN [40] and SDP [64]), and one with MOT16 (DPM). For ablation studies, we combine sequences of the new MOT20 benchmark [13] and 2DMOT15 [30] to form a diverse validation set covering both indoor and outdoor scenes at various pedestrian density levels.

Evaluation metrics. To evaluate our model and other MOT methods, we use standard metrics recognized by the tracking community [5, 43]. The two main metrics are the MOT Accuracy (MOTA) and Identity F1 Score (IDF1). MOTA focuses more on object coverage while the consistency of assigned identities is measured by IDF1. We further report False Positives (FP), False Negatives (FN), Mostly Tracked (MT) and Mostly Lost (ML). Further details of these metrics are provided in the supplementary material.

4.1 Implementation details of MO3TR

We employ a multi-stage training concept to train MO3TR end-to-end. Firstly, our ImageNet 

[45] pretrained ResNet50 [20] backbone is, together with the encoder and spatial Transformers, trained on a combination of the CrowdHuman [49], ETH [14] and CUHK-SYSU [60]

datasets for 300 epochs on a pedestrian detection task. This training procedure is similar to DETR 

[9]. Afterwards, we engage our temporal transformer and train the entire model end-to-end using the MOT17 dataset for another 300 epochs. The initial learning rate for both training tasks is 1e-4, and is dropped by a factor of 10 every 100 epochs. Relative weights of our loss are the same as in DETR [9], the number of initiation queries is 100. The input sequence length representing object track histories varies randomly from 1 to 30 frames. To enhance the learning of temporal encoding, we predict 10 future frames instead of one and compute the total loss. We train our model using 4 GTX 1080ti GPUs with 11GB memory each. It is to be noted that these computational requirements are significantly lower than for other recently published approaches in this field. We expect the performance of our model to further increase through bigger backbones and longer sequence length as well as an increased number of objects per frame.

Public detection. We evaluate the tracking performance using the public detections provided by the MOTChallenge. Not being able to directly produce tracks from these detections due to being an embedding-based method, we follow [33, 67] in filtering our initiations by the public detections using bounding box center distances, and only allow initiation of matched and thus publicly detected tracks.

4.2 Comparison with the state of the art

GMPHD [28] 39.6 36.6 8.8 43.3 50,903 284,228 5,811
EAMTT [48] 42.6 41.8 12.7 42.7 30,711 288,474 4,488
SORT17 [6] 43.1 39.8 12.5 42.3 28,398 287,582 4,852
DMAN [68] 48.2 55.7 19.3 38.3 26,218 263,608 2,194
MOTDT17 [32] 50.9 52.7 17.5 35.7 24,069 250,768 2,474
STRN [61] 50.9 56.5 20.1 37.0 27,532 246,924 2,593
jCC [24] 51.2 54.5 20.9 37.0 25,937 247,822 1,802
DeepMOT-T [62] 53.7 53.8 19.4 36.6 11,731 247,447 1,947
FAMNet [12] 52.0 48.7 19.1 33.4 14,138 253,616 3,072
UMA [65] 53.1 54.4 21.5 31.8 22,893 239,534 2,251
Tracktor++ [4] 53.5 52.3 19.5 36.6 12,201 248,047 2,072
Tracktor++v2 [4] 56.5 55.1 21.1 35.3 8,866 248,047 3,763
CenterTrack[67] 61.5 59.6 26.4 31.9 14,076 200,672 2,583
Trackformer [33] 61.8 59.8 35.4 21.1 35,226 177,270 2,982
MO3TR (Res50) 63.2 60.2 31.9 19.2 21,966 182,860 2,841
Table 2: Results on the MOT17 benchmark [34] test set using public detections. Bold and underlined numbers indicate best and second best result, respectively. More detailed results of our approach are provided in the supplementary material.

We evaluate MO3TR on the challenging MOT16 [34] and MOT17 benchmark test datasets using the provided public detections and report our results in Tables 1 and 2, respectively. Despite not using any heuristic track management to filter or post-process, we outperform most competing methods and achieve new state of the art results on both datasets regarding MOTA, IDF1 and ML metrics, and set a new benchmark for MT and FN on MOT16.

As clearly shown by its state of the art IDF1 scores on both datasets, MO3TR is capable of identifying objects and maintaining their identities over long parts of the track, in many cases for more than 80% of the objects’ lifespans as evidenced by the very high MT results. Access to the track history through the temporal Transformers and jointly reasoning over existing tracks, initiation and the input data through the spatial Transformers helps MO3TR to learn discerning occlusion from termination. The framework is thus capable to avoid false termination, as clearly evidenced by the very low FN and record low ML numbers achieved on both MOT datasets. These values further indicate that MO3TR learns to fill in gaps due missed detections or occlusions, which has additional great influence on reducing FN and ML while increasing IDF1 and MT. Using its joint reasoning over the available information helps MO3TR to reduce failed track initiations (FN) considerably while keeping incorrect track initiations (FPs) at a reasonable low levels. The combination of superior IDF1, very low FN and reasonable FP allows MO3TR to reach new state of the art MOTA results on both MOT16 (Table 1) and MOT17 (Table 2) datasets.

4.3 Ablation studies

  1 55.4 48.4 114 19 4,700 12,898
10 56.8 49.0 115 18 4,245 12,805
20 57.8 50.1 115 19 3,826 12,787
30 58.9 50.6 114 20 3,471 12,692
Table 3: Effect of varying lengths of track history  considered in the temporal Transformers during evaluation.
Training Strategies MOTA IDF1 FP FN
Naive (Two Frames) 12.2 22.1 7,905 26,848
FN (Two Frames) 14.6 42.0 22,609 11,671
FN+RA (Two Frames) 28.4 42.5 16,749 11,940
FN+RA+FP (Two Frames) 55.4 48.4 3,927 17,912
FN 21.9 42.5 19,353 11,693
FN+RA 39.2 48.1 12,265 12,002
FN+RA+FP 58.9 50.6 3,471 12,692
Table 4: Effect of different training (two frames vs. 30) and augmentation strategies: False Negatives (FN), False Positives (FP), Right-Aligned insertion (RA).

Figure 3: Qualitative results of two challenging occlusion scenarios in the validation set. Objects of focus are highlighted with slightly thicker bounding boxes. Unlike Tracktor++v2 [4], our proposed MO3TR is capable of retaining the identity and keeps track even if the object is severely occluded.

In this section, we evaluate different components of MO3TR on our validation set using private detections and show the individual contributions of the key components and strategies to facilitate learning.

Effect of track history length. The length of the track history describes the maximum number of embeddings from all the previous time steps of a certain identified object that our temporal Transformer has access to. To avoid overfitting to any particular history length that might be dominant in the dataset but not actually represent the most useful source of information, we specifically train our model with input track histories of varying and randomly chosen lengths. It is important to note that if the maximum track history length is set to one, the method practically degenerates to a two-frame based joint detection and tracking method such as Trackformer [33]. Our results reported in Table 3 however show that incorporating longer-term information is crucial to improve end-to-end tracking. Both MOTA and IDF1 can be consistently improved while FP can be reduced when longer term history, , information from previous frames, is taken into account. This trend is also clearly visible throughout evaluation of our training strategies presented in Table 4, further discussed in the following.

Training strategies. MOT datasets are highly imbalanced when it comes to the occurrence of initialization and termination examples compared to normal propagation, making it nearly impossible for models to naturally learn initiation of new or termination of no longer existing tracks when trained in a naive way. As presented in Table 4, naive training without any augmentation shows almost double the number of false negatives (FN) compared to augmented approaches, basically failing to initiate tracks properly. Augmenting with FN as discussed in 3.4 shows significant improvements for both two-frame and longer-term methods. Additionally right-aligning the track history helps generally to stabilize training and greatly reduces false positives. At last, augmenting with false positives is most challenging to implement but crucial. As the results demonstrate, it significantly reduces false positives by helping the network to properly learn the terminating of tracks.

Figure 4: Temporal attention maps averaged over 100 randomly selected objects from the MOT20 dataset [13]. The vertical axis represents the maximum track history length, the horizontal axis the different embedding positions in the history. The displayed attention related the current query at time to all the previous embeddings. Every row sums up to 1.

Analysing temporal attention. To provide some insight into the complex and highly non-linear working principle of our temporal Transformers, we visualize the attention weights over the temporal track history for different track history lengths averaged for 100 randomly picked objects in our validation set (Fig. 4). Results for the first layer clearly depict most attention being payed to multiple of its more recent frames, decreasing with increasing frame distance. The second and third layers are harder to interpret due to the increasing non-linearity, and the model starts to increasingly cast attention over more distant frames. It is important to notice that even if an embedding is not available at time , the model can still choose to pay attention to that slot and use the non-existence for reasoning.

5 Conclusion

We presented MO3TR, a truly end-to-end multi-object tracking framework that uses temporal Transformers to encode the history of objects while employing spatial Transformers to encode the interaction between objects and the input data, allowing it to handle occlusions, track termination and initiation. Demonstrating the advantages of long term temporal learning, we set new state of the art results regarding multiple metrics on the popular MOT16 and MOT17 benchmarks.

Appendix A Experiments

In this section, we provide details on the evaluation metrics used throughout the main paper, as well as detailed results for all sequences on the MOT16 and MOT17 benchmarks [34].

a.1 Evaluation metrics

To evaluate MO3TR and compare its performance to other state-of-the-art tracking approaches, we use the standard set of metrics proposed in [5, 43]. Analyzing the detection performance, we provide detailed insights regarding the total number of false positives (FP) and false negatives (FN, missed targets). The mostly tracked targets (MT) measure describes the ratio of ground-truth trajectories that are covered for at least 80% of the track’s life span, while mostly lost targets (ML) represents the ones covered for at most 20%. The number of identity switches is denoted by IDs. The two most commonly used metrics to summarize the tracking performance are the multiple object tracking accuracy (MOTA), and the identity F1 score (IDF1). MOTA combines the measures for the three error sources of false positives, false negatives and identity switches into one compact measure, and a higher MOTA score implies better performance of the respective tracking approach. The IDF1 represents the ratio of correctly identified detections over the average number of ground-truth and overall computed detections.

All reported results are computed by the official evaluation code of the MOTChallenge benchmark333https://motchallenge.net.

Sequence Public detector MOTA IDF1 MT ML FP FN IDs
MOT16-01 DPM [17] 62.83 57.29 7 1 99 2,251 27
MOT16-03 DPM 73.75 67.66 77 15 2,850 24,435 166
MOT16-06 DPM 59.52 58.74 81 41 1,083 3,442 146
MOT16-07 DPM 57.90 51.02 14 8 468 6,286 118
MOT16-08 DPM 44.69 39.91 13 13 489 8,610 158
MOT16-12 DPM 48.92 59.86 26 17 1,382 2,806 49
MOT16-14 DPM 43.49 44.88 22 44 1,249 8,931 265
MOT17 All 64.18 60.59 240 139 7,620 56,761 929
Table A1: Detailed MO3TR results on each individual sequence of the MOT16 benchmark [34] test set using public detections. Following other works, we use the public detection filtering method using center distances as proposed by [4].

a.2 Evaluation results

The public results for the MOT16 [34] benchmark presented in the experiment section of the main paper show the overall result of MO3TR on the benchmark’s test dataset using the provided public detections (DPM [17]). Detailed results showing the results for all individual sequences are presented in Table A1. Similarly the individual results for all sequences of the MOT17 benchmark [34] comprising three different sets of provided public detections (DPM [17], FRCNN [40] and SDP [64]) are detailed in Table A2. Further information regarding the metrics used is provided in Section A.1.

Sequence Public detector MOTA IDF1 MT ML FP FN IDs
MOT17-01 DPM [17] 62.31 57.00 6 2 98 2,306 27
MOT17-03 DPM 73.82 67.73 78 15 2,773 24,469 167
MOT17-06 DPM 60.51 58.81 81 39 950 3,555 148
MOT17-07 DPM 56.70 50.47 13 12 402 6,793 120
MOT17-08 DPM 38.00 35.42 13 26 206 12,723 168
MOT17-12 DPM 49.34 59.72 28 20 1,268 3,073 50
MOT17-14 DPM 43.49 44.88 22 44 1,249 8,931 265
MOT17-01 FRCNN [40] 60.37 53.92 7 4 109 2,419 28
MOT17-03 FRCNN 73.88 68.13 75 15 3,036 24,148 161
MOT17-06 FRCNN 61.98 61.07 95 25 1,170 3,148 162
MOT17-07 FRCNN 56.70 50.63 12 12 402 6,794 118
MOT17-08 FRCNN 36.48 35.17 12 31 149 13,121 149
MOT17-12 FRCNN 50.96 61.36 26 26 970 3,239 41
MOT17-14 FRCNN 43.67 44.08 24 42 1,349 8,790 272
MOT17-01 SDP [64] 66.76 57.60 7 3 94 2,022 28
MOT17-03 SDP 74.07 68.15 80 15 3,373 23,606 163
MOT17-06 SDP 61.91 61.21 93 24 1,163 3,176 150
MOT17-07 SDP 57.38 50.24 13 12 407 6,672 120
MOT17-08 SDP 38.62 36.32 13 27 235 12,553 177
MOT17-12 SDP 50.43 59.66 29 20 1,200 3,043 53
MOT17-14 SDP 46.35 45.94 24 38 1,363 8,279 274
MOT17 All 63.19 60.15 751 452 21,966 182,860 2,841
Table A2: Detailed MO3TR results on each individual sequence of the MOT17 benchmark [34] test set using public detections. Following other works, we use the public detection filtering method using center distances as proposed by [4].

Appendix B Data association as auxiliary task

In the introduction of the main paper, we introduce the idea that our proposed MO3TR performs tracking without any explicit data association module. To elaborate what we mean by that and how multi-object tracking (MOT) without an explicitly formulated data association task is feasible, we would like to re-consider the actual definition of the MOT problem: Finding a mapping from any given input data, an image sequence stream, to the output data, a set of object states over time. In any learning scheme, given a suitable learning model, this mapping function can theoretically be learned without the requirement for solving any additional auxiliary task, as long as the provided inputs and outputs are clearly defined. The firmly established task of data association, a minimum cost assignment (using Hungarian Algorithm) between detections and objects, is nothing more than such an auxiliary task originally created to solve tracking based on tracking-by-detection paradigms. An end-to-end learning model, however, can learn to infer implicit correspondences and thus renders the explicit formulation of this task obsolete.

Precisely speaking, our end-to-end tracking model learns to relate the visual input information to the internal states of the objects via a self-supervised attention scheme. We realize this through using a combination of Transformers [56] to distill the available spatial and temporal information into representative object embeddings (the object states), making the explicit formulation of any auxiliary data association strategy unnecessary.


  • [1] A. Andriyenko, K. Schindler, and S. Roth (2012) Discrete-continuous optimization for multi-target tracking. In

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1926–1933. Cited by: §1.
  • [2] A. Andriyenko and K. Schindler (2011) Multi-target tracking by continuous energy minimization.. In CVPR, Cited by: §1, §2.
  • [3] I. Bello, B. Zoph, Q. Le, A. Vaswani, and J. Shlens (2019) Attention augmented convolutional networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3285–3294. Cited by: §2.
  • [4] P. Bergmann, T. Meinhardt, and L. Leal-Taixe (2019) Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 941–951. Cited by: Table A1, Table A2, §1, §2, Figure 3, Figure 3, Table 1, Table 2.
  • [5] K. Bernardin and R. Stiefelhagen (2008) Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008, pp. 1–10. Cited by: §A.1, §4.
  • [6] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft (2016) Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pp. 3464–3468. Cited by: §1, §2, Table 2.
  • [7] S. S. Blackman and R. Popoli (1999) Design and analysis of modern tracking systems. Artech House radar library, Artech House. Cited by: §1.
  • [8] G. Brasó and L. Leal-Taixé (2020) Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6247–6257. Cited by: §2.
  • [9] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §1, §2, §3.3, §3.3, §3.4, §4.1.
  • [10] W. Choi and S. Savarese (2010) Multiple target tracking in world coordinate with single, minimally calibrated camera. In European Conference on Computer Vision, Cited by: §2.
  • [11] P. Chu, H. Fan, C. C. Tan, and H. Ling (2019) Online multi-object tracking with instance-aware tracker and dynamic model refreshment. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 161–170. Cited by: §2.
  • [12] P. Chu and H. Ling (2019) Famnet: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6172–6181. Cited by: Table 2.
  • [13] P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taixé (2020) Mot20: a benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003. Cited by: Figure 4, §4.
  • [14] A. Ess, B. Leibe, K. Schindler, and L. Van Gool (2008) A mobile vision system for robust multi-person tracking. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §4.1.
  • [15] K. Fang, Y. Xiang, X. Li, and S. Savarese (2018) Recurrent autoregressive networks for online multi-object tracking. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.
  • [16] C. Feichtenhofer, A. Pinz, and A. Zisserman (2017) Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §2.
  • [17] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §A.2, Table A1, Table A2, §1, §2, §4.
  • [18] L. R. Ford and D. R. Fulkerson (1956) Maximal flow through a network. Canadian Journal of Mathematics 8, pp. 399–404. Cited by: §1.
  • [19] T. Fortmann, Y. Bar-Shalom, and M. Scheffe (1983) Sonar tracking of multiple targets using joint probabilistic data association. IEEE Journal of Oceanic Engineering 8 (3), pp. 173–184. Cited by: §1.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • [21] A. Hornakova, R. Henschel, B. Rosenhahn, and P. Swoboda (2020) Lifted disjoint paths with application in multiple object tracking. In International Conference on Machine Learning, pp. 4364–4375. Cited by: §2.
  • [22] H. Hu, Q. Cai, D. Wang, J. Lin, M. Sun, P. Krahenbuhl, T. Darrell, and F. Yu (2019) Joint monocular 3d vehicle detection and tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5390–5399. Cited by: §1.
  • [23] R. E. Kalman (1960) A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering 82, pp. 35–45. Cited by: §1, §2.
  • [24] M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele (2018) Motion segmentation & multiple object tracking by correlation co-clustering. IEEE transactions on pattern analysis and machine intelligence 42 (1), pp. 140–153. Cited by: Table 2.
  • [25] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg (2015) Multiple hypothesis tracking revisited. In Proceedings of the IEEE international conference on computer vision, pp. 4696–4704. Cited by: §2.
  • [26] H. W. Kuhn (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §1, §3.4.
  • [27] C. Kuo and R. Nevatia (2011) How does person identity recognition help multi-person tracking?. In CVPR 2011, pp. 1217–1224. Cited by: §2.
  • [28] T. Kutschbach, E. Bochinski, V. Eiselein, and T. Sikora (2017) Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multi-object tracking in video data. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–5. Cited by: Table 2.
  • [29] L. Leal-Taixé, C. Canton-Ferrer, and K. Schindler (2016) Learning by tracking: siamese cnn for robust target association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 33–40. Cited by: §1, §2.
  • [30] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler (2015) Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942. Cited by: §4.
  • [31] Y. Liang and Y. Zhou (2018) LSTM multiple object tracker combining multiple cues. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 2351–2355. Cited by: §2.
  • [32] C. Long, A. Haizhou, Z. Zijie, and S. Chong (2018) Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME, Cited by: Table 1, Table 2.
  • [33] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer (2021) TrackFormer: multi-object tracking with transformers. arXiv preprint arXiv:2101.02702. Cited by: §1, §2, §4.1, §4.3, Table 2.
  • [34] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler (2016) MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Cited by: §A.2, Table A1, Table A2, Appendix A, §4.2, Table 1, Table 2, §4.
  • [35] N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran (2018) Image transformer. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, pp. 4055–4064. Cited by: §2.
  • [36] P. Ramachandran, N. Parmar, A. Vaswani, I. Bello, A. Levskaya, and J. Shlens (2019) Stand-alone self-attention in vision models. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [37] N. Ran, L. Kong, Y. Wang, and Q. Liu (2019) A robust multi-athlete tracking algorithm by exploiting discriminant features and long-term dependencies. In International Conference on Multimedia Modeling, Cited by: §2.
  • [38] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §1.
  • [39] J. Ren, X. Chen, J. Liu, W. Sun, J. Pang, Q. Yan, Y. Tai, and L. Xu (2017) Accurate single stage detector using recurrent rolling convolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5420–5428. Cited by: §1, §2.
  • [40] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Vol. 28. Cited by: §A.2, Table A2, §1, §2, §4.
  • [41] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666. Cited by: §3.4.
  • [42] S. H. Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid (2015) Joint probabilistic data association revisited. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3047–3055. Cited by: §1.
  • [43] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pp. 17–35. Cited by: §A.1, §4.
  • [44] E. Ristani and C. Tomasi (2018) Features for multi-target multi-camera tracking and re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6036–6046. Cited by: §2.
  • [45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §4.1.
  • [46] A. Sadeghian, A. Alahi, and S. Savarese (2017) Tracking the untrackable: learning to track multiple cues with long-term dependencies. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §2, Table 1.
  • [47] F. Saleh, S. Aliakbarian, H. Rezatofighi, M. Salzmann, and S. Gould (2020) Probabilistic tracklet scoring and inpainting for multiple object tracking. arXiv preprint arXiv:2012.02337. Cited by: §2.
  • [48] R. Sanchez-Matilla, F. Poiesi, and A. Cavallaro (2016) Online multi-target tracking with strong and weak detections. In European Conference on Computer Vision, pp. 84–99. Cited by: Table 1, Table 2.
  • [49] S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun (2018) Crowdhuman: a benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123. Cited by: §4.1.
  • [50] H. Sheng, Y. Zhang, J. Chen, Z. Xiong, and J. Zhang (2018) Heterogeneous association graph fusion for target association in multiple object tracking. IEEE Transactions on Circuits and Systems for Video Technology 29 (11), pp. 3269–3280. Cited by: §2.
  • [51] J. Smith, F. Particke, M. Hiller, and J. Thielecke (2019) Systematic analysis of the pmbm, phd, jpda and gnn multi-target tracking filters. In 2019 22th International Conference on Information Fusion, pp. 1–8. Cited by: §1.
  • [52] R. L. Streit and T. E. Luginbuhl (1994) Maximum likelihood method for probabilistic multihypothesis tracking. In Signal and Data Processing of Small Targets, Vol. 2235, pp. 394–405. Cited by: §1.
  • [53] P. Sun, Y. Jiang, R. Zhang, E. Xie, J. Cao, X. Hu, T. Kong, Z. Yuan, C. Wang, and P. Luo (2020) TransTrack: multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460. Cited by: §1, §2.
  • [54] S. Sun, N. Akhtar, H. Song, A. Mian, and M. Shah (2019) Deep affinity network for multiple object tracking. IEEE transactions on pattern analysis and machine intelligence 43 (1), pp. 104–119. Cited by: §1.
  • [55] S. Tang, M. Andriluka, B. Andres, and B. Schiele (2017) Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3539–3548. Cited by: §2.
  • [56] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 6000–6010. Cited by: Appendix B, §1, §2, §3.2, §3.3.
  • [57] X. Wan, J. Wang, and S. Zhou (2018)

    An online and flexible multi-object tracking framework using long short-term memory

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §2.
  • [58] Z. Wang, L. Zheng, Y. Liu, and S. Wang (2020) Towards real-time multi-object tracking. In European Conference on Computer Vision, Cited by: §1.
  • [59] N. Wojke, A. Bewley, and D. Paulus (2017) Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pp. 3645–3649. Cited by: §1, §2.
  • [60] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang (2017) Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3415–3424. Cited by: §4.1.
  • [61] J. Xu, Y. Cao, Z. Zhang, and H. Hu (2019) Spatial-temporal relation networks for multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3988–3998. Cited by: Table 1, Table 2.
  • [62] Y. Xu, A. Osep, Y. Ban, R. Horaud, L. Leal-Taixé, and X. Alameda-Pineda (2020) How to train your deep multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6787–6796. Cited by: Table 1, Table 2.
  • [63] B. Yang and R. Nevatia (2012) An online learned crf model for multi-target tracking. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2034–2041. Cited by: §2.
  • [64] F. Yang, W. Choi, and Y. Lin (2016) Exploit all the layers: fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2129–2137. Cited by: §A.2, Table A2, §1, §2, §4.
  • [65] J. Yin, W. Wang, Q. Meng, R. Yang, and J. Shen (2020) A unified object motion and affinity model for online multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6768–6777. Cited by: Table 1, Table 2.
  • [66] Y. Zhang, H. Sheng, Y. Wu, S. Wang, W. Lyu, W. Ke, and Z. Xiong (2020) Long-term tracking with deep tracklet association. IEEE Transactions on Image Processing 29, pp. 6694–6706. Cited by: §2.
  • [67] X. Zhou, V. Koltun, and P. Krähenbühl (2020) Tracking objects as points. In European Conference on Computer Vision, Cited by: §1, §2, §2, §4.1, Table 2.
  • [68] J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M. Yang (2018) Online multi-object tracking with dual matching attention networks. In European Computer Vision Conference, Cited by: Table 1, Table 2.