Visually discriminating the identity of multiple objects in a scene and creating individual tracks of their movements over time, namely multi-object tracking, is one of the basic yet most crucial vision tasks, imperative to tackle many real-world problems in surveillance, robotics/autonomous driving, health and biology. While being a classical AI problem, it is still very challenging to design a reliable multi-object tracking (MOT) system capable of tracking an unknown and time-varying number of objects moving through unconstrained environments, directly from spurious and ambiguous measurements and in presence of many other complexities such as occlusion, detection failure and data (measurement-to-objects) association uncertainty.
Early frameworks approached the MOT problem by splitting it into multiple sub-problems such as object detection, data association, track management and filtering/state prediction; each with their own set of challenges and solutions [2, 1, 6, 7, 19, 42, 51, 52]
. Recently, deep learning has considerably contributed to improving the performance of multi-object tracking approaches, but surprisingly not through learning the entire problem end-to-end. Instead, the developed methods adopted the traditional problem split and mainly focused on enhancing some of the aforementioned components, such as creating better detectors[17, 38, 39, 40, 64] or developing more reliable matching objectives for associating detections to existing object tracks [22, 29, 46, 58, 59]. While this tracking-by-detection paradigm has become the de facto standard approach for MOT, it has its own limitations. Recent approaches have shown advances by considering detection and tracking as a joint learning task rather than two separate sequential problems [4, 16, 54, 67]. However, these methods often formulate the MOT task as a two consecutive frames problem and ignore long-term temporal information, which is imperative for tackling key challenges such as track initiation, termination and occlusion handling.
In addition to their aforementioned limitations, all these methods can barely be considered to be end-to-end multi-object frameworks as their final outputs, tracks, are generated through a non-learning process. For example, track initiation and termination are commonly tackled by applying different heuristics, and the track assignments are decided upon by applying additional optimization methods, the Hungarian algorithm , max-flow min-cut 
, , and the generated tracks may be smoothed by a process such as interpolation or filtering.
With the recent rise in popularity of Transformers 
, this rather new deep learning tool has been adapted to solve computer vision problems like object detection and, concurrent to our work, been deployed to two new MOT frameworks [33, 53]. Nonetheless, they still either rely on conventional heuristics, IoU matching , or formulate the problem as a two-frames task [33, 53], making them naive approaches to handle long-term occlusions.
In this paper, we will show that the MOT problem can be learnt end-to-end, without the use of heuristics or post-processing, addressing the key tasks like track initiation and termination, as well as occlusion handling. Our proposed method, nicknamed MO3TR, is a truly end-to-end Transformer-based online multi-object tracking method, which learns to recursively predict the state of the objects directly from an image sequence stream. Moreover, our approach encodes long-term temporal information to estimate the state of all the objects over time and does not contain an explicit data association module (Fig. 1).
Precisely speaking, MO3TR incorporates long-term temporal information by casting temporal attention over all past embeddings of each individual object, and uses this information to predict an embedding suited for the current time step. This access to longer-term temporal information beyond two frames is crucial in enabling the network to learn the difference between occlusion and termination, which is further facilitated through a specific data augmentation strategy. To factor in the influence of other objects and the visual input measurement, we refine the predicted object embedding by casting spatial attention over all identified objects in the current frame (object-to-object attention) as well as over the objects and the encoded input image (object-to-image attention).
The idea of this joint approach relates to the natural way humans perceive such scenarios: We expect certain objects to become occluded given their past trajectory and their surroundings, and predict when and where they will reappear.
To summarize, our main contributions are as follows:
We introduce an end-to-end tracking approach that learns to encode longer-term information beyond two frames through temporal and spatial Transformers, and recursively predicting all states of the tracked objects
We realize joint learning of object initialization, termination and occlusion handling without explicit data association and eliminate the need for heuristic post-processing
MO3TR reaches new state of the art results on two popular multi-object tracking benchmarks
2 Related work
Tracking-by-detection. Tracking-by-detection treats the multi-object tracking (MOT) task as a two-stage problem. Firstly, all objects in each frame are identified using an object detector [17, 39, 40, 64]. Detected objects are then associated over frames, resulting in tracks [6, 11]. The incorporation of appearance features and motion information has been proven to be of great importance for MOT. Appearance and ReID features have been extensively utilized to improve the robustness of multi-object tracking [25, 27, 29, 44, 63]
. Further, incorporating motion has been achieved by utilizing a Kalman filter to approximate the displacement of boxes between frames in a linear fashion and with the constant velocity assumption [2, 10] to associate detections [6, 59]. Recently, more complex and data-driven models have been proposed to model motion [15, 31, 66, 67] in a deterministic [37, 46] and probabilistic [15, 47, 57]
manner. Graphs neural networks have been also used in the recent detection based MOT frameworks, conducive to extract a reliable global feature representation from visual and/or motion cues[8, 21, 50, 55].
Despite being highly interrelated, detection and tracking tasks are treated independently in this line of works. Further, the performance of tracking by detection methods highly relies on incorporating heuristics and post-processing steps to infer track initiation and termination, handle occlusions and assign tracks.
Joint detection and tracking. The recent trend in MOT has moved from associating detections over frames to regressing the previous track locations to new locations in the current frame. [4, 16, 67] perform temporal realignment by exploiting a regression head. Although detection and tracking are not disjoint components in these works, they still suffer from some shortcomings. These works formulate the problem as detection matching between two/few frames, thus solving the problem locally and ignoring long-term temporal information. We argue that MOT is a challenging task which requires long-term temporal encoding of object dynamics to handle object initiation, termination, occlusion and tracking. Furthermore, these approaches still rely on the conventional post processing steps and heuristics to generate the tracks.
Transformers for vision.
Recently, Transformers  have been widely applied to many computer vision problems [3, 9, 35, 36], including MOT by two concurrent works [33, 53].
 performs multi-object tracking using a query-key mechanism which relies on heuristic post processing to generate final tracks. Trackformer  has been proposed as a transformer-based model which achieves joint detection and tracking by converting the existing DETR  object detector to an end-to-end trainable MOT pipeline. However, it still considers local information (two consecutive frames) to learn and infer tracks and ignores long-term temporal object dynamics, which are essential for effective learning of all MOT components.
This paper. To overcome all the existing limitations in the previous works, we propose an end-to-end MOT model which learns to jointly track multiple existing objects, handle their occlusion or terminate their tracks and initiate new tracks considering long-term temporal object information.
Learning an object representation that encodes both the object’s own state over time and the interaction with its surroundings is vital to allow reasoning about three key challenges present in end-to-end multiple object tracking (MOT), namely track initiation, termination and occlusion handling. In this section, we demonstrate how such a representation can be acquired and continuously updated through our proposed framework: Multi-Object TRacking using spatial TRansformers and temporal TRansformers – short MO3TR (Fig. 2). We further introduce a training paradigm to learn resolving these three challenges in a joint and completely end-to-end trainable manner. We first present an overview of our framework and introduce the notation used throughout this paper, followed by a detailed introduction of the core components.
3.1 System overview and notation
The goal of tracking multiple objects in a video sequence of frames is to retrieve an overall set of tracks representing the individual trajectories for all uniquely identified objects present in at least one frame. Given the first frame at time , our model tentatively initializes a set of tracks based on all objects identified for this frame. From the next time step onward, the model aims to compute a set of embeddings = representing all objects present in the scene at time (Fig. 2). Taking in the track history from the previous time step, we predict a set of embeddings for the current time step based on the past representations of all objects using temporal attention (Section 3.2). Together with a learnt set of representation queries proposing the initiation of new object tracks, these predicted object representations are processed by our first spatial attention module to reason about the interaction occurring between different objects (Section 3.3). This refined set of intermediate object representations is then passed to the second spatial attention module which takes the interaction between the objects and the scene into account by casting attention over the object embeddings and the visual information of the current frame transformed into its feature map (Section 3.3). This two-step incorporation of spatial information into the embeddings is iteratively performed multiple times over several layers, returning the final set of refined object representations .
The incorporation of temporal and spatial information into a representative embedding of any object at time
can be summarized as a learnt function of the track history , the learnt set of initiation queries and the encoded image feature map . This function representation demonstrates our main objective to enable the framework to learn the best possible way to relate the visual input to the objects’ internal states, without enforcing overly-restrictive constraints or explicit data association.
The use of the resulting embeddings in our framework is twofold. Tracking results in the form of object-specific class scores and corresponding bounding boxes for the current frame are obtained through simple classification and bounding box regression networks (Fig. 2
). Further, the subset of embeddings yielding a high probability of representing an object present in the current frame () is added to the track history to form the basis for the prediction performed in the next time step. Throughout the entire video sequence, new tracks representing objects that enter the scene are initialized, while previous tracks may be terminated for objects no longer present. This leads to an overall set of tracks for all uniquely identified objects present in at least one frame of the video sequence of length , with their life span indicated by the subscript as initiation (start) and termination (end) time, respectively.
3.2 Learning long-term temporal embeddings
Discerning whether an object is not visible in a given frame due to occlusion or because it is no longer present in the scene is challenging. Considering that visual features extracted during partial or full occlusion are not describing the actual object they aim to represent increases this even further. Humans naturally reach decisions in such scenarios by considering all available information jointly. Analyzing the motion behavior of objects up to that point, we ignore frames with non-helpful information, and predict how and where the object is expected to re-appear in the current frame. Intuitively, MO3TR follows a similar approach.
Our framework learns the temporal behavior of objects jointly with the rest of the model through a Transformer-based component  that we nickname temporal Transformer. For any tracked object at time , the temporal Transformer casts attention over all embeddings contained in the object’s track history , and predicts a thereon-based expected object representation for the current frame. We supplement each object’s track history by adding positional encodings  to the embeddings in the track to represent their relative time in the sequence. We denote the time-encoded track history by and individual positional time-encodings for time as . Passing the request for an embedding estimate of the current time step in form of the positional time-encoding as a query to the Transformer111Note that this method allows to predict embeddings for any future time step, and could thus be easily extended to further applications like trajectory forecasting, or similar. and providing as basis for keys and values, we retrieve the predicted object embedding
where represents the operator, , and are learnt query, key and value functions of the temporal Transformer, respectively, and denotes the dimension of the object embeddings.
In other words, the predicted representation of object is computed through a dynamically weighted combination of all its previous embeddings. This allows the temporal Transformer to: (i) incorporate helpful and ignore irrelevant or faulty information from previous time steps, and (ii) predict upcoming occlusions and create appropriate embeddings that focus more on conveying important positional rather than visual information. While these tasks resemble those usually performed via heuristics and manual parameter tuning during track management, MO3TR learns these dependencies end-to-end without the need of heuristics.
In practice, the prediction of object representations introduced for the example of one tracked object in (2) is performed in a batched-parallel manner for the entire set of existing tracks over multiple layers, resulting in the output set of the temporal Transformers that is passed as input to the spatial Transformers (Fig. 2). Note that the size of the set is dynamic and depends on the number of tracked objects. Details on how the temporal Transformer is trained are provided in Section 3.4.
3.3 Learning spatial interactions
Multiple pedestrians that are present in the same environment not only significantly influence each others movements, but also their respective visual appearance through occluding each other when perceived from a fixed viewpoint. In this section, we introduce how MO3TR learns to incorporate these dependencies into the object representations. Starting from how detection and track initiation are performed within the concept of Transformers, we then detail the refinement of object embeddings by including the interaction between objects, followed by the interaction between objects and the input image.
Initiation of new tracks. For a new and previously untracked object spawning at any time , a corresponding track history does not yet exist and hence, no predicted embedding is passed from the temporal to the spatial Transformer (Fig. 2). To allow initiation of new tracks for such detected objects, we build upon  and learn a fixed set of initiation queries . Intuitively, these queries learn to propose embeddings that lead the spatial Transformer to check for objects with certain properties and at certain locations in the visual input data. Importantly, these queries are considered jointly with the ones propagated from the temporal Transformer to avoid duplicate tracks.
Interaction between tracked objects. We use self-attention  to capture the influence tracked objects have onto each other’s motion behavior and appearance. This interaction aspect is incorporated into the object embeddings by computing an updated version of the representation set
where , and are all learnt functions of the concatenated object embedding set , is the dimension of the embeddings and the operator. Relating this approach to the classical transformer formulation, the functions conceptually represent the queries, keys and values introduced in .
Interaction between objects and the input image. The relationship between the set of objects and the image is modeled through encoder-decoder attention (aka cross-attention) to relate all object representations to the encoded visual information of the current image (measurement). Evaluating this interaction results in the computation of a second update to the set of object representations
where is a learnt function of the pre-refined object embeddings , and and are learnt functions of the image embedding produced by a CNN backbone and a Transformer encoder. represents the operator.
Combining interactions for refined embeddings. In practice, the two previously described update steps are performed consecutively with (4) taking as input the result of (3), and are iteratively repeated over several layers of the Transformer architecture. This sequential incorporation of updates into the representation is inspired by DETR , where self-attention and cross-attention modules are similarly deployed in a sequential manner. Using both introduced concepts of object-to-object and object-to-measurement attention allow the model to globally reason about all tracked objects via their pair-wise relationships, while using the current image as context information to retrieve the final set of updated object representations .
Updating the track history. After each frame is processed by the entire framework, the final set of embeddings of objects identified to be present in the frame is added to the track history , creating the basis for the next prediction of embeddings by the temporal Transformer (Fig. 2). We consistently append new embeddings from the right-hand side, followed by right-aligning the entire set of embeddings. Due to the different lengths of tracks for different objects, this procedure aligns embeddings representing identical time steps, a method that we found to help stabilize training and improve the inference of the temporal Transformer (Table 4).
3.4 Training MO3TR
The training procedure of MO3TR (Fig. 2) is composed of two key tasks: (i) creating a set of suitable tracklets that can be used as input to the temporal Transformer, and (ii) assigning the predicted set of output embeddings to corresponding ground truth labels of the training set, and applying a corresponding loss to facilitate training. With the number output embeddings being by design larger than the number of objects in the scene, matching occurs wither either with trackable objects or the background class.
Constructing the input tracklet set. The input to the model at any given time is defined as the track history and the current image . To construct a corresponding for any sampled from the dataset during training, we first extract the ordered set of directly preceding images from the training sequence. Passing these images without track history to MO3TR causes the framework to perform track initiation for all identified objects in each frame by using the trainable embeddings , returning an ordered set of output embedding sets . Each output embedding set contains a variable number of embeddings representing objects in the respective frame
. We use multilayer perceptrons (MLPs) to extract corresponding bounding boxesand class scores from each of these object embeddings , resulting in a set of object-specific pairs denoted as for each frame . The pairs are then matched with the ground truth of the respective frame through computing a bipartite matching  between these sets. The permutation of the predicted elements with lowest pair-wise matching cost is determined by solving the assignment problem
through the Hungarian algorithm , with the matching cost taking both the probability of correct class prediction and bounding box similarity into account
We follow  and use a linear combination of L1 distance and the scale-invariant generalized intersection over union  cost to mitigate any possible scale issues arising from different box sizes. The resulting bounding box cost with weights is then defined as
The identified minimum cost matching between the output and ground truth sets is used to assign all embeddings classified as objects their respective identities annotated in the ground truth labels. The objects of allframes are accumulated, grouped regarding their assigned identities and sorted in time-ascending order to form the overall set of previous object tracks serving as input to our model.
Losses. Given the created input set of tracks and the image , MO3TR predicts an output set of object embeddings = at time . Similar to before, we extract bounding boxes and class scores for each embedding in the set. However, embeddings that possess a track history already have unique identities associated to them and are thus directly matched with the respective ground truth elements. Only newly initiated embeddings without track history are then matched with remaining unassigned ground truth labels as previously described. Elements that could not be matched are assigned the background class. Finally, we re-use (6) and (7) for and apply them as our loss to the matched elements of the output set.
Data augmentation. Most datasets are highly imbalanced regarding the occurrence of occlusion, initiation and termination scenarios. To facilitate learning of correct tracking behaviour, we propose to mitigate the imbalance problem by modelling similar effects through augmentation:
We randomly drop a certain number of embeddings in the track history to simulate cases where the object could not be identified for some frames, aiming to increase robustness. If the most recent embedding is dropped, the model can learn to re-identify objects.
Random false positive examples are inserted into the history to simulate false detection and faulty appearance information due to occlusion. This aims for the model to learn ignoring unsuited representations through its attention mechanism.
We randomly select the sequence length used to create the track history during training to increase the model’s capability to deal with varying track lengths.
In this section, we demonstrate the performance of MO3TR by comparing against other multi-object tracking methods on popular MOT benchmarks222https://motchallenge.net/ and evaluate different aspects of our contribution in detailed ablation studies. We further provide implementation and training details.
Datasets. We use the MOT16 and MOT17  datasets from the MOTchallenge benchmarks to evaluate and compare MO3TR with other state of the art models. Both datasets contain seven training and test sequences each, capturing crowded indoor or outdoor areas via moving and static cameras from various viewpoints. Pedestrians are often heavily occluded by other pedestrians or background objects, making identity-preserving tracking challenging. Three sets of public detections are provided with MOT17 (DPM , FRCNN  and SDP ), and one with MOT16 (DPM). For ablation studies, we combine sequences of the new MOT20 benchmark  and 2DMOT15  to form a diverse validation set covering both indoor and outdoor scenes at various pedestrian density levels.
Evaluation metrics. To evaluate our model and other MOT methods, we use standard metrics recognized by the tracking community [5, 43]. The two main metrics are the MOT Accuracy (MOTA) and Identity F1 Score (IDF1). MOTA focuses more on object coverage while the consistency of assigned identities is measured by IDF1. We further report False Positives (FP), False Negatives (FN), Mostly Tracked (MT) and Mostly Lost (ML). Further details of these metrics are provided in the supplementary material.
4.1 Implementation details of MO3TR
We employ a multi-stage training concept to train MO3TR end-to-end. Firstly, our ImageNet pretrained ResNet50  backbone is, together with the encoder and spatial Transformers, trained on a combination of the CrowdHuman , ETH  and CUHK-SYSU 
datasets for 300 epochs on a pedestrian detection task. This training procedure is similar to DETR. Afterwards, we engage our temporal transformer and train the entire model end-to-end using the MOT17 dataset for another 300 epochs. The initial learning rate for both training tasks is 1e-4, and is dropped by a factor of 10 every 100 epochs. Relative weights of our loss are the same as in DETR , the number of initiation queries is 100. The input sequence length representing object track histories varies randomly from 1 to 30 frames. To enhance the learning of temporal encoding, we predict 10 future frames instead of one and compute the total loss. We train our model using 4 GTX 1080ti GPUs with 11GB memory each. It is to be noted that these computational requirements are significantly lower than for other recently published approaches in this field. We expect the performance of our model to further increase through bigger backbones and longer sequence length as well as an increased number of objects per frame.
Public detection. We evaluate the tracking performance using the public detections provided by the MOTChallenge. Not being able to directly produce tracks from these detections due to being an embedding-based method, we follow [33, 67] in filtering our initiations by the public detections using bounding box center distances, and only allow initiation of matched and thus publicly detected tracks.
4.2 Comparison with the state of the art
We evaluate MO3TR on the challenging MOT16  and MOT17 benchmark test datasets using the provided public detections and report our results in Tables 1 and 2, respectively. Despite not using any heuristic track management to filter or post-process, we outperform most competing methods and achieve new state of the art results on both datasets regarding MOTA, IDF1 and ML metrics, and set a new benchmark for MT and FN on MOT16.
As clearly shown by its state of the art IDF1 scores on both datasets, MO3TR is capable of identifying objects and maintaining their identities over long parts of the track, in many cases for more than 80% of the objects’ lifespans as evidenced by the very high MT results. Access to the track history through the temporal Transformers and jointly reasoning over existing tracks, initiation and the input data through the spatial Transformers helps MO3TR to learn discerning occlusion from termination. The framework is thus capable to avoid false termination, as clearly evidenced by the very low FN and record low ML numbers achieved on both MOT datasets. These values further indicate that MO3TR learns to fill in gaps due missed detections or occlusions, which has additional great influence on reducing FN and ML while increasing IDF1 and MT. Using its joint reasoning over the available information helps MO3TR to reduce failed track initiations (FN) considerably while keeping incorrect track initiations (FPs) at a reasonable low levels. The combination of superior IDF1, very low FN and reasonable FP allows MO3TR to reach new state of the art MOTA results on both MOT16 (Table 1) and MOT17 (Table 2) datasets.
4.3 Ablation studies
|Naive (Two Frames)||12.2||22.1||7,905||26,848|
|FN (Two Frames)||14.6||42.0||22,609||11,671|
|FN+RA (Two Frames)||28.4||42.5||16,749||11,940|
|FN+RA+FP (Two Frames)||55.4||48.4||3,927||17,912|
In this section, we evaluate different components of MO3TR on our validation set using private detections and show the individual contributions of the key components and strategies to facilitate learning.
Effect of track history length. The length of the track history describes the maximum number of embeddings from all the previous time steps of a certain identified object that our temporal Transformer has access to. To avoid overfitting to any particular history length that might be dominant in the dataset but not actually represent the most useful source of information, we specifically train our model with input track histories of varying and randomly chosen lengths. It is important to note that if the maximum track history length is set to one, the method practically degenerates to a two-frame based joint detection and tracking method such as Trackformer . Our results reported in Table 3 however show that incorporating longer-term information is crucial to improve end-to-end tracking. Both MOTA and IDF1 can be consistently improved while FP can be reduced when longer term history, , information from previous frames, is taken into account. This trend is also clearly visible throughout evaluation of our training strategies presented in Table 4, further discussed in the following.
Training strategies. MOT datasets are highly imbalanced when it comes to the occurrence of initialization and termination examples compared to normal propagation, making it nearly impossible for models to naturally learn initiation of new or termination of no longer existing tracks when trained in a naive way. As presented in Table 4, naive training without any augmentation shows almost double the number of false negatives (FN) compared to augmented approaches, basically failing to initiate tracks properly. Augmenting with FN as discussed in 3.4 shows significant improvements for both two-frame and longer-term methods. Additionally right-aligning the track history helps generally to stabilize training and greatly reduces false positives. At last, augmenting with false positives is most challenging to implement but crucial. As the results demonstrate, it significantly reduces false positives by helping the network to properly learn the terminating of tracks.
Analysing temporal attention. To provide some insight into the complex and highly non-linear working principle of our temporal Transformers, we visualize the attention weights over the temporal track history for different track history lengths averaged for 100 randomly picked objects in our validation set (Fig. 4). Results for the first layer clearly depict most attention being payed to multiple of its more recent frames, decreasing with increasing frame distance. The second and third layers are harder to interpret due to the increasing non-linearity, and the model starts to increasingly cast attention over more distant frames. It is important to notice that even if an embedding is not available at time , the model can still choose to pay attention to that slot and use the non-existence for reasoning.
We presented MO3TR, a truly end-to-end multi-object tracking framework that uses temporal Transformers to encode the history of objects while employing spatial Transformers to encode the interaction between objects and the input data, allowing it to handle occlusions, track termination and initiation. Demonstrating the advantages of long term temporal learning, we set new state of the art results regarding multiple metrics on the popular MOT16 and MOT17 benchmarks.
Appendix A Experiments
In this section, we provide details on the evaluation metrics used throughout the main paper, as well as detailed results for all sequences on the MOT16 and MOT17 benchmarks .
a.1 Evaluation metrics
To evaluate MO3TR and compare its performance to other state-of-the-art tracking approaches, we use the standard set of metrics proposed in [5, 43]. Analyzing the detection performance, we provide detailed insights regarding the total number of false positives (FP) and false negatives (FN, missed targets). The mostly tracked targets (MT) measure describes the ratio of ground-truth trajectories that are covered for at least 80% of the track’s life span, while mostly lost targets (ML) represents the ones covered for at most 20%. The number of identity switches is denoted by IDs. The two most commonly used metrics to summarize the tracking performance are the multiple object tracking accuracy (MOTA), and the identity F1 score (IDF1). MOTA combines the measures for the three error sources of false positives, false negatives and identity switches into one compact measure, and a higher MOTA score implies better performance of the respective tracking approach. The IDF1 represents the ratio of correctly identified detections over the average number of ground-truth and overall computed detections.
All reported results are computed by the official evaluation code of the MOTChallenge benchmark333https://motchallenge.net.
a.2 Evaluation results
The public results for the MOT16  benchmark presented in the experiment section of the main paper show the overall result of MO3TR on the benchmark’s test dataset using the provided public detections (DPM ). Detailed results showing the results for all individual sequences are presented in Table A1. Similarly the individual results for all sequences of the MOT17 benchmark  comprising three different sets of provided public detections (DPM , FRCNN  and SDP ) are detailed in Table A2. Further information regarding the metrics used is provided in Section A.1.
Appendix B Data association as auxiliary task
In the introduction of the main paper, we introduce the idea that our proposed MO3TR performs tracking without any explicit data association module. To elaborate what we mean by that and how multi-object tracking (MOT) without an explicitly formulated data association task is feasible, we would like to re-consider the actual definition of the MOT problem: Finding a mapping from any given input data, an image sequence stream, to the output data, a set of object states over time. In any learning scheme, given a suitable learning model, this mapping function can theoretically be learned without the requirement for solving any additional auxiliary task, as long as the provided inputs and outputs are clearly defined. The firmly established task of data association, a minimum cost assignment (using Hungarian Algorithm) between detections and objects, is nothing more than such an auxiliary task originally created to solve tracking based on tracking-by-detection paradigms. An end-to-end learning model, however, can learn to infer implicit correspondences and thus renders the explicit formulation of this task obsolete.
Precisely speaking, our end-to-end tracking model learns to relate the visual input information to the internal states of the objects via a self-supervised attention scheme. We realize this through using a combination of Transformers  to distill the available spatial and temporal information into representative object embeddings (the object states), making the explicit formulation of any auxiliary data association strategy unnecessary.
Discrete-continuous optimization for multi-target tracking.
2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1926–1933. Cited by: §1.
-  (2011) Multi-target tracking by continuous energy minimization.. In CVPR, Cited by: §1, §2.
-  (2019) Attention augmented convolutional networks. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3285–3294. Cited by: §2.
-  (2019) Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 941–951. Cited by: Table A1, Table A2, §1, §2, Figure 3, Figure 3, Table 1, Table 2.
-  (2008) Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008, pp. 1–10. Cited by: §A.1, §4.
-  (2016) Simple online and realtime tracking. In 2016 IEEE international conference on image processing (ICIP), pp. 3464–3468. Cited by: §1, §2, Table 2.
-  (1999) Design and analysis of modern tracking systems. Artech House radar library, Artech House. Cited by: §1.
-  (2020) Learning a neural solver for multiple object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6247–6257. Cited by: §2.
-  (2020) End-to-end object detection with transformers. In European Conference on Computer Vision, pp. 213–229. Cited by: §1, §2, §3.3, §3.3, §3.4, §4.1.
-  (2010) Multiple target tracking in world coordinate with single, minimally calibrated camera. In European Conference on Computer Vision, Cited by: §2.
-  (2019) Online multi-object tracking with instance-aware tracker and dynamic model refreshment. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 161–170. Cited by: §2.
-  (2019) Famnet: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6172–6181. Cited by: Table 2.
-  (2020) Mot20: a benchmark for multi object tracking in crowded scenes. arXiv preprint arXiv:2003.09003. Cited by: Figure 4, §4.
-  (2008) A mobile vision system for robust multi-person tracking. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §4.1.
-  (2018) Recurrent autoregressive networks for online multi-object tracking. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.
-  (2017) Detect to track and track to detect. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §2.
-  (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §A.2, Table A1, Table A2, §1, §2, §4.
-  (1956) Maximal flow through a network. Canadian Journal of Mathematics 8, pp. 399–404. Cited by: §1.
-  (1983) Sonar tracking of multiple targets using joint probabilistic data association. IEEE Journal of Oceanic Engineering 8 (3), pp. 173–184. Cited by: §1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
-  (2020) Lifted disjoint paths with application in multiple object tracking. In International Conference on Machine Learning, pp. 4364–4375. Cited by: §2.
-  (2019) Joint monocular 3d vehicle detection and tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5390–5399. Cited by: §1.
-  (1960) A new approach to linear filtering and prediction problems. Transactions of the ASME–Journal of Basic Engineering 82, pp. 35–45. Cited by: §1, §2.
-  (2018) Motion segmentation & multiple object tracking by correlation co-clustering. IEEE transactions on pattern analysis and machine intelligence 42 (1), pp. 140–153. Cited by: Table 2.
-  (2015) Multiple hypothesis tracking revisited. In Proceedings of the IEEE international conference on computer vision, pp. 4696–4704. Cited by: §2.
-  (1955) The hungarian method for the assignment problem. Naval research logistics quarterly 2 (1-2), pp. 83–97. Cited by: §1, §3.4.
-  (2011) How does person identity recognition help multi-person tracking?. In CVPR 2011, pp. 1217–1224. Cited by: §2.
-  (2017) Sequential sensor fusion combining probability hypothesis density and kernelized correlation filters for multi-object tracking in video data. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–5. Cited by: Table 2.
-  (2016) Learning by tracking: siamese cnn for robust target association. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 33–40. Cited by: §1, §2.
-  (2015) Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942. Cited by: §4.
-  (2018) LSTM multiple object tracker combining multiple cues. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 2351–2355. Cited by: §2.
-  (2018) Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME, Cited by: Table 1, Table 2.
-  (2021) TrackFormer: multi-object tracking with transformers. arXiv preprint arXiv:2101.02702. Cited by: §1, §2, §4.1, §4.3, Table 2.
-  (2016) MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Cited by: §A.2, Table A1, Table A2, Appendix A, §4.2, Table 1, Table 2, §4.
-  (2018) Image transformer. In Proceedings of the 35th International Conference on Machine Learning, Proceedings of Machine Learning Research, pp. 4055–4064. Cited by: §2.
-  (2019) Stand-alone self-attention in vision models. In Advances in Neural Information Processing Systems, Cited by: §2.
-  (2019) A robust multi-athlete tracking algorithm by exploiting discriminant features and long-term dependencies. In International Conference on Multimedia Modeling, Cited by: §2.
-  (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §1.
-  (2017) Accurate single stage detector using recurrent rolling convolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5420–5428. Cited by: §1, §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, Vol. 28. Cited by: §A.2, Table A2, §1, §2, §4.
-  (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666. Cited by: §3.4.
-  (2015) Joint probabilistic data association revisited. In 2015 IEEE International Conference on Computer Vision (ICCV), pp. 3047–3055. Cited by: §1.
-  (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pp. 17–35. Cited by: §A.1, §4.
-  (2018) Features for multi-target multi-camera tracking and re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6036–6046. Cited by: §2.
-  (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Cited by: §4.1.
-  (2017) Tracking the untrackable: learning to track multiple cues with long-term dependencies. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: §1, §2, Table 1.
-  (2020) Probabilistic tracklet scoring and inpainting for multiple object tracking. arXiv preprint arXiv:2012.02337. Cited by: §2.
-  (2016) Online multi-target tracking with strong and weak detections. In European Conference on Computer Vision, pp. 84–99. Cited by: Table 1, Table 2.
-  (2018) Crowdhuman: a benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123. Cited by: §4.1.
-  (2018) Heterogeneous association graph fusion for target association in multiple object tracking. IEEE Transactions on Circuits and Systems for Video Technology 29 (11), pp. 3269–3280. Cited by: §2.
-  (2019) Systematic analysis of the pmbm, phd, jpda and gnn multi-target tracking filters. In 2019 22th International Conference on Information Fusion, pp. 1–8. Cited by: §1.
-  (1994) Maximum likelihood method for probabilistic multihypothesis tracking. In Signal and Data Processing of Small Targets, Vol. 2235, pp. 394–405. Cited by: §1.
-  (2020) TransTrack: multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460. Cited by: §1, §2.
-  (2019) Deep affinity network for multiple object tracking. IEEE transactions on pattern analysis and machine intelligence 43 (1), pp. 104–119. Cited by: §1.
-  (2017) Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3539–3548. Cited by: §2.
-  (2017) Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 6000–6010. Cited by: Appendix B, §1, §2, §3.2, §3.3.
An online and flexible multi-object tracking framework using long short-term memory. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Cited by: §2.
-  (2020) Towards real-time multi-object tracking. In European Conference on Computer Vision, Cited by: §1.
-  (2017) Simple online and realtime tracking with a deep association metric. In 2017 IEEE international conference on image processing (ICIP), pp. 3645–3649. Cited by: §1, §2.
-  (2017) Joint detection and identification feature learning for person search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3415–3424. Cited by: §4.1.
-  (2019) Spatial-temporal relation networks for multi-object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3988–3998. Cited by: Table 1, Table 2.
-  (2020) How to train your deep multi-object tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6787–6796. Cited by: Table 1, Table 2.
-  (2012) An online learned crf model for multi-target tracking. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2034–2041. Cited by: §2.
-  (2016) Exploit all the layers: fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2129–2137. Cited by: §A.2, Table A2, §1, §2, §4.
-  (2020) A unified object motion and affinity model for online multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6768–6777. Cited by: Table 1, Table 2.
-  (2020) Long-term tracking with deep tracklet association. IEEE Transactions on Image Processing 29, pp. 6694–6706. Cited by: §2.
-  (2020) Tracking objects as points. In European Conference on Computer Vision, Cited by: §1, §2, §2, §4.1, Table 2.
-  (2018) Online multi-object tracking with dual matching attention networks. In European Computer Vision Conference, Cited by: Table 1, Table 2.