DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion

11/29/2021
by   Peize Sun, et al.
11

A typical pipeline for multi-object tracking (MOT) is to use a detector for object localization, and following re-identification (re-ID) for object association. This pipeline is partially motivated by recent progress in both object detection and re-ID, and partially motivated by biases in existing tracking datasets, where most objects tend to have distinguishing appearance and re-ID models are sufficient for establishing associations. In response to such bias, we would like to re-emphasize that methods for multi-object tracking should also work when object appearance is not sufficiently discriminative. To this end, we propose a large-scale dataset for multi-human tracking, where humans have similar appearance, diverse motion and extreme articulation. As the dataset contains mostly group dancing videos, we name it "DanceTrack". We expect DanceTrack to provide a better platform to develop more MOT algorithms that rely less on visual discrimination and depend more on motion analysis. We benchmark several state-of-the-art trackers on our dataset and observe a significant performance drop on DanceTrack when compared against existing benchmarks. The dataset, project code and competition server are released at: <https://github.com/DanceTrack>.

READ FULL TEXT VIEW PDF

page 1

page 4

page 6

page 8

06/29/2022

BoT-SORT: Robust Associations Multi-Pedestrian Tracking

The goal of multi-object tracking (MOT) is detecting and tracking all th...
03/28/2018

TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild

Despite the numerous developments in object tracking, further developmen...
06/12/2020

GNN3DMOT: Graph Neural Network for 3D Multi-Object Tracking with Multi-Feature Learning

3D Multi-object tracking (MOT) is crucial to autonomous systems. Recent ...
08/13/2021

Track without Appearance: Learn Box and Tracklet Embedding with Local and Global Motion Patterns for Vehicle Tracking

Vehicle tracking is an essential task in the multi-object tracking (MOT)...
07/08/2022

TGRMPT: A Head-Shoulder Aided Multi-Person Tracker and a New Large-Scale Dataset for Tour-Guide Robot

A service robot serving safely and politely needs to track the surroundi...
11/18/2021

SimpleTrack: Understanding and Rethinking 3D Multi-object Tracking

3D multi-object tracking (MOT) has witnessed numerous novel benchmarks a...
01/06/2021

Multi-object Tracking with a Hierarchical Single-branch Network

Recent Multiple Object Tracking (MOT) methods have gradually attempted t...

Code Repositories

DanceTrack

DanceTrack: Multiple Object Tracking in Uniform Appearance and Diverse Motion


view repo

1 Introduction

Object tracking has been long studied and can be beneficial to applications such as autonomous driving, video analysis, and robot planning [cao2021instance, yilmaz2006object, rangesh2019no]. Multi-object tracking aims to localize and associate objects of interest over time. Interestingly, we observe recent developments in multi-object tracking heavily rely on a paradigm of detection followed by re-ID, where mostly appearance cues are used to associate objects. This trend in algorithmic development makes existing solutions fail catastrophically in situations where objects share very similar appearance and inspires us to propose a platform to encourage more comprehensive solutions by taking other cues into modeling, such as object motion patterns and temporal dynamics.

As with many other areas of computer vision, the development of multi-object tracking is influenced by benchmark datasets. Based on specified datasets 

[MOT16, MOT20, KITTI, BDD], data-driven methods are sometimes argued to be biased to certain data distributions. In this work, we recognize the limitation of existing multi-object tracking datasets and observe that many objects have distinct appearance and the motion patterns of objects are very regular or even linear. Motivated by these dataset properties, most recently developed multi-object tracking algorithms [FairMOT, quasidense, TraDeS, bytetrack, SORT, DeepSORT, bergmann2019tracking] highly rely on appearance matching to associate detected objects while considering little other cues. The dominant paradigm will fail in situations out of the biased distribution. This phenomenon is not what we expect if we aim to build more general and intelligent tracking algorithms.

We also observe that appearance matching is not reliable when objects have similar appearances or heavy occlusion. These properties cause catastrophic degradation of current state-of-the-art multi-object tracking algorithms. To provide a new platform for more comprehensive multi-object tracking studies, we propose a new dataset in this paper. Because it mostly contains group dancing videos, we name it “DanceTrack”. The dataset contains over 100K image frames (almost the MOT17 dataset). As shown in Figure 1, the emphasized properties of this dataset are (1) uniform appearance: people in videos wear very similar or even the same clothes, making their visual features hard to be distinguished by the re-ID model and (2) diverse motion: people usually have very large-range motion and complex body gesture variation, proposing higher requirements for motion modeling. The second property also brings occlusion and crossover as a side-effect that the human body has a large ratio of overlap with each other and their relative position exchanges frequently.

With the proposed dataset, we build a new benchmark including existing popular multi-object tracking methods. The results prove that current state-of-the-art algorithms fail to make satisfactory performance when they simply use appearance matching or linear motion models to associate objects across frames. Considering the cases focused on in this dataset happen frequently in our real life, we believe it shows the limitations of existing multi-object tracking algorithms on practical applications. To provide potential guidelines for further research, we analyze a range of choices in associating objects and achieve some beneficial conclusions: (1) fine-grained representations of objects, e.g., segmentation and pose, exhibit better ability than coarse bounding box; (2) depth information shows positive influence on associating objects, though we are solving a 2D tracking task; (3) motion modeling of temporal dynamics is important.

To conclude, the key contributions of our work to the object tracking community are as follows:

  1. We build a new large-scale multi-object tracking dataset, DanceTrack, covering the scenarios where tracking suffers from low distinguishability of object appearance and diverse non-linear motion patterns.

  2. We benchmark baseline methods on this newly built dataset with various evaluation metrics, showing the limitation of existing multi-object tracking algorithms.

  3. We provide a comprehensive analysis to discover more cues for developing multi-object trackers that are more robust in complicated real-life situations.

2 Related Works

Multi-object tracking datasets.

Many multi-object tracking datasets have been proposed focusing on different scenarios. Similar to our proposed dataset, many existing datasets focus on human tracking. PETS [PETS2009] dataset is one of the earliest in this area. And the more recent MOT15 [MOT15] dataset and the following MOT17 [MOT16] and MOT20 [MOT20] datasets are all popular in this community. These datasets are limited in some aspects we care about. For example, MOT contains only handful of videos and scenarios. Even MOT20 increases the density of objects and emphasis the occlusion among them, the movements of objects are very regular and they still have very distinguishable appearance. Association by pure appearance matching [quasidense] also makes success and we will show that given the perfect detector, the tracking problem can be solved by a very naive association strategy on these datasets.

Besides, many other datasets are proposed for diverse objectives, e.g., WILDTRACK [wildtrack] for multi-camera tracking and association, Youtube-VIS [youtubevos] and MOTS [MOTS20] for pixel-wise tracking (Video Instance Segmentation). With the increasing attraction of autonomous driving, some datasets are built focusing on it specifically. KITTI [KITTI] is one of the earliest large-scale multi-object tracking datasets for driving scenarios where the objects of interest are vehicles and pedestrians. More recently, BDD100K [BDD], Waymo [waymo] and KITTI360 [KITTI360]

are made available to the public, still focusing on autonomous driving problem but providing much larger scale data than KITTI. The motion patterns of objects in these datasets are even more regular than those focusing on only moving people with the limitation of lanes and traffic rules. There are many datasets focusing on more diverse object categories than person and vehicles. The ImageNet-Vid 

[ImageNet] benchmark provides trajectory annotations for 30 object categories in over 1000 videos and TAO [TAO] annotates even 833 object categories to study object tracking on long-tailed distribution.

Tracking by matching appearance.

Compared to tracking-by-detection, recent developments in multi-object tracking focus more on the joint-detection-and-tracking genre where object localization and association are conducted at the same time. And appearance similarity serves as the dominant cue in many popular multi-object tracking methods. For example, QuasiDense (QDTrack) [quasidense] designs a pairwise training paradigm and dense localization for object detection and uses highly sensitive appearance comparison to match objects across frames. JDE [JDE] and FairMOT [FairMOT] learn object localization and appearance embedding using a shared backbone which is for better appearance representation. More recently, with the new focus of applying transformers [vaswani2017attention] in vision tasks, TransTrack [Transtrack], TrackFormer [Trackformer] and MOTR [MOTR]

made attempts to leverage the attention mechanism in tracking objects in videos. In these works, the features of previous tracklets are passed to the following frames as the query to associate the same objects across frames. The appearance information contained in the query is also critical to keep tracklet consistency. Although the rise of deep-learning model brings much powerful visual representations than ever before making appearance matching more robust, we still witness the failure of matching appearance in many real-world situations which are expected to be improved by taking other cues into account.

Motion analysis in object tracking.

The displacement of objects-of-interest provides important cues for object tracking. Tracking objects by estimating their motion is thus a natural and intuitive idea and has inspired a line of researches. These tracking algorithms mainly follow the tracking-by-detection paradigm. Sequential analysis tools such as Particle filter 

[particle1, particle2]

and Kalman filter 

[kalman1960new] are found efficient in such applications. SORT [SORT] is developed on the Kalman motion model and marks a milestone in using motion models for object tracking. Furthermore, as deep networks bring the revolutionary ability to extract high-quality visual features, DeepSORT [DeepSORT] tries to combine deep visual features and motion models and gains great success. Since then, motion-based object tracker has shown weak competitiveness and many focuses are towards appearance cues. Even though motion analysis has been used in object tracking for long [JDE, FairMOT, bytetrack], all these mentioned methods can only handle simple linear motion pattern and provide limited help to multi-object tracking in more complicated situations we focus on in this work. These factors induce appearance-based tracking dominance in multi-object tracking. However, we argue that a more comprehensive and intelligent tracking algorithm should pay more attention to motion analysis since appearance is not always reliable.

3 DanceTrack

Figure 2: Some sampled scenes from the proposed DanceTrack dataset. (a) outdoor scenes; (b) low-lighting and distant camera scenes; (c) large group of dancing people; (d) gymnastics scene where the motion is usually even more diverse and people have more aggressive deformation.

DanceTrack is a benchmark for multi-object tracking for estimating the locations and identities of objects in videos. The objective of proposing this dataset is to provide the scenes where objects have a uniform appearance and diverse motion.

3.1 Dataset Construction

Dataset design. We focus on the scenarios where objects have similar or even the same appearance and diverse motion patterns, including frequent crossover, occlusion, and body deformation. The first property makes tracking by purely comparing object appearance invalid because the extracted visual features are no longer distinguishable for different objects. The second property further requires clues rather than appearance in tracking, such as motion analysis and temporal dynamics.

We argue that focusing on “crowd” by simply increasing the density of objects of interest is not what we expect. For example, MOT20 [MOT20] contains videos where the groups of pedestrians are very crowded. But as the pedestrians’ movement is very regular and the relative position and occlusion area keep consistent, such “crowd” is not building an obstacle for appearance matching. Therefore, we focus on situations where multiple objects are moving in a “relatively” large range. The dynamically changing occluded area and even crossover are what we are interested in. Such cases are common in the real world but naive linear motion models can not handle them anymore.

Dataset MOT17 [MOT16] MOT20 [MOT20] DanceTrack
Videos 14 8 100
Avg. tracks 96 432 9
Total tracks 1342 3456 990
Avg. len. (s) 35.4 66.8 52.9
Total len. (s) 463 535 5292
FPS 30 25 20
Total images 11,235 13,410 105,855


Table 1: The comparison of dataset meta-information between DanceTrack and its closest benchmark for multi-human tracking, MOT17 and MOT20. DanceTrack contains much more videos and images than MOT datasets.

Video collection. To achieve the design goals described above, we collected videos including mostly group dancing from the Internet. As shown in Figure 2, the dancers usually wear very similar or even the same clothes. They make a large-range motion in the target situations. And their poses and relative positions change very frequently. All these properties greatly satisfy our motivation to propose a new multi-object tracking dataset. We collect the videos from different search engines with query keywords like “street dance”, “hip-pop dance”, “cheerleading dance”, “rhythmic gymnastics” and so on. The collection is only for publicly available videos and under the permit of fair use of video resources.

Annotation. We use a commercial tool to annotate the collected videos. The annotated labels include bounding boxes and identifiers of each object. For a partly-occluded object, a full-body box is annotated. For a fully-occluded object, we do not annotate it; when it re-appears in the future frame, its identifier is kept as the same as in the previous frame when it is visible.

To facilitate the annotation process, our tool can automatically propagate the annotated boxes from the previous frame to the current frame, and the annotator only needs to refine the boxes in the current frame. To build a high-quality dataset, the annotations have been checked by another group of people and errors are reported back to the annotators for re-annotation.

3.2 Dataset Statistic

We provide some analytical information of DanceTrack dataset and compare it with existing multi-object tracking datasets. The statistical information helps to understand the uniqueness of the proposed dataset and how we built it to make a platform as we describe in the previous parts.

Dataset split. We collected in total 100 videos in DanceTrack dataset, by default using 40 videos as the training set, 25 as the validation set, and 35 as the test set. During splitting, we keep the distribution of subsets close in terms of average length, average bounding box number, included scenes and motion diversity. We make the annotation of the training set and validation set public while keeping the testing set annotation private for competition use. Some basic information of DanceTrack is shown in Table 1. Compared with MOT datasets, DanceTrack has a much larger volume (10x more images and 10x more videos). MOT20 focuses on very crowded scenes, so it has more tracks but as the appearance of objects inside is very distinguishable and their motion is regular, as a consequence, the association on MOT20 still requires little motion estimation when reliable detection results are given.

Scene diversity. DanceTrack contains very diverse scenes. Some samples are provided in Figure 2. One main common point for all videos is that the instances of people in a video usually have very similar appearances. This is designed on purpose to avoid the shortcut of tracking by pure appearance matching. DanceTrack contains multiple genres of dance, such as street dance, pop dance, classical dance (ballet, tango, etc.), and large groups of people’ dancing. It also contains some sports scenarios such as gymnastics, Chinese Kung Fu and cheerleader dancing. Figure 2(a) shows outdoor scenes though most included videos are indoor. Figure 2(b) shows some especially hard cases, such as low lighting and distant camera. Figure 2(c) and (d) show a large group of people dancing, including at most 40 people, and gymnastics where people show extremely diverse body gestures, frequent pose variation and complicated motion pattern.

[Cosine distance of re-ID feature] [IoU on adjacent frames] [Frequency of relative position switch]
Figure 3: (a) Cosine distance of re-ID features. The cosine distance of re-ID features of DanceTrack is lower than that of MOT17, in other words, the appearance similarity between different objects is higher. The dashed lines are for the average cosine distance similarity for the two datasets. (b) IoU on adjacent frames. Compared to MOT17 and MOT20, DanceTrack has a similar score. It means that the frame rate and object motion speed are still reasonable in DanceTrack. (c) Frequency of relative position switch. This metric measures the frequency of crossover and is highly related to the occlusion between objects. DanceTrack has much more frequent relative position switches than other pedestrian tracking datasets, such as MOT17 and MOT20. Even compared to the driving dataset KITTI, where the moving camera naturally causes many relative position switches, DanceTrack still has a higher frequency.

Appearance similarity. We make a quantitative analysis about how appearance-only matching is not reliable anymore on DanceTrack. We will prove this by measuring the appearance similarity among objects. To be precise, we use a pre-trained re-ID model [deepsort_pytorch] to extract the appearance features of the object on a frame , then we compute the sum of cosine distance of the re-ID features among objects in the video as

(1)

where is the number of frames in the video sequence, is the number of objects on the frame and

is the angle between two vectors.

We compare the object appearance similarity in DanceTrack to that in MOT17 dataset, as shown in Figure 3(a), each bin represents one video sequence. It is obvious that the cosine distance of re-ID features of DanceTrack is lower than that of MOT17, in other words, the appearance similarity among co-existing objects is higher. This quantitative analysis shows the challenge of DanceTrack to current popular trackers using appearance matching for association.

Motion pattern. We introduce two metrics to analyze the motion pattern in DanceTrack dataset and compare that to other multi-object tracking datasets.

IoU on adjacent frames: a natural measurement of object movement range is its bounding-box-IoU (Intersection-over-Union) on two adjacent frames. A low IoU indicates fast-moving objects or the low frame rate of videos. Given a video with objects and frames, we denote the -th object’s box on the -th frame as , then the averaged IoU on adjacent frames for this video is

(2)

Frequency of Relative Position Switch: a metric to measure the diversity of objects’ motion in a global view is the frequency for two objects to switch their relative position. This could happen between leftward and rightward or between upward and downward. On the contrary, movement with consistent velocity tends to cause a lower chance of relative position switch. Given a video, the average frequency of relative position switch is defined as

(3)

where is an indicator function, where =1 if the two objects swap their left-right relative position or top-down relative position on the adjacent frames, =0 if there is no swap. To be precise, we measure their relative position by comparing their bounding box center locations. And considering that such crossover causes potential trouble only when the objects have overlap, we only take the objects whose bounding boxes have overlap into the calculation.

From the results shown in Figure 3(b), we could find that DanceTrack and MOT datasets have close average IoU on adjacent frames. This indicates that DanceTrack is considered harder than MOT datasets not because of lower frame rate or unreasonably fast object movement.

On the other hand, from Figure 3(c) we could find that DanceTrack has much more frequent relative position switches than other datasets such as KITTI, MOT17 and MOT20. The frequent relative position switches are caused by highly non-linear motion pattern and result in frequent crossover and inter-object occlusion. This result shows that the challenge of DanceTrack comes from the diversity of motion.


 


Appearance
IoU Motion MOT17 DanceTrack (Proposed Dataset)
HOTA DetA AssA MOTA IDF1 HOTA DetA AssA MOTA IDF1
98.1 98.9 97.3 98.0 97.8 72.8 98.9 53.6 98.7 63.5
96.4 97.1 95.8 99.7 98.1 69.4 87.9 54.8 99.4 71.3
95.0 94.7 95.4 99.3 98.8 59.7 82.5 43.2 97.2 60.5
93.3 99.0 87.9 98.9 90.9 68.0 97.7 47.4 97.9 58.7

 


Table 2: Oracle analysis of different association models on MOT17 and DanceTrack validation set, respectively. The detection boxes are ground-truth boxes. The result comparison shows the evident increased difficulty of performing multi-object tracking on DanceTrack than MOT17 dataset.

3.3 Evaluation Metrics

For a long time, the multi-object tracking community used MOTA as the main metric for evaluation. However, recently, the community realized that MOTA focuses too much on detection quality instead of association quality. Thus, Higher Order Tracking Accuracy (HOTA) [HOTA] is proposed to correct this historical bias since then. HOTA has been used for the main metrics to evaluate tracking quality on multiple popular benchmarks such as BDD100K [BDD] and KITTI [KITTI]. We follow this setting for evaluation metrics of DanceTrack.

In our protocol, the main metric is HOTA. We also use AssA and IDF1 score to measure association performance and DetA and MOTA for detection quality.

For the detailed definitions of these metrics, we refer to  [mota, idf1, HOTA]. To make it convenient to run for fine-grained analysis, the evaluation tools also provide previously widely-used statistics, such as False Positive (FP), False Negative (FN) and ID switch (IDs).

3.4 Limitation

In this part, we discuss some recognized limitations of this proposed DanceTrack dataset. We emphasize again that we propose this dataset to provide a platform for more comprehensive multi-object tracking studies beyond the currently popular genre of combining detector and re-ID. However, the proposed dataset still has some limitations. First, given the mentioned motivation and the proposed dataset, we do not provide an algorithm that highly outperforms previous multi-object tracking algorithms but keep this as an open question for future study. Besides, we believe, for the cases, we emphasize in this work, the annotation of human pose or segmentation mask should be important for more fine-grained study. But limited by time and resources, we only provide the annotation of bounding boxes in this version.

4 Experiments

4.1 Experiment Setup

Dataset configurations We compare DanceTrack with its closest dataset, MOT17. For MOT17, because the test server is not available easily, we follow the train-val splitting provided in CenterTrack [CenterNet] to evaluate on the validation subset. For DanceTrack, we follow the default splitting described in the previous section to train on the training subset and evaluate on the test subset.

Model configuration

Unless specified otherwise, we inherit the default training settings of the investigated algorithms provided in the original papers or the officially released codebases. For MOT17 and DanceTrack, algorithms use shared configurations and hyperparameter settings.

4.2 Oracle Analysis

To decompose the analysis over object localization and association, we perform oracle analysis here. We use the ground truth bounding boxes with different association algorithms to achieve expected upper-bound performance.

This analysis can help us to understand what is the true bottleneck of tracking on different datasets. To be precise, we try to use IoU matching or motion modeling and appearance similarity for the association. We have experiments on MOT17 and DanceTrack respectively. The results are shown in Table 2. We use a pre-trained Re-ID model [deepsort_pytorch] for appearance matching and a Kalman Filter [kalman1960new] for motion modeling under linear motion assumption. IoU matching is simply performed by calculating the IoU of objects’ bounding boxes in adjacent frames. From the results, the tracking output is close to perfect in terms of all metrics on MOT17. And, interestingly, using only IoU matching achieves the best performance, which proves that MOT17 contains objects with simple and regular motion patterns and the bottleneck does not lie in association in most cases.

Figure 4: Visualization of re-ID feature from sampled video in MOT17 and DanceTrack dataset using t-SNE [tsne]. The same object is coded by the same color. For better visualization, we only select first 200 frames in each video sequence. The results show that object appearance is much distinguishable on MOT17 than that on DanceTrack. It brings a shortcut for tracking on MOT17 by even only appearance matching.

 


Methods
MOT17 DanceTrack (Proposed Dataset)
HOTA DetA AssA MOTA IDF1 HOTA DetA AssA MOTA IDF1
CenterTrack [CenterTrack] 52.2 53.8 51.0 67.8 64.7 41.8 78.1 22.6 86.8 35.7
FairMOT [FairMOT] 59.3 60.9 58.0 73.7 72.3 39.7 66.7 23.8 82.2 40.8
QDTrack [quasidense] 53.9 55.6 52.7 68.7 66.3 45.7 72.1 29.2 83.0 44.8
TransTrack [Transtrack] 54.1 61.6 47.9 75.2 63.5 45.5 75.9 27.5 88.4 45.2
TraDes [TraDeS] 52.7 55.2 50.8 69.1 63.9 43.3 74.5 25.4 86.2 41.2
MOTR [MOTR] 55.1 56.2 54.2 67.4 67.0 48.4 71.8 32.7 79.2 46.1
ByteTrack[bytetrack] 63.1 64.5 62.0 80.3 77.3 47.7 71.0 32.1 89.6 53.9

 


Table 3: Tracking performance of investigated algorithms on MOT17 and DanceTrack test set respectively. The result comparison shows the evident increased difficulty of performing multi-object tracking on DanceTrack than MOT17 dataset. To be precise, DanceTrack makes detection easier (higher MOTA and DetA scoers) but still brings significant tracking performance drop compared to MOT17 (lower HOTA, AssA and IDF1 scores). This phenomenon reveals the bottleneck of multi-object tracking on DanceTrack is on the association part.

On the other hand, using only IoU matching on DanceTrack gives a much lower performance than on MOT17. Given DetA and MOTA scores are already close to 100, the bottleneck is obviously in the association part. All association metric scores in all cases experience a dramatic drop compared with that on MOT17. Besides, the best performance lies in only IoU matching, even combining a linear motion model or additional appearance information does not help. When using appearance similarity, all metrics are worse than not using any appearance cue. This is because the objects in DanceTrack videos usually have indistinguishable appearance so simply using appearance matching makes negative effects in some cases. In Figure 4, we visualize the appearance feature of objects extracted from DanceTrack and MOT17 videos respectively. We can observe that the appearance features of different objects are very distinguishable in the feature space on MOT17 while highly entangled on DanceTrack. This qualitatively provides evidence for the high similar appearance of objects in the proposed DanceTrack dataset.

Given the results shown in the analysis with oracle object localization, we can reach a clear conclusion that existing datasets have a heavy bias that focuses more on the detection quality only and the involved simple trajectory patterns limit the study in this area. On the contrary, DanceTrack is proposing a much higher requirement to develop multi-object trackers with improvement in association ability. Considering the scenarios included in DanceTrack are what we experience in real life, we believe it is meaningful to provide such a platform.

4.3 Benchmark Results

We benchmark the current state-of-the-art multi-object tracking algorithms on MOT17 and DanceTrack. The evaluation is performed in the “private setting” that the algorithm should do both detection and association. The benchmark results are reported in Table 3. In terms of the tracking quality measured by HOTA, IDF1 and AssA, all algorithms show a significant performance gap from MOT17 to DanceTrack. For all investigated methods, their performance on DanceTrack is far from satisfactory. On the other hand, the detection quality metrics, MOTA and DetA, of all algorithms are in fact higher on DanceTrack than on MOT17. This suggests that detection is not the bottleneck to have good tracking performance on DanceTrack and continues to highlight the drop of association. The challenge on the proposed dataset is to make robust associations against the uniform appearance and the diverse motion of objects.

Association HOTA DetA AssA MOTA IDF1
IoU 44.7 79.6 25.3 87.3 36.8
SORT[SORT] 47.8 74.0 31.0 88.2 48.3
DeepSORT[DeepSORT] 45.8 70.9 29.7 87.1 46.8
MOTDT[MOTDT] 39.2 68.8 22.5 84.3 39.6
BYTE[bytetrack] 47.1 70.5 31.5 88.2 51.9
Table 4: Comparison of different association algorithms on DanceTrack validation set. The detection results are output by YOLOX [YOLOX], trained on the DanceTrack training set.

4.4 Association Strategy

In the previous section, most methods entangle the detection and tracking modules. To have an independent study on association algorithms, we use the most recently developed YOLOX [YOLOX] detector for object detection on DanceTrack and conduct different object association algorithms following that. The results are shown in Table 4.

Figure 5: Visualization of adding more information beyond bounding box on DanceTrack. Tracks are coded by color. The 1st, 2nd and 3rd column are frame20, 120 and 200 of DanceTrack0007 video sequence, respectively. The 1st row is ground-truth boxes and identifies.
Data Ass. HOTA DetA AssA MOTA IDF1
DanceTrack box 36.9 63.6 21.6 78.8 39.2
+ COCOmask [COCO] box 38.1 (+1.2) 64.5 (+0.9) 22.6 (+1.0) 80.6 (+1.8) 40.3 (+1.1)
+ COCOmask + mask 39.2 (+1.1) 64.9 (+0.4) 23.9 (+1.3) 80.7 (+0.1) 41.6 (+0.3)

DanceTrack
box 36.9 63.6 21.6 78.8 39.2
+ COCOpose [COCO] box 40.6 (+3.7) 65.5 (+1.9) 25.3 (+3.7) 82.9 (+4.1) 42.9 (+3.7)
+ COCOpose + pose 41.0 (+0.4) 65.9 (+0.4) 25.6 (+0.3) 83.1 (+0.3) 43.9 (+1.0)

DanceTrack
box 36.9 63.6 21.6 78.8 39.2
+ KITTI [KITTI] box 34.4 (- 2.5) 57.8 (- 5.8) 20.7 (- 0.9) 72.9 (- 5.9) 38.5 (- 0.7)
+ KITTI + depth 35.1 (+0.7) 57.3 (- 0.5) 21.6 (+0.9) 72.8 (- 0.1) 40.2 (+1.7)


Table 5: Ablation study on adding more information beyond bounding box on DanceTrack validation set. All experiments are based on CenterNet [CenterNet] model and BYTE [bytetrack] association. (a) Segmentation mask improves the tracking performance on DanceTrack. (b) Pose information boosts the tracking performance with an even larger gap than the segmentation mask. (c) Though adding depth information into association shows a slightly positive influence, the results still blame the domain shift between KITTI and DanceTrack.

SORT [SORT] uses Kalman Filter to model the object trajectory and DeepSORT [DeepSORT] adds appearance matching. Compared to SORT, DeepSORT shows no performance boost but worse performance instead, suggesting the negative gain due to appearance matching. On the other hand, MOTDT [MOTDT] uses the tracking result to help detect bounding boxes. But in fact, detection performance can be really good on the DanceTrack dataset and the exact bottleneck is the association part, so MOTDT shows even worse performance on both detection quality and association quality with its design. Lastly, BYTE [bytetrack] uses a high-tolerance strategy to select detection results into the association stage. The design aims to decrease tracklet fragmentation in tracking. With such a strategy, BYTE shows the best association performance in terms of IDF1 and AssA metrics. This also reveals that DanceTrack is not a strict challenge for modern deep object detectors, the true challenge is in the object association part instead.

4.5 Analysis of More Modalities

Considering high scores of MOTA and DetA on DanceTrack, the limited performance on DanceTrack is an exact failure of trackers instead of detectors. To boost performance, a straightforward strategy is to add more cues other than bounding box. Since DanceTrack contains bounding boxes and identities annotations, we propose to use joint-training technology with other datasets, e.g., COCO [COCO] and KITTI [KITTI], to enable the model output more modalities including segmentation mask, pose and depth, All models are based on CenterNet [CenterNet]. If additional modal is used other than bounding box, we add a corresponding head following the backbone network.

Does fine-grained representation help ? We investigate the influence of adding the segmentation mask into the model. The training data is a combination of the DanceTrack training set and COCO mask [COCO]. If the input image is from DanceTrack, we set its mask loss as 0. During inference, the matching metric is the weighted sum of bounding box IoU and mask IoU. From the results in Table 5, we find a performance boost by using the segmentation mask. We believe this can be explained by two reasons. First, the introduction of more fine-grained annotation makes the training more robust just as what is observed in multi-task learning. On the other hand, for crowded and occluded situations, the segmentation mask is a more reliable information form than bounding boxes. From the segmentation mask, we can surely expect to extract more accurate object identification information for the association task.

Besides the mask, another modality is human pose information. The training data is a combination of DanceTrack training set and COCO human pose [COCO]. If the input image is from DanceTrack, we set its pose loss as 0. During inference, the matching metric is the weighted sum of bounding box IoU and Object Keypoint Similarity(OKS) [COCO]. The results are shown in Table 5

. Adding additional pose information in training better boosts the model performance on DanceTrack, and using the output pose in association further helps to achieve better tracking results. A potential reason is when most of the area of a human body is occluded already, segmentation model usually can not provide reliable output while the pose estimation model focusing on certain human body key-points usually shows higher robustness.

Does depth information help ? We try to use additional depth information to help tracking on DanceTrack. The training data is a combination of DanceTrack training set and KITTI [KITTI] 3D box. If the input image is from DanceTrack, we set all losses related to the 3D box as 0. During inference, we directly use the camera parameters in KITTI dataset, and the matching metric is the weighted sum of bounding box IoU and depth similarity. The results are shown in Table 5. In contrast to the COCO segmentation mask and human pose, depth information learned from KITTI dataset does not increase the performance on DanceTrack. We explain that COCO segmentation and pose estimation datasets contain humans as the main category, while KITTI mainly contains vehicle instances. Thus, the object and scene prior in DanceTrack and KITTI change, and this domain shift degenerates the model. Nevertheless, depth information indeed helps association performance if we regard the baseline as the model trained on joint-dataset of DanceTrack and KITTI. However, limited by the available resources of depth-annotated data, this is the best we could try for now. We expect more study on the influence of depth information to associate objects with uniform appearance and diverse motion.

Motion HOTA DetA AssA MOTA IDF1
None(IoU) 34.9 68.2 18.0 77.0 31.7
Kalman filter[SORT] 37.2 62.4 22.3 77.4 39.9
LSTM[DEFT] 38.8 67.8 22.4 78.7 38.1
Table 6: Comparison of different motion models on DanceTrack validation set. The detection results are output by CenterNet [CenterNet], trained on the DanceTrack training set.

Does temporal dynamics help ? As shown in Table 6, we use different motion models to introduce temporal dynamics in the tracking process to facilitate better association. Both Kalman filter [SORT] and LSTM [DEFT] outperform naive IoU association (without temporal dynamics) by a large margin, indicating the great potential of motion models in tracking objects, especially when appearance cues are not reliable. With the relatively slow progress of object model motion, we expect to see more advanced motion models in the field of multi-object tracking.

From the study above, we know that more modalities could help boost the performance of tracking on DanceTrack, especially those from similar data distributions [jaimes2007multimodal, yu2017multi, fang2018pairwise]. Given the limitation discussed in section 3.4

that DanceTrack only provides bounding box annotation, for now, there would be several interesting future works: (1) extending its annotation modalities, (2) using weakly-supervised learning 

[tian2021boxinst, wang2019distill, zhou2017towards]

to estimate other modalities, (3) using transfer learning and domain adaptation 

[cao2019cross, li2019bidirectional, atapour2018real] to transfer knowledge of other modalities from other data domain to our benchmark.

5 Conclusion

In this paper, we propose a new multi-object tracking dataset called DanceTrack. The objects have uniform appearance and diverse motion pattern in DanceTrack, preventing being hacked by Re-ID algorithms. The motivation behind it is to reveal the bias in existing datasets that tend to emphasize detection quality and matching appearance only. This makes other cues to associate objects underrepresented. We believe that the ability to analyze the complex motion pattern is necessary for building a more comprehensive and intelligent tracker. DanceTrack provides such a platform to encourage future works on this line.

6 Acknowledgement

We would like to thank the annotator teams and coordinators. We also like to thank Xinshuo Weng, Yifu Zhang for valuable discussion and suggestions, Vivek Roy, Pedro Morgado, Shuyang Sun for proof reading.

References