DeepAI
Log In Sign Up

AnimalTrack: A Large-scale Benchmark for Multi-Animal Tracking in the Wild

Multi-animal tracking (MAT), a multi-object tracking (MOT) problem, is crucial for animal motion and behavior analysis and has many crucial applications such as biology, ecology, animal conservation and so forth. Despite its importance, MAT is largely under-explored compared to other MOT problems such as multi-human tracking due to the scarcity of large-scale benchmark. To address this problem, we introduce AnimalTrack, a large-scale benchmark for multi-animal tracking in the wild. Specifically, AnimalTrack consists of 58 sequences from a diverse selection of 10 common animal categories. On average, each sequence comprises of 33 target objects for tracking. In order to ensure high quality, every frame in AnimalTrack is manually labeled with careful inspection and refinement. To our best knowledge, AnimalTrack is the first benchmark dedicated to multi-animal tracking. In addition, to understand how existing MOT algorithms perform on AnimalTrack and provide baselines for future comparison, we extensively evaluate 14 state-of-the-art representative trackers. The evaluation results demonstrate that, not surprisingly, most of these trackers become degenerated due to the differences between pedestrians and animals in various aspects (e.g., pose, motion, appearance, etc), and more efforts are desired to improve multi-animal tracking. We hope that AnimalTrack together with evaluation and analysis will foster further progress on multi-animal tracking. The dataset and evaluation as well as our analysis will be made available upon the acceptance.

READ FULL TEXT VIEW PDF

page 2

page 6

page 9

page 10

page 11

page 12

09/20/2018

LaSOT: A High-quality Benchmark for Large-scale Single Object Tracking

In this paper, we present LaSOT, a high-quality benchmark for Large-scal...
05/20/2020

TAO: A Large-Scale Benchmark for Tracking Any Object

For many years, multi-object tracking benchmarks have focused on a handf...
09/08/2020

LaSOT: A High-quality Large-scale Single Object Tracking Benchmark

Despite great recent advances in visual tracking, its further developmen...
11/21/2020

Transparent Object Tracking Benchmark

Visual tracking has achieved considerable progress in recent years. Howe...
09/09/2022

Tracking Small and Fast Moving Objects: A Benchmark

With more and more large-scale datasets available for training, visual t...
10/29/2018

GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild

In this work, we introduce a large high-diversity database for generic o...
07/26/2022

Tracking Every Thing in the Wild

Current multi-category Multiple Object Tracking (MOT) metrics use class ...

1 Introduction

In this paper, we are interested in multi-animal tracking (MAT), a typical kind of multi-object tracking (MOT) yet heavily under-explored. MAT is critical for understanding and analyzing animal motion and behavior, and thus has a wide range of applications in zoology, biology, ecology, animal conservation and so forth. Despite the importance, MAT is less studied in the tracking community.

Currently, the MOT community mainly focuses on pedestrians and vehicles tracking, with numerous benchmarks introduced in recent years Milan et al. (2016); Geiger et al. (2012); Zhu et al. (2021); Dendorfer et al. (2020). Compared with MOT on pedestrians and vehicles, MAT is more challenging because of several unique properties of animals:

Figure 1: Comparison of MOT on vehicle, pedestrian and animal. Image (a) shows multi-vehicle tracking from KITTI Geiger et al. (2012), image (b) multi-pedestrian tracking from MOT17 Milan et al. (2016) and image (c) multi-animal tracking from the proposed AnimalTrack (Please note that, we only show part of the targets in each image for simplicity). We can observe that, animals are more difficult to be distinguished due to uniform appearance compared to vehicles and pedestrians. Best viewed in color and by zooming in for all figures in this paper.
  • Uniform appearance. Different from pedestrians and vehicles in existing MOT benchmarks that usually have distinguishable appearances (e.g., color, texture, etc.), most animals have uniform appearances that visually look extremely similar (see Fig. 1 for example). As a consequence, it is difficult to leverage their visual features only to distinguish different animals using regularly association (e.g., re-identification) models.

  • Diverse pose. Animals often possess more diverse poses than humans in a video sequence. For example, a goose may walk or run on the ground, or swim in water, or fly in air, leading to significantly different poses. By contrast, most pedestrians often walk in a video. The diverse poses of animals may cause difficulties in detector design for tracking.

  • Complex motion. Compared to humans and vehicles that usually have regular motion patterns, animals have larger-range motions due to their diverse poses. For example, animals may frequently change motions from flying to swimming, or inverse. These complicated motion patters lead to higher requirement on motion modeling for animal tracking than human or vehicle tracking.

The above properties of animals bring in technical difficulties for MAT, making it a less-touched problem. In addition, another more important reason why MAT is under-explored is the scarcity of a benchmark. Benchmark plays a crucial role in advancing multi-object tracking. As a platform, it allows researchers to develop their algorithms and fairly assess, compare and analyze different approaches for improvement. Currently, there exist many datasets Milan et al. (2016); Geiger et al. (2012); Zhu et al. (2021); Dendorfer et al. (2020); Bai et al. (2021); Du et al. (2018); Dave et al. (2020) for MOT on different subjects in various scenarios. Nevertheless, there is no available benchmark dedicated for multiple animal tracking. Although some of the datasets (e.g.Bai et al. (2021); Dave et al. (2020)) consist of video sequences involved with animal targets, they are limited in either video quantity and animal categories Bai et al. (2021) or number of animal tracklets Dave et al. (2020), which makes them not an ideal platform for studying MAT. In order to facilitate MOT on animals, a dedicated benchmark is urgently required for both designing and evaluating MAT algorithms.

Contribution. Thus motivated, in this paper we make the first step for studying the MAT problem by introducing AnimalTrack, a large-scale benchmark dedicated to multi-animal tracking in the wild. Specifically, AnimalTrack consists of 58 video sequences, which are selected from 10 common animal categories in our real life. On average, each video sequence contains 33 animals for tracking. There are more than 24.7K frames in total in AnimalTrack, and every frame is manually labeled with multiple axis-aligned bounding boxes. Careful inspection and refinement are performed to ensure high-quality annotations. To the best of our knowledge, AnimalTrack is the first benchmark dedicated to the task of MAT.

In addition, with the goal of understanding how existing MOT algorithms perform on the newly developed AnimalTrack for future improvements, we extensively evaluate 14 popular state-of-the-art MOT algorithms. We conduct in-depth analysis on the evaluation results of these trackers. From the results, not surprisingly, we observe that, most of these trackers, designed for pedestrian or vehicle tracking, are greatly degraded when directly applied for animal tracking on AnimalTrack because of the aforementioned properties of animals. We hope that these evaluation and analysis can offer baselines for future comparison on AnimalTrack and provide guidance for tracking algorithm design.

Besides the analysis on overall performance of different tracking algorithms, we also independently study the important association techniques that are indispensable for current multi-object tracking. In particular, we compare and analyze several popular association strategies. The analysis is expected to provide some guidance for future research when choosing appropriate association baseline for improvements.

In summary, we make the following contributions: (i) We introduce the AnimalTrack, which is, to the best of our knowledge, the first large-scale benchmark dedicated to multi-animal tracking. (ii) We extensively evaluate 14 representative state-of-the-art MOT approaches to provide future comparison on AnimalTrack. (iii) We conduct in-depth analysis for the evaluations of existing approaches, offering guidance for future algorithm design.

By releasing AnimalTrack, we hope to boost the future research and applications of multiple animal tracking. Our project with data and evaluation results will be made publicly available upon the acceptance of this work.

The rest of this paper is organized as follows. Sec. 2 discusses related trackers and benchmarks of this work. Sec. 3 will illustrate the proposed AnimalTrack in details. Sec. 4 demonstrates evaluations of AnimalTrack, followed by conclusion in Sec. 5.

2 Related Work

MAT belongs to the problem of MOT. In this section, we will discuss related MOT algorithms and existing benchmarks that are related to AnimalTrack. Besides, we will also briefly review other animal-related vision benchmarks.

2.1 Multi-Object Tracking Algorithms

MOT is a fundamental problem in computer vision and has been actively studied for decades. In this subsection, we will briefly review some representative works and refer readers to recent surveys 

Ciaparrone et al. (2020); Emami et al. (2020); Luo et al. (2021) for more tracking algorithms.

One popular paradigm is called Tracking-by-Detection which decomposes MOT into two subtasks including detecting objects Ren et al. (2015); Lin et al. (2017) in each frame and then associating the same target to generate trajectories using optimization techniques (e.g., Hungarian algorithm Bewley et al. (2016) and network flow algorithm Dehghan et al. (2015)). Within this framework, numerous approaches have been introduced Bewley et al. (2016); Wojke et al. (2017); Chu et al. (2019); Shuai et al. (2021); Tang et al. (2017); Zhu et al. (2018); Xu et al. (2019); Yin et al. (2020). In order to improve the data association in MOT, some other works propose to directly incorporate the optimization solvers in association into learning Chu and Ling (2019); Xu et al. (2020); Brasó and Leal-Taixé (2020); Schulter et al. (2017); Dai et al. (2021), which is greatly beneficial for improving tracking performance from end to end learning in deep network.

In addition to the Tracking-by-Detection framework, another MOT architecture named Joint-Detection-and-Tracking has recently drawn increasing attention in the community due to efficiency and simplicity. This framework learns to detect and associate target objects at the same time, largely simplifying the MOT framework. Many efficient approaches Bergmann et al. (2019); Lu et al. (2020); Zhou et al. (2020); Wang et al. (2020); Zhang et al. (2021b); Liang et al. (2022) have been proposed based on this architecture. More recently, motivated by the power of Transformer Vaswani et al. (2017), the attention mechanism has been introduced for MOT Sun et al. (2020); Meinhardt et al. (2022) and demonstrate state-of-the-art performance.

2.2 Multi-Object Tracking Benchmarks

Benchmarks are important for the development of MOT. In recent years, many benchmarks have been propose.

PETS2009. PETS2009 Ferryman and Shahrokni (2009) is one of the earliest benchmarks for MOT. It contains 3 videos sequences for pedestrian tracking.

KITTI. KITTI Geiger et al. (2012) is introduced for autonomous driving. It comprises of 50 video sequences and focuses on tracking pedestrian and vehicle in traffic scenarios. Besides 2D MOT, KITTI also supports 3D MOT.

UA-DETRAC. UA-DETRAC Wen et al. (2020) includes 100 challenge sequences captured from real world traffic scenes. This dataset provides rich annotations for multi-object tracking such as illumination, occlusion, truncation ration, vehicle type and bounding box.

MOTChallenge. MOTChallenge Dendorfer et al. (2021) contains a series of benchmarks. The first version MOT15 Leal-Taixé et al. (2015) consists of 22 sequences for tracking. Due to low difficulty of videos in MOT15, MOT16 Milan et al. (2016) compiles 14 new and more challenging sequences compared to MOT15. MOT17 Milan et al. (2016) uses the same videos as in MOT16 but improves the annotation and applies a different evaluation system. Later, MOT20 Dendorfer et al. (2020) is presented with 8 new sequences, aiming at MOT in crowded scenes.

MOTS. MOTS Voigtlaender et al. (2019) is a newly introduced dataset for multi-object tracking. In addition to 2D bounding box, MOTS also provides pixel mask for each target, aiming at simultaneous tracking and segmentation.

BDD100K. BDD100K Yu et al. (2020) is recently proposed for video understanding in traffic scenes. It provides multiple tasks including multi-object tracking.

TAO. TAO Dave et al. (2020) is a large-scale dataset for tracking any objects. It consists of 2,907 videos from 833 categories. TAO sparsely labels objects every 30 frames. Its average trajectories is 6.

GMOT-40. GMOT-40 Bai et al. (2021) is a recently proposed benchmark that aims at one-shot MOT. It consists of 40 sequences from 10 categories. Each sequence provides one instance for tracking multiple targets of the same class.

UAVDT-MOT. UAVDT-MOT Du et al. (2018) consists of 100 challenging videos that are captured with a drone. These videos mainly cover pedestrian and vehicle for tracking. The goal of UAVDT-MOT is to facilitate multi-object tracking in aerial views.

VisDrone-MOT. Similar to UAVDT-MOT, VisDrone-MOT Zhu et al. (2021) also focuses on MOT with drone. The difference is VisDrone-MOT introduces more object categories, making it more challenging.

DanceTrack. DanceTrack Sun et al. (2022) is a large-scale benchmark with 100 videos. The aim of DanceTrack is to explore multi-human tracking in uniform appearance and diverse motion.

Different from the above datasets for MOT on pedestrians, vehicles or other subjects, AnimalTrack focuses on dense multi-animal tracking in the wild. Although some of the benchmarks (e.g., TAO Dave et al. (2020) and GMOT-40 Bai et al. (2021)) contain animal targets for tracking, they have limitations for MAT. For TAO Dave et al. (2020), the average trajectory is 6 and even lower for animal, the average trajectory is 4. Nevertheless, in practice in the wild, it is very common to see objects moving in a dense group. The sparse trajectory in TAO may limits its usage for dense tracking case. In addition, TAO is sparsely annotated 30 frames, resulting in difficulty for trackers in learning temporal motion. Despite several animal videos, GMOT-40 Bai et al. (2021) is limited in animal categories (4 classes) and video quantity (12 in total). Besides, GMOT-40 has a different aim for one-shot MOT. Thus, no training data is provided. Compared to TAO Dave et al. (2020) and GMOT-40 Bai et al. (2021), AnimalTrack is dense in trajectories and annotation (i.e., per-frame manual annotation) as well as diverse in animal classes.

We are also aware that there exist a few datasets Khan et al. (2004); Betke et al. (2007); Bozek et al. (2018) for animal tracking. However, these datasets are usually small (e.g., with 1 or 2 video sequences) and limited to special animal category (e.g.Khan et al. (2004) for ant, Betke et al. (2007) for bat, Bozek et al. (2018)

for bee), and therefore may not be suitable for large-scale animal tracking in the deep learning era. Unlike these animal tracking datasets, our AnimalTrack has more classes with more videos.

2.3 Other Animal-Related Vision Benchmarks

Our AnimalTrack is also related to many other animal-related vision benchmarks outside MOT. The work of Cao et al. (2019)

introduces a large-scale benchmark for animal pose estimation, which is later extended by 

Yu et al. (2021) by adding more images and further increasing categories. In Mathis et al. (2021)

, the authors introduce a benchmark dedicated to horse pose estimation. The work of 

Bala et al. (2020) proposes a 3D animal pose estimation benchmark. The work of Parham et al. (2018) presents a new dataset for animal localization in the wild. A benchmark for tiger re-identification is proposed in Li et al. (2019). In Iwashita et al. (2014), the authors build a benchmark for animal activity recognition in videos. Different from these benchmarks, the proposed AnimalTrack focuses on multiple animal tracking.

3 AnimalTrack

3.1 Design Principle

AnimalTrack expects to provide the community with a new dedicated platform for studying MOT on animal. In particular, in the deep learning era, it aims at both large-scale training and evaluation for deep trackers. To this end, we follow three principles in constructing our AnimalTrack:

  • Large-scale. One motivation behind AnimalTrack is to provide a large-scale benchmark for animal tracking. Especially, current deep models usually require a large mount of data for training. Bearing this in mind, we hope to compile at least 50 video sequences with at least 20K frames in AnimalTrack.

  • High-quality dense annotations. The annotations of a benchmark are crucial for both algorithm development and evaluation. To this end, we provide per-frame manual annotations for every sequence of AnimalTrack to ensure high annotation quality, which is different than many MOT benchmarks providing only spare annotations.

  • Dense trajectories. In real world, it is common to see animals moving in a dense group. AnimalTrack aims at such dense tracking on animals and expects an average video trajectory at least 25.

Benchmark Tracking on other subjects (e.g., humans, vehicles, etc) Tracking on animals
KITTI
Geiger et al. (2012)
MOT17
Milan et al. (2016)
MOT20
Dendorfer et al. (2020)
UAVDT-
MOT Du et al. (2018)
TAO
Dave et al. (2020)
GMOT-
40 Bai et al. (2021)
GMOT-40
-Anim. Bai et al. (2021)
TAO
-Anim. Dave et al. (2020)
AnimalTrack
(Ours)
Videos 50 14 8 100 2,907 40 12 39 58
Categories 5 1 1 3 833 10 3 39 10
Min. len. (s) n/a 17 17 2.8 n/a 3.0 3.0 1.0 6.5
Avg. len. (s) 10.0 33.0 66.8 266.7 36.8 8.9 7.1 22.0 14.2
Max. len. (s) n/a 85.0 133.0 99.0 n/a 24.2 24.2 93.0 75.6
Total len. (s) 498.0 463.0 535.0 2,666.7 106,978.0 356.0 85.5 859.0 823.7
Avg. tracks 52 95 479 270 6 51 70 4 33
Max. tracks n/a 222 1,211 n/a 10 128 128 10 133
Total tracks 2,600 1,331 3,833 2,700 17,287 2,026 837 250 1,927
Frame rate 30 25 25 30 30 30 30 30 30
Ann. FPS 10 30 30 6 1 30 30 1 30
Total boxes 80K 300K 2,102K 840K 333K 256K 63K 3.4K 429K
Total frames 15K 11K 13K 40K 2,674K 9K 2.6K 2.5K 24.7K
Table 1: Statistics on the proposed AnimalTrack and its comparison with several multi-object tracking benchmarks and animal videos in GMOT-40 and TAO. “n/a” means that the statistics can not be obtained because some of the benchmarks do not provide the test set.

3.2 Data Collection

Our AnimalTrack focuses on dense multi-animal tracking. We start benchmark construction by selecting 10 common animal categories that are generally dense and crowded in the wild. These categories are Chicken, Deer, Dolphin, Duck, Goose, Horse, Penguin, Pig, Rabbit, and Zebra, perching in very different environments. Although TAO consists of more classes than ours, many categories in TAO are not available for dense multi-object tracking, which is different than our aim in this work.

Figure 2: Number of video sequences for each animal class in AnimalTrack. Each category consists of at least 5 and at most 7 sequences.

After determining the animal classes, we search raw video sequences111Each video sequence is collected under the Creative Commons license. of each class from YouTube (https://www.youtube.com/), the largest and the most popular video platform in the world. Initially, we have collected over 500 candidate sequences. After a joint consideration of both video quality and our principles, from these raw sequences we choose 58 video clips that are finally available for our task. For each category, there are at least 5 and at most 7 sequences, showing balance in category to some extent. Fig. 2 demonstrates the number of sequences for each category in AnimalTrack.

Position name Description
1 Frame number Frame in which the target appears; starting from 1.
2 Identifier An unique ID for each trajectory.
3 Box left Coordinate of top-left corner of annotated object.
4 Box top Coordinate of top-left corner of annotated object.
5 Box width Width of annotated object.
6 Box height Height of annotated object.
7 Confidence Flag that indicates if the box is considered (1) or ignored (-1) for evaluation; the confidence for all targets in AnimalTrack is set to 1.
8 Class Type of annotated object.
9 Visibility Visibility ratio of object; we ignore it by setting its value to -1 in AnimalTrack.
Table 2: Annotation format in AnimalTrack.

(a) horse-3

(b) deer-4

(c) dolphin-6

(d) duck-4

(e) goose-5

(f) chicken-2

(g) penguin-5

(h) pig-5

(i) rabbit-2

(j) zebra-4

Figure 3: Visualization of several annotated example images from each category in AnimalTrack. We can observe that animals from the same class usually have uniform appearances and complex pose and motion patterns, which brings new challenges for tracking animals.

Finally, we compile a large-scale benchmark for multi-animal tracking by collecting 58 video sequences with more than 24.7K frames and 429K boxes. The average video length is 426 frames. The longest sequence contains 2,269 frames, while the shortest one consist of 196 frames. The total number of tracks in AnimalTrack is 1,927, and the average number of tracks is 33. To our best knowledge, AnimalTrack is by far the largest benchmark dedicated for animal tracking. Tab. 1 summarizes detailed statistics on AnimalTrack and comparison with several popular MOT benchmarks and animal videos in GMOT-40 and TAO.

3.3 Annotation

We use the annotation tool DarkLable222The annotation tool is available at https://github.com/darkpgmr/DarkLabel. to annotate the videos in AnimalTrack. Following popular MOTChallenge Dendorfer et al. (2021), we annotate each target in videos with object identifier, axis-aligned bounding box and other information. Tab. 2

demonstrates the annotation format for each target in AnimalTrack. Note that, slightly different from MOTChallenge, we do annotate the visibility ratio of each target because it is hard to accurately measure the visibility of the target in real world scenarios. However, we still keep it (set to -1) for padding to MOTChallenge format.

To provide consistent annotations, we follow the following labeling rules. For target object that is fully visible or partially occluded, a full-body box is annotated. If the object is under full occlusion, we do not label it. When this object re-appears in the view in future, we annotate it with the same identifier. For target objects out of view, they are assigned with new identifiers when re-entering the view.

Figure 4: Statistics of object motion, area and aspect ratio change compared to initial object and IoU on adjacent object boxes in AnimalTrack and comparison with popular pedestrian tracking benchmarks including MOT17 Milan et al. (2016) and MOT20 Dendorfer et al. (2020). We can observe that the animals in our benchmark have more complex pose and motions.

To ensure high-quality annotation, we adopt a multi-step strategy. First, a group of volunteers who are familiar with topic and an expert (e.g., PhD student working on related areas) will participate in manually annotating each target object in videos. Then, a group of experts will carefully inspect the initial annotations. If the annotation results are not unanimously agreed by experts, they will be returned to labeling team for adjustment or refinement. We repeat this process until all annotations are satisfactorily completed. Fig. 3 demonstrates a few annotated samples from each category in AnimalTrack.

3.4 Statistics of Annotation

Videos Categories
Min.
len. (s)
Avg.
len. (s)
Max.
len. (s)
Total
len. (s)
Avg.
tracks
Total
tracks
Total
boxes
Total
images
AnimalTrack 32 10 6.9 12.0 50.3 384.8 26 823 186K 11.5K
AnimalTrack 26 10 6.5 16.9 75.6 438.9 42 1,104 243K 13.2K
Table 3: Comparisons between training set (i.e., AnimalTrack) and testing set (i.e., AnimalTrack) of AnimalTrack.

In order to better understand the pose and motion patters of animals, we show representative statistics of the annotation boxes of objects in AnimalTrack in Fig. 4. In particular, we demonstrate the object motion, relative area to initial object box, relative aspect ratio (aspect ratio is defined as ratio of width and width) and Intersection over Union (IoU) on object boxes in adjacent frames. From Fig. 4, we can clearly observe that the animal targets vary rapidly in terms of spatial pose and temporal motions.

In addition, we compare AnimalTrack and popular pedestrian tracking benchmarks including MOT17 Milan et al. (2016) and MOT20 Dendorfer et al. (2020). From the comparison in Fig. 4, we can see that animals have faster motion than pedestrians. Moreover, the pose variations of animals are more complex, which consequently causes new challenges in tracking animals.

3.5 Dataset Split

AnimalTrack consists of 58 video sequences. We utilize 32 out of 58 for training and the rest 26 for testing. In specific, for category with videos, we select /2 videos for training and the rest for testing if is a even number, otherwise we choose /2 videos for training and the rest for testing. During dataset splitting, we try our best to keep the distributions of training and testing set as close as possible. Tab. 3 compares the statistics of training/testing sets in AnimalTrack. Note that, the number of frames for the testing set is slightly more than that for the training set. The reason is that the testing set contains more longer video sequences for challenging evaluation. The detailed spilt will be released at our project website.

4 Evaluation

4.1 Evaluation Metric

For comprehensive evaluation of different tracking algorithms, we use multiple metrics. Specifically, we employ the recently proposed higher order tracking accuracy (HOTA) from Luiten et al. (2021), commonly used CLEAR metrics from Bernardin and Stiefelhagen (2008) including multiple object tracking accuracy (MOTA), mostly tracked targets (MT), mostly lost targets (ML), false positives (FP), false negatives (FN), ID switches (IDs) and number of times a trajectory is fragmented (FM) and ID metrics from Ristani et al. (2016) such as identification precision (IDP), identification recall (IDR) and related F1 score (IDF1) which is defined as the ratio of correct detections to the average number of GT and computed detections. Many previous works employ MOTA as the main metric (e.g., for ranking). Nevertheless, a recent study Luiten et al. (2021) shows that MOTA may bias to target detection quality instead of target association accuracy. Considering this, we follow Geiger et al. (2012); Sun et al. (2022) to adopt HOTA as the main metric in evaluation. For definitions of these metrics, we refer readers to Bernardin and Stiefelhagen (2008); Ristani et al. (2016); Luiten et al. (2021).

Tracker HOTA MOTA IDF1 IDP IDR MT PT ML FP FN IDs FM
SORT Bewley et al. (2016) 42.8% 55.6% 49.2% 58.5% 42.4% 333 470 301 19,099 86,257 2,530 3,730
IOUTrack Bochinski et al. (2017) 41.6% 55.7% 45.7% 51.9% 40.7% [rgb] 1, 0, 0388 454 [rgb] 1, 0, 0262 25,206 [rgb] 1, 0, 077,847 4,639 5,259
DeepSORT Wojke et al. (2017) 32.8% 41.4% 35.2% 49.7% 27.2% 213 452 439 14,131 124,747 3,503 4,527
JDE Wang et al. (2020) 26.8% 27.3% 31.0% 51.0% 22.0% 106 414 584 17,887 155,623 3,187 5,031
FairMOT Zhang et al. (2021b) 30.6% 29.0% 38.8% 62.8% 28.0% 143 462 499 17,653 152,624 2,335 5,447
CenterTrack Zhou et al. (2020) 9.9% 1.6% 7.0% 8.9% 5.8% 265 423 416 32,050 117,614 89,655 7,583
CTracker Peng et al. (2020) 13.8% 14.0% 14.7% 35.2% 9.3% 20 313 771 [rgb] 1, 0, 013,092 192,660 3,437 8,019
Tracktor++ Bergmann et al. (2019) 44.2% [rgb] 0, 0, 155.2% 51.0% 58.5% 45.1% 364 472 [rgb] 0, 0, 1268 25,477 [rgb] 0, 0, 181,538 [rgb] 0, 0, 11,976 4,149
ByteTrack Zhang et al. (2021a) 40.1% 38.5% 51.2% [rgb] 0, 0, 164.9% 42.3% 310 465 329 31,591 116,587 [rgb] 1, 0, 01,309 [rgb] 1, 0, 03,513
QDTrack Pang et al. (2021) [rgb] 1, 0, 047.0% [rgb] 1, 0, 055.7% [rgb] 1, 0, 056.3% [rgb] 1, 0, 065.6% [rgb] 1, 0, 049.3% [rgb] 0, 0, 1367 420 317 22,696 83,057 1,970 5,656
TADAM Guo et al. (2021) 32.5% 36.5% 37.2% 44.4% 32.0% 258 [rgb] 1, 0, 0495 351 41,728 110,048 2,538 4,469
OMC Liang et al. (2022) 43.0% 53.4% 50.3% 61.8% 42.4% 324 478 302 [rgb] 0, 0, 115,910 92,570 4,938 7,162
Trackformer Meinhardt et al. (2022) 31.0% 20.4% 36.5% 40.9% 32.8% 230 [rgb] 0, 0, 1491 383 70,404 118,724 4,355 [rgb] 0, 0, 13,725
TransTrack Sun et al. (2020) [rgb] 0, 0, 145.4% 48.3% [rgb] 0, 0, 153.4% 63.4% [rgb] 0, 0, 146.1% 327 416 361 28,553 95,212 1,978 6,459
Table 4: Overall evaluation results and comparison of different tracking algorithms on AnimalTrack. The best two results on each metric are highlighted in redred and blueblue fonts.

4.2 Evaluated Trackers

Understanding how existing MOT algorithms perform on AnimalTrack is crucial for future comparison and also beneficial for tracker design. To such end, we extensively evaluate 14 state-of-the-art multi-object tracking approaches.

These approaches include SORT Bewley et al. (2016) (ICIP’2016), DeepSort Wojke et al. (2017) (ICIP’2017), IoUTrack Bochinski et al. (2017) (AVSS’2017), JDE Wang et al. (2020) (ECCV’2020), FairMOT Zhang et al. (2021b) (IJCV’2021), CenterTrack Zhou et al. (2020) (ECCV’2020), CTracker Peng et al. (2020) (ECCV’2020), QDTrack Pang et al. (2021) (CVPR’2021), ByteTrack Zhang et al. (2021a) (arXiv’2021), Tracktor++ Bergmann et al. (2019) (ICCV’2019), TADAM Guo et al. (2021) (CVPR’2021), Trackformer Meinhardt et al. (2022) (CVPR’2022), OMC Liang et al. (2022) (AAAI’2022) and TransTrack Sun et al. (2020) (arXiv’2020). Notably, among these approaches, TransTrack and Trackformer are two recently proposed trackers using Transformer. Despite excellent performance on pedestrian tracking, these trackers quickly degrade in tracking animals as shown in later experimental results.

It is worth noticing that, in evaluation, all the above trackers are used as they are, without any modification, for two reasons. First, different approach may need different training strategies, which makes it difficult to optimally train each tracker for best performance. Moreover, inappropriate training settings may decrease the performance for certain trackers. Second, the original configuration for each tracker has been verified by authors. Thus, it is reasonable to assume that each tracker is able to obtain decent results even without modification.

4.3 Evaluation Results

In this work, the evaluation of each tracking algorithm is conducted in “private setting” in which each tracker should perform both object detection and target association.

4.3.1 Overall Performance

We extensively evaluate 14 state-of-the-art tracking algorithms. Tab. 4 shows the evaluation results and comparison.

OMC Liang et al. (2022) SORT Bewley et al. (2016) TransTrack Sun et al. (2020) Tracktor++ Bergmann et al. (2019) QDTrack Pang et al. (2021)

(a) deer

(b) dolphin

(c) duck

(d) goose

(e) horse

(f) penguin

(g) pig

(h) rabbit

(i) zebra

(j) chicken

Figure 5: Visualization of top five trackers consisting of OMC Liang et al. (2022), SORT Bewley et al. (2016), TransTrack Sun et al. (2020), Tracktor++ Bergmann et al. (2019) and QDTrack Pang et al. (2021) based on HOTA scores on several sequences. Each color represents a tracking trajectory.

From Tab. 4, we observe that QDTrack shows the best overall result by achieving 47.0% HOTA score and TransTrack the second best with 45.4% HOTA score, respectively. QDTrack densely samples numerous regions from images for similarity learning and thus can alleviate the problem of complex animal poses in detection in some degree, as evidenced by its best result of 55.7% on MOTA that focuses more on detection quality. This dense sampling strategy not only improves detection but also benefits later association, which is shown by its best 56.3% IDF1 score. TransTrack obtains the second best overall result with 45.4% HOTA score. On IDF1, it also exhibits the second best result with 53.4%. TransTrack utilizes the query-key mechanism in Transformer for multi-object tracking. The competitive performance of TransTrack shows the potential of Transformer for MOT. We notice that another Transformer-based tracker Trackformer shows poorer performance compared to TransTrack. We argue that the reason is because of its relatively weaker detection module. Tracktor++ shows the second best MOTA result with 55.2% owing to its adoption of strong Faster R-CNN Ren et al. (2015) for detection. Compared with pedestrians, animal detection is more challenging and the usage of two-stage detectors may be more suitable.

Figure 6: Comparison of different tracking algorithms on each category using HOTA.

In addition, we see some interesting findings on AnimalTrack. For example, SORT and IoUTrack are two simple trackers and outperformed by many recent approaches on pedestrian tracking benchmarks. However, we observe that, despite simplicity, these two trackers works surprisingly well on AnimalTrack. SORT and IOUTrack achieve 42.8% and 41.6% HOTA score, respectively, which surpass many recent state-of-the-arts such as JDE, FairMOT, CTracker, etc. This observation shows that more efforts and attentions should be devoted and paid to the problem of animal tracking.

Besides quantitatively evaluating and comparing different MOT approaches, we further show the qualitative results of different trackers. Due to limited space, we only demonstrate the qualitative results of top five trackers based on HOTA as in Fig. 5.

4.3.2 Category-based Performance

In addition to the overall performance analysis, we conduct more specific comparison of different tracking algorithms on each category on AnimalTrack using HOTA. Fig. 6 shows the comparison results. From Fig. 6, we can see that QDTrack achieves the best results on 3 out of 10 categories including Goose, Dolphin and Pig and the second best results on 4 out of 10 classes including Horse, Rabbit, Duck and Chicken, which is consistent with its best overall performance on AnimalTrack. OMC demonstrates the best results on 3 out of 10 classes including Horse, Penguin and Deer, showing the advantage of its “re-check” mechanism for tracking. ByteTrack obtains the best results on 2 categories including Rabbit and Chicken and competitive performance on other categories, which reveals the promising capacity of Transformer for animal tracking.

Figure 7: Difficulty comparison of different categories in AnimalTrack. The larger the area of the sector is, the larger the average HOTA score is, and the less difficult the category is.

4.3.3 Difficulty comparison of Categories

Figure 8: Comparison of different trackers on pedestrian tracking benchmark MOT17 and the proposed AnimalTrack in terms of HOTA (image (a)), MOTA (image(b)) and IDF1 (image(c)). We note that, compared to MOT17, all trackers become degenerated on all metrics on AnimalTrack, which shows that multi-animal tracking is more challenging than pedestrian tracking and there is a long way for improving animal tracking.

We analyze the difficulty of different animal categories in AnimalTrack. In specific, we simply average the HOTA scores of all evaluated trackers on one category to obtain the HOTA score for this category. Fig. 7 shows the comparison. In Fig. 7, the larger the area of the sector is, the larger the average HOTA score is, and the less difficult the category is. From Fig. 7, we can see that the category of Horse is the easiest to track while the class of Goose is the more difficult to track. We argue that Goose is the hardest because the gooses may have the most complex motion patters. We hope that this analysis can guide researchers to pay more attention to the hard categories.

4.3.4 Comparison of MOT17 and AnimalTrack

Currently, one of the main focuses in MOT community is to track pedestrians. Different from pedestrian tracking, animal tracking is more challenging because of uniform appearance, diverse pose and complex motion patterns. In order to verify this, we compare the performance of existing state-of-the-art tracking algorithms on the popular MOT17 and the proposed AnimalTrack. Notice that, we only compare the trackers whose HOTA, MOTA and IDF1 scores are available on both MOT17 and AnimalTrack. Fig. 8 displays the comparison results of these trackers.

From Fig. 8

(a), we can see that the best two performing trackers on MOT17 are ByteTrack and FairMOT that achieves 63.0% and 59.3% HOTA scores. Despite this, these two trackers degrade significantly when tracking animals on AnimalTrack. Specifically, their HOTA scores decrease from 63.1% to 40.1% and from 59.3% to 30.6%, showing absolute perform drops of 23.0% and 28.7%, respectively. Tracktor++ slightly performs worse on AnimalTrack than MOT17. This tracker utilizes a strong detection for tracking and shows competitive performance. Although QDTrack achieves the best HOTA result, its performs degrades on AnimalTrack compared to that on MOT17, which evidences again the challenge and difficulty we face in handling animal tracking. It is worth noting that, CenterTrack has the largest performance drop on AnimalTrack. We have carefully inspected the official implementation to ensure its correction for evaluation. After taking a close look at the implementation, we find that the features extracted in CenterTrack are not suitable for animal tracking, resulting in poor performance.

In addition to overall comparison using HOTA, we compare the MOTA score. From Fig. 8 (b), we can observe that the best two trackers on MOT17 are ByteTrack and TransTrack with 80.3% and 74.5% MOTA scores, respectively. Nevertheless, when tracking animals on AnimalTrack, their MOTA scores are decreased to 37.9% (42.4% absolute performance drop) and 48.3% (26.2% absolute performance drop), respectively, which shows that the animal detection is more challenging compared to human detection. Besides the best two trackers on MOT17, other approaches become degenerated on AnimalTrack, which further reveals the general difficulty of detection on AnimalTrack. We notice that Tracktor++ perform consistently on both AnimalTrack than MOT17 (55.2% v.s. 56.3%). We argue that this is attributed to its powerful regressor in detection.

Moreover, we also demonstrate the comparison of IDF1 score of each tracker on MOT17 and AnimalTrack in Fig. 8 (c). As shown, we find that the best two trackers on MOT17 are ByteTrack and FairMOT with 77.3% and 72.3% IDF1 scores. Compared to their performance on AnimalTrack with 51.0% and 38.3% IDF1 scores, the absolute performance drops are 22.3% and 33.5%, respectively, which highlights the severe challenge in associating animals with uniform appearances. Furthermore, in addition to these two trackers, all other trackers including the best performing tracker QDTrack on AnimalTrack are actually greatly degenerated in IDF1 score, demonstrating more efforts required for solving association in animal tracking.

Figure 9: Visualization and comparison of appearance features for re-identification between pedestrians and animals using t-SNE Van der Maaten and Hinton (2008). The same target object is represented as dots with the same color. We choose the first 30 target objects in the first 200 frames for visualization. We can clearly see that the appearance features of animals are more difficult to distinguish compared to pedestrian appearance features, resulting in new challenge for animal tracking.

To further compare pedestrian and animal tracking, we analyze the appearance similarities of different pedestrians and animals on MOT17 and AnimalTrack. In particular, we train two re-identification networks with identical architectures on MOT17 and AnimalTrack, respectively. Afterwards, we extract the features of pedestrians and animals and adopt t-SNE Van der Maaten and Hinton (2008) to visualize these features. Fig. 9 shows the visualization of appearance features of pedestrians and animals. From Fig. 9, we can clearly observe that the features of animals are more complex and indistinguishable because highly similar appearances of animals compared to pedestrian appearances.

From the extensive quantitative and qualitative analysis above, we can see that tracking animals is more challenging and difficult than tracking pedestrians. Despite rapid progress on pedestrian tracking, there is a long way for improving animal tracking.

4.3.5 Analysis on Association Strategy

Association HOTA MOTA IDF1
IOUTrack 41.6% 55.7% 45.7%
SORT 42.8% 55.6% 49.2%
DeepSORT 38.2% 52.0% 44.2%
ByteTrack 36.3% 37.1% 47.0%
Table 5: Analysis on different association strategies. The detection is provided by Faster R-CNN Ren et al. (2015).

Association is a core component in existing MOT algorithms. In order to analyze and compare different association strategies, we conduct an independent experiment. Specifically, we adopt the classic and powerful detector Faster R-CNN Ren et al. (2015) to provide detection results on AnimalTrack. Based on the detection results, we perform analysis on four different association strategies.

Tab. 5 demonstrates the comparison results. SORT (see ❶) and IOUTrack (see ❷) simply use motion information instead of appearance to perform association but achieves the best two results with 42.8% and 41.6% HOTA scores. This shows that taking into consideration the motion cues in videos is beneficial for distinguishing targets with uniform appearances. Compared to SORT, DeepSORT (see ❸) adopts target appearance inforamtion for association but the performance is degraded, which once again evidences that appearance should be carefully designed when applied for associating animals. ByteTrack (see ❹) is a recently proposed approach and demonstrates state-of-the-art performance on multiple pedestrian and vehicle tracking benchmarks. The main success on these benchmarks comes from its association on all detected boxes. However, because animals have uniform appearances and it is hard to leverage their appearance information as in pedestrian or vehicles to distinguish different targets. More efforts are desired for designing appropriate association for animal targets.

5 Conclusion

In this paper, we introduce AnimalTrack, a high-quality large-scale benchmark for multi-animal tracking. Specifically, AnimalTrack consists of 58 video sequences that are selected from 10 common animal categories. To the best of our knowledge, AnimalTrack is by far the first and also the largest dataset dedicated to MAT. By constructing AnimalTrack, we hope to provide a platform for facilitating research of MOT on animals. In addition, to provide future comparison on AnimalTrack, we extensively assess 14 popular MOT approaches with in-depth analysis. The evaluation results show that more efforts are desired for improving MAT. Furthermore, we independently study the association component for multi-animal tracking and hope that this can provide some guidance for choosing appropriate baseline for target association. Overall, we expect our dataset, along with evaluation results and our proposed baseline, to inspire more research on multiple animal tracking using computer vision technique.

References

  • H. Bai, W. Cheng, P. Chu, J. Liu, K. Zhang, and H. Ling (2021) GMOT-40: a benchmark for generic multiple object tracking. In

    IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR)

    ,
    Cited by: §1, §2.2, §2.2, Table 1.
  • P. C. Bala, B. R. Eisenreich, S. B. M. Yoo, B. Y. Hayden, H. S. Park, and J. Zimmermann (2020) Automated markerless pose estimation in freely moving macaques with openmonkeystudio. Nature communications 11 (1), pp. 1–12. Cited by: §2.3.
  • P. Bergmann, T. Meinhardt, and L. Leal-Taixe (2019) Tracking without bells and whistles. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1, Figure 5, §4.2, Table 4.
  • K. Bernardin and R. Stiefelhagen (2008) Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008, pp. 1–10. Cited by: §4.1.
  • M. Betke, D. E. Hirsh, A. Bagchi, N. I. Hristov, N. C. Makris, and T. H. Kunz (2007) Tracking large variable numbers of objects in clutter. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.2.
  • A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft (2016) Simple online and realtime tracking. In IEEE International Conference in Image Processing (ICIP), Cited by: §2.1, Figure 5, §4.2, Table 4.
  • E. Bochinski, V. Eiselein, and T. Sikora (2017) High-speed tracking-by-detection without using image information. In IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), Cited by: §4.2, Table 4.
  • K. Bozek, L. Hebert, A. S. Mikheyev, and G. J. Stephens (2018) Towards dense object tracking in a 2d honeybee hive. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.2.
  • G. Brasó and L. Leal-Taixé (2020) Learning a neural solver for multiple object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
  • J. Cao, H. Tang, H. Fang, X. Shen, C. Lu, and Y. Tai (2019) Cross-domain adaptation for animal pose estimation. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.3.
  • P. Chu, H. Fan, C. C. Tan, and H. Ling (2019) Online multi-object tracking with instance-aware tracker and dynamic model refreshment. In IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.1.
  • P. Chu and H. Ling (2019) Famnet: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • G. Ciaparrone, F. L. Sánchez, S. Tabik, L. Troiano, R. Tagliaferri, and F. Herrera (2020) Deep learning in video multi-object tracking: a survey. Neurocomputing 381, pp. 61–88. Cited by: §2.1.
  • P. Dai, R. Weng, W. Choi, C. Zhang, Z. He, and W. Ding (2021)

    Learning a proposal classifier for multiple object tracking

    .
    In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
  • A. Dave, T. Khurana, P. Tokmakov, C. Schmid, and D. Ramanan (2020) Tao: a large-scale benchmark for tracking any object. In European Conference on Computer Vision (ECCV), Cited by: §1, §2.2, §2.2, Table 1.
  • A. Dehghan, Y. Tian, P. H. Torr, and M. Shah (2015) Target identity-aware network flow for online multiple target tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
  • P. Dendorfer, A. Osep, A. Milan, K. Schindler, D. Cremers, I. Reid, S. Roth, and L. Leal-Taixé (2021) Motchallenge: a benchmark for single-camera multiple target tracking. International Journal of Computer Vision 129 (4), pp. 845–881. Cited by: §2.2, §3.3.
  • P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taixé (2020) Mot20: a benchmark for multi object tracking in crowded scenes. arXiv:2003.09003. Cited by: §1, §1, §2.2, Figure 4, §3.4, Table 1.
  • D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian (2018) The unmanned aerial vehicle benchmark: object detection and tracking. In European Conference on Computer Vision (ECCV), Cited by: §1, §2.2, Table 1.
  • P. Emami, P. M. Pardalos, L. Elefteriadou, and S. Ranka (2020) Machine learning methods for data association in multi-object tracking. ACM Computing Surveys 53 (4), pp. 1–34. Cited by: §2.1.
  • J. Ferryman and A. Shahrokni (2009) Pets2009: dataset and challenge. In PETS Workshop, Cited by: §2.2.
  • A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: Figure 1, §1, §1, §2.2, Table 1, §4.1.
  • S. Guo, J. Wang, X. Wang, and D. Tao (2021) Online multiple object tracking with cross-task synergy. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §4.2, Table 4.
  • Y. Iwashita, A. Takamine, R. Kurazume, and M. S. Ryoo (2014) First-person animal activity recognition from egocentric videos. In International Conference on Pattern Recognition (ICPR), Cited by: §2.3.
  • Z. Khan, T. Balch, and F. Dellaert (2004) An mcmc-based particle filter for tracking multiple interacting targets. In European Conference on Computer Vision (ECCV), Cited by: §2.2.
  • L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler (2015) Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv:1504.01942. Cited by: §2.2.
  • S. Li, J. Li, H. Tang, R. Qian, and W. Lin (2019) ATRW: a benchmark for amur tiger re-identification in the wild. In ACM Multimedia (MM), Cited by: §2.3.
  • C. Liang, Z. Zhang, X. Zhou, B. Li, Y. Lu, and W. Hu (2022) One more check: making” fake background” be tracked again. In

    Association for the Advancement of Artificial Intelligence (AAAI)

    ,
    Cited by: §2.1, Figure 5, §4.2, Table 4.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • Z. Lu, V. Rathod, R. Votel, and J. Huang (2020) Retinatrack: online single stage joint detection and tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
  • J. Luiten, A. Osep, P. Dendorfer, P. Torr, A. Geiger, L. Leal-Taixé, and B. Leibe (2021) Hota: a higher order metric for evaluating multi-object tracking. International Journal of Computer Vision 129 (2), pp. 548–578. Cited by: §4.1.
  • W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, and T. Kim (2021) Multiple object tracking: a literature review. Artificial Intelligence 293, pp. 103448. Cited by: §2.1.
  • A. Mathis, T. Biasi, S. Schneider, M. Yuksekgonul, B. Rogers, M. Bethge, and M. W. Mathis (2021) Pretraining boosts out-of-domain robustness for pose estimation. In IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.3.
  • T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer (2022) Trackformer: multi-object tracking with transformers. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1, §4.2, Table 4.
  • A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler (2016) MOT16: a benchmark for multi-object tracking. arXiv:1603.00831. Cited by: Figure 1, §1, §1, §2.2, Figure 4, §3.4, Table 1.
  • J. Pang, L. Qiu, X. Li, H. Chen, Q. Li, T. Darrell, and F. Yu (2021) Quasi-dense similarity learning for multiple object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: Figure 5, §4.2, Table 4.
  • J. Parham, C. Stewart, J. Crall, D. Rubenstein, J. Holmberg, and T. Berger-Wolf (2018) An animal detection pipeline for identification. In IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.3.
  • J. Peng, C. Wang, F. Wan, Y. Wu, Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and Y. Fu (2020) Chained-tracker: chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In European Conference on Computer Vision (ECCV), Cited by: §4.2, Table 4.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Conference and Workshop on Neural Information Processing Systems (NIPS). Cited by: §2.1, §4.3.1, §4.3.5, Table 5.
  • E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision (ECCV) Workshop, Cited by: §4.1.
  • S. Schulter, P. Vernaza, W. Choi, and M. Chandraker (2017) Deep network flow for multi-object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
  • B. Shuai, A. Berneshawi, X. Li, D. Modolo, and J. Tighe (2021) Siammot: siamese multi-object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
  • P. Sun, J. Cao, Y. Jiang, Z. Yuan, S. Bai, K. Kitani, and P. Luo (2022) DanceTrack: multi-object tracking in uniform appearance and diverse motion. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.2, §4.1.
  • P. Sun, J. Cao, Y. Jiang, R. Zhang, E. Xie, Z. Yuan, C. Wang, and P. Luo (2020) Transtrack: multiple object tracking with transformer. arXiv:2012.15460. Cited by: §2.1, Figure 5, §4.2, Table 4.
  • S. Tang, M. Andriluka, B. Andres, and B. Schiele (2017) Multiple people tracking by lifted multicut and person re-identification. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
  • L. Van der Maaten and G. Hinton (2008) Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: Figure 9, §4.3.4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Conference on Neural Information Processing Systems (NIPS). Cited by: §2.1.
  • P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger, and B. Leibe (2019) Mots: multi-object tracking and segmentation. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.2.
  • Z. Wang, L. Zheng, Y. Liu, Y. Li, and S. Wang (2020) Towards real-time multi-object tracking. In European Conference on Computer Vision (ECCV), Cited by: §2.1, §4.2, Table 4.
  • L. Wen, D. Du, Z. Cai, Z. Lei, M. Chang, H. Qi, J. Lim, M. Yang, and S. Lyu (2020) UA-detrac: a new benchmark and protocol for multi-object detection and tracking. Computer Vision and Image Understanding 193, pp. 102907. Cited by: §2.2.
  • N. Wojke, A. Bewley, and D. Paulus (2017) Simple online and realtime tracking with a deep association metric. In IEEE International Conference in Image Processing (ICIP), Cited by: §2.1, §4.2, Table 4.
  • J. Xu, Y. Cao, Z. Zhang, and H. Hu (2019) Spatial-temporal relation networks for multi-object tracking. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
  • Y. Xu, A. Osep, Y. Ban, R. Horaud, L. Leal-Taixé, and X. Alameda-Pineda (2020) How to train your deep multi-object tracker. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
  • J. Yin, W. Wang, Q. Meng, R. Yang, and J. Shen (2020) A unified object motion and affinity model for online multi-object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
  • F. Yu, H. Chen, X. Wang, W. Xian, Y. Chen, F. Liu, V. Madhavan, and T. Darrell (2020) Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.2.
  • H. Yu, Y. Xu, J. Zhang, W. Zhao, Z. Guan, and D. Tao (2021) AP-10k: a benchmark for animal pose estimation in the wild. In Conference and Workshop on Neural Information Processing Systems (NeurIPS) - Track on Datasets and Benchmarks, Cited by: §2.3.
  • Y. Zhang, P. Sun, Y. Jiang, D. Yu, Z. Yuan, P. Luo, W. Liu, and X. Wang (2021a) ByteTrack: multi-object tracking by associating every detection box. arXiv:2110.06864. Cited by: §4.2, Table 4.
  • Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu (2021b) Fairmot: on the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision 129 (11), pp. 3069–3087. Cited by: §2.1, §4.2, Table 4.
  • X. Zhou, V. Koltun, and P. Krähenbühl (2020) Tracking objects as points. In European Conference on Computer Vision (ECCV), Cited by: §2.1, §4.2, Table 4.
  • J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M. Yang (2018) Online multi-object tracking with dual matching attention networks. In European Conference on Computer Vision (ECCV), Cited by: §2.1.
  • P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling (2021) Detection and tracking meet drones challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §1, §2.2.