In this paper, we are interested in multi-animal tracking (MAT), a typical kind of multi-object tracking (MOT) yet heavily under-explored. MAT is critical for understanding and analyzing animal motion and behavior, and thus has a wide range of applications in zoology, biology, ecology, animal conservation and so forth. Despite the importance, MAT is less studied in the tracking community.
Currently, the MOT community mainly focuses on pedestrians and vehicles tracking, with numerous benchmarks introduced in recent years Milan et al. (2016); Geiger et al. (2012); Zhu et al. (2021); Dendorfer et al. (2020). Compared with MOT on pedestrians and vehicles, MAT is more challenging because of several unique properties of animals:
Uniform appearance. Different from pedestrians and vehicles in existing MOT benchmarks that usually have distinguishable appearances (e.g., color, texture, etc.), most animals have uniform appearances that visually look extremely similar (see Fig. 1 for example). As a consequence, it is difficult to leverage their visual features only to distinguish different animals using regularly association (e.g., re-identification) models.
Diverse pose. Animals often possess more diverse poses than humans in a video sequence. For example, a goose may walk or run on the ground, or swim in water, or fly in air, leading to significantly different poses. By contrast, most pedestrians often walk in a video. The diverse poses of animals may cause difficulties in detector design for tracking.
Complex motion. Compared to humans and vehicles that usually have regular motion patterns, animals have larger-range motions due to their diverse poses. For example, animals may frequently change motions from flying to swimming, or inverse. These complicated motion patters lead to higher requirement on motion modeling for animal tracking than human or vehicle tracking.
The above properties of animals bring in technical difficulties for MAT, making it a less-touched problem. In addition, another more important reason why MAT is under-explored is the scarcity of a benchmark. Benchmark plays a crucial role in advancing multi-object tracking. As a platform, it allows researchers to develop their algorithms and fairly assess, compare and analyze different approaches for improvement. Currently, there exist many datasets Milan et al. (2016); Geiger et al. (2012); Zhu et al. (2021); Dendorfer et al. (2020); Bai et al. (2021); Du et al. (2018); Dave et al. (2020) for MOT on different subjects in various scenarios. Nevertheless, there is no available benchmark dedicated for multiple animal tracking. Although some of the datasets (e.g., Bai et al. (2021); Dave et al. (2020)) consist of video sequences involved with animal targets, they are limited in either video quantity and animal categories Bai et al. (2021) or number of animal tracklets Dave et al. (2020), which makes them not an ideal platform for studying MAT. In order to facilitate MOT on animals, a dedicated benchmark is urgently required for both designing and evaluating MAT algorithms.
Contribution. Thus motivated, in this paper we make the first step for studying the MAT problem by introducing AnimalTrack, a large-scale benchmark dedicated to multi-animal tracking in the wild. Specifically, AnimalTrack consists of 58 video sequences, which are selected from 10 common animal categories in our real life. On average, each video sequence contains 33 animals for tracking. There are more than 24.7K frames in total in AnimalTrack, and every frame is manually labeled with multiple axis-aligned bounding boxes. Careful inspection and refinement are performed to ensure high-quality annotations. To the best of our knowledge, AnimalTrack is the first benchmark dedicated to the task of MAT.
In addition, with the goal of understanding how existing MOT algorithms perform on the newly developed AnimalTrack for future improvements, we extensively evaluate 14 popular state-of-the-art MOT algorithms. We conduct in-depth analysis on the evaluation results of these trackers. From the results, not surprisingly, we observe that, most of these trackers, designed for pedestrian or vehicle tracking, are greatly degraded when directly applied for animal tracking on AnimalTrack because of the aforementioned properties of animals. We hope that these evaluation and analysis can offer baselines for future comparison on AnimalTrack and provide guidance for tracking algorithm design.
Besides the analysis on overall performance of different tracking algorithms, we also independently study the important association techniques that are indispensable for current multi-object tracking. In particular, we compare and analyze several popular association strategies. The analysis is expected to provide some guidance for future research when choosing appropriate association baseline for improvements.
In summary, we make the following contributions: (i) We introduce the AnimalTrack, which is, to the best of our knowledge, the first large-scale benchmark dedicated to multi-animal tracking. (ii) We extensively evaluate 14 representative state-of-the-art MOT approaches to provide future comparison on AnimalTrack. (iii) We conduct in-depth analysis for the evaluations of existing approaches, offering guidance for future algorithm design.
By releasing AnimalTrack, we hope to boost the future research and applications of multiple animal tracking. Our project with data and evaluation results will be made publicly available upon the acceptance of this work.
2 Related Work
MAT belongs to the problem of MOT. In this section, we will discuss related MOT algorithms and existing benchmarks that are related to AnimalTrack. Besides, we will also briefly review other animal-related vision benchmarks.
2.1 Multi-Object Tracking Algorithms
MOT is a fundamental problem in computer vision and has been actively studied for decades. In this subsection, we will briefly review some representative works and refer readers to recent surveysCiaparrone et al. (2020); Emami et al. (2020); Luo et al. (2021) for more tracking algorithms.
One popular paradigm is called Tracking-by-Detection which decomposes MOT into two subtasks including detecting objects Ren et al. (2015); Lin et al. (2017) in each frame and then associating the same target to generate trajectories using optimization techniques (e.g., Hungarian algorithm Bewley et al. (2016) and network flow algorithm Dehghan et al. (2015)). Within this framework, numerous approaches have been introduced Bewley et al. (2016); Wojke et al. (2017); Chu et al. (2019); Shuai et al. (2021); Tang et al. (2017); Zhu et al. (2018); Xu et al. (2019); Yin et al. (2020). In order to improve the data association in MOT, some other works propose to directly incorporate the optimization solvers in association into learning Chu and Ling (2019); Xu et al. (2020); Brasó and Leal-Taixé (2020); Schulter et al. (2017); Dai et al. (2021), which is greatly beneficial for improving tracking performance from end to end learning in deep network.
In addition to the Tracking-by-Detection framework, another MOT architecture named Joint-Detection-and-Tracking has recently drawn increasing attention in the community due to efficiency and simplicity. This framework learns to detect and associate target objects at the same time, largely simplifying the MOT framework. Many efficient approaches Bergmann et al. (2019); Lu et al. (2020); Zhou et al. (2020); Wang et al. (2020); Zhang et al. (2021b); Liang et al. (2022) have been proposed based on this architecture. More recently, motivated by the power of Transformer Vaswani et al. (2017), the attention mechanism has been introduced for MOT Sun et al. (2020); Meinhardt et al. (2022) and demonstrate state-of-the-art performance.
2.2 Multi-Object Tracking Benchmarks
Benchmarks are important for the development of MOT. In recent years, many benchmarks have been propose.
PETS2009. PETS2009 Ferryman and Shahrokni (2009) is one of the earliest benchmarks for MOT. It contains 3 videos sequences for pedestrian tracking.
KITTI. KITTI Geiger et al. (2012) is introduced for autonomous driving. It comprises of 50 video sequences and focuses on tracking pedestrian and vehicle in traffic scenarios. Besides 2D MOT, KITTI also supports 3D MOT.
UA-DETRAC. UA-DETRAC Wen et al. (2020) includes 100 challenge sequences captured from real world traffic scenes. This dataset provides rich annotations for multi-object tracking such as illumination, occlusion, truncation ration, vehicle type and bounding box.
MOTChallenge. MOTChallenge Dendorfer et al. (2021) contains a series of benchmarks. The first version MOT15 Leal-Taixé et al. (2015) consists of 22 sequences for tracking. Due to low difficulty of videos in MOT15, MOT16 Milan et al. (2016) compiles 14 new and more challenging sequences compared to MOT15. MOT17 Milan et al. (2016) uses the same videos as in MOT16 but improves the annotation and applies a different evaluation system. Later, MOT20 Dendorfer et al. (2020) is presented with 8 new sequences, aiming at MOT in crowded scenes.
MOTS. MOTS Voigtlaender et al. (2019) is a newly introduced dataset for multi-object tracking. In addition to 2D bounding box, MOTS also provides pixel mask for each target, aiming at simultaneous tracking and segmentation.
BDD100K. BDD100K Yu et al. (2020) is recently proposed for video understanding in traffic scenes. It provides multiple tasks including multi-object tracking.
TAO. TAO Dave et al. (2020) is a large-scale dataset for tracking any objects. It consists of 2,907 videos from 833 categories. TAO sparsely labels objects every 30 frames. Its average trajectories is 6.
GMOT-40. GMOT-40 Bai et al. (2021) is a recently proposed benchmark that aims at one-shot MOT. It consists of 40 sequences from 10 categories. Each sequence provides one instance for tracking multiple targets of the same class.
UAVDT-MOT. UAVDT-MOT Du et al. (2018) consists of 100 challenging videos that are captured with a drone. These videos mainly cover pedestrian and vehicle for tracking. The goal of UAVDT-MOT is to facilitate multi-object tracking in aerial views.
VisDrone-MOT. Similar to UAVDT-MOT, VisDrone-MOT Zhu et al. (2021) also focuses on MOT with drone. The difference is VisDrone-MOT introduces more object categories, making it more challenging.
DanceTrack. DanceTrack Sun et al. (2022) is a large-scale benchmark with 100 videos. The aim of DanceTrack is to explore multi-human tracking in uniform appearance and diverse motion.
Different from the above datasets for MOT on pedestrians, vehicles or other subjects, AnimalTrack focuses on dense multi-animal tracking in the wild. Although some of the benchmarks (e.g., TAO Dave et al. (2020) and GMOT-40 Bai et al. (2021)) contain animal targets for tracking, they have limitations for MAT. For TAO Dave et al. (2020), the average trajectory is 6 and even lower for animal, the average trajectory is 4. Nevertheless, in practice in the wild, it is very common to see objects moving in a dense group. The sparse trajectory in TAO may limits its usage for dense tracking case. In addition, TAO is sparsely annotated 30 frames, resulting in difficulty for trackers in learning temporal motion. Despite several animal videos, GMOT-40 Bai et al. (2021) is limited in animal categories (4 classes) and video quantity (12 in total). Besides, GMOT-40 has a different aim for one-shot MOT. Thus, no training data is provided. Compared to TAO Dave et al. (2020) and GMOT-40 Bai et al. (2021), AnimalTrack is dense in trajectories and annotation (i.e., per-frame manual annotation) as well as diverse in animal classes.
We are also aware that there exist a few datasets Khan et al. (2004); Betke et al. (2007); Bozek et al. (2018) for animal tracking. However, these datasets are usually small (e.g., with 1 or 2 video sequences) and limited to special animal category (e.g., Khan et al. (2004) for ant, Betke et al. (2007) for bat, Bozek et al. (2018)
for bee), and therefore may not be suitable for large-scale animal tracking in the deep learning era. Unlike these animal tracking datasets, our AnimalTrack has more classes with more videos.
2.3 Other Animal-Related Vision Benchmarks
Our AnimalTrack is also related to many other animal-related vision benchmarks outside MOT. The work of Cao et al. (2019)
introduces a large-scale benchmark for animal pose estimation, which is later extended byYu et al. (2021) by adding more images and further increasing categories. In Mathis et al. (2021)
, the authors introduce a benchmark dedicated to horse pose estimation. The work ofBala et al. (2020) proposes a 3D animal pose estimation benchmark. The work of Parham et al. (2018) presents a new dataset for animal localization in the wild. A benchmark for tiger re-identification is proposed in Li et al. (2019). In Iwashita et al. (2014), the authors build a benchmark for animal activity recognition in videos. Different from these benchmarks, the proposed AnimalTrack focuses on multiple animal tracking.
3.1 Design Principle
AnimalTrack expects to provide the community with a new dedicated platform for studying MOT on animal. In particular, in the deep learning era, it aims at both large-scale training and evaluation for deep trackers. To this end, we follow three principles in constructing our AnimalTrack:
Large-scale. One motivation behind AnimalTrack is to provide a large-scale benchmark for animal tracking. Especially, current deep models usually require a large mount of data for training. Bearing this in mind, we hope to compile at least 50 video sequences with at least 20K frames in AnimalTrack.
High-quality dense annotations. The annotations of a benchmark are crucial for both algorithm development and evaluation. To this end, we provide per-frame manual annotations for every sequence of AnimalTrack to ensure high annotation quality, which is different than many MOT benchmarks providing only spare annotations.
Dense trajectories. In real world, it is common to see animals moving in a dense group. AnimalTrack aims at such dense tracking on animals and expects an average video trajectory at least 25.
|Benchmark||Tracking on other subjects (e.g., humans, vehicles, etc)||Tracking on animals|
|Min. len. (s)||n/a||17||17||2.8||n/a||3.0||3.0||1.0||6.5|
|Avg. len. (s)||10.0||33.0||66.8||266.7||36.8||8.9||7.1||22.0||14.2|
|Max. len. (s)||n/a||85.0||133.0||99.0||n/a||24.2||24.2||93.0||75.6|
|Total len. (s)||498.0||463.0||535.0||2,666.7||106,978.0||356.0||85.5||859.0||823.7|
3.2 Data Collection
Our AnimalTrack focuses on dense multi-animal tracking. We start benchmark construction by selecting 10 common animal categories that are generally dense and crowded in the wild. These categories are Chicken, Deer, Dolphin, Duck, Goose, Horse, Penguin, Pig, Rabbit, and Zebra, perching in very different environments. Although TAO consists of more classes than ours, many categories in TAO are not available for dense multi-object tracking, which is different than our aim in this work.
After determining the animal classes, we search raw video sequences111Each video sequence is collected under the Creative Commons license. of each class from YouTube (https://www.youtube.com/), the largest and the most popular video platform in the world. Initially, we have collected over 500 candidate sequences. After a joint consideration of both video quality and our principles, from these raw sequences we choose 58 video clips that are finally available for our task. For each category, there are at least 5 and at most 7 sequences, showing balance in category to some extent. Fig. 2 demonstrates the number of sequences for each category in AnimalTrack.
|1||Frame number||Frame in which the target appears; starting from 1.|
|2||Identifier||An unique ID for each trajectory.|
|3||Box left||Coordinate of top-left corner of annotated object.|
|4||Box top||Coordinate of top-left corner of annotated object.|
|5||Box width||Width of annotated object.|
|6||Box height||Height of annotated object.|
|7||Confidence||Flag that indicates if the box is considered (1) or ignored (-1) for evaluation; the confidence for all targets in AnimalTrack is set to 1.|
|8||Class||Type of annotated object.|
|9||Visibility||Visibility ratio of object; we ignore it by setting its value to -1 in AnimalTrack.|
Finally, we compile a large-scale benchmark for multi-animal tracking by collecting 58 video sequences with more than 24.7K frames and 429K boxes. The average video length is 426 frames. The longest sequence contains 2,269 frames, while the shortest one consist of 196 frames. The total number of tracks in AnimalTrack is 1,927, and the average number of tracks is 33. To our best knowledge, AnimalTrack is by far the largest benchmark dedicated for animal tracking. Tab. 1 summarizes detailed statistics on AnimalTrack and comparison with several popular MOT benchmarks and animal videos in GMOT-40 and TAO.
We use the annotation tool DarkLable222The annotation tool is available at https://github.com/darkpgmr/DarkLabel. to annotate the videos in AnimalTrack. Following popular MOTChallenge Dendorfer et al. (2021), we annotate each target in videos with object identifier, axis-aligned bounding box and other information. Tab. 2
demonstrates the annotation format for each target in AnimalTrack. Note that, slightly different from MOTChallenge, we do annotate the visibility ratio of each target because it is hard to accurately measure the visibility of the target in real world scenarios. However, we still keep it (set to -1) for padding to MOTChallenge format.
To provide consistent annotations, we follow the following labeling rules. For target object that is fully visible or partially occluded, a full-body box is annotated. If the object is under full occlusion, we do not label it. When this object re-appears in the view in future, we annotate it with the same identifier. For target objects out of view, they are assigned with new identifiers when re-entering the view.
To ensure high-quality annotation, we adopt a multi-step strategy. First, a group of volunteers who are familiar with topic and an expert (e.g., PhD student working on related areas) will participate in manually annotating each target object in videos. Then, a group of experts will carefully inspect the initial annotations. If the annotation results are not unanimously agreed by experts, they will be returned to labeling team for adjustment or refinement. We repeat this process until all annotations are satisfactorily completed. Fig. 3 demonstrates a few annotated samples from each category in AnimalTrack.
3.4 Statistics of Annotation
In order to better understand the pose and motion patters of animals, we show representative statistics of the annotation boxes of objects in AnimalTrack in Fig. 4. In particular, we demonstrate the object motion, relative area to initial object box, relative aspect ratio (aspect ratio is defined as ratio of width and width) and Intersection over Union (IoU) on object boxes in adjacent frames. From Fig. 4, we can clearly observe that the animal targets vary rapidly in terms of spatial pose and temporal motions.
In addition, we compare AnimalTrack and popular pedestrian tracking benchmarks including MOT17 Milan et al. (2016) and MOT20 Dendorfer et al. (2020). From the comparison in Fig. 4, we can see that animals have faster motion than pedestrians. Moreover, the pose variations of animals are more complex, which consequently causes new challenges in tracking animals.
3.5 Dataset Split
AnimalTrack consists of 58 video sequences. We utilize 32 out of 58 for training and the rest 26 for testing. In specific, for category with videos, we select /2 videos for training and the rest for testing if is a even number, otherwise we choose /2 videos for training and the rest for testing. During dataset splitting, we try our best to keep the distributions of training and testing set as close as possible. Tab. 3 compares the statistics of training/testing sets in AnimalTrack. Note that, the number of frames for the testing set is slightly more than that for the training set. The reason is that the testing set contains more longer video sequences for challenging evaluation. The detailed spilt will be released at our project website.
4.1 Evaluation Metric
For comprehensive evaluation of different tracking algorithms, we use multiple metrics. Specifically, we employ the recently proposed higher order tracking accuracy (HOTA) from Luiten et al. (2021), commonly used CLEAR metrics from Bernardin and Stiefelhagen (2008) including multiple object tracking accuracy (MOTA), mostly tracked targets (MT), mostly lost targets (ML), false positives (FP), false negatives (FN), ID switches (IDs) and number of times a trajectory is fragmented (FM) and ID metrics from Ristani et al. (2016) such as identification precision (IDP), identification recall (IDR) and related F1 score (IDF1) which is defined as the ratio of correct detections to the average number of GT and computed detections. Many previous works employ MOTA as the main metric (e.g., for ranking). Nevertheless, a recent study Luiten et al. (2021) shows that MOTA may bias to target detection quality instead of target association accuracy. Considering this, we follow Geiger et al. (2012); Sun et al. (2022) to adopt HOTA as the main metric in evaluation. For definitions of these metrics, we refer readers to Bernardin and Stiefelhagen (2008); Ristani et al. (2016); Luiten et al. (2021).
|SORT Bewley et al. (2016)||42.8%||55.6%||49.2%||58.5%||42.4%||333||470||301||19,099||86,257||2,530||3,730|
|IOUTrack Bochinski et al. (2017)||41.6%||55.7%||45.7%||51.9%||40.7%||[rgb] 1, 0, 0388||454||[rgb] 1, 0, 0262||25,206||[rgb] 1, 0, 077,847||4,639||5,259|
|DeepSORT Wojke et al. (2017)||32.8%||41.4%||35.2%||49.7%||27.2%||213||452||439||14,131||124,747||3,503||4,527|
|JDE Wang et al. (2020)||26.8%||27.3%||31.0%||51.0%||22.0%||106||414||584||17,887||155,623||3,187||5,031|
|FairMOT Zhang et al. (2021b)||30.6%||29.0%||38.8%||62.8%||28.0%||143||462||499||17,653||152,624||2,335||5,447|
|CenterTrack Zhou et al. (2020)||9.9%||1.6%||7.0%||8.9%||5.8%||265||423||416||32,050||117,614||89,655||7,583|
|CTracker Peng et al. (2020)||13.8%||14.0%||14.7%||35.2%||9.3%||20||313||771||[rgb] 1, 0, 013,092||192,660||3,437||8,019|
|Tracktor++ Bergmann et al. (2019)||44.2%||[rgb] 0, 0, 155.2%||51.0%||58.5%||45.1%||364||472||[rgb] 0, 0, 1268||25,477||[rgb] 0, 0, 181,538||[rgb] 0, 0, 11,976||4,149|
|ByteTrack Zhang et al. (2021a)||40.1%||38.5%||51.2%||[rgb] 0, 0, 164.9%||42.3%||310||465||329||31,591||116,587||[rgb] 1, 0, 01,309||[rgb] 1, 0, 03,513|
|QDTrack Pang et al. (2021)||[rgb] 1, 0, 047.0%||[rgb] 1, 0, 055.7%||[rgb] 1, 0, 056.3%||[rgb] 1, 0, 065.6%||[rgb] 1, 0, 049.3%||[rgb] 0, 0, 1367||420||317||22,696||83,057||1,970||5,656|
|TADAM Guo et al. (2021)||32.5%||36.5%||37.2%||44.4%||32.0%||258||[rgb] 1, 0, 0495||351||41,728||110,048||2,538||4,469|
|OMC Liang et al. (2022)||43.0%||53.4%||50.3%||61.8%||42.4%||324||478||302||[rgb] 0, 0, 115,910||92,570||4,938||7,162|
|Trackformer Meinhardt et al. (2022)||31.0%||20.4%||36.5%||40.9%||32.8%||230||[rgb] 0, 0, 1491||383||70,404||118,724||4,355||[rgb] 0, 0, 13,725|
|TransTrack Sun et al. (2020)||[rgb] 0, 0, 145.4%||48.3%||[rgb] 0, 0, 153.4%||63.4%||[rgb] 0, 0, 146.1%||327||416||361||28,553||95,212||1,978||6,459|
4.2 Evaluated Trackers
Understanding how existing MOT algorithms perform on AnimalTrack is crucial for future comparison and also beneficial for tracker design. To such end, we extensively evaluate 14 state-of-the-art multi-object tracking approaches.
These approaches include SORT Bewley et al. (2016) (ICIP’2016), DeepSort Wojke et al. (2017) (ICIP’2017), IoUTrack Bochinski et al. (2017) (AVSS’2017), JDE Wang et al. (2020) (ECCV’2020), FairMOT Zhang et al. (2021b) (IJCV’2021), CenterTrack Zhou et al. (2020) (ECCV’2020), CTracker Peng et al. (2020) (ECCV’2020), QDTrack Pang et al. (2021) (CVPR’2021), ByteTrack Zhang et al. (2021a) (arXiv’2021), Tracktor++ Bergmann et al. (2019) (ICCV’2019), TADAM Guo et al. (2021) (CVPR’2021), Trackformer Meinhardt et al. (2022) (CVPR’2022), OMC Liang et al. (2022) (AAAI’2022) and TransTrack Sun et al. (2020) (arXiv’2020). Notably, among these approaches, TransTrack and Trackformer are two recently proposed trackers using Transformer. Despite excellent performance on pedestrian tracking, these trackers quickly degrade in tracking animals as shown in later experimental results.
It is worth noticing that, in evaluation, all the above trackers are used as they are, without any modification, for two reasons. First, different approach may need different training strategies, which makes it difficult to optimally train each tracker for best performance. Moreover, inappropriate training settings may decrease the performance for certain trackers. Second, the original configuration for each tracker has been verified by authors. Thus, it is reasonable to assume that each tracker is able to obtain decent results even without modification.
4.3 Evaluation Results
In this work, the evaluation of each tracking algorithm is conducted in “private setting” in which each tracker should perform both object detection and target association.
4.3.1 Overall Performance
We extensively evaluate 14 state-of-the-art tracking algorithms. Tab. 4 shows the evaluation results and comparison.
|OMC Liang et al. (2022)||SORT Bewley et al. (2016)||TransTrack Sun et al. (2020)||Tracktor++ Bergmann et al. (2019)||QDTrack Pang et al. (2021)|
From Tab. 4, we observe that QDTrack shows the best overall result by achieving 47.0% HOTA score and TransTrack the second best with 45.4% HOTA score, respectively. QDTrack densely samples numerous regions from images for similarity learning and thus can alleviate the problem of complex animal poses in detection in some degree, as evidenced by its best result of 55.7% on MOTA that focuses more on detection quality. This dense sampling strategy not only improves detection but also benefits later association, which is shown by its best 56.3% IDF1 score. TransTrack obtains the second best overall result with 45.4% HOTA score. On IDF1, it also exhibits the second best result with 53.4%. TransTrack utilizes the query-key mechanism in Transformer for multi-object tracking. The competitive performance of TransTrack shows the potential of Transformer for MOT. We notice that another Transformer-based tracker Trackformer shows poorer performance compared to TransTrack. We argue that the reason is because of its relatively weaker detection module. Tracktor++ shows the second best MOTA result with 55.2% owing to its adoption of strong Faster R-CNN Ren et al. (2015) for detection. Compared with pedestrians, animal detection is more challenging and the usage of two-stage detectors may be more suitable.
In addition, we see some interesting findings on AnimalTrack. For example, SORT and IoUTrack are two simple trackers and outperformed by many recent approaches on pedestrian tracking benchmarks. However, we observe that, despite simplicity, these two trackers works surprisingly well on AnimalTrack. SORT and IOUTrack achieve 42.8% and 41.6% HOTA score, respectively, which surpass many recent state-of-the-arts such as JDE, FairMOT, CTracker, etc. This observation shows that more efforts and attentions should be devoted and paid to the problem of animal tracking.
Besides quantitatively evaluating and comparing different MOT approaches, we further show the qualitative results of different trackers. Due to limited space, we only demonstrate the qualitative results of top five trackers based on HOTA as in Fig. 5.
4.3.2 Category-based Performance
In addition to the overall performance analysis, we conduct more specific comparison of different tracking algorithms on each category on AnimalTrack using HOTA. Fig. 6 shows the comparison results. From Fig. 6, we can see that QDTrack achieves the best results on 3 out of 10 categories including Goose, Dolphin and Pig and the second best results on 4 out of 10 classes including Horse, Rabbit, Duck and Chicken, which is consistent with its best overall performance on AnimalTrack. OMC demonstrates the best results on 3 out of 10 classes including Horse, Penguin and Deer, showing the advantage of its “re-check” mechanism for tracking. ByteTrack obtains the best results on 2 categories including Rabbit and Chicken and competitive performance on other categories, which reveals the promising capacity of Transformer for animal tracking.
4.3.3 Difficulty comparison of Categories
We analyze the difficulty of different animal categories in AnimalTrack. In specific, we simply average the HOTA scores of all evaluated trackers on one category to obtain the HOTA score for this category. Fig. 7 shows the comparison. In Fig. 7, the larger the area of the sector is, the larger the average HOTA score is, and the less difficult the category is. From Fig. 7, we can see that the category of Horse is the easiest to track while the class of Goose is the more difficult to track. We argue that Goose is the hardest because the gooses may have the most complex motion patters. We hope that this analysis can guide researchers to pay more attention to the hard categories.
4.3.4 Comparison of MOT17 and AnimalTrack
Currently, one of the main focuses in MOT community is to track pedestrians. Different from pedestrian tracking, animal tracking is more challenging because of uniform appearance, diverse pose and complex motion patterns. In order to verify this, we compare the performance of existing state-of-the-art tracking algorithms on the popular MOT17 and the proposed AnimalTrack. Notice that, we only compare the trackers whose HOTA, MOTA and IDF1 scores are available on both MOT17 and AnimalTrack. Fig. 8 displays the comparison results of these trackers.
From Fig. 8
(a), we can see that the best two performing trackers on MOT17 are ByteTrack and FairMOT that achieves 63.0% and 59.3% HOTA scores. Despite this, these two trackers degrade significantly when tracking animals on AnimalTrack. Specifically, their HOTA scores decrease from 63.1% to 40.1% and from 59.3% to 30.6%, showing absolute perform drops of 23.0% and 28.7%, respectively. Tracktor++ slightly performs worse on AnimalTrack than MOT17. This tracker utilizes a strong detection for tracking and shows competitive performance. Although QDTrack achieves the best HOTA result, its performs degrades on AnimalTrack compared to that on MOT17, which evidences again the challenge and difficulty we face in handling animal tracking. It is worth noting that, CenterTrack has the largest performance drop on AnimalTrack. We have carefully inspected the official implementation to ensure its correction for evaluation. After taking a close look at the implementation, we find that the features extracted in CenterTrack are not suitable for animal tracking, resulting in poor performance.
In addition to overall comparison using HOTA, we compare the MOTA score. From Fig. 8 (b), we can observe that the best two trackers on MOT17 are ByteTrack and TransTrack with 80.3% and 74.5% MOTA scores, respectively. Nevertheless, when tracking animals on AnimalTrack, their MOTA scores are decreased to 37.9% (42.4% absolute performance drop) and 48.3% (26.2% absolute performance drop), respectively, which shows that the animal detection is more challenging compared to human detection. Besides the best two trackers on MOT17, other approaches become degenerated on AnimalTrack, which further reveals the general difficulty of detection on AnimalTrack. We notice that Tracktor++ perform consistently on both AnimalTrack than MOT17 (55.2% v.s. 56.3%). We argue that this is attributed to its powerful regressor in detection.
Moreover, we also demonstrate the comparison of IDF1 score of each tracker on MOT17 and AnimalTrack in Fig. 8 (c). As shown, we find that the best two trackers on MOT17 are ByteTrack and FairMOT with 77.3% and 72.3% IDF1 scores. Compared to their performance on AnimalTrack with 51.0% and 38.3% IDF1 scores, the absolute performance drops are 22.3% and 33.5%, respectively, which highlights the severe challenge in associating animals with uniform appearances. Furthermore, in addition to these two trackers, all other trackers including the best performing tracker QDTrack on AnimalTrack are actually greatly degenerated in IDF1 score, demonstrating more efforts required for solving association in animal tracking.
To further compare pedestrian and animal tracking, we analyze the appearance similarities of different pedestrians and animals on MOT17 and AnimalTrack. In particular, we train two re-identification networks with identical architectures on MOT17 and AnimalTrack, respectively. Afterwards, we extract the features of pedestrians and animals and adopt t-SNE Van der Maaten and Hinton (2008) to visualize these features. Fig. 9 shows the visualization of appearance features of pedestrians and animals. From Fig. 9, we can clearly observe that the features of animals are more complex and indistinguishable because highly similar appearances of animals compared to pedestrian appearances.
From the extensive quantitative and qualitative analysis above, we can see that tracking animals is more challenging and difficult than tracking pedestrians. Despite rapid progress on pedestrian tracking, there is a long way for improving animal tracking.
4.3.5 Analysis on Association Strategy
Association is a core component in existing MOT algorithms. In order to analyze and compare different association strategies, we conduct an independent experiment. Specifically, we adopt the classic and powerful detector Faster R-CNN Ren et al. (2015) to provide detection results on AnimalTrack. Based on the detection results, we perform analysis on four different association strategies.
Tab. 5 demonstrates the comparison results. SORT (see ❶) and IOUTrack (see ❷) simply use motion information instead of appearance to perform association but achieves the best two results with 42.8% and 41.6% HOTA scores. This shows that taking into consideration the motion cues in videos is beneficial for distinguishing targets with uniform appearances. Compared to SORT, DeepSORT (see ❸) adopts target appearance inforamtion for association but the performance is degraded, which once again evidences that appearance should be carefully designed when applied for associating animals. ByteTrack (see ❹) is a recently proposed approach and demonstrates state-of-the-art performance on multiple pedestrian and vehicle tracking benchmarks. The main success on these benchmarks comes from its association on all detected boxes. However, because animals have uniform appearances and it is hard to leverage their appearance information as in pedestrian or vehicles to distinguish different targets. More efforts are desired for designing appropriate association for animal targets.
In this paper, we introduce AnimalTrack, a high-quality large-scale benchmark for multi-animal tracking. Specifically, AnimalTrack consists of 58 video sequences that are selected from 10 common animal categories. To the best of our knowledge, AnimalTrack is by far the first and also the largest dataset dedicated to MAT. By constructing AnimalTrack, we hope to provide a platform for facilitating research of MOT on animals. In addition, to provide future comparison on AnimalTrack, we extensively assess 14 popular MOT approaches with in-depth analysis. The evaluation results show that more efforts are desired for improving MAT. Furthermore, we independently study the association component for multi-animal tracking and hope that this can provide some guidance for choosing appropriate baseline for target association. Overall, we expect our dataset, along with evaluation results and our proposed baseline, to inspire more research on multiple animal tracking using computer vision technique.
GMOT-40: a benchmark for generic multiple object tracking.
IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §1, §2.2, §2.2, Table 1.
- Automated markerless pose estimation in freely moving macaques with openmonkeystudio. Nature communications 11 (1), pp. 1–12. Cited by: §2.3.
- Tracking without bells and whistles. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1, Figure 5, §4.2, Table 4.
- Evaluating multiple object tracking performance: the clear mot metrics. EURASIP Journal on Image and Video Processing 2008, pp. 1–10. Cited by: §4.1.
- Tracking large variable numbers of objects in clutter. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.2.
- Simple online and realtime tracking. In IEEE International Conference in Image Processing (ICIP), Cited by: §2.1, Figure 5, §4.2, Table 4.
- High-speed tracking-by-detection without using image information. In IEEE International Conference on Advanced Video and Signal-based Surveillance (AVSS), Cited by: §4.2, Table 4.
- Towards dense object tracking in a 2d honeybee hive. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.2.
- Learning a neural solver for multiple object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
- Cross-domain adaptation for animal pose estimation. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.3.
- Online multi-object tracking with instance-aware tracker and dynamic model refreshment. In IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.1.
- Famnet: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
- Deep learning in video multi-object tracking: a survey. Neurocomputing 381, pp. 61–88. Cited by: §2.1.
Learning a proposal classifier for multiple object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
- Tao: a large-scale benchmark for tracking any object. In European Conference on Computer Vision (ECCV), Cited by: §1, §2.2, §2.2, Table 1.
- Target identity-aware network flow for online multiple target tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
- Motchallenge: a benchmark for single-camera multiple target tracking. International Journal of Computer Vision 129 (4), pp. 845–881. Cited by: §2.2, §3.3.
- Mot20: a benchmark for multi object tracking in crowded scenes. arXiv:2003.09003. Cited by: §1, §1, §2.2, Figure 4, §3.4, Table 1.
- The unmanned aerial vehicle benchmark: object detection and tracking. In European Conference on Computer Vision (ECCV), Cited by: §1, §2.2, Table 1.
- Machine learning methods for data association in multi-object tracking. ACM Computing Surveys 53 (4), pp. 1–34. Cited by: §2.1.
- Pets2009: dataset and challenge. In PETS Workshop, Cited by: §2.2.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: Figure 1, §1, §1, §2.2, Table 1, §4.1.
- Online multiple object tracking with cross-task synergy. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §4.2, Table 4.
- First-person animal activity recognition from egocentric videos. In International Conference on Pattern Recognition (ICPR), Cited by: §2.3.
- An mcmc-based particle filter for tracking multiple interacting targets. In European Conference on Computer Vision (ECCV), Cited by: §2.2.
- Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv:1504.01942. Cited by: §2.2.
- ATRW: a benchmark for amur tiger re-identification in the wild. In ACM Multimedia (MM), Cited by: §2.3.
One more check: making” fake background” be tracked again.
Association for the Advancement of Artificial Intelligence (AAAI), Cited by: §2.1, Figure 5, §4.2, Table 4.
- Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
- Retinatrack: online single stage joint detection and tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
- Hota: a higher order metric for evaluating multi-object tracking. International Journal of Computer Vision 129 (2), pp. 548–578. Cited by: §4.1.
- Multiple object tracking: a literature review. Artificial Intelligence 293, pp. 103448. Cited by: §2.1.
- Pretraining boosts out-of-domain robustness for pose estimation. In IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.3.
- Trackformer: multi-object tracking with transformers. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1, §4.2, Table 4.
- MOT16: a benchmark for multi-object tracking. arXiv:1603.00831. Cited by: Figure 1, §1, §1, §2.2, Figure 4, §3.4, Table 1.
- Quasi-dense similarity learning for multiple object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: Figure 5, §4.2, Table 4.
- An animal detection pipeline for identification. In IEEE Winter Conference on Applications of Computer Vision (WACV), Cited by: §2.3.
- Chained-tracker: chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In European Conference on Computer Vision (ECCV), Cited by: §4.2, Table 4.
- Faster r-cnn: towards real-time object detection with region proposal networks. Conference and Workshop on Neural Information Processing Systems (NIPS). Cited by: §2.1, §4.3.1, §4.3.5, Table 5.
- Performance measures and a data set for multi-target, multi-camera tracking. In European Conference on Computer Vision (ECCV) Workshop, Cited by: §4.1.
- Deep network flow for multi-object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
- Siammot: siamese multi-object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
- DanceTrack: multi-object tracking in uniform appearance and diverse motion. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.2, §4.1.
- Transtrack: multiple object tracking with transformer. arXiv:2012.15460. Cited by: §2.1, Figure 5, §4.2, Table 4.
- Multiple people tracking by lifted multicut and person re-identification. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
- Visualizing data using t-sne.. Journal of machine learning research 9 (11). Cited by: Figure 9, §4.3.4.
- Attention is all you need. Conference on Neural Information Processing Systems (NIPS). Cited by: §2.1.
- Mots: multi-object tracking and segmentation. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.2.
- Towards real-time multi-object tracking. In European Conference on Computer Vision (ECCV), Cited by: §2.1, §4.2, Table 4.
- UA-detrac: a new benchmark and protocol for multi-object detection and tracking. Computer Vision and Image Understanding 193, pp. 102907. Cited by: §2.2.
- Simple online and realtime tracking with a deep association metric. In IEEE International Conference in Image Processing (ICIP), Cited by: §2.1, §4.2, Table 4.
- Spatial-temporal relation networks for multi-object tracking. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.1.
- How to train your deep multi-object tracker. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
- A unified object motion and affinity model for online multi-object tracking. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.1.
- Bdd100k: a diverse driving dataset for heterogeneous multitask learning. In IEEE International Conference on Computer Vision and Pattern Recognition Conference (CVPR), Cited by: §2.2.
- AP-10k: a benchmark for animal pose estimation in the wild. In Conference and Workshop on Neural Information Processing Systems (NeurIPS) - Track on Datasets and Benchmarks, Cited by: §2.3.
- ByteTrack: multi-object tracking by associating every detection box. arXiv:2110.06864. Cited by: §4.2, Table 4.
- Fairmot: on the fairness of detection and re-identification in multiple object tracking. International Journal of Computer Vision 129 (11), pp. 3069–3087. Cited by: §2.1, §4.2, Table 4.
- Tracking objects as points. In European Conference on Computer Vision (ECCV), Cited by: §2.1, §4.2, Table 4.
- Online multi-object tracking with dual matching attention networks. In European Conference on Computer Vision (ECCV), Cited by: §2.1.
- Detection and tracking meet drones challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1, §1, §2.2.