Swarm behavior tracking based on a deep vision algorithm

by   Meihong Wu, et al.

The intelligent swarm behavior of social insects (such as ants) springs up in different environments, promising to provide insights for the study of embodied intelligence. Researching swarm behavior requires that researchers could accurately track each individual over time. Obviously, manually labeling individual insects in a video is labor-intensive. Automatic tracking methods, however, also poses serious challenges: (1) individuals are small and similar in appearance; (2) frequent interactions with each other cause severe and long-term occlusion. With the advances of artificial intelligence and computing vision technologies, we are hopeful to provide a tool to automate monitor multiple insects to address the above challenges. In this paper, we propose a detection and tracking framework for multi-ant tracking in the videos by: (1) adopting a two-stage object detection framework using ResNet-50 as backbone and coding the position of regions of interest to locate ants accurately; (2) using the ResNet model to develop the appearance descriptors of ants; (3) constructing long-term appearance sequences and combining them with motion information to achieve online tracking. To validate our method, we construct an ant database including 10 videos of ants from different indoor and outdoor scenes. We achieve a state-of-the-art performance of 95.7% mMOTA and 81.1% mMOTP in indoor videos, 81.8% mMOTA and 81.9% mMOTP in outdoor videos. Additionally, Our method runs 6-10 times faster than existing methods for insect tracking. Experimental results demonstrate that our method provides a powerful tool for accelerating the unraveling of the mechanisms underlying the swarm behavior of social insects.


page 1

page 4

page 5

page 12

page 14

page 15


A dataset of ant colonies motion trajectories in indoor and outdoor scenes for social cluster behavior study

Motion and interaction of social insects (such as ants) have been studie...

Fast Hand Detection in Collaborative Learning Environments

Long-term object detection requires the integration of frame-based resul...

Exploring Structure for Long-Term Tracking of Multiple Objects in Sports Videos

In this paper, we propose a novel approach for exploiting structural rel...

Learning to Track Object Position through Occlusion

Occlusion is one of the most significant challenges encountered by objec...

Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies

The majority of existing solutions to the Multi-Target Tracking (MTT) pr...

Refined Particle Swarm Intelligence Method for Abrupt Motion Tracking

Conventional tracking solutions are not feasible in handling abrupt moti...

Localizing Adverts in Outdoor Scenes

Online videos have witnessed an unprecedented growth over the last decad...

1 Keywords:

social insects, swarm behavior, automatic tracking method, object detection, online tracking

2 Introduction

Swarm behavior is one of the most important features of social insects, which has important significance for the study of embodied intelligence  tiacharoen2012design. Specifically, social insects often tend to cluster into a colony vandermeer2008clusters, which forms a complex dynamical system together with the surrounding environment balch2001automatically. So far, we do not know enough about the mechanisms behind swarm behaviors of social insects.

The mainly reason is that the key requirement of this research is the ability to track the motions and interactions of each individual robustly and accurately. However, until the late century, biologists still manually marked the motion trajectories on the video to guarantee the quality. They have to track each individual at one time, which might mean watching the entire video 50 times or more in a crowded scene poff2012efficient. It becomes an inhibiting factor in obtaining the complete and accurate dataset required to analyze the evolution of complex dynamical system. Therefore, in the past two decades, attempts have been made to automate the tracking process for social insects utilizing computer vision (CV) techniques khan2005mcmc; khan2006mcmc; oh2006parameterized; veeraraghavan2008shape; fletcher2011multiple.

Traditional CV techniques free researchers from manual work through approaches such as foreground segmentation algorithm li2008estimating, temporal difference method khan2005mcmc and hungarian algorithm li2009learning. Such approaches, however, have failed to address the noise in the image zhao2015improved, resulting in the limitation that a laboratory environment with a clean background is needed. Nevertheless, many scientifically valuable results are obtained in nature rather than laboratory environment schmelzer2009special; kastberger2013social; tan2016honey; dong2018olfactory.

In recent years, with the popularity of computer vision, many advanced object detection and tracking methods have emerged.

2.1 Object detection

Existing methods in object detection are categorized as one-stage or two-stage, according to whether there is a separate stage of region proposal. One-stage frameworks (e.g., YOLO redmon2016you) are fast, but their accuracy is typically slightly inferior compared with that of two-stage detection. The popularity of two-stage detection frameworks is enhanced by R-CNN girshick2014rich, which proposes candidate regions via a selective search (SS) algorithm uijlings2013selective, thereby the detector focuses on these RoIs. However, using the SS algorithm uijlings2013selective to generate region proposals is the main reason causing slow inference. Fast R-CNN girshick2015fast reduces the computational complexity of region proposals by downsampling the original image, while Faster R-CNN ren2015faster proposes an RPN, which further improves the speed of training and inference.

Given the success of deep learning in general tasks of object detection, researchers also applied to detect specific groups of animals, such as a single mouse 

geuther2019robust, fruit flies murali2019classification. These methods are either limited to track a single object, or a fixed number of objects. General tools romero2019idtracker; sridhar2019tracktor also offer the functionality to detect and track unmarked animals in the image. However, most of existing methods focus on the condition of ideal lab set-up and none of existing works reported the detection of ants in outdoor environments which contain diverse backgrounds and arbitrary terrains.

2.2 Multi-object tracking (MOT)

In the last two decades, vision-based detection and tracking models have been widely used to study social insects khan2005mcmc; veeraraghavan2008shape. Appearance (particularly color) and motion information are the main metrics used in this category of method. Due to high similarity of ants’ appearance, researchers either use the technique of pigmenting to create more distinct appearance features fletcher2011multiple, or limit the observation to a laboratory setup branson2009high; perez2014idTracker. State-of-the-art methods, such as Ctrax branson2009high and idTracker perez2014idTracker, for insect tracking are tested in a laboratory setup and use background subtraction for foreground segmentation. Notably, the operations of background modeling and foreground extraction are time-consuming.

The tracking-by-detection (TBD) paradigm is to match trajectories and detections in two consecutive frames, a process that requires metrics. The global nearest neighbor model measures motion state to achieve Drosophila tracking chiron2013detecting

. The global nearest neighbor model assumes that the motion state obeys the linear observation model, which commonly uses a constant velocity model - the Kalman filter (KF). However, changes in ants’ speed and direction are difficult to predict, thus appearance information is integrated as a metric.

The DAT method is a mainstream method for ant colony tracking fasciano2014ant. It allows a combination of multiple metrics, and uses Hungarian algorithm li2009learning to assign detections for trajectories. The PF method is suitable for solving nonlinear problems fasciano2013tracking

, but the growth in the number of particles leads to an exponential increase in the computational cost, preventing the effective multi-object tracking. Using Markov Chain Monte Carlo sampling can reduce computational complexity 

khan2006mcmc. A GPU-accelerated semi-supervised framework can further improve tracking accuracy and performance poff2012efficient.

When applying the methods above for tracking ant colonies, they are greatly disturbed by background noise and difficult to overcome the serious occlusion problem in dense scenes. Long short-term memory 

kim2018multi and spatial-temporal attention mechanisms zhu2018online have been developed to tackle the problem of long-term occlusion. A bilinear Long short-term memory structure that couples a linear predictor with input detection features, thereby modeling long-term appearance features kim2018multi. The spatial-temporal attention mechanism is also suitable for the MOT task. The spatial attention module makes the network focus on the pattern of matching. Meanwhile the temporal attention module assigns different levels of attention to the sample sequence of the trajectory zhu2018online

. The TBD paradigm-based framework is dependent on detection results. Therefore, severe occlusion is likely to cause tracking failures. To prevent this situation, a detector with automatic bounding box repairing and adjustment is introduced by a cyclic structure classifier 


In this paper, we use a deep learning method to build a detection and tracking framework. Our method is based on the TBD paradigm and accomplishes the goal of online multi-ant tracking. To the best of authors’ knowledge, this is the first work to achieve robust detection and tracking of ant colony in both indoor and outdoor environments (Figure 1

). Our method is robust in tackling the challenge of visual similarity among colony individuals, handling diverse terrain backgrounds and achieving long-period of tracking. Our main contributions are as follows:

  • We adopt a two-stage object detection framework, using ResNet-50 as the backbone and position sensitive score maps to encode regions of interest (RoIs). During the tracking stage, we use a ResNet network to obtain the appearance descriptors of ants and then combine them with motion information to achieve online association.

  • Our method proves to be robust in both indoor and outdoor scenes. Furthermore, only a small amount of training data are required to achieve the goal in our pipeline, which are 50 images chosen for each scene in the detection framework and 50 labels randomly chosen for the tracking framework respectively.

  • We construct an ant database with labeled image sequences, including five indoor videos (laboratory setup) and five outdoor videos, with 4983 frames and 115,433 labels in total. The database is made publicly available, which is hoped that it will contribute to a deeper exploration on swarm behavior of ant colony.

Figure 1: Tracking results by our method in both indoor and outdoor environments.

3 Materials and methods

3.1 Overview

Following the TBD paradigm, we propose a uniform framework for detection and tracking to efficiently and accurately track the ant colony in both indoor and outdoor scenes (Figure 2). In the detection phase, we adopt a two-stage object detection framework, using ResNet-50 as the backbone, and encoding RoIs proposed by regional proposal network via position-sensitive score maps. Then we implement classification and regression through downsampling and voting mechanisms. (see details in Section 3.2 Two-stage object detection). In the tracking stage, we first use ResNet to train the appearance descriptors of ants and measure the appearance similarity between two objects. Next, the tracking is accomplished by combining appearance and motion information for online association metric. (Section 3.3 MOT framework).

Figure 2: Architecture for detection and tracking.

3.2 Two-stage object detection

3.2.1 Regional proposal network

Regional proposal network (RPN) is proposed in Faster R-CNN ren2015faster to generate RoIs. Compared to SS uijlings2013selective, RPN is based on the CNN network structure and can connect the backbone with shared weight, significantly improving detection speed. We use ResNet-50 as the backbone and replace the fully connected layer with a 11 convolution to reduce the dimensions of feature maps. Considering that ResNet-50 conducts downsampling 32 times, we get 256-d feature maps via a 3

3 Atrous convolution to maintain translation variability. For each sliding position, we predict k region proposal boxes of different sizes and ratios; these boxes are called anchors. After the 256-d vector, we connect classification and regression branches through two parallel 1

1 convolution layers. The classification branch uses softmax to determine whether there is an object in anchor so that this branch has 2k outputs. The regression branch will perform a regression on the 4D position parameters of anchors (i.e., center coordinates, width and height) so that there are 4k outputs. RPN will propose kwh anchors with a wh feature map, called RoIs. We use the Non-maximum suppression algorithm neubeck2006efficient to filter duplicate anchors and set the IOU threshold to 0.7.

3.2.2 Position sensitive region of interest

On the basis of RoIs, the two-stage detection framework classifies and fine-tunes the location of bounding boxes. In Faster R-CNN, RoIs are scaled to the last feature maps and focusing on these areas through ROIPooling. Next, each RoI is classified and regressed through two fully connected layers, causing high computational complexity.

In order to reduce the number of parameters, we use RPN-FCN dai2016r to generate position-sensitive score maps via a convolutional layer, which is connected to the backbone. Both classification and regression tasks have independent position-sensitive score maps, forming three parallel branches with RPN.

For the classification task, since we only need to classify ants and background, we use kk2 convolution kernels to generate score maps. kk indicates that each RoI is divided into kk regions to encode position information. Each region is encoded by a specific feature map with two dimensions. Similarly, we use kk4 convolution kernels for fine-tuning the position of RoIs in the regression task.

To focus on RoIs, we perform average pooling on each region to get feature maps, called position sensitive region of interest (PSRoI) pooling, as the following formula shows:


is the result of downsampling in for category, and is one score map in the kk2 position-sensitive score maps. represents the left-top corner of RoI. is the set of parameters of the network, and is the number of pixels in the region.

For the feature maps, we vote on kk regions, getting the overall score of RoI on the classification or regression task, as the following formula shows:


In the formula, represents the overall scores of all regions.

Next, we use softmax to implement binary classification, as the following formula shows:



is the probability of

category. Finally, we use the Non-maximum suppression algorithm to filter the bounding box.

Since object detection includes classification and regression, we require a multitask loss function. In this paper, we weight the loss functions of the two tasks. Because softmax is used for the binary classification task, it is natural to adopt cross-entropy loss for the classification task. For the regression task, we calculate the matching degree between the four position parameters and ground truth:


where is the ground truth category label of RoI, and represents ants. represents cross-entropy loss:


represents the loss of the regression task, including 4 dimensions:


In the formula, is the predicted position, and is ground truth after translation and scaling.

3.3 MOT framework

3.3.1 Offline ResNet network architecture

We adopt a 15-layer ResNet network architecture to extract the appearance descriptors of objects, as Figure 2 shows. After downsampling eight times, the network will eventually obtain a 128-dimensional feature vector through a fully connected layer. The specific parameters are consistent with CAO2020107233.

3.3.2 Cosine similarity metric classifier

We modify the parameters of softmax to get a cosine similarity measurement classifier, which can measure the similarity of the same category or different categories. First, the output of a fully connected layer is normalized by batch normalization, ensuring that it is expressed as a unit length

, . Second, we normalize the weights, that is, , . Cosine similarity metric classifier is constructed as follows:


Here, is the free scaling parameter.

Because the cosine similarity classifier follows the structure of softmax, we use the cross-entropy loss for training:


Here, represents the sum of the cross-entropy loss of images, is the prediction result of image in label, and is ground truth.

3.3.3 Motion matching

We use the KF model to predict the position of trajectories in the current frame. Then, we calculate the square of the Mahalanobis distance between the predicted position and the detected bounding box position by measuring the degree of motion matching wojke2017simple as follows:


Here, is the position of the detection box, is the position of the trajectory predicted by the KF, and is the covariance matrix between the trajectory and the detected bounding box.

We use a 0-1 variable to indicate whether trajectory and detection meet the association conditions. If the Mahalanobis distance meets , will be added to the association set. The formula can be expressed as:


Here, is the motion association signal.

3.3.4 Appearance matching

We use the appearance descriptors to measure the appearance similarity between ants. Furthermore, we create a gallery for each trajectory, and each gallery stores the latest 100 appearance descriptors. Then, we calculate the cosine distance of appearance descriptors between gallery and candidate bounding boxes. The smallest distance is used as an appearance matching degree as follows:


where is the appearance descriptor of the detection box, is the appearance descriptor of the trajectory, represents the appearance matching degree between the trajectory and the bounding box.

Similarly, we introduce a 0-1 variable as an association signal. If the appearance matching degree from a pair of trajectory and detection boxes meets the threshold, we add it to the association set:


where represents the appearance association signal. In this paper, is set to 0.2.

3.3.5 Comprehensive matching

To combine motion and appearance information, we set a comprehensive association signal . Only when both motion and appearance matching degree meet the threshold, the pair will be considered for matching. The formula expression is denoted as follows:


However, the KF is scarcely possible to track accurately for long periods, because of the motion of ants is complicated. Therefore, we use the appearance matching degree (Section 3.3.4 Appearance matching) as the association cost.

3.3.6 Track update

First, we use matching cascade to match in priority for the most recently associated trajectories, avoiding the trajectory drift caused by long-term occlusion wojke2017simple

. During the matching, we use the Hungarian algorithm to find the minimum cost matches in the association cost matrix. For unmatched trajectories and detection boxes, we calculate the IOU. If they meet the threshold, they are associated. After that, trajectories need to be updated. They have three states: unconfirmed, confirmed, and deleted. We assign a new trajectory for each unmatched detection. Furthermore, if duration of trajectory is less than three, it will be set to an unconfirmed state. The unconfirmed trajectories need to be successfully associated for three consecutive frames before being converted into confirmed state; otherwise, they will be deleted. For the unmatched confirmed trajectories, if they are successfully matched in the previous frame, the KF will to estimate and update their motion state in the current frame; otherwise, we will suspend tracking. Moreover, if the number of consecutively lost frames of confirmed trajectories exceeds the threshold (Amax=30), they will be deleted.

4 Results

4.1 Ant colony database

We establish an video database of ant colony, which contains a total of 10 videos. Five videos are from an existing published work CAO2020107233 and captured in the indoor (laboratory) environment. The remaining five outdoor videos are captured in different backgrounds and are obtained from the online website DepositPhotos (http://www.depositphotos.com). Table 1 shows detailed video information, where represents an indoor video, represents an outdoor video. The resolutions of indoor and outdoor videos are 19201080 and 1280720, respectively.

Sequence FPS Resolution Length Ants Annotations
25 19201080 351 (00:14) 10 3510
351 (00:14) 10 3510
351 (00:14) 10 3510
351 (00:14) 10 3510
1001 (00:40) 10 3510
30 1280720 600 (00:20) 73 11178
677 (00:23) 162 25158
577 (00:19) 133 10280
526 (00:18) 193 27902
569 (00:19) 101 22044
Table 1: Statistics of ant videos with annotations in indoor and outdoor scenes.

The videos in our database have a total of 4983 frames. There are 10 ants per frame in the indoor videos. The number of ants in each frame is 18-53 in the outdoor videos. The number of objects in this scenario is significant, considering the fact that the popular COCO benchmark dataset contains only on average 7.7 instances per image. Some video characteristics present challenges for detection and tracking algorithms, for example over-exposure for indoor videos and diverse background for outdoor ones.

There are caves or rugged terrains in outdoor scenes, and ants may enter or leave the scene. Different from multi-human tracking, ants are visually similar and this causes significant challenges for tracking. We manually mark the video frame by frame. To facilitate training and reduce labeling cost, the aspect ratio of each bounding box is 1:1. Considering the posture and scale of ants, we set the size of the bounding box to 9696 for indoor videos and 6464 for outdoor videos. The database and code will be made publicly available.

4.2 Evaluation index

In this paper, the evaluation indicators of detection and tracking performance are as follows:

  • Mean Average Precision (MAP): the weighted sum of the average precision of all videos. The weight value is the proportion of frames.

  • False Positive (FP): the total number of false alarms.

  • False Negative (FN): the total number of objects that do not match successfully.

  • Identity Switch (IDS): the total number of identity switches during the tracking process.

  • Fragments (FM): the total number of incidents where the tracking result interrupts the real trajectory.

  • mean Multi-object Tracking Accuracy (mMOTA): the weighted sum of the average tracking accuracy of all videos. The equation to compute mMOTA is: mMOTA = 1 - (FP + FN + IDS)/NUM_LABELED_SAMPLES, where NUM_LABELED_SAMPLES is the total number of labeled samples.

  • mean Multi-object Tracking Precision (mMOTP): the weighted sum of the average tracking precision of all videos. Tracking precision measures the intersection over union (IOU) between labeled and predicted bounding boxes.

  • Frame Rate (FR): the number of frames being tracked per second.

4.3 Results of multi-ant detection

In our ant database, we set up five groups of training sets (Table 2) and compare their performance with that of the remaining datasets. The naming conventions are:

  • + represents a union of the indoor video and the outdoor video.

  • represents a union of indoor videos with their IDs of [1,2,3,4].

  • represents the last 50 frames selected from the indoor video. This partition strategy ensures the frame continuity for the subsequent tracking task.

  • represent the union of 5 subsets, the last 50 frames de-selected from the outdoor videos with their IDs of [1,2,3,4,5].

In all scenarios, the detection accuracy of indoor videos is higher than that of outdoor videos, and MAP reaches over 90%. We also noticed that the test result for outdoor videos was only 49.7% on +. This is because we used only as the outdoor training set, which is insufficient to cover the wide range of diversity in terms of environmental backgrounds and ant appearances.

Training Data Testing Data Objects MAP FR
+ 10 90.4 12.5
28 49.7 16.0
+ 10 90.4 12.2
33 81.9 17.1
+ 10 90.4 12.3
33 82.4 16.6
+ 10 90.5 12.3
33 85.1 16.6
+ 10 90.5 11.8
33 85.8 16.2
Table 2: Detection results of different training sets.
Figure 3: Detection accuracy of different training sets.

In the subsequent experiments, we integrate the images of all outdoor scenes into the outdoor training set and dramatically improve the accuracy of outdoor testing. Figure 3 clearly shows the effects of using different training sets. By further increasing in the number of images in outdoor videos, the detection accuracy of outdoor scenes improves slightly. For indoor environments, the detection accuracy is impervious to different training sets. Moreover, reducing the number of images to 50 ( has a total of 351 frames) does not reduce the detection accuracy. This shows that we need only a small number of training samples to achieve satisfactory results when the training and testing scenarios are the same.

The frame rate is around 12 FR for indoor videos and 16 FR for outdoor ones. The factor of different image resolution should be accountable for this performance gap. In practical applications, if accuracy is guaranteed, we tend to use smaller training sets to reduce labeling costs. Therefore, we use the model trained in “+” for comparison with the other methods in the comparative experiments.

4.4 Results of multi-ant tracking

Based on the TBD paradigm, we use detection results as the input to the tracking framework. For offline training, we randomly select 50 labeled samples from as the training set. We visualized the tracking results in Figure 4.

Figure 4: Tracking trajectories in test videos. Horizontal axes indicate the pixel coordinates in an image. (a-d) indoor scenes. (e-i) outdoor scenes.

Table 3 shows the performance of online tracking. After integrating the images of each outdoor video in the detection training set, our method gets 95% mMOTA for indoor videos and over 80% for outdoor videos. Additionally, mMOTP is around 80% for both indoor and outdoor videos. Notably, since the tracking performance depends on the detection result, the tracking task in fails due to the low-quality detection (the second row in Table 3). Except for this failure case, the tracking performance is generally satisfactory considering that we only use 50 labeled samples from one indoor video.

236 621 21 89.9 79.9 36.5
detection failure
239 628 22 95.7 81 38.9
6078 7122 625 81.8 81.9 26.2
260 617 14 95.7 81 38.7
4867 6018 526 83.3 82.7 28.0
228 709 17 95.4 81 36.3
4289 3421 394 85.3 83 24.9
224 820 24 94.8 81.7 35.4
2644 3007 266 85.8 83.3 26.4
/GT 22 23 8 99.6 92.4 35.2
/GT 1697 458 1064 96.2 92.4 25.9
Table 3: Tracking performance evaluation. The last two rows indicate that we use the ground truth of detection for tracking, which leads to a boost in tracking performance.

The time cost of the tracking model is mainly incurred by generating 128-d feature vectors for each detection box. The average number of objects in outdoor videos is more than three times that in indoor videos. As for runtime time, FR reaches over 35 in indoor videos and more than 24 in outdoor videos.

We add a set of comparative experiments in the last two rows of Table 3. We directly use manually-labeled detection boxes for tracking and compare the detection results on the and . Both mMOTA and mMOTP have been dramatically improved. This implies that an increase in detection accuracy could further boost the tracking performance of our framework.

4.5 Comparative experiments

There are two widely used insect tracking software: idTracker perez2014idTracker and Ctrax chiron2013detecting. idTracker needs to specify the number of objects before tracking, to create a reference image set for each object. Meanwhile, Ctrax assumes that objects will rarely enter and leave the arena. Thus, they are both not capable of tracking in outdoor scenes because of the variable number of ants. Therefore, we compare these two methods only in videos depicting indoor scenes. idTracker needs to specify the number of objects before tracking, in order to create a reference image set for each object. To compare them with our method, we convert their representations into square boxes as our ground truth. Table 4 and Figure 5 shows the tracking results. In addition to a significant improvement of tracking accuracy, our method is 6 and 10 times faster than idTracker and Ctrax (see the column of FR).

idTracker 881 8479 83 432 54 77.4 1.3
Ctrax 2832 5646 110 349 58.2 79.7 0.8
Ours 239 628 22 189 95.7 81.1 8.7
Table 4: Comparison of tracking results on videos .
Figure 5: Comparison of tracking performance in spatial-temporal dimension (). Horizontal axes indicate the pixel coordinates in an image.

We further compare tracking accuracy of idTracker and Ctrax across different indoor videos, as Figure 6

shows. The large variance of idTracker’s performance is affected by the number of static ants, which will cause missing tracking. Ctrax proves to be robust but with a lower accuracy compared with our method.

Figure 6: Comparison of tracking results for indoor scenes.

5 Discussion

5.1 Comparison of methods

In the previous section, we compared two methods of insect tracking, idTracker and Ctrax. idTracker uses the intensity and contrast of the foreground segmented area to extract appearance features and construct a reference image set for each individual. However, it can not track motionless individuals. Figure 5(a) shows that only a minority group of ants are successfully tracked over the period of video. Further, there are some trajectory fragments due to the limitations of the foreground segmentation model for multiple objects.

Compared to our results, the trajectories of Ctrax are incomplete. This indicates that there are more FN, as Figure 5(b) shows. Ctrax requires a sharp contrast between object and background. The ants passing through the overexposed areas in the scene will be ignored. Additionally, Ctrax assumes that the motion of the object obeys the linear distribution. However, the ants’ movement is nonlinear, and their speed and direction might change abruptly, causing IDS in Ctrax.

Our method classifies and regresses twice to locate ants accurately. During the tracking stage, we use the historical appearance sequence as a reference and update it frame by frame. Compared with idTracker, our method effectively solves the long-term and short-term dependence of motion states, thereby reducing FM. Despite that we also assume the linear distribution of motion states, they are used only to filter impossible associations, and have nothing to do with association cost. We take the appearance distance between trajectories and detection boxes as association cost, thus the model is robust even when the ant movement is complicated. We take the appearance measure between trajectories and detection boxes as the association cost, thus the model is robust even when the ant movement is complicated.

5.2 Failure cases

5.2.1 Limitations of detection framework

The number of ants in outdoor scenes is on average 33 per frame. It is also typical for ants to involve close body contact with each other for the purpose of information sharing. Naturally, their extremely-close interactions are highly likely to cause mis-detection (Figure 7(a)). Additionally, entrances and exits of ants in outdoor scenes are more prone to mis-detection (Figure 7(b)). Moreover, the dramatically non-rigid deformation of ants is also a factor causing the detection failure (Figure 7(c)). These three scenarios are all challenging cases that deserve our future efforts.

Figure 7: Examples of failed detection in outdoor scenes.

5.2.2 Limitations of tracking framework

According to Figure 8, Ant No.41 entered the scene at Frame No.88. Coincidentally Ant No.32 left the scene at an adjacent region, but its trajectory was not deleted. At Frame No.93, Ant No.41 drifted to Trajectory No.32. This defect is caused by insufficient appearance descriptors stored in the gallery of Ant No.41, and it moved near the exit location of another ant. This kind of mis-association occurs at the image boundary and accounts for the majority of IDS and FM in our experiments. However, when ants move inside the scope of both indoor and outdoor scenes, our method can accurately track multiple ants simultaneously for a long time, as Figure 1 shows.

Figure 8: Drift at the scene boundary. A newly-entered Ant No.41 is mis-associated with an existing Trajectory No.32.

6 Conclusion

We proposed a complete detection and tracking framework based on deep learning for ant colony tracking. In the detection stage, we adopted a two-stage object detection framework for the detection task. We also use a ResNet model to obtain ant appearance descriptors for online associations. Next, we combined appearance and motion information for the tracking task. The experimental results demonstrated that our method outperformed two mainstream insect tracking models in terms of accuracy, precision, and speed. Particularly, our work shows its advantage in robustly detecting and tracking ant colonies in outdoor scenes, which is rarely reported in existing literature. We believe our method could serve as an effective tool for high-throughput swarm behavior analysis of ant colonies, leading to the development of embodied intelligence.

In future research, we aim to achieve more robust detection. For example, by exploring additional information of ants’ skeletal structure, we can potentially solve the aforementioned failure case of close interaction and nonrigid deformation problem. We also plan to improve the generalization ability of our detection and tracking frameworks so that it is applicable to a wide range of outdoor environments.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Author Contributions

All authors listed have made substantial, direct, and intellectual contribution to the work; they have also approved it for publication. In particular, MW, SG and XC contributed to the design of this work; MW, SG and XC contributed to the writing of the manuscript; XC designed and implemented the multi-ant tracking framework; XC conducted the experiments, and analyzed the results.


This work was supported by the Natural Science Foundation of Fujian Province of China (No. 2019J01002).

Data Availability Statement

The training and testing datasets used for this study can be found in the ANTS–ant detection and tracking.