Benefiting from the development of Deep Neural Networks, Multi-Object Tracking (MOT) has achieved aggressive progress. Currently, the real-time Joint-Detection-Tracking (JDT) based MOT trackers gain increasing attention and derive many excellent models. However, the robustness of JDT trackers is rarely studied, and it is challenging to attack the MOT system since its mature association algorithms are designed to be robust against errors during tracking. In this work, we analyze the weakness of JDT trackers and propose a novel adversarial attack method, called Tracklet-Switch (TraSw), against the complete tracking pipeline of MOT. Specifically, a push-pull loss and a center leaping optimization are designed to generate adversarial examples for both re-ID feature and object detection. TraSw can fool the tracker to fail to track the targets in the subsequent frames by attacking very few frames. We evaluate our method on the advanced deep trackers (i.e., FairMOT, JDE, ByteTrack) using the MOT-Challenge datasets (i.e., 2DMOT15, MOT17, and MOT20). Experiments show that TraSw can achieve a high success rate of over 95 frames on average for the single-target attack and a reasonably high success rate of over 80 https://github.com/DerryHub/FairMOT-attack .READ FULL TEXT VIEW PDF
Multiple Object Tracking (MOT) has been dramatically boosted in recent years due to the rapid development of Deep Neural Networks (DNNs) [bewley2016simple, chen2018real, wojke2017simple], and has a wide range of applications, such as autonomous driving, intelligent monitoring, human-computer interaction, etc. The deep MOT trackers can be divided into two main categories: Detection-Based-Tracking (DBT) trackers [choi2015near, milan2016mot16, yu2016poi, luo2020multiple] and Joint-Detection-Tracking (JDT) trackers [wang2020towards, centertrack, zhang2021fairmot]. Considering JDT trackers generally have faster speed and higher accuracy than DBT trackers, they remain dominant in the field of real-time multiple object tracking.
and eventually spread out to other tasks, including face recognition[dong2019efficient, ijcai2021-173], object detection [xie2017adversarial, wei2018transferable, liang2021parallel] and semantic segmentation [xie2017adversarial, hendrik2017universal]. To our knowledge, the vulnerability of MOT trackers is rarely studied. In this work, we will address the security of MOT and focus on attacking JDT trackers.
Unlike the image classification or object detection tasks, MOT aims to track all targets of interest in continuous video sequences and builds their moving trajectories. A typical JDT tracker addresses the problem in two steps [wang2020towards]. Firstly, the tracker locates objects of interest and extracts their features for each frame in the video. Then, according to a specific similarity metric, each detected object is associated with a trajectory. The continuous tracking process allows the tracker to save trajectories’ motion and appearance information in tracklets for a long period (i.e., 30 frames) to make more precise decision [wang2020towards]. It means the detection-based attack methods [xie2017adversarial, wei2018transferable, liang2021parallel] that blind the target object detection need to attack successfully and continuously for much more frames to fool the tracker, as shown by our experiments in Sec. 4.3. Besides, though the association between objects and existing trajectories is indeed a classification problem, it considers both the motion and appearance information. It is hard to define an optimization objective to generate adversarial examples. Especially the motion information is discrete.
In this work, we study the adversarial attacks against JDT trackers and propose a novel attack method called the Tracklet-Switch (TraSw) consisting of the PushPull and the CenterLeaping technique. TraSw can fool the tracker by attacking as few as one frame. In a nutshell, our method learns an effective perturbation generator to make the tracker confuse intersecting trajectories, which is a very common scene in MOT, especially pedestrian tracking. Specifically, we analyze the association algorithm, which combines appearance and motion distance as the similarity cost matrix. It is explicit that when the detected objects overlap each other, the tracker identifies which trajectory the object belongs to by comparing its appearance with trajectories’ appearances. In order to involve the past appearance information and mitigate the impact of a single frame, the tracker uses a smooth feature representing a trajectory’s appearance. As frames go by, the past appearance’s influence dismiss. Based on this observation, the attacker deliberately reforms the feature embedding of the overlapping objects to make them similar to the other trajectory’s appearance [wang2020towards] but be further to the origin. During this process, the appearances of these two trajectories are silently modified and switched. As a result, the tracker may track a completely different object without realizing this error. An example is shown in Fig. 1.
Our main contributions can be summarized as follows:
We are the first to study adversarial attacks against JDT trackers. A novel and efficient TraSw method is proposed to deceive advanced JDT trackers by only a few frames attacked in the videos.
Numerous experiments on MOT-Challenge datasets demonstrate that our method can efficiently fool JDT trackers using PushPull on re-ID branch and CenterLeaping on detection branch.
Our method has excellent transferability to DBT trackers. Experimental results show that the state-of-the-art (SOTA) DBT tracker (e.g., ByteTrack [zhang2021bytetrack]) can also be deceived by TraSw, even though it is not specially designed for DBT trackers.
Multiple object tracking (MOT) aims to locate and identify the targets of interest in the video, and then estimate their movements in the subsequent frames[luo2020multiple], such as pedestrians on the street, vehicles on the road, and animals on the ground.
The mainstream MOT trackers are divided into Detection-Based-Tracking (DBT) trackers [choi2015near, milan2016mot16, yu2016poi, luo2020multiple] and Joint-Detection-Tracking (JDT) trackers [wang2020towards, centertrack, zhang2021fairmot]. The DBT trackers break tracking down to two steps: 1) the detection stage, in which targets are localized; 2) the association stage, where the targets are linked to existing trajectories. However, the DBT trackers are inefficient and not optimized end-to-end due to the two-step processing. To address this problem, [wang2020towards] designed the first JDT tracker, JDE, meeting the real-time performance without losing accuracy. Incorporating the appearance embedding model into a single-shot detector to simultaneously output the detections and embeddings, JDE accomplishes the end-to-end training and proposes an effective association method.
The association problem of MOT is regarded as a bipartite matching problem [jia2020fooling, wang2020towards] based on the similarity between the detected objects and the trajectories. The tracker used tracklets to maintain the appearance and motion states of trajectories. In the first frame, the tracker recognizes objects, numbers the trajectories in order, and saves their states as the initial tracklets. Then for a coming frame, the tracker compares each detected object with tracklets to determine whether it belongs to an existing trajectory or is a new trajectory. After that, the tracker updates trajectory information with the current frame (i.e., updates tracklets). In the meantime, the tracker also needs to estimate the current trajectories. When a tracklet isn’t updated for consecutive frames (i.e., 30 frames), the tracklet will be deleted, and the tracking of this trajectory is terminated. Even if the corresponding tracking object reappears after frames, it is regarded as a new trajectory.
So far, JDT method is used in most real-time scenarios and inspires many outstanding models. Our primary target model is one of the famous and widely used models, FairMOT [zhang2021fairmot].
CNN models are known to be vulnerable to adversarial examples, since the first discovery of the adversarial examples in 2014 [szegedy2013intriguing]. After that, numerous adversarial attack methods have been proposed [goodfellow2014explaining, madry2018towards, dong2018boosting, lin2020nesterov, Croce020autoattack]. Most of the adversarial attack researches are mainly focused on the basic computer visual task of image classification. To our knowledge, the adversarial attack research in Visual Object Tracking systems is scarce, especially in Multiple Object Tracking systems.
Recently, there have been several explorations of adversarial attacks against the Single Object Tracking (SOT) [chen2020one, yan2020cooling, jia2021iou], which aims at tracking a determined object in the video sequences. Chen et al. [chen2020one] propose the first one-shot attack method by adding perturbations on the target path in the initial frame to make the tracker lose the target in the subsequent frames. Meanwhile, Yan [yan2020cooling] propose a cooling-shrinking attack method to fool the SiamPRN-based trackers [li2019siamrpn++, li2018high], which can cool hot regions on the heatmap and shrink the size of predicted bounding box, so as to make the target invisible to the trackers. Most recently, Jia et al. [jia2021iou] present a decision-based black-box attack to decrease the IoU scores gradually.
For the MOT system, tracker hijacking [jia2020fooling] is the first adversarial attack method to deceive the DBT tracker in automatic driving. It uses adversarial examples generated on objection detection, which fabricates a bounding box toward the expected attacker-specified direction and erase the original bounding box. It makes the tracker assign an error velocity to the attacked tracklet, thus resulting in the detected object being too far from the tracklet’s expectation to be associated.
In this section, we propose a novel method called the Tracklet-Switch (TraSw) attack against the JDT trackers. Our method finds two intersecting trajectories to make their tracklets switch. Representatively, we choose FairMOT [zhang2021fairmot] as our main target model, but our method can also be adopted to attack other JDT trackers, even DBT trackers.
FairMOT stands out among many JDT trackers by achieving a good trade-off between accuracy and efficiency. As shown in Fig. 1(a)
, the network architecture consists of two homogeneous branches for the object detection and feature extraction. The online association also plays an essential role in FairMOT, as shown inFig. 1(b).
Detection Branch. The anchor-free detection branch of FairMOT is built on CenterNet [duan2019centernet], consisting of the heatmap head, center-offset head and box-size head. We denote the ground-truth (GT) bounding box of the -th object in the -th frame as . Object ’s center is computed by and
. The response location of the bounding box’s center on the heatmap can be obtained by dividing the stride (which is 4 in FairMOT)
. The heatmap value indicates the probability of the presence of an object centering in the corresponding location. The GT box size and center offset is computed byand , respectively.
Re-ID Branch. The re-ID branch generates the re-ID features to distinguish the objects. Denote the feature map as . The re-ID feature
represents the feature vector of the-th object, whose norm equals .
Association. FairMOT follows the standard online association algorithm in [wang2020towards]. The tracker maintains a tracklet pool containing all the valid tracklets before the -th frame. A tracklet describes the appearance state and motion state of the -th trajectory in the -th frame. The is initialized with the appearance embedding and updated by:
where is the appearance embedding of the matched object in the -th frame. The bounding box information of is updated by the predicted center , aspect ratio and height in the -th frame, and velocity information
is updated by the Kalman filter. For a coming frame, we can compute the similarity between the observed objects in the current-th frame and all the tracklets in the -th frame.
Then the association problem is solved by the Hungarian algorithm using the final cost matrix:
where and denote the detected bounding boxes and the features in the -th frame, represents the Kalman filter which uses tracklets to predict the corresponding trajectories’ expected positions in the -th frame, stands for a certain measurement of the spatial distance, and
represents the cosine similarity.
Let denote a video containing frames. In simple terms, there are two intersecting trajectories observed by the tracker . Denote the two trajectories as and respectively, that are adjacent in the -th frame. Their bounding boxes and features are and for , and .
In the following, we define the adversarial video as , where , indicates the original frame and the adversarial frame, respectively. We define the problems of single-target attack and multiple-target attack as follows:
Single-Target Attack. For an attack trajectory , we call , that is overlapping with in the -th frame, the screener trajectory. The adversarial video misleads the tracker to estimate the trajectory as . The single-target attack aims to add adversarial perturbations from the -th to at most the -th frames such that the tracking of trajectory is changed to that of wrongly since the -th frame.
Multiple-Target Attack. Similarly, the multiple-target attack aims to craft the adversarial video such that all the pairwise trajectories that are overlapping with other are predicted wrongly since their overlapping frames.
The JDT tracker distinguishes the objects through a combination of motion and appearance similarity. So when objects are close to each other, the tracker relies heavily on the features to distinguish the objects. Inspired by the triplet loss [schroff2015facenet], we design the PushPull loss as follows:
where denotes the cosine similarity, represents the appearance state of the attack tracklet ( represents the attack trajectory’s ID), represents the screener’s trajectory that overlaps most with trajectory , and represents the feature of trajectory in the -th frame. The loss will make dissimilar to tracklet and make similar to tracklet .
Specifically, in FairMOT, the object’s feature is extracted from the feature map according to the predicted object center . Considering that the surrounding locations of the center may be activated, we calculate the appearance cost within a nine-block box location for a more stable attack. So the PushPull loss for the single-target attack against FairMOT is as follows:
where indicates a set of offsets in the nine-block box location, as illustrated in Fig. 2(b); denotes the trajectory that overlaps most with trajectory (i.e., and ), and represents the feature according to the position around the center of trajectory .
The PushPull loss for the multiple-target attack against FairMOT is as follows:
where represents a collection of trajectory IDs that could be attacked in the -th frame.
Attacking the features of intersecting trajectories can generally fool the tracker. However, it is insufficient when the two boxes are too far. As the bounding boxes are computed with discrete locations of heat points in the heatmap, we cannot directly optimize a bounding box’s loss to make the objects close to each other.
In order to reduce the distance between the objects, we propose a novel and simple method, called the CenterLeaping. The optimization objective can be summarized as reducing the IoU between the screener tracklet’s predicted box and the attack trajectory’s detected box. The goal is achieved by reducing the distance between their centers, as well as the differences of their sizes and offsets. Hence, the optimization function for can be expressed as follows:
where computes the center of a box, represents the box size, and denotes the box offset.
In order to make the detected center of trajectory to be close to , based on the focal loss of FairMOT, we design the CenterLeaping loss as follows:
where and denotes the two most overlapping trajectories in the -th frame, means the value of heatmap at location , represents the center of the trajectory , represents the point which is in the direction from to as shown in Fig. 2(c). In the optimization iteration, the point will leap to the next grid along the direction until the attack succeeds. As a result, the heatmap values on the original centers’ positions are cooled down, while the points closer to are heated up.
In addition, the widths and heights of bounding boxes should not be overlooked due to Eq. 6. We restrain the sizes and offsets of the objects by a Smooth regression loss:
where and both use the Smooth loss.
With the summation of all the loss functions, we can get the final optimization objective function:
To conceal the disturbance, we restrain the distance between the adversarial images and the original images . Then we could get the adversarial images as follows:
where denotes the -th iteration of optimization.
The algorithm overview of crafting the adversarial videos is shown in Algorithm 1. Take the single-target attack as an example, we specify an attack trajectory, , in the original tracking video before the attack. Then for each coming frame, we initialize the tracklet pools, and , with the original frame, and also initialize the adversarial frame as . Then we conduct a double check to determine whether to attack the current frame: (1) whether the object of the trajectory has appeared for more than frames; (2) whether the tracking of the attack object is correct. If both checks are passed, we find the object that overlaps with the trajectory mostly and get the corresponding screener trajectory, . Then we check whether the IoU between the trajectories, and , is greater than . If it is true, the current frame will be attacked.
We generate the noise by optimizing Eq. 9 iteratively until the tracker makes mistakes in the current frame or the iterations reaches (60 in default). Note that during the course of generation, point will jump to the next grid closer to at certain number of iterations as presented in Sec. 3.4. The noise is then added to the original frame by Eq. 10. We believe the noise should be added to the current frame no matter the attack succeeds or not, and the experiments in Sec. 4.3 show that such action contributes to an easier attack for the following frames. The tracklet pool is then re-updated by the adversarial frame , and the threshold is set to zero. In the end, the adversarial frame is added to the adversarial video .
Attacked Models and Datasets. We choose two representative JDT-based trackers as the target models: FairMOT [zhang2021fairmot] and JDE [wang2020towards]. FairMOT is also used as the model for ablation studies. We validate the proposed TraSw attack method on the MOT-Challenge test datasets: 2DMOT15 [2015motchallenge], MOT17 [milan2016mot16] and MOT20 [dendorfer2020mot20]. As there is no available attack method for the MOT attack in the same scenario, we compare our method with the baseline of adding random noise, denoted as RanAt, whose distances are limited to randomly.
Evaluation Metric. We define an attack to be successful if the detected objects of the attack trajectory are no longer associated with the original tracklets after the attack. As described in Sec. 3.5, our method attacks when an object overlaps with the attack object. So the calculation of the attack success rate depends on two factors. Firstly, we need to obtain the number of trajectories meeting the attack conditions: (1) the trajectory’s object should have appeared for at least (10 as the default value) frames. (2) there is another object overlapping with this object, and the IoU should be greater than (0.2 as the default value). Secondly, we need to get the number of successfully attacked trajectories, for which the detected bounding box is no longer associated with the original tracklet after the attack. And the error state should last for more than dozens of frames (20 as the default value) or until the tracking of the attack object ends.
We report the results of the single-target attack and multiple-target attack. The effectiveness and efficiency of our method are demonstrated through the attack success rate (), average attack frames () and average distance () of the successful attacks.
Single-Target Attack. Single-target attack means that we only attack a specific trajectory. Tab. 1 shows the attack results on FairMOT and JDE using the three datasets. For a fair comparison, TraSw and RanAt have the same conditions of adding noises. To make the experiments more efficient, the addition of random noise in a tracklet is limited to 30 frames in FairMOT (50 in JDE). Figs. 3(b) and 3(a) show the variation of the attack success rate constrained by the attack frames or the perturbation of average distances. The attack performance of TraSw is significant, achieving much higher attack success rate () by attacking less frames using smaller perturbations .
Multiple-Target Attack. The experimental setting of multiple-target attack is the same as that of the single-target attack. Multiple-target attack means that all the trajectories satisfying the attack conditions will be attacked (Sec. 3.2). The performance of TraSw on each dataset is shown in Table Tab. 2. Unlike the single-target attack, here represents the average ratio of attack frames in the video. It shows that we can make more than 80% tracklets fail to track with only about half of the video frames being attacked. The attack performance of TraSw is significant, achieving much higher attack success rate () by attacking less ratio of attack frames using smaller perturbations .
Ablation Study. We discuss the necessity of PushPull in Sec. 3.3, CenterLeaping in Sec. 3.4, and adding failure noises in Sec. 3.5. We conduct a series of comparison experiments on FairMOT to analyze and evaluate the contribution of each component in TraSw. The results are shown in Tab. 3. We can observe that each module has different degrees of importance on the two datasets. From the point of view of universality, it is beyond doubt that all the above components are helpful.
|Dataset||TraSw||Attack success rate (%)|
Parameter Study on IoU Threshold. Our method attacks while finding two trajectories overlap with each other, and the IoU between them is greater than the threshold . If is set too high, we may not be able to find the attack targets in the video. If it is set to too low, our method might not find the best opportunity to attack and need to attack more frames. Therefore, we analyze the impact of , as shown in Fig. 5. The proportion of the trajectories to be attacked decreases with the increasement of .
Comparison with Object-Detection-Based Attack. Due to the detection branch in JDT based MOT trackers, object-detection-based attack method can also be used to fool the trackers. Hence, we choose a typical attack method against the object detection, which aims to make the attack object invisible to the object detection module. By comparison, we observe that our method is more effective than making the attack object invisible, as shown in Fig. 6.
Transferability to Detection-Based Tracker. As presented in Sec. 3, TraSw is designed for JDT-based MOT trackers. However, we find that it is also effective for TraSw to deceive tracklets in single-stage DBT-based MOT trackers while only CenterLeaping module can be used due to the only detection branch in DBT trackers. Specifically, we choose the SOTA DBT-based MOT tracker ByteTrack [zhang2021bytetrack] as the target model. As same as attack on JDT-based MOT, there is no available attack method for the DBT-based MOT in the same scenario, we compare TraSw with the baseline of adding random noise. As shown in Tab. 4, we can see that the performance of TraSw in ByteTrack achieves an average success rate of over 91% with only a tiny increment on the distance.
Noise Pattern. As shown in Fig. 7, obviously, the noise mainly focuses on the attack object and the screener object, leaving other regions almost not perturbed.
To our best knowledge, this is the first work to study the adversarial attack against JDT-based MOT trackers considering the complete tracking pipeline. We design an effective and efficient adversarial attack method that can deceive the tracklets to fail to track using as few as one frame to attack the subsequent frames. The experimental results on standard benchmarks demonstrate that our method can fool the advanced trackers efficiently, and the failure in tracking also reveals the association algorithm’s weakness. We wish this work could inspire more works in designing robust MOT trackers and draw more attention to the adversarial attacks and defenses on MOT in the future.