Multiple object tracking (MOT) concerns identifying objects of interest and tracking their moving trajectories in video sequences. Intuitively, successful MOT algorithms need to be able to handle subtle appearance differences between multiple tracked objects and resolve the ambiguity via other cues, such as motion, when the targets are visually indistinguishable.
With the powerful appearance encoding capability of CNN, the tracking-by-detection paradigm dominates MOT methods in the past decade [carion2020end, zhang2020fairmot, Wojke2017simple]. Highly accurate CNN-based object detection [redmon2016you, ren2015faster, cai2018cascade] is first performed in all frames independently, and then association of these detected objects across frames is performed to establish tracks of consistent object IDs. In the association step, locations of existing tracks in the following frame may be predicted from assumption (constant velocity, acceleration, etc.) or other motion models [zhang2020fairmot, shuai2021siammot, welch1995introduction, Wojke2017simple] and then associate with detections based on metrics like intersection-over-union (IoU).
Joint-detection-and-tracking methods [zhou2020tracking, sun2020transtrack, wu2021track] recently demonstrate superior accuracy. The idea is simultaneously performing object detection and tracking so both tasks enjoy information shared from the other. This is particularly intriguing in Transformer-based architectures where output feature embeddings of previous frames are used as ‘track queries’, along with ‘object queries’ for Transformer decoder, predicting corresponding tracks as well as newly discovered objects in the current frame (Figure 0(a)). Albeit achieving state of the art MOT results, we argue that these architectures overly rely on appearance. As the information encoded in track queries is strictly limited to previous frames, the Transformer model needs to infer both object offset and object appearance in the current frame.
To resolve above problem, we take inspiration from UP-DETR [dai2021up], an object detection model that is pre-trained to detect image patches (Figure 0(b)) using patch features, and propose a MOT system that uses frame patches from the current frame of interest. We first use a motion model to predict new locations of existing tracks in the current frame from the previous frame, and crop the current frame to patches based on the prediction. These patches, with implicit prior knowledge of object motion and explicit information of object appearance in the current frame, are sent to the decoder to predict new locations of existing tracks in the current frame.
More specifically, we present PatchTrack (Figure 0(c)), which is a Transformer-based joint-object-detection-and-tracking system that predicts tracks in the current frame of interest from its patches. We use the Kalman filter [welch1995introduction]
to obtain track candidates in the current frame from existing tracks in the previous frame and crop the current frame using the bounding box of these candidates to get patches. Both the current frame and these patches are sent into our convolutional neural network (CNN)[Goodfellow-et-al-2016] backbone that outputs the frame feature and the patch queries respectively. Each pair of track query, from the output embeddings when processing the previous frame, and patch query with the same tracking ID are added together to form the corresponding patch-track query. These patch-track queries are sent to the decoder along with object queries, where the former is used to predict new locations of existing tracks, while the latter is used to detect new objects in the current frame.
We evaluate PatchTrack on MOT benchmarks and achieve competitive results on MOT16 (MOTA 73.71%, IDF1 65.77%) and MOT17 (MOTA 73.59%, IDF1 65.23%) test sets. To the best of our knowledge, our method is the first that uses patches of the current frame of interest to infer both object motion and appearance information simultaneously. We hope it could provide a new perspective for designing MOT systems.
In summary, our contributions are:
A Transformer-based MOT system, namely PatchTrack, which jointly performs object detection and tracking.
A novel way of optimizing the usage of visual information by utilizing patches from the current frame of interest.
Introduction of patch-track queries that incorporate both knowledge of the object motion and object appearance in the current frame of interest to facilitate tracking.
2 Related Work
2.1 Object detection and tracking
Object detection concerns locating and/or classifying objects of interest in a single image. As the preliminary to object tracking, there is a close connection between the two. Many popular object detection methods generate detections from hypothesis of object locations, including regional proposals[girshick2014rich, girshick2015fast, ren2015faster, cai2018cascade] and anchors/object centers [redmon2016you, liu2016ssd, zhou2019objects]. On the other hand, there is an increasing number of object tracking systems that utilize Transformer [vaswani2017attention], which has shown success in object detection [carion2020end, meng2021conditional, zhu2020deformable, dai2021up] before. Transformer-based object-detection methods encode the CNN [Goodfellow-et-al-2016] feature of images and decodes learned object queries to obtain detections. Aside from architecture adjustment [meng2021conditional, zhu2020deformable] from the original DETR [carion2020end], we also see modification to object queries [dai2021up] using image patches to facilitate detection. Inspired by the usage of regional proposal and image patches, our proposed method uses frame patches, which can be considered as our initial guess of track locations and appearance.
One major paradigm in MOT is tracking-by-detection, where the MOT systems [carion2020end, zhang2020fairmot, Wojke2017simple] first obtain detections for each frame and then associate them across frames to form tracks. Since the object detection is a standalone step in the tracking process, one benefit of tracking-by-detection methods is the flexibility to pair different object detection models [ren2015faster, redmon2016you, carion2020end] with different association strategies, thus be benefited directly from advancement in the area of object detection. On the other hand, the object detection step omits information across frames as each of them is processed separately by the detector.
Object motion and appearance may only be considered as part of the detection association strategy for these methods [zhang2020fairmot, shuai2021siammot]. For object motion, Kalman filter [welch1995introduction] is one of the most popular algorithm used to propagate detections in previous frame to predict their location in its future frame. Combined with Hungarian algorithm [kenesei2002hungarian] and intersection-over-union (IOU) metrics, it has proven to be an effective tracking mechanism [bewley2016simple]. Object appearance information like Re-ID features [Wojke2017simple, pang2021quasi, zhang2020fairmot] are also commonly used as similarity measures.
The other popular paradigm in MOT is joint-detection-and-tracking, where the object detection and object tracking are performed simultaneously [zhou2020tracking, sun2020transtrack]. One advantage of joint-detection-and-tracking methods is the accessibility to information across frames. For instance, features of multiple frames can be used at once [zhou2020tracking, sun2020transtrack, wu2021track] to facilitate detection and/or tracking. For Transformer-based joint-detection-and-tracking methods, both the encoder and the decoder may take additional information from previous frames to infer predictions of the current frame of interest [sun2020transtrack, meinhardt2021trackformer, zeng2021motr]. Specifically, recent works have introduced track queries [sun2020transtrack, meinhardt2021trackformer], which come from the output embeddings when processing previous frames. Depends on the design, the track queries may be decoded to bounding boxes separately from the object queries [sun2020transtrack] and matched together to predict new tracks, or processed together to form new tracks directly [meinhardt2021trackformer].
In this section, we describe the architecture of PatchTrack (Section 3.1), how object tracking is initialized (Section 3.2), how existing tracks are propagated to form new track candidates (Section 3.3), and how frame patches are generated to help facilitate object tracking(Section 3.4).
PatchTrack is a Transformer-based joint-detection-and-tracking system. The Transformer encoder takes in CNN features of a consecutive frame pair. The Transformer decoder takes queries as input and output bounding boxes. PatchTrack deals with four types of queries: object queries, track queries, patch queries, and patch-track queries. Depending on the source of the queries, the predicted bounding boxes may correspond to either tracks associated with existing tracking IDs or detections that need to be assigned with new tracking IDs.
3.2 Object tracking initialization
Object tracking for the first frame is equivalent to object detection, where each predicted detection can be arbitrarily assigned to a unique tracking ID to form tracks. Frame is sent to the CNN backbone that outputs the corresponding frame feature. This feature is stacked with itself [sun2020transtrack] and sent to the Transformer encoder. Since there are no existing tracks to form non-object queries, the Transformer decoder only takes object queries as input and produces embeddings. The output embeddings that result in the non-background bounding boxes are the predicted detection in , each of which is assigned to a unique tracking ID to form tracks. These embeddings are also used as the track queries for the next frame.
3.3 Track propagation
For frame (), there exists with a set of tracks . We can propagate these tracks using a motion model and infer tracks in (Algorithm 1).
Here we use the Kalman filter [welch1995introduction] as our motion model to predict a set of track candidates for , namely . The reason we call them track candidates is because there are several problems if we use them directly as tracks in . First of all, since the tracks in are mapped one-to-one with the ones in , they only include objects that have appeared in . Secondly, although Kalman filter and other motion models have shown effectiveness in many cases [bewley2016simple, veeramani2018deepsort, zhang2020fairmot], their predicted bounding boxes are not accurate enough in terms of locating objects. This is the reason why motion models are often used to process existing tracks, and IoU is introduced to match processed tracks with new detections to form new tracks. In the paradigm of joint-object-detection-and-tracking, our architecture is designed to refine these track candidates to more accurate tracks.
3.4 Patch generation and object tracking
To tackle the above problems, we take inspiration from UP-DETR [dai2021up] where its Transformer decoder is pre-trained to detect locations of random image patches using their corresponding CNN features. Our proposed PatchTrack takes patches of frame as additional visual information besides the entire to perform object tracking. Specifically, for each track candidate , we crop the frame using its bounding box and send the resulting patch to the CNN backbone to get the corresponding patch feature. We use a fully-connected (FC) layer followed by global average pooling (GAP) to process all patch features to patch queries that align with track queries (Figure 2). Each patch query is added to the track query from the same tracking ID to form a patch-track query. The patch-track queries are sent to the Transformer decoder alone with the initial object queries, both of which are processed jointly. Output embedding decoded from each patch-track query may either correspond to the refined location of the corresponding track candidate, or the background if the object has left . On the other hand, the embeddings decoded from object queries that result in non-background detections locate new objects entering , which are assigned with new tracking IDs to form new tracks. All embeddings that contribute tracks in form the track queries for (Figure 2).
3.5 Track re-birth
To obtain track queries for frame from the track queries for , embeddings corresponding to the new detections are added and track queries corresponding to background class are removed (Figure 2). A problem with this mechanism is that it is not robust to long-range tracking: if one object is not successfully detected, it can only be assigned to a new tracking ID when it is detected again, which causes fragmented trajectories. To tackle this problem, we adopt the track re-identification strategy from TrackFormer [meinhardt2021trackformer] and store these originally removed patch-track queries to an inactive query set. Queries in this set are included in the list of patch-track queries and sent to the decoder for at most consecutive frames. If the queries can be decoded to non-background bounding boxes during this process, these queries are re-activated with their original tracking IDs, otherwise they will be removed.
3.6 Set prediction loss
As shown in the model architecture 2, PatchTrack processes a frame pair and iteratively, and there are two steps involved. The first step is performing object detection on in order to initialize track queries for processing later. The second step is to perform object tracking on using previously generated track queries. Since the second steps involves detecting new objects, which is the same as the first step, as well as tracking existing object with tracking IDs associated with track queries, we use two set prediction loss [carion2020end], one for detection new objects and the other for tracking objects existing in .
Let us denote and as the tracks for and respectively. In the case of detecting new objects, we are looking at any track , which corresponds to new objects in but not . We adopt object detection set prediction loss following the matching cost in TransTrack [sun2020transtrack] and DETR [carion2020end]:
where is the focal loss [lin2017focal] between predicted class labels and the ground truth, and are L1 loss and generalized IoU loss [rezatofighi2019generalized] between the normalized center and sides of the predicted bounding boxes and ground truth, while , and are their weights respectively. Predictions generated from decoding object queries are compared with the ground truth , so handles new object detection.
Similarly, our object tracking set prediction loss is as follows:
where , , and are calculated between predictions generated from decoding patch-track queries and the ground truth , so handles tracking objects in and predict their new locations in .
Our final loss function is simply the sum of object detection set prediction loss and object tracking set prediction loss:.
|RADTrack (RelationTrack) [yu2021relationtrack]||73.1||73.7||39.9||20.0||25,935||122,700||3,021|
|MOTPrivate (TransCenter) [xu2021transcenter]||70.0||62.1||38.9||20.4||28,119||136,722||4,647|
|TrTrack (TransTrack) [sun2020transtrack]||75.2||63.5||55.3||10.2||50,157||86,442||3,603|
4.1 Datasets and metrics
MOT MOT benchmarks are among the most widely used multi-object tracking benchmarks. We perform experiments on two of the MOT benchmarks: MOT16 and MOT17 [milan2016mot16]. MOT16 consists of a training set of 7 videos (5,316 frames and 336,891 tracks) and a test set of 7 videos (5,919 frames and 564,228 tracks) with FPS ranging from 14 to 30. To evaluate the performance of the tracking mechanism independently of the detection accuracy, this benchmark also provides public detection from Faster R-CNN [ren2015faster]. MOT17 consists of the same training set and test set as MOT16, but with additional public detection from DPM [felzenszwalb2009object] and SDP [yang2016exploit]. Both MOT16 and MOT17 are annotated with full-body bounding boxes.
CrowdHuman CrowdHuman [shao2018crowdhuman] is a pedestrian detection benchmark. It contains 15,000 training images and 4,370 validation images with a total of 470K objects. The annotations are also human full-body bounding boxes. This benchmark is often used for pre-training MOT systems.
Metrics MOT benchmarks [leal2015motchallenge, milan2016mot16, dendorfer2020mot20] uses metrics from CLEAR [bernardin2008evaluating], which includes Multiple-Object Tracking Accuracy (MOTA), Identity F1 score (IDF1), Identity Switch (IDsw), False Positive (FP), False Negative (FN) detections, as well as Mostly Tracked (MT) and Mostly Lost (ML) trajectories.
4.2 Training data generation
Given the architecture of PatchTrack (Figure 2), we need two consecutive frames to train the model. Although we could simply take frames pairs, predict track candidates from tracks of the previous frame using Kalman filter [welch1995introduction] as shown in the architecture, Kalman filter would not be able to provide high quality predictions due to high uncertainty in the early stage when there is a lack of prior information, which will in turn degrades the performance of decoder since the patch queries do not serve as good guesses to where existing tracks may be in the current frame.
To simulate the role of Kalman filter [welch1995introduction] and generate track candidates for training, we propose the following augmentation strategy. Given a frame pair and . We first randomly shift and reshape each track bounding box in frame within a pre-defined domain. We ensure that the IoU between each augmented bounding box and the track bounding box in frame with the same tracking ID, if exists, is at least 0.5. This is to align with commonly used IoU threshold value in detection association [Wojke2017simple, bewley2016simple, zhang2020fairmot]. These augmented tracks are the track candidates to our system during training.
We also adapt the track augmentation strategy from Trackformer [meinhardt2021trackformer], where we introduce false negatives by removing some queries associated with tracks that exist in both and from the input. The objective of the system is to detect the corresponding objects as new objects using object queries. On the other end, we sample output embeddings (generated from performing object detection on ) that map to background bounding boxes. They are included in the track queries as false positives when performing object tracking on . To obtain their corresponding patch queries, we get their respective bounding boxes and augment them in the same manner as track candidates generation. We ensure that the IoU of each augmented bounding box is below 0.5 with ground truth tracks in . For each patch-track queries generated from the above procedure, our system should decode them and get background objects.
Frame pairs are selected from two sources. The first one is video data from MOT benchmarks [milan2016mot16], where we take two video clips within a certain range from each other in the same video. This gives us more variety in terms of camera motion. The second one is image data from CrowdHuman [shao2018crowdhuman], where we augment a single image through random scaling and translating to obtain a frame pair. For each selected frame pair, we perform the aforementioned steps to generate track candidates and modify the ground truth corresponding to false positives/negatives we inserted manually. PatchTrack is optimized towards the modified ground truth during training.
4.3 Implementation details
The Kalman filter [welch1995introduction] following a constant velocity model is used to predict track candidates. PatchTrack uses ResNet-50[he2016deep]
pre-trained on ImageNet[deng2009imagenet] as its CNN backbone and Deformable DETR [zhu2020deformable] for the Transformer encoder-decoder framework. The number of object queries is set to be 500. Inactive track queries will be kept for 30 frames for track re-birth.
We adopt the training procedure from TransTrack [sun2020transtrack] as follows. The optimizer is AdamW with and initial learning rate . We use 8 NVIDIA Tesla V100 GPUs with batch size 16. PatchTrack is first pre-trained on CrowdHuman [shao2018crowdhuman]
for 150 epochs with the learning rate dropped toafter the first 100 epochs. Then, PatchTrack is trained on both CrowdHuman and MOT17 [milan2016mot16] for another 20 epochs. Lastly, it is evaluated on MOT16 and MOT17 [milan2016mot16] test sets.
MOT16 We compare PatchTrack with other MOT systems on MOT16 [milan2016mot16] test set in private protocol (Table 1), where PatchTrack achieves state-of-the-art results in MOTA, ML, and FN. Compared to LMP_p [tang2017multiple] and POI [yu2016poi], which collectively achieve best results in the remaining metrics, PatchTrack has significantly lower ML, showing overall better tracking performance. Figure 3 shows additional visual comparison with LMP_p and POI, where PatchTrack is able to track partially occluded objects and distinguish crowded objects better without missing objects or tracking one object multiple times.
MOT17 Table 2 shows quantitative results of PatchTrack along with other recent MOT systems on MOT17 [milan2016mot16] test set in private protocol. Compared to Non-Transformer-based methods, PatchTrack reports best numbers in MT and ML, and shows superior ability in trajectory prediction. On the other hand, PatchTrack performs comparably well with other Transformer-based methods, achieving second-to-best results in most metrics. Compared to TransTrack [sun2020transtrack], which has state-of-the-art results in MOTA, MT, ML, and FN, our system is able to produce less than 50% of FP. We provide additional visualizations of PatchTrack and TransTrack in Figure 4. While PatchTrack is able to perform on par with TransTrack, our system is able to avoid tracking one object multiple times or causing ID switches when a previously fully occluded object re-appears.
4.5 Ablation study
The ablation study is performed on the MOT17 [milan2016mot16] validation set. The original MOT17 training set is split to a new training set and validation set, each consisting of the first half and the second half of training videos. After pre-training PatchTrack on CrowdHuman [shao2018crowdhuman], the system is fine-tuned on the both CrowdHuman and the new MOT17 training set and evaluated on the validation set.
Type of queries We evaluate the effect of various queries in Table 3. Removal of only patch queries or track queries means the other is sent to the Transformer decoder along with object queries. Removal of patch-track queries means that the decoder takes in object queries only and essentially behaves like an object detector. After getting individual detections for each frame, we use the Kalman filter [welch1995introduction] and the Hungarian algorithm [kenesei2002hungarian] to associate them. In this case, the modified system falls into the tracking-by-detection paradigm. We see that both patch queries and track queries play an important role in the joint-detection-and-tracking setting. On the other hand, the performance of the tracking-by-detection version of our system is overall comparable with PatchTrack, but produces more ID switches.
|w/o patch queries||71.4||165||42||214|
|w/o track queries||66.3||141||61||248|
|w/o patch-track queries||72.0||176||40||200|
Source of frame patches We also evaluate patch queries generated from different sources. The previous bboxes patches come directly from cropping the current frame of interest using bounding boxes of tracks in the previous frame. Alternatively, the previous frame patches are generated using both the previous frame and bounding boxes of tracks in the previous frame. From Table 4, we see similar results when using patches from the previous frame compared to using track queries alone, meaning that patches from the previous frame contains similar information to track queries. On the other hand, patches generated from the current frame with bounding boxes of tracks in the previous frame degrade the performance. We reason that it is because of the misalignment between the frame and bounding boxes, which leads to less useful information in patches.
|w/o patch query||71.4||165||42||214|
We present PatchTrack, a Transformer-based joint-detection-and-tracking system using frame patches. By generating patch queries from the current frame of interest and track predictions using a motion model, we obtain information about object motion and appearance that is associated with the current frame. This novel way of using visual information in the current frame adds additional information to track queries that are derived from previous frames. By using both queries collectively, PatchTrack is able to achieve competitive results on MOT benchmarks.