PatchTrack: Multiple Object Tracking Using Frame Patches

Object motion and object appearance are commonly used information in multiple object tracking (MOT) applications, either for associating detections across frames in tracking-by-detection methods or direct track predictions for joint-detection-and-tracking methods. However, not only are these two types of information often considered separately, but also they do not help optimize the usage of visual information from the current frame of interest directly. In this paper, we present PatchTrack, a Transformer-based joint-detection-and-tracking system that predicts tracks using patches of the current frame of interest. We use the Kalman filter to predict the locations of existing tracks in the current frame from the previous frame. Patches cropped from the predicted bounding boxes are sent to the Transformer decoder to infer new tracks. By utilizing both object motion and object appearance information encoded in patches, the proposed method pays more attention to where new tracks are more likely to occur. We show the effectiveness of PatchTrack on recent MOT benchmarks, including MOT16 (MOTA 73.71 IDF1 65.23 chl=10.



There are no comments yet.


page 4

page 6

page 7


TransTrack: Multiple-Object Tracking with Transformer

Multiple-object tracking(MOT) is mostly dominated by complex and multi-s...

Joint Learning Architecture for Multiple Object Tracking and Trajectory Forecasting

This paper introduces a joint learning architecture (JLA) for multiple o...

DEFT: Detection Embeddings for Tracking

Most modern multiple object tracking (MOT) systems follow the tracking-b...

Motion Prediction in Visual Object Tracking

Visual object tracking (VOT) is an essential component for many applicat...

Frame-wise Motion and Appearance for Real-time Multiple Object Tracking

The main challenge of Multiple Object Tracking (MOT) is the efficiency i...

Object Tracking by Detection with Visual and Motion Cues

Self-driving cars and other autonomous vehicles need to detect and track...

TransCenter: Transformers with Dense Queries for Multiple-Object Tracking

Transformer networks have proven extremely powerful for a wide variety o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Multiple object tracking (MOT) concerns identifying objects of interest and tracking their moving trajectories in video sequences. Intuitively, successful MOT algorithms need to be able to handle subtle appearance differences between multiple tracked objects and resolve the ambiguity via other cues, such as motion, when the targets are visually indistinguishable.

With the powerful appearance encoding capability of CNN, the tracking-by-detection paradigm dominates MOT methods in the past decade [carion2020end, zhang2020fairmot, Wojke2017simple]. Highly accurate CNN-based object detection [redmon2016you, ren2015faster, cai2018cascade] is first performed in all frames independently, and then association of these detected objects across frames is performed to establish tracks of consistent object IDs. In the association step, locations of existing tracks in the following frame may be predicted from assumption (constant velocity, acceleration, etc.) or other motion models [zhang2020fairmot, shuai2021siammot, welch1995introduction, Wojke2017simple] and then associate with detections based on metrics like intersection-over-union (IoU).

Joint-detection-and-tracking methods [zhou2020tracking, sun2020transtrack, wu2021track] recently demonstrate superior accuracy. The idea is simultaneously performing object detection and tracking so both tasks enjoy information shared from the other. This is particularly intriguing in Transformer-based architectures where output feature embeddings of previous frames are used as ‘track queries’, along with ‘object queries’ for Transformer decoder, predicting corresponding tracks as well as newly discovered objects in the current frame (Figure 0(a)). Albeit achieving state of the art MOT results, we argue that these architectures overly rely on appearance. As the information encoded in track queries is strictly limited to previous frames, the Transformer model needs to infer both object offset and object appearance in the current frame.

To resolve above problem, we take inspiration from UP-DETR [dai2021up], an object detection model that is pre-trained to detect image patches (Figure 0(b)) using patch features, and propose a MOT system that uses frame patches from the current frame of interest. We first use a motion model to predict new locations of existing tracks in the current frame from the previous frame, and crop the current frame to patches based on the prediction. These patches, with implicit prior knowledge of object motion and explicit information of object appearance in the current frame, are sent to the decoder to predict new locations of existing tracks in the current frame.

More specifically, we present PatchTrack (Figure 0(c)), which is a Transformer-based joint-object-detection-and-tracking system that predicts tracks in the current frame of interest from its patches. We use the Kalman filter [welch1995introduction]

to obtain track candidates in the current frame from existing tracks in the previous frame and crop the current frame using the bounding box of these candidates to get patches. Both the current frame and these patches are sent into our convolutional neural network (CNN) 

[Goodfellow-et-al-2016] backbone that outputs the frame feature and the patch queries respectively. Each pair of track query, from the output embeddings when processing the previous frame, and patch query with the same tracking ID are added together to form the corresponding patch-track query. These patch-track queries are sent to the decoder along with object queries, where the former is used to predict new locations of existing tracks, while the latter is used to detect new objects in the current frame.

We evaluate PatchTrack on MOT benchmarks and achieve competitive results on MOT16 (MOTA 73.71%, IDF1 65.77%) and MOT17 (MOTA 73.59%, IDF1 65.23%) test sets. To the best of our knowledge, our method is the first that uses patches of the current frame of interest to infer both object motion and appearance information simultaneously. We hope it could provide a new perspective for designing MOT systems.

In summary, our contributions are:

  • A Transformer-based MOT system, namely PatchTrack, which jointly performs object detection and tracking.

  • A novel way of optimizing the usage of visual information by utilizing patches from the current frame of interest.

  • Introduction of patch-track queries that incorporate both knowledge of the object motion and object appearance in the current frame of interest to facilitate tracking.

2 Related Work

2.1 Object detection and tracking

Object detection concerns locating and/or classifying objects of interest in a single image. As the preliminary to object tracking, there is a close connection between the two. Many popular object detection methods generate detections from hypothesis of object locations, including regional proposals 

[girshick2014rich, girshick2015fast, ren2015faster, cai2018cascade] and anchors/object centers [redmon2016you, liu2016ssd, zhou2019objects]. On the other hand, there is an increasing number of object tracking systems that utilize Transformer [vaswani2017attention], which has shown success in object detection [carion2020end, meng2021conditional, zhu2020deformable, dai2021up] before. Transformer-based object-detection methods encode the CNN [Goodfellow-et-al-2016] feature of images and decodes learned object queries to obtain detections. Aside from architecture adjustment [meng2021conditional, zhu2020deformable] from the original DETR [carion2020end], we also see modification to object queries [dai2021up] using image patches to facilitate detection. Inspired by the usage of regional proposal and image patches, our proposed method uses frame patches, which can be considered as our initial guess of track locations and appearance.

2.2 Tracking-by-detection

One major paradigm in MOT is tracking-by-detection, where the MOT systems [carion2020end, zhang2020fairmot, Wojke2017simple] first obtain detections for each frame and then associate them across frames to form tracks. Since the object detection is a standalone step in the tracking process, one benefit of tracking-by-detection methods is the flexibility to pair different object detection models [ren2015faster, redmon2016you, carion2020end] with different association strategies, thus be benefited directly from advancement in the area of object detection. On the other hand, the object detection step omits information across frames as each of them is processed separately by the detector.

Object motion and appearance may only be considered as part of the detection association strategy for these methods [zhang2020fairmot, shuai2021siammot]. For object motion, Kalman filter [welch1995introduction] is one of the most popular algorithm used to propagate detections in previous frame to predict their location in its future frame. Combined with Hungarian algorithm [kenesei2002hungarian] and intersection-over-union (IOU) metrics, it has proven to be an effective tracking mechanism [bewley2016simple]. Object appearance information like Re-ID features [Wojke2017simple, pang2021quasi, zhang2020fairmot] are also commonly used as similarity measures.

2.3 Joint-detection-and-tracking

The other popular paradigm in MOT is joint-detection-and-tracking, where the object detection and object tracking are performed simultaneously [zhou2020tracking, sun2020transtrack]. One advantage of joint-detection-and-tracking methods is the accessibility to information across frames. For instance, features of multiple frames can be used at once [zhou2020tracking, sun2020transtrack, wu2021track] to facilitate detection and/or tracking. For Transformer-based joint-detection-and-tracking methods, both the encoder and the decoder may take additional information from previous frames to infer predictions of the current frame of interest [sun2020transtrack, meinhardt2021trackformer, zeng2021motr]. Specifically, recent works have introduced track queries [sun2020transtrack, meinhardt2021trackformer], which come from the output embeddings when processing previous frames. Depends on the design, the track queries may be decoded to bounding boxes separately from the object queries [sun2020transtrack] and matched together to predict new tracks, or processed together to form new tracks directly [meinhardt2021trackformer].

Figure 2: PatchTrack. We first use Kalman filter [welch1995introduction] to predict track candidates in frame from tracks in frame . Both frames are sent to the CNN backbone that produces frame features for the Transformer encoder. We crop to patches using bounding boxes of track candidates and send them to the CNN backbone, followed by a fully connected layer (FC) and global average pooling (GAP), to get patch queries that align with track queries. Patch queries are added to track queries to form patch-track queries, which are then sent to the Transformer decoder along with object queries. The patch-track queries are decoded to output embeddings that refine locations of track candidates and the object queries are decoded to output embeddings that detect new objects. Output embeddings that corresponds to tracks in are the track queries for processing .

3 Method

In this section, we describe the architecture of PatchTrack (Section 3.1), how object tracking is initialized (Section 3.2), how existing tracks are propagated to form new track candidates (Section 3.3), and how frame patches are generated to help facilitate object tracking(Section 3.4).

3.1 Architecture

PatchTrack is a Transformer-based joint-detection-and-tracking system. The Transformer encoder takes in CNN features of a consecutive frame pair. The Transformer decoder takes queries as input and output bounding boxes. PatchTrack deals with four types of queries: object queries, track queries, patch queries, and patch-track queries. Depending on the source of the queries, the predicted bounding boxes may correspond to either tracks associated with existing tracking IDs or detections that need to be assigned with new tracking IDs.

3.2 Object tracking initialization

Object tracking for the first frame is equivalent to object detection, where each predicted detection can be arbitrarily assigned to a unique tracking ID to form tracks. Frame is sent to the CNN backbone that outputs the corresponding frame feature. This feature is stacked with itself [sun2020transtrack] and sent to the Transformer encoder. Since there are no existing tracks to form non-object queries, the Transformer decoder only takes object queries as input and produces embeddings. The output embeddings that result in the non-background bounding boxes are the predicted detection in , each of which is assigned to a unique tracking ID to form tracks. These embeddings are also used as the track queries for the next frame.

3.3 Track propagation

For frame (), there exists with a set of tracks . We can propagate these tracks using a motion model and infer tracks in (Algorithm 1).

Here we use the Kalman filter [welch1995introduction] as our motion model to predict a set of track candidates for , namely . The reason we call them track candidates is because there are several problems if we use them directly as tracks in . First of all, since the tracks in are mapped one-to-one with the ones in , they only include objects that have appeared in . Secondly, although Kalman filter and other motion models have shown effectiveness in many cases [bewley2016simple, veeramani2018deepsort, zhang2020fairmot], their predicted bounding boxes are not accurate enough in terms of locating objects. This is the reason why motion models are often used to process existing tracks, and IoU is introduced to match processed tracks with new detections to form new tracks. In the paradigm of joint-object-detection-and-tracking, our architecture is designed to refine these track candidates to more accurate tracks.

Input : Tracks in frame ;
Motion model M
Output : Track candidates for frame
1 Initialization: ;
2 for  do
3       ;
5 end for
Algorithm 1 Pseudo-code for object propagation

3.4 Patch generation and object tracking

To tackle the above problems, we take inspiration from UP-DETR [dai2021up] where its Transformer decoder is pre-trained to detect locations of random image patches using their corresponding CNN features. Our proposed PatchTrack takes patches of frame as additional visual information besides the entire to perform object tracking. Specifically, for each track candidate , we crop the frame using its bounding box and send the resulting patch to the CNN backbone to get the corresponding patch feature. We use a fully-connected (FC) layer followed by global average pooling (GAP) to process all patch features to patch queries that align with track queries (Figure 2). Each patch query is added to the track query from the same tracking ID to form a patch-track query. The patch-track queries are sent to the Transformer decoder alone with the initial object queries, both of which are processed jointly. Output embedding decoded from each patch-track query may either correspond to the refined location of the corresponding track candidate, or the background if the object has left . On the other hand, the embeddings decoded from object queries that result in non-background detections locate new objects entering , which are assigned with new tracking IDs to form new tracks. All embeddings that contribute tracks in form the track queries for (Figure 2).

3.5 Track re-birth

To obtain track queries for frame from the track queries for , embeddings corresponding to the new detections are added and track queries corresponding to background class are removed (Figure 2). A problem with this mechanism is that it is not robust to long-range tracking: if one object is not successfully detected, it can only be assigned to a new tracking ID when it is detected again, which causes fragmented trajectories. To tackle this problem, we adopt the track re-identification strategy from TrackFormer [meinhardt2021trackformer] and store these originally removed patch-track queries to an inactive query set. Queries in this set are included in the list of patch-track queries and sent to the decoder for at most consecutive frames. If the queries can be decoded to non-background bounding boxes during this process, these queries are re-activated with their original tracking IDs, otherwise they will be removed.

3.6 Set prediction loss

As shown in the model architecture 2, PatchTrack processes a frame pair and iteratively, and there are two steps involved. The first step is performing object detection on in order to initialize track queries for processing later. The second step is to perform object tracking on using previously generated track queries. Since the second steps involves detecting new objects, which is the same as the first step, as well as tracking existing object with tracking IDs associated with track queries, we use two set prediction loss [carion2020end], one for detection new objects and the other for tracking objects existing in .

Let us denote and as the tracks for and respectively. In the case of detecting new objects, we are looking at any track , which corresponds to new objects in but not . We adopt object detection set prediction loss following the matching cost in TransTrack [sun2020transtrack] and DETR [carion2020end]:


where is the focal loss [lin2017focal] between predicted class labels and the ground truth, and are L1 loss and generalized IoU loss [rezatofighi2019generalized] between the normalized center and sides of the predicted bounding boxes and ground truth, while , and are their weights respectively. Predictions generated from decoding object queries are compared with the ground truth , so handles new object detection.

Similarly, our object tracking set prediction loss is as follows:


where , , and are calculated between predictions generated from decoding patch-track queries and the ground truth , so handles tracking objects in and predict their new locations in .

Our final loss function is simply the sum of object detection set prediction loss and object tracking set prediction loss:


Dataset Method MOTA IDF1 MT ML FP FN IDsw
MOT16 DeepSORT [Wojke2017simple] 61.4 62.2 32.8 18.2 12,852 56,668 781
HTA [lin2021detection] 62.4 64.2 37.5 12.1 19,071 47,839 1,619
VMaxx [wan2018multi] 62.6 49.2 32.7 21.1 10,604 56,182 1,389
RAR16 [fang2018recurrent] 63.0 63.8 39.9 22.1 13,663 53,248 482
TAP [zhou2018online] 64.8 73.5 40.6 22.0 12,980 50,635 794
CNNMTT [mahmoudi2019multi] 65.2 62.2 32.4 21.3 6,578 55,896 946
POI [yu2016poi] 66.1 65.1 34.0 21.3 5,061 55,914 805
GSDT [wang2021joint] 66.7 69.2 38.6 19.0 14,754 45,057 959
TubeTK [pang2020tubetk] 66.9 62.2 39.0 18.1 11,544 47,502 1,236
LM_CNN [babaee2019dual] 67.4 61.2 38.2 19.2 10,109 48,435 931
Chain-Tracker [peng2020chained] 67.6 57.2 32.9 23.1 8,934 48,305 1,897
KDNT(POI) [yu2016poi] 68.2 60.0 41.0 19.0 11,479 45,605 933
FairMOT [zhang2020fairmot] 69.3 72.3 40.3 16.7 13,501 41,653 815
QuasiDense [pang2021quasi] 69.8 67.1 41.6 19.8 9,861 44,050 1,097
TraDeS [wu2021track] 70.1 64.7 37.3 20.0 8,091 45,210 1,144
LMP_p [tang2017multiple] 71.0 70.1 46.9 21.9 7,880 44,564 434
PatchTrack (Ours) 73.3 65.8 45.7 11.3 10,660 36,824 1,179
Table 1: Evaluation on the MOT16 test set. We evaluate recent MOT systems on the MOT16 test set in the private detection protocol. The method names are taken directly from the leaderboard of motchallenge, where the names in parentheses are associated with their respective literatures. Metrics with means higher numbers are preferable, while the ones with means lower numbers are preferable. Numbers are marked in bold if they are the best in their respective metric columns. Our proposed PatchTrack achieves best results in MOTA, ML, and FN.
(a) LMP_p MOT16-08 Frame 210
(b) POI MOT16-08 Frame 210
(c) PatchTrack MOT16-08 Frame 210
(d) LMP_p MOT16-08 Frame 420
(e) POI MOT16-08 Frame 420
(f) PatchTrack MOT16-08 Frame 420
Figure 3: Visualizations on the MOT16 test set. Visualizations on the MOT16 test set are taken from motchallenge. We add additional annotations in red to show challenging cases where LMP_p [tang2017multiple] and POI [yu2016poi] fail to track. While both LMP_p (Figure 2(a)) and POI (Figure 2(b)) fail to track objects that are partially occluded, PatchTrack is still able to locate such objects (Figure 2(c)). Additionally, PatchTrack performs better in distinguish different objects in a cluster (Figure 2(f)) without missing (Figure 2(e)) objects or tracking one object twice (Figure 2(d)).
Dataset (CNN-based) method MOTA IDF1 MT ML FP FN IDsw
MOT17 DAN [sun2019deep] 52.4 49.5 21.4 30.7 25,423 234,592 8,431
TubeTK [pang2020tubetk] 63.0 58.6 31.2 19.9 27,060 177,483 4,137
GSDT [wang2021joint] 66.2 63.4 36.9 21.7 25,800 164,120 2,711
Chained-Tracker [peng2020chained] 66.6 57.4 37.8 18.5 22,284 160,491 5,529
CenterTrack [zhou2020tracking] 67.8 64.7 34.6 24.6 18,498 160,332 3,039
QuasiDense [pang2021quasi] 68.7 66.3 40.6 21.9 26,589 146,643 3,378
TraDes [wu2021track] 69.1 63.9 36.4 21.5 20,892 150,060 3,555
MAT [han2020mat] 69.5 63.1 43.8 18.9 30,660 138,741 2,844
SOTMOT [zheng2021improving] 71.0 71.9 42.7 15.3 39,537 118,983 5,184
RADTrack (RelationTrack) [yu2021relationtrack] 73.1 73.7 39.9 20.0 25,935 122,700 3,021
GSDT [wang2021joint] 73.2 66.5 41.7 17.5 26,397 120,666 3,891
Semi-TCL [li2021semi] 73.3 73.2 41.8 18.7 22,944 124,980 2,790
FairMOT [zhang2020fairmot] 73.7 72.3 43.2 17.3 27,507 117,477 3,303
RelationTrack [yu2021relationtrack] 73.8 74.7 41.7 23.2 27,999 118,623 1,374
PermaTrackPr [tokmakov2021learning] 73.8 68.9 43.8 17.2 28,998 115,104 3,699
CSTrack [liang2020rethinking] 74.9 72.6 41.5 17.5 23,847 114,303 3,567
PatchTrack (ours) 73.6 65.2 44.6 12.5 23,976 121,230 3,795
Transformer-based method
MOTR [zeng2021motr] 65.1 66.4 33.0 25.2 45,486 149,307 2,049
TrackFormer [meinhardt2021trackformer] 65.0 63.9 45.6 13.8 70,443 123,552 3,528
MOTPrivate (TransCenter) [xu2021transcenter] 70.0 62.1 38.9 20.4 28,119 136,722 4,647
TransCenter [xu2021transcenter] 73.2 62.2 40.8 18.5 23,112 123,738 4,614
TrTrack (TransTrack) [sun2020transtrack] 75.2 63.5 55.3 10.2 50,157 86,442 3,603
PatchTrack (ours) 73.6 65.2 44.6 12.5 23,976 121,230 3,795
Table 2: Evaluation on MOT17 test set. We evaluate recent MOT systems on the MOT17 test set in a private detection protocol. Compared to CNN-based (non Transformer-based) methods, PatchTrack outperforms in MT and ML. We also compare our proposed method with MOT systems that are also Transformer based. Numbers are in bold if they are the best in their respective metric columns, and in blue if they are the second-to-best.
(a) TransTrack MOT17-07 Frame 402
(b) TransTrack MOT17-07 Frame 420
(c) TransTrack MOT17-07 Frame 438
(d) PatchTrack MOT17-07 Frame 402
(e) PatchTrack MOT17-07 Frame 420
(f) PatchTrack MOT17-07 Frame 438
Figure 4: Visualizations on the MOT17 test set. Comparing to TransTrack [sun2020transtrack], PatchTrack is able to show comparable performance and generate less than 50% FP, while TransTrack suffers from detecting one object multiple times (Figure 3(a)) and ID switches (Figure 3(c)) when trying to track fully-occluded objects (Figure 3(b)).

4 Experiments

4.1 Datasets and metrics

MOT MOT benchmarks are among the most widely used multi-object tracking benchmarks. We perform experiments on two of the MOT benchmarks: MOT16 and MOT17 [milan2016mot16]. MOT16 consists of a training set of 7 videos (5,316 frames and 336,891 tracks) and a test set of 7 videos (5,919 frames and 564,228 tracks) with FPS ranging from 14 to 30. To evaluate the performance of the tracking mechanism independently of the detection accuracy, this benchmark also provides public detection from Faster R-CNN [ren2015faster]. MOT17 consists of the same training set and test set as MOT16, but with additional public detection from DPM [felzenszwalb2009object] and SDP [yang2016exploit]. Both MOT16 and MOT17 are annotated with full-body bounding boxes.

CrowdHuman CrowdHuman [shao2018crowdhuman] is a pedestrian detection benchmark. It contains 15,000 training images and 4,370 validation images with a total of 470K objects. The annotations are also human full-body bounding boxes. This benchmark is often used for pre-training MOT systems.

Metrics MOT benchmarks [leal2015motchallenge, milan2016mot16, dendorfer2020mot20] uses metrics from CLEAR [bernardin2008evaluating], which includes Multiple-Object Tracking Accuracy (MOTA), Identity F1 score (IDF1), Identity Switch (IDsw), False Positive (FP), False Negative (FN) detections, as well as Mostly Tracked (MT) and Mostly Lost (ML) trajectories.

4.2 Training data generation

Given the architecture of PatchTrack (Figure 2), we need two consecutive frames to train the model. Although we could simply take frames pairs, predict track candidates from tracks of the previous frame using Kalman filter [welch1995introduction] as shown in the architecture, Kalman filter would not be able to provide high quality predictions due to high uncertainty in the early stage when there is a lack of prior information, which will in turn degrades the performance of decoder since the patch queries do not serve as good guesses to where existing tracks may be in the current frame.

To simulate the role of Kalman filter [welch1995introduction] and generate track candidates for training, we propose the following augmentation strategy. Given a frame pair and . We first randomly shift and reshape each track bounding box in frame within a pre-defined domain. We ensure that the IoU between each augmented bounding box and the track bounding box in frame with the same tracking ID, if exists, is at least 0.5. This is to align with commonly used IoU threshold value in detection association [Wojke2017simple, bewley2016simple, zhang2020fairmot]. These augmented tracks are the track candidates to our system during training.

We also adapt the track augmentation strategy from Trackformer [meinhardt2021trackformer], where we introduce false negatives by removing some queries associated with tracks that exist in both and from the input. The objective of the system is to detect the corresponding objects as new objects using object queries. On the other end, we sample output embeddings (generated from performing object detection on ) that map to background bounding boxes. They are included in the track queries as false positives when performing object tracking on . To obtain their corresponding patch queries, we get their respective bounding boxes and augment them in the same manner as track candidates generation. We ensure that the IoU of each augmented bounding box is below 0.5 with ground truth tracks in . For each patch-track queries generated from the above procedure, our system should decode them and get background objects.

Frame pairs are selected from two sources. The first one is video data from MOT benchmarks [milan2016mot16], where we take two video clips within a certain range from each other in the same video. This gives us more variety in terms of camera motion. The second one is image data from CrowdHuman [shao2018crowdhuman], where we augment a single image through random scaling and translating to obtain a frame pair. For each selected frame pair, we perform the aforementioned steps to generate track candidates and modify the ground truth corresponding to false positives/negatives we inserted manually. PatchTrack is optimized towards the modified ground truth during training.

4.3 Implementation details

The Kalman filter [welch1995introduction] following a constant velocity model is used to predict track candidates. PatchTrack uses ResNet-50[he2016deep]

pre-trained on ImageNet 

[deng2009imagenet] as its CNN backbone and Deformable DETR [zhu2020deformable] for the Transformer encoder-decoder framework. The number of object queries is set to be 500. Inactive track queries will be kept for 30 frames for track re-birth.

We adopt the training procedure from TransTrack [sun2020transtrack] as follows. The optimizer is AdamW with and initial learning rate . We use 8 NVIDIA Tesla V100 GPUs with batch size 16. PatchTrack is first pre-trained on CrowdHuman [shao2018crowdhuman]

for 150 epochs with the learning rate dropped to

after the first 100 epochs. Then, PatchTrack is trained on both CrowdHuman and MOT17 [milan2016mot16] for another 20 epochs. Lastly, it is evaluated on MOT16 and MOT17 [milan2016mot16] test sets.

4.4 Results

MOT16 We compare PatchTrack with other MOT systems on MOT16 [milan2016mot16] test set in private protocol (Table 1), where PatchTrack achieves state-of-the-art results in MOTA, ML, and FN. Compared to LMP_p [tang2017multiple] and POI [yu2016poi], which collectively achieve best results in the remaining metrics, PatchTrack has significantly lower ML, showing overall better tracking performance. Figure 3 shows additional visual comparison with LMP_p and POI, where PatchTrack is able to track partially occluded objects and distinguish crowded objects better without missing objects or tracking one object multiple times.

MOT17 Table 2 shows quantitative results of PatchTrack along with other recent MOT systems on MOT17 [milan2016mot16] test set in private protocol. Compared to Non-Transformer-based methods, PatchTrack reports best numbers in MT and ML, and shows superior ability in trajectory prediction. On the other hand, PatchTrack performs comparably well with other Transformer-based methods, achieving second-to-best results in most metrics. Compared to TransTrack [sun2020transtrack], which has state-of-the-art results in MOTA, MT, ML, and FN, our system is able to produce less than 50% of FP. We provide additional visualizations of PatchTrack and TransTrack in Figure 4. While PatchTrack is able to perform on par with TransTrack, our system is able to avoid tracking one object multiple times or causing ID switches when a previously fully occluded object re-appears.

4.5 Ablation study

The ablation study is performed on the MOT17 [milan2016mot16] validation set. The original MOT17 training set is split to a new training set and validation set, each consisting of the first half and the second half of training videos. After pre-training PatchTrack on CrowdHuman [shao2018crowdhuman], the system is fine-tuned on the both CrowdHuman and the new MOT17 training set and evaluated on the validation set.

Type of queries We evaluate the effect of various queries in Table 3. Removal of only patch queries or track queries means the other is sent to the Transformer decoder along with object queries. Removal of patch-track queries means that the decoder takes in object queries only and essentially behaves like an object detector. After getting individual detections for each frame, we use the Kalman filter [welch1995introduction] and the Hungarian algorithm [kenesei2002hungarian] to associate them. In this case, the modified system falls into the tracking-by-detection paradigm. We see that both patch queries and track queries play an important role in the joint-detection-and-tracking setting. On the other hand, the performance of the tracking-by-detection version of our system is overall comparable with PatchTrack, but produces more ID switches.

Method MOTA MT ML IDsw
w/o patch queries 71.4 165 42 214
w/o track queries 66.3 141 61 248
w/o patch-track queries 72.0 176 40 200
PatchTrack 72.1 176 40 192
Table 3: Ablation study on type of query inputs. We send different types of query inputs to our system and evaluate their effects. The results suggest the positive effect of patch queries and track queries. When the system doesn’t use patch-track queries and behave as an object detector, where we use Kalman filter [welch1995introduction] and Hungarian algorithm [kenesei2002hungarian] to associate predicted detections, the system produces more ID switches.

Source of frame patches We also evaluate patch queries generated from different sources. The previous bboxes patches come directly from cropping the current frame of interest using bounding boxes of tracks in the previous frame. Alternatively, the previous frame patches are generated using both the previous frame and bounding boxes of tracks in the previous frame. From Table 4, we see similar results when using patches from the previous frame compared to using track queries alone, meaning that patches from the previous frame contains similar information to track queries. On the other hand, patches generated from the current frame with bounding boxes of tracks in the previous frame degrade the performance. We reason that it is because of the misalignment between the frame and bounding boxes, which leads to less useful information in patches.

Method MOTA MT ML IDsw
w/o patch query 71.4 165 42 214
previous bboxes 62.8 137 69 258
previous frame 71.4 165 42 214
PatchTrack 72.1 176 40 192
Table 4: Ablation study on source of frame patches. We test patch queries generated from different sources. When the patches come from cropping the current frame using the track bounding boxes from the previous frame (previous bboxes), the corresponding patch queries have a negative effect on the performance.

5 Conclusion

We present PatchTrack, a Transformer-based joint-detection-and-tracking system using frame patches. By generating patch queries from the current frame of interest and track predictions using a motion model, we obtain information about object motion and appearance that is associated with the current frame. This novel way of using visual information in the current frame adds additional information to track queries that are derived from previous frames. By using both queries collectively, PatchTrack is able to achieve competitive results on MOT benchmarks.