TubeTK: Adopting Tubes to Track Multi-Object in a One-Step Training Model

06/10/2020 ∙ by Bo Pang, et al. ∙ Shanghai Jiao Tong University 0

Multi-object tracking is a fundamental vision problem that has been studied for a long time. As deep learning brings excellent performances to object detection algorithms, Tracking by Detection (TBD) has become the mainstream tracking framework. Despite the success of TBD, this two-step method is too complicated to train in an end-to-end manner and induces many challenges as well, such as insufficient exploration of video spatial-temporal information, vulnerability when facing object occlusion, and excessive reliance on detection results. To address these challenges, we propose a concise end-to-end model TubeTK which only needs one step training by introducing the “bounding-tube" to indicate temporal-spatial locations of objects in a short video clip. TubeTK provides a novel direction of multi-object tracking, and we demonstrate its potential to solve the above challenges without bells and whistles. We analyze the performance of TubeTK on several MOT benchmarks and provide empirical evidence to show that TubeTK has the ability to overcome occlusions to some extent without any ancillary technologies like Re-ID. Compared with other methods that adopt private detection results, our one-stage end-to-end model achieves state-of-the-art performances even if it adopts no ready-made detection results. We hope that the proposed TubeTK model can serve as a simple but strong alternative for video-based MOT task. The code and models are available at



There are no comments yet.


page 1

page 3

page 4

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Video multi-object tracking (MOT) is a fundamental yet challenging task that has been studied for a long time. It requires the algorithm to predict the temporal and spatial location of objects and classify them into correct categories. The current mainstream trackers such as 

[65, 3, 9, 1, 13] all adopt the tracking-by-detection (TBD) framework. As a two-step method, this framework simplifies the tracking problem into two parts: detecting the spatial location of objects and matching them in the temporal dimension. Although this is a successful framework, it is important to note that TBD method suffers from some drawbacks:

  1. As shown in [65, 18], the performances of models adopting TBD framework dramatically vary with detection models. This excessive reliance on image detection results limits performances of the MOT task. Although there are some existing works aiming at integrating the two steps more closely [67, 20, 3], the problems are still not solved fundamentally because of the relatively independent detection model.

  2. Due to image-based detection models employed by TBD, the tracking models are weak when facing severe object occlusions (see Fig. 1). It is extremely difficult to detect occluded objects only through spatial representations [3]. The low quality detection further makes tracking unstable, which leads to more complicated design of matching mechanism [53, 57].

  3. As a video level task, MOT requires models to process spatial-temporal information (STI) integrally and effectively. To some extent, the above problems are caused by the separate exploration of STI: detectors mainly model spatial features and trackers capture temporal ones [50, 9, 18, 53], which casts away the semantic consistency of video features and results in incomplete STI at each step.

Nowadays, many video tasks can be solved in a simple one-step end-to-end method such as the I3D model [6] for action recognition [36], TRN [68] for video relational reasoning, and MCNet [56] for video future prediction. As one of the fundamental vision tasks, MOT still does not work in a simple elegant method and the drawbacks of TBD mentioned above require assistance of some other techniques like Re-ID [3, 41]. It is natural to ask a question: Can we solve the multi-object tracking in a neat one-step framework? In this way, MOT can be solved as a stand-alone task, without restrictions from detection models. We answer it in the affirmative and for the first time, we demonstrate that the much simpler one-step tracker even achieves better performance than the TBD-based counterparts.

In this paper, we propose the TubeTK which conducts the MOT task by regressing the bounding-tubes (Btubes) in a 3D manner. Different from 3D point-cloud [64], this 3D means 2D spatial and 1D temporal dimensions. As shown in Fig. 1, a Btube is defined by 15 points in space-time compared to the traditional 2D box of 4 points. Besides the spatial location of targets, it also captures the temporal position. More importantly, the Btube encodes targets’ motion trail as well, which is exactly what MOT needs. Thus, Btubes can well handle spatial-temporal information integrally and largely bridge the gap between detection and tracking.

To predict the Btube that captures spatial-temporal information, we employ a 3D CNN framework. By treating a video as 3D data instead of a group of 2D image frames, it can extract spatial-temporal features simultaneously. This is a more powerful and fully automatic method to extract tracking features, where the handcrafted features such as optical flow [52], segmentation [57, 15, 62], human pose [17, 16, 58] or targets interactions [50, 37, 14, 46] are not needed. The network structure is inspired by recent advances of one-stage anchor-free detectors [55, 11] where the FPN [38] is adopted to better track targets of different scales and the regression head directly generates Btubes. After that, simple IoU-based post-processing is applied to link Btubes and form final tracks. The whole pipeline is made up of fully convolutional networks and we show the potential of this compact model to be a new tracking paradigm.

The proposed TubeTK enjoys the following advantages:

  1. With TubeTK, MOT now can be solved by a simple one-step training method as other video tasks. Without constraint from detection models, assisting technologies, and handcrafted features, TubeTK is considerably simpler when being applied and it also enjoys great potential in future research.

  2. TubeTK adequately extracts spatial-temporal features simultaneously and these features capture information of motion tendencies. Thus, TubeTK is more robust when faced with occlusions.

  3. Without bells and whistles, the end-to-end-trained TubeTK achieves better performances compared with TBD-based methods on MOT15, 16, and 17 dataset [34, 44]. And we show that the Btube-based tracks are smoother (fewer FN and IDS) than the ones based on pre-generated image-level bounding-boxes.

2 Related Work

Tracking-by-detection-based model

Research based on the TBD framework often adopts detection results given by external object detectors [47, 40, 42] and focuses on the tracking part to associate the detection boxes across frames. Many associating methods have been utilized on tracking models. In [2, 29, 66, 45, 35], every detected bounding-box is treated as a node of graph, the associating task is equivalent to determining the edges where maximum flow [2, 61], or equivalently, minimum cost [45, 29, 66] are usually adopted as the principles. Recently, with the development of deep learning, appearance-based matching algorithms have been proposed [32, 50, 18]. By matching targets with similar appearances such as clothes and body types, models can associate them over long temporal distances. Re-ID techniques [33, 3, 54] are usually employed as an auxiliary in this matching framework.

Figure 2: Definition and generation of the Btube. a: A Btube can be seen as the combination of three bounding-boxes , , and

from different video frames. A Btube has 15 degrees of freedom, which can be determined by the spatial locations of the three bounding-boxes (4

3 degrees) and their temporal positions (3 degrees, , , and ). b: Btubes are generated from whole tracks. Left: For each bounding-box in a track, we treat it as the of one Btube then look forward and backward to find its and in the track. Right: A longer Btube can capture more temporal features but the IoU between it and the track is lower ( is the IoU threshold), which leads to bad moving trails as the second row shows. Overlaps between the Btubes are used for linking them.

Bridging the gap between detection and tracking

Performances of image-based object detectors are limited when facing dense crowds and serious occlusions. Thus, some works try to utilize extra information such as motion [50] or temporal features learned by the track step to aid detection. One simple direction is to add bounding-boxes generated by the tracking step into the detection step [41, 10], but this does not affect the original detection process. In [67], the tracking step can efficiently improve the performance of detection by controlling the NMS process. [20] proposes a unified CNN structure to jointly perform detection and tracking tasks. By sharing features and conducting multi-task learning, it can further reduce the isolation between the two steps. The authors of [59] propose a joint detection and embedding framework where the detection and associating steps share same features. Despite these works’ effort to bridge the gap between detection and tracking, they still treat them as two separate tasks and can not well utilize spatial-temporal information.

Tracking framework based on trajectories or tubes

Tubes can successfully capture motion trails of targets, which are important for tracking. There are previous works that adopt tubes to conduct MOT or video detection [51] tasks. In [31, 30], a tubelet proposal module combining detection results into tubes is adopted to solve the video detection task. And [70] employs a single-object tracking method to capture subjects’ trajectories. Although these works propose and utilize the concept of tubes, they still utilize external detection results and form tubes at the second step, instead of directly regressing them. Thus they are still TBD methods and the problems stated above are not solved.

Figure 3: The pipeline of our TubeTK. a: Given a video and the corresponding ground-truth tracks, we cut them into short clips in a sliding window manner to get inputs of the network. b: To model spatial-temporal information in video clips, we adopt 3D convolutional layers to build our network which consists of a backbone, an FPN, and a few multi-scale heads. Following FCOS [55], the multi-scale heads are responsible for targets with different scales respectively. The 3D network directly predicts Btubes. c: We link the predicted Btubes that have the same spatial positions and moving directions in the overlap part into whole tracks. d: In the training phase, the GT tracks are split into Btubes and then they are transformed into the same form of the network’s output: target maps (see Fig. 4

for details). The target and predicted maps are fed into three loss functions to train the model: the Focal loss for classifying the foreground and background, BCE for giving out the center-ness, and GIoU loss for regressing Btubes.

3 The Proposed Tracking Model

We propose a new one-step end-to-end training MOT paradigm, the TubeTK. Compared with the TBD framework, this paradigm can better model spatial-temporal features and alleviate problems led by dense crowds and occlusions. In this section, we will introduce the entire pipeline in the following arrangement: 1) We first define the Btube which is a 3D extension of Bbox and introduce its generation method in Sec. 3.1. 2) In Sec. 3.2, we introduce the deep network adopted to predict Btubes from input videos. 3) Next, we interpret the training method tailored for Btubes in Sec. 3.3. 4) Finally, we propose the parameter-free post-processing method to link the predicted Btubes in Sec. 3.4.

3.1 From Bounding-Box to Bounding-Tube

Traditional image-based bounding-box (Bbox) which serves as the smallest enclosing box of a target can only indicate its spatial position, while for MOT, the pattern of targets’ temporal positions and moving directions is of equal importance. Thus, we go down to consider how can we extend the bounding-box to simultaneously represent the temporal position and motion, with which, models can overcome occlusions shorter than the receptive field.

Btube definition

Adopting a 3D Bbox to point out an object across frames is the simplest extension method, but obviously, this 3D Bbox is too sparse to precisely represent the target’s moving trajectory. Inspired by the tubelet in video detection task [31, 30]

, we design a simplified version, called bounding-tube (Btube), for the dimension of original tubelets is too large to directly regress. A Btube can be uniquely identified in space and time by 15 coordinate values and it is generated by a method similar to the linear spline interpolation which splits a whole track into several overlapping Btubes.

As shown in Fig. 2 a, a Btube is a decahedron composed of 3 Bboxes in different video frames, namely , , and , which need 12 coordinate values to define. And 3 other values are used to point out their temporal positions. This setting allows the target to change its moving direction once in a short time. Moreover, its length-width ratio can change linearly, which makes the Btube more robust when facing pose and scale changes led by perspective. By interpolation between and , we can restore all the bounding-boxes that constitute the Btube. Note that does not have to be exactly at the midpoint of and . It may be closer to one of them. Btubes are designed to encode spatial and temporal information simultaneously. It can even reflect targets’ moving trends which are important in MOT task. These specialties make Btubes contain much more useful semantics than traditional Bboxes.

Generating Btubes from tracks

Btubes can only capture simple linear trajectories, thus we need to disassemble complex target’s tracks into short clips, in which motions can approximately be seen as linear and captured by our Btubes.

The disassembly process is shown in Fig. 2 b. We split a whole track into multiple overlapping Btubes by extending EVERY Bbox in it to a Btube. We treat each Bbox as the of one Btube then look forward and backward in the track to find its corresponding and . We can extend Bboxes to longer Btubes for capturing more temporal information, but long Btubes generated by linear interpolation cannot well represent the complex moving trail (see Fig. 2). To balance this trade-off, we set each Btube to be the longest one which satisfies that the mean IoU between its interpolated bounding-boxes and the ground-truth bounding-boxes is no less than the threshold :


This principle allows to dynamically generate Btubes with different lengths. When the moving trajectory is monotonous, the Btubes will be longer to capture more temporal information. While when the motion varies sharply, it will generate shorter Btubes to better fit the trail.

Overcoming the occlusion

Btubes guide models to capture moving trends. Thus, when facing occlusions, these trends will assist in predicting the position of shortly invisible targets. Moreover, this specialty can reduce the ID switches at the crossover point of two tracks because two crossing tracks trend to have different moving directions.

3.2 Model Structure

With Btubes that encode the spatial-temporal position, we can handle the MOT task in one step learning without the help of external object detectors or handcrafted matching features. To fit Btubes, we adopt the 3D convolutional structure [28] to capture spatial-temporal features, which is widely used in the video action recognition task [6, 24, 19]. The whole pipeline is shown in Fig. 3.

Network structure

The network consists of a backbone, an FPN [38], and a few multi-scale task heads.

Given a video to track, where , , and are frame number, height, width, and input channel respectively, we split it into short clips as inputs. starts from frame and its length is . As Btubes are usually short, the split clips can provide enough temporal information and reduce the computational complexity. Moreover, by adopting a sliding window scheme, the model can work in an online manner. The 3D-ResNet [25, 26] is applied as the backbone to extract the basic spatial-temporal feature groups with multiple scales. denotes the level of the features which are generated by stage of 3D-ResNet. Like the RetinaNet [39] and FCOS [55], a 3D version FPN in which the 2D-CNN layers are simply replaced by 3D-CNNs [28] then takes as input and outputs multi-scale feature map groups . This multi-scale setting can better capture targets with different scales. For each , there is a task head composed of several CNN layers to output regressed Btubes and confidence scores. This fully 3D network processes temporal-spatial information simultaneously, making it possible to extract more efficient features.

Figure 4: Regression method and the matchup between output maps and GT Btubes. a: The model is required to regress the relative temporal and spatial position to focus on moving patterns. b: Each Btube can be regressed by several points in the output map. The colored points on the black map are inside the Btube’s , so they are responsible for this Btube. Even through on the grey maps, there are some points also inside the Btube, they do not predict it because they are not on its .


Each task head generates three output maps: the confidence map, regression map, and center-ness map following FCOS [55]. The center-ness map is utilized as a weight mask applied on the confidence map in order to reduce confidence scores of off-center boxes. The sizes of these three maps are the same. Each point in the map can be mapped back to the original input image. If the corresponding point of in the original input image is inside the of one Btube, then will regress its position (see Fig. 4). With the Btube position can be regressed by 14 values: four for , four for , four for , and two for the tube length . Their definitions are shown in Fig. 4. We utilize relative distances with respect to , instead of absolute ones, to regress Btubes aiming to make the model focus on moving trails. The center-ness which servers as the weighting coefficient of confidence score is defined as:


Although can be calculated directly from the predicted , we adopt a head to regress it, and calculated based on GT by Eq. 2 is utilized as the ground-truth to train the head.

Following the FCOS [55], different task heads are responsible for detecting objects within a range of different sizes respectively, which can largely alleviate the ambiguity caused by one point falling into multiple Btubes’ .

3.3 Training Method

Tube GIoU

IoU is the most popular indicator to evaluate the quality of the predicted Bbox, and it is usually used as the loss function. GIoU [49] loss is an extension of IoU loss which solves the problem that there is no supervisory information when the predicted Bbox has no intersection with the ground truth. GIoU of Bbox is defined as:


where is the smallest enclosing convex object of and . We extend the definition of GIoU to make it compatible with Btubes. According to our regression method, and must be on the same video frame, which makes the calculation of BTube’s volume, intersection and smallest tube enclosing object straightforward. As shown in Fig. 5, we can treat each Btube as two square frustums sharing the same underside. Because and are on the same video frame, and are also composed of two adjoining square frustums whose volumes are easy to calculate (Detail algorithm is shown in supplementary files). Tube GIoU and Tube IoU are the volume extended version of the original area ones.

Figure 5: Visualization of the calculation process of Tube GIoU. The intersection and of targets are also decahedrons, thus the volume of them can be calculated in the same way of Btubes.

Loss function

For each point in map , we denote its confidence score, regression result, and center-ness as , , and . The training loss function can be formulated as:


where denotes the corresponding ground truth. denotes the number of positive foreground samples. and are the weight coefficients which are assigned as 1 in the experiments. is the focal loss proposed in [39], is the binary cross-entropy loss, and is the Tube GIoU loss which can be formulated as:


where is the indicator function, being 1 if and 0 otherwise. is the Tube GIoU.

3.4 Linking the Bounding-Tubes

After getting predicted Btubes, we only need an IoU-based method without any trainable parameters to link them into whole tracks.

Tube NMS

Before the linking principles, we will first introduce the NMS method tailored for Btubes. As Btubes are in 3D space, if we conduct a pure 3D NMS, the huge number of them will lead to large computational overhead. Thus, we simplify the 3D NMS into a modified 2D version. The NMS operation is only conducted among the Btubes whose is on the same video frame. Traditional NMS eliminates targets that have large IoU. However, this method will break at least one track when two or more tracks intersect. Due to the temporal information encoded in Btubes, we can utilize and to perceive the moving direction of targets. Often the directions of intersecting tracks are different, thus the IoU of their , , and will not all be large. In the original NMS algorithm, it will suppress one of two Btubes with IoU larger than the threshold , while in the Tube NMS, we set two thresholds and , and for two Btubes and , suppression is conduct when , where , and is generated by interpolation.

Linking principles

After the Tube NMS pre-processing, we need to link all the rest Btubes into whole tracks. The linking method is pretty simple which is only an IoU-based greedy algorithm without any learnable parameters or assisting techniques like appearance matching or Re-ID.

Due to the overlap of Btubes in the temporal dimension, we can focus on it to calculate the frame-based IoU for linking. Given a track starting from frame and ending at frame , and a Btube , we first find the overlap part: where and . If , and have no overlap and do not need to link. When they are overlapping, we calculate the matching score as:


where and denote the (interpolated) bounding-boxes at frame in and . is the number of frames in . If is larger than the linking threshold , we link them by adding the interpolated bounding-boxes of onto . It should be noted that in the overlap part, we average the bounding-boxes from and to reduce the deviation caused by the linear interpolation. The linking function can be formulated as:


where we assume that , and denotes jointing two Btubes (or tracks) without overlap.

To avoid ID switch at intersection of two tracks, we also take moving directions into account. The moving direction vector (MDV) of a Btube (or track) starts from the center of its

and ends at ’s center. We hope the track and Btube with similar directions can be more likely to link. Thus, we compute the angle between the MDV of and and take as a weighted coefficient masked on to adjust the matching score. The final matching score utilized to link is , where is a hyper-parameter. If the direction vectors of the track and Btube form an acute angle, and their matching score will be enlarged, otherwise reduced.

The overall linking method is an online greedy algorithm, which is shown in Alg. 1.

0:  Predicted Btubes
0:  Final tracks
1:  Grouping to , where is the total length of the video and .
2:  Utilizing to initialize .
3:  for ; ;  do
4:     Calculating between and to form the matching score matrix , where
5:     Linking the track-tube pairs starting from the largest in by Eq. 7 until all the rest . Each linking operation will update .
6:     The remaining Btubes after linking are added to as new tracks.
7:  end for
Algorithm 1 Greedy Linking Algorithm

4 Experiments

Datasets and evaluation metrics

We evaluate our TubeTK model on three MOT Benchmarks [44, 34], namely 2D-MOT2015 (MOT15), MOT16, and MOT17. These benchmarks consist of videos with many occlusions, which makes them really challenging. They are widely used in the field of multi-object tracking and can objectively evaluate models’ performances. MOT15 contains 11 train and 11 test videos, while MOT16 and MOT17 contain the same videos, including 7 train and 7 test videos. These three benchmarks provide public detection results (detected by DPM [21], Faster R-CNN [48], and SDP [63]) for fair comparison among TBD frameworks. However, because our TubeTK conducts MOT in one-step, we do not adopt any external detection results. Without detection results generated by sophisticated detection models trained on large datasets, we need more videos to train the 3D network. Thus, we adopt a synthetic dataset JTA [12] which is directly generated from the video game Grand Theft Auto V developed by Rockstar North. There are 256 video sequences in JTA, enough to pre-train our 3D network. Following the MOT Challenge [44], we adopt the CLEAR MOT metrics [4], and other measures proposed in [60].

Figure 6: Analysis of the performances in occlusion situations. The examples (from test set of MOT16) in the top row show that our TubeTK can effectively reduce the ID switches and false negatives caused by the occlusion. The bottom analysis is conducted on the training set of MOT-16 dataset. We first illustrate the tracked ratio with respect to visibility. The results reveal that our TubeTK performs much better on highly occluded targets than other models. Then, we illustrate the values of IDS/IDR, the conclusion still holds.


The hyper-parameters we adopt in the experiments are shown in the following table.

img size
0.8 8 8961152 0.4 0.2 0.5 0.4

For each clip we randomly sample a spatial crop from it or its horizontal flip, with the per-pixel mean subtracted. HSL jitter is adopted as color augmentation. The details of the network structure follow FCOS [55] (see supplementary file for detail). We only replace the 2D CNN layers with the 3D version and modify the last layer in the task head to output tracking results. We initialize the weights as [55] and train them on JTA from scratch. We utilize SGD with a mini-batch size of 32. The learning rate starts from and is divided by 5 when error plateaus. TubeTK is trained for 150K iterations on JTA and 25K on benchmarks.The weight decay and momentum factors are and 0.9.

Ablation study

The ablation study is conducted on MOT17 training set (without pre-training on JTA). Tab. 1 demonstrates the great potential of the proposed model. We find that shorter clips () encoding less temporal information lead to bad performance, which reveals that extending the bounding-box to Btube is effective. Moreover, if we fix the length of all the Btubes to 8 (the length of input clips), the performance drops significantly. Fixing length makes the Btubes deviate from the ground-truth trajectory, leading to much more FNs. This demonstrates that setting the length of Btubes dynamically can better capture the moving trails. The other comparisons show the importance of the Tube GIoU loss and Tube NMS. The Original NMS kills many highly occluded Btubes, causing more FN and IDS, and Tube GIoU loss guides the model to regress the Btube’s length more accurately than Tube IoU loss (less FN and FP). TubeTK has much more IDS than Tracktor [3] because our FN is much lower and more tracked results potentially lead to more IDS. From IDF1 we can tell that TubeTK tracks better. Note that we refrain from a cross-validation following [3] as our TubeTK is trained on local clips and never accesses to the tracking ground truth data.

D&T [20] 50.1 24.9 23.1 27.1 3561 52481 2715
Tracktor++[3] 61.9 64.7 35.3 21.4 323 42454 326
POI [65] 65.2 - 37.3 14.7 3497 34241 716
TubeTK shorter clips 60.3 60.7 44.3 25.5 3446 40139 968
TubeTK fixed tube len 74.3 68.5 62.5 8.6 7468 19452 1184
TubeTK IoU Loss 70.5 63.7 67.8 6.4 13247 18148 1734
TubeTK original NMS 75.3 70.1 84.6 6.2 11256 13421 2995
TubeTK 76.9 70.0 84.7 3.1 11541 11801 2687

Table 1: Ablation study on the training set of MOT17. D&T and Tracktor adopt public detections generated by Faster R-CNN [23]. POI adopts private detection results and is tested on MOT16.

Benchmark evaluation

MOT17 Ours w/o 63.0 58.6 31.2 19.9 27060 177483 4137
SCNet Priv 60.0 54.4 34.4 16.2 72230 145851 7611
LSST17 [22] Pub 54.7 62.3 20.4 40.1 26091 228434 1243
Tracktor [3] Pub 53.5 52.3 19.5 36.3 12201 248047 2072
JBNOT [27] Pub 52.6 50.8 19.7 35.8 31572 232659 3050
FAMNet [9] Pub 52.0 48.7 19.1 33.4 14138 253616 3072

Ours POI 66.9 62.2 39.0 16.1 11544 47502 1236
Ours w/o 64.0 59.4 33.5 19.4 10962 53626 1117
POI [65] POI 66.1 65.1 34.0 20.8 5061 55914 805
CNNMTT [43] POI 65.2 62.2 32.4 21.3 6578 55896 946
TAP [69] Priv 64.8 73.5 38.5 21.6 12980 50635 571
RAN [18] POI 63.0 63.8 39.9 22.1 13663 53248 482
SORT [5] Priv 59.8 53.8 25.4 22.7 8698 63245 1423
Tracktor [3] Pub 54.5 52.5 19.0 36.9 3280 79149 682

Ours w/o 58.4 53.1 39.3 18.0 5756 18961 854
RAN [18] POI 56.5 61.3 45.1 14.6 9386 16921 428
NOMT [8] Priv 55.5 59.1 39.0 25.8 5594 21322 427
APRCNN [7] Priv 53.0 52.2 29.1 20.2 5159 22984 708
CDADDAL [1] Priv 51.3 54.1 36.3 22.2 7110 22271 544
Tracktor [3] Pub 44.1 46.7 18.0 26.2 6477 26577 1318

Table 2: Results of the online state-of-the-art models on MOT15, 16, 17 datasets. “Detr” denotes the source of the detection results. Our model does not adopt external detection results (w/o). RAN and CNNMTT utilize the ones provided by POI [65].

Tab. 2 presents the results of our TubeTK and other state-of-the-art (SOTA) models which adopt public or private external detection results (detailed results are shown in supplementary files). We only compare with the officially published and peer-reviewed online models in the MOT Challenge benchmark***MOT challenge leaderboard: As we show, although TubeTK does not adopt any external detection results, it achieves new SOTA results on MOT17 (3.0 MOTA improvements) and MOT15 (1.9 MOTA improvements). On MOT16, it achieves much better performance than other SOTAs that rely on publicly available detections (64.0 vs. 54.5). Moreover, TubeTK performs competitively with the SOTA models adopting POI [65] detection bounding-boxes and appearance features (POI-D-F) on MOT16. It should be noted that the authors of POI-D-F utilize 2 extra tracking datasets, many self-collected surveillance data (10 frames than MOT16) to train the Faster-RCNN detector, and 4 extra Re-ID datasets to extract the appearance features. Thus, we cannot get the same generalization ability as the POI-D-F with synthetic JTA data. To demonstrate the potential of TubeTK, we also provide the results adopting the POI detection (without the appearance features, details in supplementary files) and in this setting our TubeTK achieves the new state-of-the-art on MOT16 (66.9 vs. 66.1). On these three benchmarks, due to the great resistibility to occlusions, our model has fewer FN, under the condition that the number of FP is relatively acceptable. Although TubeTK can handle occlusions better, its IDS is relatively higher because we do not adopt feature matching mechanisms to maintain global consistency. The situation of IDS in occlusion parts is further discussed in Sec. 5.

5 Discussion

Overcoming the occlusion

With Btubes, our model can learn and encode the moving trend of targets, leading to more robust performances when facing severe occlusions. We show the qualitative and quantitative analysis in Fig. 6. Form the top part of Fig. 6, we show that TubeTK can keep tracking with much less FN or IDS when the target is totally shielded by other targets. In the bottom part, we provide the tracked ratio and number of IDS (regularized by ID recall) with respect to targets’ visibility on the training set of MOT16. When the visibility is low, TubeTK performs much better than other TBD models.

Robustness of Btubes for linking

The final linking process has no learnable parameters, thus the linking performances depend heavily on the accuracy of regressed Btubes. To verify the robustness, we perform the linking algorithm on GT Btubes with noise jitter. The jitter is conducted on Btubes’ center position and spatial-temporal scale. 0.25 jitter on center position or scale means the position or scale shift up to 25% of the Btube’s size. The results on MOT17-02, a video with many crossovers, are shown in Tab. 3. We can find that even with large jitter up to 25%, the linking results are still great enough (MOTA 86, IDF1 79), which reveals that the linking algorithm is robust and does not need rigorously accurate Btubes to finish the tracking.

0.00 0.05 0.10 0.15 0.20 0.25

Table 3: Experiments on linking robustness. We only test on the GT tracks of a single video MOT17-02. “cn” and “sn” denote the center position and bounding-box scale noises. In each grid, the values are “MOTA” “IDF1”, “MT”, and “ML” in order.

6 Conclusion

In this paper, we proposed an end-to-end one-step training model TubeTK for MOT task. It utilizes Btubes to encode target’s temporal-spatial position and local moving trail. This makes the model independent of external detection results and has enormous potential to overcome occlusions. We conducted extensive experiments to evaluate the proposed model. On the mainstream benchmarks, our model achieves the new state-of-the-art performances compared with other online models, even if they adopt private detection results. Comprehensive analyses were presented to further validate the robustness of TubeTK.

7 Acknowledgements

This work is supported in part by the National Key R&D Program of China, No. 2017YFA0700800, National Natural Science Foundation of China under Grants 61772332 and Shanghai Qi Zhi Institute.


  • [1] S. Bae and K. Yoon (2017) Confidence-based data association and discriminative deep appearance learning for robust online multi-object tracking. TPAMI 40 (3), pp. 595–610. Cited by: §1, Table 2.
  • [2] J. Berclaz, F. Fleuret, E. Turetken, and P. Fua (2011) Multiple object tracking using k-shortest paths optimization. TPAMI 33 (9), pp. 1806–1819. Cited by: §2.
  • [3] P. Bergmann, T. Meinhardt, and L. Leal-Taixe (2019) Tracking without bells and whistles. arXiv preprint arXiv:1903.05625. Cited by: item 1, item 2, §1, §1, §2, §4, Table 1, Table 2.
  • [4] K. Bernardin and R. Stiefelhagen (2008) Evaluating multiple object tracking performance: the clear mot metrics. Journal on Image and Video Processing 2008, pp. 1. Cited by: §4.
  • [5] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft (2016) Simple online and realtime tracking. In ICIP, pp. 3464–3468. Cited by: Table 2.
  • [6] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pp. 6299–6308. Cited by: §1, §3.2.
  • [7] L. Chen, H. Ai, C. Shang, Z. Zhuang, and B. Bai (2017)

    Online multi-object tracking with convolutional neural networks

    In ICIP, pp. 645–649. Cited by: Table 2.
  • [8] W. Choi (2015) Near-online multi-target tracking with aggregated local flow descriptor. In ICCV, pp. 3029–3037. Cited by: Table 2.
  • [9] P. Chu and H. Ling (2019) FAMNet: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. arXiv preprint arXiv:1904.04989. Cited by: item 3, §1, Table 2.
  • [10] Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, and N. Yu (2017) Online multi-object tracking using cnn-based single object tracker with spatial-temporal attention mechanism. In ICCV, pp. 4836–4845. Cited by: §2.
  • [11] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) CenterNet: object detection with keypoint triplets. arXiv preprint arXiv:1904.08189. Cited by: §1.
  • [12] M. Fabbri, F. Lanzi, S. Calderara, A. Palazzi, R. Vezzani, and R. Cucchiara (2018) Learning to detect and track visible and occluded body joints in a virtual world. In ECCV, Cited by: §4.
  • [13] L. Fagot-Bouquet, R. Audigier, Y. Dhome, and F. Lerasle (2016) Improving multi-frame data association with sparse representations for robust near-online multi-object tracking. In ECCV, pp. 774–790. Cited by: §1.
  • [14] H. Fang, J. Cao, Y. Tai, and C. Lu (2018) Pairwise body-part attention for recognizing human-object interactions. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 51–67. Cited by: §1.
  • [15] H. Fang, J. Sun, R. Wang, M. Gou, Y. Li, and C. Lu (2019)

    Instaboost: boosting instance segmentation via probability map guided copy-pasting

    In ICCV, pp. 682–691. Cited by: §1.
  • [16] H. Fang, S. Xie, Y. Tai, and C. Lu (2017)

    Rmpe: regional multi-person pose estimation

    In Proceedings of the IEEE International Conference on Computer Vision, pp. 2334–2343. Cited by: §1.
  • [17] H. Fang, Y. Xu, W. Wang, X. Liu, and S. Zhu (2018)

    Learning pose grammar to encode human body configuration for 3d pose estimation

    In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1.
  • [18] K. Fang, Y. Xiang, X. Li, and S. Savarese (2018) Recurrent autoregressive networks for online multi-object tracking. In IEEE Winter Conference on Applications of Computer Vision, pp. 466–475. Cited by: item 1, item 3, §2, Table 2.
  • [19] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2018) Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982. Cited by: §3.2.
  • [20] C. Feichtenhofer, A. Pinz, and A. Zisserman (2017) Detect to track and track to detect. In ICCV, Cited by: item 1, §2, Table 1.
  • [21] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2009) Object detection with discriminatively trained part-based models. TPAMI 32 (9), pp. 1627–1645. Cited by: §4.
  • [22] W. Feng, Z. Hu, W. Wu, J. Yan, and W. Ouyang (2019) Multi-object tracking with multiple cues and switcher-aware classification. arXiv preprint arXiv:1901.06129. Cited by: Table 2.
  • [23] R. Girshick (2015) Fast r-cnn. In ICCV, pp. 1440–1448. Cited by: Table 1.
  • [24] K. Hara, H. Kataoka, and Y. Satoh (2017) Learning spatio-temporal features with 3d residual networks for action recognition. In ICCV, pp. 3154–3160. Cited by: §3.2.
  • [25] K. Hara, H. Kataoka, and Y. Satoh (2018)

    Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?

    In CVPR, pp. 6546–6555. Cited by: §3.2.
  • [26] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.2.
  • [27] R. Henschel, Y. Zou, and B. Rosenhahn (2019) Multiple people tracking using body and joint detections. In CVPRW, pp. 0–0. Cited by: Table 2.
  • [28] S. Ji, W. Xu, M. Yang, and K. Yu (2012) 3D convolutional neural networks for human action recognition. TPAMI 35 (1), pp. 221–231. Cited by: §3.2, §3.2.
  • [29] H. Jiang, S. Fels, and J. J. Little (2007)

    A linear programming approach for multiple object tracking

    In CVPR, pp. 1–8. Cited by: §2.
  • [30] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, et al. (2017) T-cnn: tubelets with convolutional neural networks for object detection from videos. IEEE Transactions on Circuits and Systems for Video Technology 28 (10), pp. 2896–2907. Cited by: §2, §3.1.
  • [31] K. Kang, W. Ouyang, H. Li, and X. Wang (2016) Object detection from video tubelets with convolutional neural networks. In CVPR, pp. 817–825. Cited by: §2, §3.1.
  • [32] C. Kim, F. Li, A. Ciptadi, and J. M. Rehg (2015) Multiple hypothesis tracking revisited. In CVPR, pp. 4696–4704. Cited by: §2.
  • [33] C. Kuo and R. Nevatia (2011) How does person identity recognition help multi-person tracking?. In CVPR, pp. 1217–1224. Cited by: §2.
  • [34] L. Leal-Taixé, A. Milan, I. Reid, S. Roth, and K. Schindler (2015) Motchallenge 2015: towards a benchmark for multi-target tracking. arXiv preprint arXiv:1504.01942. Cited by: item 3, §4.
  • [35] P. Lenz, A. Geiger, and R. Urtasun (2015) Followme: efficient online min-cost flow tracking with bounded memory and computation. In CVPR, pp. 4364–4372. Cited by: §2.
  • [36] Y. Li, L. Xu, X. Huang, X. Liu, Z. Ma, M. Chen, S. Wang, H. Fang, and C. Lu (2019)

    HAKE: human activity knowledge engine

    arXiv preprint arXiv:1904.06539. Cited by: §1.
  • [37] Y. Li, S. Zhou, X. Huang, L. Xu, Z. Ma, H. Fang, Y. Wang, and C. Lu (2019) Transferable interactiveness prior for human-object interaction detection. CVPR. Cited by: §1.
  • [38] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 2117–2125. Cited by: §1, §3.2.
  • [39] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §3.2, §3.3.
  • [40] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §2.
  • [41] C. Long, A. Haizhou, Z. Zijie, and S. Chong (2018) Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME, Vol. 5, pp. 8. Cited by: §1, §2.
  • [42] C. Lu, H. Su, Y. Li, Y. Lu, L. Yi, C. Tang, and L. J. Guibas (2018) Beyond holistic object recognition: enriching image understanding with part states. In CVPR, Cited by: §2.
  • [43] N. Mahmoudi, S. M. Ahadi, and M. Rahmati (2019) Multi-target tracking using cnn-based features: cnnmtt. Multimedia Tools and Applications 78 (6), pp. 7077–7096. Cited by: Table 2.
  • [44] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler (2016) MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Cited by: item 3, §4.
  • [45] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes (2011) Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR, pp. 1201–1208. Cited by: §2.
  • [46] S. Qi, W. Wang, B. Jia, J. Shen, and S. Zhu (2018) Learning human-object interactions by graph parsing neural networks. In ECCV, Cited by: §1.
  • [47] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In CVPR, pp. 779–788. Cited by: §2.
  • [48] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §4.
  • [49] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In CVPR, pp. 658–666. Cited by: §3.3.
  • [50] A. Sadeghian, A. Alahi, and S. Savarese (2017) Tracking the untrackable: learning to track multiple cues with long-term dependencies. In ICCV, pp. 300–311. Cited by: item 3, §1, §2, §2.
  • [51] D. Shao, Y. Xiong, Y. Zhao, Q. Huang, Y. Qiao, and D. Lin (2018) Find and focus: retrieve and localize video events with natural language queries. In ECCV, pp. 200–216. Cited by: §2.
  • [52] K. Simonyan and A. Zisserman (2014) Two-stream convolutional networks for action recognition in videos. In NeurIPS, pp. 568–576. Cited by: §1.
  • [53] S. Sun, N. Akhtar, H. Song, A. S. Mian, and M. Shah (2019) Deep affinity network for multiple object tracking. TPAMI. Cited by: item 2, item 3.
  • [54] S. Tang, M. Andriluka, B. Andres, and B. Schiele (2017) Multiple people tracking by lifted multicut and person re-identification. In CVPR, pp. 3539–3548. Cited by: §2.
  • [55] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355. Cited by: §1, Figure 3, §3.2, §3.2, §3.2, §4.
  • [56] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee (2017) Decomposing motion and content for natural video sequence prediction. ICLR. Cited by: §1.
  • [57] P. Voigtlaender, M. Krause, A. Osep, J. Luiten, B. B. G. Sekar, A. Geiger, and B. Leibe (2019) MOTS: multi-object tracking and segmentation. In CVPR, pp. 7942–7951. Cited by: item 2, §1.
  • [58] W. Wang, Z. Zhang, S. Qi, J. Shen, Y. Pang, and L. Shao (2019) Learning compositional neural information fusion for human parsing. In ICCV, Cited by: §1.
  • [59] Z. Wang, L. Zheng, Y. Liu, and S. Wang (2019) Towards real-time multi-object tracking. In arXiv preprint arXiv:1909.12605, Cited by: §2.
  • [60] B. Wu and R. Nevatia (2006) Tracking of multiple, partially occluded humans based on static body part detection. In CVPR, Vol. 1, pp. 951–958. Cited by: §4.
  • [61] Y. Xiu, J. Li, H. Wang, Y. Fang, and C. Lu (2018) Pose Flow: efficient online pose tracking. In BMVC, Cited by: §2.
  • [62] W. Xu, Y. Li, and C. Lu (2018) Srda: generating instance segmentation annotation via scanning, reasoning and domain adaptation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 120–136. Cited by: §1.
  • [63] F. Yang, W. Choi, and Y. Lin (2016) Exploit all the layers: fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In CVPR, pp. 2129–2137. Cited by: §4.
  • [64] Y. You, Y. Lou, Q. Liu, Y. Tai, W. Wang, L. Ma, and C. Lu (2018) Prin: pointwise rotation-invariant network. arXiv preprint arXiv:1811.09361. Cited by: §1.
  • [65] F. Yu, W. Li, Q. Li, Y. Liu, X. Shi, and J. Yan (2016) Poi: multiple object tracking with high performance detection and appearance feature. In ECCV, pp. 36–42. Cited by: item 1, §1, §4, Table 1, Table 2.
  • [66] L. Zhang, Y. Li, and R. Nevatia (2008) Global data association for multi-object tracking using network flows. In CVPR, pp. 1–8. Cited by: §2.
  • [67] Z. Zhang, D. Cheng, X. Zhu, S. Lin, and J. Dai (2018) Integrated object detection and tracking with tracklet-conditioned detection. arXiv preprint arXiv:1811.11167. Cited by: item 1, §2.
  • [68] B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In ECCV, pp. 803–818. Cited by: §1.
  • [69] Z. Zhou, J. Xing, M. Zhang, and W. Hu (2018)

    Online multi-target tracking with tensor-based high-order graph matching

    In ICPR, pp. 1809–1814. Cited by: Table 2.
  • [70] J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M. Yang (2018) Online multi-object tracking with dual matching attention networks. In ECCV, pp. 366–382. Cited by: §2.