Continuity, Stability, and Integration: Novel Tracking-Based Perspectives for Temporal Object Detection

12/22/2019 ∙ by Xingyu Chen, et al. ∙ 7

Video object detection (VID) has been vigorously studied for years but almost all literature adopts a static accuracy-based evaluation, i.e., mean average precision (mAP). From a temporal perspective, the importance of recall continuity and localization stability is equal to that of accuracy, but the mAP is insufficient to reflect detectors' performance across time. In this paper, non-reference assessments are proposed for continuity and stability based on tubelets from multi-object tracking (MOT). These temporal evaluations can serve as supplements to static mAP. Further, we develop tubelet refinement for improving detectors' performance on temporal continuity and stability through short tubelet suppression, fragment filling, and history-present fusion. In addition, we propose a small-overlap suppression to extend VID methods to single object tracking (SOT) task. The VID-based SOT does not need MOT or traditional SOT model. A unified VID-MOT-SOT framework is then formed. Extensive experiments are conducted on ImageNet VID dataset, where the superiority of our proposed approaches are validated and verified. Codes will be publicly available.



There are no comments yet.


page 1

page 5

page 6

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction


Temporal object detection is one of the main areas of research in computer vision

[1]. Over recent years, we have witnessed the development of VID. As the exclusive assessment, mean average precision (mAP) has grown from less than [4] to over [6].

Problem & Motivation. Different from images, video frames are interrelated on time series. Thus, almost all VID methods leverage across-time information to improve detection performance [9, 10, 5, 6, 7, 18, 19, 10, 11, 12, 13, 16, 14, 15], but their metric mAP only statically reflects image-based accuracy, failing in across-time evaluation. We deem that the mAP would neglect some important features in VID.

Figure 1: Defective detection and tracking cases. (a) Short tubelets (yellow tubelet length is 1 while the length of green one is 2); (b) tubelet fragments incur identity switch; (c,d) box center/size jitter (green dashed box denotes previous results); (e) siamese tracker needs an initial box from VID and can hardly be aware of tracking drift and failure (tracking score is shown in the left-top).

As shown in Figure 1 (a)–(d), besides false positive/negative captured by mAP, four cases of defective temporal detection can be subdivided into two-fold aspects:

  • Recall continuity: As shown in Figure 1 (a), transient object recall induces short tubelets, whose duration could only contain several frames. As delineated in Figure 1 (b), intermittent object missing forms tubelet fragments, incurring identity switch. We deem these phenomenons damage recall continuity in VID.

  • Localization stability: As shown in Figure 1 (c,d), box center/size jitter frequently appears in modern object detection, and slight pixel-level change could incur considerable location jitter. It is conceived that this phenomenon impairs localization stability in VID.

These two problems do not draw researchers’ gaze since the static evaluation system can hardly reflect them. That is, the mAP can hardly describe detection continuity and stability. On one hand, although it reports object recall/missing from a spatial perspective, the mAP is insufficient for trajectory analysis. On the other hand, the mAP can hardly report localization jitter since it only considers intersection-over-union (IoU) between prediction and ground truth with a fixed threshold.

The Above analyzed problems inspire us to develop a new temporal evaluation system for VID. Because both continuity and stability are totally unrelated to man-made labels, we suggest a non-reference manner for evaluation so that the approach can be applied to any scenes (including but not limited to datasets). As a highly relevant task, multi-object tracking attempts to associate detection boxes across time [2], then detection tubelets are produced. The tubelet from MOT exactly reflects VID performance across time, but there has been very limited published research in MOT-based VID analysis. This motivates us to re-investigate VID approaches and generally evaluate and enhance detectors’ continuity/stability based on MOT tubelets.

VID-MOT deals with multiple tubelets, but there is an equally imperative need of single object tracking (SOT) [3]. If VID-MOT is directly leveraged for SOT, a large amount of redundant computing would incur high time consumption. Additionally, researchers tend to exploit similarity-based SOT methods [25, 26, 27, 28]. However, as shown in Figure 1 (e), similarity-based SOT is independent from VID and has three-fold drawbacks: 1) It cannot work without initial box from VID [29]; 2) it can hardly capture tracking failure with its similarity score; 3) VID-SOT cascade would highly increase model complexity. Hence, these situations inspire us to extend VID methods for the SOT task.

Our Work. In this paper, we firstly study important yet less explored aspects in temporal detection, i.e., continuity and stability. We pose transient and intermittent object recall/missing as a problem of recall continuity, while box center/size jitter is formulated as an issue of localization stability. Further, we propose novel non-reference assessments based on MOT for these two problems. To this end, we modify MOT pipeline to capture object missing (or, recall failure) and design a Fourier approach for stability evaluation. In addition, tubelet refinement, including short tubelet suppression, fragment filling, and history-present fusion, is proposed to enhance VID continuity and stability. The tubelet refinement can be generally applied to any detector in VID task. Subsequently, we propose a small-overlap suppression (SOS) for extending VID approaches to the SOT task. Compared to VID task, the SOS is able to induce a faster inference speed for SOT task. Compared to similarity-based SOT, our VID-based SOT has advantages on free initialization and awareness of the tracking state. A unified framework for VID, MOT, and SOT tasks is finally formed. Our contributions are summarized as follows:

  • Two VID problems are novelly analyzed from the tubelet perspective, i.e., continuity and stability, then non-reference assessments are proposed for them. Temporal continuity and stability are complementary to static accuracy, and our assessments can make up the deficiency of traditional accuracy-based evaluation.

  • We develop tubelet refinement to generally improving VID continuity/stability based on tubelet refinement. Besides, we discuss how to solve these problems from the detector itself to inspire futher work.

  • We propose an SOS to extend VID approaches to the SOT task without the requirement of an MOT or similarity-based SOT module. Moreover, our SOS is able to speed up VID methods towards SOT. Then, a unified framework is formed that can simultaneously perform VID, MOT, and SOT tasks.

2 Related Work

Temporal Object Detection. There are manifold ideas of temporal detection, including 1) post-processing [9], 2) tracking-based location [10, 5, 6, 7], 3) feature aggregation [10, 11, 12, 13, 16, 14, 15], 4) batch-frame processing (i.e., tubelets proposal) [17], 5) temporally sustained proposal [18, 19]. All these ideas are attractive in that they can leverage temporal information in their characteristic manners.

All above methods only pursued high accuracy and followed mAP evaluation. However, detection continuity and stability are equally important in temporal cases. The mAP only considers static accuracy based on detection recall rate and precision [8], and thus, it can hardly give a temporal evaluation for VID methods. Zhang and Wang [30]

proposed evaluation metrics for VID stability and proved that the stability metric has a low correlation with mAP. In detail, they formulated the stability problem as fragment error, center position error, and scale/ratio error. Their work was impressive but had two limitations: 1) they ignored the problem of short tubelet that was also an import situation in recall continuity; 2) their evaluation needed ground truth boxes, hampering their metrics from extensive applications. Conversely, we address these limitations and propose non-reference assessments without the need of man-made labels.

Multi-Object Tracking. MOT associates detected boxes across time. For example, Bewley et al. leveraged Kalman and Hungarian methods for fast MOT [20]. Xiang et al.

formulated MOT as a Markov decision process instead of similarity matching

[21]; Ning et al. and Lu et al. exploited RNN for sequential modeling and contrasted MOT with LSTM [22, 23]. Almost all these methods are based on VID, but the effect of MOT for VID is usually ignored. Analytically, we report the MOT-based VID evaluation and propose the tubelet refinement to improve VID performance.

Referring to [24], MOT metrics include multi-object tracking accuracy (MOTA) and precision (MOTP). The MOTA synthesizes false positive, false negative and identity switch of detected objects while the MOTP only considers the static localization precision. Therefore, only trajectory fragment is captured by identity switch, and other tubelet characteristics (e.g., box jitter) are ignored. Further, we simultaneously propose the problems of recall continuity and localization stability for comprehensive evaluation.

Single Object Tracking. SOT aims at localizing an object based on similarity in a video. Henriques et al.

exploited a kernelized correlation filters with discrete Fourier transform for high-speed tracking

[25]. Based on the siamese network, Bertinetto et al. designed a fully convolutional tracker that posed the problem of tracking as template matching [26]. Further, Li et al. proposed SiamRPN to ally siamese network with region proposal [27]. Wang et al. developed SiamMask by adding a mask branch into SiamRPN for rotated boxes [28]. However, these similarity-based methods suffer from problems as analyzed in Section 1.

VOT benchmark uses expect average overlap rate (EAO) for SOT evaluation [3], including accuracy and robustness. Accuracy is determined by static IoU, and the robustness describes tracking failure. After each failure captured by the evaluation process, the tracker will be initialized with ground truth. EAO only describes tracking fragments, but it cannot give a comprehensive evaluation for multi-tubelets.

Detection-Tracking Cascade. Detection and tracking are the same task in nature, but their pipelines are distinct, so researchers tried to simultaneously leverage their advantages. Kang et al. used a tracking algorithm to re-score detection results with around tubelets [10]. Kim et al. combined a detector, a forward tracker, and a backward tracker to perform tracking-detection refinement [5]. Feichtenhofer et al. simultaneously leveraged a two-stage detector and a correlation filter to boost VID recall rate [6]. Luo et al. formulated detection and tracking as a sequential decision problem, and process a frame by either a siamese tracker or a detector [7]. These methods complementally improved tracking and detection, but their SOT model and detector are independent so that high model complexity is usually incurred. Instead of model cascade, we design an SOS to extend VID methods towards the SOT task.

3 Approach

3.1 Non-Reference Assessments

Our assessments follow a reasonable assumption, i.e., object motion is smooth across time without high-frequency location jitter or change of existence. We leverage a detector and an MOT module to recall all object tubelets in a video. In detail, any detector can be used in this paper, and we modify a concise MOT tracker with IoU similarity [16] to associate detected boxes. Unlike label-based accuracy evaluation, we only focus on detected tubelets, because totally missed tubelets do not impact continuity and stability.

As delineated in Figure 2

, a detector locates and classifies objects at each frame

, generating to describe objects. Each object has confidence score , box center and size . tubelets are produced after the whole video is processed by VID and MOT. The video duration is denoted as while is tubelet duration. “location” in Figure 2 represents .

Figure 2: Problem formation. Tubelets from VID could suffer from short duration (e.g., ), fragments (e.g., ), location jitter (e.g., ), all of which can hardly be captured by mAP.

Recall Continuity. Detection tubelets should have an appropriate duration without interruption, and transient/intermittent object recall/missing damages recall continuity. As for this problem, we consider the impact of short tubelets and fragments. Referring to Figure 1 (a) and in Figure 2, tubelets with short duration frequently appear in VID. To capture this situation, we design extremely short tubelet error (ESTE) and short tubelet error (STE) with various duration threshold as follows:


If is true, then returns ; Otherwise, it returns . In this paper, , which describe different degree of short tubelet duration.

Figure 1 (b) and in Figure 2 describe fragment problem. Some MOT algorithms end a tubelet after a recall failure. Conversely, we count the number of continuous recall failure with , and leave a -frame life duration for each tubelet. That is, if a match-failed tubelet is re-matched by a box in consequent frames, the tubelet can be kept. In this manner, the total number of object missing in the whole tubelet can be captured, forming fragment error (FE) and fragmental tubelet ratio (FTR) as follows:


FE describes the ratio of object missing while FTR gives the ratio of faulty tubelets. They are complementary for fragment problem, and a better VID result should have lower FE and FTR in the meantime. That is, there is a small number of object missing, and object missing is concentrated in a small number of tubelets. Note that these calculations are numerically small, so a log transformation is used to enhanced contrast, i.e., , where represents ESTE, STE, FE, or FTR.

Localization Stability. Detection tubelets should be smooth in localization, and box center/size jitter damages localization stability (see Figure 1 (c,d) and in Figure 2). We evaluate temporal stability in Fourier domain so that our approach can work without labels. Time domain data can be transformed to Fourier domain by , where represents . Thus, contains frequency information of , and we extract frequency-related amplitude with . Note that each tubelet produces different frequency component because of variable data length (i.e., tubelet duration). That is, , where is the frequency set; is tubelet duration; and denotes frequency-related amplitude. Based on Fourier analysis, center jitter error (CJE) and size jitter error (SJE) are designed as follows:


where starts from 1 since . Object size is usually described by scale (i.e., ) and aspect ratio (i.e., ). However, SJE directly analyzes and , which has an equal ability to describe object size. Moreover, our decoupled SJE is able to definitely describe stability in both directions.

3.2 Tubelet Refinement

For enhancing recall continuity and localization stability, we use tubelets to refine VID results and report boxes in tubelets. We add a new attribute to each tubelet, namely, matching length . Therefore, a tubelet can be formulated as , where has been explained in Section 3.1. denotes the tubelet identity, and records the number of times that the tubelet is matched by boxes. is the object set in the tubelet (i.e., ), and the length of (i.e., ) cannot exceed in this paper. That is, if , only the latest boxes are preserved in the .

Short Tubelet Suppression. Aiming at solving the problem of short tubelets and enhancing ESTE/STE, we define a tubelet as reliable tubelet if , then boxes in reliable tubelets are reported. This manner is beneficial to continuity, and as for accuracy, it has a two-fold effect. First, false positives could be suppressed, because they are usually inconsecutive across time so that a reliable tubelet is hard to form. Second, false negatives could be produced since an object would not be reported until it is continuously detected over times.

Fragment Filling. In terms of the fragment issue and FE/FTR, we make up boxes in a tubelet based on a reasonable assumption, i.e., the object motion is uniform in an extremely short duration (e.g., ). When a tubelet suffers from a recall failure at the th frame, its previous boxes

can be used to predict current location. In detail, we first estimate the velocity

, i.e., , then the current location can be given as , where denotes .

History-Present Fusion. When it comes to location stability and CJE/SJE, we first add the detected box into the tubelet, then produce a new location with weighted average. Concretely, a geometric progression is contrasted with , where is an arithmetic progression, whose length is equal to . The normalized is utilized to multiply with , and the current location can be formulated as .

3.3 Extending VID approaches to SOT task

Input: After selection by detection score threshold, boxes , scores ; previous tracked box ; SOS/NMS thresholds
Output: Tracked box
       // SOS based on IoU ("+=/-=" denotes element add/removal)
       for   do
             if  then
      // Inspection of tracking failure
       if  then
      // NMS based on detection score
       while  do
             for   do
                   if  then
      // SOT based on IoU
Algorithm 1 SOS-NMS for SOT

Small-Overlap Suppression. We promote VID model to generate SOT result by propagating previous location information before non-maximum suppression (NMS). Taking inspiration from NMS, we leverage IoU-based suppression to this end. Referring to Algorithm 1, after selection by score threshold, IoU between predicted boxes and is calculated, then predicted boxes with small IoU (e.g., ) are suppressed. Next, tracking failure would be reported if all boxes are suppressed by SOS. Subsequently, NMS is conducted for remaining boxes. Finally, we select a box with IoU maximum as current SOT result . Compared to the manner of IoU-based re-scoring, the SOS does not affect detection scores. In our opinion, detection score and IoU are two different aspects for tracking, where detection score describes object attribute while IoU reports object motion, so it is more reasonable to use them independently. Thereby, the SOS-NMS framework adopts a tracking manner based on alternating detection score and IoU, i.e., 1) discarding obviously category-incorrect boxes with score threshold; 2) discarding obviously motion-incorrect boxes using IoU threshold, 3) discarding boxes without local maximum of detection score, and 4) generating single object location with maximal IoU. Note that the SOS-NMS should have speed advantage than NMS since a vast number of predicted boxes are suppressed by computationally efficient SOS.

The VID-based SOT has three-fold advantages: 1) There is no need for man-made initial location; 2) detection score is more reliable than similarity score because of semantic classification, and VID-based SOT can be aware of the tracker state (e.g., tracking drift or failure); 3) complex model cascade is avoid.

Unified VID-MOT-SOT Framework. As shown in Figure 3, we design a unified VID-MOT-SOT framework, where only the VID network is adopted. We first define the condition of MOT-SOT switch: 1) the framework initially performs MOT; 2) the framework switch to SOT mode when a reliable tubelet (i.e., as claimed in Section 3.2) is detected, and track this reliable tubelet. If there are several reliable tubelets, only one would be focused according to detection score; 3) the framework conducts MOT after it encounters SOT failure. Under these conditions, we design an MOT branch (including NMS, data association, and tubelet refinement) and a SOT branch (containing SOS, NMS, tubelet update, and tubelet refinement), where the tubelet refinement for the SOT branch only processes the tracked tubelet. In this framework, the SOT branch is faster than MOT branch because 1) SOS is able to highly reduced NMS’s computation time; 2) data association in the MOT branch is usually time-consuming. Furthermore, our proposed framework can be easily extended to 2-object tracking, 3-object tracking, and so on.

Figure 3: The unified VID-MOT-SOT framework with our proposed SOS and tubelet refinement.

This framework is potential in applications. For example, when a mobile robot tries to grasp all objects in an area, the MOT branch should firstly work for object search, i.e., perception of object group. After a reliable tubelet is detected, the SOT branch should work for elaborate perception of object instance. At this time, other perception manners, e.g., mask, depth, etc., can be added for more object instance information to guide this object grasping. If the target encounters tracking failure, our proposed framework can immediately capture this situation and give a convenient way to switch to the MOT branch for object search.

Figure 4: Tubelet visualization for SSD. (a) The video snippet; (b) original tubelets; (c) refined tubelets; (d) original Fourier results; (e) Fourier results after tubelet refinement. Colors differentiate and “” denotes object missing at a time-stamp. The unit of the horizontal axis of (b,c) is frame while that of (d,e) is Hz. The axes of (d,e) is and (defined in Section 3.1).


Method mAP Recall continuity Localization stability
w/o tubelet refinement
static methods
SSD [31] 0.630 0.062 0.234 0.320 0.246 0.242 0.334
RetinaNet [32] 0.656 0.060 0.250 0.350 0.283 0.236 0.317
RefineDet [33] 0.669 0.126 0.350 0.391 0.306 0.257 0.362
DRN [34] 0.694 0.114 0.330 0.389 0.312 0.248 0.346
Temporal methods
TRN [19] 0.665 0.120 0.334 0.375 0.265 0.252 0.346
TDRN [19] 0.673 0.116 0.345 0.388 0.297 0.247 0.360
TSSD [16] 0.654 0.059 0.206 0.257 0.240 0.210 0.253
w/ tubelet refinement
SSD - 0.003 0.026 0.0 0.0 0.169 0.208
RetinaNet - 0.003 0.023 0.0 0.0 0.168 0.204
RefineDet - 0.004 0.037 0.0 0.0 0.173 0.212
DRN - 0.003 0.036 0.0 0.0 0.172 0.208
TRN - 0.003 0.030 0.0 0.0 0.171 0.209
TDRN - 0.004 0.031 0.0 0.0 0.170 0.218
TSSD - 0.003 0.029 0.0 0.0 0.159 0.180


Table 1: Continuity and stability evaluation of several existing detectors based on the proposed non-reference metrics.

4 Experiments and Analysis

We analyze real-time online detectors for our evaluation, i.e., SSD [31], RetinaNet [32], RefineDet [33], DRN [34], TSSD [16], TRN, and TDRN [19]. The first four are static detectors while the last three are temporal methods. SSD directly detects objects from anchors in a single-stage manner [31]. RetinaNet adopts feature pyramid networks to enhance shadow-layer receptive field [32]. RefineDet introduces a two-step regression to the single-stage pipeline, and DRN proposes anchor-offset detection (including anchor refinement and feature location refinement) to perform joint anchor-feature refinement [34]. Referring to Section 2, there are 5 types of VID approach, but post-processing and tracking-based methods actually adopt static detectors, and batch-frame approaches can hardly work in real-world scenes, so we only analyze the methods based on feature aggregation and temporally sustained proposal. For feature aggregation, TSSD uses attentional-LSTM for propagating visual feature across time [16]. As temporally sustained proposal approaches, TRN and TDRN propagate refined anchors and feature offsets across time for accuracy vs.speed trade-off [19]. All these detectors are trained and evaluated on ImageNet VID dataset [1] (including videos in the validation set). Both detection score threshold and NMS threshold are fixed as 0.5.

4.1 Analysis on VID Continuity/Stability

Visualization of Detection Tubelets. As shown in Figure 4, we use a VID case (see Figure 4 (a)) with 9 tubelets to visualize SSD detection. Referring to Figure 4 (b), SSD suffers from serious continuity and stability problems. At the beginning of this video, a vast number of object missing (i.e., on curves) and short tubelet (i.e., short curves and scattered points) appear due to motion blur. Then, these continuity problems appear again at the end of the video owing to occlusion. For localization stability, Figure 4 (d) plots the amplitude of high-frequency component ( Hz) in Fourier domain. Numerically, for this video snippet.

Based on our tubelet refinement, we refine SSD results from the perspective of tubelets. As shown in Figure 4 (c), tubelet refinement eliminate all short tubelets and fragment in refined results and smooth each tubelet curve. In addition, referring to Figure 4 (e), high-frequency components in Fourier domain are suppressed to some degree. Finally, we can obtain .

Numerical Evaluation. Referring to Table 1, detectors are evaluated with our proposed non-reference assessments on VID validation set. In terms of accuracy, the best method is DRN whose mAP is . However, there is low relation among accuracy, continuity, and stability. From static SSD and RetinaNet, we observe that FPN structure improves localization stability since spatial feature fusion. However, as for continuity, RetinaNet performs worse than SSD since more hard objects can be detected by RetinaNet. That is, detecting hard objects (e.g., small objects) can easily produce continuity problems, and SSD is likely to completely miss them because of low detection accuracy. Besides, the comparison between RefineDet and RetinaNet indicates that anchor refinement improves accuracy. In the meantime, RefineDet induces more serious problems on continuity and stability, because anchor-feature mis-alignment is exacerbated by anchor refinement. Finally, DRN conducts joint anchor-feature refinement, and thus, it performs better than RefineDet on almost all metrics.

Figure 5: Plot of vs. CJE/SJE. shows the results without history-present fusion.

For temporal approaches, we draw readers’ attention to three detector pairs, i.e., TRN vs. RefineDet, TDRN vs. DRN, and TSSD vs. SSD, where the temporal detector is extended from the static one. For TRN vs. RefineDet and TDRN vs. DRN, temporal detectors obtain on par, sometimes even worse, performance compared with static approaches. Therefore, this design of temporally sustained proposal has an ignorable effect on detection continuity and stability. Nevertheless, TSSD performs better than SSD by a large margin on all metrics, which validates the effectiveness of this temporal feature aggregation. That is, TSSD can smooth visual features across time, and thus, produce more temporally consistent results.

In addition, with tubelet refinement, all metrics can be effectively improved for all tested approaches. For fragment problem, tubelet refinement can totally eliminate it by filling up recall failures, generating FE/FTR. In terms of ESTE, STE, CJE, and SJE, the tubelet refinement also produces substantially better results. Note that mAP cannot be reported with tubelet refinement, because MOT needs a relatively high detection score threshold for data association (i.e., in this paper). ‘ We use a geometric progression to fuse current prediction with location history, so controls the degree of history-present fusion. We investigate , which induces decay ratios of across time. For example, when . As plotted in Figure 5, optimal ranges from to , where fusion ratio between current location and location history is more suitable for stability.

4.2 Unified VID-MOT-SOT

Figure 6: Plot of time consumption of NMS and SOS-NMS.

Comparison of NMS and SOS-NMS on Speed. This test is conducted on a workstation with an Intel 2.20 GHz Xeon(R) E5-2630 CPU. As plotted in Figure 6, the SOS considerably reduces the time consumption of box suppression. NMS takes  ms per frame, and SOS can highly reduce the box amount for NMS with temporal information. As a result, SOS-NMS time can be reduced to ms, and the NMS part in SOS-NMS only takes  ms when SOS threshold . Thereby, when performing SOT based on SOS, the VID model can achieve faster speed.

Figure 7: Comparison between VID-MOT-SOT and siamese SOT. If tow boxes are highly overlapping, we slightly change their coordinates for better visualization.

Siamese SOT vs. Unified VID-MOT-SOT. There lacks a dataset to quantitatively evaluate VID-based SOT and similarity-based SOT, so we qualitatively compare our method and a siamese tracker [37] on VID. Firstly, the VID-MOT-SOT framework has the ability to exploit objects, i.e., VID can provide an initial localization for SOT. Then, siamese SOT is particularly susceptible to unconscious tracking drift (as delineated in Figure 7 (a)). Finally, referring to Figure 7 (b), the VID-MOT-SOT is able to capture tracking failure and conveniently restart the MOT module for object search (see the th snapshot). On the contrary, the siamese tracker always reports a similar region, ignoring the tracking state.

4.3 Discussions

Improvement of Continuity and Stability from the Detector Itself. This paper evaluates and enhances detection continuity and stability from the perspective of MOT. Additionally, our evaluation indicates that detection continuity and stability can be benefited from feature aggregation. On one hand, spatial/scale smooth is effective (see RetinaNet vs. SSD), and on the other hand, temporal fusion is more efficient (see TSSD vs. SSD). Thus, we advocate investigating more fusion approaches for improving continuity and stability from the detector itself.

Similarity-based SOT vs. VID-based SOT. VID-based SOT has advantages but it also has a drawback that it cannot deal with unseen categories. We advocate solving this problem with a combination of online learning [35] and few-short learning [36]. From another perspective, similarity-based SOT and VID-based SOT are complementary. That is, similarity-based SOT focus on appearance similarity while the VID-based SOT is depended on category attribute and motion information. Therefore, they can be reasonably combined for more accurate and robust tracking.

5 Conclusion and Future Work

From novel tracking-based perspectives, this paper deals with recall continuity, localization stability, VID-MOT-SOT unification for temporal object detection. First of all, we analyze temporal continuity and stability that can hardly be reflected by mAP, and propose non-reference assessments to evaluate them without the need of labels. The recall continuity is based on tubelet analysis and the localization stability is captured in the Fourier domain. Secondly, we design tubelet refinement to enhance detection continuity and stability by short tubelet suppression, fragment filling, and history-present fusion. Finally, an SOS is proposed to extend VID towards SOT, and a unified VID-MOT-SOT framework is developed. With ImageNet VID dataset, our proposed methods are verified. As a result, our non-reference assessments and tubelet refinement can deal with the problems of detection continuity and stability, and the SOS can conveniently extend VID methods to the SOT task.

In the future, we plan to enhance detection continuity and stability from the detector itself. We will further improve VID-based SOT by attention mechanism.


  • [1] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F. Li, “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, 2015.
  • [2] L. Leal-Taixe, A. Milan, I. Reid, S. Roth, and K. Schindler, “Motchallenge 2015: Towards a benchmark for multi-target tracking,” arXiv:1504.01942, 2015.
  • [3] M. Kristan, et al., “The sixth visual object tracking vot-2018 challenge results,” in Proc. Eur. Conf. Comput. Vis. workshops, Munich, Germany, Sept. 2018.
  • [4]

    K. Kang, W. Ouyang, H. Li, X. Wang, “Object detection from video tubelets with convolutional neural networks,” in

    Proc. IEEE Conf. Comput. Vis. Pattern Recognition

    , Las Vegas, the US, Jun. 2016, pp. 817–825.
  • [5] H. U. Kim and C. S. Kim, “CDT: Cooperative detection and tracking for tracing multiple objects in video sequences. in Proc. Eur. Conf. Comput. Vis., Amsterdam, Netherlands, Oct. 2016, pp. 851–867.
  • [6] C. Feichtenhofer, A. Pinz, A. Zisserman, “Detect to track and track to detect,” in Proc. IEEE Conf. Comput. Vis. and Pattern Recognition, Venice, Italy, Oct. 2017, pp. 3038–3046.
  • [7] H. Luo, W. Xie, X. Wang, W. Zeng, “Detect or track: Towards cost-effective video object detection/tracking,” in Proc. AAAI Conf. Artifical Intell., Honolulu, USA, Jul. 2019, pp. 8803-8810.
  • [8] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, 2010.
  • [9] W. Han, P. Khorrami, T. L. Paine, P. Ramachandran, M. Babaeizadeh, H. Shi, J. Li, S. Yan, and T. S. Huang, “Seq-NMS for video object detection,” arXiv:1602.08465, 2016.
  • [10] K. Kang et al., “T-CNN: Tubelets with convolutional neural networks for object detection from videos”, IEEE Trans. Circuits Syst. Video Technol., vol. 28, no. 10, pp. 2896–2907.
  • [11] X. Zhu, Y. Wang, J. Dai, L. Yuan, and Y. Wei, “Flow-guided feature aggregation for video object detection,” in Proc. Int. Conf. Comput. Vis., Venice, Italy, Oct. 2017, pp. 408–417.
  • [12] X. Zhu, J. Dai, L. Yuan, and Y. Wei, “Towards high performance video object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognition, Salt Lack City, USA, Jun. 2018, pp. 7210–7218.
  • [13] G. Bertasius, L. Torresani, and J. Shi, “Object detection in video with spatiotemporal sampling networks,” in Proc. Eur. Conf. Comput. Vis., Munich, Germany, Sept. 2018, pp. 342–357.
  • [14] F. Xiao and Y. J. Lee, “Video object detection with an aligned spatial-temporal memory,” in Proc. Eur. Conf. Comput. Vis., Munich, Germany, Sept. 2018, pp. 494–510.
  • [15] M. Liu and M. Zhu, “Mobile video object detection with temporally-aware feature maps,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognition, Salt Lack City, USA, Jun. 2018, pp. 5686–5695.
  • [16] X. Chen, J. Yu, and Z. Wu, “Temporally identity-aware SSD with attentional LSTM,” IEEE Trans. Cybern., doi:10.1109/TCYB.2019.2894261.
  • [17] K. Kang et al., “Object detection in videos with tubelet proposal networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognition, Hawaii, USA, Jul, 2017, pp. 727–735.
  • [18] L. Galteri, L. Seidenari, M. Bertini, and A. Del Bimbo, “Spatio-temporal closed-loop object detection,” IEEE Trans. Image Process., vol. 26, no. 3, pp. 1253–1263, 2017.
  • [19] X. Chen, J. Yu, S. Kong, Z. Wu, and L. Wen, “Towards real-time accurate object detection in both images and videos based on dual refinement,” arXiv:1807.08638, 2018.
  • [20] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in IEEE Int. Conf. Image Process., Phoneix, U.S., Sep. 2016, pp. 3464–3468.
  • [21] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision making,” in Proc. IEEE Int. Conf. Comput. Vis., Santiago, Chile, Dec. 2015, pp. 4705–4713.
  • [22] Y. Lu, C. Lu, and C. K. Tang, “Online video object detection using association LSTM, in Proc. IEEE Int. Conf. Comput. Vis., Venice, Italy, Oct. 2017, pp. 2344–2352.
  • [23] G. Ning, Z. Zhang, C. Huang, X. Ren, H. Wang, C. Cai, and Z. He, “Spatially supervised recurrent convolutional neural networks for visual object tracking,” in Proc. IEEE Int. Symp. Circuits Syst., Baltimore, USA, May 2017, pp. 1–4.
  • [24] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: The CLEAR MOT metrics,” EURASIP J. Image Video Process., vol. 1, pp. 1–10, 2008.
  • [25] J. .F Henriques, R. Caseiro, P. Martins, J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 3, pp. 583-596, 2014.
  • [26] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in Proc. Eur. Conf. Comput. Vis., Amsterdam, Netherlands, Oct. 2016, pp. 850–865.
  • [27] B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, “High performance visual tracking with siamese region proposal network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognition, Salt Lack City, USA, Jun. 2018, pp. 8971–8980.
  • [28] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, P. H. Torr, “Fast online object tracking and segmentation: A unifying approach,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognition, Long Beach, USA, Jun. 2019, pp. 1328–1338.
  • [29] L. Pang, Z. Cao, J. Yu, P. Guan, X. Rong, H. Chai, “A visual leader-following approach with a TDR framework for quadruped robots,” IEEE Trans. on Syst., Man, and Cybern. Syst., doi: 10.1109/TSMC.2019.2912715, 2019.
  • [30] H. Zhang, N. Wang, “On the stability of video detection and tracking,” arXiv:1611.06467, 2016.
  • [31] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis., Amsterdam, Netherlands, Oct. 2016, pp. 21–37.
  • [32] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” in Proc. IEEE Int. Conf. Comput. Vis., Venice, Italy, Oct. 2017, pp. 2980–2988.
  • [33] S. Zhang, L. Wen, X. Bian, Z. Lei, S. Z. Li, “Single-shot refinement neural network for object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognition, Salt Lack City, USA, Jun. 2018, pp. 4203–4212.
  • [34] X. Chen, X. Yang, S. Kong Z. Wu, and J. Yu, “Dual refinement network for single-shot object detection,” in Proc. Int. Conf Robot. Autom., Montreal, Canada, May 2019, pp. 8305–8310.
  • [35] M. Danelljan, G. Bhat, F. S. Khan, M. Felsberg, “Atom: Accurate tracking by overlap maximization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognition, Long Beach, USA, Jun. 2019, pp. 4660–4669.
  • [36] A. Li, T. Luo, Z. Lu, T. Xiang, and L. Wang, “Large-scale few-shot learning: Knowledge transfer with class hierarchy,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognition, Long Beach, USA, Jun. 2019, pp. 7212-7220.
  • [37] Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu, “SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines,” arXiv:1911.06188, 2019.