Detect or Track: Towards Cost-Effective Video Object Detection/Tracking

State-of-the-art object detectors and trackers are developing fast. Trackers are in general more efficient than detectors but bear the risk of drifting. A question is hence raised -- how to improve the accuracy of video object detection/tracking by utilizing the existing detectors and trackers within a given time budget? A baseline is frame skipping -- detecting every N-th frames and tracking for the frames in between. This baseline, however, is suboptimal since the detection frequency should depend on the tracking quality. To this end, we propose a scheduler network, which determines to detect or track at a certain frame, as a generalization of Siamese trackers. Although being light-weight and simple in structure, the scheduler network is more effective than the frame skipping baselines and flow-based approaches, as validated on ImageNet VID dataset in video object detection/tracking.


page 5

page 7


Detect to Track and Track to Detect

Recent approaches for high accuracy detection and tracking of object cat...

Learning to Track Object Position through Occlusion

Occlusion is one of the most significant challenges encountered by objec...

Track to Detect and Segment: An Online Multi-Object Tracker

Most online multi-object trackers perform object detection stand-alone i...

Integrated Object Detection and Tracking with Tracklet-Conditioned Detection

Accurate detection and tracking of objects is vital for effective video ...

Localization-Based Tracking

End-to-end production of object tracklets from high resolution video in ...

An Analysis of Object Representations in Deep Visual Trackers

Fully convolutional deep correlation networks are integral components of...

Finding a Needle in a Haystack: Tiny Flying Object Detection in 4K Videos using a Joint Detection-and-Tracking Approach

Detecting tiny objects in a high-resolution video is challenging because...


Convolutional neural network (CNN)-based methods have achieved significant progress in computer vision tasks such as object detection [Ren et al.2015, Liu et al.2016, Dai et al.2016, Tang et al.2018b] and tracking [Held, Thrun, and Savarese2016, Bertinetto et al.2016, Nam and Han2016, Bhat et al.2018]. Following the tracking-by-detection paradigm, most state-of-the-art trackers can be viewed as a local detector of a specified object. Consequently, trackers are generally more efficient than detectors and can obtain precise bounding boxes in subsequent frames if the specified bounding box is accurate. However, as evaluated commonly on benchmark datasets such as OTB [Wu, Lim, and Yang2015] and VOT [Kristan et al.2017]

, trackers are encouraged to track as long as possible. It is non-trivial for trackers to be stopped once they are not confident, although heuristics, such as a threshold of the maximum response value, can be applied. Therefore, trackers bear the risk of drifting.

Besides object detection and tracking, there have been recently a series of studies on video object detection [Kang et al.2016, Kang et al.2017, Feichtenhofer, Pinz, and Zisserman2017, Zhu et al.2017b, Zhu et al.2017a, Zhu et al.2018, Chen et al.2018]. Beyond the baseline to detect each frame individually, state-of-the-art approaches consider the temporal consistency of the detection results via tubelet proposals [Kang et al.2016, Kang et al.2017], optical flow [Zhu et al.2017b, Zhu et al.2017a, Zhu et al.2018] and regression-based trackers [Feichtenhofer, Pinz, and Zisserman2017]. These approaches, however, are optimized for the detection accuracy of each individual frame. They either do not associate the presence of an object in different frames as a tracklet, or associate after performing object detection on each frame, which is time-consuming.

This paper is motivated by the constraints from practical video analytics scenarios such as autonomous driving and video surveillance. We argue that algorithms applied to these scenarios should be:

  • capable of associating an object appearing in different frames, such that the trajectory or velocity of the object can be further inferred.

  • in realtime (e.g., over 30 fps) and as fast as possible, such that the deployment cost can be further reduced.

  • with low latency, which means to produce results once a frame in a video stream has been processed.

Considering these constraints, we focus in this paper on the task of video object detection/tracking [Russakovsky et al.2017]. The task is to detect objects in each frame (similar to the goal of video object detection), with an additional goal of associating an object appearing in different frames.

In order to handle this task under the realtime and low latency constraint, we propose a detect or track (DorT) framework. In this framework, object detection/tracking of a video sequence is formulated as a sequential decision problem – a scheduler network makes a detection/tracking decision for every incoming frame, and then these frames are processed with the detector/tracker accordingly. The architecture is illustrated in Figure 1.

Figure 1: Detect or track (DorT) framework. The scheduler network compares the current frame with the keyframe by evaluating the tracking quality, and determines to detect or track frame : either frame is detected by a single-frame detector, or bounding boxes are tracked to frame from the keyframe . If detect is chosen, frame is assigned as the new keyframe, and the boxes in frame and frame are associated by the widely-used Hungarian algorithm (not shown in the figure for conciseness).

The scheduler network is the most unique part of our framework. It should be light-weight but be able to determine to detect or track. Rather than using heuristic rules (e.g., thresholds of tracking confidence values), we formulate the scheduler as a small CNN by assessing the tracking quality. It is shown to be a generalization of Siamese trackers and a special case of reinforcement learning (RL).

The contributions are summarized as follows:

  • We propose the DorT framework, in which the object detection/tracking of a video sequence is formulated as a sequential decision problem, while being in realtime and with low latency.

  • We propose a light-weight but effective scheduler network, which is shown to be a generalization of Siamese trackers and a special case of RL.

  • The proposed DorT framework is more effective than the frame skipping baselines and flow-based approaches, as validated on ImageNet VID dataset [Russakovsky et al.2015] in video object detection/tracking.

Related Work

To our knowledge, we are the first to formulate video object detection/tracking as a sequential decision problem and there is no existing similar work to directly compare with. However, it is related to existing work in multiple aspects.

Video Object Detection/Tracking

Video object detection/tracking is a task in ILSVRC 2017 [Russakovsky et al.2017], where the winning entries are optimized for accuracy rather than speed. [Deng et al.2017] adopts flow aggregation [Zhu et al.2017a] to improve the detection accuracy. [Wei et al.2017] combines flow-based [Ilg et al.2017] and object tracking-based [Nam and Han2016] tubelet generation [Kang et al.2017]. THU-CAS [Russakovsky et al.2017] considers flow-based tracking [Kang et al.2016], object tracking [Held, Thrun, and Savarese2016] and data association [Yu et al.2016].

Nevertheless, these methods combine multiple cues (e.g., flow aggregation in detection, and flow-based and object tracking-based tubelet generation) which are complementary but time-consuming. Moreover, they apply global post-processing such as seq-NMS [Han et al.2016] and tubelet NMS [Tang et al.2018a] which greatly improve the accuracy but are not suitable for a realtime and low latency scenario.

Video Object Detection

Approaches to video object detection have been developed rapidly since the introduction of the ImageNet VID dataset [Russakovsky et al.2015]. [Kang et al.2016, Kang et al.2017] propose a framework that consists of per-frame proposal generation, bounding box tracking and tubelet re-scoring. [Zhu et al.2017b] proposes to detect frames sparsely and propagates features with optical flow. [Zhu et al.2017a] proposes to aggregate features in nearby frames along the motion path to improve the feature quality. Futhermore, [Zhu et al.2018] proposes a high-performance approach by considering feature aggregation, partial feature updating and adaptive keyframe scheduling based on optical flow. Besides, [Feichtenhofer, Pinz, and Zisserman2017] proposes to learn detection and tracking using a single network with a multi-task objective. [Chen et al.2018] proposes to propagate the sparsely detected results through a space-time lattice. All the methods above focus on the accuracy of each individual frame. They either do not associate the presence of an object in different frames as a tracklet, or associate after performing object detection on each frame, which is time-consuming.

Multiple Object Tracking

Multiple object tracking (MOT) focuses on data association: finding the set of trajectories that best explains the given detections [Leal-Taixé et al.2014]. Existing approaches to MOT fall into two categories: batch and online mode. Batch mode approaches pose data association as a global optimization problem, which can be a min-cost max-flow problem [Zhang, Li, and Nevatia2008, Pirsiavash, Ramanan, and Fowlkes2011], a continuous energy minimization problem [Milan, Roth, and Schindler2014] or a graph cut problem [Tang et al.2016, Tang et al.2017]. Contrarily, online mode approaches are only allowed to solve the data association problem with the present and past frames. [Xiang, Alahi, and Savarese2015]

formulates data association as a Markov decision process.

[Milan et al.2017, Sadeghian, Alahi, and Savarese2017]

employs recurrent neural networks (RNNs) for feature representation and data association.

State-of-the-art MOT approaches aim to improve the data association performance given publicly-available detections since the introduction of the MOT challenge [Leal-Taixé et al.2015]. However, we focus on the sequential decision problem of detection or tracking. Although the widely-used Hungarian algorithm is adopted for simplicity and fairness in the experiments, we believe the incorporation of existing MOT approaches can further enhance the accuracy.

Keyframe Scheduler

Researchers have proposed approaches to adaptive keyframe scheduling beyond regular frame skipping in video analytics. [Zhu et al.2018]

proposes to estimate the quality of optical flow, which relies on the time-consuming flow network.

[Chen et al.2018] proposes an easiness measure to consider the size and motion of small objects, which is hand-crafted and more importantly, it is a detect-then-schedule paradigm but cannot determine to detect or track. [Li, Shi, and Lin2018, Xu et al.2018] learn to predict the discrepancy between the segmentation map of the current frame and the keyframe, which are only applicable to segmentation tasks.

All the methods above, however, solve an auxiliary task (e.g., flow quality, or discrepancy of segmentation maps) but do not answer the question directly in a classification perspective – is the current frame a keyframe or not? In contrast, we pose video object detection/tracking as a sequential decision problem, and learn directly whether the current frame is a keyframe by assessing the tracking quality. Our formulation is further shown as a generalization of Siamese trackers and a special case of RL.

The DorT Framework

Video object detection/tracking is formulated as follows. Given a sequence of video frames , the aim is to obtain bounding boxes , where , denotes the 4-dim bounding box coordinates and , and are scalars denoting respectively the frame ID, the confidence score and the object ID.

Considering the realtime and low latency constraint, we formulate video object detection/tracking as a sequential decision problem, which consists of four modules: single-frame detector, multi-box tracker, scheduler network and data association. An algorithm summary follows the introduction of the four modules.

Single-Frame Detector

We adopt R-FCN [Dai et al.2016]

as the detector following deep feature flow (DFF)

[Zhu et al.2017b]. Our framework, however, is compatible with all single-frame detectors.

Efficient Multi-Box Tracker via RoI Convolution

The SiamFC tracker [Bertinetto et al.2016] is adopted in our framework. It learns a deep feature extractor during training such that an object is similar to its deformations but different from the background. During testing, the nearby patch with the highest confidence is selected as the tracking result. The tracker is reported to run at 86 fps in the original paper.

Despite its efficiency, there are usually 30 to 50 detected boxes in a frame outputted by R-FCN. It is a natural idea to track only the high-confidence ones and ignore the rest. Such an approach, however, results in a drastic decrease in mAP since R-FCN detection is not perfect and many true positives with low confidence scores are discarded. We therefore need to track all the detected boxes.

It is time-consuming to track 50 boxes without optimization (about 3 fps). In order to speed up the tracking process, we propose to share the feature extraction network of multiple boxes and propose an RoI convolution layer in place of the original cross-correlation layer in SiamFC. Figure

2 is an illustration. Through cropping and convolving on the feature maps, the proposed tracker is over 10x faster than the time-consuming baseline while obtaining comparable accuracy.

Figure 2: RoI convolution. Given targets in keyframe and search regions in frame , the corresponding RoIs are cropped from the feature maps and convolved to obtain the response maps. Solid boxes denote detected objects in keyframe and dashed boxes denote the corresponding search region in frame . A star denotes the center of its corresponding bounding box. The center of a dashed box is copied from the tracking result in frame .

Notably, there is no learnable parameter in the RoI convolution layer, and thus we can train the SiamFC tracker following the original settings in [Bertinetto et al.2016].

Scheduler Network

The scheduler network is the core of DorT, as our task is formulated as a sequential decision problem. It takes as input the current frame and its keyframe , and determines to detect or track, denoted as . We will elaborate this module in the next section.

Data Association

Once the scheduler network determines to detect the current frame, there is a need to associate the previous tracked boxes and the current detected boxes. Hence, a data association algorithm is required. For simplicity and fairness in the paper, the widely-used Hungarian algorithm is adopted. Although it is possible to improve the accuracy by incorporating more advanced data association techniques [Xiang, Alahi, and Savarese2015, Sadeghian, Alahi, and Savarese2017], it is not the focus in the paper. The overall architecture of the DorT framework is shown in Figure 1. More details are summarized in Algorithm 1.

1:A sequence of video frames .
2:Bounding boxes with ID, where .
4: is the index of keyframe
5:Detect with the single-frame detector.
6:Assign new ID to the detected boxes.
7:Add the detected boxes in to .
8:for  to  do
9:      decision of scheduler
10:     if  then
11:          Detect with single-frame detector.
12:          Match boxes in and using Hungarian.
13:          Assign new ID to unmatched boxes in .
14:          Assign corresponding ID to matched boxes in .
15:           update keyframe
16:     else the decision is to
17:          Track boxes from to .
18:          Assign corresponding ID to tracked boxes in .
19:          Assign corresponding detection score to tracked boxes in .
20:     end if
21:     Add the bounding boxes in to .
22:end for
Algorithm 1 The Detect or Track (DorT) Framework

The Scheduler Network in DorT

The scheduler network in DorT aims to determine to detect or track given a new frame by estimating the quality of the tracked boxes. It should be efficient itself. Rather than training a network from scratch, we propose to reuse part of the tracking network. Firstly, the -th layer convolutional feature map of frame and frame , denoted respectively as and , are fed into a correlation layer which performs point-wise feature comparison


where and are offsets to compare features in a neighbourhood around the locations in the feature map, defined by the maximum displacement . Hence, the output of the correlation layer is a feature map of size , where and denote respectively the height and width of the -th layer feature map. The correlation feature map

is then passed through two convolutional layers and a fully-connected layer with a 2-way softmax. The final output of the network is a classification score indicating the probability to detect the current frame. Figure

3 is an illustration of the scheduler network.

Figure 3: Scheduler network. The output feature map of the correlation layer is followed by two convolutional layers and a fully-connected layer with a 2-way softmax. As discussed later, this structure is a generalization of the SiamFC tracker.

Training Data Preparation

Existing groundtruth in the ImageNet VID dataset [Russakovsky et al.2015] does not contain an indicator of the tracking quality. In this paper, we simulate the tracking process between two sampled frames and label it as detect (0) or track (1) in a strict protocol.

As we have sampled frame and frame from the same sequence, we track all the groundtruth bounding boxes using SiamFC from frame to frame . If all the groundtruth boxes in frame are matched with the tracked boxes (e.g., IOU over ), the frame is labeled as track; otherwise, it is labeled as detect. Any emerging or disappearing objects indicates a detect. Several examples are shown in Figure 4.

(a) Positive examples
(b) Negative examples
Figure 4: Examples of labeled data for training the scheduler network. Red and green boxes denote groundtruth and tracked results, respectively. (a) Positive examples, where the IOU of each groundtruth box and its corresponding tracked box is over a threshold; (b) Negative examples, where at least one such IOU is below a threshold or there are emerging/disappearing objects.

We have also tried to learn a scheduler for each tracker, but found it difficult to handle high-confidence false detections and non-trivial to merge the decisions of all the trackers. In contrast, the proposed approach to learning a single scheduler is an elegant solution which directly learns the decision rather than an auxiliary target such as the fraction of pixels at which the semantic segmentation labels differ [Li, Shi, and Lin2018], or the fraction of low-quality flow estimation [Zhu et al.2018].

Relation to the SiamFC Tracker

The proposed scheduler network can be seen as a generalization of the original SiamFC [Bertinetto et al.2016]. In the correlation layer of SiamFC, the target feature () is convolved with the search region feature () and obtains the response map (, which can be equivalently written as ). Similarly, we can view the correlation layer of the proposed scheduler network (see Eq. 1) as convolutions between multiple target features in the keyframe and their respective nearby search regions in the current frame. The size of a target equals the receptive field of the input feature map of our scheduler. Figure 5 shows several examples of targets. Actually, however, targets include all possible patches in a sliding window manner.

Figure 5: Examples of targets on keyframes. The size of a target equals the receptive field of the input feature map of the scheduler. As shown, a target patch might be an object, a part of an object, or totally background. The “tracking” results of these targets will be fused later. It should be noted that targets include all possible patches in a sliding window manner, but not just the three boxes shown above.

In this sense, the output feature map of the correlation layer can be regarded as a set of SiamFC tracking tasks, where the response map of each is . The correlation feature map is then fed into a small CNN consisting of two convolutional layers and a fully-connected layer.

In summary, the generalization of the proposed scheduler network over SiamFC lies in two fold:

  • SiamFC correlates a target feature with its nearby search region, while our scheduler extends the number of tasks from one to many.

  • SiamFC directly picks the highest value in the correlation feature map as the result, whereas the proposed scheduler fuses the multiple response maps with a CNN.

The validity of the proposed scheduler network is hence clear – it first convolves patches in frame (examples shown in Figure 5) with their respective nearby regions in frame , and then fuses the response maps with a CNN, in order to measure the difference between the two frames, and more importantly, to assess the tracking quality. The scheduler is also resistant to small perturbations by inheriting SiamFC’s robustness to object deformation.

Relation to Reinforcement Learning

The sequential decision problem can also be formulated in a RL framework, where the action, state, state transition function and reward need to be defined.


The action space contains two types of actions: {detect, track}. If the decision is detect, object detector is applied to the current frame; otherwise, boxes tracked from the keyframe are taken as the results.


The state is defined as a tuple , where and denote the -th layer convolutional feature map of frame and frame , respectively. Frame is the keyframe on which object detector is applied, and frame is the current frame on which actions are to be determined.

State transition function.

After the decision of action in state . The next state is obtained upon the action:

  • detect. The next state is . Frame is fed to the object detector and is set as the new keyframe.

  • track. The next state is . Bounding boxes tracked from the keyframe are taken as the results in frame . The keyframe remains unchanged.

As shown above, no matter whether the keyframe is or , the task in the next state is to determine the action in frame .


The reward function is defined as since it is determined by both the state and the action . As illustrated in Figure 4, a labeling mechanism is proposed to obtain the groundtruth label of the tracking quality between two frames (i.e., a certain state ). We denote the groundtruth label as , which is either detect or track. Hence, the reward function can be defined as follows:


which is based on the consistency between the groundtruth label and the action taken.

After defining all the above, the RL problem can be solved via a deep Q network (DQN) [Mnih et al.2015] with a discount factor , penalizing the reward from future time steps. However, training stability is always an issue in RL algorithms [Anschel, Baram, and Shimkin2017]. In this paper, we set such that the agent only cares about the reward from the next time step. Therefore, the DQN becomes a regression network – pushing the predicted action to be the same as the GT action, and the scheduler network is a special case of RL. We empirically observe that the training procedure becomes easier and more stable by setting .


The DorT framework is evaluated on the ImageNet VID dataset [Russakovsky et al.2015] in the task of video object detection/tracking. For completeness, we also report results in video object detection.

Experimental Setup

Dataset description.

All experiments are conducted on the ImageNet VID dataset [Russakovsky et al.2015]. ImageNet VID is split into a training set of 3862 videos and a test set of 555 videos. There are per-frame bounding box annotations for each video. Furthermore, the presences of a certain target across different frames in a video are assigned with the same ID.

Evaluation metric.

The evaluation metric for video object detection is the extensively used mean average precision (mAP), which is based on a sorted list of bounding boxes in descending order of their scores. A predicted bounding box is considered correct if its IOU with a groundtruth box is over a threshold (e.g.,


In contrast to the standard mAP which is based on bounding boxes, the mAP for video object detection/tracking is based on a sorted list of tracklets [Russakovsky et al.2017]. A tracklet is a set of bounding boxes with the same ID. Similarly, a tracklet is considered correct if its IOU with a groundtruth tracklet is over a threshold. Typical choices of IOU thresholds for tracklet matching and per-frame bounding box matching are both . The score of a tracklet is the average score of all its bounding boxes.

Implementation details.

Following the settings in [Zhu et al.2017b], R-FCN [Dai et al.2016] is trained with a ResNet-101 backbone [He et al.2016] on the training set.

SiamFC is trained following the original paper [Bertinetto et al.2016]. Instead of training from scratch, however, we initialize the first four convolutional layers with the pretrained parameters from AlexNet [Krizhevsky, Sutskever, and Hinton2012] and change Conv5 from to with the Xavier initializer. Parameters of the first four convolutional layers are fixed during training [He et al.2018]. We only search for one scale and discard the upsampling step in the original SiamFC for efficiency. All images being fed into SiamFC are resized to . Moreover, the confidence score of a tracked box (for evaluation) is equal to its corresponding detected box in the keyframe.

The scheduler network takes as input the Conv5 feature of our trained SiamFC. The SGD optimizer is adopted with a learning rate 1e-2, momentum 0.9 and weight decay 5e-4. The batch size is set to 32. During testing, we raise the decision threshold of track to (i.e., the scheduler outputs track if the predicted confidence of track is over ) to ensure conservativeness of the scheduler. Furthermore, since nearby frames look similar, the scheduler is applied every frames (where is a tunable parameter) to reduce unnecessary computation.

All experiments are conducted on a workstation with an Intel Core i7-4790k CPU and a Titan X GPU. We empirically observe that the detection network and the tracking/scheduler network run at 8.33 fps and 100fps, respectively. This is because the ResNet-101 backbone is much heavier than AlexNet. Moreover, the speed of the Hungarian algorithm is as high as 667 fps.

Video Object Detection/Tracking

To our knowledge, the most closely related work to ours is [Lan et al.2016]

, which handles cost-effective face detection/tracking. Since face is much easier to track and is with less deformation, the paper achieves success by utilizing non-deep learning-based detectors and trackers. However, we aim at general object detection/tracking in video, which is much more challenging. We demonstrate the effectiveness of the proposed DorT framework against several strong baselines.

Effectiveness of scheduler.

The scheduler network is a core component of our DorT framework. Since SiamFC tracking is more efficient than R-FCN detection, the scheduler should predict track when it is safe for the trackers and be conservative enough to predict detect when there is sufficient change to avoid track drift.

We compare our DorT framework with a frame skipping baseline, namely a “fixed scheduler” – R-FCN is performed every frames and SiamFC is adopted to track for the frames in between. As aforementioned, our scheduler can also be applied every frames to improve efficiency. Moreover, there could be an oracle scheduler – predicting the groundtruth label (detect or track) as shown in Figure 4 during testing. The oracle scheduler is a 100% accurate scheduler in our setting. The results are shown in Figure 6.

Figure 6: Comparison between different methods in video object detection/tracking in terms of mAP. The detector (for deep feature flow and fixed scheduler) or the scheduler (for scheduler network and oracle scheduler) can be applied every frames to obtain different results.

We can observe that the frame rate and mAP vary as changes. Interestingly, the curves are not monotonic – as the frame rate decreases, the accuracy in mAP is not necessarily higher. In particular, detectors are applied frequently when (the leftmost point of each curve). Associating boxes using the Hungarian algorithm is generally less reliable (given missed detections and false detections) than tracking boxes between two frames. It is also a benefit of the scheduler network – applying tracking only when confident, and thus most boxes are reliably associated. Hence, the curve of the scheduler network is on the upper-right side of that of the fixed scheduler as shown in Figure 6.

However, it can be also observed that there is certain distance between the curve of the scheduler network and that of the oracle scheduler. Given that the oracle scheduler is a 100% accurate classifier, we analyze the classification accuracy of the scheduler network in Figure


Figure 7: Confusion matrix of the scheduler network. The horizontal axis is the groundtruth and the vertical axis is the predicted label. The scheduler is applied every frames.

Let us take the case as an example. Although the classification accuracy is only 32.3%, the false positive rate (i.e., misclassifying a detect case as track) is as low as 1.9%. Because we empirically find that the mAP drops drastically if the scheduler mistakenly predicts track, our scheduler network is made conservative – track only when confident and detect if unsure. Figure 8 shows some qualitative results.

Figure 8: Qualitative results of the scheduler network. Red, blue and green boxes denote groundtruth, detected boxes and tracked boxes, respectively. The first row: R-FCN is applied in the keyframe. The second row: the scheduler determines to track since it is confident. The third row: the scheduler predicts to track in the first image although the red panda moves; however, the scheduler determines to detect in the second image as the cat moves significantly and is unable to be tracked.

Effectiveness of RoI convolution.

Trackers are optimized for the crop-and-resize case [Bertinetto et al.2016] – the target and search region are cropped and resized to a fixed size before matching. It is a nice choice since the tracking algorithm is not affected by the original size of the target. It is, however, slow in multi-box case and we propose RoI convolution as an efficient approximation. As shown in Figure 6, crop-and-resize SiamFC is even slower than detection – the overall running time is 3 fps. Notably, its mAP is 56.5%, which is roughly the same as that of our DorT framework empowered with RoI convolution. Our DorT framework, however, runs at 54 fps when . RoI convolution obtains over 10x speed boost while retaining mAP.

Comparison with existing methods.

Deep feature flow [Zhu et al.2017b] focuses on video object detection without tracking. We can, however, associate its predicted bounding boxes with per frame data association using the Hungarian algorithm. The results are shown in Figure 6. It can be observed that our framework performs significantly better than deep feature flow in video object detection/tracking.

Concurrent works that deal with video object detection/tracking are the submitted entries in ILSVRC 2017 [Deng et al.2017, Wei et al.2017, Russakovsky et al.2017]. As discussed in the Related Work section, these methods aim only to improve the mAP by adopting complicated methods and post processing, leading to inefficient solutions without guaranteeing low latency. Their reported results on the test set ranges from 51% to 65% mAP. Our proposed DorT, notably, achieves 57% mAP on the validation set, which is comparable to the existing methods in magnitude, but is much more principled and efficient.

Video Object Detection

We also evaluate our DorT framework in video object detection for completeness, by removing the predicted object ID. Our DorT framework is compared against deep feature flow [Zhu et al.2017b], D&T [Feichtenhofer, Pinz, and Zisserman2017], high performance video object detection (VOD) [Zhu et al.2018] and ST-Lattice [Chen et al.2018]. The results are shown in Figure 9.

Figure 9: Comparison between different methods in video object detection in terms of mAP. Results of D&T, High performance VOD and ST-Lattice are copied from the original papers. The detector (for deep feature flow) or the scheduler (for scheduler network) can be applied every frames to obtain different results.

It can be observed that D&T and high performance VOD manage to achieve a speed-accuracy balance. They obtain higher results but cannot fit into realtime (over 30 fps) scenarios. ST-Lattice, although being fast and accurate, adopts detection results in future frames and is thus not suitable in a low latency scenario. As compared with deep feature flow, our DorT framework performs significantly faster with comparable performance (no more than 1% mAP loss). Although our aim is not the video object detection task, the results in Figure 9 demonstrate the effectiveness of our approach.

Conclusion and Future Work

We propose a DorT framework for cost-effective video object detection/tracking, which is in realtime and with low latency. Object detection/tracking of a video sequence is formulated as a sequential decision problem in the framework. Notably, a light-weight but effective scheduler network is proposed, which is shown to be a generalization of Siamese trackers and a special case of RL. The DorT framework turns out to be effective and strikes a good balance between speed and accuracy.

The framework can still be improved in several aspects. The SiamFC tracker can search for multiple scales to improve performance as in the original paper. More advanced data association methods can be applied by resorting to the state-of-the-art MOT algorithms. Furthermore, there is room to improve the training of the scheduler network to approach the oracle scheduler. These are left as future work.


This work was partly supported by NSFC (No. 61876212 & 61733007). The authors would like to thank Chong Luo and Anfeng He for fruitful discussions.


  • [Anschel, Baram, and Shimkin2017] Anschel, O.; Baram, N.; and Shimkin, N. 2017.

    Averaged-dqn: Variance reduction and stabilization for deep reinforcement learning.

    In ICML.
  • [Bertinetto et al.2016] Bertinetto, L.; Valmadre, J.; Henriques, J. F.; Vedaldi, A.; and Torr, P. H. 2016. Fully-convolutional siamese networks for object tracking. In ECCVw.
  • [Bhat et al.2018] Bhat, G.; Johnander, J.; Danelljan, M.; Khan, F. S.; and Felsber, M. 2018. Unveiling the power of deep tracking. In ECCV.
  • [Chen et al.2018] Chen, K.; Wang, J.; Yang, S.; Zhang, X.; Xiong, Y.; Loy, C. C.; and Lin, D. 2018. Optimizing video object detection via a scale-time lattice.
  • [Dai et al.2016] Dai, J.; Li, Y.; He, K.; and Sun, J. 2016. R-fcn: Object detection via region-based fully convolutional networks. In NIPS.
  • [Deng et al.2017] Deng, J.; Zhou, Y.; Yu, B.; Chen, Z.; Zafeiriou, S.; and Tao, D. 2017. Speed/accuracy trade-offs for object detection from video.
  • [Feichtenhofer, Pinz, and Zisserman2017] Feichtenhofer, C.; Pinz, A.; and Zisserman, A. 2017. Detect to track and track to detect. In ICCV.
  • [Han et al.2016] Han, W.; Khorrami, P.; Paine, T. L.; Ramachandran, P.; Babaei-zadeh, M.; Shi, H.; Li, J.; Yan, S.; and Huang, T. S. 2016. Seq-nms for video object detection. arXiv.
  • [He et al.2016] He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In CVPR.
  • [He et al.2018] He, A.; Luo, C.; Tian, X.; and Zeng, W. 2018. A twofold siamese network for real-time object tracking. In CVPR.
  • [Held, Thrun, and Savarese2016] Held, D.; Thrun, S.; and Savarese, S. 2016. Learning to track at 100 fps with deep regression networks. In ECCV.
  • [Ilg et al.2017] Ilg, E.; Mayer, N.; Saikia, T.; Keuper, M.; Dosovitskiy, A.; and Brox, T. 2017. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR.
  • [Kang et al.2016] Kang, K.; Ouyang, W.; Li, H.; and Wang, X. 2016. Object detection from video tubelets with convolutional neural networks. In CVPR.
  • [Kang et al.2017] Kang, K.; Li, H.; Xiao, T.; Ouyang, W.; Yan, J.; Liu, X.; and Wang, X. 2017. Object detection in videos with tubelet proposal networks. In CVPR.
  • [Kristan et al.2017] Kristan, M.; Leonardis, A.; Matas, J.; Felsberg, M.; Pflugfelder, R.; Cehovin Zajc, L.; Vojir, T.; Hager, G.; Lukezic, A.; Eldesokey, A.; and Fernandez, G. 2017. The visual object tracking vot2017 challenge results. In ICCVw.
  • [Krizhevsky, Sutskever, and Hinton2012] Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet classification with deep convolutional neural networks. In NIPS.
  • [Lan et al.2016] Lan, X.; Xiong, Z.; Zhang, W.; Li, S.; Chang, H.; and Zeng, W. 2016. A super-fast online face tracking system for video surveillance. In ISCAS.
  • [Leal-Taixé et al.2014] Leal-Taixé, L.; Fenzi, M.; Kuznetsova, A.; Rosenhahn, B.; and Savarese, S. 2014. Learning an image-based motion context for multiple people tracking. In CVPR.
  • [Leal-Taixé et al.2015] Leal-Taixé, L.; Milan, A.; Reid, I.; Roth, S.; and Schindler, K. 2015. Motchallenge 2015: Towards a benchmark for multi-target tracking. arXiv.
  • [Li, Shi, and Lin2018] Li, Y.; Shi, J.; and Lin, D. 2018. Low-latency video semantic segmentation. In CVPR.
  • [Liu et al.2016] Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; and Berg, A. C. 2016. Ssd: Single shot multibox detector. In ECCV.
  • [Milan et al.2017] Milan, A.; Rezatofighi, S. H.; Dick, A. R.; Reid, I. D.; and Schindler, K. 2017. Online multi-target tracking using recurrent neural networks. In AAAI.
  • [Milan, Roth, and Schindler2014] Milan, A.; Roth, S.; and Schindler, K. 2014. Continuous energy minimization for multitarget tracking. TPAMI.
  • [Mnih et al.2015] Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. Nature.
  • [Nam and Han2016] Nam, H., and Han, B. 2016. Learning multi-domain convolutional neural networks for visual tracking. In CVPR.
  • [Pirsiavash, Ramanan, and Fowlkes2011] Pirsiavash, H.; Ramanan, D.; and Fowlkes, C. C. 2011. Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR.
  • [Ren et al.2015] Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS.
  • [Russakovsky et al.2015] Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. 2015. Imagenet large scale visual recognition challenge. IJCV.
  • [Russakovsky et al.2017] Russakovsky, O.; Park, E.; Liu, W.; Deng, J.; Li, F.-F.; and Berg, A. 2017. Beyond imagenet large scale visual recognition challenge.
  • [Sadeghian, Alahi, and Savarese2017] Sadeghian, A.; Alahi, A.; and Savarese, S. 2017. Tracking the untrackable: Learning to track multiple cues with long-term dependencies. In ICCV.
  • [Tang et al.2016] Tang, S.; Andres, B.; Andriluka, M.; and Schiele, B. 2016. Multi-person tracking by multicut and deep matching. In ECCV.
  • [Tang et al.2017] Tang, S.; Andriluka, M.; Andres, B.; and Schiele, B. 2017. Multiple people tracking by lifted multicut and person reidentification. In CVPR.
  • [Tang et al.2018a] Tang, P.; Wang, C.; Wang, X.; Liu, W.; Zeng, W.; and Wang, J. 2018a. Object detection in videos by high quality object linking. arXiv.
  • [Tang et al.2018b] Tang, P.; Wang, X.; Bai, S.; Shen, W.; Bai, X.; Liu, W.; and Yuille, A. L. 2018b. Pcl: Proposal cluster learning for weakly supervised object detection. TPAMI.
  • [Wei et al.2017] Wei, Y.; Zhang, M.; Li, J.; Chen, Y.; Feng, J.; Dong, J.; Yan, S.; and Shi, H. 2017. Improving context modeling for video object detection and tracking.
  • [Wu, Lim, and Yang2015] Wu, Y.; Lim, J.; and Yang, M.-H. 2015. Object tracking benchmark. TPAMI.
  • [Xiang, Alahi, and Savarese2015] Xiang, Y.; Alahi, A.; and Savarese, S. 2015. Learning to track: Online multi-object tracking by decision making. In ICCV.
  • [Xu et al.2018] Xu, Y.-S.; Fu, T.-J.; Yang, H.-K.; and Lee, C.-Y. 2018. Dynamic video segmentation network. In CVPR.
  • [Yu et al.2016] Yu, F.; Li, W.; Li, Q.; Liu, Y.; Shi, X.; and Yan, J. 2016. Poi: Multiple object tracking with high performance detection and appearance feature. In ECCVw.
  • [Zhang, Li, and Nevatia2008] Zhang, L.; Li, Y.; and Nevatia, R. 2008. Global data association for multi-object tracking using network flows. In CVPR.
  • [Zhu et al.2017a] Zhu, X.; Wang, Y.; Dai, J.; Yuan, L.; and Wei, Y. 2017a. Flow-guided feature aggregation for video object detection. In ICCV.
  • [Zhu et al.2017b] Zhu, X.; Xiong, Y.; Dai, J.; Yuan, L.; and Wei, Y. 2017b. Deep feature flow for video recognition. In CVPR.
  • [Zhu et al.2018] Zhu, X.; Dai, J.; Yuan, L.; and Wei, Y. 2018. Towards high performance video object detection. In CVPR.