by   Jianren Wang, et al.
Carnegie Mellon University

We introduce a prediction driven method for visual tracking and segmentation in videos. Instead of solely relying on matching with appearance cues for tracking, we build a predictive model which guides finding more accurate tracking regions efficiently. With the proposed prediction mechanism, we improve the model robustness against distractions and occlusions during tracking. We demonstrate significant improvements over state-of-the-art methods not only on visual tracking tasks (VOT 2016 and VOT 2018) but also on video segmentation datasets (DAVIS 2016 and DAVIS 2017).


page 1

page 4

page 5

page 6

page 7

page 8


Object Segmentation Tracking from Generic Video Cues

We propose a light-weight variational framework for online tracking of o...

Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching

Significant progress has been made in Video Object Segmentation (VOS), t...

MTP: Multi-Hypothesis Tracking and Prediction for Reduced Error Propagation

Recently, there has been tremendous progress in developing each individu...

Attentional Push: Augmenting Salience with Shared Attention Modeling

We present a novel visual attention tracking technique based on Shared A...

Tracking-Assisted Segmentation of Biological Cells

U-Net and its variants have been demonstrated to work sufficiently well ...

A topological solution to object segmentation and tracking

The world is composed of objects, the ground, and the sky. Visual percep...

Tracking Live Fish from Low-Contrast and Low-Frame-Rate Stereo Videos

Non-extractive fish abundance estimation with the aid of visual analysis...

1 Introduction

A human can track, segment or interact with fast moving objects with surprising accuracy [54], even in the cases the objects are under deformations, occlusions and illumination changes [68]. What is the key component in human perception to make this happen?

In fact, tracking and segmenting moving object appears at a very early stage of human perception. Even 4-month-old infants can track moving objects with his or her eyes and reaching for them [61]. A professional athlete can even interact with objects at breakneck speed. (e.g

., a baseball player can hit a 100-mph baseball). To do so, the human brain has to overcome its delays in neuronal transmission through prediction 

[45, 25, 6] and saves times for processing what we see by using only local information [59, 5]. Besides prediction, researchers further point out that humans have multiple temporal scales for tracking objects with different speeds [26].

Inspired by these observations from human perception, in this paper, we propose a two-stage tracking method driven by prediction. Given a tracking result in time , we first predict the approximate object location in the next frame (in time ) without seeing it. Based on the prediction results, we then further refine the localization as well as the segmentation results by using the appearance input in time . The refined tracking and segmentation results can help us back to update a better prediction model, which will be applied again in the successive frames.

Concretely, in the first stage of prediction, we use an extrapolation method to estimate the object position in the future frame (in time

) and simulate the multiple temporal scales effect in human perception through adaptive search region in the same frame. With prediction driven tracking and segmentation, we name our method Prediction-Tracking-Segmentation (PTS).

Our approach offers several unique advantages. First, through prediction of object position, we free tracking from using only appearance information. As most tracking and segmentation methods can only discriminate foreground from the non-semantic background [71], the performance suffers significantly when the target object is surrounded by similar objects (know as distractors [71]). Prediction also improves the robustness of our model against occlusions. Note that occlusions largely prevent most appearance-based methods from extracting useful information. We show in the experiments that our method improves the tracking performance by a large margin under both cases. We visualize part of the results in Fig. 1.

Second, through the usage of adaptive search region around the predicted area, PTS significantly decreases the information required to process and thus has a large potential to increase the inference speed. To achieve this, we propose to focus on smaller local regions when objects have slower speeds and smaller sizes, and vice versa. This approach allows better segmentation performance, less missing and identity switching.

We evaluate our framework all major tracking datasets: VOT-2016 [36], VOT-2018 [35]. We demonstrate that our framework achieves state-of-the-art performance, both qualitatively and quantitatively. We also show competitive results against semi-supervised VOS approaches on DAVIS-2016 [48] and DAVIS-2017 [51].

To summarize, our main contributions are three-fold: First, inspired by visual cognition theory, we propose PTS to unify predict, tracking and segmentation in a single framework. Second, we propose an adaptive search region module to process information effectively. We indicate that our proposed achieves competitive performance on VOT and VOS datasets.

2 Related Works

In this section, we briefly overview three research areas relative to our proposed method.

Video Object Tracking

In tracking community, significant attention has been paid to discriminative correlation filters (DCF) based methods [3, 42, 39, 14]. These methods allow discriminating between the template of an arbitrary target and its 2D translations at a breakneck speed. MOSSE  [3] is the pioneering work which proposes a fast correlation tracker by minimizing the squared error. Performance of DCF-based trackers has then been notably improved through the using of multi-channel features [24, 12, 32], robust scale estimation [8, 9], reducing boundary effects [10, 33] and fusing multi-resolution features in the continuous spatial domain [11].

Tracking through Siamese Network is also an important approach [34, 56, 2, 58]

. Instead of learning a discriminative classifier online, the idea is to train a deep siamese similarity function offline on pairs of video frames. At test time, this function is used to search for the candidate most similar to the template given in the starting frame on a new video, once per frame. The pioneering work is SINT 

[56]. Similarly, GOTURN [23] used deep regression network to predict the motion between successive frames. SiamFC  [2] implemented a fully convolutional network to output the correlation response map with high values at target locations, which set a basic form of modern Siamese framework. Many following works have been proposed to improve the accuracy while maintain fast inference speed by adding semantic branch [17], using region proposals [38], hard negative mining [71], ensembling [16], deeper backbone [37] and high-fidelity object representations [66].

With the assumption that objects are under minor displacement and size change in consecutive frames, most modern trackers, including all the ones mentioned above, use a steady search region, which is centered on the last estimated position of the target with the same ratio. Although it is very straightforward, this oversimplified prior often fails in occlusion, motion change, size change, camera motion, as it is evident in the examples of Figure 1. This motivated us to propose a tracker able to adaptively set the search region.

Video Forecasting

The ability to predict and therefore to anticipate the future is an important attribute of intelligence. Many methods are developed to improve the temporal stability of semantic video segmentation. Luc et al.  [41]

develop an autoregressive convolutional neural network that learns to generate multiple future frames iteratively. Similarly, Walker et al. 

[62] uses a VAE to model the possible future movements of humans in the pose space. Instead of generating future states directly, many methods attempt to propagate segmentation from preceding input frames [30, 46, 27].

Unlike previous work, inspired by human perception, we extract a motion model for each object and set up a new search region for segmentation according to the motion model.

Video Object Segmentation

Video Object Segmentation (VOS) have been divided into three categories based on the level of supervision required: unsupervised, semi-supervised and supervised. We briefly review the VOS focusing on semi-supervised setting, which is usually formulated as a temporal label propagation problem. To exploit consistency between video frames, many methods propagate the first segmentation mask through temporal adjacent ones [1, 43, 57] or even entire video [28, 29]. Another approach is to process video frames independently [47] and usually heavily rely on fine-tuning [4], data augmentation [31] and model adaption [60].

3 Method

To unify prediction, tracking and segmentation with adaptive search region, our model consists of: (i) prediction module: estimate object position and velocity in an unseen frame (Section 3.1) (ii) tracking module: adaptively limit the search region for further processing (Section 3.2) (iii) segmentation module: a fully-convolutional Siamese framework to segment foreground object from given search region (Section 3.3). We show our framework in Figure 2.

Figure 2: An overview of our method. Our method is composed of prediction part, tracking part and segmentation part.

3.1 Prediction Module

Figure 3: One example for decoupling background motion and mapping object motion to reference frame (arrows illustrates the movement of object center)

In video object tracking and segmentation, most methods do not consider the time consistency of object motion. In other words, most methods predict a zero-velocity-object and thus set a local search region centered on the last estimated position of the target [2, 38, 71, 66].

However, these methods only consider appearance features of the current frame, and hardly benefit from motion information. This leads to great difficulty in distinguishing between instances that look like the template, known as distractors [71] or under occlusion, fast motion and camera motion. To solve this problem, our proposed tracker takes full advantage of the motion information.

Object motion in a given image is the superposition of camera motion and object motion. The former is random while the latter should satisfy Newton’s First Law [44]. We first pick a reference frame (, denotes reference frame) every frames and thus separate the long video into several pieces of short n-frame videos.

Second, we adopt the method proposed by ARIT [63] to decouple the camera motion and object motion within each short video. ARIT assumes that pending detection frame () and its reference frame () are related by a homography (). This assumption holds in most cases as the global motion between neighboring frames is usually small. To estimate the homography, the first step is to find the correspondences between two frames. As mentioned in ARIT, we combine SURF features [67]

and motion vectors from the optical flow to generate sufficient and complementary candidate matches, which is shown to be robust 

[15, 63]. Here we use PWCNet [55] for dense flow generation.

As a homography matrix contains 8 free variables, at least 4 background points pairs should be used. We calculate the least square solution of eq. 1 and optimize it to obtain robust solution through RANSAC [13], where and denotes random selected background matching pairs in and using the above mentioned features. Given the assumption that the background occupies more than half of the images, we partition matching points between frames into 4 pieces, and then one point is randomly chosen inside each selected piece to improve the efficiency of RANSAC algorithm.


For simplicity, all following calculations are under reference coordinate and project back to the new coming frame without further noticing.

Fig.3 illustrates the working principle of decoupling step. The origin video for Fig.3 is a handheld video with trembling background. The motion of the pedestrians in the origin video is highly unpredictable with huge background uncertainties. However, by mapping the target frame towards the reference frame, the movement for pedestrians could be more predictable and continuous.

To find the most representative point of object position, we calculate the ”center of mass” of object segmentation using eq. 2


Random noise from background motion estimation and mask segmentation might be introduced to the object position prediction, which could influence the accuracy of prediction. To achieve a better estimation for object states, we utilize Kalman Filter to provide accurate position information based on the measurements from current and former frames. As a classical tracking algorithm, the Kalman filter estimates the position of the object in two steps: prediction and correction. In the prediction step, it predicts the target state based on the dynamic model (eq. 

3) and generates the search region for the Siamese network to achieve object segmentation. Therefore, the measurement for object position in the next frame could be computed with eq. 2. Then, in the correction step, the position measurement would be updated with higher certainty given the position measurement from the Siamese network, which benefits the accuracy of predictions for future frames.

The dynamic model for object position update could be formulated as:


In eq.3, is the priori state estimation given observations up to time , which is in the form of 4-dimension vector () with position information. It is worth to mention that the velocity terms () are predicted by using extrapolation between the information from time and . is the random noise existing in the system. And is the transition matrix from time to .

After predicting the states, the Kalman filter uses measurements to correct its prediction during the correction steps using eq. 4. In the equation, is the residuals between the prediction and measurement. And is the optimal Kalman gain given from the predicted error covariance (), measurement matrix () and measurement margin covariance (), as shown in eq. 5. It is worth to mention that, as Kalman filter is a recursive algorithm, the predicted error covariance (P) should be updated as well based on the estimation results.


The motion consistency between video frames in different sliced videos with different reference frames could be an issue because the initialization of the velocity for the reference frame could be critical to the accuracy of the position update. To maintain the motion consistency, we choose the frame, which is the last frame in the sliced video, as the next reference frame with the refined position and velocity estimation from Kalman filter based on the former reference frame. Therefore, the velocity of the object, with respect to the new frame, could be initialized by mapping the refined velocity towards the new reference.

3.2 Tracking Module

Inspired by human perception, we dynamically set up a new search region in the coming frame centered at the predicted object position. We project the estimated object center position back to the pending detection frame using eq. 6.


Given the estimated position, we setup the search region accordingly using the similar method as in  [24]:


where . To achieve the adaptive search region, the search region size would be modified with respect to the predicted velocity using eq.8. In the equation, is the velocity predicted by Kalman filter and is the threshold for velocity. The search region is cropped center at on the frame , and then resized in .

To make the one-shot segmentation framework suitable for tracking task. We adopt the optimization strategy used for the automatic bounding box generation proposed in VOT-2016 [36] as it offers the highest IOU and mAP as reported in  [66].

3.3 Segmentation Module

We adopt the SiamMask framework [66], which achieves a good balance between the accuracy and speed. SiamMask propose to use an offline trained fully-convolutional network to simultaneously collect binary segmentation mask, detection bounding box and objectness score. First, the Siamese network compares an template image () against a (larger) search image to obtain a dense response map . The two inputs are processed by the same CNN , yielding two feature maps that are cross-correlated:


Each spatial element of the response map represent a similarity between the template image and candidate window in . Second, a three-branch head calculates binary segmentation mask, detection bounding box and objectness score, respectively. The mask branch predicts a () binary mask from each spatial element . The box branch regresses bounding boxes from each spatial element , where is the number of anchors. And the score branch estimates the corresponding objectness score.


A multi-task loss is used to optimise the whole framework.


We refer readers to  [49, 50] for understanding mask branch and  [53, 38] for understanding region proposal branch.

4 Experiments

In this section, we evaluate our approach on three tasks: motion prediction, visual object tracking (VOT 2016 and VOT 2018) and semi-supervised video object segmentation (on DAVIS 2016 and DAVIS 2017). It is worth noticing that our method does not depend on the selection of tracking and segmentation methods. To better evaluate the efficiency of our method, we adopt SiamMask [66] with provided pretrained model as our online tracking and segmentation method.

4.1 Evaluation for motion prediction accuracy

Datasets and settings

We adopt two widely used benchmark data set to evaluate the performance the motion prediction: VOT 2016 [36] and VOT 2018 [35]. Both of them are annotated with the rotated bounding box. Both datasets contain 60 public sequences with different challenging factors: camera motion, object motion change, object size change, occlusion and illumination change, which makes it extremely challenging for object motion prediction [35]. We use eq. 2 to predict object position and eq. 3 to extrapolate object velocity. For our baseline method, the predicted position of the next frame (t+1) is always the same as the current frame (t), while object velocity is always predicted as 0. The ground truth position is set as the center of the annotated rotated bounding box, while the velocity is the difference between two consecutive positions. We evaluate the position error from ground truth with Euclidean distance and velocity error with Euclidean distance, cosine distance and magnitude distance. Cosine distance is the cosine value between predicted velocity and ground truth velocity (the higher, the better). Magnitude distance is the absolute difference between the absolute value of predicted velocity and ground truth velocity. We adopt the reinitialize mechanism as used in the official VOT toolkit when the segmentation has no overlap with ground truth. We reinitialize the tracking method with ground truth after five frames.

Results on VOT 2016 and VOT 2018

Figure 4: Object center predictions generated by SiamMask and PTS (red for PTS, yellow for SiamMask and blue cross for ground truth) (better view with color)
Dataset Tracker Pos Err.
VOT 2016 SiamMask 16.281
PTS(ours) 8.198
VOT 2018 SiamMask 14.593
PTS(ours) 8.744

Table 1: Position prediction error on VOT 2016 and VOT 2018
Dataset Tracker MSE Err. Cosine Mag.
VOT 2016 SiamMask 8.274 - 8.274
PTS(ours) 4.596 0.667 3.190
VOT 2018 SiamMask 7.006 - 7.006
PTS(ours) 4.298 0.793 2.929

Table 2: Velocity prediction error on VOT 2016 and VOT 2018
Figure 5: Velocity prediction generated by PTS (white for goundtruth, black for prediction, both extended by 5 times longer for better visualization)(better view with color)

Table.1 presents the comparison of position prediction results using SiamMask and PTS based on VOT 2016 and VOT 2018 datasets.

As it is shown in the table, for both of these two datasets, the PTS method could dramatically reduce the prediction errors of the object position. The mean square error for object position on VOT 2018 could be reduced by half from 16 pixels to 8 pixels. Meanwhile, Fig.4 shows when the object velocity is high, the PTS method could provide a prediction more accurate compared with the SiamMask, which does not consider the influence of object motion. The results prove that the decoupling strategy could reduce the background uncertainty and the Kalman filter would provide a relatively reliable prediction for object position in the next frame. Higher accuracy for object position prediction could benefit the generation of search regions for object tracking and eventually improve the performance of object segmentation.

For velocity, as can be seen in Table. 2, our method significantly reduce the estimation error. In VOT 2018, PTS achieves 0.763 cosine distance, which is about 37 degree divergence from ground truth velocity direction. The main cause of the error is that objects are not always rigid, thus ”center of mass” can approximate the overall motion of the object 5. The size change of objects will further increase the prediction error. However, with the correction procedure of the Kalman filter, this error (noise) can be stabilized. One possible solution to decrease velocity prediction error is tracking each part of non-rigid objects and grouping all parts together to get the final prediction [52].

Figure 6: Qualitative result of our method : green box is the ground truth, red box is the bounding box from SiamMask, and blue box is our bounding box for the mask.

4.2 Evaluation for visual object tracking

Datasets and settings

Similarly, we adopt three widely used benchmarks for the evaluation of the object tracking task: VOT 2016, VOT 2018 and compare against the state-of-the-art using official metric: Expected Average Overlap (EAO), which considers both accuracy and robustness of a tracker [35]. We use VOT 2018 to conduct an experiment to discuss the performance under different conditions further.

Results on VOT 2016

VOT 2016
Trackers A R EAO
SiamMaks 0.639 0.214 0.433
PTS (Ours) 0.642 0.144 0.471
Table 3: Comparison with SiamMask on VOT 2016

Table 3 present comparisons of tracking performance between PTS and other state-of-the-art models based on VOT 2016 datasets. Our model improves the robustness by 30%, and provide an 8.8% gain of EAO, which achieves 0.471.

Results on VOT 2018

EAO ↑ 0.326 0.337 0.339 0.345 0.356 0.376 0.378 0.380 0.383 0.385 0.389 0.397
Accuracy 0.569 0.566 0.506 0.523 0.519 0.507 0.536 0.609 0.586 0.505 0.503 0.612
Robustness 0.337 0.258 0.239 0.215 0.201 0.155 0.184 0.276 0.276 0.140 0.159 0.220
Table 4: Comparison with the state-of-the-art under EAO, Accuracy, and Robustness on the VOT 2018 benchmark.

In Table 4

we compare our PTS against eleven recently published state-of-the-art trackers on the VOT 2018 benchmark. We establish a new state-of-the-art tracker with 0.397 EAO and 0.612 accuracy. In particular, our accuracy outperforms all existing Correlation Filter-based trackers. This is very easy to understand since our baseline SiamMask relies on deep feature extraction which is much richer than all existing Correlation Filter-based methods. However, PTS even outperforms the baseline SiamMask method, which is very interesting. Previous research shows Siamese based trackers have strong center bias despite the appearances of test targets 

[37]. Thus, by estimating the center of the search region more accurate, Siamese trackers can also achieve better regression result (e.g., bounding box detection or object segmentation). Besides, PTS achieves the highest robustness among all Siamese based trackers. This is even exhilarating because one of the key vulnerability of Siamese based trackers is the low robustness. The main reason is that most Siamese networks can only discriminate foreground from the non-semantic background [71]. In other words, Siamese based trackers are not appearance sensitive enough and always suffer from distinguishing surrounding objects. Our proposed PTS adopts a straightforward strategy and shows huge improvement of robustness from 0.276 to 0.220, which provides another strategy to achieve better robustness: by setting more accurate and targeted search region.

To further analysis where the improvements come from, we show the qualitative results of PTS and our baseline6. Just as mentioned above, the robustness comes from less tracking object switching and missing. For example, as for the car scenario in figure 6, when the camera shakes, the center of the search region of SiamMask will shift to the left of the tracking car, and finally catches the truck. On the contrary, since our model considers camera motion, the center of our search region stays on the tracking car. This stability comes from the decoupling of camera motion. Another example is Bolt, the third row in figure 6. When Bolt accelerates, SiamMask will be easily distracted by other runners, but our PTS model won’t fail because it considers the speed of Bolt. This stability comes from object velocity estimation. These unique features greatly help the performance of PTS under large camera motion, fast object motion and occlusion. Simply speaking, by predicting object position accurately, we can focus on more targeted search region and thus achieve better detection and segmentation performance.

Datasets Methods J F
DAVIS 2016 SiamMask 0.713 0.674
PTS(ours) 0.732 0.692
DAVIS 2017 SiamMask 0.543 0.585
PTS(ours) 0.554 0.604
Table 5: J and F Results on DAVIS 2016 and DAVIS 2017

Table 5 presents the comparison of vos results using SiamMask and PTS based on DAVIS 2016 and DAVIS 2017 datasets.

Figure 7: Qualitative result of SiamMask and PTS on DAVIS: First row and third row are the results from SiamMask. Second row and fourth row are the results from same videos using PTS. (better view with color)

4.3 Evaluation for video object segmentation

Datasets and settings

We report the performance of PWT on standard VOS datasets DAVIS 2016 [48] and DAVIS 2017 [51]

. For both datasets, we use the official performance measures: the Jaccard index (J) to express region similarity and the F-measure (F) to express contour accuracy. Since we use SiamMask as our baseline, we adopt the semi-supervise setup. We fit bounding boxes to object masks in the first frame and use these bounding boxes to initialize our proposed PTS.

Results on DAVIS 2016 and DAVIS 2017

The effect of our approach is limited on DAVIS 2016 and DAVIS 2017 datasets. The main reason is that DAVIS datasets have less camera motion or fast object motion, which are the major gain from our method. However, segmentation can still benefit from more accurately cropped search region. e.g., The dog in the third frame of ”Dogs-Jump” video is segmented more completely through PTS. However, SiamMask misses the tail of the same dog during segmentation. Another example is the person in the fourth frame of ”Soap-Box” video. PTS separates this person from the soapbox, however, SiamMask mixes its segmentation with the surrounding pixels. Further, SiamMask fails to distinguish the person mask from the drum of the soapbox because the drum occupies the previous position of the person, which can not be handled without motion assumption. Though our pre-tracking procedure, PTS can separate specific instance from its neighboring instance and thus get a more accurate segmentation. We show that our proposed PTS has very large potential especially under segmentation of crowded scenarios. For more qualitative results, please refer to Fig 7.

4.4 Ablation studies

Table 6 compares the influence of each module in our model. Based on VOT 2018 dataset, we evaluate the influence of tracking and prediction module and compare their performance with the baseline approach (SiamMask) and PTS. It can be observed from Table 6 that the prediction module which uses Kalman filter to update the position of objects plays an important role in PTS that most EAO improvements seem to be introduced by the prediction module. Moreover, for the tracking module, which is the adaptive search region update module, the influence is a little bit limited with only 0.02 EAO increase. However, as we can see from Table 6, both of these two modules have the potential to improve accuracy.

EAO Accuracy Robustness
SiamMask 0.380 0.609 0.276
SiamMask + Tracking 0.382 0.610 0.268
SiamMask + Prediction 0.394 0.611 0.234
PTS 0.397 0.612 0.220
Table 6: Ablation studies for Tracking and Prediction modules on VOT 2018 dataset.

5 Conclusion

In conclusion, we introduce a prediction driven method for visual tracking and segmentation in videos. Instead of solely relying on matching with appearance cues for tracking, we build a predictive model which provides guidance on finding more accurate tracking regions efficiently. We show that this simple idea significantly improve the robustness in VOT and VOS challenges and achieve state-of-the-art performance in both tasks. We hope our work can inspire more studies in considering the relationship among three main challenges in video understanding: prediction, tracking and segmentation.


  • [1] Linchao Bao, Baoyuan Wu, and Wei Liu. Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 5977–5986, 2018.
  • [2] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pages 850–865. Springer, 2016.
  • [3] David S Bolme, J Ross Beveridge, Bruce A Draper, and Yui Man Lui. Visual object tracking using adaptive correlation filters. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2544–2550. IEEE, 2010.
  • [4] Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 221–230, 2017.
  • [5] Carlos R Cassanello, Abhay T Nihalani, and Vincent P Ferrera. Neuronal responses to moving targets in monkey frontal eye fields. Journal of neurophysiology, 2008.
  • [6] Patrick Cavanagh and Stuart Anstis. The flash grab effect. Vision Research, 91:8–20, 2013.
  • [7] Shuyang Chen, Jianren Wang, and Peter Kazanzides. Integration of a low-cost three-axis sensor for robot force control. In 2018 Second IEEE International Conference on Robotic Computing (IRC), pages 246–249. IEEE, 2018.
  • [8] Martin Danelljan, Gustav Häger, Fahad Khan, and Michael Felsberg. Accurate scale estimation for robust visual tracking. In British Machine Vision Conference, Nottingham, September 1-5, 2014. BMVA Press, 2014.
  • [9] Martin Danelljan, Gustav Häger, Fahad Shahbaz Khan, and Michael Felsberg. Discriminative scale space tracking. IEEE transactions on pattern analysis and machine intelligence, 39(8):1561–1575, 2017.
  • [10] Martin Danelljan, Gustav Hager, Fahad Shahbaz Khan, and Michael Felsberg. Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 4310–4318, 2015.
  • [11] Martin Danelljan, Andreas Robinson, Fahad Shahbaz Khan, and Michael Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision, pages 472–488. Springer, 2016.
  • [12] Martin Danelljan, Fahad Shahbaz Khan, Michael Felsberg, and Joost Van de Weijer. Adaptive color attributes for real-time visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1090–1097, 2014.
  • [13] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
  • [14] Hamed Kiani Galoogahi, Ashton Fagg, and Simon Lucey. Learning background-aware correlation filters for visual tracking. In ICCV, pages 1144–1152, 2017.
  • [15] Steffen Gauglitz, Tobias Höllerer, and Matthew Turk. Evaluation of interest point detectors and feature descriptors for visual tracking. International journal of computer vision, 94(3):335, 2011.
  • [16] Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. Towards a better match in siamese network based visual object tracker. In European Conference on Computer Vision, pages 132–147. Springer, 2018.
  • [17] Anfeng He, Chong Luo, Xinmei Tian, and Wenjun Zeng. A twofold siamese network for real-time object tracking. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  • [18] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), pages 784–800, 2018.
  • [19] Yihui He, Xianggen Liu, Huasong Zhong, and Yuchun Ma. Addressnet: Shift-based primitives for efficient convolutional neural networks. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1213–1222. IEEE, 2019.
  • [20] Yihui He, Xiaobo Ma, Xiapu Luo, Jianfeng Li, Mengchen Zhao, Bo An, and Xiaohong Guan. Vehicle traffic driven camera placement for better metropolis security surveillance. IEEE Intelligent Systems, 33(4):49–61, 2018.
  • [21] Yihui He, Xiangyu Zhang, Marios Savvides, and Kris Kitani. Softer-nms: Rethinking bounding box regression for accurate object detection. arXiv preprint arXiv:1809.08545, 2018.
  • [22] Yihui He, Xiangyu Zhang, and Jian Sun. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 1389–1397, 2017.
  • [23] David Held, Sebastian Thrun, and Silvio Savarese. Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision, pages 749–765. Springer, 2016.
  • [24] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):583–596, 2015.
  • [25] Hinze Hogendoorn and Anthony N Burkitt. Predictive coding of visual object position ahead of moving objects revealed by time-resolved eeg decoding. Neuroimage, 171:55–61, 2018.
  • [26] Alex O Holcombe. Seeing slow and seeing fast: two limits on perception. Trends in cognitive sciences, 13(5):216–221, 2009.
  • [27] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in neural information processing systems, pages 2017–2025, 2015.
  • [28] Varun Jampani, Raghudeep Gadde, and Peter V Gehler. Video propagation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 451–461, 2017.
  • [29] Won-Dong Jang and Chang-Su Kim. Online video object segmentation via convolutional trident network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5849–5858, 2017.
  • [30] Xiaojie Jin, Xin Li, Huaxin Xiao, Xiaohui Shen, Zhe Lin, Jimei Yang, Yunpeng Chen, Jian Dong, Luoqi Liu, Zequn Jie, et al. Video scene parsing with predictive feature learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 5580–5588, 2017.
  • [31] Anna Khoreva, Rodrigo Benenson, Eddy Ilg, Thomas Brox, and Bernt Schiele. Lucid data dreaming for object tracking. In The DAVIS Challenge on Video Object Segmentation, 2017.
  • [32] Hamed Kiani Galoogahi, Terence Sim, and Simon Lucey. Multi-channel correlation filters. In Proceedings of the IEEE international conference on computer vision, pages 3072–3079, 2013.
  • [33] Hamed Kiani Galoogahi, Terence Sim, and Simon Lucey. Correlation filters with limited boundaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4630–4638, 2015.
  • [34] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In

    ICML Deep Learning Workshop

    , volume 2, 2015.
  • [35] Matej Kristan, Ales Leonardis, Jiri Matas, Michael Felsberg, Roman Pfugfelder, Luka Cehovin Zajc, Tomas Vojir, Goutam Bhat, Alan Lukezic, Abdelrahman Eldesokey, Gustavo Fernandez, and et al. The sixth visual object tracking vot2018 challenge results, 2018.
  • [36] Matej Kristan, Jiři Matas, Aleš Leonardis, Michael Felsberg, Gustavo Fernández, and et al. The visual object tracking vot2016 challenge results. In Proceedings of the European Conference on Computer Vision Workshop, pages 777–823. Springer International Publishing, 2016.
  • [37] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. arXiv preprint arXiv:1812.11703, 2018.
  • [38] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018.
  • [39] Yang Li and Jianke Zhu. A scale adaptive kernel correlation filter tracker with feature integration. In European conference on computer vision, pages 254–265. Springer, 2014.
  • [40] Yudong Liang, Ze Yang, Kai Zhang, Yihui He, Jinjun Wang, and Nanning Zheng. Single image super-resolution via a lightweight residual convolutional neural network. arXiv preprint arXiv:1703.08173, 2017.
  • [41] Pauline Luc, Natalia Neverova, Camille Couprie, Jakob Verbeek, and Yann LeCun. Predicting deeper into the future of semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pages 648–657, 2017.
  • [42] Chao Ma, Xiaokang Yang, Chongyang Zhang, and Ming-Hsuan Yang. Long-term correlation tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5388–5396, 2015.
  • [43] Nicolas Märki, Federico Perazzi, Oliver Wang, and Alexander Sorkine-Hornung. Bilateral space video segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 743–751, 2016.
  • [44] Isaac Newton. The Principia: mathematical principles of natural philosophy. Univ of California Press, 1999.
  • [45] Romi Nijhawan. Motion extrapolation in catching. Nature, 1994.
  • [46] David Nilsson and Cristian Sminchisescu. Semantic video segmentation by gated recurrent flow propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6819–6828, 2018.
  • [47] Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. Learning video object segmentation from static images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2663–2672, 2017.
  • [48] Federico Perazzi, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine-Hornung. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 724–732, 2016.
  • [49] Pedro O Pinheiro, Ronan Collobert, and Piotr Dollár. Learning to segment object candidates. In Advances in Neural Information Processing Systems, pages 1990–1998, 2015.
  • [50] Pedro O Pinheiro, Tsung-Yi Lin, Ronan Collobert, and Piotr Dollár. Learning to refine object segments. In European Conference on Computer Vision, pages 75–91. Springer, 2016.
  • [51] Jordi Pont-Tuset, Federico Perazzi, Sergi Caelles, Pablo Arbeláez, Alex Sorkine-Hornung, and Luc Van Gool. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  • [52] Deva Ramanan, David A Forsyth, and Andrew Zisserman. Tracking people by learning their appearance. IEEE transactions on pattern analysis and machine intelligence, 29(1):65–81, 2007.
  • [53] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [54] Jeroen BJ Smeets, Eli Brenner, and Marc HE de Lussanet. Visuomotor delays when hitting running spiders. In EWEP 5-Advances in perception-action coupling, pages 36–40. Éditions EDK, Paris, 1998.
  • [55] Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2018.
  • [56] Ran Tao, Efstratios Gavves, and Arnold WM Smeulders. Siamese instance search for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1420–1429, 2016.
  • [57] Yi-Hsuan Tsai, Ming-Hsuan Yang, and Michael J Black. Video segmentation via object flow. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3899–3908, 2016.
  • [58] Jack Valmadre, Luca Bertinetto, João Henriques, Andrea Vedaldi, and Philip HS Torr. End-to-end representation learning for correlation filter based tracking. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5000–5008. IEEE, 2017.
  • [59] Elle van Heusden, Martin Rolfs, Patrick Cavanagh, and Hinze Hogendoorn. Motion extrapolation for eye movements predicts perceived motion-induced position shifts. Journal of Neuroscience, 38(38):8243–8250, 2018.
  • [60] Paul Voigtlaender and Bastian Leibe. Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364, 2017.
  • [61] Claes Von Hofsten. Eye–hand coordination in the newborn. Developmental psychology, 18(3):450, 1982.
  • [62] Jacob Walker, Kenneth Marino, Abhinav Gupta, and Martial Hebert. The pose knows: Video forecasting by generating pose futures. In Proceedings of the IEEE International Conference on Computer Vision, pages 3332–3341, 2017.
  • [63] H. Wang and C. Schmid. Action recognition with improved trajectories. In 2013 IEEE International Conference on Computer Vision, pages 3551–3558, Dec 2013.
  • [64] Jianren Wang, Long Qian, Ehsan Azimi, and Peter Kazanzides. Prioritization and static error compensation for multi-camera collaborative tracking in augmented reality. In 2017 IEEE Virtual Reality (VR), pages 335–336. IEEE, 2017.
  • [65] Jianren Wang, Junkai Xu, and Peter B Shull. Vertical jump height estimation algorithm based on takeoff and landing identification via foot-worn inertial sensing. Journal of biomechanical engineering, 140(3):034502, 2018.
  • [66] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. Fast online object tracking and segmentation: A unifying approach. arXiv preprint arXiv:1812.05050, 2018.
  • [67] Geert Willems, Tinne Tuytelaars, and Luc Van Gool. An efficient dense and scale-invariant spatio-temporal interest point detector. In European conference on computer vision, pages 650–663. Springer, 2008.
  • [68] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1834–1848, 2015.
  • [69] Haisheng Xia, Junkai Xu, Jianren Wang, Michael A Hunt, and Peter B Shull. Validation of a smart shoe for estimating foot progression angle during walking gait. Journal of biomechanics, 61:193–198, 2017.
  • [70] Chenchen Zhu, Yihui He, and Marios Savvides. Feature selective anchor-free module for single-shot object detection. arXiv preprint arXiv:1903.00621, 2019.
  • [71] Zheng Zhu, Qiang Wang, Bo Li, Wei Wu, Junjie Yan, and Weiming Hu. Distractor-aware siamese networks for visual object tracking. In European Conference on Computer Vision, pages 103–119. Springer, 2018.