Hide and Seek tracker: Real-time recovery from target loss

06/20/2018 ∙ by Alessandro Bay, et al. ∙ 0

In this paper, we examine the real-time recovery of a video tracker from a target loss, using information that is already available from the original tracker and without a significant computational overhead. More specifically, before using the tracker output to update the target position we estimate the detection confidence. In the case of a low confidence, the position update is rejected and the tracker passes to a single-frame failure mode, during which the patch low-level visual content is used to swiftly update the object position, before recovering from the target loss in the next frame. Orthogonally to this improvement, we further enhance the running average method used for creating the query model in tracking-through-similarity. The experimental evidence provided by evaluation on standard tracking datasets (OTB-50, OTB-100 and OTB-2013) validate that target recovery can be successfully achieved without compromising the real-time update of the target position.



There are no comments yet.


page 4

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent advances in deep learning architectures significantly improved the state of the art in a number of computer vision applications. In general, it is rather common in computer vision applications to rely on a convolutional neural network (CNN) pre-trained on large datasets such as ImageNet

[Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton], which is subsequently fine-tuned to be transferred to a different task. Examples include the use of CNNs for object detection [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik], semantic segmentation [Long et al.(2015)Long, Shelhamer, and Darrell], caption generation [Xu et al.(2015)Xu, Ba, Kiros, Cho, Courville, Salakhudinov, Zemel, and Bengio], action recognition [Simonyan and Zisserman(2014)], etc.

On the other hand, the task of visual tracking consists of the ability of tracking arbitrary objects in videos, given the starting position in the initial frame [Yilmaz et al.(2006)Yilmaz, Javed, and Shah]. The inherent ad-hoc nature of this setup has caused that deep-learning based architectures struggled to show disruptive changes in tracking for some years [Nam and Han(2016)]. However, recently, major advances in deep-learning architectures applicable to tracking has significantly improved the state-of-the-art, both for real-time [Bertinetto et al.(2016)Bertinetto, Valmadre, Henriques, Vedaldi, and Torr, Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr] and non-real-time tracking [Yun et al.(2017)Yun, Choi, Yoo, Yun, and Choi, Danelljan et al.(2016)Danelljan, Robinson, Khan, and Felsberg].

A significant adjustment towards this path was to gradually replace the transfer learning from large classification datasets with techniques that either perform on-line training (e.g. ECO

[Danelljan et al.(2017)Danelljan, Bhat, Khan, and Felsberg]) or are trained off-line in a setup that is only loosely connected to semantic classification (e.g. CFNet [Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr]). The on-line methods currently achieve state-of-the-art performance but not real-time tracking, while the off-line methods the opposite. While both on-line and off-line techniques progressively converge to the optimisation of the tracker accuracy and its computational complexity, several significant challenges remain unresolved, including the sensitivity to distractors [Wang et al.(2015)Wang, Ouyang, Wang, and Lu], object disappearance and re-appearance events [Valmadre et al.(2018)Valmadre, Bertinetto, Henriques, Tao, Vedaldi, Smeulders, Torr, and Gavves] and the recovery from target loss [Tao et al.(2017)Tao, Gavves, and Smeulders].

This work deals with the latter, i.e. the real-time recovery of a tracker from a temporary failure. The difficulty of tracker recovery originates from its main design principle, i.e. its ability to accumulate correct object positions for a substantial amount of time. As a matter of fact, tracking could be viewed as an accumulative retrieval problem, in which the performance is evaluated based on the algorithm potential to retrieve the correct object position in each and every frame that constitute a video signal. This potential is reversed to a significant flaw in the cases that the tracker would temporarily lose the position of the object due to an abrupt camera movement, an unexpected and abrupt object movement, a technical problem, etc. Once the sampling drift [Tao et al.(2017)Tao, Gavves, and Smeulders] causes the tracker bounding box to not intersect with the object, the tracking algorithm potential “secures” that it will remain in the background with little possibility of recovery.

A possible method to tackle this issue is to exploit the tracking-through-similarity para-digm that is adopted by CFNet [Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr]. The assumption that a similarity search among candidates is characterised as ambiguous/unambiguous depending on the global maximum magnitude in relation to the local maxima is rather common in computer vision (e.g. in image matching [Lowe(2004)]). Following a similar rationale, it is suggested to declare ambiguous a tracking-through-similarity tracker output, in the cases that the similarity peak is not prominent. Subsequently, the tracker could pass on a “failure mode”, during which the target would recover from the possible target loss before returning to “normal mode”. In this work we introduce the “Hide and Seek” (HnS) architecture and we experimentally tested it on state-of-the-art benchmarks.

Additionally, the original CFNet algorithm is using a sub-optimal approach to estimate the running average of the feature map used for query. More specifically, the CFNet running average is updated very slowly, thus causing a dependency on the object visibility on the first video frame. This flaw is not striking in the OPE evaluation because for most of the benchmark videos the object to be tracked is clearly visible in the first frame. However, this introduces an unnecessary sensitivity to the initialisation frame (apparent in the TRE evaluation), which could be avoided if the running average estimation becomes more smooth.

In summary, apart from the original architecture, the main contributions of the paper are the following:

  1. The ambiguity measure that estimates the confidence on the tracker output.

  2. The enlargement of the search area when the tracker is on “failure mode”.

  3. The use of a simple low-level representation as a short-term, backup tracker.

  4. An improved running average method for the query model of the tracker.

The paper is structured as follows. The relevant literature is reviewed in Section 2, before passing to the presentation of the HnS tracker in Section 3. The experimental evaluation is conducted in Section 4, while Section 5 concludes this work.

2 Related works

2.1 Real-time deep learning trackers

Due to its major significance, the tracking of moving objects in videos has a long history of research and development. A qualitative comparison of a review on early-days trackers [Yilmaz et al.(2006)Yilmaz, Javed, and Shah] and a corresponding publication almost a decade later [Smeulders et al.(2015)Smeulders, Chu, Cucchiara, Calderara, Dehghan, and Shah] (the later just before deep-learning trackers) reveals the substantial progress achieved in the meantime, which has led to algorithms that could perform tracking with high precision [Hare et al.(2011)Hare, Saffari, and Torr], in real-time [Bolme et al.(2010)Bolme, Beveridge, Draper, and Lui] and with the additional capability to recover from severe occlusions or the loss of target [Kalal et al.(2012)Kalal, Mikolajczyk, and Matas].

The maturity of the domain was one of the reasons that deep-learning trackers were initially struggling to outperform “classical” trackers [Nam and Han(2016)]. But perhaps a more important reason was the inherent characteristics of the tracking setup that undermined a straightforward transfer of deep-learning techniques to this task [Zhang et al.(2017)Zhang, Maei, Wang, and Wang]. More specifically, (1) maximising heatmaps corresponding to semantic classes is not necessarily the optimal strategy to locate a specific object [Ma et al.(2015)Ma, Huang, Yang, and Yang] (especially when distractors are present), (2) off-line training is hampered by the lack of large datasets for this task, and (3) on-line training is prohibitively slow for applications requiring real-time tracking, especially if the model update is conducted in each and every frame [Danelljan et al.(2017)Danelljan, Bhat, Khan, and Felsberg], [Tao et al.(2017)Tao, Gavves, and Smeulders].

On the other hand, the visual similarity of the tracked object between two consecutive frames is implied by the small temporal window between them (

secs for a fps video). Based on this rationale, Bertinetto et al. introduced a tracking-by-similarity tracker [Bertinetto et al.(2016)Bertinetto, Valmadre, Henriques, Vedaldi, and Torr] that despite its simple architecture achieved state-of-the-art real-time performance. In this setup, the similarity is learned through a Siamese deep network that is trained offline, while the localisation is conducted through a correlation filter [Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr].

One of the main issue with such a tracker is the sampling drift [Tao et al.(2017)Tao, Gavves, and Smeulders] (from which the tracker often fails to recover) that occurs in the case of abrupt object or/and camera motion. Two main solutions have been recently proposed: (1) in [Valmadre et al.(2018)Valmadre, Bertinetto, Henriques, Tao, Vedaldi, Smeulders, Torr, and Gavves] the authors assume that the heatmap that is generated in the final stage of the algorithm incorporates the localisation ambiguity by not exhibiting a clear peak and it temporarily discards peaks below a threshold, and (2) in [Tao et al.(2017)Tao, Gavves, and Smeulders] it is suggested to interleave global similarity search (in the whole frame) every frames ( being a hard-coded constant parameter) to avoid the model drift.

In this work, we carefully amalgamate the above solutions in order to optimise both the accuracy and the computational time. Firstly, due to the parameter sensitivity of a comparison to a hard-coded value, the localisation ambiguity is detected using the Nearest Neighbour Distance Ratio (NNDR). If a localisation is declared ambiguous the tracker passes to “failure mode”, during which the window is expanded (but without covering the whole frame, as in [Tao et al.(2017)Tao, Gavves, and Smeulders] so as to reduce the computational cost). Finally, while the tracker is in failure mode, tracking is conducted using correlation of low-level patch representations. The latter builds upon the recent improvements on tracker performance that was achieved by including the first network layers, which model the low-level image content [Danelljan et al.(2016)Danelljan, Robinson, Khan, and Felsberg], [Danelljan et al.(2017)Danelljan, Bhat, Khan, and Felsberg]. In the current work, the low-level image content is modelled though Census transform [Hirschmuller and Scharstein(2009)], in order to not compromise the computational cost, and because Census transform has exhibited exceptional robustness in low-level matching [Hirschmuller and Scharstein(2009)].

2.2 Nearest Neighbour Distance Ratio

The use of Nearest Neighbour Distance Ratio (NNDR), i.e. the ratio of the distance to the nearest neighbour over the distance to the second nearest neighbour, to validate candidate matches is an idea originating from computer vision [Lowe(2004)] and has been extensively used in applications such as image matching (e.g.[Paul and Pati(2016)], [Sedaghat and Ebadi(2015)], [Sidiropoulos and Muller(2015)]) to discard erroneous results.

The basic assumption of NNDR is that the value corresponding to the correct estimation and the values corresponding to all false estimations derive from two distinct and non-overlapping distributions. Following this assumption, the two nearest neighbour values should be samples from separate distributions, thus generating a high NNDR value. Conversely, a low NNDR value implies that the two nearest neighbours are sampled from the same distribution, therefore, the detection result may be declared ambiguous and discarded.

The most important property of NNDR, which is in large the root cause of its extensive use in the literature, is its performance robustness. NNDR is expected to exhibit almost identical performance for a large range of threshold values [Sidiropoulos and Muller(2015)]

, thus reducing the parameter sensitivity of the algorithm employed. For a similar reasons, thresholding the nearest neighbour value is a sub-optimal approach in applications in which the nearest neighbour values exhibit significant variance and their magnitude is difficult to be systematically predicted.

Due to the NNDR robustness and high performance, in this work we propose using it to discriminate between ambiguous and unambiguous peaks in the tracker heatmap. While the application setup is different, from a data-science point of view the similarity is apparent. The tracker heatmap is a matrix of distances, one of which corresponds to the actual object position while the rest corresponds to erroneous entries that can be in general assumed to be selected from a distinct distribution. The use of NNDR could allow the identification of ambiguous peaks, thus allowing the pipeline to pass to “failure mode”, before recovering the object position in subsequent frames and continuing in its “normal mode”.

3 Hide and seek (HnS) tracker

In this Section we describe Hide and Seek (HnS, Figure 1), our novel tracker that builds upon CFNet in order to achieve real-time target recovery.

Figure 1: The architecture of the HnS tracker

3.1 Heatmap confidence evaluation

The main concept of our architecture lies in the assumption that the CFNet heatmap output can be used to evaluate the confidence on the bounding box update. The confidence is modelled through the ratio of the two most dominant peaks, which are estimated as follows: first the correlation filter map is projected onto and planes, before being differentiated twice in order to identify the local maxima, which determine the two most dominant peaks. Since CFNet is estimating patch similarity, these peaks correspond to the nearest neighbour and the second nearest neighbour of the current bounding box. Their ratio (i.e. NNDR) is used to evaluate the confidence on the tracker output.

More specifically, if the ratio is above the confidence threshold the tracker output is considered safe, therefore, the top peak is followed to update the object position and the tracking continues following the CFNet algorithm in the next frame. On the other hand, if the ratio is below the confidence threshold the tracker output is considered ambiguous and the tracker passes on failure mode.

The following measures are taken during the time that the tracker is on failure mode: (1) CFNet output is not used to update the object position (the top peak position is ignored), (2) the object position is updated following the estimation of the backup tracker, and (3) in the following frame the object is searched in an area wider that the original one.

The backup tracker is based on correlating Census-transformed [Zabih and Woodfill(1994)] image patches. Census transform is a simple and powerful low-level representation of the image content that presents a set of positive characteristics: (1) it has linear computational complexity, (2) it preserves the object edges, (3) it is robust to radiometric differences (which may occur during tracking due to abrupt camera motion) [Hirschmuller and Scharstein(2009)], and (4) it generates robust optic flow estimations [Hafner et al.(2013)Hafner, Demetz, and Weickert]. The 8-bit binary strings that Census transform generates for each pixel are converted into decimal numbers by iteratively applying a circular shift of positions before conversion. The result is correlated and the position of the maximum value is followed to update the object bounding box. It should be noted that from the deep-learning point of view the Census transform could be considered as a hand-crafted filter of a single layer of a neural network.

3.2 Smooth running average

The strategy used in CFNet [Bertinetto et al.(2016)Bertinetto, Valmadre, Henriques, Vedaldi, and Torr] to create the query model from the previously seen feature map is a simple running average:


where is the -th query model and is the -th feature map and the update factor was empirically set to . As a result, during the first video frames (in which case ), is dominated by the first feature map .

This approach is a reasonable measure against sampling drift in OPE benchmarking, since the first frame of the video usually captures the object from an angle that allows a clear identification (e.g. a person would be usually captured from an angle that makes the face fully visible and not from the back). However, such a strong dependence from is expected to be suboptimal in the general case, especially in practical applications that the tracked object would be initialised with an arbitrary view and with sub-optimal quality.

The dependence from the first frame is reduced by updating Eq. 1 as follows:


Eqs. 1 and 2 converge asymptotically. Their main difference is that Eq. 2 generates a “smooth average” over the first frames, by creating a “bootstrap” model which uses a significant number of initial frames (), instead of a single frame. As it will be demonstrated in the next Section, this improves significantly the performance in the more challenging TRE evaluation, a benchmark that measures the robustness of the tracker on the initial object view.

The introduced algorithm is analytically presented in Algorithm 1, where we highlight in bold the differences from the classic CFNet algorithm. The parameter values are the ones used in our implementation.

1:procedure HnS tracker() It returns the track of starting from box
10:      for  in  do
17:            if   then
20:                 if   then
24:            else
25:                 if   then
Algorithm 1 Hide and seek algorithm (HnS)

4 Experimental Results

4.1 Implementation Details

The CFNet implementation that we used is the code that was provided by the authors [Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr]. In order to manifest the validity of each of the measures taken when the tracker passes to failure mode, variations of the HnS tracker has been evaluated:

  • HnS0, in which the position of the bounding box is not updated until the NNDR is above the confidence threshold, but the search area for the subsequent frames is double than the original.

  • HnS1, in which the new position is extrapolated by the position of the object in the past two frames using bilinear interpolation.

  • HnS, in which the algorithm presented in Section 3.1 is followed (but not smooth average)

  • HnSSA, where the HnS algorithm is combined with the smooth average approach.

All of the calculations were performed with MATLAB R2017a, MatConvNet 1.0-beta25, gcc/g++-4.9, Cuda-8.0, Cudnn-5.1, on a i7-6800K CPU @ 3.40GHz 12 workstation with 32 GB RAM and a single nVidia GeForce GTX 1080Ti graphics card.

4.2 Benchmark

For evaluating our algorithm, we use the object tracking benchmark (OTB, [Wu et al.(2013)Wu, Lim, and Yang]). As in [Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr], we considered three benchmarks: the entire dataset with 100 videos (OTB-100), its subset comprising 50 videos (OTB-50), and the official OTB-2013 challenge. For all of them, we compare One-Pass Evaluation (OPE) and Temporal Robustness Evaluation (TRE). The TRE metric consists in choosing 20 equispaced points per video sequence and running the tracker from each of them until the end, while OPE metric processes the video from the beginning until the end, which is equivalent to the first trial of TRE. Both metrics are evaluated in terms of overlap and precision. In particular, the precision is computed as the percentage of frames whose estimated location is within a given threshold distance of the ground truth. As a representative precision score, we use the score for the threshold of 20 pixels [Babenko et al.(2011)Babenko, Yang, and Belongie]. On the other hand, overlap is computed as the success rate of frames whose intersection over union (IoU) of the predicted bounding box and the ground truth is larger than a given threshold. In this case too, we followed the standard literature approach (e.g. [Wang et al.(2015)Wang, Ouyang, Wang, and Lu]) to report the area under curve (AUC) of each success plot to rank the tracking algorithms, instead of the success rate at a specific threshold. Finally, the tracker speed (in fps) is also reported for each method.

4.3 Results

The performance achieved by the HnS variations described in Section 4.1 are presented in Table 1 and compared to the CFNet2 baseline [Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr]. It should be noted that due to compatibility issues between MATLAB R2017a (used in this work) and R2015a (used in [Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr]), the reported results slightly diverge from the ones reported in [Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr].

OTB-2013 OTB-50 OTB-100
Method fps IoU prec IoU prec IoU prec IoU prec IoU prec IoU prec
CFNet2 [Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr] 71.1 57.0 74.1 59.9 76.1 49.4 64.3 52.9 69.0 55.7 71.6 58.2 73.7
CFNet2+HnS0 64.2 57.4 74.4 60.2 76.5 49.1 64.1 53.0 69.2 54.6 70.3 58.0 73.4
CFNet2+HnS1 64.0 57.9 75.2 60.2 76.5 48.9 63.8 53.0 69.0 54.8 69.9 58.1 73.6
CFNet2+HnS 61.6 60.0 78.2 60.6 76.9 51.3 68.1 53.9 70.7 57.3 74.1 58.7 74.4
CFNet2+HnSSA 62.3 58.7 76.8 61.9 79.8 51.5 68.5 55.0 73.2 57.1 73.9 59.9 77.0
Table 1: The results of our baseline (CFNet2) along with the tested variations of HnS: HnS0, HnS1, HnS and HnS+smooth average (HnSSA). The best performance is highlighted in bold.

A first comment is that the relative improvement achieved by HnS is in precision and in accuracy, without significant computational overhead. The fact that both HnS and HnSSA outperformed CFNet2 in all evaluation scenarios manifests the potential of the introduced method to improve tracking through the real-time recovering from a target loss. In comparison, the ambiguous results achieved by HnS0 and HnS1 imply that approaches which instead of a “failure mode” (backup) tracker rely on simple bounding box updates would generally fail to exploit the identification of low-confidence heatmaps.

The comparison between HnS and HnSSA validates the analysis conducted in Section 3.2. HnSSA clearly outperforms HnS in all TRE evaluations, while achieves similar performance in OPE in two out of three evaluations (OTB-2013 and OTB-100) and worse in OTB-50. This is aligned with the dependence of the TRE from the object viewing angle (in the original frame), thus signifying that in most practical applications (in which the object viewing angle in the first frame is not generally known) HnSSA should be preferred.

Analysing various attributes describing the videos, we find out that our HnS approach significantly increases CFNet performance in those videos characterised by occlusion, motion blur, out of view, and low resolution, as shown in Figure 2 for OTB-100 dataset, IoU curve scores and OPE metric. Note that in the low resolution and motion blur examples, i.e. in videos of low quality, HnS clearly outperforms HnSSA. This implies that HnS is to be preferred in applications that the video quality is expected to be low and the initial frame is of good quality.

Figure 2: OPE of IoU scores for relevant attributes (occlusion, motion blur, out of view, and low resolution, respectively) on the OTB-100 dataset.

Finally, as an explicative example, we report the very challenging case of tracking a motorbike in Figure 3. In frame 76 (Figure 3(a)) the baseline tracker is going to lose the target due to a false detection in the background. However, our HnS method finds two peaks in the feature map (Figure 3(d)) through its projection onto and planes (Figures 3(b) and 3(c), respectively). Therefore, HnS avoids the sampling drift by passing to failure mode, before recovering the bike position in the next frame.

Figure 3: Example of a tracking motorbike: due to the presence of multiple peaks (b-c) in the feature map (d), our HnS method rejects the peak, passes to failure mode and retrieves the motorbike in the next frame (a), instead of failing.

5 Conclusions and future developments

In this work we examined the hypothesis that in tracking-through-similarity algorithms the output heatmap that determines the object position in the next frame could be used to alert about possible sampling drift. The experimental results confirm this hypothesis, while additional validating the use of a fast and simple classical tracker as a “failure mode” tracker, which would estimate the target position in the current frame allowing the tracker to recover in the next frames. Developing a reliable recovery strategy in case of object loss is crucial for real world applications of video tracking technology where the behaviour of targets is often more complicated that what happens in benchmark videos. Finally, the benefits from a slightly more elaborate running average method suggest that the use of a deep-learning approach (such as a RNN) has a big potential for further improvement. This would be our main focus in the near future.


  • [Babenko et al.(2011)Babenko, Yang, and Belongie] Boris Babenko, Ming-Hsuan Yang, and Serge Belongie. Robust object tracking with online multiple instance learning. IEEE transactions on pattern analysis and machine intelligence, 33(8):1619–1632, 2011.
  • [Bertinetto et al.(2016)Bertinetto, Valmadre, Henriques, Vedaldi, and Torr] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pages 850–865. Springer, 2016.
  • [Bolme et al.(2010)Bolme, Beveridge, Draper, and Lui] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui. Visual object tracking using adaptive correlation filters. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2010.
  • [Danelljan et al.(2016)Danelljan, Robinson, Khan, and Felsberg] M. Danelljan, A. Robinson, F. Khan, and M. Felsberg. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In IEEE European conference on computer vision, 2016.
  • [Danelljan et al.(2017)Danelljan, Bhat, Khan, and Felsberg] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6931–6939, 2017.
  • [Girshick et al.(2014)Girshick, Donahue, Darrell, and Malik] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • [Hafner et al.(2013)Hafner, Demetz, and Weickert] D. Hafner, O. Demetz, and J. Weickert. Why is the census transform good for robust optic flow computation. In Scale Space and Variational Methods in Computer Vision, 2013.
  • [Hare et al.(2011)Hare, Saffari, and Torr] S. Hare, A. Saffari, and H. S. Torr. Struck: Structured output tracking with kernels. In IEEE International conference on computer vision, 2011.
  • [Hirschmuller and Scharstein(2009)] H. Hirschmuller and D. Scharstein. Evaluation of stereo matching costs on images with radiometric differences. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(9):1582–1599, 2009.
  • [Kalal et al.(2012)Kalal, Mikolajczyk, and Matas] Z. Kalal, K. Mikolajczyk, and J. Matas. Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7):1409–1422, 2012.
  • [Krizhevsky et al.(2012)Krizhevsky, Sutskever, and Hinton] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [Long et al.(2015)Long, Shelhamer, and Darrell] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [Lowe(2004)] D. Lowe. Distinctive image features from scale-invariant keypoints. Int. Journal of Computer Vision, 60(2):91–110, 2004.
  • [Ma et al.(2015)Ma, Huang, Yang, and Yang] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchical convolutional features for visual tracking. In IEEE International conference on computer vision, 2015.
  • [Nam and Han(2016)] H. Nam and B. Han. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
  • [Paul and Pati(2016)] S. Paul and U. C. Pati. Remote sensing optical image registration using Modified Uniform Robust SIFT. IEEE Transactions on Geoscience and Remote Sensing Letters, 13(9):1300–1304, 2016.
  • [Sedaghat and Ebadi(2015)] A. Sedaghat and H. Ebadi. Remote sensing image matching based on Adaptive Binning SIFT descriptor. IEEE Transactions on Geoscience and Remote Sensing, 53(10):5283–5293, 2015.
  • [Sidiropoulos and Muller(2015)] P. Sidiropoulos and J.-P. Muller. Matching of large images through coupled decomposition. IEEE Transactions on Image Processing, 24(7):2124–2139, 2015.
  • [Simonyan and Zisserman(2014)] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pages 568–576, 2014.
  • [Smeulders et al.(2015)Smeulders, Chu, Cucchiara, Calderara, Dehghan, and Shah] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah. Visual tracking: An experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1442–1468, 2015.
  • [Tao et al.(2017)Tao, Gavves, and Smeulders] R. Tao, E. Gavves, and A. W.M. Smeulders. Tracking for half an hour. In arXiv preprint arXiv:1711.10217, 2017.
  • [Valmadre et al.(2018)Valmadre, Bertinetto, Henriques, Tao, Vedaldi, Smeulders, Torr, and Gavves] J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao, A. Vedaldi, A. Smeulders, P. Torr, and E. Gavves. Long-term tracking in the wild: A benchmark. In arXiv preprint arXiv:1803.09502, 2018.
  • [Valmadre et al.(2017)Valmadre, Bertinetto, Henriques, Vedaldi, and Torr] Jack Valmadre, Luca Bertinetto, João Henriques, Andrea Vedaldi, and Philip HS Torr. End-to-end representation learning for correlation filter based tracking. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pages 5000–5008. IEEE, 2017.
  • [Wang et al.(2015)Wang, Ouyang, Wang, and Lu] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual tracking with fully convolutional networks. In IEEE International conference on computer vision, 2015.
  • [Wu et al.(2013)Wu, Lim, and Yang] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013.
  • [Xu et al.(2015)Xu, Ba, Kiros, Cho, Courville, Salakhudinov, Zemel, and Bengio] Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. In

    International Conference on Machine Learning

    , pages 2048–2057, 2015.
  • [Yilmaz et al.(2006)Yilmaz, Javed, and Shah] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. ACM Computing Surveys, 38(4):1–45, 2006.
  • [Yun et al.(2017)Yun, Choi, Yoo, Yun, and Choi] S. Yun, J. Choi, Y. Yoo, K. Yun, and J. Y. Choi.

    Action-decision networks for visual tracking with deep reinforcement learning.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, 2017.
  • [Zabih and Woodfill(1994)] R. Zabih and J. Woodfill. Non-parametric local transforms for computing visual correspondence. In IEEE European conference on computer vision, pages 151–158, 1994.
  • [Zhang et al.(2017)Zhang, Maei, Wang, and Wang] D. Zhang, H. Maei, X. Wang, and Y.-F. Wang. Deep reinforcement learning for visual object tracking in videos. In arXiv preprint arXiv:1701.08936, 2017.