Tracking with a Pan-Tilt-Zoom (PTZ) camera has been a research topic in computer vision for many years. Compared to tracking with a still camera, the images captured with a PTZ camera are highly changing in nature because the camera can perform large motion resulting in quickly changing capture conditions. Furthermore, tracking with a PTZ camera involves controlling a camera and it is an online process. Therefore, standard benchmarks, e.g. , do not allow to evaluate the performance of a tracker for the PTZ scenario because they do not account for camera control, neither for drop frames caused by spending too much time for processing a frame to track a target. A specific benchmark is required for evaluation in this scenario. In this work, we used the virtual PTZ framework that has been developed by  to evaluate recent trackers for the PTZ tracking scenario. The goal of this evaluation is to assess the performance of trackers in an online tracking scenario where a combination of speed and robustness is required. PTZ scenarios include many changes in scale and in illumination and large motion. We evaluated 19 tracking algorithms and compared and analyzed their performances. Many of these trackers were tested in the VOT 2016  benchmark. We also extended the framework to add target position prediction for the next frame, accounting for camera motion and processing delays. By doing this we can assess if prediction can help to make long-term tracking more robust and help slower algorithm for keeping the target in the field of view of the camera.
Ii Evaluation framework and metrics
Chen et al.  proposed a C++ framework to evaluate in a reproducible way trackers in PTZ scenarios. Because of the online nature of this scenario, the authors proposed the use of a PTZ camera simulator that pans, tilts and zooms based on a spherical video captured offline. The simulator also includes relevant delays that result in drop frames if tracking takes too long and if the target has a large motion in the image plane. These delays are categorized into execution delay, motion delay and communication delay.
Ii-a Simulator Configuration
We use the evaluation framework as indicated by Chen et al. . However, since some tracker codes are in C++ and others in Matlab, we adjusted how the delays are calculated to ensure fairness in execution time evaluation. We used the chronometer functions based on execution time in C++ 11 to calculate the time elapsed for processing frames by the trackers instead of the real-time clock. Some of the program running time is spent to read the image from disk drive and it was not included in the execution time. The motion delay is calculated by the time it takes for the simulated camera to tilt and/or pan. We decided that there would be no communication delays by supposing that the camera is not networked.
Some codes are written originally in Matlab and we had to use the Matlab engine to call the Matlab function, or in other word, tracker interfaces. Minor changes have been made to the Matlab source codes such as eliminating the display and drawing functions since those will affect the speed of processing but are totally unnecessary and are thus taken as irrelevant delays. Besides, after practical experiments, it turns out that calling Matlab engine from C++ is actually in a time scale of milliseconds and this overhead can be neglected since the time to process frame by trackers is much longer than that delay.
Ii-B Performance Evaluation
Chen et al. defined four performance metrics to evaluate the trackers . These metrics are calculated in the image plane for the current camera viewpoint (i.e. the viewed subregion on the image sphere projected on the camera image plane). Let and be the ground-truth target center and the predicted target center, and and be the ground-truth target bounding box and the predicted bounding box, respectively. is the center of the camera image plane, or in other words, field of view (FOV). (Target Point Error) and (Box Overlapping Ratio) evaluate the quality of target localization and are defined as
(Target Point Offset) and (Track Fragmentation) evaluate the quality of the camera control and are defined as
indicates whether the target is inside the camera FOV. and are invalid and assigned -1 if the target is outside the FOV. The overall metrics , , are the average metrics of the valid tracked frames. is the sum of divided by the number of processed frames. In the experiments, we report only and as they are the most significant metrics. Besides and have similar purpose, and so are and .
Iii Target Position Prediction
In , the authors made the remark that for a tracker to be successful, it should be either very fast, or should use some kind of target position prediction to keep the target close to the FOV of the camera. To assess the practicality of predicting a target position based on previous track information, we implemented three target motion models. Since there will be a delay between frames, it is necessary to predict the object position on the image sphere and move the camera accordingly so that the target appears in its FOV. Therefore, the prediction must account for the motion of the target and the motion of the camera. For calculation, let a target in the first frame appear at point (on the spherical image). At next frame, the target moves to (again on the spherical image). We can calculate its speed as . Then in the third frame if the target moved from to its speed would then be
. Thus a basic classical mechanics model can be used to estimate the next position in the fourth frame.
The position prediction should locate the target near the image center as much as possible. By knowing the motion of the target it is possible to predict where it will be later after a delay . should account for the processing time of the current frame in addition to the time it takes for the camera to move. We experimented with three motion models to obtain the target motion () between two instants:
Model 1: Object is moving at a constant speed and uses the velocity of last instant
Model 2: Object is moving at a constant speed and uses mean velocity in last two instant.
Model 3: Object can accelerate
where is the acceleration.
Iv Tested Trackers
Among the 19 trackers we tested, six trackers are variations of correlation filters: KCF, SRDCF, SWCF, DSST, DFST and sKCF. Two trackers combine correlation filter outputs with color: STAPLE and STAPLE+. One is based on structured SVM: STRUCK. Two trackers are based purely on color: DAT and ASMS. One tracker is based on normalized cross-correlation: NCC. Two trackers are based on boosting: MIL and BOOSTING. One tracker is based on optical flow (MEDIANFLOW) and one tracker includes a detector (TLD). Two trackers can be categorized as part-based: DPCF and CTSE. Another one combines many trackers in an ensemble: KF-EBT. Below, we briefly describe the trackers. More details can be found in the original papers describing each of them.
Kernelized Correlation Filter tracker (KCF) 
KCF is operating on HOG features. It localizes target with the equivalent of a kernel ridge regression trained with sample patches around the object at different translations. This version of the KCF tracker includes multi-scale support, sub-cell peak estimation and a model update by linear interpolation.
Spatially Regularized Discriminative Correlation Filter Tracker (SRDCF)  This tracker is derived from KCF. It introduces a spatial regularization function that penalizes filter coefficients residing outside the target area, thus solving the problems arising from assumptions of periodicity in learning the correlation filters. The size of the training and detection samples can be increased without affecting the effective filter size. By selecting the spatial regularization function to have a sparse discrete Fourier spectrum, the filter is optimized directly in the Fourier domain. SRDCF employs also Color Names and greyscale features.
Spatial Windowing for Correlation Filter-Based Visual Tracking (SWCF)  This tracker is derived from KCF. It predicts a spatial window for the observation of the object so that the correlation output of the correlation filter as well as the windowed observation are improved. Moreover, the estimated spatial window of the object patch indicates the object regions that are useful for correlation.
Discriminative Scale Space Tracker (DSST)  DSST extends the Minimum Output Sum of Squared Errors (MOSSE) tracker with robust scale estimation. DSST also learns a one-dimensional discriminative scale filter which is used to predict the target size. The intensity features used in MOSSE  tracker are combined with a pixel-dense representation of HOG features.
Dynamic Feature Selection Tracker (DFST) DFST is a visual tracking algorithm based on the real-time selection of locally and temporally discriminative features. DFST provides a significant gain in accuracy and precision with respect to KCF by the use of a dynamic set of features. A further improvement is given by making micro-shifts at the predicted position according the best template matching.
Scalable Kernel Correlation Filter with Sparse Feature Integration (sKCF)  This tracker is derived from KCF. It introduces an adjustable Gaussian window function and a keypoint-based model for scale estimation. It deals with the fixed window size limitation in KCF.
Sum of Template And Pixel-wise LEarners (STAPLE)  STAPLE combines two image patch representations that are sensitive to complementary factors to learn a model that is robust to both color changes and deformations. It combines the scores of two models in a dense window translation search. The scores of the two models are indicative of their reliability.
An improved STAPLE tracker with multiple feature integration (STAPLE+)
STAPLE+ is based on the STAPLE tracker and improves it by integrating multiple features. It extracts HOG features from color probability map to exploit color information better. The final response map is thus a fusion of scores obtained with different features.
STRUCtured output tracking with Kernels (STRUCK) 
This is a framework for adaptive visual object tracking. It applies a support vector machine which is learned online. It introduces a budgeting mechanism that prevents the unbounded growth in the number of support vectors that would otherwise occur during tracking.
Distractor Aware Tracker (DAT)  This is a tracking-by-detection approach based on appearance. To distinguish the object from the surrounding areas, a discriminative model using color histograms is applied. It adapts the object representation beforehand so that distractors are suppressed and the risk of drifting is reduced.
Scale Adaptive Mean Shift (ASMS)  This is a mean-shift tracker  optimizing the Hellinger distance between a template histogram and the target in the image. The optimization is done by a gradient descent. ASMS addresses the problem of scale adaptation and scale estimation. It also introduces two improvements over the original mean-shift  to make the scale estimation more robust in the presence of background clutter: 1) a histogram color weighting and 2) a forward-backward consistency check.
Normalized Cross-Correlation (NCC)  NCC follows the basic idea of tracking by searching for the best match between a static grayscale template and the image using normalized cross-correlation.
MEDIANFLOW  This tracker uses optical flow to match points between frames. The tracking is performed forward and backward in time and the discrepancies between these two trajectories are measured. The proposed error enables reliable detection of tracking failures and selection of reliable trajectories in video sequences.
Tracking-learning-detection(TLD)  It combines both a tracker and a detector. The tracker follows the object from frame to frame using MEDIANFLOW . The detector localizes the target using all appearances that have been observed so far and corrects the tracker if necessary. The learning estimates the detector errors and updates it to avoid these errors in the future.
Deformable Part-based Tracking by Coupled Global and Local Correlation Filters (DPCF)  This tracker that is derived from KCF relies on joint interactions between a global filter and local part filters. The local filters provide an initial estimate which is used by the global filter as a reference to determine the final result. The global filter provides a feedback to the part filters. In this way, it handles both partial occlusion (with the part filters) but also scale changes (with the global filter).
Contextual Object Tracker with Structural Encoding (CTSE) This tracker uses contextual and structural information (that is specific to a target object) into the appearance model. This is first achieved by including features from complementary region having correlated motion with the target object. Second, a local structure that represents the spatial constraints between features within the target object is included. SIFT keypoints are used as features to encode both these information.
Kalman filter ensemble-based tracker (KF-EBT) This tracker combines the result of two other trackers: ASMS
using a color histogram and KCF. Using a Kalman filter, the tracker works in cycles of prediction and correction. Firstly, a motion model predicts the target next position. Secondly, the trackers results are fused with the predicted position and the model is updated.
V Results and analysis
In this section, we report our results on the 36 video sequences of . The video sequences that consist in tracking persons, faces and objects include difficulties such as motion blur, scale change, out-of-plane rotation, fast motion, cluttered background, illumination variation, low resolution, occlusion, presence of distractors and articulated objects. Results are reported for the whole dataset.
V-a Ranking method
During testing, we discovered that since different trackers have different tracking speed, using only the four metrics in section II-B is not enough. For example, some trackers have , and metrics the are good because they track slowly in real-time simulation, which means they just track a few frames correctly and all other frames are invalid and are ignored for the calculation of the metrics. Under this circumstance, the tracker succeed in tracking every processed frames (which are only the first frames). Only can capture to some extent this lack of performance as it verifies if the target is in the FOV or not. Thus we consider another essential metric: processed frame ratio ():
where is the number of processed frames and is the total number of frames.
contains part of information since it stands for whether the object is inside the FOV in the processed frames. If is low, will be high since the tracker will not be able to track the object correctly in the processed frames because of the long interval between them. However, a high can be caused also by poor robustness.
After considering the PTZ camera nature and the tracker test results, we formulated a ranking formula. The formula stands for the Euclidean distance between the point defined by the pair (, ) obtained by a tracker and the ideal tracker (top-right point in figure 1). The score is thus:
We selected and because we consider that conveys similar information as and conveys similar information as .
V-B Results without processing delays
We have first set the execution ratio to zero in order to compare different trackers for their performances for in-plane rotations, out-of-plane rotations and drastic scene changes caused by the camera motion. We are looking for the most robust trackers, neglecting their processing times (they are set to all perform at the same speed). Note that in this experiment, the camera motion delays are considered as they reflects the robustness of the trackers. If a tracker performs poorly, this will cause unnecessary camera motion that will result in drop frames.
Table I and Figure 1(a) give the results for the 19 trackers based on their ranking. In the PTZ camera scenario, the difficulties with in-plane and out-of-plane rotations will be amplified because of the application dynamic nature. Surprisingly, trackers which adopted a scale adaptation function such as SRDCF do not necessarily perform better than other trackers due in great part to their slow execution speed. ASMS and DPCF are the best performers in this experiment. In the VOT 2016 benchmark  there is no drop frames caused by delays and less viewpoint changes caused by camera motion. Thus it is reasonable that there is difference between the ranking of our framework and that of the VOT ranking. However, our ranking is still quite similar to that of VOT 2016 benchmark. Trackers like Staple, Staple+, KCF-EBT and DAT are good both in VOT benchmark and our benchmark. However, in our benchmark the performance of ASMS is surprisingly the best when it just ranked in the middle in VOT. And some trackers like SRDCF do not behave well in our PTZ framework probably because they do not output bounding boxes when the tracking fails and as a result, the PTZ camera is not controlled correctly. The VOT system will check every five frames to verify whether the tracker has failed. If it has failed, it is reinitialized. In our framework, we do not intervene in the tracking process at all. Early failures are thus more penalized.
V-C Results with processing delays
We then set the execution ratio to 1 to track objects. In the real world the objects will keep moving while the tracker is processing a frame. As a result, the task of tracking is harder in this case since the intervals between processed frames are caused both by the time for processing the frame and the camera motion delay. should decline and the trackers should lose targets more easily. Table II and Figure 1(b) give the results for the 19 trackers.
Compared to the previous case, the ranking of trackers changes. ASMS still ranks first, but DPCF degrades because it is very slow and its value declines to 0.08 where it was previously at 0.8. The other trackers relative rankings do not change much, but the average score of trackers is higher, which means that their performances are worse because of the execution delay in the tracking process. Still, compared to the VOT 2016 benchmark , the performance of ASMS is still surprisingly the best. This means that this tracker is particularly good for handling viewpoint changes. Finally, performance in VOT are better than for our task because we are testing online tracking and camera control. The PTZ scenario requires tracking people with different illumination variation, different scale and from rapidly changing viewpoints. All of those reasons make our results unique compared to other benchmarks.
V-D Target Position Prediction
Finally, we tested target position prediction to investigate if it can help trackers to perform better. Table III gives results for two trackers. Results are similar for all the other trackers. The proposed models (see section III) for predicting the next position of the target are not improving results. This is due to the fact that the speed of the target is difficult to estimate because it moves in 3D, but we estimate its motion in 2D. Therefore, the predicted speed is not very accurate. After calculating the speed, the framework will use this speed to predict the target position in the next frame. It may cause unnecessary large motion by the camera. For example, the predicted target motion may be too large so the camera, by rotating to this wrongly predicted position, will add delays in the tracking process. This adds to the possibility that the tracker will lose the target. If the target cannot be tracked, its speed will not be updated leading to an even worse situation where the camera just rotate more or less randomly. The high-speed trackers are affected the most by wrong prediction. Their process ratio decline from above 0.8 to below 0.2.
Therefore, we can conclude from this experiment that although appealing in theory, compensating slow tracking by a position prediction is not easy to apply in practice. It may work for objects that are far away and that mostly move in a plane, but it cannot work for target that are closer and that are moving toward or away from the camera. In such cases, the motion of the target cannot be predicted in 2D. For best results, it is thus preferable to design a fast tracker.
This paper presented a benchmark of recent trackers for the PTZ tracking scenario. Surprisingly, high-speed trackers, such as MEDIANFLOW and NCC, do not necessarily behave better than others. However, since predicting the target position was shown to be difficult, slow trackers should be avoided for the PTZ tracking task. The results of our test indicate that the top performing tracker for the PTZ scenario is the ASMS tracker. This tracker performed very well in accuracy as well as in robustness in our tests. It is impossible to conclusively determine whether the performance of ASMS over other trackers come from its image features or its approach. Nevertheless, results of top trackers show that features play a significant role in the final performance.
-  M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernández, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder, “The visual object tracking vot2016 challenge results,” in Proceedings of the IEEE international conference on computer vision workshops, 2016, pp. 1–23.
-  G. Chen, P.-L. St-Charles, W. Bouachir, G.-A. Bilodeau, and R. Bergevin, “Reproducible evaluation of pan-tilt-zoom tracking,” in Image Processing (ICIP), 2015 IEEE International Conference on. IEEE, 2015, pp. 2055–2059.
-  J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, 2015.
-  M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg, “Learning spatially regularized correlation filters for visual tracking,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 4310–4318.
-  E. Gundogdu and A. A. Alatan, “Spatial windowing for correlation filter based visual tracking,” in Image Processing (ICIP), 2016 IEEE International Conference on. IEEE, 2016, pp. 1684–1688.
-  M. Danelljan, G. Häger, F. Khan, and M. Felsberg, “Accurate scale estimation for robust visual tracking,” in British Machine Vision Conference, Nottingham, September 1-5, 2014. BMVA Press, 2014.
D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object
tracking using adaptive correlation filters,” in
Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE, 2010, pp. 2544–2550.
-  G. Roffo and S. Melzi, “Online feature selection for visual tracking.” in BMVC, 2016.
-  A. S. Montero, J. Lang, and R. Laganiere, “Scalable kernel correlation filter with sparse feature integration,” in Computer Vision Workshop (ICCVW), 2015 IEEE International Conference on. IEEE, 2015, pp. 587–594.
-  L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. Torr, “Staple: Complementary learners for real-time tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1401–1409.
-  S. Hare, S. Golodetz, A. Saffari, V. Vineet, M.-M. Cheng, S. L. Hicks, and P. H. Torr, “Struck: Structured output tracking with kernels,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 10, pp. 2096–2109, 2016.
-  H. Possegger, T. Mauthner, and H. Bischof, “In defense of color-based model-free tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2113–2120.
-  T. Vojir, J. Noskova, and J. Matas, “Robust scale-adaptive mean-shift for tracking,” Pattern Recognition Letters, vol. 49, pp. 250–258, 2014.
-  D. Comaniciu, V. Ramesh, and P. Meer, “Real-time tracking of non-rigid objects using mean shift,” in Computer Vision and Pattern Recognition, 2000. Proceedings. IEEE Conference on, vol. 2. IEEE, 2000, pp. 142–149.
-  J. P. Lewis, “Fast normalized cross-correlation,” in Vision interface, vol. 10, no. 1, 1995, pp. 120–123.
-  B. Babenko, M.-H. Yang, and S. Belongie, “Robust object tracking with online multiple instance learning,” IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 8, pp. 1619–1632, 2011.
-  H. Grabner, M. Grabner, and H. Bischof, “Real-time tracking via on-line boosting.”
-  Z. Kalal, K. Mikolajczyk, and J. Matas, “Forward-backward error: Automatic detection of tracking failures,” in Pattern recognition (ICPR), 2010 20th international conference on. IEEE, 2010, pp. 2756–2759.
-  ——, “Tracking-learning-detection,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 7, pp. 1409–1422, 2012.
-  O. Akin, E. Erdem, A. Erdem, and K. Mikolajczyk, “Deformable part-based tracking by coupled global and local correlation filters,” Journal of Visual Communication and Image Representation, vol. 38, pp. 763–774, 2016.
-  T. Chakravorty, G.-A. Bilodeau, and E. Granger, “Contextual object tracker with structure encoding,” in Image Processing (ICIP), 2015 IEEE International Conference on. IEEE, 2015, pp. 4937–4941.