benchmark_results
visual tracker benchmark results
view repo
Short-term tracking is an open and challenging problem for which discriminative correlation filters (DCF) have shown excellent performance. We introduce the channel and spatial reliability concepts to DCF tracking and provide a novel learning algorithm for its efficient and seamless integration in the filter update and the tracking process. The spatial reliability map adjusts the filter support to the part of the object suitable for tracking. This both allows to enlarge the search region and improves tracking of non-rectangular objects. Reliability scores reflect channel-wise quality of the learned filters and are used as feature weighting coefficients in localization. Experimentally, with only two simple standard features, HoGs and Colornames, the novel CSR-DCF method -- DCF with Channel and Spatial Reliability -- achieves state-of-the-art results on VOT 2016, VOT 2015 and OTB100. The CSR-DCF runs in real-time on a CPU.
READ FULL TEXT VIEW PDFvisual tracker benchmark results
Tracking Benchmark for Correlation Filters
Benchmark results of correlation filters
Short-term visual object tracking is the problem of continuously localizing a target in a video-sequence given a single example of its appearance. It has received significant attention of the computer vision community which is reflected in the number of papers published on the topic and the existence of multiple performance evaluation benchmarks
[43, 29, 30, 26, 27, 33, 38, 36]. Diverse factors – occlusion, illumination change, fast object or camera motion, appearance changes due to rigid or non-rigid deformations and similarity to the background – make short-term tracking challenging.Recent short-term tracking evaluations [43, 29, 30, 26] consistently confirm the advantages of semi-supervised discriminative tracking approaches [17, 1, 18, 4]. In particular, trackers based on the discriminative correlation filter method (DCF) [4, 8, 20, 31, 10] have shown state-of-the-art performance in all standard benchmarks. Discriminative correlation methods learn a filter with a pre-defined response on the training image.
The standard formulation of DCF uses circular correlation which allows to implement learning efficiently by Fast Fourier transform (FFT). However, the FFT requires the filter and the patch size to be equal which limits the detection range. Due to the circularity, the filter is trained on many examples that contain unrealistic, wrapped-around circularly-shifted versions of the target. These windowing problems were recently addressed by state-of-the-art approaches of Galoogahi et al.
[24]who propose zero-padding the filter during learning and by Danneljan et al.
[10] who introduce spatial regularization to penalize filter values outside the object boundaries. Both approaches train from image patches larger than the object and thus increase the detection range.Another limitation of the published DCF methods is the assumption that the target shape is well approximated by an axis-aligned rectangle. For irregularly shaped objects or those with a hollow center, the filter eventually learns the background, which may lead to drift and failure. The same problem appears for approximately rectangular objects in the case of occlusion. The Galoogahi et al. [24] and Danneljan et al. [10] methods both suffer from this problem.
In this paper we introduce the CSR-DCF, the Discriminative Correlation Filter with Channel and Spatial Reliability. The spatial reliability map adapts the filter support to the part of the object suitable for tracking which overcomes both the problems of circular shift enabling an arbitrary search range and the limitations related to the rectangular shape assumption. The spatial reliability map is estimated using the output of a graph labeling problem solved efficiently in each frame. An efficient novel optimization procedure is derived for learning a correlation filter with the support constrained by the spatial reliability map since the standard closed-form solution cannot be generalized to this case. Experiments show that the novel filter optimization procedure outperforms related approaches for constrained learning in DCFs.
Channel reliability is the second novelty the CSR-DCF tracker introduces. The reliability is estimated from the properties of the constrained least-squares solution to filter design. The channel reliability scores are used for weighting the per-channel filter responses in localization (Figure 1). The CSR-DCF shows state-of-the-art performance on standard benchmarks – OTB100 [44], VOT2015 [26] and VOT2016 [26]
while running in real-time on a single CPU. The spatial and channel reliability formulation is general and can be used in most modern correlation filters, e.g. those using deep features.
The discriminative correlation filters for object detection date back to the 80’s with seminal work of Hester and Casasent [21]. They have been popularized only recently in the tracking community, starting with the Bolme et al. [4] MOSSE tracker published in 2010. Using a gray-scale template, MOSSE achieved a state-of-the-art performance on a tracking benchmark [43] at a remarkable processing speed. Significant improvements have been made since and in 2014 the top-performing trackers on a recent benchmark [30] were all from this class of trackers. Improvements of DCFs fall into two categories, application of improved features and conceptual improvements in filter learning.
In the first group, Henriques et al. [20] replaced the grayscale templates by HoG [7], Danelljan et al. [12] proposed multi-dimensional color attributes and Li and Zhu [32] applied feature combination. Recently, the convolutional network features learned for object detection have been applied [35, 11, 13], leading to a performance boost, but at a cost of significant speed reduction.
Conceptually, the first successful theoretical extension of the standard DCF was the kernelized formulation by Henriques et al. [20]. Later, a correlation-filter-based scale adaptation was proposed by Danelljan et al. [8]. Zhang et al. [46] introduced spatio-temporal context learning in the DCFs. Recently, Galoogahi et al. [16] addressed the problems resulting from learning with circular correlation from small patches. They proposed a learning framework that artificially increases the filter size by implicit zero-padding to the right and down. The non-symmetric padding only partially reduces the boundary artefacts in filter learning. Danneljan et al. [10] reformulate the learning cost function to penalize non-zero filter values outside the object bounding box. Performance better than [16] is reported, but the learned filter is still a tradeoff between the correlation response and regularization, and it does not guarantee that filter values are zero outside of object bounding box.
Given a set of channel features and corresponding target templates (filters) , where , , the object position
is estimated by maximizing the probability
(1) |
The density is a convolution of a feature map with a learned template evaluated at and is a prior reflecting the channel reliability.
Most correlation filters, e.g., [31, 16, 46], assume independent feature channels. Optimal filters are obtained at learning stage by minimizing the sum of squared differences between the channel-wise correlation outputs and the desired output ,
(2) |
The equivalence in (3) follows from the Parsevaal’s theorem, the operator is a Fourier transform of
reshaped into a column vector, i.e.,
, with , forms a diagonal matrix from and is a Hermitian transpose. Minimization of (3) has a closed-form solution by equating the complex gradient of (3) w.r.t. each channel to zero [4]. Albeit its simplicity, this solution suffers from boundary defects due to input circularity assumption and from assuming all pixels are equally reliable for filter learning. In the following we address this issue by proposing an efficient spatial reliability map construction for correlation filters and propose a new spatially constrained correlation filter learning framework.Spatial reliability map , with elements , indicates the learning reliability of each pixel. Probability of pixel being reliable conditioned on appearance is specified as
(3) |
The appearance likelihood is computed by Bayes rule from the object foreground/background color models, which are maintained during tracking as color histograms . The prior is defined by the ratio between the region sizes for foreground/background histogram extraction.
Central pixels in axis-aligned approximations of elongated rotating, articulated or deformable object will likely contain the object regardless of deformation. On the other hand, pixels away from the center will equally likely contain object or background. This deformation invariance of central elements reliability is enforced in our approach by defining a weak spatial prior
(4) |
where is a modified Epanechnikov kernel, , with size parameter equal to the minor bounding box axis and clipped to interval
such that the object prior probability at center is 0.9 and changes to a uniform prior away from the center (Figure
2).Spatial consistency of labeling is enforced by using (3) as unary terms in a Markov random field. An efficient solver [28] is applied to compute maximum a posteriori solution of
. To avoid potentially poor classifications at significant appearance changes, the interior of the object bounding box is set to a uniform distribution if less than a very small percentage of pixels
are classified as foreground. The reliability map is morphologically dilated by a small kernel to prevent unwanted masking of object boundaries. Figure
2 shows the likelihood and spatial prior in the unary terms and the final binary reliability map.In interest of notation clarity, we assume only a single channel in the following derivation, i.e., , and drop the channel index , since the filter learning is independent across the channels.
The spatial reliability map identifies pixels in the filter that should be ignored in learning, i.e., introduces a constraint , which prohibits a closed-form solution of (3). In the following we summarize our solution of this constrained optimization and report the full derivation in the supplementary material.
We introduce a dual variable and the constraint , which leads to the following augmented Lagrangian [5]
(5) | |||
where is a complex Lagrange multiplier, , and we use the following definition for compact notation. The augmented Lagrangian (5) can be iteratively minimized by the alternating direction method of multipliers [5], which sequentially solves the following sub-problems at each iteration:
(6) | |||
(7) |
and the Lagrange multiplier is updated as
(8) |
The minimizations in (6) have a closed-form solution:
(9) |
(10) |
A standard scheme for updating the constraint penalty values [5] is applied, i.e., .
Computations of (9,8) are fully carried out in frequency domain, the solution for (10) requires a single inverse FFT and another FFT to compute the . A single optimization iteration thus requires only two calls of the Fourier transform, resulting in a very fast optimization. The computational complexity is that of the Fourier transform, i.e., . Filter learning is implemented in less than five lines of Matlab code and is summarized in the Algorithm 1.
The channel reliability at target localization stage is computed as the product of a learning channel reliability measure and a detection reliability measure. These are described next.
Minimization of (5) solves a least squares problem averaged over all displacements of the filter on a feature channel. A discriminative feature channel produces a filter whose output nearly exactly fits the ideal response . On the other hand, the response is highly noisy on the channel with low discriminative power and a global error reduction in the least squares significantly reduces the maximal response. Thus a straight-forward measure of channel learning reliability in (1) is the maximum response of a learned channel filter, i.e., , where the normalization scalar ensures that .
At detection stage, the per-channel detection reliability is reflected in the expressiveness of the major mode in the response of each channel. Note that Bolme et al. [4] proposed a similar approach to detect target loss. Our measure is based on the ratio between the second and first major mode in the response map, i.e., . Note that this ratio penalizes cases when multiple similar objects appear in the target vicinity since these result in multiple equally expressed modes, even though the major mode accurately depict the target position. To prevent such penalizations, the ratio is clamped by . Therefore, the per-channel detection reliability is estimated as .
The localization and update steps of the proposed channel and spatial reliability correlation filter trackers (CSR-DCF) proceed as follows. Localize. Per-channel responses and the corresponding detection reliability values are computed (Section 3.3) and multiplied with the learning reliability measures from previous time-step into channel reliability scores. The object is localized by summing the responses of the learned correlation filters weighted by the estimated channel reliability scores. Scale is estimated by a single scale-space correlation filter from Danelljan et al. [8]. Update. Foreground/background histograms are extracted at the estimated location and updated by an auto-regressive scheme with learning rate . The foreground histogram is extracted by a standard Epanechnikov kernel within the estimated object bounding box and the background is extracted from the neighborhood twice the object size. The spatial reliability map (Section 3.1) is constructed, the optimal filters are computed by optimizing (5) and the per-channel learning reliability is estimated from their responses (Section 3.3
). For temporal robustness, the filters and channel learning reliability weights are updated by an autoregressive model with learning rate
. A single tracking iteration is summarized in Algorithm 2.This section overviews a comprehensive experimental evaluation of the proposed CSR-DCF tracker. Implementation details are discussed in Section 4.1, Section 4.2 reports comparison of the proposed constrained learning to the related state-of-the-art, ablation study is provided in Section 4.3, performance on three recent benchmarks is reported in Section 4.4, Section 4.5 and Section 4.6. Section 4.7 evaluates the tracking speed.
Standard HOG [15] and Colornames [39] features are used in the correlation filter and HSV foreground/background color histograms with bins per color channel are used in reliability map estimation with parameter . All the parameters are set to values commonly used in literature [10, 16]. Histogram adaptation rate is set to , correlation filter adaptation rate is set to , and the regularization parameter is set to . The augmented Lagrangian optimization parameters are set to and . All parameters have straight-forward interpretation, do not require fine-tuning, and were kept constant throughout all experiments. Our Matlab implementation^{1}^{1}1The CSR-DCF Matlab source code will be made publicly available. runs at frames per second on an Intel Core i7 3.4GHz standard desktop.
This section compares our proposed boundary constraints formulation (Section 3) with recent state-of-the-art approaches [10, 16]. In the first experiment, three variants of the standard single-scale HOG-based correlation filter were implemented to emphasize the difference in boundary constraints: the first one uses our channel reliability boundary constraint formulation from Section 3 () the second one applies the spatial regularization constraint [10] () and the third one applies the limited boundaries constraint [16] ().
The three variants were compared on the challenging VOT2015 dataset [26] by applying a standard no-reset one-pass evaluation (OTB [43]) and computing the AUC on the success plot. The tracker with our constraint formulation achieved 0.32 AUC, while the alternatives achieved 0.28 () and 0.16 (). The only difference between these tackers is in the constraint formulation, which indicates superiority of the proposed channel-reliability-based constraints formulation over the recent alternatives [16, 10].
The proposed CSR-DCF tracker from Section 3 was compared to the original recent state-of-the-art trackers SRDCF [10] and CFLB [24] that apply alternative boundary constraints. The source code was obtained from the authors and only HoG features were used in all three trackers for a fair comparison. An experiment was designed to evaluate initialization and tracking of non axis-aligned targets, which is the case for most realistic deforming and non-circular objects. Trackers were initialized on frames with non-axis aligned targets and left to track until the sequence end, resulting in a large number of tracking trajectories and summarized by various performance measures.
The VOT2015 dataset [26] contains non-axis-aligned annotations, which allows automatic identification of tracker initialization frames, i.e., frames in which the ground truth bounding box significantly deviates from an axis-aligned approximation. Frames with overlap between ground truth and axis-aligned approximation lower than were identified and filtered to obtain a set of initialization frames at least hundred frames apart – this constraint fits half the typical short-term sequence length [26] and reduces the potential correlation across the initializations (see Figure 3 for examples).
Tracker | ||||
---|---|---|---|---|
CSR-DCF | 0.58 | 221 | 0.31 | 0.24 |
SRDCF (ICCV2015) | 0.31 | 95 | 0.16 | 0.12 |
LBCF (CVPR2015) | 0.12 | 37 | 0.06 | 0.04 |
Initialization robustness is estimated by counting the number of trajectories in which the tracker was still tracking (overlap with ground truth greater than 0) frames after initialization. Figure 3 shows these values with increasing the threshold . The CSR-DCF graph is consistently above the SRDCF and LBCF for all thresholds. The performance is summarized by the average tracking length (number of frames before the overlap drops to zero) weighted by trajectory lengths. The weighted average tracking lengths in frames, , and proportions of full trajectory lengths, , are shown in Table 1. The CSR-DCF by far outperforms SRDCF and LBCF in all measures indicating a significant robustness at initialization of challenging targets that deviate from axis-aligned templates. This improvement is further confirmed by Figure 3 which shows the OTB success plots [43] calculated on these trajectories and summarized by the AUC values, which are equal to the average overlaps [6]. Table 1 shows the average overlaps computed on the original ground truth on VOT2015 () and on ground truth approximated by the axis-aligned bounding box (). Again, the CSR-DCF by far outperforms the competing alternatives SRDCF and LBCF. Tracking examples for the three trackers are shown in Figure 4.
An ablation study on VOT2016 (see Section 4.6 for details of the evaluation protocol) was conducted to evaluate the contribution of spatial and channel reliability measures in our CSR-DCF. Results of the VOT primary measure EAO and two supplementary measures (A,R) are summarized in Table 2. Setting the adaptive channel reliability weights to uniform values (CuSR-DCF) results in performance drop in EAO compared to CSR-DCF. Replacing the adaptive spatial reliability map in CSR-CDF by a constant map with uniform values within the bounding box and zeros elsewhere (CSuR-DCF), results in a drop in EAO. Making both replacements in CSR-DCF (CuSuR-DCF) results in drop. Removing the channel and spatial reliability map reduces our tracker to a standard DCF with a large receptive field – the performance drops by over .
Tracker | EAO | ||
---|---|---|---|
CSR-DCF | 0.338 | 0.85 | 0.51 |
CuSR-DCF | 0.297 | 1.08 | 0.51 |
CSuR-DCF | 0.264 | 1.18 | 0.49 |
CuSuR-DCF | 0.256 | 1.33 | 0.51 |
DCF | 0.152 | 2.85 | 0.47 |
The OTB100 [44] benchmark contains results of trackers evaluated on sequences by a no-reset evaluation protocol. Tracking quality is measured by precision and success plots. Success plot shows portion of frames with the overlap between predicted and ground truth bounding box greater than a threshold with respect to all threshold vales. The precision plot shows similar statistics on the center error. The results are summarized by areas under these plots. To reduce clutter in the graphs, we show here only the results for top-performing recent baselines, i.e., Struck [18], TLD [23], CXT [14], ASLA [45], SCM [42], LSK [34], CSK [19] and results for recent top-performing state-of-the-art trackers SRDCF [10] and MUSTER [22].
The CSR-DCF is ranked top on the benchmark (Figure 5). It significantly outperforms the best performers reported in [44] and outperforms the current state-of-the-art (SRDCF [10] and MUSTER [22]). The average CSR-DCF performance on success plot is slightly lower than SRDCF [9] due to poorer scale estimation, but yields better performance in the average precision (center error). Both, precision and success plot, show that the CSR-DCF tracks on average longer than competing methods.
The VOT2015 [26] benchmark contains results of 63 state-of-the-art trackers evaluated on 60 challenging sequences. In contrast to related benchmarks, the VOT2015 dataset was constructed from 300 sequences by an advanced sequence selection methodology that favors objects difficult to track and maximizes a visual attribute diversity cost function [26]. This makes it arguably the most challenging sequence set available. The VOT methodology [27]
resets a tracker upon failure to fully use the dataset. The basic VOT measures are the number of failures during tracking (robustness) and average overlap during the periods of successful tracking (accuracy), while the primary VOT2015 measure is the expected average overlap (EAO) on short-term sequences. The latter can be thought of as the expected no-reset average overlap (AUC in OTB methodology), but with reduced bias and the variance as explained in
[26].Figure 7 shows the VOT EAO plots with the CSR-DCF and the VOT2015 state-of-the-art approaches considering the VOT2016 rules that do not consider trackers learned on video sequences related to VOT to prevent over-fitting. The CSR-DCF outperforms all trackers and achieves a top rank. The CSR-DCF significantly outperforms the related correlation filter trackers like SRDCF [9] as well as trackers that apply computationally-intesive state-of-the-art deep features e.g., deepSRDCF [11] and SO-DLT [41].
Finally, we compare our tracker on the most recent visual tracking benchmark, VOT2016 [25]. The dataset contains 60 sequences from VOT2015 [26] with improved annotations. The benchmark evaluated a set of 70 trackers which includes the recently published and yet unpublished state-of-the-art trackers. The set is indeed diverse, the top-performing trackers come from various classes e.g., correlation filter methods (CCOT [13], Staple [2], DDC [25]), deep ConvNets (TCNN [25], SSAT[25, 37], MLDF [25, 40], FastSiamnet [3]) and different detection-based approaches (EBT [47], SRBT [25]).
Figure 7 shows the EAO performance on the VOT2016. Our proposed CSR-DCF outperforms all 70 trackers at the EAO score 0.338. The CSR-DCF significantly outperforms correlation filter approaches that do not apply deep ConvNets. Even though the CSR-DCF applies only simple features, it outperforms all trackers that apply computationally intensive deep features.
The VOT2016 [25] dataset is per-frame annotated with visual attributes and allows detailed analysis of per-attribute tracking performance. Figure 6 shows per-attribute plot for ten top-performing trackers on VOT2016 in EAO. The proposed CSR-DCF is consistently ranked among top three trackers on five out of six attributes. In four attributes (size change, occlusion, camera motion, unassigned) the tracker is ranked number one.
Tracking speed is an important factor of many real-world tracking problems. Table 3 thus compares several related and well-known trackers (including the best-performing tracker on the VOT2016 challenge) in terms of speed and VOT performance measures. Speed measurements on a single CPU are computed on Intel Core i7 3.4GHz standard desktop.
The proposed CSR-DCF performs on par with the VOT2016 best-performing CCOT [13], which applies deep ConvNets, with respect to VOT measures, while being 20 times faster than the CCOT. The CCOT was modified by replacing the computationally intensive deep features with the same simple features used in CSR-DCF. The resulting tracker, indicated by CCOT*, is still ten times slower than CSR-DCF, while the performance drops by over 15%. The proposed CSR-DCF performs twice as fast as the related SRDCF [9], while achieving approximately better tracking results. The speed of baseline real-time trackers like DSST [8] and Struck [18] is comparable to CSR-DCF, but their tracking performance is significantly poorer. The fastest compared tracker, KCF [20] runs much faster than real-time, but delivers a significantly poorer performance than CSR-DCF.
The experiments show that the CSR-DCF performs at comparably to the state-of-the-art trackers which apply computationally demanding high-dimensional features, but runs considerably faster and delivers top tracking performance among the real-time trackers.
Tracker | Published at | EAO | fps | ||
---|---|---|---|---|---|
CSR-DCF | This work. | 0.338 | 0.51 | 0.85 | 13.0 |
CCOT | ECCV2016 | 0.331 | 0.52 | 0.85 | 0.55 |
CCOT* | ECCV2016 | 0.274 | 0.52 | 1.18 | 1.0 |
SRDCF | ICCV2015 | 0.247 | 0.52 | 1.50 | 7.3 |
KCF | PAMI2015 | 0.192 | 0.48 | 2.03 | 115.7 |
DSST | PAMI2016 | 0.181 | 0.48 | 2.52 | 18.6 |
Struck | ICCV2011 | 0.142 | 0.42 | 3.37 | 8.5 |
The Discriminative Correlation Filter with Channel and Spatial Reliability (CSR-DCF) was introduced. The spatial reliability map adapts the filter support to the part of the object suitable for tracking which overcomes both the problems of circular shift enabling an arbitrary search range and the limitations related to the rectangular shape assumption. A novel efficient spatial map estimation method was proposed and an efficient optimization procedure derived for learning a correlation filter with the support constrained by the spatial reliability map. The second novelty of CSR-DCF is the channel reliability. The reliability is estimated from the properties of the constrained least-squares solution. The channel reliability scores were used for weighting the per-channel filter responses in localization.
Experimental comparison with recent related state-of-the-art boundary-constraints formulations showed significant benefits of using our formulation. The CSR-DCF has state-of-the-art performance on standard benchmarks – OTB100 [44], VOT2015 [26] and VOT2016 [26] while running in real-time on a single CPU. Despite using simple features like HoG and Colornames, the CSR-DCF performs on par with trackers that apply computationally complex deep ConvNet, but is significantly faster.
To the best of our knowledge, the proposed approach is the first of its kind to introduce constrained filter learning with arbitrary spatial reliability map and the use of channel reliabilities. The spatial and channel reliability formulation is general and can be used in most modern correlation filters, e.g. those using deep feature.
Foundations and Trends in Machine Learning
, 3(1):1–122, 2011.2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014
, pages 1090–1097, 2014.Learning multi-domain convolutional neural networks for visual tracking.
In Comp. Vis. Patt. Recognition, pages 4293–4302, June 2016.
Comments
There are no comments yet.