Robust Visual Tracking using Multi-Frame Multi-Feature Joint Modeling

11/19/2018 ∙ by Peng Zhang, et al. ∙ The University of Sydney 0

It remains a huge challenge to design effective and efficient trackers under complex scenarios, including occlusions, illumination changes and pose variations. To cope with this problem, a promising solution is to integrate the temporal consistency across consecutive frames and multiple feature cues in a unified model. Motivated by this idea, we propose a novel correlation filter-based tracker in this work, in which the temporal relatedness is reconciled under a multi-task learning framework and the multiple feature cues are modeled using a multi-view learning approach. We demonstrate the resulting regression model can be efficiently learned by exploiting the structure of blockwise diagonal matrix. A fast blockwise diagonal matrix inversion algorithm is developed thereafter for efficient online tracking. Meanwhile, we incorporate an adaptive scale estimation mechanism to strengthen the stability of scale variation tracking. We implement our tracker using two types of features and test it on two benchmark datasets. Experimental results demonstrate the superiority of our proposed approach when compared with other state-of-the-art trackers. project homepage http://bmal.hust.edu.cn/project/KMF2JMTtracking.html

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 10

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Visual tracking is one of the most important components in computer vision system. It has been widely used in visual surveillance, human computer interaction, and robotics

[1, 2]. Given an annotation of the object (bounding box) in the first frame, the task of visual tracking is to estimate the target locations and scales in subsequent video frames. Though much progress has been made in recent years, robust visual tracking, which can reconcile different varying circumstances, still remains a challenging problem [1]. On the one hand, it is expected that the designed trackers can compensate for large appearance changes caused by illuminations, occlusions, etc. On the other hand, the real-time requirement in real applications impedes the usage of overcomplicated models.

Briefly speaking, there are two major categories of modern trackers: generative trackers and discriminative trackers [3, 1]

. Generative trackers typically assume a generative process of the target appearance and search for the regions most similar to the target model, while discriminative trackers usually train a classifier to distinguish the target from the background. Among discriminative trackers, correlation filter-based trackers (CFTs) drawn an increasing number of attentions since the development of Kernel Correlation Filter (KCF) tracker

[4]. As has been demonstrated, KCF can achieve impressive performance on accuracy, robustness and speed on both the Online Tracking Benchmark (OTB) [5] and the Visual Object Tracking (VOT) challenges [6].

Despite the overwhelming evidence of success achieved by CFTs, two observations prompt us to come up with our tracking approach. First, almost all the CFTs ignore the temporal consistency or invariance among consecutive frames, which has been demonstrated to be effective in augmenting tracking performance [7, 8]

. Second, there is still a lack of theoretical sound yet computational efficient model to integrate multiple feature cues. Admittedly, integrating different channel features is not new under the CFTs umbrella. However, previous work either straightforwardly concatenating various feature vectors

[9, 10] (i.e., assuming mutual independence of feature channels) or inheriting high computational burden which severely compromises the tracking speed [11, 12].

In this paper, to circumvent these two drawbacks simultaneously, we embark on the basic KCF tracker and present a multi-frame multi-feature joint modeling tracker () to strike a good trade-off between robustness and speed. In , the interdependencies between different feature cues are modeled using a multi-view learning (MVL) approach to enhance the discriminative power of target appearance undergoing various changes. Specifically, we use the view consistency principle to regularize the objective function learned from each view agree on the labels of most training samples [13]. On the other hand, the temporal consistency is exploited under a multi-task learning (MTL) framework [14], i.e., we model () consecutive frames simultaneously by constraining the learned objectives from each frame close to their mean. We extend to its kernelized version (i.e., ) and develop a fast blockwise diagonal matrix inversion algorithm to accelerate model training. Finally, we adopt a two-stage filtering pipeline [9, 15, 16] to cope with the problem of scale variations.

To summarize, the main contributions of our work are twofold. First, a novel tracker, which can integrate multiple feature cues and temporal consistency in a unified model, is developed for robust visual tracking. Specifically, instead of simply concatenating multiple features into a single vector, we demonstrate how to reorganize these features by taking into consideration their intercorrelations. Moreover, we also present an advanced way to reconcile temporal relatedness amongst multiple consecutive frames, rather than naively using a forgetting factor [10, 17]. Second, a fast blockwise diagonal matrix inversion algorithm is developed to speed up training and detection. Experiments against state-of-the-art trackers reveal the underline clues on the importance of “intelligent” feature integration and temporal modeling, and also illustrate future directions for the design of modern discriminative trackers.

The rest of this paper is organized as follows. In section II, we introduce the background knowledge. Then, in section III, we discuss the detailed methodology of and extend it under kernel setting. A fast algorithm for blockwise diagonal matrix inversion is also developed to speed up training and detection. Following this, an adaptive scale estimation mechanism is incorporated into our tracker in section IV. The performance of our proposed approach against other state-of-the-art trackers is evaluated in section V. This paper is concluded in section VI.

: Scalars are denoted by lowercase letters (e.g., ), vectors appear as lowercase boldface letters (e.g., ), and matrices are indicated by uppercase letters (e.g., or ). The -th element of (or ) is represented by (or ), while denotes transpose and denotes Hermitian transpose. If (or ) is a square matrix, then (or ) denotes its inverse.

stands for the identity matrix with compatible dimensions,

denotes a square diagonal matrix with the elements of vector on the main diagonal. The -th row of a matrix (or ) is declared by the row vector , while the -th column is indicated with the column vector . If denotes the absolute value operator, then, for , the -norm and -norm of are defined as and , respectively.

Ii Related work

In this section, we briefly review previous work of most relevance to our approach, including popular conventional trackers, the basic KCF tracker and its extensions.

Ii-a Popular conventional trackers

Success in visual tracking relies heavily on how discriminative and robust the representation of target appearance is against varying circumstances [18, 19]. This is especially important for discriminative trackers, in which a binary classifier is required to distinguish target from background. Numerous types of visual features have been successfully applied for discriminative trackers in the last decades, including color histograms [20, 21], texture features [22], Haar-like features [23], etc. Unfortunately, none of them can handle all kinds of varying circumstances individually and the discriminant capability of a unique type of feature is not stable across the video sequence [20]. As a result, it becomes prevalent to take advantage of multiple features (i.e., multi-view representations) to enable more robust tracking. For example, [24, 25] used the AdaBoost to combine an ensemble of weak classifiers to form a powerful classifier, where each weak classifier is trained online on a different training set using pixel colors and a local orientation histogram. Note that, several generative trackers also have improved performance by incorporating multi-view representations. [26] employed a group sparsity technique to integrate color histograms, intensity, histograms of oriented gradients (HOGs) [27] and local binary patterns (LBPs) [28] via requiring these features to share the same subset of the templates, whereas [29] proposed a probabilistic approach to integrate HOGs, intensity and Haar-like features for robust tracking. These trackers perform well normally, but they are far from satisfactory when being tested on challenging videos.

Ii-B The basic KCF tracker and its extensions

Much like other discriminative trackers, the KCF tracker needs a set of training examples to learn a classifier. The key idea of KCF tracker is that the augmentation of negative samples is employed to enhance the discriminative ability of the tracking-by-detection scheme while exploring the structure of the circulant matrix to speed up training and detection.

Following the basic KCF tracker, numerous extensions have been conducted to boost its performance, which generally fall into two categories: application of improved features and conceptual improvements in filter learning [30]. The first category lies in designing more discriminative features to deal with challenging environments [31, 32] or straightforwardly concatenating multiple feature cues into a single feature vector to boost representation power [9, 33]

. A recent trend is to integrate features extracted from convolutional neural networks (CNNs) trained on large image datasets to replace the traditional hand-crafted features

[34, 35]. The design or integration of more powerful features often suffers from high computational burden, whereas blindly concatenating multiple features assumes the mutual independence of these features and neglects their interrelationships. Therefore, it is desirable to have a more “intelligent” manner to reorganize these features such that their interrelationships are fully exploited.

Conceptually, the theoretical extension on the filter learning also drawn lots of attentions. Early work focused on accurate and efficient scale estimation as the basic KCF assumes a fixed target size [15, 11]. The most recent work concentrated on developing more advanced convolutional operators. For example, [36]

employed an interpolation model to train convolution filters in continuous spatial domain. The so-called Continuous Convolution Operator Tracker (C-COT) is further improved in

[37], in which the authors introduce factorized convolution operators to drastically reduce the number of parameters C-COT as well as the number of filters.

Apart from these two extensions, efforts have been made to embed the conventional tracking strategies (like part-based tracking [38]) on KCF framework. For example, [39] developed a probabilistic tracking reliability metric to measure how reliable a patch can be tracked. On the other hand, [40] employed an online random fern classifier as a re-detection component for long-term tracking, whereas [16] presented a biology-inspired framework where short-term processing and long-term processing are cooperated with each other under a correlation filter framework. Finally, it is worth noting that, with the rapid development of correlation filters on visual tracking, several correlation filter-based thermal infrared trackers (e.g., [41, 42, 43]) have been developed in recent years. This work only focuses on visual tracking on color video sequences. We leave extension to thermal infrared video sequences as future work.

Iii The Multi-Frame Multi-Feature Joint Modeling Tracker ()

The made two strategic extensions to the basic KCF tracker. Our idea is to integrate the temporal information and multiple feature cues in a unified model, thus providing a theoretical sound and computational-efficient solution for robust visual tracking. To this end, instead of simply concatenating different features, the

integrates multiple feature cues using a MVL approach to better exploit their interrelationships, thus forming a more informative representation to target appearance. Moreover, the temporal consistency is taken into account under a MTL framework. Specifically, different from prevalent CFTs that learn filter taps using a ridge regression function which only makes use of template (i.e. circulant matrix

) from the current frame, we show that it is possible to use examples from () consecutive frames to learn the filter taps very efficiently by exploiting the structure of blockwise diagonal matrix.

Before our work, there are two ways attempting to incorporate temporal consistency into KCF framework. The Spatio-Temporal Context (STC) tracker [10] and its extensions (e.g., [17]) formulate the spatial relationships between the target and its surrounding dense contexts in a Bayesian framework and then use a temporal filtering procedure together with a forgetting factor to update the spatio-temporal model. Despite its simplicity, the tracking accuracy of STC tracker is poor compared with other state-of-the-art KCF trackers (see Section V-F). On the other hand, another trend is to learn a temporally invariant feature representation trained on natural video repository for visual tracking [31]. Although the learned feature can accommodate partial occlusion or slight illumination variation, it cannot handle the scenarios where there are large appearance changes. Moreover, the effectiveness of trained features depends largely on the selected repository which suffers from limited generalization capability. Different from these two kinds of methods, we explicitly model multiple consecutive frames in a joint cost function to circumvent abrupt drift in a single frame and to reconcile temporal relatedness amongst frames in a short period of time. Experiments demonstrate the superiority of our method.

Iii-a Formulation of

We start from the formulation of the basic . Given training frames, we assume the number of candidate patches in the -th () frame is . Suppose the dimensions for the first and second types of features are and respectively111This paper only considers two types of features, but the developed objective function (1) and associated solution can be straightforwardly extended to three or more feature cues., we denote () the matrix consists of the first (second) type of feature in the -th frame (each row represents one sample). Also, let represent sample labels. Then, the objective of can thus be formulated as:

(1)

where are the non-negative regularization parameters controlling model complexity. is the regression coefficients shared by frames for the first type of feature and denotes the deviation term from in the -th frame. The same definition goes for and for the second type of feature.

The problem (1) contains three different items with distinct objectives, namely the MTL item, the MVL item and the regularization item. The MTL item, i.e., (or ), is analogous to the formulation of regularized multi-task learning (RMTL) [14], as it encourages (or ) close to the mean value (or ) with a small deviation (or ). The MVL item employs the view consistency principle that widely exists in MVL approaches (e.g., [44]) to constrain the objectives learned from two views agree on most training samples. Finally, the regularization term (or

) serves to prevent the ill-posed solution and enhance the robustness of selected features to noises or outliers.

Iii-B Solution to

Minimization of has a closed-form solution by equating the gradients of (1) w.r.t. and to zero. Albeit its simplicity, reorganizing linear equation array into standard form is intractable and computational expensive herein. As an alternative, we present an equivalent yet simpler form of (1), which can be solved with matrix inversion in one step.

Denote

(2)
(3)
(4)
(5)
(6)

where

denotes a zero matrix of the same size as

in (or in ), we have:

(7)
(8)
(9)
(10)

Substituting (7)-(10) into (1) yields:

Denote

(12)

then (LABEL:MTMVmodel2form1) becomes:

(13)

Equating the gradients of w.r.t. and to zero. With straightforward derivation, we have:

(14)

Denote , the solution of is given by [45]:

(15)

where

(16)

Having computed and , the responses of candidate samples in the next frame for the trained model can be computed as:

where

(18)

in which and are feature matrices constructed from features in the new frame. Specifically, let consist of the first type of feature, we construct with feature in the new frame as defined in (4) and set as zero matrices to coincide with the size of in (12). In this sense, only and contribute to . The same goes for and ().

Iii-C Kernel extension and fast implementation

Although (15) gives a tractable solution to (1), it contains the inversion of with the computational complexity when using the well acknowledged Gaussian elimination algorithm [46]. In this section, for a more powerful regression function and a fast implementation, we demonstrate how to incorporate the dense sampling strategy in the basic under the kernel setting to speed up its tracking and detection. A computational-efficient optimization method is also presented by exploiting the structure of blockwise diagonal matrix.

The dense sampling was considered previously to be a drawback for discriminative trackers because of the large number of redundant samples that are required [3]. However, when these samples are collected and organized properly, they form a circulant matrix that can be diagonalized efficiently using the DFT matrix, thereby making the dual rigid regression problem can be solved entirely in the frequency domain [4]. Due to this attractive property, we first show that it is easy to embed the “kernel trick” on , we will also show that it is possible to obtain non-linear filters as fast as linear correlation filters using the dense sampling, both to train and evaluate. We term this improvement kernel multi-frame multi-feature joint modeling tracker ().

Denote

(19)

we have

(20)

and

where .

According to (12), consists of block matrices and the -th () block matrix can be represented as:

(22)

where with denotes the indicator function.

If we project and

onto the Reproducing Kernel Hilbert Space (RKHS), i.e., applying a non-linear transform

to both and , we can obtain the kernelized version of (1), i.e., . According to [4], (22) can be represented as:

(23)

where denotes a circular matrix generated by (see Appendix A for more details). Similarly,

(24)
(25)
(26)

Note that , , and are circular matrices, thus can be made diagonal as expressed below [47]:

(27)

where is the DFT matrix. Combining (20)-(27), according to Appendix A, the (defined in (LABEL:kpred)) under kernel setting can be computed as:

(28)

where and are given in () and is a block diagonal matrix with blocks on its main diagonal.

Denote

(30)

then

(31)

According to the Equation () of [45], we have:

where

(33)

It is obvious that and are blockwise diagonal matrix that can be partitioned into diagonal matrices of size . According to Theorem 1, the computational cost for or is . Besides, it takes product operations to compute , and . In this sense, the computational cost for is still . On the other hand, the DFT bounds the cost at nearly by exploiting the circulant structure [4]. Therefore, the overall cost for computing is given that there are inverse DFTs in (28).

Theorem 1

Given an invertible blockwise diagonal matrix that can be partitioned into diagonal matrices of size , the computational complexity of is .

Proof:

Denote () the -th block of , where is a diagonal matrix of size . After a series of elementary matrix operations, e.g., pre-multiply or post-multiply the matrix by different elementary matrices, we can interchange the rows and columns of

arbitrarily. Therefore, there exists an invertible matrix

with , such that the matrix satisfies: the elements of come from with row indices and column indices . Obviously, for . Thus, can be represented as:

(34)

Given that , which means can be obtained by allocating the elements of to locations with row indices and column indices . The main computational cost of comes from the calculation of . The size of is , thus the computational complexity of is . As a result, the computational complexity of is .

Iv Scale Adaptive

To further improve the overall performance of , we follow the Integrated Correlation Filters (ICFs) framework in [16] to cope with scale variations. The ICF is a cascading-stage filtering process that performs translation estimation and scale estimation, respectively (same as the pipeline adopted in [15] and [9]). Unless otherwise specified, the mentioned in the following experimental parts refers to the scale adaptive one.

Specifically, in scale adaptive , the training of basic is accompanied by the training of another D Discriminative Scale Space Correlation Filter (DSSCF) [15] and this new trained filter is performed for scale estimation. To evaluate the trained DSSCF, image patches centered around the location found by the are cropped from the image, each of size , where is the target size in the current frame, is the scale factor, and . All image patches are then resized to the template size for the feature extraction. Finally, the final output from the scale estimation is given as the image patch with the highest filtering response. Similar to , the model parameters are also updated in an interpolating manner with learning rate . We refer readers to [15] for more details and the implementation of DSSCF.

V Experiments

We conduct four groups of experiments to demonstrate the effectiveness and superiority of our proposed . First, we implement and several of its baseline variants, including multi-feature-only tracker (MFT), multi-frame-only tracker (MFT-2), scale-adaptive-only tracker (SAT), scale-adaptive multi-feature tracker (SAMFT) and scale-adaptive multi-frame tracker (SAMFT-2), to analyze and evaluate the component-wise contributions to the performance gain. We then evaluate our tracker against state-of-the-art trackers on Object Tracking Benchmark (OTB)  [5]. Following this, we present results on Visual Object Tracking (VOT) challenge [48]. Finally, we compare with several other representative CFTs, including MOSSE [49], SAMF [9], MUSTer [16] and the recently published C-COT[36], to reveal the properties of our proposed tracker among CFTs family and also illustrate future research directions for modern discriminative tracker design.

V-a Experimental setup

To make a comprehensive evaluation, we use all the color video sequences in OTB 2015 dataset ( in total). These sequences are captured in various conditions, including occlusion, deformation, illumination variation, scale variation, out-of-plane rotation, fast motion, background clutter, etc. We use the success plot to evaluate all trackers on OTB dataset. The success rate counts the percentage of the successfully tracked frames by measuring the overlap score for trackers on each frame. The average overlap measure is the most appropriate for tracker comparison, which accounts for both position and size. Let denote the tracking bounding box and denote the ground truth bounding box, the overlap score is defined as , where and represent the intersection and union of two regions, and denotes the number of pixels in the region. In success plot, is varied from to , and the ranking of trackers is based on the Area Under Curve (AUC) score. We also report the speed of trackers in terms of average frames per second (FPS) over all testing sequences.

We also test tracker performance on VOT dataset containing video sequences. The VOT challenge is a popular competition for single object tracking. Different from OTB 2015 dataset, a tracker is restarted in the case of a failure (i.e., there is no overlap between the detected bounding box and ground truth) in the VOT 2015 data set. For the VOT 2015 dataset, tracking performance is evaluated in terms of accuracy (overlap with the ground-truth), robustness (failure rate) and the expected average overlap (EAO). EAO is obtained by combining the raw values of accuracy and failures. Its score represents the average overlap a tracker would obtain on a typical short-term sequence without reset. For a full treatment of these metrics, interested readers can refer to [48].

V-B Implementation details of the proposed tracker

We set the regularization parameters to , , , , , . These parameters are tuned with a coarse-to-fine procedure. In the coarse module, we roughly determine a satisfactory range for each parameter (for example, the range of is ). Here, the word “satisfactory” means that the value in the range can achieve higher mean success rate at the threshold in the OTB 2015 dataset. Then, in the fine-tuning module, we divide these parameters into three groups based on their correlations: (1) ; (2) ; and (3) . When we test the value of one group of parameters, other groups are set to default values, i.e., the mean value of the optimal range given by the coarse module. In each group, the parameters are tuned with grid search (for example, in the first group is tuned at the range with an interval ). We finally pinpointed the specific value as the one that can achieve the highest mean success rate (for example, the final value of is ). Besides, we set the learning rate to as previous work [4, 9, 15] and model three consecutive frames in the MTL module, i.e., . We select HOGs [27] and color names [50] for image representation in the MVL module. The HOGs and color names are complementary to each other, as the former puts emphasis on the image gradient which is robust to illumination and deformation while the latter focuses on the color information which is robust to motion blur. For HOGs, we employ a variant in [51], with the implementation provided by [52]. More particular, the cell size is and number of orientations is set to . For color names, we map the RGB values to a probabilistic dimensional color representation which sums up to . All the experiments mentioned in this work are conducted using MATLAB on an Intel i- GHz Quad-Core PC with GB RAM. MATLAB code is available from our project homepage http://bmal.hust.edu.cn/project/KMF2JMTtracking.html.

V-C Evaluation on component-wise contributions

Before systematically evaluating the performance of our proposed tracker, we first compare with its baseline variants to demonstrate the component-wise contributions to the performance gain. To this end, we implement seven trackers with various degraded settings, including multi-feature-only tracker (MFT) which just uses multiple feature cues, multi-frame-only tracker (MFT-2) which just uses the temporal relatedness across consecutive frames, scale-adaptive-only tracker (SAT) which just concerns scale variations, scale-adaptive multi-feature tracker (SAMFT) and scale-adaptive multi-frame tracker (SAMFT-2). Besides, to validate the efficiency of modeling multiple feature cues under a MVL framework rather than simply concatenating them, we also implement Multiple Feature tracker with feature concatenation (MFT-C). Note that, the KCF [4], which only uses the HOG feature without temporal modeling and scale searching (i.e., , , ), serves as a baseline in this section.

Table I summarized the differences between these trackers, where denotes the basic kernel multi-frame multi-feature joint modeling tracker without scale estimation. In this table, we report the mean success rate at the threshold of , which corresponds to the PASCAL evaluation criterion [53]. Although these trackers share one or more common components, their tracking performances differ significantly. This indicates that the visual features, temporal relatedness and scale searching strategy all are essentially important to the visual tracking tasks. KCF ranks the lowest among the compared trackers as expected. MFT (or MFT-C), MFT-2 and SAT extend KCF by augmenting the feature space with color information, taking advantage of temporal information with frame relatedness and introducing scale adaptive searching strategy respectively, thus achieving a few improvements. MFT outperforms MFT-C with a more “intelligent” feature fusion strategy. Besides, it is obvious that the scale adaptive searching can effectively handle scale variations, thus obtaining a large improvement in success rate (see SAT, SAMFT, SAMFT-2 as against KCF, MFT and MFT-2 in Table I). Moreover, by comparing MFT-2 with MFT and comparing SAMFT-2 with SAMFT, one can see that the integration of multiple frames plays a more positive role to improve tracking performance than the integration of multiple features. Finally, it is interesting to find that the success rate gains of MFT, MFT-2 and SAT are , and respectively compared with KCF, while gets a improvement. This indicates that the is not just the simple combination of the MFT, MFT-2 and SAT.

MF TM SA mean success rate
KCF no no no
MFT yes no no
MFT-C yes no no
MFT-2 no yes no
SAT no no yes
SAMFT yes no yes
SAMFT-2 no yes yes
yes yes no
yes yes yes
TABLE I: A comparison of our with different baseline variants. The mean overlap precision (OP) score () at threshold over all the color videos in the OTB dataset are presented. The best two results are marked with red and blue respectively. MF, TM and SA are the abbreviation of Multiple Features, Temporal Modeling and Scale Adaptive, respectively.
(a) One pass evaluation (OPE)
(b) Temporal robustness evaluation (TRE)
(c) Spatial robustness evaluation (SRE)
Fig. 1: Success plot showing the performance of our compared to state-of-the-art methods on the OTB dataset. The AUC score for each tracker is reported in the legend. Only the top ten trackers are displayed in the legend for clarity in (a) OPE, (b) TRE and (c) SRE. Our approach provides the best performance in OPE and the second best performance in TRE and SRE.
Fig. 2: Success plots for each attribute on OTB dataset. The value presented in the title represents the number of videos corresponding to the attributes. The success score is shown in the legend for each tracker. Our approach provides the best performance on out of attributes, namely motion blur, deformation, illumination variation, low resolution, occlusion, out-of-plane rotation, and scale variation.

V-D Comparison with state-of-the-art trackers

In this section, we provide a comprehensive comparison of our proposed with state-of-the-art trackers on OTB 2015 dataset: 1) state-of-the-art trackers that do not follow CFT framework yet achieved remarkable performance on OTB 2015, that is MEEM [54], TGPR [55], LSST [56] and MTMVTLAD [26]; 2) the popular trackers provided in [5], such as Struck [57], ASLA [58], SCM [59], TLD [60], MTT[61], VTD [62], VTS [63] and MIL [64]. Note that, as a generative tracker, MTMVTLAD has similar motivations as our approach, as it also attempts to integrate multi-task multi-view learning to improve tracking performance. However, different from , MTMVTLAD casts tracking as a sparse representation problem in a particle filter framework. Moreover, MTMVTLAD does not reconcile the temporal coherence in consecutive frames explicitly.

The success plot of all the competing trackers using One Pass Evaluation (OPE) is shown in Fig. 1(a). For clarity, we only show the top trackers in this comparison. As can be seen, our proposed achieves overall the best performance, which persistently outperforms the overall second and third best trackers, i.e., MEEM [54] and TGPR [55]. If we look deeper (see a detailed summarization in Table II), the MEEM tracker, which uses an entropy-minimization-based ensemble learning strategy to avert bad model update, obtains a mean success rate of

. The transfer learning based TGPR tracker achieves a mean success rate of

. By contrast, our approach follows a CFT framework, while using an explicit temporal consistency regularization to enhance robustness. The MTMVTLAD tracker provides a mean success rate of . Our tracker outperforms MTMVTLAD by in mean success rate. Finally, it is worth noting that our approach achieves superior performance while operating at real time, while the mean FPS for MEEM, TGPR and MTMVTLAD are approximately , and , respectively.

(a) Occlusion on Clifbar.
(b) Illumination variation on Skating1.
(c) Scale variation on Human5.
(d) Out-of-plane rotation on Tiger2.
(e) Tracker legend
Fig. 3: A qualitative comparison of our method with five state-of-the-art trackers. Tracking results are shown on four example videos from the OTB 2015 dataset. The videos show challenging situations, such as (a) occlusion, (b) illumination variation, (c) scale variations, and (d) out-of-plane rotation. Our approach offers superior performance compared to the existing trackers in these challenging situations. (e) shows tracker legend.

We also report the tracking results using Temporal Robustness Evaluation (TRE) and Spatial Robustness Evaluation (SRE). For TRE, it runs trackers on sub-sequences segmented from the original sequence with different lengths, and SRE evaluates trackers by initializing them with slightly shifted or scaled ground truth bounding boxes. With TRE and SRE, the robustness of each evaluated trackers can be comprehensively interpreted. The SRE and TRE evaluations are shown in Fig. 1(b) and Fig. 1(c) respectively. In both evaluations, our approach provides a constant performance gain over the majority of existing methods. The MEEM achieves better robustness than our approach. A possible reason is that MEEM combines the estimates of an ensemble of experts to mitigate inaccurate predictions or sudden drifts, so that the weaknesses of the trackers are reciprocally compensated. We also perform an attribute based analysis on our tracker. In OTB, each sequence is annotated with different attributes, namely: fast motion, background clutter, motion blur, deformation, illumination variation, in-plane rotation, low resolution, occlusion, out-of-plane rotation, out-of-view and scale variation. It is interesting to find that ranks the first on out of attributes (see Fig. 2), especially on illumination variation, occlusion, out-of-plane rotation and scale variation. This verifies the superiority of on target appearance representation and its capability on discovering reliable coherence from short-term memory. However, the overwhelming advantage no longer exists for fast motion and out-of-view. This is because the temporal consistency among consecutive frames becomes weaker in these two scenarios. The qualitative comparison shown in Fig. 3 corroborates quantitative evaluation results. It is worth noting that, we strictly follow the protocol provided in [5] and use the same parameters for all sequences.

mean success rate FPS
MEEM
TGPR
MTMVTLAD
TABLE II: A detailed comparison of our with MEEM [54], TGPR [55] and MTMVTLAD [26]. The mean overlap precision (OP) score () at threshold over all the color videos in the OTB dataset are presented.

V-E VOT 2015 Challenge

In this section, we present results on the VOT challenge. We compare our proposed with participating trackers in this challenge. For a fair comparison, the DSST [15] is substituted with its fast version (i.e., fDSST [65]) raised by the same authors, as fDSST demonstrates superior performance than DSST as shown in [65].

(a) AR rank (sequence pooling)
(b) AR rank (attribute normalization)
(c) Zoomed-in AR rank (sequence pooling)
(d) Zoomed-in AR rank (attribute normalization)
(e) Tracker legend
Fig. 4: The accuracy and robustness (AR) rank plots generated by (a) sequence pooling and by (b) attribute normalization in the VOT 2015 dataset. The accuracy and robustness rank are plotted along the vertical and horizontal axis respectively. (c) and (d) demonstrate the zoomed-in figure of (a) and (b) respectively, in which only top-10 accuracy and robustness ranks are plotted. (e) shows the tracker legend. Our proposed (denoted by the red circle) achieves top-6 performance in terms of both accuracy and robustness among competitors in both experiments.

In the VOT benchmark, each video is annotated by five different attributes: camera motion, illumination change, occlusion, size change and motion change. Different from OTB 2015 dataset, the attributes in VOT 2015 are annotated per-frame in a video. Fig. 4 shows the accuracy and robustness (AR) rank plots generated by sequence pooling and attribute normalization. The pooled AR plots are generated by concatenating the experimental results on all sequences to directly obtain a rank list, while the attribute normalized AR rank plots are obtained based on the average of ranks achieved on these individual attributes. Fig. 5 shows the EAO ranks. Only results for the top- trackers in the VOT challenge are reported for clarity.

It is easy to summarize some key observations from these figures:
1) The CFTs (including our tracker , DeepSRDCF [35], SRDCF [66], RAJSSC [67], NSAMF [48], SAMF [9]) account for majority of the top-performing trackers. By fully exploiting the representation power of CNNs or rotating candidate samples to augment candidate set, MDNet [68] and sPST [69] also demonstrate superior (or even the best) performance. This result indicates that a well-designed discriminative tracker with conventional tracking-by-detection scheme and random sampling in the detection phase can achieve almost the same tracking accuracy with state-of-the-art CFTs coupled with dense sampling, at the cost of high computational burden (as officially reported in [48]).
2) Among top-performing CFTs, tied for the first place with DeepSRDCF, SRDCF, RAJSSC and NSAMF in terms of tracking accuracy. However, the robustness of is inferior to DeepSRDCF and SRDCF. Both DeepSRDCF and SRDCF introduce a spatial regularization penalty term to circumvent boundary effects caused by conventional circular correlation, thus significantly mitigating inaccurate training samples and restricted searching regions. However, one should note that these two trackers can hardly run in real time. A thorough investigation between and DeepSRDCF is demonstrated in Section V-F.
3) Our tracker outperforms RAJSSC, NSAMF and SAMF in terms of robustness and EAO values. All these trackers use HOGs and color information to describe target appearance under a scale-adaptive framework. The performance difference indicates that our enjoys an advanced feature fusion scheme and the integration of multiple frames is beneficial to performance gain.

Apart from these three observations, there are other interesting points. For example, the NSAMF achieves a large performance gain compared with SAMF. The main difference is that NSAMF substitutes the color name with color probability. On the other hand, the EBT 

[70] reaches a fairly high robustness and EAO value. However, its tracking accuracy is desperately poor. One possible reason is that the adopted contour information does not have desirable adaptability to target scale and aspect ratio changes.

Fig. 5: The expected average overlap (EAO) graphs with trackers ranked from right to left. The right-most tracker is the top-performing in terms of EAO values. The grey solid line denotes the average performance of trackers published at ICCV, ECCV, CVPR, ICML or BMVC in / (nine papers from 2015 and six from 2014), beyond which a tracker can be thought of as state of the art [48]. The green dashed line denotes the performance of VOT winner (i.e., fDSST [65]). See Fig. 4 for legend.

V-F Comparison among correlation filter-based trackers (CFTs)

In the last section, we investigate the performance of representative CFTs. Our goal is to demonstrate the effectiveness of and reveal its properties when compared with other CFT counterparts. We also attempt to illustrate the future research directions of CFTs through a comprehensive evaluation. Table III summarized the basic information of the selected CFTs as well as their corresponding FPS values (some descriptions are adapted from [71]).

(a) One pass evaluation (OPE)
(b) Temporal robustness evaluation (TRE)
(c) Spatial robustness evaluation (SRE)
Fig. 6: Speed and accuracy plot of state-of-the-art CFTs on OTB 2015 dataset. We use the AUC score of success plots to measure tracker accuracy or robustness. The dashed vertical line (with a FPS value of  [72]) separates the trackers into real-time trackers and those cannot run in real time. Meanwhile, the solid horizontal line (mean of AUC values for all competing CFTs) separates trackers into well-performed trackers and those perform poorly in terms of tracking accuracy. The proposed achieves the best accuracy among all real-time trackers in terms of (a) OPE, (b) TRE and (c) SRE.
Published year Major contribution FPS
MOSSE [49] Pioneering work of introducing correlation filters for visual tracking.
CSK [73] Introduced Ridge Regression problem with circulant matrix to apply kernel methods.
STC [10] Introduced spatio-temporal context information.
CN [74] Introduced color attributes as effective features.
DSST [15] Relieved the scaling issue using feature pyramid and 3-dimensional correlation filter.
SAMF [9] Integrated both color feature and HOG feature; Applied a scaling pool to handle scale variations.
KCF [4] Formulated the work of CSK and introduced multi-channel HOG feature.
LCT [40] Introduced online random fern classifier as re-detection component for long-term tracking.
MUSTer [16] Proposed a biology-inspired framework to integrate short-term processing and long-term processing.
RPT [39] Introduced reliable local patches to facilitate tracking.
CF2 [34] Introduced features extracted from convolutional neural networks (CNN) for visual tracking.
DeepSRDCF [35] Introduced CNN features for visual tracking and a spatial regularization term to handle bound effect.
MvCFT [12] Introduced Kullback-Leibler (KL) divergence to fuse multiple feature cues.
C-COT [36] Employed an implicit interpolation model to train convolutional filters in continuous spatial domain.
Integrated multiple feature cues and temporal consistency in a unified model.
TABLE III: Selected competing CFTs (including summarized major contributions) and their corresponding FPS values (adapted from [71]).

Fig. 6 shows the AUC scores of success plots vs. FPS for all competing CFTs in OPE, TRE and SRE, respectively. The dashed vertical line (with a FPS value of  [72]) separates the trackers into real-time trackers and those cannot run in real time. Meanwhile, the solid horizontal line (mean of AUC values for all competing CFTs) separates trackers into well-performed trackers and those perform poorly in terms of tracking accuracy. As can be seen, our method performs the best in terms of accuracy among all real-time trackers, thus achieving a good trade-off in speed and accuracy. This suggests that our modifications on the basic KCF tracker are effective and efficient. By contrast, although introducing long-term tracking strategy (e.g., MUSTer, LCT) or imposing spatial regularization penalty term (e.g., DeepSRDCF) can augment tracking performance as well, most of these modifications cannot be directly applied in real applications where the real-time condition is a prerequisite. This unfortunate fact also applies to C-COT, in which the hand-crafted features are substituted with powerful CNN features. Therefore, a promising research direction is to investigate computational-efficient long term tracking strategy or spatial regularization with little or no sacrifice in speed.

Finally, it is worth mentioning that our approach provides a consistent gain in performance compared to MvCFT, although both methods employ the same feature under a MVL framework. The MvCFT only fuses tracking results from different view to provide a more precise prediction. By contrast, our approach enjoys a more reliable and computational-efficient training model. On the other hand, it is surprising to find that SAMF can achieve desirable performance on both TRE and SRE experiments. One possible reason is that the scale estimation method in SAMF, i.e., exhaustively searching a scaling pool, is more robust to scale variations (although time-consuming).

Vi Conclusions and future work

In this paper, we proposed kernel multi-frame multi-feature joint modeling tracker (