Code for the paper 'ROI Pooled Correlation Filters for Visual Tracking' (CVPR 2019)
The ROI (region-of-interest) based pooling method performs pooling operations on the cropped ROI regions for various samples and has shown great success in the object detection methods. It compresses the model size while preserving the localization accuracy, thus it is useful in the visual tracking field. Though being effective, the ROI-based pooling operation is not yet considered in the correlation filter formula. In this paper, we propose a novel ROI pooled correlation filter (RPCF) algorithm for robust visual tracking. Through mathematical derivations, we show that the ROI-based pooling can be equivalently achieved by enforcing additional constraints on the learned filter weights, which makes the ROI-based pooling feasible on the virtual circular samples. Besides, we develop an efficient joint training formula for the proposed correlation filter algorithm, and derive the Fourier solvers for efficient model training. Finally, we evaluate our RPCF tracker on OTB-2013, OTB-2015 and VOT-2017 benchmark datasets. Experimental results show that our tracker performs favourably against other state-of-the-art trackers.READ FULL TEXT VIEW PDF
Code for the paper 'ROI Pooled Correlation Filters for Visual Tracking' (CVPR 2019)
Visual tracking aims to localize the manually specified target object in the successive frames, and it has been densely studied in the past decades for its broad applications in the automatic drive, human-machine interaction, behavior recognition, etc. Till now, visual tracking is still a very challenging task due to the limited training data and plenty of real-world challenges, such as occlusion, deformation and illumination variations.
In recent years, the correlation filter (CF) has become one of the most widely used formulas in visual tracking for its computation efficiency. The success of the correlation filter mainly comes from two aspects: first, by exploiting the property of circulant matrix, the CF-based algorithms do not need to construct the training and testing samples explicitly, and can be efficiently optimized in the Fourier domain, enabling it to handle more features; second, optimizing a correlation filter can be equivalently converted to solving a system of linear functions, thus the filter weights can either be obtained with the analytic solution (e.g., [danelljan2015learning, danelljan2014accurate]) or be solved via the optimization algorithms with quadratic convergence [danelljan2015learning, danelljan2017eco]. As is well recognized, the primal correlation filter algorithms have limited tracking performance due to the boundary effects and the over-fitting problem. The phenomenon of boundary effects is caused by the periodic assumptions of the training samples, while the over-fitting problem is caused by the unbalance between the numbers of model parameters and the training samples. Though the boundary effects have been well addressed in several recent papers (e.g., SRDCF [danelljan2015learning], DRT [sun2018correlation], BACF [galoogahi2017learning] and ASRCF [dai2019]), the over-fitting problem is still not paid much attention to and remains to be a challenging research hotspot.
The average/max-pooling operation has been widely used in the deep learning methods via the pooling layer, which is shown to be effective in handling the over-fitting problem and deformations. Currently, two kinds of pooling operations are widely used in deep learning methods. The first one performs average/max-pooling on the entire input feature map and obtains a feature map with reduced spatial resolutions. In the CF formula, the pooling operation on the input feature map can lead to fewer available synthetic training samples, which limits the discriminative ability of the learned filter. Also, the smaller size of the feature map will significantly influence the localization accuracy. However, the ROI (Region of Interest)-based pooling operation is an alternative, which has been successfully embedded into several object detection networks (e.g., [girshick2015fast, ren2015faster]). Instead of directly performing the average/max-pooling on the entire feature map, the ROI-based pooling method first crops large numbers of ROI regions, each of which corresponds to a target candidate, and then performs average/max-pooling for each candidate ROI region independently. The ROI-based pooling operation has the merits of a pooling operation as mentioned above, and at the same time retains the number of training samples and the spatial information for localization, thus it is meaningful to introduce the ROI-based pooling into the CF formula. Since the CF algorithm has no access to real-world samples, it remains to be investigated on how to exploit the ROI-based pooling in a correlation filter formula.
In this paper, we study the influence of the pooling operation in visual tracking, and propose a novel ROI pooled correlation filters algorithm. Even though the ROI-based pooling algorithm has been successfully applied in many deep learning-based applications, it is seldom considered in the visual tracking field, especially in the correlation filter-based methods. Since the correlation filter formula does not really extract positive and negative samples, it is infeasible to perform the ROI-based pooling like Fast R-CNN [girshick2015fast]. Through mathematical derivation, we provide an alternative solution to implement the ROI-based pooling. We propose a correlation filter algorithm with equality constraints, through which the ROI-based pooling can be equivalently achieved. We propose an Alternating Direction Method Of Multipliers (ADMM) algorithm to solve the optimization problem, and provide an efficient solver in the Fourier domain. Large number of experiments on the OTB-2013 [wu2013online], OTB-2015 [wu2015object] and VOT-2017 [VOT2017] datasets validate the effectiveness of the proposed method (see Figure 1 and Section 5). The contributions of this paper are three-fold:
This paper is the first attempt to introduce the idea of ROI-based pooling in the correlation filter formula. It proposes a correlation filter algorithm with equality constraints, through which the ROI-based pooling operation can be equivalently achieved without the need for real-world ROI sample extraction. The learned filter weights are insusceptible to the over-fitting problem and are more robust to deformations.
This paper proposes a robust ADMM method to optimize the proposed correlation filter formula in the Fourier domain. With the computed Lagrangian multipliers, the paper aims to use the conjugate gradient method for filter learning, and develops efficient optimization strategy for each step.
This paper conducts large amounts of experiments on three available public datasets. The experimental results validate the effectiveness of the proposed method. Project page : https://github.com/rumsyx/RPCF.
The recent papers on visual tracking are mainly based on the correlation filters and deep networks [li2018deep], many of which have impressive performance. In this section, we primarily focus on the algorithms based on the correlation filters and briefly introduce related issues of the pooling operations.
Discriminative Correlation Filters. Trackers based on correlation filters have been the focus of researchers in recent years, which have achieved the top performance in various datasets. The correlation filter algorithm in visual tracking can be dated back to the MOSSE tracker [Bolme2010Visual], which takes the single-channel gray-scale image as input. Even though the tracking speed is impressive, the accuracy is not satisfactory. Based on the MOSSE tracker, Henriques et al. advance the state-of-the-art by introducing the kernel functions [henriques2012exploiting] and higher dimensional features [henriques2015high]. Ma et al. [ma2015hierarchical]
exploit the rich representation information of deep features in the correlation filter formula, and fuse the responses of various convolutional features via a coarse-to-fine searching strategy. Qiet al. [qi2016hedged] extend the work of [ma2015hierarchical] by exploiting the Hedge method to learn the importance for each kind of feature adaptively. Apart from the MOSSE tracker, the aforementioned algorithms learn the filter weights in the dual space, which have been attested to be less effective than the primal space-based algorithms [danelljan2014accurate, danelljan2015learning, henriques2015high]. However, correlation filters learned in the primal space are severely influenced by the boundary effects and the over-fitting problem. Because of this, Danelljan et al. [danelljan2015learning] introduce a weighted regularization constraint on the learned filter weights, encouraging the algorithm to learn more weights on the central region of the target object. The SRDCF tracker [danelljan2015learning] has become a baseline algorithm for many latter trackers, e.g., CCOT [danelljan2016beyond] and SRDCFDecon [danelljan2016adaptive]. The BACF tracker [galoogahi2017learning] provides another feasible way to address the boundary effects, which generates real-world training samples and greatly improves the discriminant power of the learned filter. Though the above methods have well addressed the boundary effects, the over-fitting problem is rarely considered. The ECO tracker [danelljan2017eco] jointly learns a projection matrix and the filter weights, through which the model size is greatly compressed. Different from the ECO tracker, our method introduces the ROI-based pooling operation into a correlation filter formula, which does not only address the over-fitting problem but also makes the learned filter weights more robust to deformations.
The idea of the pooling operation has been used in various fields in computer vision,e.gdalal2005histograms, lowe2004distinctive]simonyan2014very, he2016deep], to name a few. Most of the pooling operations are performed on the entire feature map to either obtain more stable feature representations or rapidly compress the model size. In [dalal2005histograms], Dalal et al. divide the image window into dozens of cells, and compute the histogram of gradient directions in each divided cell. The computed feature representations are more robust than the ones based on individual pixels. In most deep learning-based algorithms (e.g., [dalal2005histograms, lowe2004distinctive]), the pooling operations are performed via a pooling layer, which accumulates the multiple response activations over a small neighbourhood region. The localization accuracy of the network usually decreases after the pooling operation. Instead of the primal max/average-pooling layer, the faster R-CNN method [girshick2015fast] exploits the ROI pooling layer to ensure the localization accuracy and at the same time compress the model size. The method firstly extracts the ROI region for each candidate target object via a region of proposal network (RPN), and then performs the max-pooling operation on the ROI region to obtain more robust feature representations. Our method is inspired by the ROI pooling proposed in [girshick2015fast], and is the first attempt to introduce the ROI-based pooling operation into the correlation filter formula.
In this section, we briefly revisit the two key technologies closely related to our approach (i.e., the correlation filter and pooling operation).
To help better understand our method, we first introduce the primal correlation filter algorithm. Given an input feature map, a correlation filter algorithm aims at learning a set of filter weights to regress the Gaussian-shaped response. We use to denote the desired Gaussian-shaped response, and to denote the input feature map with feature channels . For each feature channel , a correlation filter algorithm computes the response by convolving with the filter weight . Based on the above-mentioned definitions and descriptions, the optimal filter weights can be obtained by optimizing the following objective function:
where denotes the circular convolution operator,
is concatenated filter vector,is a trade-off parameter to balance the importance between the regression and the regularization losses. According to the Parseval’s theorem, Eq. 1 can be equivalently written in the Fourier domain as
where is the Hadamard product. We use , , to denote the Fourier domain of vector , and .
As is described by many deep learning methods [simonyan2014very, gatys2016image], the pooling layer plays a crucial rule in addressing the over-fitting problem. Generally speaking, a pooling operation tries to fuse the neighbourhood response activations into one, through which the model parameters can be effectively compressed. In addition to addressing the over-fitting problem, the pooled feature map becomes more robust to deformations (Figure 2). Currently, two kinds of pooling operations are widely used, i.e., the pooling operation based on the entire feature map (e.g., [simonyan2014very, he2016deep]) and the pooling operation based on the candidate ROI region (e.g. [ren2015faster]). The former one has been widely used in the CF trackers with deep features, as a contrast, the ROI-based pooling operation is seldom considered. As is described in Section 1, directly performing average/max-pooling on the input feature map will result in fewer training/testing samples and worse localization accuracy. We use an example to show how different pooling methods influence the sample extraction process in Figure 3, wherein the extracted samples are visualized on the right-hand side. For simplicity, this example is based on the dense sampling process. The conclusion is also applicable to the correlation filter method, which is essentially trained via densely sampled circular candidates. In the feature map based pooling operation, the feature map size is first reduced to , thus leading to fewer samples. However, the ROI-based pooling first crop samples from the feature map and then performs pooling operations upon them, thus does not influence the training number. Fewer training samples will lead to inferior discrimination ability of the learned filter, while fewer testing samples will result in inaccurate target localizations. Thus, it is meaningful to introduce the ROI-based pooling operation into the correlation filter algorithms. Since the max-pooling operation will introduce the non-linearity that makes the model intractable to be optimized, the ROI-based average-pooling operation is preferred in this paper.
In this section, we propose a novel correlation tracking method with ROI-based pooling operation. Like the previous methods [henriques2012exploiting, danelljan2016beyond], we introduce our CF-based tracking algorithm in the one-dimensional domain, and the conclusions can be easily generalized to higher dimensions. Since the correlation filter does not explicitly extract the training samples, it is impossible to perform the ROI-based pooling operation following the pipeline in Figure 3. In this paper, we derive that the ROI-based pooling operation can be implemented by adding additional constraints on the learned filter weights.
Given a candidate feature vector corresponding to the target region with elements, we perform the average-pooling operation on it with the pooling kernel size . For simplicity, we set , where
is a positive integer (the padding operation can be used ifcannot be divided by evenly). The pooled feature vector can be computed as , where the matrix is constructed as:
where denotes a vector with all the entries set as 1, and is a zero vector. Based on the pooled vector, we compute the response as:
wherein is the weight corresponding to the pooled feature vector, . It is easy to conclude that average-pooling operation can be equivalently achieved by constraining the filter weights in each pooling kernel to have the same value. Based on the discussions above, we define our ROI pooled correlation filter as follows:
where we consider equality constraints to ensure that filter weights in each pooling kernel have the same value, denotes the set that two filter elements belong to the same pooling kernel, and denote the indexes of elements in weight vector . In Eq. 5, is a binary mask which crops the filter weights corresponding to the target region. By introducing , we make sure that the filter only has the response for the target region of each circularly constructed sample [galoogahi2017learning]. The vector is a regularization weight that encourages the filter to learn more weights in the central part of the target object. The idea to introduce and has been previously proposed in [danelljan2015learning, galoogahi2017learning], while our tracker is the first attempt to integrate them. In the equality constraints, we consider the relationships between two arbitrary weight elements in a pooling kernel, thus for each channel , where is the number of nonzero values in . Note that the constraints are only performed in the filter coefficients corresponding to the target region of each sample, and the computed is based on the one-dimensional case.
According to the Parseval’s formula, the optimization in Eq. 5 can be equivalently written as:
denotes the Fourier transform matrix, anddenotes the inverse transform matrix. The vectors , , and denote the Fourier coefficients of the corresponding signal vectors , , and . Matrices and are the Toeplitz matrices, whose -th elements are and , where denotes the modulo operation. They are constructed based on the convolution theorem to ensure that , . Since the discrete Fourier coefficients of a real-valued signal are Hermitian symmetric, i.e., in our case, we can easily conclude that and , where denotes the conjugate-transpose of a complex matrix. In the constraint term, and are index matrices with either or as the entries, and .
Eq. 6 can be rewritten in a compact formula as:
where , is a diagonal matrix, .
Since Eq. 7 is a quadratic programming problem with linear constraints, we use the Augmented Lagrangian Method for efficient model learning. The Lagrangian function corresponding to Eq. 7 is defined as:
where denotes the Lagrangian multipliers for the -th channel, is the penalty parameter, . The ADMM method is used to alternately optimize and . Though the optimization objective function is non-convex, it becomes a convex function when either or is fixed.
When is fixed, can be computed via the conjugate gradient descent method [bunse1999conjugate]. We compute the gradient of the objective function with respects to in Eq. 8 and obtain a number of linear equations by setting the gradient to be a zero vector:
where , , and are block diagonal matrices with the -th matrix block set as , , and , , . In the conjugate gradient method, the computation load lies in the three terms , and given the search direction . In the following, we present more details on how we compute these three terms efficiently. Each of the three terms can be regarded as a vector constructed with sub-vectors. The -th sub-vector of is computed as wherein as described above. Since the Fourier coefficients of (a vector with binary values) are densely distributed, it is time consuming to directly compute given an arbitrary complex vector . In this work, the convolution theorem is used to efficiently compute . The -th sub-vector of the second term is . As the matrices and only consists of and , thus the computation of can be efficiently conducted via table lookups. The third term corresponds to the convolution operation, whose convolution kernel is usually smaller than 5, thus it can also be efficiently computed.
When is computed, can be updated via:
where we use to denote the value of in the -th iteration. According to [boyd2011distributed], the value of can be updated as:
again we use to denote the iteration index.
To learn more robust filter weights, we update the proposed RPCF tracker based on several training samples ( samples in total) like [danelljan2016beyond, danelljan2017eco]. We extend the notations and in Eq. 9 with superscript , and reformulate Eq. 9 as follows:
where , and denotes the importance weight for each training sample . Most previous correlation filter trackers update the model iteratively via a weighted combination of the filter weights in various frames. Different from them, we exploit the sparse update mechanism, and update the model every frames [danelljan2017eco]. In each updating frame, the conjugate gradient descent method is used, and the search direction of the previous update process is input as a warm start. Our training samples are generated following [danelljan2017eco], and the weight (i.e., learning rate) for the newly added sample is set as , while the weights of previous samples are decayed by multiplying . In Figure 4, we visualize the learned filter weights of different trackers with and without ROI-based pooling, our tracker can learn more compact filter weights and focus on the reliable regions of the target object.
In the target localization process, we first crop the candidate samples with different scales, i.e., . Then, we compute the response for the feature in each scale in the Fourier domain:
The computed responses are then interpolated with trigonometric polynomial following[danelljan2015learning] to achieve the sub-pixel target localization.
In this section, we evaluate the proposed RPCF tracker on the OTB-2013 [wu2013online], OTB-2015 [wu2015object] and VOT2017 [VOT2017] datasets. We first evaluate the effectiveness of the method, and then further compare our tracker with the recent state-of-the-art.
Implementation Details. The proposed RPCF method is mainly implemented in MATLAB on a PC with an i7-4790K CPU and a Geforce 1080 GPU. Similar to the ECO method [danelljan2017eco], we use a combination of CNN features from two convolution layers, HOG and color names for target representation. For efficiency, the PCA method is used to compress the features. We set the learning rate , the maximum number of training samples , and as 0.02, 50, 1000 and 10 respectively, and we update the model in every frame. As to , we set a relative small value (e.g., ) for the high-level feature (i.e., the second convolution layer), and a larger value for the other feature channels. The kernel size is set as in the implementation. We use the conjugate gradient descent for model initialization and update, 200 iterations are used in the first frame, and the following update frame uses 6 iterations. Our tracker runs at about 5fps without optimization.
We follow the one-pass evaluation (OPE) rule on the OTB-2013 and OTB-2015 datasets, and report the precision plots as well as the success plots for the performance measure. The success plots demonstrate the overlaps between tracked bounding boxes and ground truth with varying thresholds, while the precision plots measure the accuracy of the estimated target center positions. In the precision plots, we exploit the distance precision (DP) rate at 20 pixels for the performance report, while we exploit the area-under-curve (AUC) score for performance report in success plots. On the VOT-2017 dataset, we evaluate our tracker in terms of the Expected Average Overlap (EAO), accuracy raw value (A) and robustness raw value (R) measure the overlap, accuracy and robustness respectively.
In this subsection, we conduct experiments to validate the contributions of the proposed RPCF method. We set the tracker that does not consider the pooling operation as the baseline method, and use Baseline to denote it. It essentially corresponds to Eq. 5 without equality constraints. To validate the superiority of our ROI-based pooling method over feature map based average-pooling and max-pooling, we also implement the trackers that directly performs average-pooling and max-pooling on the input feature map, which are named as Baseline+AP and Baseline+MP.
We first compare the Baseline method with Baseline+AP and Baseline+MP, which shows that the tracking performance decreases when feature map based pooling operations are performed. Directly performing pooling operations on the input feature map will not only influence the extraction of the training samples but also lead to worse target localization accuracy. In addition, the over-fitting problem is not well addressed in such methods since the ratio between the numbers of model parameters and available training samples do not change compared with the Baseline method. We validate the effectiveness of the proposed method by comparing our RPCF tracker with the Baseline method. Our tracker improves the Baseline method by 4.4% and 2.0% in precision and success plots respectively. By exploiting the ROI-based pooling operations , our learned filter weights are insusceptible to the over-fitting problem and are more robust to deformations.
OTB-2013 Dataset. The OTB-2013 dataset contains 50 videos annotated with 11 various attributes including illumination variation, scale variation, occlusion, deformation and so on. We evaluate our tracker on this dataset and compare it with 8 state-of-the-art methods that are respectively ECO [danelljan2017eco], CCOT [danelljan2016beyond], LSART [sun2018learning], ECO-HC [danelljan2017eco], CF2 [ma2015hierarchical], Staple [bertinetto2016staple], MEEM [zhang2014meem] and KCF [henriques2015high]. We demonstrate the precision and success plots for different trackers in Figure 6. Our RPCF method has a 94.3% DP rate at the threshold of 20 pixels and a 70.9% AUC score. Compared with other correlation filter based trackers, the proposed RPCF method has the best performance in terms of both precision and success plots. Our method improves the second best tracker ECO by 1.9% in terms of DP rates, and has comparable performance according to the success plots. When the features are not compressed via PCA, the tracker (denoted as RPCF-NC) has a 95.4% DP rate at the threshold of 20 pixels and a 71.3% AUC score in success plots, and it runs at 2fps without optimization.
OTB-2015 Dataset. The OTB-2015 dataset is an extension of the OTB-2013 dataset and contains 50 more video sequences. On this dataset, we also compare our tracker with the above mentioned 8 state-of-the-art trackers, and present the results in Fiugre 7(a)(b). Our RPCF tracker has a 92.9% DP rate and a 69.0% AUC score. It improves the second best tracker ECO by 1.9% in terms of the precision plots. With the non-compressed features, our RPCF-NC tracker achieves the 93.2% DP rate and 69.6% AUC score, which again has the best performance among all the compared trackers.
The OTB-2015 dataset divides the image sequences into 11 attributes, each of which corresponds to a challenging factor. We compare our RPCF tracker against other 8 state-of-the-art trackers and present the precision plots for different trackers in Figure 9. As is illustrated in the figure, our RPCF tracker has good tracking performance in all the listed attributes. Especially, the RPCF tracker improves the ECO method by 3.6%, 2.5%, 2.8%, 2.2% and 4.3% in the attributes of scale variation, in-plane rotation, out-of-plane rotation, fast motion and deformation. The ROI pooled features become more consistent across different frames than the original ones, which contributes to robust target representation when the target appearance dramatically changes (see Figure 2 for example). In addition, by exploiting the ROI-based pooling operations, the model parameters are greatly compressed, which makes the proposed tracker insusceptible to the over-fitting problem. In Figure 9, we also present the results of our RPCF-NC tracker for reference.
VOT-2017 Dataset. We test the proposed tracker on the VOT-2017 dataset for more thorough performance evaluations. The VOT-2017 dataset consists of 60 sequences with 5 challenging attributes, i.e., occlusion, illumination change, motion change, size change, camera motion. Different from the OTB-2013 and OTB-2015 datasets, it focuses on evaluating the short-term tracking performance and introduces a reset based experiment setting. We compare our RPCF tracker with 9 state-of-the-art trackers including CFWCR [he2017correlation], ECO [danelljan2017eco], CCOT [danelljan2016beyond], MCCT [wang2018multi], CFCF [gundogdu2018good], CSR [lukezic2017discriminative], MCPF [zhang2017multi], Gnet [VOT2017] and Staple [bertinetto2016staple]. The tracking performance of different trackers in terms of EAO, A and R are provided in Table 1 and Figure 8. Among all the compared trackers, our RPCF method has a 31.6% EAO score which improves the ECO method by 3.5%. Also, our tracker has the best performance in terms of robustness measure among all the compared trackers.
In this paper, we propose the ROI pooled correlation filters for visual tracking. Since the correlation filter algorithm does not extract real-world training samples, it is infeasible to perform the pooling operation for each candidate ROI region like the previous methods. Based on the mathematical derivations, we provide an alternative solution for the ROI-based pooling with the circularly constructed virtual samples. Then, we propose a correlation filter formula with equality constraints, and develop an efficient ADMM solver in the Fourier domain. Finally, we evaluate the proposed RPCF tracker on OTB-2013, OTB-2015 and VOT-2017 benchmark datasets. Extensive experiments demonstrate that our method performs favourably against the state-of-the-art algorithms on all the three datasets.
Acknowledgement. This paper is supported in part by National Natural Science Foundation of China #61725202, #61829102, #61872056 and #61751212, and in part by the Fundamental Research Funds for the Central Universities under Grant #DUT18JC30. This work is also sponsored by CCF-Tencent Open Research Fund.