DSNet: Deep and Shallow Feature Learning for Efficient Visual Tracking

11/06/2018 ∙ by Qiangqiang Wu, et al. ∙ Xiamen University 0

In recent years, Discriminative Correlation Filter (DCF) based tracking methods have achieved great success in visual tracking. However, the multi-resolution convolutional feature maps trained from other tasks like image classification, cannot be naturally used in the conventional DCF formulation. Furthermore, these high-dimensional feature maps significantly increase the tracking complexity and thus limit the tracking speed. In this paper, we present a deep and shallow feature learning network, namely DSNet, to learn the multi-level same-resolution compressed (MSC) features for efficient online tracking, in an end-to-end offline manner. Specifically, the proposed DSNet compresses multi-level convolutional features to uniform spatial resolution features. The learned MSC features effectively encode both appearance and semantic information of objects in the same-resolution feature maps, thus enabling an elegant combination of the MSC features with any DCF-based methods. Additionally, a channel reliability measurement (CRM) method is presented to further refine the learned MSC features. We demonstrate the effectiveness of the MSC features learned from the proposed DSNet on two DCF tracking frameworks: the basic DCF framework and the continuous convolution operator framework. Extensive experiments show that the learned MSC features have the appealing advantage of allowing the equipped DCF-based tracking methods to perform favorably against the state-of-the-art methods while running at high frame rates.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given the initial state of a target at the first frame, generic visual object tracking is to accurately and efficiently estimate the trajectory of the target at subsequent frames. In recent years, Discriminative Correlation Filter (DCF) based tracking methods have shown excellent performance on canonical object tracking benchmarks

[31, 32]

. The key reasons to their success are the mechanism of enlarging training data by including all shifted samples of a given sample, and the efficiency of DCF by solving the ridge regression problem in the frequency domain.

Features play an important role in designing a high-performance tracking method. In recent years, significant progress has been made in exploiting discriminative features for DCFs. For example, hand-crafted features like HOG [5], Color Names [30]

or the combinations of these features, are commonly employed by DCFs for online object tracking. Despite the fast tracking speed achieved by these methods, they usually cannot obtain high tracking accuracy due to the less discriminative features they use. Recently, the outstanding success of deep convolutional neural networks (CNNs) has been made in a variety of computer vision tasks

[15, 19, 23]. Inspired by the success of CNNs, the visual tracking community has exploited the advantages of CNNs, and shown that deep convolutional features trained from other tasks like image classification, are also applicable for the visual tracking task [24]

. On one hand, the deep features extracted from the shallow convolutional layers, which provide high spatial resolution, are more helpful for the accurate localization of the object. On the other hand, the deep features extracted from the deeper layers encode the semantic information and are more robust to target appearance variations (e.g., deformation, rotation and motion blur). The combination of these two types of features shows excellent tracking performance in both locating the target accurately and modeling the target appearance variations online. However, the conventional DCF formulation is limited to single-resolution feature maps. Deep and shallow features (i.e., multi-resolution feature maps) cannot be naturally used in the conventional DCF framework. Thus, how to effectively fuse the deep and shallow features in the DCF framework is still an open and challenging problem.

Figure 1: Comparison between (a) the DCF-based tracking methods [9, 20] with deep convolutional features trained from the image classification task and (b) the DCF-based tracking method with our MSC features.

Recently, several works have been developed to fuse multi-resolution feature maps in the DCF framework [6, 9]. A straightforward strategy is to explicitly resample both deep and shallow features from different spatial resolutions to the same resolution. However, such a strategy introduces artifacts, which severely limit the tracking performance. To overcome the above issue, an online learning formulation is presented in C-COT [9] to integrate multi-resolution feature maps. Despite the promising performance achieved by C-COT, it still has several limitations: (1) The online learning formulation is time-consuming due to the high-dimensional deep features. (2) In order to fuse multi-resolution feature maps, multiple DCFs need to be trained. (3) The method employs the deep features trained from other tasks like image classification. These features are not specifically designed for visual tracking and may limit the tracking performance.

In this paper, instead of designing an online learning formulation to integrate deep and shallow features (i.e., multi-resolution feature maps) in the DCF framework, we propose to learn multi-layer same-resolution compressed (MSC) features in an end-to-end offline manner for efficient online tracking. To achieve this, a deep and shallow feature learning network architecture (called as DSNet) is developed in this paper. The proposed DSNet aggregates multi-level convolutional features and integrates them into the same-resolution feature maps. In the training stage, a correlation filter layer is added in DSNet, enabling to learn the discriminative MSC features for visual tracking. In the test stage, DSNet acts as a feature extractor without relying on the time-consuming online fine-tuning step. In general, our MSC features learned by the proposed DSNet have the following characteristics:

(1) MSC features effectively incorporate both the deep and shallow features of objects but with the same spatial resolution. This enables MSC features to be naturally fused in any DCF-based tracking methods without using any online combination strategies.

(2) Due to the low-dimension and uniform spatial resolution of MSC features, the tracking model complexity can be significantly decreased (see Fig. 1). Generally, our MSC features can be naturally incorporated by a single DCF instead of multiple DCFs, thus leading to highly efficient online object tracking.

To demonstrate the effectiveness of MSC features learned by the proposed DSNet, we incorporate the MSC features into two state-of-the-art tracking frameworks: the basic DCF framework [15] and the continuous convolution operator framework [9], namely MSC-DCF and MSC-CCO, respectively. Experiments demonstrate that our MSC features have the important advantage of allowing the equipped MSC-DCF and MSC-CCO methods to perform favorably against the state-of-the-art methods at high frame rates.

In summary, this paper has the following contributions:

(1) We propose a deep and shallow feature learning network architecture, namely DSNet, enabling to learn multi-level same-resolution compressed (MSC) features for efficient online object tracking in an end-to-end offline manner.

(2) Based on the observation that several feature channels have low channel reliability scores, an online channel reliability measurement (CRM) method is presented to further refine the learned MSC features.

(3) We show that our MSC features are applicable for any CF-based tracking methods. Based on the MSC features, two MSC-trackers (MSC-DCF and MSC-CCO) are presented. Experiments demonstrate that the presented trackers can achieve favorable performance while running at high frame rates.

2 Related Work

In this section, we give a brief review to the methods closely related to our work.

Correlation filter tracking. Correlation filter (CF) based tracking methods [8, 9, 15] have attracted considerable attention due to their computational efficiency and favorable tracking performance. For example, MOOSE [3] is the initially proposed CF-based tracking method, which uses grayscale images to train the regression model. KCF [15] further extends MOOSE by using multi-channel features and mapping the input features to a kernel space. Staple [1] combines both HOG and Color Name features in a CF framework. Despite the fast tracking speed of their methods, the employed hand-crafted features are still not discriminative enough to handle different challenges. To overcome this problem, several deep feature based tracking methods have been proposed. For example, CF2 [20] and DeepSRDCF [7] employ the convolutional features extracted from VGGNet [4]. HDT [24] merges multiple CFs trained on the different layers of VGGNet. In [9], an online learning formulation of convolutional features is developed on the spatial domain. ECO [6] further alleviates the over-fitting problem in [9], and decreases the computational complexity. Despite significant improvements made by these methods, they still suffer from the problems of low tracking speed and less discriminative deep features. To alleviate these problems, in this work, we propose to learn the multi-layer same-resolution compressed (MSC) features in an end-to-end offline manner. The learned MSC features are specifically designed for visual tracking, and they can be easily incorporated into any CF-based tracking methods without using the time-consuming online fine-tuning steps.

Feature representation learning. Feature representation is the core of many computer vision tasks, including semantic segmentation [23], object detection [25] and object tracking [1]. For the object detection task, many works on feature learning have been proposed. For example, in R-CNN [10], a region proposal based CNN is proposed to learn better feature representations in an end-to-end manner. Due to the learned discriminative features, R-CNN significantly outperforms other detection methods. In addition, SSD [19] and HyperNer [17] combine multi-layer convolution features to further improve the detection performance. For the visual tracking task, CFNet [28] firstly proposes to add a correlation filter layer in a CNN architecture, thus enabling to learn more discriminative features for CF-based methods. The similar feature learning method is also introduced in DCFNet [29]. Despite the success of these methods, these methods only focus on learning single-layer convolutional features, which may limit their performance. To encode both the low-level and high-level information of objects, we propose to learn multi-layer convolutional features in an end-to-end manner for visual tracking.

3 Proposed DSNet for Feature Learning

In this section, we firstly introduce the proposed DSNet framework. Secondly, the feature learning of MSC features is presented. Thirdly, we detail the feature extraction step of MSC features. Finally, we introduce the channel reliability measurement (CRM) method to further refine the learned MSC features.

3.1 DSNet Framework

Figure 2: Overall architecture of the proposed deep and shallow feature learning network (DSNet).

The proposed DSNet framework is illustrated in Fig. 2. As can be seen, DSNet mainly consists of three parts: a backbone network (the blue part), a shallow feature extraction branch (the red part) and a deep feature extraction branch (the orange part). The backbone network is a pre-trained image classification network, which can be any classification CNNs, such as AlexNet [18] and ResNet [13]

. In this work, we select the imagenet-vgg-2048 network

[4]

as our backbone network. In the shallow and deep feature extraction branches, in order to combine multi-layer multi-resolution feature maps at the same spatial resolution, a max pooling layer (a 7

7 kernel with a stride of 2) and a deconvolution layer

[23] (a 44 kernel with a stride of 4) are added to perform downsampling and upsampling, respectively. Then, the Conv6 and Conv7 layers (i.e., the 11 convolutional layers) are employed to compress the shallow and deep features, respectively. Moreover, a local response normalization (LRN) layer [18] is employed to normalize these features. Finally, we concatenate the normalized features to a single output cube, and obtain our multi-layer same-resolution compressed (MSC) features. In order to effectively train the MSC features, a correlation filter is interpreted as a differentiable layer, which is added at the last layer of DSNet in the training stage.

3.2 End-to-end Feature Learning

In order to effectively train the proposed DSNet and make the learned MSC features suitable for correlation filter tracking, we add a correlation filter layer (see Fig. 2) in the proposed DSNet to perform the end-to-end MSC feature representation learning in an offline manner.

In the training stage, a set of triplet training samples is generated. Let be a triplet, where is the target image patch including the centered target, is the test image patch which contains the non-centered target, and

is the desired Gaussian distribution centered at the target center position according to

. Given a batch size of triplet training samples, the cost function is formulated as:

(1)

where

(2)

and is the desired correlation filter for the -th channel feature map, refers to the parameters of our DSNet, is the extracted features of -th channel with the parameters corresponding to the input , and is a regularization parameter. Furthermore,

denotes the inverse Fourier transform,

is the circular correlation operation, represents the channel numbers, , and

denote the Hadamard product, discrete Fourier transform and complex conjugation, respectively. By applying the multivariable chain rule, the derivation of the loss function in Eq. (

1) can be rewritten as:

(3)

Specifically, and

in the above can be efficiently calculated by recent deep learning toolkits. According to

[28, 29], the prior two terms ( and ) in (3) can be formulated as:

(4)
(5)

3.3 Feature Extraction

In the online tracking stage, DSNet acts as a feature extractor, which first extracts multi-layer convolutional feature maps as shown in Fig. 2. Next, the shallow and deep feature extraction branches in our DSNet aggregate multi-level convolutional feature maps and compress them to the uniform spatial resolution features. Finally, the MSC features are obtained by normalizing the compressed convolutional features. Specifically, the obtained MSC features (with the size of ) consist of two parts of features: shallow convolutional features with the size of and deep convolutional features with the size of . To better understand the learned MSC features, several channel feature maps of MSC features are visualized in Fig. 3. As can be seen, the shallow channel feature maps usually capture the detailed information of objects, while the deep channel feature maps generally encode the semantic information. These two types of features can complement each other, and the combination of them is beneficial for online tracking.

Figure 3: Visualization of the shallow and deep channel feature maps in the learned MSC features.

3.4 Channel Reliability Measurement

Several channel feature maps of MSC features may have small target activations, which indicates that these feature channels are more sensitive to the background regions rather than the target regions. To measure the reliability of these channels, the channel-wise ratio of the -th channel is formulated as:

(6)

Here, refers to the entire -th channel feature map, is the target region of , is a penalty parameter and is the norm.

The channel-wise ratio shows the ratio of the target responses to the overall responses, however, it cannot fully reflect the channel reliability in some cases. For example, when the background responses are equal to zero, even is a small value, a quite large channel-wise ratio will be obtained. To overcome the above problem, is defined to measure the target activations of the -th channel feature map:

(7)

where

(8)

and is the sign function, returns the activation value at the position of , and are the width and height of , respectively. is a penalty parameter that controls the measurement of the target region responses. Finally, the reliability score of the -th feature channel is calculated by:

(9)

Generally, channels with high reflect that they contain more activations from the target regions than the background regions. After obtaining the reliability scores of feature channels, we sort these channels in the descending order. The top ranked feature channels are selected to perform online tracking.

4 MSC-Trackers

In this section, we show how the learned MSC features can be incorporated into different DCF-based tracking frameworks. We select two state-of-the-art tracking frameworks, i.e., the basic DCF framework [15] and the continuous convolutional operator framework [9].

4.1 MSC Features for the Basic DCF framework

A typical DCF learns a correlation filter by solving a Ridge Regression problem:

(10)

where is the extracted MSC features of -th channel with the DSNet parameters corresponding to the training image patch , is the desired Gaussian distribution, is a regularization parameter that alleviates the overfitting problem. The learned correlation filter can be obtained as [15]:

(11)

Given the test image patch and the extracted features , the online detection process is formulated as:

(12)

where is the response map. The target center position can be estimated by searching the maximum value in . During the tracking process, at the -th frame, the numerator and denominator in Eq. (11) are respectively updated by using a moving average strategy with a learning rate . Then the correlation filter model at the -th frame is updated by . We use the scale estimation similar to [29].

Note that the conventional DCF framework is restricted to single-resolution feature maps. In comparison, the proposed MSC features can be naturally fused into the DCF framework (see Fig. 1). For briefly, this MSC features based DCF tracker is named as MSC-DCF.

4.2 MSC Features for the Continuous Convolution Operator Framework

The continuous convolution operator is proposed in C-COT [9]. Let denote a training sample, which contains feature channels . is the number of spatial samples in . Here, the feature channel can be viewed as a function , where is the discrete spatial variable . Assume that the spatial support of the feature map is the continuous interval

. The interpolation operator

is formulated as:

(13)

where is the interpolation function, denotes the location in the image, . In the continuous formulation, the convolution operator is estimated by a set of convolution filters . The convolution operator is defined as:

(14)

Here, is the continuous filter for the -th channel (see [9] for more details), is the circular convolution operation: . As can be seen, for each interpolated sample , it is convolved with the corresponding filter . At last, the final confidence map is obtained by summing up the convolution responses from all the filters.

In Eqs. (13) and (14), the interpolation operator and convolution filter are learned for each feature channel. Thus, the high-dimensional convolutional features (e.g., the 608-dimensional features used in CCOT), significantly limit the online tracking speed. In comparison, the learned MSC features have much less channels (i.e., 96), and they can be regarded as one layer convolutional features to be fused in the continuous convolution operator framework without any modifications. We call this MSC features based tracker as MSC-CCO.

5 Experiments

Implementation Details: To avoid overfitting, we select the large scale video detection dataset (ILSVRC-2015) [26] to train the proposed DSNet. This dataset contains 4417 videos of 30 different objects. We use 3862 videos in this dataset for training and the remaining videos for validation. The triplet training samples are generated as described in [12]

. The proposed DSNet is trained for 200 epochs with a batch size of 16 and an initial learning rate of

by using the SGD solver. The momentum and weight decay are respectively set to and . In the tracking stage, for MSC-DCF, the learning rate

and the padding area are respectively set to 0.012 and 1.65. The searching scale number is set to 3. For MSC-CCO, we set the learning rate and the padding area to

and 3.62, respectively. Similar to [6], except for MSC features, MSC-CCO also employs HOG features, and the MSC features are further compressed to 38-D by using PCA. The other parameters in MSC-DCF and MSC-CCO are respectively set to be the same as in [29] and [6]. For the CRM method, we apply it to refine the deep feature channels in MSC features, where in MSC-DCF and MSC-CCO are respectively set to 50 and 58. The penalty parameters and are set to 3 and , respectively. In addition, we implement our method on a computer equipped with an Intel 6700K 4.0 GHz CPU and an NVIDIA GTX 1080 GPU.

Evaluation Methodology: Both the distance precision (DP) and overlap success (OS) plots are adopted to evaluate trackers on OTB-2013 [31], OTB-2015 [32] and OTB-50. We report both the DP rates at the conventional threshold of 20 pixels (DPR) and the OS rates at the threshold of 0.5 (OSR). The Area Under the Curve (AUC) is also used to evaluate the trackers.

Comparison Scenarios: We evaluate the proposed MSC-trackers (MSC-DCF and MSC-CCO) in four experiments. The first experiment is conducted to demonstrate the effectiveness of the learned MSC features by comparing our MSC features with both hand-crafted features and deep features. At the second experiment, we compare the proposed highly efficient MSC-trackers with the state-of-the-art real-time trackers, which shows the superiority of our MSC-trackers. The third experiment compares our MSC-trackers with the top-performing CF-based trackers with deep features. The last experiment makes an ablation study on MSC-trackers to show the effectiveness of the proposed CRM method.

5.1 Feature Comparison

MSC
Conv1
Conv5
HOG
RGB
HOG+RGB
Conv1+Conv5

DPR (%)
83.7 76.2 69.4 74.4 47.4 77.7 78.5
OSR (%) 78.3 72.6 53.7 73.5 38.7 75.2 75.7
Feat. Size.
Feat. Dimen. 96 96 512 32 3 35 608
Avg. FPS 56.0 328.1 49.6

Table 1: Comparison of our MSC features with both the hand-crafted features and deep convolutional features within the DCF framework on OTB-2013. Note that indicates the GPU speed, otherwise the CPU speed. The first, second and third best features are shown in color.

To demonstrate the effectiveness of the learned MSC features, we compare the MSC features with different commonly used features. For fair comparison, all the compared features are incorporated into the same tracking framework, i.e., the DCF framework described in Section 4.1 for this experiment. We compare MSC features with raw RGB pixels (RGB), HOG [5], the first (Conv1) and fifth (Conv5) layer convolutional features in imagenet-vgg-2048, the combination of HOG and RGB (HOG+RGB) features and the combination of Conv1 and Conv5 (Conv1+Conv5) features. In order to combine the Conv1 and Conv5 features, the bilinear interpolation method is employed to upsample the Conv5 features to the same size as the Conv1 features.

Table 1 shows a comparison of our MSC features with different types of features on OTB-2013. As can be seen, our MSC features achieve the best DPR (83.7%) and OSR (78.3%) on OTB-2013, significantly outperforming both the hand-crafted features and deep convolutional features with large margins. In particular, the Conv1 and Conv5 features are extracted from the imagenet-vgg-2048 network, which is also the backbone network in DSNet. Despite the similar network architecture, our MSC features improve 7.5% and 14.4% of the DPRs obtained by the Conv1 and Conv5 features on OTB-2013, respectively. Compared with the Conv1 and Conv5 features, the combination Conv1+Conv5 features have more feature channels (608) and achieve better performance, with a DPR of 78.5%. Although much more feature channels are included in the Conv1+Conv5 features, our MSC features still provide improved performance, with a DPR of 83.7%, while achieving the fast tracking speed of 68.5 FPS, which is about 26 times faster than the Conv1+Conv5 features. This empirically shows that the MSC features are compact and can lead to highly efficient online tracking.

5.2 Comparison with Real-Time Trackers

Figure 4: Precision (left) and success (right) plots obtained by our MSC-DCF and MSC-CCO compared with the other 12 state-of-the-art real-time trackers on OTB-2015. DPRs and AUCs are reported in left and right brackets, respectively.

We compare the proposed highly efficient MSC-trackers (MSC-DCF and MSC-CCO) with 12 state-of-the-art trackers that can achieve real-time tracking speed (FPS20) for fair comparison, including SiamFC [2], CFNet [28], Staple [1], DCFNet [29], DCF [22], LCT [21], DSST [8], KCF [15], GOTURN [14], Re3 [11], DCF [15], and TLD [16].

Fig. 4 compares the proposed MSC-trackers with the state-of-the-art real-time trackers, showing that our MSC-trackers achieve the best performance on OTB-2015. More specifically, MSC-CCO achieves the best DPR (89.2%) followed by MSC-DCF (79.8%) and CFNet (77.7%). Note that CFNet is the winner of the VOT-17 real-time challenge. Similar to MSC-DCF, CFNet also employs the traditional DCF framework. However, different from CFNet, our MSC-DCF learns complementary multi-layer features instead of single-layer features, thus achieving better performance than CFNet in terms of both DPR and AUC.

MSC-CCO
MSC-DCF
CFNet
SiamFC
DCFNet
Staple
DCF
LCT
OTB-2013 90.5 83.7 82.2 80.3 79.5 79.3 78.4 84.8
OTB-2015 89.2 79.8 77.7 77.1 75.1 78.4 74.4 76.2
OTB-50 86.6 75.2 72.3 69.4 68.3 68.1 71.2 69.1
Avg. DPR 88.8 79.6 77.4 75.6 74.3 75.3 74.7 76.7
Avg. FPS 48.3 179.2 21.0

Table 2: DPRs (%) and speed ( indicates the GPU speed, otherwise the CPU speed) obtained by our MSC-DCF and MSC-CCO trackers as well as the state-of-the-art real-time trackers on OTB datasets. The results of top 8 performing trackers are given. The first, second and third best trackers are shown in color.

As can be seen from Table 2, the proposed MSC-trackers achieve the best accuracy over all the three datasets. Specifically, MSC-CCO achieves the best accuracy (86.6%) on OTB-50 followed by MSC-DCF (75.2%). For the average DPR, the best two results belong to our MSC-trackers, followed by CFNet (77.4%) and SiamFC (75.6%). This comparison highlights the high accuracy achieved by our tracker among the state-of-the-art real-time trackers. The average tracking speed of different trackers are also reported in Table 2. Compared with other trackers, MSC-DCF achieves the fast tracking speed of 66.8 FPS while demonstrating its competitive tracking performance. In addition, higher accuracy of MSC-CCO is obtained at the cost of lower speed compared to MSC-DCF, but MSC-CCO still maintains quasi-real-time tracking speed of 20.6 FPS.

5.3 Comparison with Deep Feature-based Trackers

We compare the proposed MSC-trackers with 6 state-of-the-art deep feature-based CF trackers: CCOT [9], MCPF [33], CREST [27], DeepSRDCF [7], CF2 [20] and HDT [24].

(a)
(b)
(c)
Figure 5: Success plots obtained by the proposed MSC-trackers (MSC-CCO and MSC-DCF) and the top-performing deep feature-based CF trackers on (a) OTB-50, (b) OTB-2013 and (c) OTB-2015. AUCs are illustrated in brackets.
MSC-CCO
MSC-DCF
CCOT
MCPF
DeepSRDCF
HDT
CF2
CREST
OTB-2013 83.7 78.3 83.2 85.8 79.4 73.7 74.0 86.0
OTB-2015 82.1 73.3 82.0 78.0 77.2 65.7 65.5 77.6
OTB-50 77.6 67.1 74.9 71.0 67.6 58.4 58.2 70.5
Avg. OSR 81.1 72.9 80.0 78.3 74.7 65.9 65.9 78.0
Avg. FPS <
Table 3: OSRs (%) and speed ( indicates the GPU speed, otherwise the CPU speed) obtained by MSC-CCO and MSC-DCF as well as the state-of-the-art deep feature-based CF trackers on OTB-2013, OTB-2015 and OTB-50. The first, second and third best trackers are highlighted in color.

Fig. 5 and Table 3 show the comparison of our MSC-trackers with several deep CF trackers with deep features on OTB datasets. More particularly, the AUC and OSR obtained by MSC-DCF are higher than those obtained by CF2 and HDT. MSC-CCO achieves the best OSR (77.6%) on OTB-50, significantly outperforming the second best tracker CCOT with a large margin of 2.7%. Furthermore, the average OSR obtained by MSC-CCO is 81.1%, which is ranked at the first and followed by CCOT (80.0%) and MCPF (78.3%). In terms of the tracking speed reported in Table 5, compared with the other deep feature based trackers, MSC-DCF can run at 66.8 FPS, which is significantly faster than the compared trackers. In addition, MSC-CCO achieves the quasi-real-time tracking speed of 20.6 FPS, which is almost 94 times faster than CCOT and 37 times faster than MCPF. In the meanwhile, MSC-CCO achieves the best overall performance over all the three datasets.

MSC-CCO
MSC-CCO-w/o-CRM
MSC-DCF
MSC-DCF-w/o-CRM
OTB-2013 90.5 86.5 83.7 83.1
OTB-2015 89.2 87.1 79.8 79.2
Table 4: DPRs (%) obtained by MSC-DCF, MSC-CCO and their additional versions on OTB-2013 and OTB-2015.

5.4 Ablation Study

To demonstrate the effectiveness of the proposed CRM method, we evaluate MSC-DCF and MSC-CCO with additional versions, i.e., MSC-DCF and MSC-CCO without using the CRM method, namely MSC-DCF-w/o-CRM and MSC-CCO-w/o-CRM, respectively. As can be seen from Table 4, the CRM method is effective to improve tracking performance. Specifically, MSC-CCO achieves the accuracy (90.5%) by improving 4% of MSC-CCO-w/o-CRM on OTB-2013. For MSC-DCF, the performance is also improved by applying the CRM method. These results demonstrate the effectiveness of the CRM method, which is mainly due to the fact that the CRM method filters the feature channels that are more sensitive to the background regions while retaining the high-quality channels that are more beneficial for robust visual tracking.

6 Conclusions

In this work, we propose a deep and shallow feature learning network (called as DSNet) to learn the multi-level same-resolution compressed (MSC) features, which effectively incorporate both deep and shallow features with the same spatial resolution for efficient online tracking. The proposed DSNet compresses multi-layer convolutional features and can be effectively trained in an end-to-end offline manner. The MSC features are generic and can be easily applied to any CF-based trackers. In addition, we propose an effective channel reliability measurement method to further refine the learned MSC features. To demonstrate the effectiveness of our MSC features, two MSC features based trackers are presented, namely MSC-DCF and MSC-CCO, respectively. Experiments on several large scale benchmarks show that the proposed methods perform favorably against state-of-the-art tracking methods.

Acknowledgments. This work is supported by the National Natural Science of China (Grant No. U1605252, 61872307, 61472334 and 61571379) and the National Key Research and Development Program of China under Grant No. 2017YFB1302400.

References

  • [1]

    Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H.S.: Staple: Complementary learners for real-time tracking. In: Computer Vision and Pattern Recognition (CVPR). pp. 1401–1409 (2016)

  • [2] Bertinetto, L., Valmadre, J., Henriques, J., Vedaldi, A., Vedaldi, P.: Fully-convolutional siamese networks for object tracking. In: ECCV Workshop. pp. 850–865 (2016)
  • [3] Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using adaptive correlation filters. In: Computer Vision and Pattern Recognition (CVPR). pp. 2544–2550 (2010)
  • [4] Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: Delving deep into convolutional nets (2014), arXiv preprint arxiv:1405.3531
  • [5] Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Computer Vision and Pattern Recognition (CVPR). pp. 886–893 (2005)
  • [6] Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Eco: Efficient convolution operators for tracking. In: Computer Vision and Pattern Recognition (CVPR). pp. 21–26 (2017)
  • [7] Danelljan, M., Hager, G., Khan, F.S., Felsberg, M.: Convolutional features for correlation filter based visual tracking. In: International Conference on Computer Vision Workshops. pp. 58–66 (2015)
  • [8] Danelljan, M., Häger, G., Khan, F.S., Felsberg, M.: Discriminative scale space tracking. IEEE Trans. Pattern Anal. Mach. Intell. 39(8), 1561–1575 (2017)
  • [9] Danelljan, M., Robinson, A., Khan, F.S., Felsberg, M.: Beyond correlation filters: learning continuous convolution operators for visual tracking. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 472-488. Springer, Heidelberg. doi:10.1007/978-3-319-46454-1_29 (2016)
  • [10] Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Computer Vision and Pattern Recognition (CVPR). pp. 580–587 (2014)
  • [11] Gorden, D., Farhadi, A., Fox, D.: Re3: Real-time recurrent regression networks for object tracking (2017), arXiv preprint arxiv:1705.06368
  • [12] Gundogdu, E., Alatan, A.A.: Good features to correlate for visual tracking (2017), arXiv preprint arxiv:1704.06326
  • [13] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (2016)
  • [14] Held, D., Thrun, S., Savaresei, S.: Learning to track at 100 fps with deep regression networks. In: Leibe B., Matas J., Sebe N., Welling M. (eds) ECCV 2016. LNCS, vol. 9905, pp. 749-765. Springer, Heidelberg. doi:10.1007/978-3-319-46448-0_45 (2016)
  • [15] Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2015)
  • [16] Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1409–1422 (2012)
  • [17] Kong, T., Yao, A., Chen, Y., Sun, F.: Hypernet: Towards accurate region proposal generation and joint object detection. In: Computer Vision and Pattern Recognition (CVPR). pp. 845–853 (2016)
  • [18] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (NIPS). pp. 1097–1105 (2012)
  • [19] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y., Berg, A.C.: Ssd: Single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21-37. Springer, Heidelberg. doi:978-3-319-46448-0_2 (2016)
  • [20] Ma, C., J. B. Huang, X. Yang, M.H.Y.: Hierarchical convolutional features for visual tracking. In: International Conference on Computer Vision (ICCV). pp. 3074–3082 (2015)
  • [21] Ma, C., Yang, X., Zhang, C., Yang, M.H.: Long-term correlation tracking. In: Computer Vision and Pattern Recognition (CVPR). pp. 5388–5396 (2015)
  • [22] Mueller, M., Smith, N., Ghanem, B.: Context-aware correlation filter tracking. In: Computer Vision and Pattern Recognition (CVPR). pp. 1387–1395 (2017)
  • [23] Noh, H., Hong, S., Han, B.: Learning deconvolution network for semantic segmentation. In: International Conference on Computer Vision (ICCV). pp. 1520–1528 (2015)
  • [24] Qi, Y., Zhang, S., Qin, L., Yao, H., Huang, Q., Lim, J., Yang, M.H.: Hedged deep tracking. In: Computer Vision and Pattern Recognition (CVPR). pp. 4303–4311 (2016)
  • [25] Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Computer Vision and Pattern Recognition (CVPR). pp. 779–788 (2016)
  • [26] Russakovsky, O., Deng, J.: Imagenet large scale visual recognition challenge 115(3), 211–252 (2015)
  • [27] Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W., Yang, M.H.: Crest: Convolutional residual learning for visual tracking. In: International Conference on Computer Vision (ICCV). pp. 2574–2583 (2017)
  • [28] Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.S.: End-to-end representation learning for correlation filter based tracking. In: Computer Vision and Pattern Recognition (CVPR). pp. 5000–5008 (2017)
  • [29] Wang, Q., Gao, J., Xing, J., Zhang, M., Hu, W.: Dcfnet: Discriminant correlation filters network for visual tracking (2017), arXiv preprint arxiv:1704.04057
  • [30] Weijer, J.V.D., Schmid, C., Verbeek, J., Larlus, D.: Learning color names for real-world applications. IEEE Trans. Image Process. 18(7), 1512–1523 (2009)
  • [31] Wu, Y., Lim, J., Yang, M.H.: Online object tracking: a benchmark. In: Computer Vision and Pattern Recognition (CVPR). pp. 2411–2418 (2013)
  • [32] Wu, Y., Lim, J., Yang, M.H.: Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 37(9), 1834–1848 (2015)
  • [33] Zhang, T., Xu, C., Yang, M.H.: Multi-task correlation particle filter for robust object tracking. In: Computer Vision and Pattern Recognition (CVPR). pp. 4819–4827 (2017)