Effective Fusion of Deep Multitasking Representations for Robust Visual Tracking

04/03/2020 ∙ by Seyed Mojtaba Marvasti-Zadeh, et al. ∙ Sharif Accelerator University of Alberta Yazd University Aalborg University 0

Visual object tracking remains an active research field in computer vision due to persisting challenges with various problem-specific factors in real-world scenes. Many existing tracking methods based on discriminative correlation filters (DCFs) employ feature extraction networks (FENs) to model the target appearance during the learning process. However, using deep feature maps extracted from FENs based on different residual neural networks (ResNets) has not previously been investigated. This paper aims to evaluate the performance of twelve state-of-the-art ResNet-based FENs in a DCF-based framework to determine the best for visual tracking purposes. First, it ranks their best feature maps and explores the generalized adoption of the best ResNet-based FEN into another DCF-based method. Then, the proposed method extracts deep semantic information from a fully convolutional FEN and fuses it with the best ResNet-based feature maps to strengthen the target representation in the learning process of continuous convolution filters. Finally, it introduces a new and efficient semantic weighting method (using semantic segmentation feature maps on each video frame) to reduce the drift problem. Extensive experimental results on the well-known OTB-2013, OTB-2015, TC-128 and VOT-2018 visual tracking datasets demonstrate that the proposed method effectively outperforms state-of-the-art methods in terms of precision and robustness of visual tracking.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generic visual tracking aims to estimate the trajectory of an arbitrary visual target (given the initial state in the first frame) over time despite many challenging factors, including fast motion, background clutter, deformation, and occlusion. This constitutes a fundamental problem in many computer vision applications (e.g., self-driving cars, autonomous robots, human-computer interactions). Extracting a robust target representation is the critical component of state-of-the-art visual tracking methods to overcome these challenges. Hence, to robustly model target appearance, these methods utilize a wide range of handcrafted features (e.g.,

HC-HCOF; HC-SA; HC-HCF; HC-MS; HC-FM; APIN1 which exploit histogram of oriented gradients (HOG) HOG, histogram of local intensities (HOI), and Color Names (CN) CN), deep features from deep neural networks (e.g., HCFT; HCFTs; DCFNet; SiamFC; APIN2; APIN3, or both (e.g., DeepSRDCF; ECO; LCTdeep).

Deep features are generally extracted from either FENs (i.e., pre-trained deep convolutional neural networks (CNNs)) or end-to-end networks (EENs), which directly evaluate target candidates SurveyDeepTracking

. However, most EEN-based visual trackers train or fine-tune FENs on visual tracking datasets. Numerous recent visual tracking methods exploit powerful generic target representations from FENs trained on large-scale object recognition datasets, such as the ImageNet

ImageNet. By combining deep discriminative features from FENs with efficient online learning formulations, visual trackers based on discriminative correlation filters (DCF) have achieved significant performance in terms of accuracy, robustness, and speed. Despite the existence of modern CNN models, which are mainly based on the ResNet architectures ResNet, most DCF-based visual trackers still use VGG-M VGGM, VGG-16 VGGNet, and VGG-19 VGGNet models, which have a simple stacked multi-layer topology with moderate representation capabilities. These models provide a limited power of representation compared to the state-of-the-art ResNet-based models such as ResNet ResNet, ResNeXt ResNeXt, squeeze-and-excitation networks (SENets) SE-Res-CVPR; SE-Res-PAMI, and DenseNet DenseNet models. Besides, most of the visual tracking methods exploit well-known CNN models which are trained for the visual recognition task. However, the transferability of deep representations from other related tasks (e.g., semantic segmentation or object detection) may help the DCF-based visual trackers to improve their robustness. Motivated by these two goals, the proposed paper surveys the performance of state-of-the-art ResNet-based FENs and exploits deep multitasking representations for visual tracking purposes.

The main contributions of the paper are as follows.
1) The effectiveness of twelve state-of-the-art ResNet-based FENs is evaluated: ResNet-50 ResNet, ResNet-101 ResNet, ResNet-152 ResNet, ResNeXt-50 ResNeXt, ResNeXt-101 ResNeXt, SE-ResNet-50 SE-Res-CVPR; SE-Res-PAMI, SE-ResNet-101 SE-Res-CVPR; SE-Res-PAMI, SE-ResNeXt-50 SE-Res-CVPR; SE-Res-PAMI, SE-ResNeXt-101 SE-Res-CVPR; SE-Res-PAMI, DenseNet-121 DenseNet, DenseNet-169 DenseNet, and DenseNet-201 DenseNet. These have been trained on the large-scale ImageNet dataset. To the best of our knowledge, this is the first paper to comprehensively survey the performance of ResNet-based FENs for visual tracking purposes.
2) The generalization of the best ResNet-based FEN is investigated by a different DCF-based visual tracking method.
3) A visual tracking method is proposed that fuses deep representations; these are extracted from the best trained ResNet-based network and the FCN-8s network FCN-8s, which have been trained on the PASCAL VOC PascalVOC and MSCOCO MSCOCO datasets for semantic image segmentation tasks.
4) The proposed method uses the extracted deep features from the FCN-8s network to semantically weight the target representation in each video frame.
5) The performance of the proposed method is extensively compared with state-of-the-art visual tracking methods on four well-known visual tracking datasets.

The rest of the paper is organized as follows. In Section 2, an overview of related work, including DCF-based visual trackers exploiting deep features from the FENs, is outlined. In Section 3 and Section 4, the survey of ResNet-based FENs and the proposed visual tracking method are presented, respectively. In Section 5, extensive experimental results on the large visual tracking datasets are given. Finally, the conclusion is summarized in Section 6.

2 Related Work

In this section, the exploitation of different FENs by state-of-the-art visual trackers is categorized and described. Although some visual tracking methods use recent FENs (e.g., R-CNN RCNN in CNN-SVM), popular FENs in the most visual trackers are VGG-M, VGG-16, and VGG-19 models (see Table 1). These networks provide moderate accuracy and complexity for visual target modeling.

Visual Tracking Method Model Pre-training Dataset Name of Exploited Layer(s)
DeepSRDCF DeepSRDCF VGG-M ImageNet Conv1
C-COT CCOT VGG-M ImageNet Conv1, Conv5
ECO ECO VGG-M ImageNet Conv1, Conv5
DeepSTRCF STRCF VGG-M ImageNet Conv3
WAEF WAEF VGG-M ImageNet Conv1, Conv5
WECO WECO VGG-M ImageNet Conv1, Conv5
VDSR-SRT VDSR-SRT VGG-M ImageNet Conv1, Conv5
RPCF RPCF VGG-M ImageNet Conv1, Conv5
DeepTACF DeepTACF VGG-M ImageNet Conv1
ASRCF ASRCF VGG-M, VGG-16 ImageNet Norm1, Conv4-3
DRT DRT VGG-M, VGG-16 ImageNet Conv1, Conv4-3
ETDL ETDL VGG-16 ImageNet Conv1-2
FCNT FCNT VGG-16 ImageNet Conv4-3, Conv5-3
Tracker DeepMotionFeatures VGG-16 ImageNet Conv2-2, Conv5-3
DNT DNT VGG-16 ImageNet Conv4-3, Conv5-3
CREST CREST VGG-16 ImageNet Conv4-3
CPT CPT VGG-16 ImageNet Conv5-1, Conv5-3
DeepFWDCF DeepFWDCF VGG-16 ImageNet Conv4-3
DTO DTO VGG-16, SSD ImageNet Conv3-3, Conv4-3, Conv5-3
DeepHPFT DeepHPFT VGG-16, VGG-19, and GoogLeNet ImageNet Conv5-3, Conv5-4, and icp6-out
HCFT HCFT VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
HCFTs HCFTs VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
LCTdeep LCTdeep VGG-19 ImageNet Conv5-4
Tracker CF-CNN VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
HDT HDT VGG-19 ImageNet Conv4-2, Conv4-3, Conv4-4, Conv5-2, Conv5-3, Conv5-4
IBCCF IBCCF VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
DCPF DCPF VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
MCPF MCPF VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
DeepLMCF DeepLMCF VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
STSGS STSGS VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
MCCT MCCT VGG-19 ImageNet Conv4-4, Conv5-4
DCPF2 DCPF2 VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
ORHF ORHF VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
IMM-DFT IMM-DFT VGG-19 ImageNet Conv3-4, Conv4-4, Conv5-4
MMLT MMLT VGGNet, Fully-convolutional Siamese network ImageNet, ILSVRC-VID Conv5
CNN-SVM CNN-SVM R-CNN ImageNet First fully-connected layer
TADT TADT Siamese matching network ImageNet Conv4-1, Conv4-3
Table 1: Exploited FENS in some visual tracking methods.

VGG-M Model: With the aid of deep features and spatial regularization weights, the spatially regularized discriminative correlation filters (DeepSRDCF) DeepSRDCF learn more discriminative target models on larger image regions. The formulation of DeepSRDCF provides a larger set of negative training samples by penalizing the unwanted boundary effects resulting from the periodic assumption of standard DCF-based methods. Moreover, the method based on deep spatial-temporal regularized correlation filters (DeepSTRCF) STRCF incorporates both spatial and temporal regularization parameters to provide a more robust appearance model. Then, it iteratively optimizes three closed-form solutions via the alternating direction method of multipliers (AD-MM) algorithm ADMM. To explore efficiency of Tikhonov regularization in temporal domain, the weighted aggregation with enhancement filter (WAEF) WAEF organizes deep features according to the average information content and suppresses uncorrelated frames for visual tracking. The continuous convolution operator tracker (C-COT) CCOT

fuses multi-resolution deep feature maps to learn discriminative continuous-domain convolution operators. It also exploits an implicit interpolation model to enable accurate sub-pixel localization of the visual target. To reduce the computational complexity and the number of training samples and to improve the update strategy of the C-COT, the ECO tracker

ECO uses a factorized convolution operator, a compact generative model of training sample distribution, and a conservative model update strategy. The weighted ECO WECO utilizes weighted sum operation and normalization of deep features to take advantages of multi-resolution deep features. By employing the ECO framework, the VDSR-SRT method VDSR-SRT

uses a super-resolution reconstruction algorithm to robustly track targets in low-resolution images. To compress model size and improve the robustness against deformations, the region of interest (ROI) pooled correlation filters (RPCF)

RPCF accurately localizes a target while it uses deep feature maps with smaller sizes. The target-aware correlation filters (TACF) method DeepTACF guides the learned filters for visual tracking by focusing on a target and preventing from background information.
VGG-16 Model: By employing the DeepSRDCF tracker, Gladh et al. DeepMotionFeatures have investigated the fusion of handcrafted appearance features (e.g., HOG and CN) with deep RGB and motion features in the DCF-based visual tracking framework. This method demonstrates the effectiveness of deep feature maps for visual tracking purposes. Besides, the CREST method CREST reformulates correlation filters to extract more beneficial deep features by integrating the feature extraction and learning process of DCFs. The dual network-based tracker (DNT) DNT designs a dual structure for embedding the boundary and shape information into the feature maps to better utilize deep hierarchical features. The deep fully convolutional networks tracker (FCNT) FCNT uses two complementary feature maps and a feature-map selection method to design an effective FEN-based visual tracker. It exploits two different convolutional layers for category detection and distraction separation. Moreover, the feature-map selection method helps the FCNT to reduce computation redundancy and discard irrelevant feature maps. The method based on enhanced tracking and detection learning (ETDL) ETDL employs different color bases for each color frame, adaptive multi-scale DCF with deep features, and a detection module to re-detect the target in failure cases. To prevent unexpected background information and distractors, the adaptive feature weighted DCF (FWDCF) DeepFWDCF calculates target likelihood to provide spatial-temporal weights for deep features. The channel pruning method (CPT) CPT utilizes average feature energy ratio method to exploit low-dimensional deep features, adaptively.
VGG-19 Model: Similar to the FCNT, the hierarchical correlation feature-based tracker (HCFT) HCFT exploits multiple levels of deep feature maps to handle considerable appearance variation and simultaneously to provide precise localization. Furthermore, the modified HCFT (namely HCFTs or HCFT*) HCFTs

not only learns linear correlation filters on multi-level deep feature maps; it also employs two types of region proposals and a discriminative classifier to provide a long-term memory of target appearance. Furthermore, the IMM-DFT method

IMM-DFT exploits adaptive hierarchical features to interactively model a target due to the insufficiency of linear combination of deep features. To incorporate motion information with spatial deep features, the STSGS method STSGS solves a compositional energy optimization to effectively localize an interested target. Also, the MCCT method MCCT learns different target models by multiple DCFs that adopt different features, and then it makes the decision based on the reliable result. The hedged deep tracker (HDT) HDT ensembles weak CNN-based visual trackers via an online decision-theoretical Hedge algorithm, aiming to exploit the full advantage of hierarchical deep feature maps. The ORHF method ORHF selects useful deep features according to the estimated confidence scores to reduce the computational complexity problem. To enhance the dicrimination power of target among its background, the DCPF DCPF and DeepLMCF DeepLMCF exploit deep features into the particle filter and structured SVM based tracking, respectively. To handle significant appearance change and scale variation, the deep long-term correlation tracking (LCTdeep) LCTdeep uses multiple adaptive DCF, deep feature pyramid reconfiguration, aggressive and conservative learning rates as short- and long-term memories combined with an incrementally learned detector to recover tracking failures. Furthermore, Ma et al. CF-CNN exploited different deep feature maps and a conservative update scheme to learn the mapping as a spatial correlation and maintain the long-memory of target appearance. At last, the DCPF2 DCPF2 and IBCCF IBCCF methods aims to handle the aspect ratio of a target for deep DCF-based trackers.
Custom FENs: In addition to the mentioned popular FENs, some methods exploit either a combination of FENs (e.g., ASRCF ASRCF, DRT DRT, DTO DTO, MMLT MMLT, and DeepHPFT DeepHPFT) or other types of FENs (e.g., CNN-SVM CNN-SVM, and TADT TADT) for visual tracking. The ASRCF ASRCF and DRT DRT methods modify the DCF formulation to achieve more reliable results. To alleviate inaccurate estimations and scale drift, the DeepHPFT DeepHPFT ensembles various handcrafted and deep features into the particle filter framework. Although the DTO method DTO employs a trained object detection model (i.e., single shot detector (SSD) SSD) to evaluate the impact of object category estimation for visual tracking, its ability is limited to some particular categories. The MMLT method MMLT employs the Siamese and VGG networks to not only robustly track a target but also provide a long-term memory of appearance variation. On the other hand, the CNN-SVM CNN-SVM and TADT TADT methods use different models to improve the effectiveness of FENs for visual tracking.
According to the Table 1, the recent visual tracking methods do not only use the state-of-the-art FENs as their feature extractor; they also exploit different feature maps in the DCF-based framework. In the next section, the popular ResNet-based FENs will be surveyed. Then, the proposed visual tracking method, with the aid of fused deep representation, will be described.

3 Survey of ResNet-based FENs

This section has two main aims. First, twelve state-of-the-art ResNet-based FENs in the DCF-based framework are comprehensively evaluated for visual tracking purposes, and the best ResNet-based FEN and its feature maps are selected to be exploited in the DCF-based visual tracking framework. Second, the generalization of the best FEN is investigated on another DCF-based visual tracking method.

3.1 Performance Evaluation

The ECO framework ECO is employed as the baseline visual tracker. Aiming for accurate and fair evaluations, this tracker was modified to extract deep features from FENs with the directed acyclic graph topology and exploit all deep feature maps without any down sampling or dimension reduction. Table 2 shows the exploited ResNet-based FENs and their feature maps, which are used in the proposed method. All these FENs are evaluated by well-known precision and success metrics OTB2013; OTB2015 on the OTB-2013 dataset OTB2013, which includes more than 29,000 video frames. The characteristics of visual tracking datasets which are used in this work are shown in Table 3. The OTB-2013, OTB-2015, and TC-128 visual tracking datasets OTB2013; OTB2015; TC128 have common challenging attributes, including illumination variation (IV), out-of-plane rotation (OPR), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out-of-view (OV), background clutter (BC), and low resolution (LR), which may occur for different classes of visual targets in real-world scenarios. Also, the VOT-2018 includes six main challenging attributes of camera motion, illumination change, motion change, occlusion, and size change which are annotated per each video frame. To measure visual tracking performance, the OTB-2013, OTB-2015, and TC-128 toolkits use precision and success plots to rank the methods according to the area under curve (AUC) metric while the VOT-2018 utilizes the accuracy-robustness (AR) plot to rank the visual trackers based on the TraX protocol TraX. The precision metric is defined as the percentage of frames where the average Euclidean distance between the estimated and ground-truth locations is smaller than a given threshold (20 pixels in this work). Moreover, the overlap success metric is the percentage of frames where the average overlap score between the estimated and the ground-truth bounding boxes is more than a particular threshold (50% overlap in this work). For the VOT-2018, the accuracy measure the overlap of the estimated bounding boxes with the ground-truth ones while the number of failures is considered as the robustness metric. Although the OTB-2013, OTB-2015, and TC-128 toolkits assess the visual trackers according to one-pass evaluation, the VOT-2014 detects the tracking failures of each method and restart the evaluations for each method after five frames of failure on every video sequence based on the TraX protocol.

Name of FENs Name of Output Layers (Number of Feature Maps)
L1 L2 L3 L4
ResNet-50

Conv1-relu (64)

Res2c-relu (256) Res3d-relu (512) Res4f-relu (1024)
ResNet-101 Conv1-relu (64) Res2c-relu (256) Res3b3-relu (512) Res4b22-relu (1024)
ResNet-152 Conv1-relu (64) Res2c-relu (256) Res3b7-relu (512) Res4b35-relu (1024)
ResNeXt-50 Features-2 (64) Features-4-2-id-relu (256) Features-5-3-id-relu (512) Features-6-5-id-relu (1024)
ResNeXt-101 Features-2 (64) Features-4-2-id-relu (256) Features-5-3-id-relu (512) Features-6-22-id-relu (1024)
SE-ResNet-50 Conv1-relu-7x7-s2 (64) Conv2-3-relu (256) Conv3-4-relu (512) Conv4-6-relu (1024)
SE-ResNet-101 Conv1-relu-7x7-s2 (64) Conv2-3-relu (256) Conv3-4-relu (512) Conv4-23-relu (1024)
SE-ResNeXt-50 Conv1-relu-7x7-s2 (64) Conv2-3-relu (256) Conv3-4-relu (512) Conv4-6-relu (1024)
SE-ResNeXt-101 Conv1-relu-7x7-s2 (64) Conv2-3-relu (256) Conv3-4-relu (512) Conv4-23-relu (1024)
DenseNet-121 Features-0-relu0 (64) Features-0-transition1-relu (256) Features-0-transition2-relu (512) Features-0-transition3-relu (1024)
DenseNet-169 Features-0-relu0 (64) Features-0-transition1-relu (256) Features-0-transition2-relu (512) Features-0-transition3-relu (1024)
DenseNet-201 Features-0-relu0 (64) Features-0-transition1-relu (256) Features-0-transition2-relu (512) Features-0-transition3-relu (1792)
Table 2: Exploited state-of-the-art ResNet-based FENs and their feature maps [Level of feature map resolution is denoted as L].
Visual Tracking Dataset Number of Videos Number of Frames Number of Videos Per Attribute
IV OPR SV OCC DEF MB FM IPR OV BC LR
OTB-2013 OTB2013 51 29491 25 39 28 29 19 12 17 31 6 21 4
OTB-2015 OTB2015 100 59040 38 63 65 49 44 31 40 53 14 31 10
TC-128 TC128 129 55346 37 73 66 64 38 35 53 59 16 46 21
VOT-2018 VOT-2018 70 25504 Frame-based attributes
Table 3: Exploited visual tracking datasets in this work.

To survey the performance of twelve ResNet-based FENs for visual tracking, the results of the comprehensive precision and success evaluations on the OTB-2013 dataset are shown in Fig. 1. On this basis, the L3 feature maps of these FENs have provided the best representation of targets, which is favorable for visual tracking. The deep features in the third level of ResNet-based FENs provide an appropriate balance between semantic information and spatial resolution and yet are invariant to significant appearance variations. The performance comparison of different ResNet-based FENs in terms of precision and success metrics is shown in Table 4. According to the results, the DenseNet-201 model and the feature maps extracted from the L3 layers have achieved the best performance for visual tracking purposes.

Figure 1: Precision and success evaluation results of state-of-the-art ResNet-based FENs on the OTB-2013 dataset.
Name of FENs Precision Success Average
ResNet-50 0.881 0.817 0.849
ResNet-101 0.901 0.829 0.865
ResNet-152 0.844 0.777 0.810
ResNeXt-50 0.885 0.819 0.852
ResNeXt-101 0.880 0.800 0.840
SE-ResNet-50 0.871 0.802 0.836
SE-ResNet-101 0.892 0.826 0.859
SE-ResNeXt-50 0.883 0.810 0.846
SE-ResNeXt-101 0.886 0.806 0.846
DenseNet-121 0.897 0.821 0.859
DenseNet-169 0.905 0.827 0.866
DenseNet-201 0.910 0.829 0.869
Table 4: Performance comparison of the L3 representations of ResNet-based models [first, second, and third FENs are shown in color].

Furthermore, the comprehensive results show that the DenseNet-201 network provides superior deep representations of target appearance in challenging scenarios compared to the other ResNet-based FENs. This network features reuse property, which concatenates the feature maps learned by different layers to strengthen feature propagation and improve the efficiency. While this network is selected as the FEN in the proposed method (see Sec. 4), the exploitation of its best feature maps will be investigated in the next subsection.

3.2 Selection and Generalization Evaluation of Best Feature Maps

Two aims are investigated in the following section; exploiting the best feature maps of the DenseNet-201 model and performing generalization evaluation of this network in other DCF-based visual trackers.

In the first step, all single and fused deep feature maps extracted from the DenseNet-201 model were extensively evaluated on the OTB-2013 dataset (see Fig. 2(a)). Based on these results, the feature maps extracted from the L3 layer appeared to have achieved the best visual tracking performance in terms of average precision and success metrics. Although the original ECO tracker exploits fused deep features extracted from the first and third convolutional layers of the VGG-M network, the fusion of different levels of feature maps from the DenseNet network did not lead to better visual tracking performance. These results may help the visual trackers to prevent redundant feature maps and considerably reduce the computational complexity.

In the second step, the generalization of exploiting the third convolutional layer of the DenseNet-201 model was evaluated on the DeepSTRCF tracker STRCF. To ensure a fair comparison, the visual tracking performance of both the original and the proposed versions of DeepSTRCF were evaluated with the aid of only deep features (without fusing with handcrafted features) extracted from VGG-M and DenseNet201 models on the OTB-2013 dataset. As shown in Fig. 2(b), the exploitation of the DenseNet-201 network does not only significantly improve the accuracy and the robustness of visual tracking in challenging scenarios; it can also generalize well into other DCF-based visual trackers.

Figure 2: Evaluation of exploiting fused feature maps of the pre-trained DenseNet-201 network and generalization of the best of them into the DeepSTRCF tracker.

4 Proposed Visual Tracking Method

The proposed tracking method exploits fused deep features of multi-task FENs and introduces semantic weighting to provide a more robust target representation for the learning process. To effectively track generic objects in challenging scenarios, the proposed method aims to exploit fused deep representation (extracted from FENs) and semantic weighting of target regions. The algorithm 1 is shown a brief outline of the proposed method.

Algorithm 1. Brief Outline of Proposed Method
  • [leftmargin=*]

  • Pre-requisitions:

    • Survey of twelve ResNet-based models which have been pre-trained on ImageNet dataset

    • Selection of the best feature maps of the DenseNet-201 model and their generalization

    • Selection of the DenseNet-201 model as the best FEN

    • Selection of the FCN-8s model which has been pre-trained on PASCAL VOC and MSCOCO datasets

    • Selection of the pixel-wise prediction maps as the semantic features

  • Evaluation of proposed visual tracking method on the OTB-2013, OTB-2015, and TC-128 datasets (without any pre-training of translation and scale filters):

Initialization of the proposed method with a given bounding box in the first frame:
  1. Extract deep features from DenseNet-201 and FCN-8s models

  2. Fuse deep multi-tasking features with semantic windowing using Eq. (2) to Eq. (7)

  3. Learn the continuous translation filters using Eq. (9) to robustly model the target

  4. Extract HOG features

  5. Learn continuous scale filters using a multi-scale search strategy

  6. Update the translation and scale filters using Eq. (11) (for frame t > 1)

  7. Go to the next frame

  8. Select the search region according to the previous target location

  9. Extract deep features from DenseNet-201 and FCN-8s models

  10. Fuse deep multi-tasking features with semantic windowing using Eq. (2) to Eq. (7)

  11. Calculate the confidence map Eq. (9)

  12. Estimate the target location by finding the maximum of Eq. (10)

  13. Extract multi-scale HOG features centered at the estimated target location

  14. Estimate the target scale using the multi-scale search strategy

  15. Select the target region conforming with the estimated location and scale

  16. Return to Step 1

Using the ECO framework, the proposed method minimizes the following objective function in the Fourier domain ECO

(1)

in which , , , and denote the multi-channel convolution filters, weighted fused deep feature maps from FENs, desirable response map, and interpolation function (to pose the learning problem in the continuous domain), respectively. The penalty function (to learn filters on target region), number of training samples, number of convolution filters, and weights of training samples are indicated by the , , , and variables, respectively. Also, the , and

are referred to the circular convolution operation and L2 norm, respectively. Capital letters represent the discrete Fourier transform (DFT) of variables.

In addition to utilizing the feature maps of the DenseNet-201 network, the proposed method extracts deep feature maps from the FCN-8s network (i.e., 21 feature maps from the “score_final” layer) to learn a more discriminative target representation by fusing the feature maps as

(2)

where , , , and are the fused deep representation, deep feature maps, number of feature maps extracted from the DenseNet-201 network, and number of feature maps extracted from the FCN-8s network, respectively.

Although the DCF-based visual trackers use a cosine window to weight the target region and its surroundings, this weighting scheme may contaminate the target model and lead to drift problems. Some recent visual trackers, e.g., WindowedCF, utilize different windowing methods, such as computation of target likelihood maps from raw pixels. However, the present work is the first method to use deep feature maps to semantically weight target appearances. The proposed method exploits the feature maps of the FCN-8s model to semantically weight the target. Given the location of the target at () in the first (or previous) frame, the proposed method defines a semantic mask as

(3)
(4)

in which , , and are the label map, PASCAL classes , and spatial location, respectively. Finally, the fused feature maps are weighted by

(5)

where and are the conventional cosine (or sine) window and the channel-wise product, respectively. The traditional cosine window weights a deep representation with and dimensions as

(6)

The matrix form of semantically weighted fused deep representations is defined as

(7)

such that the weighted fused feature maps are posed to the continuous domain by an interpolation function. By defining the diagonal matrix of training samples weights as , the regularized least square problem (1) is minimized by

(8)

which has the following closed-form solution

(9)

The proposed method learns multi-channel convolution filters by employing the iterative conjugate gradient (CG) method. Because the iterative CG method does not need to form the multiplications of and , it can desirably minimize the objective function in limited iterations. To track the target in subsequent frames, the multitasking deep feature maps () are extracted from the search region in the current frame. Next, the confidence map is calculated by

(10)

that its maximum determines the target location in the current frame. Then, a new set of continuous convolution filters are trained on the current target sample and the prior learned filters are updated by the current learned filters as

(11)

in which is learning rate. Therefore, the proposed method alternatively uses online training and tracking processes of the generic visual target for all video frames.

5 Experimental Results

All implementation details related to the exploited models (including the ResNet, ResNeXt, SE-ResNet, SE-ResNeXt, DenseNet, and FCN-8s models) are the same as in their original papers ResNet; ResNeXt; SE-Res-CVPR; SE-Res-PAMI; DenseNet; FCN-8s. The ResNet-based models have been trained on the ImageNet dataset ImageNet while the FCN-8s model has been trained on the PASCAL VOC and MSCOCO data-sets PascalVOC; MSCOCO. Similar to the exploitation of FENs in state-of-the-art visual tracking methods DeepSRDCF; DeepMotionFeatures; STRCF; CCOT; ECO; ETDL; FCNT; HCFT; HCFTs; DNT; HDT; LCTdeep; CF-CNN, these CNNs are not fine-tuned or trained during their performance evaluations on the OTB-2013, OTB-2015, TC-128, and VOT-2018 visual tracking datasets. In fact, the FENs transfers the power of deep features to the visual tracking methods. According to the visual tracking benchmarks, all visual trackers are initialized with only a given bounding box in the first frame of each video sequence. Then, these visual trackers start to learn and track a target, simultaneously. To investigate the effectiveness of the proposed method and fair evaluations, the experimental configuration of baseline ECO tracker is used. The search region, learning rate, number of CG iterations, and number of training samples are set to the

times of the target box, 0.009, 5, and 50, respectively. The desirable response map is considered as a 2D Gaussian function which its standard deviation is set to 0.083. Given a target with size

and spatial location , the penalty function is defined as

(12)

which is a quadratic function with a minimum at its center. Moreover, the number of deep feature maps extracted from the DenseNet and FCN-8s models are and . To estimate the target scale, the proposed method utilizes the multi-scale search strategy DSST with 17 scales which have relative scale factor 1.02. Note that this multi-scale search strategy exploits the HOG features to accelerate the visual tracking methods (such as in the DSST, C-COT, and ECO). Like the baseline ECO tracker, the proposed method applies the same configuration to all videos in different visual tracking datasets. However, this work exploits only deep features to model target appearance and highlight the effectiveness of the proposed method. This section includes a comparison of the proposed method with state-of-the-art visual tracking methods and the ablation study, which evaluates the effectiveness of proposed components on visual tracking performance.

5.1 Performance Comparison

To evaluate the performance of the proposed method, it is extensively compared quantitatively with deep and handcrafted-based state-of-the-art visual trackers (for which benchmark results on different datasets have been publicly available) in the one-pass evaluation (OPE) on three large visual tracking datasets: OTB-2013 OTB2013, OTB-2015 OTB2015, TC-128 TC128, and VOT-2018 VOT-2018 datasets. The state-of-the-art deep visual trackers include deep feature-based visual trackers (namely, ECO-CNN ECO, DeepSRDCF DeepSRDCF, UCT UCT, DCFNet DCFNet, DCFNet-2.0 DCFNet, LCTdeep LCTdeep, HCFT HCFT, HCFTs HCFTs, SiamFC-3s SiamFC, CNN-SVM CNN-SVM, PTAV PTAV-ICCV; PTAV), CCOT CCOT, DSiam DSiam, CFNet CFNet, DeepCSRDCF DeepCSRDCF, MCPF MCPF, TRACA TRACA, DeepSTRCF STRCF, SiamRPN SiamRPN, SA-Siam SA-Siam, LSART LSART, DRT DRT, DAT DAT, CFCF CFCF, CRPN CRPN, GCT GCT, SiamDW-SiamFC SiamDW, and SiamDW-SiamRPN SiamDW; Also the compared handcrafted feature-based visual trackers are the Struck StruckTPAMI, BACF BACF, DSST DSST, MUSTer MUSTer, SRDCF SRDCF, SRDCFdecon SRDCFdecon, Staple Staple, LCT LCTdeep, KCF KCF-ECCV; KCF, SAMF SAMF, MEEM MEEM, and TLD TLD

). These methods are included the visual trackers that exploit handcrafted features, deep features (by FENs or EENs), or both. The proposed method is implemented on an Intel I7-6800K 3.40 GHz CPU with 64 GB RAM with an NVIDIA GeForce GTX 1080 GPU. The overall and attribute-based performance comparisons of the proposed method with different visual trackers on the four visual tracking datasets in terms of various evaluation metrics are shown in Fig. 3 to Fig. 5.

Figure 3: Overall precision and success evaluations on the OTB-2013, OTB-2015, and TC-128 datasets.
Figure 4: Attribute-based evaluations of visual trackers in terms of average success rate on the OTB-2015 dataset.
Figure 5: Overall and attribute-based performance comparison of visual tracking methods on VOT-2018 dataset.

Based on the results of the attribute-based comparison on the OTB-2015 dataset, the proposed method has shown to improve the average success rate of the ECO tracker by up to 5.1%, 1.8%, 1.3%, 1.1%, 4.3%, 2.4%, 2.8%, and 6.5% for IV, OPR, SV, OCC, DEF, FM, IPR, and BC attributes, respectively. However, the ECO tracker has achieved better visual tracking performance for MB, OV, and LR attributes. The proposed method outperforms other state-of-the-art visual trackers in most challenging attributes, and the results indicate that the proposed method has improved the average precision rate by up to 3.8%, 1.1%, and 1.6% and the average success rate by up to 5.3%, 2.6%, and 3.4% compared to the ECO-CNN tracker on the OTB-2013, OTB-2015, and TC-128 datasets, respectively. Thus, in addition to providing better visual tracking performance compared to the FEN-based visual trackers, the proposed method outperforms the EEN-based visual tracking methods. For example, the proposed method achieved up to 3.7%, 7%, 7.2%, 11.6%, and 15.2% on the OTB-2015 dataset in terms of average precision rate compared to the UCT, DCFNet-2.0, PTAV, SiamFC-3s, and DCFNet trackers, respectively. Furthermore, the success rate of the proposed method gained by up to 4.3%, 4.8%, 16.4%, 8.2%, and 13.9%, respectively, compared to the aforementioned EEN-based trackers. Because of the higher performance of deep visual trackers compared to the handcrafted-based ones, the evaluations on the VOT-2018 compare the proposed method just with the state-of-the-art deep visual tracking methods. The achieved results in Fig. 5 demonstrate the proficiency of the proposed method for visual tracking. According to these results, the proposed method not only considerably improve the performance of ECO-CNN framework but also has a competitive results compared to the EEN-based methods which have trained on large-scale datasets.

The qualitative comparison between the proposed method and the top-5 visual trackers (i.e., UCT, DCFNet-2.0, ECO-CNN, HCFTs, and DeepSRDCF) is shown in Fig. 5, which clearly demonstrates remarkable robustness of the proposed method in real-world scenarios. It is noteworthy that the proposed method provides an acceptable average speed (about five frames per second (FPS)) compared to the most FEN-based visual trackers, which run less than one FPS MDNet; CCOT.

Figure 6: Qualitative evaluation results of visual tracking methods on Bird1, Ironman, Soccer, Human9, Basketball, and Couple video sequences from top to bottom row, respectively.

To robustly model the visual targets, the proposed method uses two main components. First, the proposed method exploits the weighted fused deep representations in the learning process of convolution filters in the continuous domain. It extracts deep feature maps from two FENs (i.e., DenseNet-201 and FCN-8s) that are trained on different large-scale datasets (i.e., ImageNet, PASCAL VOC, and MSCOCO) for different tasks of object recognition and semantic segmentation. These rich representations provide complementary information for an unknown visual target. Hence, the fused deep representation leads to a more robust target model, which is more resist against challenging attributes, such as heavy illumination variation, occlusion, deformation, background clutter, and fast motion. Moreover, the employed FENs are efficient and do not have complications with integrating into the proposed DCF-based tracker. Second, the proposed method uses the extracted feature maps from the FCN-8s network to semantically enhance the weighting process of target representations. This component helps the proposed method to model the visual target more accurately and prevent contamination of target appearance with the background or visual distractors. Hence, the proposed method demonstrates more robustness against significant variations of a visual target, including deformation and background clutter.

5.2 Performance Comparison: Ablation Study

To understand the advantages and disadvantages of each proposed component, the visual tracking performance of two ablated trackers are investigated and compared to the proposed method. The ablated trackers are the proposed method to disable the fusion process (i.e., this method just exploits the DenseNet-201 representations and semantic windowing process) and the proposed method without semantic windowing process (i.e., this method only fuses deep multitasking representations from the DenseNet-201 and FCN-8s networks to ensure robust target appearance).

Figure 7: Overall and attribute-based ablation study of proposed visual-tracking method on the OTB-2013 dataset.

The results of the ablation study on the OTB-2013 dataset are shown in Fig. 7. Based on the precision and success plots of the overall comparison, it can be concluded that most of the success of the proposed method is related to the effective fusion process of desirable representations for visual tracking. Moreover, it is clearly demonstrated that the proposed windowing process enhances the proposed method, particularly in terms of precision metrics. To identify the reason for the performance degradation faced for MB, OV, and LR attributes, an attribute-based evaluation was performed. The fusion process provides the main contribution to reducing visual tracking performance when these attributes occur, as seen in the results in Fig. 7. These results confirm that the semantic windowing process can effectively support visual trackers in learning more discriminative target appearance and in preventing drift problems.

6 Conclusion

The use of twelve state-of-the-art ResNet-based FENs for visual tracking purposes was evaluated. A visual tracking method was proposed; this method can fuse deep features extracted from the DenseNet-201 and FCN-8s networks, but it can also semantically weight target representations in the learning process of continuous convolution filters. Moreover, the best ResNet-based FEN in the DCF-based framework was determined (i.e., DenseNet-201), and the effectiveness of using the DenseNet-201 network on the tracking performance of another DCF-based tracker was explored. Finally, the comprehensive experimental results on the OTB-2013, OTB-2015, TC-128 and VOT-2018 datasets demonstrated that the proposed method outperforms state-of-the-art visual tracking methods in terms of various evaluation metrics for visual tracking.

Acknowledgement: This work was partly supported by a grant (No. 96013046) from Iran National Science Foundation (INSF).

Compliance with Ethical Standards (Conflict of Interest): All authors declare that they have no conflict of interest.

References