Beyond Background-Aware Correlation Filters: Adaptive Context Modeling by Hand-Crafted and Deep RGB Features for Visual Tracking

04/06/2020 ∙ by Seyed Mojtaba Marvasti-Zadeh, et al. ∙ Sharif Accelerator University of Alberta Yazd University 0

In recent years, the background-aware correlation filters have achie-ved a lot of research interest in the visual target tracking. However, these methods cannot suitably model the target appearance due to the exploitation of hand-crafted features. On the other hand, the recent deep learning-based visual tracking methods have provided a competitive performance along with extensive computations. In this paper, an adaptive background-aware correlation filter-based tracker is proposed that effectively models the target appearance by using either the histogram of oriented gradients (HOG) or convolutional neural network (CNN) feature maps. The proposed method exploits the fast 2D non-maximum suppression (NMS) algorithm and the semantic information comparison to detect challenging situations. When the HOG-based response map is not reliable, or the context region has a low semantic similarity with prior regions, the proposed method constructs the CNN context model to improve the target region estimation. Furthermore, the rejection option allows the proposed method to update the CNN context model only on valid regions. Comprehensive experimental results demonstrate that the proposed adaptive method clearly outperforms the accuracy and robustness of visual target tracking compared to the state-of-the-art methods on the OTB-50, OTB-100, TC-128, UAV-123, and VOT-2015 datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual object tracking is one of the challenging practical tasks in computer vision that has widespread applications in self-driving cars, autonomous robots, augmented reality, surveillance, and so forth. Although these methods have had significant progress in the constraint environments or with prior knowledge about the target, they are still faced with many challenging factors in practical scenarios; including occlusion, deformation, background clutter, camera motion, and out-of-view target, to name a few. One of the most critical issues of recent visual tracking methods is how to model the target appearance in order to be robust against challenging factors in real scenes. Hence, deep learning-based visual trackers are used that mostly exploit RGB feature maps extracted CNNs. Although the deep learning-based visual trackers obtain satisfactory results, they mostly suffer from a poor real-time tracking performance because of their expensive computations on all video frames

MDNet; CCOT.

Generally speaking, the visual target trackers can be mainly categorized into discriminative VisComp_ImpCF; VisComp_NewTLDCF; VisComp_LongSTC; VisComp_PartSRDCF, generative Generative1; Generative2, and hybrid discriminative-generative STC; ASTCT

methods; based on their target appearance modeling. While the generative trackers learn a target model and search the best matching region in the subsequent frame, the discriminative trackers try to discriminate the target from its background. With the development of powerful discriminative methods in object detection, tracking-by-detection methods have achieved great success in visual target tracking. Discriminative correlation filters (DCF) are the most popular branch of tracking-by-detection that learn the target model via learning correlation filters by a set of training samples. Due to simultaneous visual tracking and filter training via the fast Fourier transform (FFT), correlation filter-based trackers are efficient in terms of both computation and memory.

On the other hand, the hybrid discriminative-generative trackers use the combination of generative and discriminative trackers to efficiently benefit from the background information. Background-aware (or context-aware) visual trackers have been widely used to improve the robustness of the target model. In general, these trackers either use the iterative optimization methods or formulate a closed-form solution to learn a context model. However, these trackers either prevent to use deep RGB features to preserve the computational efficiency of DCF or fuse the hand-crafted and deep RGB features which dramatically degrades tracking speed. Moreover, these trackers construct a context model based on all estimated target regions which their information validity has not been trusted.

Motivated by the above problems, an adaptive background-aware DCF-based method is proposed which provides both the validation process of target representations and the adaptive learning procedure of two context models according to the difficulty of target estimation. To the best of our knowledge, this paper presents the first background-aware visual tracker that exploits the hand-crafted and deep RGB features, adaptively. The main contributions of the proposed method are summarized as follows.

1) Based on the difficulty of target state estimation, the proposed method exploits two context models; where one has low dimensional hand-crafted features (i.e., HOG context model), and the other includes robust deep RGB features with high computational complexity (i.e., CNN context model). Adaptive utilization of these context models helps the proposed tracker to not only exploit the CNN context model in challenging situations but also to have an acceptable speed compared to deep learning-based visual trackers. Furthermore, the simultaneous exploitation of low-level and high-level information improves the accuracy and robustness of the proposed method in realistic scenarios.

2) To detect the challenging situations that the HOG context model may not be accurately able to estimate target region, the proposed method utilizes the fast 2D-NMS algorithm 2DNMS and rejection option which sets by the semantic similarity comparison of target representations.

3) With the aid of semantic similarity comparisons, the proposed method selects the best target estimation and updates the greedy CNN context model to localize the target when the context model with hand-crafted features is ineffective.

Comprehensive experimental results on five well-known visual tracking datasets demonstrate the effectiveness of the proposed method compared to the state-of-the-art visual tracking methods.

The rest of this paper is organized as follows. The related work is briefly explored in Section 2. The proposed adaptive background-aware visual tracking method is explained in Section 3. Section 4 represents the experimental results. Finally, the conclusion is summarized in Section 5.

2 Related Work

In this section, two types of the most well-known visual tracking methods including the correlation filters-based and context-aware visual trackers are briefly explained. The minimum output sum of squared error (MOSSE) correlation filters MOSSE has engrossed the attention of researchers in the field of detection and tracking. The MOSSE is a simple visual tracking strategy that circumvents conventional problems with correlation filters and prevents the visual tracker to respond incorrectly to the background information. With the aid of efficient training of multi-channel detector/filter, the multi-channel correlation filters MultiChannelCF extended the correlation filter theory to produce a single channel response map for a variety of tasks; such as detection and localization. Since then, this extension has been broadly used in other correlation filter-based visual trackers. To avoid iterating shifts of target patches to form a training set, the tracking-by-detection with Kernels method KCF-ECCV exploited the well-established theory of Circulant matrices to derive closed-form training and detection solutions for different Kernels. Using the discrete Fourier transform (DFT), the Kernelized correlation filter (KCF) tracker KCF developed an analytic model that can diagonalize a data matrix with thousands of translated target patches for high-speed tracking purposes.

Although correlation filter-based trackers have a significant computational efficiency, their detection/tracking performance suffers from unwanted boundary effects caused by the periodic assumption. Hence, correlation filters with limited boundaries CFLimitedBoundaries utilized an augmented objective function which can largely remove the boundary effects. With the aid of the alternating direction method of multipliers (ADMMs) ADMM

, this tracker efficiently optimized the ridge regression objective function, iteratively. With a simple and powerful convex optimization strategy, the ADMM decomposed the augmented Lagrangian function to extremely efficient sub-problems that are appropriate for large learning problems. Also, the spatially regularized discriminative correlation filters (SRDCF) tracker

SRDCF formulated the correlation filters with a spatial regularization component that allows larger image regions to contribute to the online learning process.

On the other hand, the context-aware visual tracking method ContextAwareVisualTracking utilized the context information via three main components; including auxiliary objects, a collaborative framework, and robust fusion. The context tracker ContextTracker automatically explored supporters and distracters on-the-fly to verify the genuine target and avoid drifting. The spatio-temporal structural context-based tracker (STT) Wen2012 utilized the temporal and spatial context models that were constructed by the linear subspace method and the contributors (like supporters). To improve the robustness of target appearance variations, the fast spatio-temporal context tracking method (STC) STC

constructed spatial and spatio-temporal context models with fast learning and detection in the frequency domain. To track generic targets robustly in challenging situations, the adaptive STC method

ASTCT utilized adaptive learning parameters determination, accurate target localization with alternating feature set, and modified scale estimation scheme. Also, the context-aware correlation filter tracking ContextAwareCFTracking provides a closed-form solution to boost the performance of correlation filter-based trackers by using the context information. Using densely extracted real negative training samples (instead of shifted foreground patches), the background-aware correlation filters (BACF) BACF learn optimized correlation filters by using the ADMM and then update the context model with Sherman-Morrison lemma to enhance the accuracy and robustness of visual tracking.

Recently, some correlation filter-based trackers have been proposed to exploit deep CNN features. With the aid of multiple levels of CNN features, the hierarchical correlation features-based tracker (HCFT) HCFT uses a coarse-to-fine fashion on achieved CNN response maps to handle challenging scenarios of visual tracking. In addition to the exploitation of hierarchical CNN features, the advanced version of HCFT (i.e., HCFT* of HCFTs) HCFTs

uses region proposals and a discriminative classifier to handle tracking failures (or the drift problem). The long-term correlation tracking (LCTdeep)

LCTdeep uses a passive-aggressive algorithm, multiple adaptive correlation filters, and incorporation of hand-crafted and deep CNN features for robust visual tracking. Inspired from SiamFC, which uses fully convolutional Siamese networks for visual tracking, the discriminative correlation filter-based tracker (DCFNet) DCFNet trains a Siamese network with a correlation layer to not only utilize the prior knowledge of convolutional layers and online feature learning, but also to benefit from the efficiency property of correlation filters.

Although these visual tracking methods have provided competitive results for generic object tracking, they still suffer from some drawbacks. The conventional context-aware correlation filters usually exploit hand-crafted features (which might have a lower representational power compared to the deep CNN features) to preserve the computation efficiency of correlation filter-based methods. Besides, other methods that exploit deep CNN features either alternate the architecture of deep neural networks to provide computational efficiency or combine different features with improving the target model representation in all video frames. Despite the existing methods, the proposed method adaptively exploits hand-crafted and deep RGB features in two independent target models. With the aid of two context models, context validation, and decision making based on the comparison of semantic similarities, the proposed method provides a superior visual tracking performance with an efficient computational cost.

3 Proposed Method

In this section, the proposed adaptive BACF filters for visual tracking are presented that improve the BACF tracking performance, effectively. The proposed method is roughly decomposed to three main components. First, the adaptive context modeling is explained. Next, the validation process of context information is introduced. Finally, decision making and update procedure are described. The overview of the proposed method is shown in Fig. 1.

Figure 1: Overview of proposed adaptive background-aware visual tracking method.

3.1 Adaptive Context Modeling

Although BACF improves the tracking robustness against challenging attributes (e.g., deformation, scaling, and background clutter) with the incorporation of background information as the real negative examples, this method might quickly fail when their response maps do not generate a strict peak for the target location. This problem mostly comes from the exclusive exploitation of hand-crafted features which provide limited representation power for target modeling. Hence, the proposed method exploits the HOG and CNN context models, adaptively. Although the BACF tracker only learns the HOG context model, the proposed method exploits the HOG or Deep RGB features to learn two context models, independently.

The two multi-channel context models are trained by minimizing the following objective function BACF in the frequency domain

(1)

in which defines the DFT such that

is the Fourier transform matrix (including orthonormal and complex basis vectors) for mapping to the frequency domain, and

denotes the transpose operator on a complex matrix. Moreover, , , , , and

denote the desired correlation response (i.e., a Gaussian function centered at the target location), context information matrix (including the HOG or deep RGB features which are extracted from the target and its background region), auxiliary variable, trainable multi-channel correlation filters, regularization parameter, and identity matrix with

feature channels, respectively. Besides, the variable indicates a binary crop matrix that returns the target region as the positive sample and the real background regions as the negative ones. Finally, and denote the length of the context region (vectorized image patch) and the Kronecker product, respectively.

To exploit efficient computations, the proposed method utilizes the FFT of weighted feature maps (i.e., the HOG or 3D CNN feature maps which are weighted by Gaussian window) and then constructs the context models by applying the ADMM algorithm on weighted feature map. By definition of penalty factor (with a constant ), and Lagrangian vector respectively as BACF

(2)
(3)

the ADMM algorithm iteratively utilizes the two following closed-form solutions 4 and 5 and Lagrangian update 6 as BACF

(4)
(5)
(6)

in which defines the DFT of vectorized context information. Moreover, the scalers are achieved by the multiplication of with subscript variables of them. Notice that the dimension of CNN feature maps is significantly higher than the hand-crafted feature maps that considerably reduces the convergence rate of the training process. As a result, it is noteworthy that the iteration number of the ADMM algorithm for the CNN context model must be increased in comparison with the HOG context model. It can reduce the computational efficiency of the proposed method. Hence, the proposed method prevents to construct the CNN context model until context information is not valid (described in 3.2).

3.2 Context Information Validation

So far, there is no rejection option or validation scheme to evaluate the context model in the BACF tracking method. Therefore, inaccurate estimation of context information contaminates the context model with tracking errors. These accumulated errors can reduce the tracking performance or lead to the drift problem. On the other hand, the persistent training of appearance model using deep RGB feature maps significantly reduces the speed of tracking methods (less than one frame per second (FPS) MDNet; CCOT). To alleviate these issues, the proposed method extracts both the HOG and deep CNN features in each video frame. However, it does not construct a CNN context model until the HOG context model becomes inefficient or inaccurate. To detect critical situations, the proposed method enables the CNN context model exploitation when neither the rejection flag is zero, nor the HOG response map is reliable.

As a common problem of context-aware visual tracking methods, accumulated errors of context region information contaminate the HOG context model that aggravates the unreliable estimation of target localization. Therefore, the proposed method constructs the CNN context model when the rejection flag is enabled (i.e., set to one) to exploit the representation power of deep features in target localization. Besides, to validate the reliability of HOG response map fast (see Fig.

2 (a, b)), the proposed method applies the fast 2D-NMS algorithm to find the considerable local maxima that might affect the accurate target localization. Due to its different scan order (see Fig. 2 (c)) from conventional methods 2DNMS, this fast and straightforward algorithm requires at most two comparisons per pixel to detect the global () and local () maxima of the HOG response map. Then, the proposed method validates the HOG response map () using

Figure 2: Examples of reliable unreliable response maps which are evaluated by the 3x3-neighborhood scan order of 2D-NMS algorithm.
(7)

in which and are the number of local maxima and high threshold, respectively. According to the objective function, the response map must have a Gaussian shape with a significant peak centered on the target location when the DCF can model the target in the next frame. Hence, the HOG response map is not reliable if the number of significant peaks is more than one.

3.3 Decision Making and Update Scheme

Despite the advantages of deep RGB feature maps, many pre-trained feature maps are often noisy or irrelevant to the visual tracking task. Hence, the proposed method utilizes a joint decision making and update scheme that prevents the contribution of irrelevant deep feature maps in the context modeling process. The proposed method simultaneously exploits the 3D CNN feature maps (i.e., 2D spatial and 1D depth dimensions) and the FC7 feature vector which are extracted from the pre-trained VGG-19 model VGGNet. The 3D CNN feature maps (including conv1-2, conv2-2, and conv3-4 layers) are resized and concatenated from different convolutional layers while an FC7 matrix (FC) is constructed by placing the FC7 feature vectors of valid target regions in the matrix rows as

(8)

where is the number of extracted FC7 feature vectors. To strictly focus on the best context information and effectively update the CNN context model, the proposed method exploits the FC7 feature vector of the context region. In fact, the semantic comparison of estimated context region with previous valid context regions helps the proposed method to refine itself. This semantic similarity comparison is defined as the correlation score of semantic information between the estimated context regions with the mean of prior target representations which is applied as follows.

First, the proposed method utilizes the correlation of the FC7 feature vector of estimated context region () with the mean of valid FC7 feature vectors from previous frames (which is abbreviated by FCM). In 9, is a column vector of ones. Hence, the FC7 and rejection flags are updated via the comparison of the semantic similarity score () with two constant thresholds ( and are high and low thresholds, respectively) as follows.

(9)
(10)
(11)
(12)

Thus, the proposed method identifies the valid context regions when the rejection and FC7 flags are zero and one, respectively. In this case, it not only updates the FC7 matrix by adding the FC7 feature vector to its rows but also updates the 3D CNN feature maps () using learning parameter as follows.

(13)

Second, the proposed method utilizes the semantic comparison when a challenging situation has occurred, and the CNN response map is required to participate. When the fast 2D-NMS algorithm identifies more than one significant peak in the HOG response map or the rejection flag is equal to one, it detects a critical situation and enables the CNN context model. In this case, it estimates the location and scale of the context region via the construction of CNN context model (using the ADMM algorithm) and the extraction of multi-scale CNN feature maps. Then, the FC7 feature vectors of two estimated context regions (from the HOG and CNN response maps which are defined as ) are compared with the FCM, and the best location and scale of the target () are chosen as

(14)
(15)
(16)

in which , , and are the correlation operator, estimated location and scale using the CNN and HOG context models, respectively.

4 Experimental Results

In the proposed method, most of the implementation details of the HOG context model are the same as that of the BACF tracker. To efficiently estimate the location and scale of the target, the proposed method iteratively compares the estimation of each step of the Newton method with the previous one. It then terminates the Newton method when the difference between two successive localizations is lower than . Moreover, the high and low thresholds are experimentally set to and , respectively. To appropriately train the CNN context model, the number of iterations of the ADMM algorithm is set to . Furthermore, the FC7 and rejection flags are initialized to one and zero, respectively. The proposed method is implemented on the Intel I7-K CPU with GB RAM running at GHz. Except for the modified MatConvNet toolbox MatConvNet that its computations were run on an NVIDIA GeForce GTX GPU, all of the other parts of the proposed method are implemented on the CPU. Nevertheless, the proposed method has an acceptable average tracking speed of FPS (on the OTB-2015 dataset), which is considerably higher than the state-of-the-art deep learning-based visual trackers.

The following subsections not only will provide the comprehensive evaluation of the proposed method compared with state-of-the-art methods on five large and well-known OTB-50 OTB2013, OTB-100 OTB2015, TC-128 TC128, UAV-123 UAV123, and VOT-2015 VOT-2015 visual tracking datasets but also will evaluate the ablated versions of proposed tracker to investigate the effectiveness of proposed components for visual tracking.

4.1 Performance Comparison: State-of-the-Art Methods

The proposed method is extensively compared with state-of-the-art visual trackers on the OTB-50, OTB-100, TC-128, UAV-123, and VOT-2015 dataset, which their benchmark results are publicly available. These visual trackers include KCF KCF, SRDCF SRDCF, HCFT HCFT, HCFTs HCFTs, LCT LCTdeep, LCTdeep LCTdeep, SiamFC-3s SiamFC, DCFNet DCFNet, DCFNet-2.0 DCFNet, Staple-CA ContextAwareCFTracking, SAMF-CA ContextAwareCFTracking, BACF BACF, DSST DSST, Struck StruckTPAMI, MUSTer MUSTer, Staple Staple, MDNet MDNet, DeepSRDCF DeepSRDCF, SODLT VOT-2015, C-COT CCOT, TADT TADT, SA-Siam SA-Siam, Siam-MCF Siam-MCF, SAMF SAMF, MEEM MEEM, ACT ACT, TGPR TGPR, ASMS ASMS, OAB VOT-2015, SCBT SCBT, EBT EBT, SumShift VOT-2015, MIL MIL, FCT VOT-2015, DAT DAT, and sKCF VOT-2015. The OTB-50, OTB-100, TC-128, UAV-123, and VOT-2015 datasets have , , , , and challenging video sequences, respectively. Also, they include challenging attributes namely illumination variation (IV), out-of-plane rotation (OPR), scale variation (SV), occlusion (OCC), partial occlusion (POC), full occlusion (FOC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane-rotation (IPR), out-of-view (OV), background clutter (BC), aspect ratio change (ARC), viewpoint change (VC), camera motion (CM), similar object (SOB), and low resolution (LR). Furthermore, the video sequences of these datasets have at least and at most frames.

The visual tracking methods are quantitatively evaluated on the OTB-50, OTB-100, TC-128, and UAV-123 datasets with two widely used tracking evaluation metrics; namely, precision and success metrics

OTB2015. While the precision metric measures the average Euclidean distance between the ground-truth and the estimated target center locations, the success metric is given by the intersection over union (IoU) of the ground-truth and the estimated bounding boxes. According to these definitions, the precision and success plots illustrate (and also rank) the percentage of video frames for which the Euclidean distance and IoU are within a given threshold (e.g., pixels and , respectively). Also, two well-known performance metrics of accuracy and robustness (under visual Tracking eXchange (TraX) protocol TraX) are used to rank visual tracking methods on the VOT-2015 dataset. The accuracy metric evaluates the well-fitting of estimated bounding box compared to the ground-truth one, while the robustness metric measures the failures of methods (i.e., zero overlap of the estimated region with ground-truth one) during tracking. The VOT protocol not only re-initializes the trackers five frames after failure but also ignores ten first frames after re-initialization to reduce the bias effect on accuracy and robustness metrics.

Figure 3: Precision and success rate comparisons of visual tracking methods on OTB-50, OTB-100, TC-128, and UAV-123 datasets.

Fig. 3 shows the precision and success plots of visual tracking methods on the OTB-50, OTB-100, TC-128 and UAV-123 datasets. These results indicate that the proposed method has improved the average precision rate up to at least , and at most compared to the HCFTs and Struck, and the average success rate at least , and at most compared to the DCFNet-2.0 and Struck on the OTB-50 dataset, respectively. The experimental results on the OTB-100 dataset show that the proposed method has enhanced the average precision and success rates up to at least , and compared to the HCFTs, and BACF/DCFNet-2.0 and at most , and compared to the Struck, respectively. Moreover, the obtained results show that the proposed method has increased the average precision rate up to at least , and at most compared to the Staple-CA and Struck, the average success rate up to at least and at most compared to the Staple-CA and DSST on the TC-128 dataset, respectively. Finally, the proposed method has improved the average precision and success rates up to at least and compared to the SRDCF and BACF and at most and compared to the KCF, respectively. The attribute-based evaluations (in terms of average success rate) of visual tracking methods on OTB-100 are summarized in Table 1. According to this table, the proposed method has considerably achieved the best visual tracking performance at the most attributes compared to the BACF and other state-of-the-art visual tracking methods. Based on the results listed in Table 1, the proposed method has increased the BACF performance up to , , , , , , , , , , and for IV, OPR, SV, OCC, DEF, MB, FM, IPR, OV, BC, and LR attributes, respectively. Moreover, the qualitative evaluation results of the proposed method compared to the top-5 tracking methods, including Staple-CA, SiamFC-3s, HCFTs, BACF, and DCFNet-2.0, are shown in Fig. 4.

Top-trackers IV (38) OPR (63) SV (65) OCC (49) DEF (44) MB (31) FM (40) IPR (53) OV (14) BC (31) LR (10)
Proposed Method green0.863 green0.793 green0.773 green0.804 green0.774 green0.809 green0.799 blue!300.745 green0.738 green0.841 blue!300.720
HCFTs 0.696 0.612 0.544 0.604 0.624 0.717 0.654 0.652 0.569 red0.758 0.526
HCFT 0.602 0.590 0.489 0.570 0.561 0.677 0.644 0.634 0.568 0.680 0.402
DCFNet-2.0 0.737 blue!300.747 blue!300.722 blue!300.747 0.651 0.734 red0.758 green0.747 red0.660 0.738 red0.637
SiamFC-3s 0.716 red0.708 0.697 0.688 0.638 0.704 0.719 0.685 0.627 0.663 green0.750
BACF blue!300.782 red0.708 red0.706 red0.695 blue!300.696 blue!300.752 blue!300.765 red0.701 blue!300.692 blue!300.766 red0.637
Staple-CA 0.738 0.672 0.649 0.685 blue!300.696 0.718 0.717 0.676 0.576 0.724 0.486
Staple 0.729 0.659 0.622 0.671 red0.671 0.677 0.665 0.661 0.570 0.709 0.483
SRDCF red0.742 0.663 0.674 0.680 0.667 red0.747 0.722 0.657 0.558 0.701 0.604
Table 1: Attribute-based evaluations on the OTB-100 dataset. [First, second, and third visual tracking methods are shown in color.]
Figure 4: Qualitative results of visual tracking methods on Ironman, DragonBaby, and Skating2-1 video sequences from top to bottom row, respectively.

Finally, Fig. 5 (left) shows the performance comparison of the proposed method compared to the state-of-the-art visual tracking methods in terms of accuracy and robustness rank (AR rank) on the VOT-2015 dataset. The proposed method has enhanced the BACF performance up to and in terms of accuracy and robustness metrics, respectively. According to these results, the proposed method not only has considerably improved the tracking performance of the BACF but also has almost achieved the accuracy of the MDNet (which has one FPS speed). Furthermore, the proposed method has significantly improved the robustness of the BACF. The robustness result of proposed method originates from the moderate robustness of BACF and transferred deep features faced with some challenging attributes such as LR, MB, and CM. The proposed method has outperformed the other context-aware visual trackers (e.g., ContextAwareCFTracking; BACF) which have a competitive performance against deep learning-based tracking methods. Although the context information has reduced the unwanted boundary effects of correlation filters to achieve a higher tracking performance, it may not handle realistic scenarios in which several challenging attributes occur, simultaneously. However, the proposed method tries to detect these situations and robustly estimate the context region. It also represents better results compared to the LCT, LCTdeep, and MUSter trackers that exploit short- and long-term memory storages. Furthermore, the proposed method has provided better quantitative and qualitative results compared to the recent deep feature-based visual trackers that even exploit an end-to-end framework (e.g., SiamFC-3s, DCFNet, and DCFNet-2.0).

Figure 5: (left): AR ranking of visual tracking methods on VOT-2015 dataset. (right): Ablation study of proposed method on OTB-50 dataset.

The desirable performance of the proposed tracking method is due to the following reasons. First, it simultaneously uses the CNN feature maps to achieve a better target representation. The CNN context model is not exploited until the HOG context model cannot accurately estimate the target location. This adaptive exploitation leads to a remarkable performance along with an acceptable tracking speed. Second, the proposed method learns two context models, separately. Although the estimated target regions are validated to update the rejection and FC7 flags, the HOG model is updated in all video frames. On the other hand, the update process of CNN feature maps (which are involved in the construction of the CNN model) is dependent on the status of flags. In fact, the HOG context model represents a more flexible target model, whereas the CNN context model constructs a strict valid target model. This duality helps the proposed method to avoid the accumulated tracking errors (i.e., the drift problem). Third, the proposed method can enjoy the rejection option when the target is in extremely challenging situations (such as occlusion or out-of-view). This option prevents the update process of the CNN context model and allows the proposed method to re-detect the target (without any detection module) if it is in the search area. Finally, it exploits the semantic information along with low-level feature maps which are beneficial to capture intra-class variations. These semantic comparisons can modify the estimated location of the target when the HOG context model tends to drift towards the background.

The desirable performance of the proposed tracking method is due to the following reasons. First, it simultaneously uses the CNN feature maps to achieve a better target representation. The CNN context model is not exploited until the HOG context model cannot accurately estimate the target location. This adaptive exploitation leads to a remarkable performance along with an acceptable tracking speed. Second, the proposed method learns two context models, separately. Although the estimated target regions are validated to update the rejection and FC7 flags, the HOG model is updated in all video frames. On the other hand, the update process of CNN feature maps (which are involved in the construction of the CNN model) is dependent on the status of flags. In fact, the HOG context model represents a more flexible target model, whereas the CNN context model constructs a strict valid target model. This duality helps the proposed method to avoid the accumulated tracking errors (i.e., the drift problem). Third, the proposed method can enjoy the rejection option when the target is in extremely challenging situations (such as occlusion or out-of-view). This option prevents the update process of the CNN context model and allows the proposed method to re-detect the target (without any detection module) if it is in the search area. Finally, it exploits the semantic information along with low-level feature maps which are beneficial to capture intra-class variations. These semantic comparisons can modify the estimated location of the target when the HOG context model tends to drift towards the background.

4.2 Performance Comparison: Ablation Study

To investigate the effectiveness of proposed components, four ablated trackers including the proposed method: i) without rejection flag, ii) without semantic similarity comparison, iii) which exploits FC6 feature vector (instead of FC7), iv) which exploits the first convolutional layer (shallow features) of lightweight MobileNetV2 (instead of HOG features) with different ADMM iteration numbers are evaluated on the OTB-50 dataset (see Fig. 5

(right)). According to these results, replacement the HOG features with shallow features extracted from the MobileNetV2 has the most degradation on the proposed method. As previously mentioned in

MDNet

, the pre-trained CNNs have not been trained for video processing applications, and these networks may provide some deficiency when they exploit for transfer learning. Moreover, the incorrect number of ADMM iterations may cause under-fitting or overfitting of target modeling. Following, no exploitation of semantic similarity comparison, no utilization of rejection option, and replacement the FC7 with FC6 feature vector have the most destructive effects on the proposed method, respectively.

5 Conclusion

An adaptive context-aware visual tracking method which exploits both hand-crafted and deep RGB feature maps for context modeling was proposed. With the aid of the fast 2D non-maximum suppression algorithm and deep semantic information, the proposed method detects challenging situations through video frames. When a challenging situation is detected, the proposed method constructs a CNN context model (which is according to the valid deep RGB feature maps) to increase the accuracy of the target localization by using the HOG and CNN context models. Otherwise, the HOG context model was used to increase visual tracking speed. Based on the proposed context information validation, an update scheme allowed the proposed method to either update the CNN feature maps or avoid the contaminated CNN feature maps. The extensive experimental results on the OTB-50, OTB-100, TC-128, UAV-123, and VOT-2015 visual tracking datasets demonstrated that the proposed method not only improves the visual tracking performance, but it also enjoys acceptable tracking speed compared to the state-of-the-art tracking methods.

Acknowledgement: This work was partly supported by a grant (No. ) from Iran National Science Foundation (INSF).

Compliance with Ethical Standards: All authors declare that they have no conflict of interest.

References