Learning Reinforced Attentional Representation for End-to-End Visual Tracking

08/27/2019 ∙ by Peng Gao, et al. ∙ 14

Despite the fact that tremendous advances have been made by numerous recent tracking approaches in the last decade, how to achieve high-performance visual tracking is still an open problem. In this paper, we propose an end-to-end network model to learn reinforced attentional representation for accurate target object discrimination and localization. We utilize a novel hierarchical attentional module with long short-term memory and multi-layer perceptrons to leverage both inter- and intra-frame attention to effectively facilitate visual pattern emphasis. Moreover, we incorporate a contextual attentional correlation filter into the backbone network to make our model be trained in an end-to-end fashion. Our proposed approach not only takes full advantage of informative geometries and semantics, but also updates correlation filters online without the backbone network fine-tuning to enable adaptation of target appearance variations. Extensive experiments conducted on several popular benchmark datasets demonstrate the effectiveness and efficiency of our proposed approach while remaining computational efficiency.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

page 5

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual tracking is an essential and active research problem in the field of computer vision with various real-world applications such as robotic services, smart surveillance systems, autonomous driving and human-computer interaction. It refers to the automatic estimation of the trajectory of an arbitrary target object as it moves around in subsequent video frames, which is usually specified by a bounding box in the first frame. Although considerable progress has been made in the last few decades 

Yang et al. (2011); Smeulders et al. (2014); Li et al. (2018b), it is still commonly recognized as a very challenging task, partially due to numerous complicated yet scenarios under the real-world, such as scale variations, fast motion, occlusions, deformations, and so forth.

One of the most successful tracking frameworks is discriminative correlation filter (DCF) Bolme et al. (2010); Henriques et al. (2015); Danelljan et al. (2017b); Gao et al. (2018a)

. With the benefits of fast Fourier transform, most DCF-based approaches can employ large numbers of cyclically shifted samples for training and achieve high accuracy while running at impressive frame rates. Recent years have witnessed significant advances of convolutional neural network (CNN) on many computer vision tasks, such as image classification and object detection 

Liu et al. (2016); Ren et al. (2015). It is because that CNN can gradually learn finer-level geometries to coarse-level semantics of target objects by applying the transformation and enlargement of receptive fields in different convolutional layers LeCun et al. (2015); Schmidhuber (2015). Encouraged by these great successes, some DCF-based trackers resort to pre-trained CNN models Krizhevsky et al. (2012); Simonyan and Zisserman (2015); He et al. (2016) instead of conventional handcrafted features Felzenszwalb et al. (2010); Van De Weijer et al. (2009) for target object representation and demonstrate favorable performance Gao et al. (2018b); Danelljan et al. (2017a). Recently, tracking by Siamese matching networks Tao et al. (2016); Bertinetto et al. (2016); Valmadre et al. (2017); Li et al. (2018a) has achieved record-breaking performance and efficiency. In each frame, these trackers learn a similarity metric between the target template and candidate patches of the current searching frame in an end-to-end fashion.

Figure 1:

Visualization of deep feature maps from different convolutional layers of different CNN architectures, including AlexNet 

Krizhevsky et al. (2012) (top row), VGG-19 Simonyan and Zisserman (2015) (middle row) and ResNet-50 He et al. (2016) (bottom row). It is evident that low-level geometries from shallow layers, such as ‘conv1’ in AlexNet, ‘conv1-2’ in VGG-19 and ‘conv1’ in ResNet-50, remains fine-grained target-specific details, while high-level semantics from deep layers, such as ‘conv5’ in AlexNet, ‘conv5-4’ in VGG-19 and ‘conv5-3c’ in ResNet-50, contains coarse category-specific information. Compared with AlexNet, the architecture of ResNet-50 is deeper and more sophisticated. The example frame is shown from the sequence dinosaur.
Figure 2: Visualization of feature channels in the last layer of the ‘conv3’, ‘conv4’ and ‘conv5’ stages in ResNet-50 He et al. (2016). Example frames are randomly picked up from Bolt, Lemming and Liquor

sequences (shown from top to bottom on the left). We show the features extracted from the 20 random channels of each stage from top to bottom on the right of the corresponding example frame. It is clear that only few of feature channels and regions contribute to target object representation, others may serve as information redundancy. A noteworthy is that for each example frame, the channels in the corresponding stage are the same.

Despite the above significant progress, existing CNN-based tracking approaches are still limited by several obstacles. Most methods directly transfer off-the-shelf CNN models pre-trained on large scale image classification datasets Russakovsky et al. (2015); Lin et al. (2014) to obtain the generic representation of target object Krizhevsky et al. (2012); Simonyan and Zisserman (2015). It is well acknowledged that different convolutional layers of CNNs, as shown in Fig. 1, encode different types of features Zeiler and Fergus (2014). The features taken from higher convolutional layers retain rich coarse-level semantics but are ineffective for accurate localization or scale estimation of the target object. Conversely, the features extracted from lower convolutional layers retain more finer-level geometries to capture target-specific spatial details that facilitate accurately locating the target object, but are insufficient to distinguish objects from non-objects with similar characteristics. Some prior works Qi et al. (2016); Ma et al. (2015) have tried to integrate advantages from multiple convolutional layers. Unfortunately, the performance still has a notable gap with state-of-the-arts Bertinetto et al. (2016); Li et al. (2018a) that only employ the outputs of the last layers to represent target objects. Combining features straightly from multiple layers is thus not successful for representing target objects, and yet with less performance gain under challenging scenarios.

In fact, in deep feature maps, each feature channel corresponds to a particular type of visual pattern, whereas feature spatial regions represent object-specific details Wang et al. (2018a); Gao et al. (2019b)

. We observe that deep features directly extracted from pre-trained CNN models treat every pixel equally along the channel-wise and spatial axes. Actually probably only partial features are closely related to the task of separating specific target objects from background surroundings, while others may only serve as information redundancy, which may cause model drift and even leads to failures during tracking 

Ma et al. (2018); Gao et al. (2019a), as illustrated in Fig. 2. Recently, visual attention mechanism has been remarkable progress in recent researches and reaches surprisingly good performance in many computer vision tasks Hu et al. (2018); Woo et al. (2018), owing to its ability to model contextual information. Thus, it is necessary to highlight useful features and suppress irrelevant redundant information using attention mechanisms for visual tracking. Unfortunately, some previous trackers Lukežič et al. (2017); Wang et al. (2018a); Zhu et al. (2018b) only take advantage of intra-frame attention to learn which semantic attribute to select propriety visual pattern along the channel axis, but do not care about where to focus along the spatial axis, and thus achieving inferior tracking results. Moreover, most existing CNN-based trackers implement their models with shallow networks such as AlexNet Krizhevsky et al. (2012), they cannot take advantage of the benefits of more powerful representation from deeper networks like ResNet He et al. (2016).

Figure 3: The framework of the proposed tracking approach. Specifically speaking, our approach contains three main components, i.e., a backbone network for deep feature extraction (detailed in Section 3.1), a hierarchical attention module for informative feature emphasis (detailed in Section 3.2), and a decision module for target object discrimination and localization (detailed in Section 3.3).

Notably, target objects specified for visual tracking could be anything, the pre-trained CNN models may be agnostic about some target objects unseen in the training set. To ensure high- performance visual tracking, most trackers only employ the original deep features taken from the first frame to match candidate patches in the following frames Bertinetto et al. (2016); Tao et al. (2016); Li et al. (2018a)

. The characteristics of the target object are consistent within consecutive frames, there exists a strong temporal relationship between the target object appearance and motion in video data. Using contexts from historical frames may enhance tracking accuracy and robustness under challenging scenarios such as occlusions and deformations. Recurrent neural network (RNN), especially long short-term memory (LSTM) 

Hochreiter and Schmidhuber (1997)

, has achieved great success in many natural language processing (NLP) applications due to it saves interesting temporal cues and forgets irrelevant ones using prejudiced memory components, and thereby it is suitable for visual tracking to explore inter-frame attention. However, there are limited approaches employ such network models in visual tracking 

Wang et al. (2019); Chen et al. (2019). Actually, most trackers ignore the inter-frame attention and can hardly obtain the appearance variations of target objects well, which may lead to model drift. On the whole, how to take full use of inter- and intra-frame attention for visual tracking had been a largely underexplored domain.

Figure 4: Visualization of feature and attention maps of the convolutional layer ‘conv3_4c’ in the backbone network, and the correlation response map corresponding to the example image. From left to right are the example image from the sequence Lemming, original feature map, inter-frame attention map, intra-frame attention, and the correlation response map generated by the proposed approach.

To address the above issues, in this paper, we propose a unified end-to-end reinforced attentional Siamese network model, dubbed RAR, to pursue high-performance visual tracking. The framework of the proposed approach is shown in Fig. 3. As above-mentioned, it has already been proven that tracking can benefit from leveraging deep feature hierarchies across multiple convolutional layers Zeiler and Fergus (2014); Ma et al. (2015); Qi et al. (2016)

. Therefore, we use a carefully modified ResNet-50 as the backbone network and take multiple level deep features from the last three convolutional blocks to enhance the capability of target objects representation. We adopt the tracking-by-detection paradigm to trace target objects and reformulate tracking problem as a sequential inference task. To emphasize informative representation and suppress unnecessary information redundancy, we design a hierarchical attention module for learning multiple visual attention, which is composed of an inter-frame attention model and an intra-frame attention model. The inter-frame attention model is built upon convolutional LSTM units, which can fully explore temporal cues of target object appearance at different convolutional layers in consecutive frames 

Chen et al. (2019); Yang and Chan (2017). It can be decomposed into sequential blocks, each of them corresponds to certain time slice. We then design an intra-frame attention model, which is consisted of two multi-layer perceptrons (MLPs) along the channel-wise and spatial axes on deep feature maps Gao et al. (2019b); Woo et al. (2018). With the guidance of both inter- and intra-frame attention, we can get much more powerful attentional representations, as illustrated in Fig. 4. It is worth noting that both the inter- and intra-frame attention at different convolutional layers are obtained independently. After that, hierarchical attentional representations are merged together using a multiple refinement strategy to maintain a desirable resolution. In addition, we adopt DCF to discriminate and locate target objects. Since the background context around the target objects has a significant impact on tracking performance, a contextual attentional DCF is employed as the decision module to take into account global context and further alleviate unnecessary disturbance. To allow the whole network model to be trained from end to end, the correlation operation is reformulated as a differentiable correlational layer Valmadre et al. (2017); Wang et al. (2018b). Thus, the contextual attentional DCF can be updated online without network model fine-tuning to guide the adaption of target object appearance model.

We summarize the main contributions of our work as follows:

  1. An end-to-end reinforced attentional Siamese network model is proposed for high-performance visual tracking.

  2. A hierarchical attention module is utilized to leverage both inter- and intra-frame attention at each convolutional layer to effectively highlight informative representations and suppress unnecessary redundancy.

  3. A contextual attentional correlation layer is incorporated into the backbone network, which can take global context into account and further emphasize interesting regions.

  4. Extensive and ablative experiments on four popular benchmark datasets, i.e., OTB-2013 Wu et al. (2013), OTB-2015 Wu et al. (2015), VOT-2016 Kristan et al. (2016) and VOT-2017 Kristan et al. (2017), demonstrate that our proposed tracker outperforms state-of-the-art approaches.

The rest of the paper is organized as follows. Section 2 briefly reviews related works. Section 3 illustrates the proposed tracking approach. Section 4 details experiments and discusses results. Section 5 concludes the paper.

2 Related works

Visual tracking with excellent effectiveness and efficiency are required by many industrial applications. In this section, we give a brief review of tracking-by-detection approaches based on DCF and CNN, which are most related to our work. For other visual tracking methods, please refer to more comprehensive reviews Yang et al. (2011); Smeulders et al. (2014); Li et al. (2018b).

In the past few years, DCF-based tracking approaches Bolme et al. (2010), which train DCFs by exploiting properties of circular correlation and performing operations in the Fourier frequency domain, had played a dominating role in the visual tracking community because of its superior computational efficiency and reasonably good accuracy. Several extensions then have been proposed to considerably improve tracking performance with the use of multi-dimensional features Henriques et al. (2012), nonlinear kernel correlation Henriques et al. (2015), reducing boundary effects Mueller et al. (2017) and robust scale estimation Danelljan et al. (2017b). However, most earlier DCF-based trackers take advantage of conventional handcrafted features Felzenszwalb et al. (2010); Van De Weijer et al. (2009) and thus suffer from inadequate representation capability.

Recently, with the rapid progress of deep learning technique, CNN-based trackers have obtained remarkable progress and become a new trend in visual tracking. Some approaches combine DCF framework with CNN features for tracking, and show outstanding accuracy and high efficiency. As we know, finer-level features which detail spatial information play a vital role for accurate localization, and coarse-level features which characterize semantics play a pivotal role for robust discrimination. Therefore, specific feature combination scheme needs to be designed before discrimination. HCF 

Ma et al. (2015) extracts deep features from hierarchical convolutional layers, and merges those features using a fixed weight scheme. HDT Qi et al. (2016) employs an adaptive weight to combine deep features from multiple layers. However, these trackers merely exploit CNN for feature extraction, and then learn filters separately to locate the target object. So, their performance may be suboptimal. Some works then try to train a network model to perform both feature extraction and target object localization simultaneously. Both CFNet Valmadre et al. (2017) and EDCF Wang et al. (2018b) unify DCF as a differentiable correlation layer in a Siamese network model Bertinetto et al. (2016); Tao et al. (2016), and thus make it possible to learn powerful representation from end to end. These approaches have promoted the development of visual tracking and greatly improved tracking performance. Nevertheless, many deep features taken from pre-trained CNN models are irrelative to the task of distinguishing the target object from the background. These unnecessary disturbances will significantly limit performance of end-to-end tracking approaches above-mentioned.

Instead of exploiting vanilla deep features for visual tracking, methods using attention weighted deep features alleviate model drift problems caused by background noises. In fact, when tracking a target object, the tracker should merely focus on a much smaller subset of deep features which can well discriminate and locate the target object from the background. That means lots of deep features are irrelative to the target object. Some works explore attention mechanisms to highlight useful information in visual tracking. CSRDCF Lukežič et al. (2017) constructs a unique spatial reliability map to constraint filters learning. ACFN Choi et al. (2017) establishes special attention mechanism to choose useful filters during tracking. RASNet Wang et al. (2018a) and FlowTrack Zhu et al. (2018b) further introduce an attention network similar to SENet architecture Hu et al. (2018) to enhance representation capabilities of output features. Especially, FlowTrack also clusters motion information to get benefit from historical cues. CCOT Danelljan et al. (2016) takes previous frames into account during the filter training stage to enhance robustness. RTT Yang and Chan (2017) learns recurrent filters through an LSTM network to maintain the target object appearance. Nonetheless, all these trackers take advantage of only one or two aspects of attention to refine the output deep features, much useful information in intermediate convolutional layers has not yet been fully explored.

Motivated by above observations, we aim to achieve high-performance visual tracking by learning efficient representation and correlation filters mutually in an end-to-end network. Our approach is related to EDCF Wang et al. (2018b), which proposes a fully convolutional encoder-decoder network model to jointly perform similarity measurement and correlation operation on multi-level reinforced representation for multi-task tracking. In contrast, we propose to learn both inter- and intra-frame attention based on convolutional LSTM units and MLPs to emphasize useful features, and take the global context and temporal correlation into account to train and update DCF. Our approach is also related to but different from HCF Ma et al. (2015), which utilizes hierarchical convolutional features for robust tracking. Rather than using a fixed weight scheme to fuse features from different level, we first carry out attentional analysis on different convolutional layers separately and then merge hierarchical attentional features by a refinement model for better target object representation.

3 The proposed approach

In this section, we describe our proposed tracking approach in detail.

3.1 Network Overview

We use the fully-convolutional portion of ResNet-50 He et al. (2016) as the backbone network, and make some modifications to the architecture of the original network. Table 1 illustrates the details of our modified ResNet-50. The deep feature hierarchies extracted from the conv3_4, conv4_6 and conv

5_3 block are exploited for visual tracking. To reduce the output stride of the original ResNet-50 network from

to , we set the spatial strides to in the convolutional layers of the conv4_1 and conv5_1 block. Thus, all the hierarchical features have the same spatial resolution. In order to increase the receptive field, we also adopt deformable convolutions Dai et al. (2017) in the convolutional layers of the conv4_1 block. Then, we employ a hierarchical attention module, as proposed in Section 3.2, to obtain both inter- and intra-frame attention of each deep feature hierarchy separately. After that, the hierarchical attentional features are merged together using a refinement model. Moreover, our network model comprises has branches, which share the same network parameters, to learn hierarchical attentional representation from the target template and the searching candidates. The reinforced attentional representation of the target template is used to learn a contextual attention correlation filter, as described in Section 3.3, which is then applied to compute the correlation response of the searching candidates. The maximum of the correlation response indicates the estimated position of the target object.

stage output size blocks output stride
(input 255255)
conv1 127127 77, 64 2
maxpool1 6363 33 4
conv2_x 6363 3 4
conv3_x 3131 4 8
conv4_x 3131 6 8
conv5_x 3131 3 8
Table 1: Architecture of backbone network. More details of each building blocks are shown in brackets

3.2 Hierarchical Attention Module

We propose a hierarchical attention module to leverage both inter- and intra-frame attention. The inter-frame attention is exploited to capture the historical context information to perform robust inference at current frame. The intra-frame attention along channel-wise and spatial axes are employed to emphasize informative representations and suppress irrelative redundancy. Details of our hierarchical attention module are described as following.

Figure 5: Architecture of convolutional LSTM unit.

Inter-frame attention. We formulate the tracking task as a sequential inference problem, and utilize a convolutional LSTM unit, as shown in Fig. 5, to model temporal consistency of target object appearance. On the extracted feature map at the current frame , the inter-frame attention can be computed in the convolutional LSTM unit as:

(1)

where denotes element-wise addition, and are sigmoid activation and hyperbolic tangent activation respectively. and

are the kernel weight of the input layer and the hidden layer. The hyperparameters

, , and indicate the forget, input, output and content gate. denotes the cell state. is the hidden state ,which is treated as the inter-frame attention. In order to facilitate the calculation of intra-frame attention, is fed into two fully convolutional layers to separately obtain the inter-frame attention along the channel-wise axis and inter-frame attention along the spatial axis .

Intra-frame attention along the channel-wise axis. We exploit channel-wise intra-frame attention to make feature maps more visually appealing and boost the performance for target object discrimination. Given the input feature and the channel-wise inter-frame attention

of the previous frame, we first apply global average-pooling and max-pooling operations along the spatial axis on the input feature to generate two channel-wise context descriptors:

and . Then we combine and feed them into a MLP with sigmoid activation to obtain the channel-wise intra-frame attention as following,

(2)

where

indicates the sigmoid function,

denotes the element-wise addition. , and are weights used to make a balance between dimensions of channel-wise descriptors and channel-wise intra-frame attention.

Figure 6: Overview of intra-frame attention model.

Intra-frame attention along the spatial axis. We utilize spatial intra-frame attention to highlight target-specific details and enhance the capability for target object localization. Given the input feature and the spatial inter-frame attention at current frame, we first combine two different pooled spatial context descriptors and . Then, we feed the combination into a MLP with sigmoid activation to generate the spatial intra-frame attention :

(3)

where , and are parameters to balance the dimensions of and . presents the sigmoid function and denotes the element-wise addition.

Figure 7: The structure of refinement model.

Reinforced Attentional Representation. The hierarchical attentional representation can be computed with both inter- and intra-frame attention as:

(4)

where and indicate element-wise addition and broadcasting multiplication, respectively. Finally, we merge those hierarchical attentional representations from coarse-to-fine to obtain reinforced attentional representation using a refinement model, as shown in Fig. 7.

3.3 Contextual Attentional Correlation Layer

Different from traditional DCF-based tracking approaches Henriques et al. (2015); Danelljan et al. (2017b); Gao et al. (2018b); Danelljan et al. (2017a), we make some essential modifications to DCF to utilize contextual attention in consecutive frames. Given the previous size and position of the target object, we can crop a searching image patch from the current image frame. The parameters of the contextual attentional DCF are presented as . Thus, the correlation response is obtained as Henriques et al. (2015),

(5)

where and denote the circular correlation operation and Hadamard product respectively, is the attentional feature mapping, indicates the discrete Fourier transform of , represents the complex conjugate of , and denotes the inverse discrete Fourier transform.

We choose the CACF tracker Mueller et al. (2017) as the base of our decision module. Since the background of the target object may impact tracking performance, CACF takes global contextual information into account and demonstrates outstanding discriminative capability. We crop a target template patch and context patches around from the exemplar image frame. A noteworthy is that we use a set of exemplar image frames from frames to learn the DCF that has a high response on the target template patch and close to zero for all context patches,

(6)

where is the impact factor of the target template patch , is the desired correlation response which is designed as a Gaussian function centered at the previous target object position, controls context patches regressing to zero. The closed-form solution in the Fourier frequency domain is obtained as,

(7)

Then, the current target object position and size are determined according to the maximum of the correlation response on the candidate patch calculated as Eq. 5.

We formulate the contextual attentional DCF as a differentiable correlation layer to achieve end-to-end training of the whole network and online updating of filters. These capabilities can further enhance the adaptation ability of our approach to target object appearance variations. Therefore, the network can be trained by minimizing the differences between the real response and the desired response of

. The loss function is formulated as,

(8)

The back-propagation of loss with respect to the current template and searching branches are computed as,

(9)

Once the back-propagation of the correlation layer is derived, our network can be trained end-to-end. Finally, the correlation filters can be incrementally updated during tracking as formulated in Eq. 7.

3.4 Implementation Details

We apply stochastic gradient decent (SGD) with the learning rate varying from to , the weight decay of and the momentum of to train our RAR from scratch on the ILSVRC video object detection dataset which has more than sequences and annotated objects. The weights of the first two residual block of the backbone network are fixed, and only the last three residual blocks, i.e., conv3, conv4 and conv

5, are fine-tuned. During training, the target template and searching candidates are cropped with a padding size of

from two frame images picked randomly from sequences of the same target object and then resized to a standard input size of . Moreover, to deal with scale variations, we generate a proposal pyramid with three scales times of the previous target object size. The regularization parameters are set as and .

4 Experiments

Experiments are conducted on four modern benchmark datasets, including OTB-2013 with 50 videos Wu et al. (2013), OTB-2015 with 100 videos Wu et al. (2015), both VOT-2016 Kristan et al. (2016) and VOT-2017 Kristan et al. (2017) are with 60 videos. We implement our proposed tracker in Python using MXNet Chen et al. (2015) on an Amazon EC2 instance with an Intel Xeon E5 CPU @ 2.3GHz with 61GB RAM, and an NVIDIA Tesla K80 GPU with 12GB VRAM. The average speed of the proposed tracker is fps.

4.1 Results on OTB

Trackers OTB-2013 OTB-2015 Speed (FPS)
AUC DP AUC DP
RAR 0.682 0.896 0.664 0.873 37
DaSiamRPN Zhu et al. (2018a) 0.656 0.890 0.658 0.881 97
SiamTri Dong and Shen (2018) 0.615 0.815 0.590 0.781 85
SA_Siam He et al. (2018) 0.676 0.894 0.656 0.864 50
SiamRPN Li et al. (2018a) 0.658 0.884 0.637 0.851 71
TRACA Choi et al. (2018) 0.652 0.898 0.602 0.816 65
EDCF Wang et al. (2018b) 0.665 0.885 0.635 0.836 65
CACF Mueller et al. (2017) 0.621 0.833 0.598 0.810 33
CFNet Valmadre et al. (2017) 0.611 0.807 0.568 0.767 73
SiamFC Bertinetto et al. (2016) 0.609 0.809 0.578 0.767 86
HCF Ma et al. (2015) 0.638 0.891 0.562 0.837 26
Table 2: Comparisons with recent real-time ( 25 fps) state-of-the-art tracking approaches on OTB benchmarks using AUC and Precision metrics. The best three values are highlighted in red, blue and green fonts, respectively.

OTB-2013 Wu et al. (2013) and OTB-2015 Wu et al. (2015) are two popular visual tracking benchmark datasets. The RAR tracker is compared with recent real-time ( 25 fps) trackers including DaSiamPRN Zhu et al. (2018a), SiamTri Dong and Shen (2018), SA_Siam He et al. (2018), SiamRPN Li et al. (2018a), TRACA Choi et al. (2018), EDCF Wang et al. (2018b), CACF Mueller et al. (2017), CFNet Valmadre et al. (2017), SiamFC Bertinetto et al. (2016) and HCF Ma et al. (2015)

on these benchmarks. We exploit two evaluation metrics, i.e., distance precision (DP) and overlap success rate (OSR). DP is defined as the percentage of frames where the average Euclidean distance between the estimated target position and the ground truth is smaller than a preset threshold of 20 pixels, while OSR is the overlap ratios of successful frames exceed within the threshold range of

. The area under the OSR curve (AUC) is mainly used to rank trackers. The evaluation results are illustrated in Table 2.

On the OTB-2013 benchmark dataset, the proposed tracker achieves a DP score of and an AUC score of . Although the TRACA tracker obtains the best DP score of , our RAR tracker outperforms it with an absolute gain of in AUC because the proposed hierarchical attention module best highlights informative representation and suppresses irrelevant redundancy. On the OTB-2015 benchmark dataset, our tracker achieves the best AUC score of and a second-best DP score of . However, our tracker does not perform as well as the top-performing DaSiamRPN, which obtains the best DP score of . This can be attributed to DaSiamRPN exploits extra negative training samples from other datasets to enhance its discriminative capability. As the baseline of our tracker, EDCF and HCF achieve AUC scores of and on the OTB-2015 benchmark, respectively. Their performances drop significantly by more than and compared to RAR.

Figure 8: Performance evaluation of five trackers on the OTB-2015 benchmark dataset with different attributes. Each subset of sequences corresponds to one of the attributes. The later number in the brackets after each attribute acronym is the number of sequences in the corresponding subset.

For a comprehensive evaluation, our approach is also compared with state-of-the-art trackers including DaSiamRPN Zhu et al. (2018a), SiamRPN Li et al. (2018a), SiamFC Bertinetto et al. (2016) and HCT Ma et al. (2015) in different attributes on the OTB-2015 benchmark dataset. The video sequences in OTB are annotated with 11 different attributes including illumination variation (IV), out-of-plane rotation (OPR), scale variation (SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rotation (IPR), out of view (OV), background clutter (BC) and low resolution (LR). The results are present in terms of AUC and DP scores in Fig. 8. Although our approach performs worse on three attributes of in-plane rotation, out-of-plane rotation and low resolution, it achieves impressive performance on the rest eight attributes.

Fig. 9 shows comparisons with excellent trackers SiamRPN Li et al. (2018a), SiamFC Bertinetto et al. (2016), HCF Ma et al. (2015) and our proposed approach on three challenging video sequences from the OTB-2015 benchmark dataset. In the sequence carScale, the target object undergoes scale variations with fast-moving. All trackers except the proposed one cannot deal with the scale variation desirably. Both SiamRPN and HCF concentrate on tracking a small part of the target object, while the tracking results of the SiamFC is larger than the ground-truth. In contrast, the proposed approach can trace the target object well. In the sequence david3, the target object is partially occluded in a cluttered background. SiamFC drifts quickly when occlusion occurs while others are able to trace the target object correctly throughout the sequence. The target object in the sequence MotorRolling suffers from rotations with varying illumination. Only SiamRPN and RAR can locate the target object accurately.

Figure 9: Comparison of our proposed approach with the state-of-the-art trackers SiamRPN Li et al. (2018a), SiamFC Bertinetto et al. (2016) and HCF Ma et al. (2015) on three challenging video sequences (from top to bottom are carScale, david3, and MotorRolling, respectively).

The reasons that the proposed approach performs well can be attributed to two-fold: First, both inter- and intra-frame attention are effective to select more meaningful representation which accounts for appearance and scale variations; Second, with the use of contextual attentional correlation filter, the proposed approach can further deal with more complicated scenarios such as background clutters and heavily occlusions.

4.2 Results on VOT

(a) Expected average overlap scores on VOT-2016 (b) Expected average overlap scores on VOT-2017
Figure 10: Expected average overlap plot on VOT datasets. The better trackers are located at the right. The values in the legend indicate the EAO scores. The horizontal dashed lines denote the state-of-the-art bounds according to the VOT committee.

The VOT challenge is the largest annual competition in the field of visual tracking. We compare our tracker with several state-of-the-art trackers on VOT-2016 Kristan et al. (2016) and VOT2017 Kristan et al. (2017) challenge datasets, respectively. Following the evaluation protocol of VOT, we report the tracking performance in terms of expected average overlap (EAO) scores, as shown in Fig. 10.

The RAR tracker obtains the EAO scores of and on these datasets. Compared to SiamFC Bertinetto et al. (2016), our approach outperforms it by an absolute gains of and in EAO, and demonstrate its superiority in target object representation. In comparisons, CCOT Danelljan et al. (2016) and LSART Sun et al. (2018) achieve the top performance on VOT-2016 and VOT-2017 datasets, respectively. However, LSART runs at fps and CCOT runs at around fps, our approach runs at orders of magnitude faster than them. As a result, our approach exceeds state-of-the-art bounds by large margins, and it can be considered as a state-of-the-art tracker according to the definition of the VOT committee. All the results demonstrate the effectiveness and efficiency of our proposed tracking approach.

Trackers OTB-2013 OTB-2015
AUC (%) DP (%) AUC (%) DP (%)
RAR 0.657 0.853 0.635 0.841
RAR 0.667 0.875 0.643 0.858
RAR 0.671 0.869 0.646 0.837
RAR 0.627 0.824 0.595 0.798
RAR 0.644 0.849 0.616 0.813
RAR 0.658 0.871 0.632 0.841
RAR 0.665 0.878 0.638 0.850
RAR 0.682 0.896 0.664 0.873
Table 3: Ablation studies of several variations of our tracker on OTB benchmark datasets using AUC and DP scores.

4.3 Ablation Studies

To investigate how each proposed component contributes to improving tracking performance, we evaluate several variations of our approach on the OTB benchmark datasets, including the tracker using VGG-M network Simonyan and Zisserman (2015) as the backbone (RAR), the tracker with traditional correlation filters Danelljan et al. (2017b) (RAR), the tracker without hierarchical convolutional features (RAR), the tracker without using all attention (RAR), and the trackers without using single attention (RAR, RAR, RAR). The detailed evaluation results are illustrated in Table 3.

The performance of our full algorithm (RAR) outperforms all the variants. RAR achieves absolute gains of and in the AUC scores compared with RAR and RAR on the OTB-2015 benchmark dataset, respectively. Therefore, it has been proved that our modified backbone network enhances the generalization capability to learn more informative target object representation. Compared with RAR, a large AUC gain of is obtained by RAR on the OTB-2015 benchmark dataset. The result clearly verifies the effectiveness of the connection of inter- and intra-frame attention for emphasizing meaningful representations and suppressing redundant information. By introducing the differentiable correlation layer, the AUC score can be significantly increased by compared with RAR on the OTB-2013 benchmark dataset. This performance gain shows the superiority of the proposed contextual attentional correlation filter. According to our ablation studies, every component in our approach contributes to improve tracking performance.

5 Conclusions

In this paper, we propose an end-to-end network model, which can jointly achieve hierarchical attentional representation learning and contextual attentional correlation filter training for high-performance visual tracking. Specifically, we introduce a hierarchical attention module to learn hierarchical attentional representation with both inter- and intra-frame attention at different convolutional layers to emphasize informative representations and suppress redundant information. Moreover, a contextual attentional correlation layer is incorporated into the network to enhance the performance for accurate target object discrimination and localization. Experimental results clearly demonstrate that our proposed tracker outperforms far above most state-of-the-art trackers both in terms of accuracy and robustness on OTB and VOT benchmark datasets at a speed of above real-time. Although the proposed tracker has achieved competitive tracking results, its performance can be further improved by utilizing multimodal representation and powerful backbone networks such as natural linguistic feature and graph convolutional network.

References

  • L. Bertinetto, J. Valmadre, J. Henriques, A. Vedaldi, and P. H. S. Torr (2016) Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision (ECCV), pp. 850–865. Cited by: §1, §1, §1, §2, Figure 9, §4.1, §4.1, §4.1, §4.2, Table 2.
  • D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui (2010) Visual object tracking using adaptive correlation filters. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 2544–2550. Cited by: §1, §2.
  • B. Chen, P. Li, C. Sun, D. Wang, G. Yang, and H. Lu (2019) Multi attention module for visual tracking. Pattern Recognition 87, pp. 80–93. Cited by: §1, §1.
  • T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang (2015)

    Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems

    .
    arXiv preprint arXiv:1512.01274. Cited by: §4.
  • J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, and J. Y. Choi (2017) Attentional correlation filter network for adaptive visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • J. Choi, H. Jin Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris, and J. Young Choi (2018) Context-aware deep feature compression for high-speed visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 479–488. Cited by: §4.1, Table 2.
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In IEEE International Conference on Computer Vision (ICCV), pp. 764–773. Cited by: §3.1.
  • M. Danelljan, G. Bhat, S. F. Khan, and M. Felsberg (2017a) ECO: efficient convolution operators for tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §3.3.
  • M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg (2017b) Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (8), pp. 1561–1575. Cited by: §1, §2, §3.3, §4.3.
  • M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In European Conference on Computer Vision (ECCV), pp. 472–488. Cited by: §2, §4.2.
  • X. Dong and J. Shen (2018) Triplet loss in siamese network for object tracking. In European Conference on Computer Vision (ECCV), Cited by: §4.1, Table 2.
  • P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2010) Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (9), pp. 1627–1645. Cited by: §1, §2.
  • P. Gao, Y. Ma, C. Li, K. Song, Y. Zhang, F. Wang, and L. Xiao (2018a) Adaptive object tracking with complementary models. IEICE Transactions on Information and Systems E101-D (11), pp. 2849–2854. Cited by: §1.
  • P. Gao, Y. Ma, K. Song, C. Li, F. Wang, L. Xiao, and Y. Zhang (2018b) High performance visual tracking with circular and structural operators. Knowledge-Based Systems 161, pp. 240–253. Cited by: §1, §3.3.
  • P. Gao, Y. Ma, R. Yuan, L. Xiao, and F. Wang (2019a) Learning cascaded siamese networks for high performance visual tracking. In Proceedings of International Conference on Image Processing (ICIP), Cited by: §1.
  • P. Gao, Y. Ma, R. Yuan, L. Xiao, and F. Wang (2019b) Siamese attentional keypoint network for high performance visual tracking. arXiv preprint arXiv:1904.10128. Cited by: §1, §1.
  • A. He, C. Luo, X. Tian, and W. Zeng (2018) A twofold siamese network for real-time object tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4834–4843. Cited by: §4.1, Table 2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: Figure 1, Figure 2, §1, §1, §3.1.
  • J. Henriques, R. Caseiro, P. Martins, and J. Batista (2012) Exploiting the circulant structure of tracking-by-detection with kernels. In European conference on computer vision (ECCV), pp. 702–715. Cited by: §2.
  • J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2015) High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 583–596. Cited by: §1, §2, §3.3.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1.
  • J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Čehovin, and et al (2016) The visual object tracking vot2016 challenge results. In European Conference on Computer Vision (ECCV), Cited by: item 4, §4.2, §4.
  • M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Č. Zajc, and et al (2017) The visual object tracking vot2017 challenge results. In IEEE International Conference on Computer Vision (ICCV), Cited by: item 4, §4.2, §4.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In Annual Conference on Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: Figure 1, §1, §1, §1.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. Nature 521 (7553), pp. 436–444. Cited by: §1.
  • B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018a) High performance visual tracking with siamese region proposal network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8971–8980. Cited by: §1, §1, §1, Figure 9, §4.1, §4.1, §4.1, Table 2.
  • P. Li, D. Wang, L. Wang, and H. Lu (2018b) Deep visual tracking: review and experimental comparison. Pattern Recognition 76, pp. 323–338. Cited by: §1, §2.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §1.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In European Conference on Computer Vision (ECCV), Cited by: §1.
  • A. Lukežič, T. Vojíř, L. Čehovin, J. Matas, and M. Kristan (2017) Discriminative correlation filter with channel and spatial reliability. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4847–4856. Cited by: §1, §2.
  • C. Ma, J. Huang, X. Yang, and M. Yang (2015) Hierarchital convolutional features for visual tracking. In IEEE International Conference on Computer Vision (ICCV), pp. 3074–3082. Cited by: §1, §1, §2, §2, Figure 9, §4.1, §4.1, §4.1, Table 2.
  • Y. Ma, C. Yuan, P. Gao, and F. Wang (2018) Efficient multi-level correlating for visual tracking. In Asian Conference on Computer Vision (ACCV), pp. 452–465. Cited by: §1.
  • M. Mueller, N. Smith, and B. Ghanem (2017) Context-aware correlation filter tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1396–1404. Cited by: §2, §3.3, §4.1, Table 2.
  • Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M. Yang (2016) Hedged deep tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4303–4311. Cited by: §1, §1, §2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Annual Conference on Neural Information Processing Systems (NIPS), pp. 91–99. Cited by: §1.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and F. Li (2015) Imagenet large scale visual recognition challenge. International Journal of Computer Vision 115 (3), pp. 211–252. Cited by: §1.
  • J. Schmidhuber (2015) Deep learning in neural networks: an overview. Neural Networks 61, pp. 85–117. Cited by: §1.
  • K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556v6. Cited by: Figure 1, §1, §1, §4.3.
  • A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah (2014) Visual tracking: an experimental survey. IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7), pp. 1442–1468. Cited by: §1, §2.
  • C. Sun, D. Wang, H. Lu, and M. Yang (2018) Learning spatial-aware regressions for visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8962–8970. Cited by: §4.2.
  • R. Tao, E. Gavves, and A. W. M. Smeulders (2016) Siamese instance search for tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1420–1429. Cited by: §1, §1, §2.
  • J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. S. Torr (2017) End-to-end representation learning for correlation filter based tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2805–2813. Cited by: §1, §1, §2, §4.1, Table 2.
  • J. Van De Weijer, C. Schmid, J. Verbeek, and D. Larlus (2009) Learning color names for real-world applications. IEEE Transactions on Image Processing 18 (7), pp. 1512–1523. Cited by: §1, §2.
  • Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. Maybank (2018a) Learning attentions: residual attentional siamese network for high performance online visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4854–4863. Cited by: §1, §2.
  • Q. Wang, M. Zhang, J. Xing, J. Gao, W. Hu, and S. Maybank (2018b) Do not lose the details: reinforced representation learning for high performance visual tracking. In

    International Joint Conference on Artificial Intelligence (IJCAI)

    ,
    pp. 985–991. Cited by: §1, §2, §2, §4.1, Table 2.
  • Q. Wang, C. Yuan, J. Wang, and W. Zeng (2019) Learning attentional recurrent neural network for visual tracking. IEEE Transactions on Multimedia 21 (4), pp. 930–942. Cited by: §1.
  • S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §1, §1.
  • Y. Wu, J. Lim, and M. Yang (2013) Online object tracking: a benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2411–2418. Cited by: item 4, §4.1, §4.
  • Y. Wu, J. Lim, and M. Yang (2015) Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1834–1848. Cited by: item 4, §4.1, §4.
  • H. Yang, L. Shao, F. Zheng, L. Wang, and Z. Song (2011) Recent advances and trends in visual tracking: a review. Neurocomputing 74 (18), pp. 3823–3831. Cited by: §1, §2.
  • T. Yang and A. B. Chan (2017) Recurrent filter learning for visual tracking. In IEEE International Conference on Computer Vision (ICCV), pp. 2010–2019. Cited by: §1, §2.
  • M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision (ECCV), pp. 818–833. Cited by: §1, §1.
  • Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu (2018a) Distractor-aware siamese networks for visual object tracking. In European Conference on Computer Vision (ECCV), pp. 103–119. Cited by: §4.1, §4.1, Table 2.
  • Z. Zhu, W. Wu, W. Zou, and J. Yan (2018b) End-to-end flow correlation tracking with spatial-temporal attention. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.