Efficient Visual Tracking with Exemplar Transformers

12/17/2021
by   Philippe Blatter, et al.
ETH Zurich
6

The design of more complex and powerful neural network models has significantly advanced the state-of-the-art in visual object tracking. These advances can be attributed to deeper networks, or to the introduction of new building blocks, such as transformers. However, in the pursuit of increased tracking performance, efficient tracking architectures have received surprisingly little attention. In this paper, we introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking. E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU. This is up to 8 times faster than other transformer-based models, making it the only real-time transformer-based tracker. When compared to lightweight trackers that can operate in real-time on standard CPUs, E.T.Track consistently outperforms all other methods on the LaSOT, OTB-100, NFS, TrackingNet and VOT-ST2020 datasets. The code will soon be released on https://github.com/visionml/pytracking.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

03/25/2022

Efficient Visual Tracking via Hierarchical Cross-Attention Transformer

In recent years, target tracking has made great progress in accuracy. Th...
03/07/2022

ZippyPoint: Fast Interest Point Detection, Description, and Matching through Mixed Precision Discretization

The design of more complex and powerful neural network models has signif...
10/17/2021

Siamese Transformer Pyramid Networks for Real-Time UAV Tracking

Recent object tracking methods depend upon deep networks or convoluted a...
03/28/2021

TransCenter: Transformers with Dense Queries for Multiple-Object Tracking

Transformer networks have proven extremely powerful for a wide variety o...
04/29/2021

LightTrack: Finding Lightweight Neural Networks for Object Tracking via One-Shot Architecture Search

Object tracking has achieved significant progress over the past few year...
03/24/2022

Keypoints Tracking via Transformer Networks

In this thesis, we propose a pioneering work on sparse keypoints trackin...
07/16/2020

Hopfield Networks is All You Need

We show that the transformer attention mechanism is the update rule of a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Estimating the trajectory of an object in a video sequence, referred to as visual tracking, is one of the fundamental problems in computer vision. Deep neural networks has significantly advanced the performance of visual tracking methods with deeper networks [3], more accurate bounding boxes [21], or with the introduction of new modules, such as transformers [33, 7, 29]. However, these advances often come at the cost of more expensive models. While the demand for real-time visual tracking on applications such as autonomous driving, robotics, and human-computer-interfaces is increasing, efficient deep tracking architectures have received surprisingly little attention. This calls for visual trackers that, while accurate and robust, are capable of operating in real-time under the hard computational constraints of limited hardware.

Figure 1: Comparison of tracker performance in terms of AUC score (Success in %) on LaSOT [12] vs. tracking speed in fps on a standard CPU. Our Exemplar Transformer tracker (E.T.Track) outperforms all other real-time trackers. It achieves a higher AUC score than the mobile LightTrack [34] (LT-Mobile). Furthermore, our approach achieves up to faster runtimes on a CPU compared to previous Transformer-based trackers.

Transformers [27], proposed for machine translation, have also demonstrated superior performance in a number of vision based tasks, including image [2] and video [31] classification, object detection [6], and even multi-task learning [5]. The field of visual tracking has also observed similar performance benefits [33, 26, 29]. While transformers have enabled the trackers to improve accuracy and robustness, they severely suffer from high computational cost, leading to decreased run-time operation, as depicted in Fig 1. This is due to the quadratic scaling of the computational complexity with respect to its input image size - a well known limitation of the transformer architecture [27]. In this work, we set out to find an efficient transformer architecture for tracking, capable of achieving real-time performance on standard CPUs.

In this work we propose the Exemplar Transformer, an efficient transformer layer for visual tracking. Our attention module, the key building block of transformers, is inspired by a generalization of the standard “Scaled Dot-Product Attention” [27]. While the self-attention of [27] scales quadratically with the image size or input sequence, we re-designing the key operands to improve its efficiency.

Our approach builds upon two hypotheses. Firstly, a small set exemplar values can act as a shared memory between the samples of the dataset. Secondly, a coarser query representation to that of the input is sufficiently descriptive to utilize the exemplar representation.

Scaled dot-product attention eliminates spatial biases from the architecture, and provides the freedom to learn the spatial location dependencies through the similarity function. This comes at the cost of a computationally expensive similarity matrix. Exemplar Attention instead acts on a coarse representation of the input sequence, providing global information to the attention layer in the form of a Query. We jointly learn to project the input to a set of exemplar representations, inducing spatial bias. The significant decrease in the input dimension of the Query, coupled with the small number of exemplars gives rise to an efficient transformer layer capable of operating in real-time on standard CPUs.

We integrate our efficient transformer layer into a Siamese tracking architecture, E.T.Track. Specifically, we replace the convolutional layers in the tracker head with the Exemplar Transformer layer. The additional expressivity from the Exemplar Transformer layer significantly improves the performance of the models based on regular convolutional layer. The added performance gain comes at insignificant cost in run-time as seen from Fig. 1.

We verify this claim by performing different experiments on multiple benchmark datasets including LaSOT [12], OTB-100 [32], UAV-123 [23], NFS [18], GOT10k [17], TrackingNet [24] and VOT-ST2020 [19].

Our proposed tracker runs at FPS on a CPU, making it the first real-time transformer-based tracker for CPU architectures.

In summary, our contributions are:

  • We introduce an efficient transformer layer based on the use of a novel Exemplar Attention.

  • We incorporate the proposed transformer layer into a Siamese-based tracking architecture and, thereby, significantly increase robustness with negligible effect on the run-time..

  • We present the first transformer-based tracking architecture that is capable of running in real-time on a CPU.

2 Related Work

Siamese Trackers

In recent years, Siamese trackers have gained significant popularity due to their performance capabilities and simplicity. The Siamese-based tracking framework formulates visual object tracking as a template matching problem, utilizing cross-correlation between a search and an image patch. The original work of Bertinetto et al. introduced SiamFC [3], the first model incorporating feature correlation into a Siamese framework. Li et al. [21] introduced region proposal networks to increase efficiency and obtain more accurate bounding boxes. More recent advances on the Siamese tracker front include the use of additional branches [30], refinement modules for more precise bounding box regression [35], and various model update mechanisms [13, 14, 36, 37]. Unlike previous Siamese trackers, we propose an Examplar Attention module, which is integrated into the prediction heads, and achieve real-time frame-rates on a CPU.

Transformers in Tracking

The Transformer [27]

was introduced as a module to improve the learning of long-range dependencies in neural machine translation, by enabling every element to attend all others. In computer vision, transformers have been used in image 

[2] and video [31] classification, object detection [6], and even multi-task learning of dense prediction tasks [5]. More related to our work, transformers have also been utilized to advance the performance of visual trackers. STARK [33] utilizes transformers to model the global spatio-temporal feature dependencies between target object and search regions. This is achieved by integrating a dynamically updated template into the encoder, in addition to the regular search and template patch. [29], introduce a transformer architecture that improves the standard Siamese-like pipeline by additionally exploiting temporal context. The encoder model mutually reinforces multiple template features by leveraging self-attention blocks. In the decoder, the template and search branch are bridged by cross-attention blocks in order to propagate temporal contexts. [7] also improve Siamese-based trackers by replacing the regular correlation operation by a Transformer-based feature fusion network. The Transformer-based fusion model aggregates global information, providing a superior alternative to the standard linear correlation operation. In this work, we also design transformer architecture for tracking. Unlike the previous transformers for tracking, Exemplar Transformer is lightweight and can be utilized in computationally limited hardware running at real-time.

Efficient Tracking Architectures

With an increase in demand for real-time visual tracking in applications such as autonomous driving, and human-computer-interfaces, efficient deep tracking architectures are essential. Surprisingly, however, little attention has been provided on efficient trackers that can operate on computationally limited hardware. KCF [16] and fDSST [11] are two examples of trackers employing hand-crafted features, operating in real-time on a CPU. While fast, their reliance on hand crafted features significantly hinders their accuracy and robustness, resulting in a subprime performance compared to newer and more complex methods. In contrast, we present an efficient deep tracker that operates at a comparable run-time but performs on par with the more expensive deep trackers. More related to out work, LightTrack [34] employs NAS to find a lightweight and efficient Siamese tracking architecture. We instead propose an efficient transformer layer that can complement existing architecture advances such as LightTrack. Specifically, our transformer layer is as a drop in replacement for convolutional layers, and can increase performance with negligible effects on runtime.

3 Efficient Tracking with Transformers

Figure 2: Comparison of our exemplar attention module (right) and the standard scaled dot-product attention module presented in [27]

. The matching blocks are indicated by the same color. The line thickness indicates the size of the input tensors.

Figure 3: Comparison of Scaled Dot-Product Attention (SDPA) and Exemplar Attention (EA). The attention map (red) in SDPA is an matrix whereas in EA, the dimensions of the attention map can be reduced to matrix, where in our experiments. In SDPA, V is a projection of the input , whereas in EA, V corresponds to the exemplars computed by convolving the input with a learned matrix . The exemplars correspond to the respective feature maps of the convolution. As a consequence thereof, the computation of a single output pixel in (shown in purple) has the computational complexity of in SDPA, where is the sequence length. In comparison, the computation of a single output pixel in EA has the computational complexity , where in our experiments and which is much smaller than .

Striking a balance between well performing object trackers and run-time speeds that fall in the real-time envelope is a challenging problem when deploying on computationally limited devices. In this section, we introduce the Exemplar Transformer, an efficient transformer architecture for visual object tracking, capable of operating at real-time on standard CPUs. While lightweight, our Exemplar Transformer significantly closes the performance gap with the prohibitively expensive transformer-based trackers [33, 29, 7]. Sec. 3.1 first presents the original Transformer of Vaswani et al. [27], and then introduces the efficient Exemplar Transformer. Sec. 3.2 introduces our tracker, E.T.Track. Specifically, it first outlines the overall architecture, and presents how Exemplar Transformers are utilized within the tracker.

3.1 Exemplar Transformers

Standard Transformer

The Transformer [27], introduced for machine translation, receives a one dimensional input sequence with

feature vectors of dimensions

. The input sequence is processed by a series of transformer layers defined as

(1)

The function is a lightweight feedforward network that projects independently each feature vector. The function represents a self-attention layer that acts across the entire sequence. Specifically, the authors used the “Scaled Dot-Product Attention”, defined as

(2)

The queries , keys , and values represent projections of the input sequence, while is a normalization constant. The self attention, therefore, computes a similarity score between all representations, linearly combines the feature representations, and accordingly adapts the input representation in eq. 1. The computational complexity of eq. 2 is , i.e. it scales quadratically with the length of the input sequence.

The direct connection of all feature vectors enables the learning of long-range dependencies. As discussed earlier, however, these benefits come at the cost of poor computational and memory scaling properties. This is further exacerbated in vision based tasks, where high-resolution information is desired to improve the quality of the predictions. Consequently, limiting the use of transformers from applications with strict latency constraints.

Exemplar Attention

We now introduce Exemplar Attention, the key building block of the Exemplar Transformer layer. We assume that, while the direct connection between all features is essential in machine translation, it is not necessary for vision tasks. By decreasing the number of feature vectors, the transformer is significantly accelerated. We describe the required modifications of the individual components below.

The standard Query function projects every spatial location of the feature map independently to a query space. Unlike machine translation where every feature represents a specific word or token, adjacent spatial representation in vision tasks often resemble identical objects. We, therefore, hypothesize that a more coarse representation is sufficiently descriptive while significantly decreasing the computational complexity. Consequently, we aggregate the information of the feature map , where represents the spatial dimensions. Specifically, we use a 2D adaptive average pooling layer followed by a flattening operation , with a kernel of size , decreasing the output spatial dimension to . The compressed representation of is then project to a query space as in the standard self-attention formulation.

(3)

In our experiments we set , providing the maximum attainable efficiency. While the adaptive pool has been set to the extreme, we conjecture that for single object tracking, a query of one is sufficient. This is further supported by the success of global pooling in classification architectures [15], and transformer based object detection [6].

The keys and values, as presented in eq. 2, are per spatial location linear projections of the input. This eliminates spatial bias built into convolutional layers. Instead, self-attention enables the learning of spatial correlations, at the cost of every feature attending to all others. Rather than requiring a fine grained feature map and relying only on intra-sample relationships, we learn a small set of exemplar representations that captures dataset information. To this end, we optimize a small set of exemplar keys that, unlike the formulation in eq. 2, are independent of the input. The similarity matrix therefore associates queries with coarse representation, eq. 3, to exemplars. To enable our attention layer to also operate on the local level, we replace the projection with a convolutional operation

(4)

where can be of any spatial dimension , while the number of exemplars can be chosen arbitrarily. We use in our experiments, which is significantly smaller than the dimensions .

Our efficient exemplar attention is therefore defined as,

(5)

but can also be written as,

(6)

While eq. 5 and eq. 6 are mathematically identical, eq. 6 can be more efficiently implemented for small spatial dimensions . On the other hand, for large spatial dimensions , eq. 5 is preferred.

Exemplar attention, while inspired by the scaled dot-product attention, is conceptually different. In the self-attention, are projections to their corresponding feature spaces, with the similarity function learning the spatial relationships. Instead, to significantly accelerate the similarity function, we utilize convolutions for the local features, while the similarity function for the global. Furthermore, self-attention relies on intra-sample relationships, and therefore requires fine-grain representations. We rather exploit dataset information to project our input to exemplar representations, minimizing the need for computationally expensive self-attention modules.

Computational costs and path length

Table 1 compares the traditional self-attention layer [27], a convolutional layer, and the Exemplar Attention layer. We list the computational complexity, as well as the maximum path length between any two input and output positions as introduced in [27]. In vision based tasks, the spatial size is often significantly larger than the feature dimension . Just as scaled dot-product attention, an exemplar attention layer, as seen in Table 1, only needs a single operation to connect all the tokens of the input sequence. Moreover, as shown in section 3.1, the computational complexity of Exemplar Attention is significantly smaller than for the scaled dot-product attention. However, besides the constant , the computational complexity is the same as for regular convolutions. This underlines the strength of our approach - increased model capacity with only marginal increase in computational load.

Layer Type Max. Path Length Computational
Complexity
Self-Attention [27]
Convolution
Exemplar Attention
Table 1: Computational complexity and path length of different modules.

3.2 E.T.Track Architecture

In this section we introduce the base architecture used throughout our work. While Exemplar Transformers can be incorporated into any tracking architecture, we evaluate its efficacy on lightweight Siamese trackers. An overview of the E.T.Track architecture can be seen in figure 4.

Our model employs the lightweight backbone model introduced in [34]

. The model was identified by NAS on a search space consisting of efficient and lightweight building blocks only. The feature extracting backbone consists of

convolutional layers, depthwise separable convolutional layers and mobile inverted bottleneck layers with squeeze and excitation modules.

Figure 4: E.T.Track - a Siamese tracking pipeline that incorporates Exemplar Transformers in the tracker head.

The Exemplar Transformer layer can act as a drop in replacement for any convolution operation of the architecture. We replace all the convolutions in the classification- and bounding box regression branch, while keeping the lightweight backbone architecture untouched. This eliminates the need for retraining the backbone on ImageNet, and minimizing the increase of model complexity.

Search and template frames are initially processed through a backbone network. The similarity between the representations is computed by a pointwise cross-correlation. The resulting correlation map is then fed into the tracker head, where it is processed by a classification branch and a bounding box regression branch in parallel. The bounding box regression branch predicts the distance to all four sides of the bounding box. The classification branch predicts for every region whether it is part of the foreground or the background. As, during training, the bounding box regression branch considers all the pixels within the ground truth bounding box as training samples, the model is able to determine the exact location of the object even when only small parts of the input image are classified as foreground. The model is trained by optimizing a weighted combination of the IoU loss between the predicted bounding box and the ground truth bounding box and the binary cross-entropy (BCE). For more details as well as more inoformation on the data preprocessing, we refer the reader to 

[38].

4 Experiments

We first present implementation details of our tracker in section 4.1. The results of our experiments are presented in section 4.2, followed by an ablation study in section 4.3. Code and trained models will be released on publication.

4.1 Implementation Details

Model

All of our models have been trained on a Nvidia GTX TITAN X and evaluated on a Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz. All reported models use the lightweight backbone of the mobile architecture presented in [34]

. The parameters initializing the backbone have been pre-trained on ImageNet. During training, the entire backbone model is frozen for the first 10 epochs. The classification branch consists of a DWS

111depthwise-separable

-Conv-BN-ReLU layer followed by an Exemplar Transformer consisting of 4 layers, another DWS-Conv-BN-ReLU layer and a regular Conv layer. The bounding box regression branch has the same structure except that the Exemplar Transformer consists of 6 layers. The fixed values of the attention module are initialized using the kaiming initialization method whereas the fixed keys are initialized with normally distributed parameters. The FFN following the attention module consists of 2 linear layers with hidden units of 1024 and 256 combined with a ReLU activation, dropout with a ratio of 0.1 and LayerNorm layers.

Training

We use the training splits of LaSOT [12], TrackingNet [24], GOT10k [17]

and COCO

[22]

to train our model. Our trackers are implemented in Python using PyTorch. We train for 50 epochs on 3 GPUs by sampling 32 image pairs per batch per GPU and 500’000 in total. We use SGD

[25] with a momentum of 0.9 and weight decay of . We use a step learning rate scheduler during a warmup period of 5 epochs during which we increase the learning rate from 0.02 to 0.1. From epoch 5 to 50 we decrease the learning rate from 0.1 to logarithmically. After unfreezing the backbone model in epoch 10, the parameters of the backbone are smaller than the global learning rate. The sampled image pairs consist of a search frame and a template frame. The frames out of which the template and the search patch are cropped are sampled within different ranges. For LaSOT [12], the two frames are sampled within a range of 100 frames, for GOT10k [17] within a range of 100 frames, for TrackingNet [24] within a range of 30 and for COCO [22] within a range of 1. Both patches are randomly shifted and scaled. The training of our E.T.-tracking architecture is based on the training framework used in LightTrack [34] which, in turn, is based on the framework introduced in OCEAN [38] except for the feature alignment module.

4.2 Comparison with State-of-the-Art

In this section, we compare our proposed E.T.Tracker with the current state of the art methods on 6 benchmark datasets: OTB-100 [32], NFS [18], UAV-123 [23], LaSOT [12], TrackingNet [24], and VOT2020 [19]. We compare our tracker to two different classes of trackers. On the one hand, we compare it to other transformer-based trackers and on the other hand, we compare it to trackers that are capable of running in realtime on a CPU.

Figure 5: Success plot on the LaSOT dataset [12]. The CPU real-time trackers are indicated by the regular lines while the CPU non-real-time trackers are indicated by the dashed lines in colder colors. Our tracker clearly outperforms the other real-time trackers. E.T.Track also outperforms some of the more established trackers as DiMP [4] and performs only worse than PrDiMP [10].
non-realtime realtime
SiamRPN++ DiMP-50 PrDiMP-50 SiamR-CNN TransT TrDiMP TrSiam STARK-ST50 ECOhc ATOM LT-Mobile E.T.Track
[20] [4] [10] [28] [7] [29] [29] [33] [9] [8] [34]
NFS 50.2 62 63.5 63.9 65.7 66.5 65.8 66.4 46.6 58.4 55.3 59.0
UAV-123 61.3 65.3 68 64.9 69.4 67.5 67.4 68.8 51.3 64.2 62.5 62.3
OTB-100 69.6 68.4 69.6 70.1 69.1 71.1 70.8 67.3 64.3 66.9 66.2 67.8
CPU Speed 15* 15* 15* 15* 5 6 6 9 25 25 47 47
Table 2: State-of-the-art comparison on the NFS, OTB-100 and UAV-123 datasets in terms of AUC score. The overall best score is highlighted in blue while the best realtime score is highlighted in red. All the models were benchmarked on the same CPU. The CPU speed of models which are neither transformer-based nor considered real-time trackers are indicated by a *. As they are not relevant for our comparison and they all use a ResNet-50 as a backbone, we simply benchmarked the forward pass of a ResNet-50 to get an upper bound on the tracking speed (in FPS) for those models.
non-realtime realtime
SiamRPN++ DiMP-50 PrDiMP-50 SiamR-CNN TransT TrDiMP TrSiam STARK-ST50 ECO ATOM LT-Mobile E.T.Track
[20] [4] [10] [28] [7] [29] [29] [33] [9] [8] [34]
Prec. (%) 69.38 68.7 70.4 80 80.3 73.1 72.7 - 48.86 64.84 69.5 70.55
N. Prec. (%) 79.98 80.1 81.6 85.4 86.7 83.3 82.9 86.1 62.14 77.11 77.9 80.32
Success (%) 73.3 74 75.8 81.2 81.4 78.4 78.1 81.3 56.13 70.34 72.5 74.98
CPU Speed 15* 15* 15* 15* 5 6 6 9 25 25 47 47
Table 3: Comparison with state-of-the-art trackers on the TrackingNet test set [24] consisting of 511 sequences. The trackers are compared in terms of the precision (Prec.), the normalized precision (N. Prec.) and success (AUC score) in percent. The overall best score is highlighted in blue while the best realtime score is highlighted in red. The CPU speed of models which are neither transformer-based nor considered real-time trackers are indicated by a *. As they are not relevant for our comparison and they all use a ResNet-50 as a backbone, we simply benchmarked the forward pass of a ResNet-50 to get an upper bound on the tracking speed (in FPS) for those models.
non-realtime realtime
DiMP SuperDiMP STARK-ST50 KCF SiamFC ATOM LT-Mobile E.T.Track
[4] [1] [33] [16] [3] [8] [34]
EAO() 0.274 0.305 0.308 0.154 0.179 0.271 0.242 0.267
Accuracy() 0.457 0.477 0.478 0.407 0.418 0.462 0.422 0.432
Robustness() 0.740 0.786 0.799 0.432 0.502 0.734 0.689 0.741
CPU Speed 15* 15* 9 95 46 25 47 47
Table 4: Comparison with state-of-the-art trackers on VOT-ST2020 [19]. The trackers are compared in terms of the expected average overlap (EAO), the accuracy and robustness. We only compare our model to other bounding box predicting trackers. The overall best score is highlighted in blue while the best realtime score is highlighted in red. The CPU speed of models which are neither transformer-based nor considered real-time trackers are indicated by a *. As they are not relevant for our comparison and they all use a ResNet-50 as a backbone, we simply benchmarked the forward pass of a ResNet-50 to get an upper bound on the tracking speed (in FPS) for those models.

LaSOT [12]: We evaluate our approach on the 280 test sequences of the LaSOT dataset. The dataset is highly challenging, including very long sequences, with an average of 2500 frames per sequence. Robustness is therefore essential to score highly on LaSOT. The success plots are shown in figure 5. The transformer-based trackers are shown in colder colors while the faster CPU trackers are color-coded by warmer colors. The tracking speed on a CPU is indicated by the color intensity of the graph. Compared to online-learning methods such STARK that use a dynamically updated template, our model only uses the features of the template patch extracted in the first sequence of the frames. Nevertheless, our model is very robust and reaches an AUC score of which is a gain over the popular DiMP tracker [4]. Compared to the lightweight mobile architecture of LightTrack [34], our model improves the success score by an astonishing while achieving the same speed.

NFS [18]: We evaluate our approach on the NFS dataset containing fast moving objects. The scores are shown in table 2. While the state-of-the-art performance is achieved by TrDiMP [29], our model reaches an AUC score of , outperforming the mobile version of LightTrack by . 222The IoU-net used in TrDiMP and TrSiam [29] was removed from the model during the CPU benchmarking. Some of the modules used in the model only run on a GPU. Hence, the measured values of 5.68 and 5.89, respectively, can be considered upper bounds on the tracking speed.

OTB-100 [32]: OTB-100 contains 100 challenging sequences. As shown in table 2, the current state-of-the-art is given by the recently introduced TrDiMP [29] reaching an AUC score of . Our model reaches an AUC score of .

UAV-123 [23]: UAV-123 contains a total of 123 sequences from an aerial viewpoint. The results in terms of AUC are shown in table 2. Our model reaches a performance of .

TrackingNet [24]: We evaluate our approach on the 511 sequences of the TrackingNet test set. The results are shown in table 3. On the right side of the table and one can see that our Exemplar Transformer tracker E.T.Track outperforms all other real-time trackers. We improve the precision of the LightTrack mobile architecture by , the normalized precision by and the AUC by . Comparing E.T.Track to more complex transformer-based trackers as TrSiam [29], our model is only worse in terms of precision, in terms of normalized precision and in terms of AUC while running almost faster on a CPU.

VOT-ST2020 [19]: We evaluate our tracker on the anchor-based short term tracking dataset VOT-ST2020. In comparison with other tracking datasets, VOT2020 contains various which are placed many frames apart. The tracker is then run from every anchor either forward or backward, depending on whatever leads to the longer sequence. The trackers are evaluated in terms of the accuracy, the robustness and expected average overlap (EAO). The accuracy is a weighted combination of the average overlap between the ground truth and the predicted target predictions on subsequences defined by anchors. The robustness indicates the percentage of frames before the trackers fails on average. The EAO is a measure for the overall tracking performance and combines the accuracy and the robustness. The results are shown in 4. While our model outperforms the lighweight convolutional baseline model introduced in [34] by in terms of accuracy and robustness, the largest performance increase can be noted in terms of robustness where our model increased the performance by . This confirms that our Exemplar Transformers significantly contribute to the robustness of a tracker compared to its convolutional counterpart. Furthermore, our comparison is comparable to the performance of DiMP [4]. Even though our model achieves a slightly lower accuracy, the robustness is marginally better.

4.3 Ablation Study

In this section, we analyze the different components of our approach in an incremental manner. The experiments are performed on OTB-100 [32], NFS [18], and LaSOT [12]. The trackers are evaluated in terms of the AUC metric [32].

Baseline. The Siamese tracker that we use as a baseline (sometimes also referred to as the convolutional baseline) corresponds to the mobile architecture of the LightTrack model introduced in [34]. The similarity of the search and template patch features is computed by a pointwise cross-correlation and passed to the tracker head consisting of two branches. The classification branch as well as the bounding box regression branch of the tracker head consist of convolutional layers, depthwise separable convolutional layers and skip connections. It employs the same lightweight backbone model as in our approach and other ablative settings. The performance of the baseline model is shown in the first row of table 5.

Exemplar Attention

. We replace the convolutional layers in the baseline model with exemplar attention layers. We add a residual connection followed by a normalization layer after every attention module (abbreviated by “Att.” in table

5). On NFS [18] and LaSOT, the performance increases by 1.3% and 1.5%, respectively. This demonstrates the effectiveness of our Exemplar attention modules, given that the number of layers and all other settings remain unchanged.

FFN. Similar to the encoder layer architecture presented in [27], we add a lightweight feed-forward network followed by a LayerNorm layer (abbreviated by “FFN.” in table 5). This model corresponds to the model shown in figure 4. The additional expressivity introduced by the FFN improved the performance on all three datasets. The highest performance increase is achieved on LaSOT [12] where the AUC score increases by . These results clearly demonstrate the power of the Exemplar Transformer.

Template conditioning. The queries used in the Exemplar Transformer are solely based on a transformed version of the initial correlation map. In the model abbreviated by “T-Cond.” in table 5, explore the impact of adding template information into our Exemplar attention module. Therefore, we average pool the feature map corresponding to the template patch and add it to the regular input in every layer. The richer queries lead to an improvement on NFS [18]. However, on OTB-100 [32] and LaSOT [12], the model did not benefit from the additional information.

Conv Att. FFN T-Cond. NFS OTB-100 LaSOT
- - - 55.3 66.2 52.1
- - - 56.6 65.8 53.6
- - 58 67.3 59.1
- 59.0 66.9 57.9
Table 5: Impact of each component in terms of AUC (%) on NFS [18], OTB [32] and LaSOT [12]. The first row corresponds to the mobile LightTrack [34] architecture which is our convolutional baseline.

5 Conclusion

We propose a novel efficient transformer layer, based on Exemplar Attention. Exemplar Attention utilizes a coarse representation of the input sequence and jointly learns a small set of exemplar representations. This leads to a significant decrease of the total number of operations and, thus, to a remarkable speedup. The proposed transformer layer can be used throughout the architecture, e.g. as a substitute for a convolutional layer. Having the same computationally complexity as standard convolutional layers while being more expressive, the proposed Exemplar Transformer layers can significantly improve the robustness of tracking models. E.T.Track, our Siamese tracker that incorporates Exemplar Transformer, significantly improve the performance compared to the convolutional baseline at negligible increase in run-time. E.T.Track is capable of running in real-time on computationally limited devices such as standard CPUs, making it the only real-time transformer-based tracker.

References

  • [1] Pytracking. https://github.com/visionml/pytracking. Accessed: 2021-11-16.
  • [2] Irwan Bello, Barret Zoph, Ashish Vaswani, Jonathon Shlens, and Quoc V Le. Attention augmented convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3286–3295, 2019.
  • [3] Luca Bertinetto, Jack Valmadre, Joao F Henriques, Andrea Vedaldi, and Philip HS Torr. Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pages 850–865. Springer, 2016.
  • [4] Goutam Bhat, Martin Danelljan, Luc Van Gool, and Radu Timofte. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6182–6191, 2019.
  • [5] David Bruggemann, Menelaos Kanakis, Anton Obukhov, Stamatios Georgoulis, and Luc Van Gool. Exploring relational context for multi-task dense prediction. arXiv preprint arXiv:2104.13874, 2021.
  • [6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
  • [7] Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 8126–8135, June 2021.
  • [8] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4660–4669, 2019.
  • [9] Martin Danelljan, Goutam Bhat, Fahad Shahbaz Khan, and Michael Felsberg. Eco: Efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6638–6646, 2017.
  • [10] Martin Danelljan, Luc Van Gool, and Radu Timofte. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7183–7192, 2020.
  • [11] Martin Danelljan, Gustav Häger, Fahad Shahbaz Khan, and Michael Felsberg. Discriminative scale space tracking. IEEE transactions on pattern analysis and machine intelligence, 39(8):1561–1575, 2016.
  • [12] Heng Fan, Liting Lin, Fan Yang, Peng Chu, Ge Deng, Sijia Yu, Hexin Bai, Yong Xu, Chunyuan Liao, and Haibin Ling. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5374–5383, 2019.
  • [13] Junyu Gao, Tianzhu Zhang, and Changsheng Xu. Graph convolutional tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4649–4659, 2019.
  • [14] Qing Guo, Wei Feng, Ce Zhou, Rui Huang, Liang Wan, and Song Wang. Learning dynamic siamese network for visual object tracking. In Proceedings of the IEEE international conference on computer vision, pages 1763–1771, 2017.
  • [15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [16] João F Henriques, Rui Caseiro, Pedro Martins, and Jorge Batista. High-speed tracking with kernelized correlation filters. IEEE transactions on pattern analysis and machine intelligence, 37(3):583–596, 2014.
  • [17] Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • [18] Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, and Simon Lucey. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision, pages 1125–1134, 2017.
  • [19] Matej Kristan, Aleš Leonardis, Jiří Matas, Michael Felsberg, Roman Pflugfelder, Joni-Kristian Kämäräinen, Martin Danelljan, Luka Čehovin Zajc, Alan Lukežič, Ondrej Drbohlav, et al. The eighth visual object tracking vot2020 challenge results. In European Conference on Computer Vision, pages 547–601. Springer, 2020.
  • [20] Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, and Junjie Yan. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4282–4291, 2019.
  • [21] Bo Li, Junjie Yan, Wei Wu, Zheng Zhu, and Xiaolin Hu. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8971–8980, 2018.
  • [22] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [23] Matthias Mueller, Neil Smith, and Bernard Ghanem. A benchmark and simulator for uav tracking. In European conference on computer vision, pages 445–461. Springer, 2016.
  • [24] Matthias Muller, Adel Bibi, Silvio Giancola, Salman Alsubaihi, and Bernard Ghanem. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), pages 300–317, 2018.
  • [25] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747, 2016.
  • [26] Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao, Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, and Ping Luo. Transtrack: Multiple-object tracking with transformer. arXiv preprint arXiv:2012.15460, 2020.
  • [27] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [28] Paul Voigtlaender, Jonathon Luiten, Philip HS Torr, and Bastian Leibe. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6578–6588, 2020.
  • [29] Ning Wang, Wengang Zhou, Jie Wang, and Houqiang Li. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1571–1580, June 2021.
  • [30] Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, and Philip HS Torr. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1328–1338, 2019.
  • [31] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7794–7803, 2018.
  • [32] Yi Wu, Jongwoo Lim, and Ming-Hsuan Yang. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418, 2013.
  • [33] Bin Yan, Houwen Peng, Jianlong Fu, Dong Wang, and Huchuan Lu. Learning spatio-temporal transformer for visual tracking. arXiv preprint arXiv:2103.17154, 2021.
  • [34] Bin Yan, Houwen Peng, Kan Wu, Dong Wang, Jianlong Fu, and Huchuan Lu. Lighttrack: Finding lightweight neural networks for object tracking via one-shot architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15189, 2021.
  • [35] Bin Yan, Xinyu Zhang, Dong Wang, Huchuan Lu, and Xiaoyun Yang. Alpha-refine: Boosting tracking performance by precise bounding box estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5289–5298, 2021.
  • [36] Tianyu Yang and Antoni B Chan. Learning dynamic memory networks for object tracking. In Proceedings of the European conference on computer vision (ECCV), pages 152–167, 2018.
  • [37] Lichao Zhang, Abel Gonzalez-Garcia, Joost van de Weijer, Martin Danelljan, and Fahad Shahbaz Khan. Learning the model update for siamese trackers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4010–4019, 2019.
  • [38] Zhipeng Zhang, Houwen Peng, Jianlong Fu, Bing Li, and Weiming Hu. Ocean: Object-aware anchor-free tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16, pages 771–787. Springer, 2020.

Appendix A Algorithm

We present the pseudocode corresponding to eq. 5, depicted on the right side of figure 2 in Algorithm 1. The more efficient implementation corresponding to eq. 6 is shown in Algorithm 2.

function ExemplarAttention():
       equation 3
      
       equation 4
      
      
      
      
end
 
Algorithm 1 Pseudocode of the Exemplar Transformer layer, eq. 5.
function ExemplarAttention():
       equation 3
      
      
      
      
      
      
       return
end
 
Algorithm 2 Pseudocode of the Exemplar Transformer layer, eq. 6.

Appendix B Vot-Rt2020

We evaluate our tracker on the anchor-based short real-time tracking dataset VOT-RT2020 [19]. The results are shown in table 6. While the performance of our model is comparable to the lighweight convolutional baseline model introduced in [34] in terms of accuracy, our model is approximately better in terms of robustness. This confirms that our Exemplar Transformers significantly contribute to the robustness of a tracker compared to its convolutional counterpart. Our approach thus also achieves a better overall EAO score.

non-realtime realtime
DiMP SuperDiMP KCF SiamFC ATOM LT-Mobile E.T.Track
[4] [1] [16] [3] [8] [34]
EAO() 0.241 0.289 0.154 0.172 0.237 0.217 0.227
Accuracy() 0.434 0.472 0.406 0.422 0.440 0.418 0.418
Robustness() 0.700 0.767 0.434 0.479 0.687 0.607 0.663
CPU Speed 15* 15* 95 46 25 47 47
Table 6: Comparison with state-of-the-art trackers on VOT-RT2020 [19]. The trackers are compared in terms of the expected average overlap (EAO), the accuracy and robustness. We only compare our model to other bounding box predicting trackers. The overall best score is highlighted in blue while the best realtime score is highlighted in red. The CPU speed of models which are neither transformer-based nor considered real-time trackers are indicated by a *. As they are not relevant for our comparison and they all use a ResNet-50 as a backbone, we simply benchmarked the forward pass of a ResNet-50 to get an upper bound on the tracking speed (in FPS) for those models.

Appendix C Video Visualizations

We provide a set of videos for a visualization of our tracker’s performance. We show a direct comparison between our E.T.Track model and LT-Mobile [34]. We list the video sequences used for the comparisons in table 7. The person8-2 sequence of the UAV-123 dataset [23], showing a man running on grass, nicely shows that our tracker does not lose track of the target even if it partly moves out of the frame and that it recovers completely when the target moves back into the frame. On this video sequence, LT-Mobile [34] performs comparably well. Human7 from the OTB dataset [32] shows a woman walking. Even though the video appears to be jittery, the appearance and the shape of the target object only changes marginally. Our model achieves an average overlap of 88% which is 7% higher than LT-Mobile [34]. The video boat-9 from the UAV-123 dataset [23] demonstrates that our model can handle situations in which the target gets very small due to an increased distance to the camera as well as appearance changes. Unlike LT-Mobile, E.T.Track does not lose track of the boat even after a -degree turn. In sequence basketball-3 of NFS [18], our tracker head is more robust to similar objects than LT-Mobile [34] and does not confuse the basketball with the player’s head. The superior performance of our tracker on the sequences boat-9 and basketball-3 can be explained by the increased robustness of our model, attributed to the increased capacity introduced by the Exemplar Transformer layers. In the sequence drone-2 from the LaSOT dataset [12], the target object shortly moves completely out of the frame. When the target object re-enters the scene, its appearance is different to that of the initial frame. Furthermore, the target object’s location deviates from the tracker’s search range when re-entering the scene. These two aspects pose a problem for our model and are inherent limitations of the tracking inference pipeline used in our approach. Our tracker, as well as LT-Mobile [34], closely follow the inference code presented in [38]. Specifically, the tracking pipeline contains a post-processing step in which the predicted bounding boxes are refined. Changes in size as well as changes of the bounding box aspect ratio are penalized. Furthermore, both models only consider an image patch of a much smaller size than the actual input image around the previously predicted target location. Thus, disappearing objects are hard to be recovered by the design of the tracking pipeline from [38] and pose a problem for both trackers. Addressing this issue by integrating our Exemplar Transformer layer into a tracker that directly predicts bounding boxes without any post-processing (such as the tracker head used in STARK [33]) is left for future work.

Dataset LT-Mobile E.T.Track
[34]
person8-2 UAV-123 [23] 0.889 0.915
Human7 OTB [32] 0.813 0.883
boat-9 UAV-123 [23] 0.483 0.803
basketball-3 NFS [18] 0.259 0.707
drone-2 LaSOT [12] 0.192 0.887
Table 7: Direct per-sequence comparison of E.T.Track and LT-Mobile [34] on various sequences in terms of average overlap (AO). The best performance is highlighted in red.

Appendix D Attributes

Table 8 presents the results of various trackers on different sequence attributes from the LaSOT dataset [12]. It can be seen that we consistently outperform the other real-time trackers by a significant margin. The attribute on which we improve most compared to LT-Mobile [34] is Full Occlusion (+27% relative to LT-Mobile). The second largest improvement is achieved on the attribute Fast Motion (+22% relative to LT-Mobile). This can also be explained by the increased robustness. When comparing our model to the slower state-of-the-art (SOTA) transformer-based models, our model reaches about 89 2% of the best performance for most of the attributes. The main weaknesses of our model compared to the SOTA performance are the attributes Viewpoint Change (82% relative to the SOTA), Full Occlusion (82% relative to SOTA), Fast Motion (76% relative to SOTA), Out-of-View (82% relative to SOTA) and Low Resolution (82% relative to SOTA). Referring back to the limitations of our tracking pipeline presented in the previous section, we note that even though occlusion and fast motion are the attributes on which our tracker performs the worst compared to the SOTA, the increased robustness of our Exemplar Transformer layers significantly improves LT-Mobile. These insights are also confirmed by the videos listed in the previous section. This analysis paves the path for future research in the design of additional modules for efficient tracking, specifically designed to tackle the identified challenging attributes.

Illumination Partial Motion Camera Background Viewpoint Scale Full Fast Low Aspect
Variation Occlusion Deformation Blur Motion Rotation Clutter Change Variation Occlusion Motion Out-of-View Resolution Ration Change Total
STARK-ST50 66.8 64.3 66.9 62.9 69.0 66.1 57.3 67.8 66.1 58.7 53.8 62.1 59.4 64.9 66.4
TransT 65.2 62.0 67.0 63.0 67.2 64.3 57.9 61.7 64.6 55.3 51.0 58.2 56.4 63.2 64.9
TrDiMP 67.5 61.1 64.4 62.4 68.1 62.4 58.9 62.8 63.4 56.4 53.0 60.7 58.1 62.3 63.9
TrSiam 63.8 60.1 63.8 61.1 65.5 62.0 55.1 60.8 62.5 54.5 50.6 58.9 56.0 61.2 62.6
PrDiMP50 63.3 57.1 61.3 58.0 64.0 59.1 55.4 61.7 60.1 51.6 49.2 57.0 54.8 59.0 60.5
E.T.Track 61.3 55.7 61.0 55.5 60.2 58.1 51.8 55.9 58.8 48.7 41.1 51.1 48.8 56.9 59.1
DiMP 59.5 52.1 56.6 54.6 59.3 54.5 49.7 56.7 55.8 47.5 45.6 49.5 49.1 54.5 56.0
LT-Mobile 55.0 48.9 57.3 45.8 52.9 51.2 43.3 49.9 51.9 38.3 33.6 43.7 40.8 49.9 52.1
SiamRPN++ 53.0 46.6 52.8 44.2 51.3 48.5 44.9 44.4 49.4 36.6 31.6 41.6 38.5 47.2 49.5
SiamFC 34.6 30.6 35.1 30.8 33.3 31.0 30.8 28.6 33.2 24.5 19.5 25.6 25.2 30.8 33.6
Table 8: LaSOT [12] attribute-based analysis. Each column corresponds to the results computed on all sequences in the dataset with the corresponding attribute. The trackers that do not run in real-time are greyed out. The overall best score is highlighted in blue while the best realtime score is highlighted in red.

Appendix E Detailed Results on NFS, UAV-123 and OTB-100

We depict the success plot of the NFS dataset [18] in figure 6, the success plot of the UAV-123 dataset [23] in figure 7 and the success plot of the OTB-100 dataset [32] in figure 8. For efficient trackers, we limited our comparison to the mobile architecture presented in LightTrack [34], as SiamRPN++ [20] and SiamFC [3], were consistently outperformed by a large margin. While not real-time on a CPU, we additionally compare our tracker to the more complex transformer-based trackers STARK-ST50 [33], TrDimp [29], TrSiam [29] and TransT [7], as well as DiMP [4] and PrDiMP [10], depicted with dashed lines in colder colors. The values shown in this plot correspond to our evaluation results, slightly deviating from the results presented in the main paper that were directly acquired from their respective papers. As it can be seen, our model outperforms LT-Mobile [34] on all but one benchmark dataset. We further want to highlight the shrinking gap between the complex transformer-based trackers and our lightweight real-time CPU tracker. Closing this gap even further while maintaining real-time speed will be a crucial part for future work in order to deploy high-performing trackers on computationally limited edge devices.

Figure 6: Success plot on the NFS dataset [18]. The CPU real-time trackers are indicated by the regular lines while the CPU non-real-time trackers are indicated by the dashed lines in colder colors. Our tracker clearly outperforms LT-Mobile [34].
Figure 7: Success plot on the UAV-123 dataset [23]. The CPU real-time trackers are indicated by the regular lines while the CPU non-real-time trackers are indicated by the dashed lines in colder colors. The performance of our tracker is comparable to the performance of LT-Mobile, although slightly worse.
Figure 8: Success plot on the OTB-100 dataset [32]. The CPU real-time trackers are indicated by the regular lines while the CPU non-real-time trackers are indicated by the dashed lines in colder colors. Our tracker clearly outperforms LT-Mobile.