Learning Target Candidate Association to Keep Track of What Not to Track

03/30/2021 ∙ by Christoph Mayer, et al. ∙ 0

The presence of objects that are confusingly similar to the tracked target, poses a fundamental challenge in appearance-based visual tracking. Such distractor objects are easily misclassified as the target itself, leading to eventual tracking failure. While most methods strive to suppress distractors through more powerful appearance models, we take an alternative approach. We propose to keep track of distractor objects in order to continue tracking the target. To this end, we introduce a learned association network, allowing us to propagate the identities of all target candidates from frame-to-frame. To tackle the problem of lacking ground-truth correspondences between distractor objects in visual tracking, we propose a training strategy that combines partial annotations with self-supervision. We conduct comprehensive experimental validation and analysis of our approach on several challenging datasets. Our tracker sets a new state-of-the-art on six benchmarks, achieving an AUC score of 67.2 dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generic visual object tracking is one of the fundamental problems in computer vision. The task involves estimating the state of the target object in every frame of a video sequence, given only the initial target location. Most prior research has been devoted to the development of robust appearance models, used for locating the target object in each frame. The two currently dominating paradigms are Siamese networks

[2, 37, 36] and discriminative appearance modules [3, 13]. While the former employs a template matching in a learned feature space, the latter constructs an appearance model through a discriminative learning formulation. Although these approaches have demonstrated promising performance in recent years, they are effectively limited by the quality and discriminative power of the appearance model.

As one of the most challenging factors, co-occurrence of distractor objects similar in appearance to the target is a common problem in real-world tracking applications [4, 64, 53]. Appearance-based models struggle to identify the sought target in such cases, often leading to tracking failure. Moreover, the target object may undergo a drastic appearance change over time, further complicating the discrimination between target and distractor objects. In certain scenarios, , as visualized in Fig. 1, it is even virtually impossible to unambiguously identify the target solely based on appearance information. Such circumstances can only be addressed by leveraging other cues during tracking, for instance the spatial relations between objects. We therefore set out to address problematic distractors by exploring such alternative cues.

Figure 1: Visualization of the proposed target candidate association network used for tracking. For each target candidate () we extract a set of features such as score, position and appearance in order to associate candidates across frames. The proposed target association network then allows to associate these candidates () with the detected distractors () and the target object () of the previous frame. Lines combining circles represent associations.

We propose to actively keep track of distractor objects in order to ensure more robust target identification. To this end, we introduce a target candidate association network, that matches distractor objects as well as the target across frames. Our approach consists of a base appearance tracker, from which we extract target candidates in each frame. Each candidate is encoded with a set of distinctive features, consisting of the target classifier score, location, and appearance. The encodings of all candidates are jointly processed by a graph-based candidate embedding network. From the resulting embeddings, we compute the association scores between all candidates in subsequent frames, allowing us to keep track of the target and distractor objects over time. In addition, we estimate a target detection confidence, used to increase the robustness of the target classifier.

While associating target candidates over time provides a powerful cue, learning such a matching network requires tackling a few key challenges. In particular, the generic visual object tracking datasets only provide annotations of one object in each frame, the target. As a result, there is a lack of ground-truth annotations for associating distractor objects across frames. Moreover, the definition of a distractor is not universal and depends on the properties of the employed tracker appearance model. We address these challenges by introducing an approach that allows our candidate matching network to learn from real tracker output. Our approach exploits the single target annotations in existing tracking datasets in combination with a self-supervised strategy. Furthermore, we actively mine the training dataset in order to retrieve rare and challenging cases, where the use of distractor association is important, in order to learn a more effective model.

Contributions:  In summary, our contributions are as follows: (i) We propose a method for target candidate association based on a learnable candidate matching network. (ii) We develop an online object association method in order to propagate distractors and the target over time and introduce a sample confidence score to update the target classifier more effectively during inference. (iii)

We tackle the challenges with incomplete annotation by employing partial supervision, self-supervised learning, and sample-mining to train our association network.

(iv) We perform comprehensive experiments and ablative analysis by integrating our approach into the recent SuperDiMP tracker [17, 3]. Our approach sets a new state-of-the-art on six tracking datasets, obtaining an AUC of on LaSOT [23] and on UAV123 [43].

2 Related Work

Discriminative appearance model based trackers [12, 3, 30, 33, 59, 16] aim to suppress distractors based on their appearance by integrating background information when learning the target classifier online. While often increasing robustness, the capacity of an online appearance model is still limited. A few works have therefore developed more dedicated strategies of handling distractors. Bhat  [4] combine an appearance based tracker and an RNN to propagate information about the scene across frames. It internally aims to track all regions in the scene by maintaining a learnable state representation. Other methods exploit the existence of distractors explicitly in the method formulation. DaSiamRPN [64] handles distractor objects by subtracting corresponding image features from the target template during online tracking. Xiao  [53] use the locations of distractors in the scene and employ hand crafted rules to classify image regions into background and target candidates on each frame. SiamRCNN [49] associates subsequent detections across frames using a hand-crafted association score to form short tracklets. In contrast, we introduce a learnable network that explicitly associates target candidates from frame-to-frame.

Many online trackers [12, 3, 13] employ a memory to store previous predictions to fine-tune the tracker. Typically the oldest sample is replaced in the memory and an age-based weight controls the contribution of each sample when updating the tracker online. Danelljan  [15] propose to learn the tracking model and the training sample weights jointly. LTMU [11] combines an appearance based tracker with a learned meta-updater. The goal of the meta-updater is to predict whether the employed online tracker is ready to be updated or not. In contrast, we use a learned target candidate association network to compute a confidence score and combine it with sample age to manage the tracker updates.

The object association problem naturally arises in multi object tracking (MOT). The dominant paradigm in MOT is tracking-by-detection [6, 1, 57, 47, 60], where tracking is posed as the problem of associating object detections over time. The latter is typically formulated as a graph partitioning problem. Typically, these methods are non-causal and thus require the detections from all frames in the sequence. Furthermore, MOT typically focuses on a limited set of object classes [18], such as pedestrians, where strong object detectors are available. In comparison we aim at tracking different objects in different sequences solely defined by the initial frame. Furthermore, we lack ground truth correspondences of all distractor objects from frame to frame whereas the ground-truth correspondences of different objects in MOT datasets are typically provided [18]. Most importantly, we aim at associating target candidates that are defined by the tracker itself, while MOT methods associate all detections that correspond to one of the sought objects.

3 Method

Figure 2: Overview of the entire online tracking pipeline processing the previous and current frames jointly to predict the target object.

Here, we describe our tracking approach, which actively associates distractor objects and the sought target across multiple frames.

3.1 Overview

First we give an overview of our tracking pipeline, which is shown in Fig. 2. We employ a base tracker with a discriminative appearance model and internal memory. In particular, we adopt the SuperDiMP [14] tracker, which employs the target classifier in DiMP [3] and the probabilistic bounding-box regression from [17], together with improved training settings.

We use the base tracker to predict the target score map for the current frame and extract the target candidates by finding locations in with high target score. Then, we extract a set of features for each candidate. Namely: target classifier score , location in the image, and an appearance cue

based on the backbone features of the base tracker. Then, we encode this set of features into a single feature vector

for each candidate. We feed these representations and the equivalent once of the previous frame - already extracted before - into the candidate embedding network and process them together and obtain the enriched embeddings for each candidate. We use these features to compute the similarity matrix to estimate the candidate assignment matrix between the two consecutive frames using an optimal matching strategy.

Once having the candidate-to-candidate assignment probabilities estimated, we build the set of currently visible objects in the scene

and associate them to the previously identified objects , , we determine which objects disappeared, newly appeared or stayed visible and can be associated unambiguously. We then use this propagation strategy to reason about the target object in the current frame. Additionally, we compute the target detection confidence to manage the memory and control the sample weight, when updating the target classifier online.

3.2 Problem Formulation

First, we define the set of target candidates, which includes distractors and the sought target, as where denotes the number of candidates present in each frame. Using the introduced notation we define the target candidate sets and corresponding to the previous and current frame, respectively. We formulate the problem of target candidate association across two subsequent frames as, finding the assignment matrix between the two sets and . If the target candidate corresponds to then and otherwise.

In practice, a matching candidate may not exist for every candidate. Therefore, we introduce dustbins, which are commonly used in graph matching [45, 19] to actively handle the non-matching node. The idea is matching each unmatched candidate to the dustbin on the missing side. Therefore, we augment the assignment matrix by an additional row and column representing dustbins. It follows that a newly appearing candidate – which is only present in the set – leads to the entry . Similarly, a candidate that is no longer available in the set results in . To solve the assignment problem, we design a learnable approach that predicts the matrix . Our approach first extracts a representation of the target candidates, discussed next.

3.3 Target Candidate Extraction

Here, we describe how to detect and represent target candidates and propose a set of features and their encoding. We define the set of target candidates as all unique coordinates that correspond to a local maximum in the target score map and with a minimal score . Thus, each target candidate and its coordinate need to fulfill the following two constraints,

(1)

where returns 1 if the score at is a local maximum of or 0 otherwise, and denotes a threshold. This definition allows us to build the sets and , by retrieving the local maxima of and

with sufficient score value. We use the max-pooling operation in a

local neighbourhood to retrieve the local maxima of and set .

For each candidate we build a set of features exploiting two assumptions. First, we assume small motion of the same objects from frame to frame and thus similar locations and similar distances between different objects. Thus, the position of a target candidate forms a strong cue. In addition, we assume small changes in appearance for each object. Thus, we use the target classifier score as another cue. In order to add a more discriminative appearance based feature , we process the back bone features (used in the baseline tracker) with a single learnable convolution layer. Finally, we build a feature tuple for each target candidate as . These features are combined in the following way,

where

denotes a Multi-Layer Perceptron (MLP), that maps

and to the same dimensional space as . This encoding permits jointly reasoning about appearance, target similarity, and position.

3.4 Candidate Embedding Network

In order to further enrich the encoded features and in particular to facilitate extracting features while being aware of neighbouring candidates, we employ a candidate embedding network. On an abstract level, our association problem bares similarities with the task of sparse feature matching. In order to incorporate information of neighbouring candidates, we thus take inspiration from recent advances in this area. In particular, we adopt the SuperGlue [45] architecture that establishes the current state-of-the-art in sparse feature matching. Its design allows to exchange information between different nodes, to handle occlusions, and to estimate the assignment of nodes across two images. In particular, the features of both frames translate to nodes of a single complete graph with two types of directed edges: 1) self edges within the same frame and 2) cross edges connecting only nodes between the frames. The idea is then to exchange information either along self or cross edges.

The adopted architecture [45]

uses a Graph Neural Network (GNN) with message passing that sends the messages in an alternating fashion across self or cross edges to produce a new feature representation for each node after every layer. Moreover, an attention mechanism computes the messages using self attention for self edges and cross attention for cross edges. After the last message passing layer a linear projection layer extracts the final feature representation

for each candidate .

3.5 Candidate Matching

To represent the similarities between candidates and , we construct the similarity matrix . The sought similarity is measured using the scalar product: , for feature vectors and corresponding to the candidates and .

As previously introduced, we make use of the dustbin-concept [19, 45] to actively match candidates that miss their counterparts to the so-called dustbin. However, a dustbin is a virtual candidate without any feature representation . Thus, the similarity score is not directly predictable between candidates and the dustbin. A candidate corresponds to the dustbin, only if its similarity scores to all other candidates are sufficiently low. In this process, the similarity matrix represents only an initial association predictions between candidates disregarding the dustbins. Note that a candidate corresponds either to an other candidate or to the dustbin in the other frame. When the candidates and are matched, both constraints and must be satisfied for one-to-one matching. These constraints however, do not apply for missing matches since multiple candidates may correspond to the same dustbin. Therefore, we make use of two new constraints for dustbins. These constraints for dustbins read as follows: all candidates not matched to another candidate must be matched to the dustbins. Mathematically, this can be expressed as, and , where represents the number of candidate-to-candidate matching. In order to solve the association problem, using the discussed constraints, we follow Sarlin  [45] and use the Sinkhorn [46, 9] based algorithm therein.

3.6 Learning Candidate Association

Training the embedding network that parameterizes the similarity matrix used for optimal matching requires ground truth assignments. Hence, in the domain of sparse keypoint matching, recent learning based approaches leverage large scale datasets [21, 45] such as MegaDepth [40] or ScanNet [10], that provide ground truth matches. However, in tracking such ground truth correspondences (between distractor objects) are not available. Only the target object and its location provide a single ground truth correspondence. Manually annotating correspondences for distracting candidates, identified by a tracker on video datasets, is expensive and may not be very useful. Instead, we propose a novel training approach that exploits, (i) the partial supervision from the annotated target objects, and (ii) the self-supervision by artificially mimicking the association problem. Our approach requires only the annotations that already exist in standard tracking datasets. The candidates for matching are obtained by running the base tracker only once through the given dataset for training.

Partially Supervised Loss:  For each pair of consecutive frames, we retrieve the two candidates corresponding to the annotated target, if available. This correspondence forms a partial supervision for a single correspondence while all other associations remain unknown. For the retrieved candidates and , we define the association as a tuple . Here, we also mimic the association for redetections and occlusions by occasionally excluding one of the corresponding candidates from or . We replace the excluded candidate by the corresponding dustbin to form the correct association for supervision. More precisely, the simulated associations for redetection and occlusion are expressed as, and , respectively. The supervised loss, for each frame-pairs, is then given by the negative log-likelihood of the assignment probability,

(2)

Self-Supervised Loss:  To facilitate the association of distractor candidates, we employ a self-supervision strategy. The proposed approach first extracts a set of candidates from any given frame. The corresponding candidates for matching, say , are identical to but we augment its features. Since the feature augmentation does not affect the associations, the initial ground-truth association set is given by . In order to create a more challenging learning problem, we simulate occlusions and redetections as described above for the partially supervised loss. Note that the simulated occlusions and redetections change the entries of , , and . We make use of the same notations with slight abuse for simplicity. Our feature augmentation involves, randomly translating the location , increasing or decreasing the score , and transforming the given image before extracting the visual features . Now, using the simulated ground-truth associations , our self-supervised loss is given by,

(3)

Finally, we combine both losses as

. It is important to note that real training data is used only for the former loss function, whereas synthetic data is used only for the latter one.

Data Mining:  Most frames contain a candidate corresponding to the target object and are thus applicable for supervised training. However, a majority of these frames are not very informative for training because the contain only a single candidate with high target classifier score and correspond to the target object. Conversely, a dataset contains adverse situations where associating the candidate corresponding to the target object is very challenging. Such situations include sub-sequences with different number of candidates, with changes in appearance or large motion between frames. Thus, particularly valuable for training are sub-sequences where the appearance model either fails and starts to track a distractor or when the tracker is no longer able to detect the target with sufficient confidence. However, such failure cases are relatively rare even in large scale datasets. Similarly, we prefer frames with many target candidates when creating synthetic sub-sequences to simultaneously include, candidate associations, redetections and occlusions. Thus, we mine the training dataset using the dumped target candidates of the base tracker in order to retrieve more informative data for training.

Training Details:  We first retrain the base tracker SuperDiMP without the learned discriminative loss parameters but keep everything else unchanged. We split the LaSOT training set into a train-train and a train-val

set. We run the base tracker on all sequences and save the search region and score map for each frame on disk. We use the dumped data to mine the dataset and to extract the target candidates and its features. We freeze the weights of the base tracker during training of the proposed network and train for 15 epochs by sampling 6400 sub-sequences per epoch from the

train-train split. We sample real or synthetic data equally likely. We use ADAM [34] with learning rate decay of 0.2 every 6th epoch with a learning rate of 0.0001. We run 50 Sinkhorn iterations. Please refer to the supplementary (Sec. A) for additional details about training.

3.7 Object Association

We focus in this part on applying the estimated assignment (see Sec. 3.5) in order to determine the object correspondences during online tracking. An object corresponds either to the target or a distractor. The general idea is to keep track of every object present in each scene over time. We implement this idea with a database , where each entry corresponds to an object that is visible in the current frame. Fig. 3 shows these objects as circles. An object disappears from the scene if none of the current candidates is associated with it, , in Fig. 3 the purple and pink objects (, ) no longer correspond to a candidate in the last frame. Then, we delete this object from the database. Similarly, we add a new object to the database if a new target candidate appears (, , ). When initializing a new object, we assign it a new object-id - not used previously - and the score . In Fig. 3 object-ids are represented with colors. For objects that remain visible, we add the score of the corresponding candidate to the history of scores of this object. Furthermore, we delete the old and create a new object if the candidate correspondence is ambiguous, , the assignment probability is smaller than .

If associating the target object across frames is unambiguous, it implies that one object has the same object-id as the initially provided object . Thus, we return this object as the selected target. However, in real world scenarios the target object gets occluded, leaves the scene or associating the target object is ambiguous. Then none of the candidates corresponds to the sought target and we first need to redetect it. We redetect the target if the candidate with the highest target classifier score achieves a score that exceeds the threshold . We declare the corresponding object to the target as long as not another candidate achieves a higher score in the current frame. Then, we switch to this candidate and declare it to the target if its score is higher than any score in the history of the currently selected object. Otherwise, we treat this object as a distractor for now but if its score increases we might select it as the target object in the future. Please refer to the supplementary (Sec. B) for the detailed algorithm.

Figure 3: Visually comparison of the base tracker and our tracker. The bounding boxes represent the tracker result, green [] indicates correct detections and red [] refers to tracker failure. Each circle represents an object. Circles with the same color are connected to indicate that the object ids are identical. If a target candidate cannot be match with an existing object we add a new object (, , ). Similarly, we delete the object if no candidate corresponds to it anymore in the next frame (, , ).

3.8 Memory Sample Confidence

While updating the tracker online is often beneficial, it is disadvantageous if the training samples have a poor quality. Thus, we describe a memory sample confidence score, that we use to decide which sample to keep in the memory and which should be replaced when employing a fixed size memory. In addition, we use the score to control the contribution of each training sample when updating the tracker online. In contrast, the base tracker replaces frames using a first-in-first out policy if the target was detected and weights samples during inference solely based on age.

First, we define the training samples in frame as . We assume a memory size that stores samples from frame , where denotes the current frame number. The loss minimized during tracking then reads as

(4)

where denotes the data term, the regularisation term, is a scalar and represents appearance model weights. The weights control the impact of the sample from frame , , a higher value increases the influence of the corresponding sample during training. We follow other appearance based trackers and use a learning parameter in order to control the weights , such that older samples achieve a smaller value and its impact during training decreases. In addition, we propose a second set of weights that represent the confidence of the tracker that the predicted label is correct. Instead of removing the oldest samples to keep the memory fixed [3], we propose to drop the sample that achieves the smallest score that combines age and confidence. Thus, if we remove the sample at position by setting . This means, that if all samples achieve similar confidence the oldest will be replaced and that new but low confident samples will be replaced quickly if all other samples are recent and have a higher confidence.

We describe the extraction of the confidence weights as

(5)

where denotes the maximum value of the target classifier score map of frame . For simplicity, we assume that . The condition is fulfilled if the currently selected object is identical to the initially provided target object, , both objects share the same object id. Then, it is very likely, that the selected object corresponds to the target object such that we increase the confidence using the square root function that increases values in the range . Hence, the described confidence score combines the confidence of the target classifier with the confidence of the object association module but fully relies on the target classifier once the target was lost.

Inference details:  During inference, we use the same parameters as for SuperDiMP but increase the search area scale from 6 to 8 and the resolution from 352 to 480 in image space, Our tracker follows the target longer until it is lost such that small search areas occur frequently. Thus, we reset the search area to previous size if it was drastically decreased before the target was lost, to facilitate redetections. In addition, if only one target with high score is present in the previous and current frame. Then, we select this candidate as the target and omit running the target candidate association network to speed-up tracking. Please refer to the supplementary (Sec. B) for additional inference details.

4 Experiments

We evaluate our proposed tracking architecture on seven tracking benchmarks. Our approach is implemented in Python using PyTorch. On a single Nvidia GTX 2080Ti GPU, we achieve a tracking speed of 12.7 FPS.

4.1 Ablation Study

We perform an extensive analysis of the proposed tracker, memory sample confidence, and training losses.

Online tracking components:

Memory Sample Search area Target Candidate
Confidence Adaptation Association Network NFS UAV123 LaSOT
64.4 68.2 63.5
64.7 68.0 65.0
65.2 69.1 65.8
66.4 69.8 67.2
Table 1: Impact of each component in terms of AUC (%) on three datasets. The first row corresponds to our SuperDiMP baseline.

We study the importance of memory sample confidence, the search area protocol, and target candidate association of our final method. In Tab. 1 we analyze the impact of sequentially adding each component, and report the average of five runs on the NFS [25], UAV123 [43] and LaSOT [23] datasets. The first row reports the results of the employed base tracker. First, we add the memory sample confidence approach (Sec. 3.8), observe similar performance on NFS and UAV but a significant improvement of on LaSOT, demonstrating its potential for long-term tracking. With the added robustness, we next employ a larger search area and increase it if it was drastically shrank before the target was lost. This leads to a fair improvement on all datasets. Finally, we add the target candidate association network, which provides substantial performance improvements on all three datasets, with a AUC on LaSOT. These results clearly demonstrate the power of the target candidate association network.

Training:

Loss no TCA
Data-mining n.a. -
LaSOT, AUC (%) 65.8 66.6 66.8 66.5 67.2
Table 2:

Analysis on LaSOT of association learning using different loss functions with and without data-mining.

Sample Replacement Online updating Conf. score LaSOT
with conf. score with conf. score threshold AUC (%)
63.5
64.1
0.00 64.6
0.50 65.0
Table 3: Analysis of our memory weighting component on LaSOT.

In order to study the effect of the proposed training losses, we retrain the target candidate association network either with only the partially supervised loss or only the self-supervised loss. We report the performance on LaSOT [23] in Tab. 2. the results show that each loss individually allows to train the network and to outperform the baseline without the target candidate association network (no TCA) but that combining both losses leads to the best tracking results. In addition, training the network with the combined loss but without data-mining decreases the tracking performance.

Memory management:  We not only use the sample confidence to manage the memory but also to control the impact of samples when learning the target classifier online. In Tab. 3, we study the importance of each component by adding one after the other and report the results on LaSOT [23]. First, we use the sample confidence scores only in order to decide which sample to remove next from the memory. This, already improves the tracking performance. However, reusing these weights when learning the target classifier as described in Eq. (4) increases the performance again. In order to suppress the impact of poor-quality samples during online learning, we ignore samples with a confidence score bellow . This leads to another improvement on LaSOT. The last row corresponds to the used setting in the final proposed tracker.

Siam Super Pr Siam Siam
Ours R-CNN Dimp DiMP TLPG TACT LTMU DiMP Ocean AttN CRACT GAT PG-NET ATOM
[49] [17] [17] [38] [8] [11] [3] [62] [58] [24] [27] [41] [12]
Precision 70.4 68.4 65.3 60.8 60.7 57.2 56.7 56.6. 53.0 50.5
Norm. Prec 77.4 72.2 72.2 68.8 66.0 66.2 65.0 65.1 64.8 62.8 63.3 60.5 57.6
Success (AUC) 67.2 64.8 63.1 59.8 58.1 57.5 57.2 56.9 56.0 56.0 54.9. 53.9 53.1 51.5
Table 4: Comparison with state-of-the-art on the LaSOT [23] test set in terms of AUC score.

4.2 State-of-the-art Comparison

We compare our approach on seven tracking benchmarks. The same settings and parameters are used for all datasets. In order to ensure the significance of the results, we report the average over 5 runs on all datasets unless the evaluation protocol requires otherwise.

LaSOT [23]:  First, we compare on the large-scale LaSOT dataset (280 test sequences with 2500 frames in average) to demonstrate the robustness and accuracy of the proposed tracker. The success plot in Fig. (a)a shows the overlap precision as a function of the threshold . Trackers are ranked their area-under-the-curve (AUC) score, denoted in the legend. Tab. 4 shows more results including precision and normalized precision. Our approach outperforms the previous best tracker, SiamRCNN, by a large margin of in AUC and the base tracker SuperDiMP by . The improvement in is most prominent for thresholds , demonstrating the superior robustness of our tracker. In Tab. 5, we further perform an apple-to-apple comparison with KYS [4] and LTMU [11], where all methods employ the same SuperDiMP as base tracker. We outperform the best method on each metric, achieving an AUC improvement of .

(a) LaSOT [23]
(b) LaSOTExtSub [22]
Figure 6: Success plots, showing , on LaSOT [23] and LaSOTExtSub [22]. Our approach outperforms all other methods by a large margin in AUC, reported in the legend.
Ours LTMU [11] KYS [4] Super DiMP [17]
Precision 70.4 66.5 64.0 65.3
Norm. Prec. 77.4 73.7 70.7 72.2
Success (AUC) 67.2 64.7 61.9 63.1
Table 5: Results on the LaSOT [23] test set. All trackers use the same base tracker SuperDiMP.

LaSOTExtSub [22]:  We evaluate our tracker on the recently published extension subset of LaSOT. LaSOTExtSub is meant for testing only and consists of 15 new classes with 10 sequences each. The sequences are long (2500 frames on average), rendering substantial challenges. Fig. (b)b shows the success plot, that averaged over 5 runs. All results, except ours and SuperDiMP, are obtained from [22],, DaSiamRPN [64], SiamRPN++ [36] and SiamMask [51]. Our method achieves superior results, outperforming LTMU by and SuperDiMP by .

OxUvALT [48]:

  The OxUvA long-term dataset contains 166 test videos with average length 3300 frames. Trackers are required to predict whether the target is present or absent in addition to the bounding box for each frame. Trackers are ranked by the maximum geometric mean (MaxGM) of the true positive (TPR) and true negative rate (TNR). We use the proposed confidence score and set the threshold for target presence using the separate dev. set. Tab. 

6 shows the results on the test set, which are obtained through an evaluation server. Our method sets the new state-of-the-art in terms of MaxGM by achieving an improvement of compared to the previous best method and exceed the result of the base tracker SuperDiMP by .

VOT2018LT [35]:  Next, we evaluate on the 2018 edition of the VOT long-term tracking challenge. We compare with the top methods in the challenge [35]

, as well as more recent methods. The dataset contains 35 videos with 4200 frames per sequence on average. Trackers are required to predict a confidence score that the target is present in addition to the bounding box for each frame. Trackers are ranked by the F-score, evaluated for a range of confidence thresholds. As shown in Tab. 

7, our tracker achieves the best results in all three metrics and outperforms the base tracker SuperDiMP by almost in F-score.

UAV123 [43]:  This dataset contains 123 videos and is designed to assess trackers for UAV applications. It contains small objects, fast motions, and distractor objects. Tab. 8 shows the results, where the entries correspond to AUC for over IoU thresholds . Our method sets a new state-of-the-art with an AUC of , exceeding the performance of the second best method by .

OTB-100 [52]:  For reference, we also evaluate our tracker on the OTB-100 dataset. This dataset has become highly saturated in recent years, with several approaches reaching over 70% AUC, as shown in Tab. 8. Still, our approach performs similarly to the top methods, with a AUC gain over the SuperDiMP baseline.

NFS [25]:  Lastly, we report results on the 30 FPS version of the Need for Speed (NFS) dataset. It contains fast motions and challenging distractors. Tab. 8 shows that our approach significantly exceeds the previous state-of-the-art.

Super Siam Global Siam
Ours LTMU DiMP R-CNN TACT Track SPLT MBMD FC+R TLD
[11] [17] [49] [8] [31] [55] [61] [48] [32]
MaxGM 81.2 75.1 74.8 72.3 70.9 60.3 62.2 54.4 45.4 43.1
TPR 79.6 74.9 79.7 70.1 80.9 57.4 49.8 60.9 42.7 20.8
TNR 82.8 75.4 70.2 74.5 62.2 63.3 77.6 48.5 48.1 89.5
Table 6: Results on the OxUvALT [48] test set in terms of TPR, TNR, and the max geometric mean of TPR and TNR.
Siam Siam Super
Ours LTMU R-CNN PGNet RPN++ DiMP SPLT MBMD DaSiamLT
[11] [49] [41] [36] [17] [55] [61] [64, 35]
F-Score 71.3 69.0 66.8 64.2 62.9 62.2 61.6 61.0 60.7
Precision 72.7 71.0 67.9 64.9 64.3 63.3 63.4 62.7
Recall 70.3 67.2 61.0 60.9 61.0 60.0 58.8 58.8
Table 7: Results on the VOT2018LT dataset [35]

in terms of F-Score, Precision and Recall.

Super Pr Retina Siam Siam Siam
Ours CRACT Dimp DiMP MAML AttN DCFST R-CNN GAT KYS DiMP UPDT
[24] [17] [17] [50] [58] [63] [49] [27] [4] [3] [5]
UAV123 69.8 66.4 68.1 68.0. 65.0 64.9 64.6 65.3 54.5
OTB-100 70.8 72.6 70.1 69.6 71.2 71.2 70.9 70.1 71.0 69.5 68.4 70.2
NFS 66.4 62.5 64.7 63.5 64.1 63.9 63.5 62.0 53.7
Table 8: Comparison with state-of-the-art on the OTB-100 [52], NFS [25] and UAV123 [43] datasets in terms of AUC score.

5 Conclusion

We propose a novel tracking pipeline employing a learned target candidate association network in order to track both, the target and distractor objects. This approach allows us to propagate the identities of all target candidates throughout the sequence. In addition, we propose a training strategy that combines partial annotations with self-supervision to compensate for lacking ground-truth correspondences between distractor objects in visual tracking. We conduct comprehensive experimental validation and analysis of our approach on seven generic object tracking benchmarks and set the new state-of-the-art on six.

Acknowledgments: This work was partly supported by the ETH Zürich Fund (OK), a Siemens project, a Huawei Technologies Oy (Finland) project, an Amazon AWS grant, and an Nvidia hardware grant.

References

  • [1] P. Bergmann, T. Meinhardt, and L. Leal-Taixe (2019-10) Tracking without bells and whistles. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2.
  • [2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In ECCV workshop, Cited by: §1.
  • [3] G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte (2019-10) Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §A.3, §D.2, Table 11, Table 12, §1, §1, §2, §2, §3.1, §3.8, Table 4, Table 8.
  • [4] G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte (2020) Know your surroundings: exploiting scene information for object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 205–221. Cited by: Table 12, §1, §2, §4.2, Table 5, Table 8.
  • [5] G. Bhat, J. Johnander, M. Danelljan, F. S. Khan, and M. Felsberg (2018) Unveiling the power of deep tracking. In ECCV, Cited by: §D.2, Table 12, Table 8.
  • [6] G. Braso and L. Leal-Taixe (2020-06) Learning a neural solver for multiple object tracking. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §2.
  • [7] Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji (2020-06) Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 11, Table 12.
  • [8] J. Choi, J. Kwon, and K. M. Lee (2020-11) Visual tracking by tridentalign and context embedding. In Proceedings of the Asian Conference on Computer Vision (ACCV), Cited by: Table 11, Table 4, Table 6.
  • [9] M. Cuturi (2013) Sinkhorn distances: lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems, Vol. 26. Cited by: §3.5.
  • [10] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Niessner (2017-07) ScanNet: richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.6.
  • [11] K. Dai, Y. Zhang, D. Wang, J. Li, H. Lu, and X. Yang (2020-06) High-performance long-term tracking with meta-updater. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 11, §2, §4.2, Table 4, Table 5, Table 6, Table 7.
  • [12] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2019-06) ATOM: accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §D.2, Table 11, Table 12, §2, §2, Table 4.
  • [13] M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg (2017) ECO: efficient convolution operators for tracking. In CVPR, Linköping University, Computer Vision. Cited by: §D.2, Table 12, §1, §2.
  • [14] M. Danelljan and G. Bhat (2019) PyTracking: Visual tracking library based on PyTorch.. Note: https://github.com/visionml/pytrackingAccessed: 1/08/2019 Cited by: §A.3, §3.1.
  • [15] M. Danelljan, G. Häger, F. Shahbaz Khan, and M. Felsberg (2016) Adaptive decontamination of the training set: a unified formulation for discriminative visual tracking. In CVPR, Cited by: §2.
  • [16] M. Danelljan, A. Robinson, F. Shahbaz Khan, and M. Felsberg (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In ECCV, Cited by: §D.2, Table 12, §2.
  • [17] M. Danelljan, L. Van Gool, and R. Timofte (2020-06) Probabilistic regression for visual tracking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §A.3, §D.2, Table 11, Table 12, §1, §3.1, Table 4, Table 5, Table 6, Table 7, Table 8.
  • [18] P. Dendorfer, A. Osep, A. Milan, K. Schindler, D. Cremers, I. Reid, S. Roth, and L. Leal-Taixé (2020) MOTChallenge: a benchmark for single-camera multiple target tracking. International Journal of Computer Vision, pp. 1–37. Cited by: §2.
  • [19] D. DeTone, T. Malisiewicz, and A. Rabinovich (2018-06) SuperPoint: self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §3.2, §3.5.
  • [20] X. Dong, J. Shen, L. Shao, and F. Porikli (2020) CLNet: a compact latent network for fast adjusting siamese trackers. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 378–395. Cited by: Table 11, Table 12.
  • [21] M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler (2019-06) D2-net: a trainable cnn for joint description and detection of local features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.6.
  • [22] H. Fan, H. Bai, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, M. Huang, J. Liu, Y. Xu, et al. (2021) Lasot: a high-quality large-scale single object tracking benchmark. International Journal of Computer Vision 129 (2), pp. 439–461. Cited by: Figure 12, §D.1, (b)b, Figure 6, §4.2.
  • [23] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling (2019-06) LaSOT: a high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §A.1, Appendix C, Figure 9, §D.1, Table 11, Learning Target Candidate Association to Keep Track of What Not to Track, §1, (a)a, Figure 6, §4.1, §4.1, §4.1, §4.2, Table 4, Table 5.
  • [24] H. Fan and H. Ling (2020) CRACT: cascaded regression-align-classification for robust visual tracking. arXiv preprint arXiv:2011.12483. Cited by: Table 11, Table 12, Table 4, Table 8.
  • [25] H. K. Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey (2017) Need for speed: a benchmark for higher frame rate object tracking. In ICCV, Cited by: Appendix C, Figure 16, §D.2, Table 12, §4.1, §4.2, Table 8.
  • [26] J. Gao, T. Zhang, and C. Xu (2019-06) Graph convolutional tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 12.
  • [27] D. Guo, Y. Shao, Y. Cui, Z. Wang, L. Zhang, and C. Shen (2020) Graph attention tracking. arXiv preprint arXiv:2011.11204. Cited by: Table 11, Table 12, Table 4, Table 8.
  • [28] D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen (2020-06) SiamCAR: siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 11, Table 12.
  • [29] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §A.3.
  • [30] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2015) High-speed tracking with kernelized correlation filters. TPAMI 37 (3), pp. 583–596. Cited by: §2.
  • [31] L. Huang, X. Zhao, and K. Huang (2020) Globaltrack: a simple and strong baseline for long-term tracking. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    pp. 11037–11044. Cited by: §D.1, Table 11, Table 6.
  • [32] Z. Kalal, K. Mikolajczyk, and J. Matas (2012) Tracking-learning-detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (7), pp. 1409–1422. External Links: Document Cited by: Table 6.
  • [33] H. Kiani Galoogahi, A. Fagg, and S. Lucey (2017) Learning background-aware correlation filters for visual tracking. In ICCV, Cited by: §2.
  • [34] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. In ICLR, Cited by: §3.6.
  • [35] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pfugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukezic, A. Eldesokey, G. Fernandez, and et al. (2018) The sixth visual object tracking vot2018 challenge results. In ECCV workshop, Cited by: §4.2, Table 7.
  • [36] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan (2019) SiamRPN++: evolution of siamese visual tracking with very deep networks. In CVPR, Cited by: §D.1, §D.2, Table 11, Table 12, §1, §4.2, Table 7.
  • [37] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In CVPR, Cited by: §1.
  • [38] S. Li, Z. Zhang, Z. Liu, A. Wang, L. Qiu, and F. Du (2020-07) TLPG-tracker: joint learning of target localization and proposal generation for visual tracking. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere (Ed.), pp. 708–715. Cited by: Table 11, Table 12, Table 4.
  • [39] Y. Li, C. Fu, F. Ding, Z. Huang, and G. Lu (2020-06) AutoTrack: towards high-performance visual tracking for uav with automatic spatio-temporal regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 12.
  • [40] Z. Li and N. Snavely (2018-06) MegaDepth: learning single-view depth prediction from internet photos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.6.
  • [41] B. Liao, C. Wang, Y. Wang, Y. Wang, and J. Yin (2020) PG-net: pixel to global matching network for visual tracking. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 429–444. Cited by: Table 11, Table 12, Table 4, Table 7.
  • [42] Y. Liu, R. Li, Y. Cheng, R. T. Tan, and X. Sui (2020) Object tracking using spatio-temporal networks for future prediction location. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 1–17. Cited by: Table 12.
  • [43] M. Mueller, N. Smith, and B. Ghanem (2016) A benchmark and simulator for uav tracking. In ECCV, Cited by: Appendix C, Figure 16, §D.2, Table 12, §1, §4.1, §4.2, Table 8.
  • [44] H. Nam and B. Han (2016)

    Learning multi-domain convolutional neural networks for visual tracking

    .
    In CVPR, Cited by: §D.2, Table 12.
  • [45] P. Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich (2020-06) SuperGlue: learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §A.3, §3.2, §3.4, §3.4, §3.5, §3.6.
  • [46] R. Sinkhorn and P. Knopp (1967) Concerning nonnegative matrices and doubly stochastic matrices. Pacific Journal of Mathematics 21 (2), pp. 343–348. Cited by: §3.5.
  • [47] S. Tang, M. Andriluka, B. Andres, and B. Schiele (2017-07) Multiple people tracking by lifted multicut and person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [48] J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao, A. Vedaldi, A. W.M. Smeulders, P. H.S. Torr, and E. Gavves (2018-09) Long-term tracking in the wild: a benchmark. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Learning Target Candidate Association to Keep Track of What Not to Track, §4.2, Table 6.
  • [49] P. Voigtlaender, J. Luiten, P. H.S. Torr, and B. Leibe (2020-06) Siam R-CNN: visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 11, Table 12, §2, Table 4, Table 6, Table 7, Table 8.
  • [50] G. Wang, C. Luo, X. Sun, Z. Xiong, and W. Zeng (2020-06) Tracking by instance detection: a meta-learning approach. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 11, Table 12, Table 8.
  • [51] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H.S. Torr (2019-06) Fast online object tracking and segmentation: a unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §D.1, Table 11, §4.2.
  • [52] Y. Wu, J. Lim, and M. Yang (2015) Object tracking benchmark. TPAMI 37 (9), pp. 1834–1848. Cited by: Figure 16, §D.2, Table 12, §4.2, Table 8.
  • [53] J. Xiao, L. Qiao, R. Stolkin, and Alevs. Leonardis (2016) Distractor-supported single target tracking in extremely cluttered scenes. In ECCV, Cited by: §1, §2.
  • [54] Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu (2020) Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence, pp. 12549–12556. Cited by: Table 11, Table 12.
  • [55] B. Yan, H. Zhao, D. Wang, H. Lu, and X. Yang (2019-10) ’Skimming-perusal’ tracking: a framework for real-time and robust long-term tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: Table 6, Table 7.
  • [56] T. Yang, P. Xu, R. Hu, H. Chai, and A. B. Chan (2020-06) ROAM: recurrently optimizing tracking model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 11.
  • [57] Q. Yu, G. Medioni, and I. Cohen (2007)

    Multiple target tracking using spatio-temporal markov chain monte carlo data association

    .
    In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.
  • [58] Y. Yu, Y. Xiong, W. Huang, and M. R. Scott (2020-06) Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 11, Table 12, Table 4, Table 8.
  • [59] J. Zhang, S. Ma, and S. Sclaroff (2014) MEEM: robust tracking via multiple experts using entropy minimization. In ECCV, Cited by: §2.
  • [60] L. Zhang, Y. Li, and R. Nevatia (2008) Global data association for multi-object tracking using network flows. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.
  • [61] Y. Zhang, D. Wang, L. Wang, J. Qi, and H. Lu (2018) Learning regression and verification networks for long-term visual tracking. External Links: 1809.04320 Cited by: Table 6, Table 7.
  • [62] Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu (2020) Ocean: object-aware anchor-free tracking. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 771–787. Cited by: Table 11, Table 12, Table 4.
  • [63] L. Zheng, M. Tang, Y. Chen, J. Wang, and H. Lu (2020) Learning feature embeddings for discriminant model based tracking. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 759–775. Cited by: Table 12, Table 8.
  • [64] Z. Zhu, Q. Wang, L. Bo, W. Wu, J. Yan, and W. Hu (2018) Distractor-aware siamese networks for visual object tracking. In ECCV, Cited by: §D.1, §D.2, Table 11, Table 12, §1, §2, §4.2, Table 7.

Appendix A Training

First, we describe the training data generation and sample selection to train the network more effectively. Then, we provide additional details about the training procedure such as training in batches, augmentations and synthetic sample generation. Finally, we briefly summarize the employed network architecture.

Name Number of Is an candidate selected as target? Does the candidate with highest score Does any candidate correspond Num Ratio
candidates correspond to the gth. target? to the gth. target? Frames
D 1.8M 67.9%
H 498k 18.4%
G x 8k 0.3%
J x x 76k 2.8%
K x 42k 1.5%
other 243k 9.1%
Table 9: Categories and specifications for each frame in the training dataset used for data-mining.

a.1 Data-Mining

We use the LaSOT [23] training set to train our target candidate association network. In particular, we split the 1120 training sequences randomly into a train-train (1000 sequences) and a train-val (120 sequences) set. We run the base tracker on all sequences and store the target classifier score map and the search area on disk for each frame. During training, we use the score map and the search area to extract the target candidates and its features to provide the data to train the target candidate association network.

We observed that many sequences or sub-sequences contain mostly one target candidate with a high target classifier score. Thus, in this cases target candidate association is trivial and learning on these cases will be less effective. Conversely, tracking datasets contain sub-sequences that are very challenging (large motion or appearance changes or many distractors) such that trackers often fail. While these sub-sequences lead to a more effective training they are relatively rare such that we decide to actively search the training dataset.

First, we assign each frame to one of six different categories. We classify each frame based on four observations about the number of candidates, their target classifier score, if one of the target candidates is selected as target and if this selection is correct, see Tab. 9. A candidate corresponds to the annotated target object if the spatial distance between the candidate location and center coordinate of the target object is smaller than a threshold.

Assigning each frame to the proposed categories, we observe, that the dominant category is D (70%) that corresponds to frames with a single target candidate matching the annotated target object. However, we favour more challenging settings for training. In order to learn distractor associations using self supervision, we require frames with multiple detected target candidates. Category H (18.4%) corresponds to such frames where in addition the candidate with the highest target classifier score matches the annotated target object. Hence, the base tracker selects the correct candidate as target. Furthermore, category G corresponds to frames where the base tracker was no longer able to track the target because the target classifier score of the corresponding candidate fell bellow a threshold. We favour these frames during training in order to learn continue tracking the target even if the score is low.

Both categories J and K correspond to tracking failures of the base tracker. Whereas in K the correct target is detected but not selected, it is not detected in frames of category J. Thus, we aim to learn from tracking failures in order to train the target candidate association network such that it learns to compensate for tracking failures of the base tracker and corrects them. In particular, frames of category K are important for training because the the two candidates with highest target classifier score no longer match such that the network is forced to include other cues for matching. We use frames of category J because frames where the object is visible but not detected contain typically many distractors such that these frames are suitable to learn distractor associations using self-supervised learning.

To summarize, we select only frames with category H, K, J for self-supervised training and sample them with a ratio of instead of (ratio in the dataset). We ignore frames from category D during self-supervised training because we require frames with multiple target candidates. Furthermore, we select sub-sequences of two consecutive frames for partially supervised training. We choose challenging sub-sequences that either contain many distractors in each frame (HH, 350k) or sub-sequences where the base tracker fails and switches to track a distractor (HK, 1001) or where the base tracker is no longer able to identify the target with high confidence (HG, 1380). Again we change the sampling ratio from approximately to during training. We change the sampling ration in order to use failure cases more often during training than they occur in the training set.

a.2 Training Data Preparation

During training we use two different levels of augmentation. First, we augment all features of target candidate to enable self-supervised training with automatically produced ground truth correspondences. In addition, we use augmentation to improve generalization and overfitting of the network.

When creating artificial features we randomly scale each target classifier score, randomly jitter the candidate location within the search area and apply common image transformations to the original image before extracting the appearance based features for the artificial candidates. In particular, we randomly jitter the brightness, blur the image and jitter the search area before cropping the image to the search area.

To reduce overfitting and improve the generalization, we randomly scale the target candidate scores for synthetic and real sub-sequences. Furthermore, we remove candidates from the sets and randomly in order to simulate newly appearing or disappearing objects. Furthermore, to enable training in batches we require the same number of target candidates in each frame. Thus, we keep the five candidates with the highest target classifier score or add artificial peaks at random locations with a small score such that five candidates per frame are present. When computing the losses, we ignore these artificial candidates.

a.3 Architecture Details

We use the SuperDiMP tracker [14] as our base tracker. SuperDiMP employs the DiMP [3] target classifier and the probabilistic bounding-box regression of PrDiMP [17], together with improved training settings. It uses a ResNet-50 [29] pretrained network as backbone feature extractor. We freeze all parameters of SuperDiMP while training the target candidate association network. To produce the visual features for each target candidate, we use the third layer ResNet-50 features. In particular, we obtain a feature map and feed it into a convolutional layer which produces the feature map . Note, that the spatial resolution of the target classifier score and feature map agree such that extracting the appearance based features for each target candidate at location is simplified.

Furthermore, we use a four layer Multi-Layer Perceptron (MLP) to project the target classifier score and location for each candidate in the same dimensional space as . We use the following MLP structure:

with batch normalization. Before feeding the candidate locations into the MLP we normalize it according to the image size.

We follow Sarlin  [45] when designing the candidate embedding network. In particular, we use self and cross attention layers in an alternating fashion and employ nine layers of each type. In addition, we append a convolutional layer to the last cross attention layer. Again, we follow Sarlin  [45] for optimal matching and reuse their implementation of the Sinkhorn algorithm and run it for 50 iterations.

Appendix B Inference

In this section we provide the detailed algorithm that describes the object association module (Sec. 3.7 in the paper). Furthermore, we explain the idea of search area rescaling at occlusion and how it is implemented. We conclude with additional inference details.

b.1 Object Association Module

Here, we provide a detailed algorithm describing the object association module presented in the main paper, see Alg. 1. It contains target candidate to object association and the redetection logic to retrieve the target object after it was lost.

First, we will briefly explain the used notation. Each object can be modeled similar to a class in programming. Thus, each object contains attributes that can be accesses using the notation. In particular returns the score attribute of object . In total the object class contains two attribute: the list of scores and the object-id . Both setting and getting the attribute values is possible.

The algorithm requires the following inputs: the set of target candidates , the set of detected objects and the object selected as target in the previous frame. First, we check if a target candidate matches with any previously detected object and verify that the assignment probability is higher than a threshold . If such a match exists, we associate the candidate to the object and append its target classifier score to the scores and add the object to the set of currently visible object . If a target candidate matches none of the previously detected objects, we create a new object and add it to . Hence, previously detected objects that miss a matching candidate are not included in . Once, all target candidates are associated to an already existing or newly created object. We check if the object previously selected as target is still visible in the current scene and forms the new target . After the object was lost it is possible that the object selected as target is in fact a distractor. Thus, we select an other object as target if this other object achieves a higher target classifier score in the current frame than any score the currently selected object achieved in the past. Furthermore, if the object previously selected as target object is no longer visible, we try to redetect it by checking if the object with highest target classifier score in the current frame achieves a score higher than a threshold . If the score is high enough, we select this object as the target.

1:Set of target candidates
2:Set of objects of previous frame and target object
3:, Initialize
4:for  do Propagate objects via their ids
5:     if  &&  then Candidate association successful
6:         
7:         
8:         
9:     else
10:          Initialize new object if unmatched      
11:     
12:if  and  then Retrieve target if detected
13:     
14:     for  do
15:         if  then Check if better fitting target detected
16:                             
17:else
18:      Get index of object with highest score
19:     if  then Check if object with highest score is selected
20:         
21:     else
22:          Target not found      
23:return ,
Algorithm 1 Object Association Algorithm.

b.2 Search Area Rescaling at Occlusion

The target object often gets occluded or moves out-of-view in many tracking sequences. Shortly before the target is lost the tracker typically detects only a small part of the target object and estimates a smaller bounding box than in the frames before. The used base tracker SuperDiMP employs a search area that depends on the currently estimated bounding box size. Thus, a partially visible target object causes a small bounding box and search area. The problem of a small search area is that it complicates redetecting the target object, , the target reappears at a slightly different location than it disappeared and if the object then reappears outside of the search area redetection is prohibited. Smaller search areas occur more frequently when using the target candidate association network because it allows to track the object longer until we declare it as lost.

Hence, we use a procedure to increase the search area if it decreased before the target object was lost. First, we store all search are resolutions during tracking in an list as long as the object is detected. If the object was lost frames ago, we compute the new search area by averaging the last entries of larger than the search area at occlusion. We average at most over 30 previous search areas to compute the new one. If the target object was not redetected within these 30 frames with keep the search area fixed until redetection.

b.3 Inference Details

In contrast to training, we use all extracted target candidates to compute the candidate associations between consecutive frames. In order to save computations, we extract the candidates and features only for the current frame and cache the results such that they can be reused when computing the associations in the next frames.

Memory Sample Larger Search Area Target Candidate Candidate
Confidence Search Area Rescaling Association Network Embedding Network NFS UAV123 LaSOT
64.4 68.2 63.5
64.7 68.0 65.0
65.3 68.4 65.5
64.7 68.4 65.8
65.2 69.1 65.8
65.9 69.2 66.6
66.4 69.8 67.2
Table 10: Impact of each component in terms of AUC (%) on three datasets. The first row corresponds to our SuperDiMP baseline.

Appendix C More Detailed Analysis

In addition to the ablation study presented in the main paper (Sec. 4.1) we provide more settings in order to assess the contribution of each component better. In particular, we split the term search area adaptation into larger search area and search area rescaling. Where larger search area refers to a search area scale of 8 instead of 6 and a search area resolution of 480 instead of 352 in the image domain. Furthermore, we evaluate the target candidate network when omitting the candidate embedding network and using the encoded features directly for optimal matching. Tab. 10 shows all the results on NFS [25], UAV123 [43] and LaSOT [23]. We run each experiment five times and report the average. We conclude that both search area adaptation techniques improve the tracking quality but we achieve the best results on all three datasets when employing both at the same time . Furthermore, using the target candidate association network without the candidate embedding network outperforms the pipeline without target candidate association (next to last row in Tab. 10). However, we observe clearly the best tracking performance on all three benchmarks when using the entire proposed tracking pipeline that includes the candidate embedding network (last row in Tab. 10).

Appendix D Experiments

We provide more details to complement the state-of-the-art comparison performed in the paper (Sec. 4.2).

(a) Success
(b) Normalized Precision
Figure 9: Success, precision and normalized precision plots on LaSOT [23]. Our approach outperforms all other methods by a large margin in AUC, reported in the legend.
(a) Success
(b) Normalized Precision
Figure 12: Success and normalized precision plots on LaSOTExtSub [22]. Our approach outperforms all other methods by a large margin in AUC, reported in the legend.
Siam Super Pr Siam Siam Siam FCOS Global DaSiam Siam Siam Siam Retina Siam
Ours R-CNN Dimp DiMP TLPG TACT LTMU DiMP Ocean AttN CRACT FC++ GAT PG-NET MAML Track ATOM RPN BAN CAR CLNet RPN++ MAML Mask ROAM++
[49] [17] [17] [38] [8] [11] [3] [62] [58] [24] [54] [27] [41] [50] [31] [12] [64] [7] [28] [20] [36] [50] [51] [56]
LaSOT 67.2 64.8 63.1 59.8. 58.1 57.5 57.2 56.9 56.0 56.0 54.9 54.4 53.9 53.1 52.3 52.1 51.5 51.5 51.4 50.7 49.9 49.6 48.0 46.7 44.7
Table 11: Comparison with state-of-the-art on the LaSOT [23] test set in terms of overall AUC score. The average value over 5 runs is reported for our approach. The symbol marks results that were produced by Fan  [23] otherwise they are obtained directly from the official paper.

d.1 LaSOT and LaSOTExtSub

In addition to the success plot, we provide the normalized precision plot on the LaSOT [23] test set (280 videos) and LaSOTExtSub [23] test set (150 videos). The normalized precision score is computed as the percentage of frames where the normalized distance (relative to the target size) between the predicted and ground-truth target center location is less than a threshold . is plotted over a range of thresholds . The trackers are ranked using the AUC, which is shown in the legend. Figs. (b)b and (b)b show the normalized precision plots. We compare with state-of-the-art trackers and report their success (AUC) in Tab. 11 and where available we show the raw results in Fig. 9. In particular, we use the raw results provided by the authors except for DaSiamRPN [64], GlobaTrack [31], SiamRPN++ [36] and SiamMask [51] such results were not provided such that we use the raw results produced by Fan  [23]. Thus, the exact results for these methods might be different in the plot and the table, because we show in the table the reported result the corresponding paper. Similarly, we obtain all results on LaSOTExtSub directly from Fan  [22] except the result of SuperDiMP that we produced.

(a) UAV123
(b) OTB-100
(c) NFS
Figure 16: Success plots on the UAV123 [43], OTB-100 [52] and NFS [25] datasets in terms of overall AUC score, reported in the legend.
Super Pr Siam Siam Retina FCOS Auto Siam Siam Siam Siam Siam DaSiam
Ours Dimp DiMP R-CNN DiMP KYS RPN++ ATOM UPDT MAML MAML Ocean STN Track BAN CAR ECO DCFST PG-NET CRACT GCT GAT CLNet TLPG AttN FC++ MDNet CCTO RPN
[17] [17] [49] [3] [4] [36] [12] [5] [50] [50] [62] [42] [39] [7] [28] [13] [63] [41] [24] [26] [27] [20] [38] [58] [54] [44] [16] [64]
UAV123 69.8 67.7 68.0 64.9 65.3 61.3 64.2 54.5 64.9 67.1 63.1 61.4 53.2 66.4 50.8 64.6 63.3 65.0 51.3 57.7
OTB-100 70.8 70.1 69.6 70.1 68.4 69.5 69.6 66.9 70.2 71.2 70.4 68.4 69.3 69.6 69.1 70.9 69.1 72.6 64.8 71.0 69.8 71.2 68.3 67.8 68.2 65.8
NFS 66.4 64.8 63.5 63.9 62.0 63.5 58.4 53.7 59.4 46.6 64.1 62.5 54.3 41.9 48.8
Table 12: Comparison with state-of-the-art on the OTB-100 [52], NFS [25] and UAV123 [43] datasets in terms of overall AUC score. The average value over 5 runs is reported for our approach.

d.2 UAV123, OTB-100 and NFS

We provide the success plot over the 123 videos of the UAV123 dataset [43] in Fig. (a)a, the 100 videos of the OTB-100 dataset [52] in Fig. (b)b and the 100 videos of the NFS dataset [25] in Fig. (c)c. We compare with state-of-the-art trackers SuperDiMP [17], PrDiMP50 [17], UPDT [5], SiamRPN++ [36], ECO [13], DiMP [3], CCOT [16], MDNet [44], ATOM [12], and DaSiamRPN [64]. Our method provides a significant gain over the baseline SuperDiMP on UAV123 and NFS and performs among the top methods on OTB-100. Tab. 12 shows additional results on UAV123, OTB-100 and NFS in terms of success (AUC).