DeepAI
Log In Sign Up

Divert More Attention to Vision-Language Tracking

Relying on Transformer for complex visual feature learning, object tracking has witnessed the new standard for state-of-the-arts (SOTAs). However, this advancement accompanies by larger training data and longer training period, making tracking increasingly expensive. In this paper, we demonstrate that the Transformer-reliance is not necessary and the pure ConvNets are still competitive and even better yet more economical and friendly in achieving SOTA tracking. Our solution is to unleash the power of multimodal vision-language (VL) tracking, simply using ConvNets. The essence lies in learning novel unified-adaptive VL representations with our modality mixer (ModaMixer) and asymmetrical ConvNet search. We show that our unified-adaptive VL representation, learned purely with the ConvNets, is a simple yet strong alternative to Transformer visual features, by unbelievably improving a CNN-based Siamese tracker by 14.5 even outperforming several Transformer-based SOTA trackers. Besides empirical results, we theoretically analyze our approach to evidence its effectiveness. By revealing the potential of VL representation, we expect the community to divert more attention to VL tracking and hope to open more possibilities for future tracking beyond Transformer. Code and models will be released at https://github.com/JudasDie/SOTS.

READ FULL TEXT VIEW PDF

page 3

page 9

page 13

12/17/2021

Efficient Visual Tracking with Exemplar Transformers

The design of more complex and powerful neural network models has signif...
12/02/2021

SwinTrack: A Simple and Strong Baseline for Transformer Tracking

Transformer has recently demonstrated clear potential in improving visua...
03/22/2021

Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking

In video object tracking, there exist rich temporal contexts among succe...
10/17/2021

Siamese Transformer Pyramid Networks for Real-Time UAV Tracking

Recent object tracking methods depend upon deep networks or convoluted a...
01/22/2022

Temporal Aggregation for Adaptive RGBT Tracking

Visual object tracking with RGB and thermal infrared (TIR) spectra avail...
12/05/2021

Learning Tracking Representations via Dual-Branch Fully Transformer Networks

We present a Siamese-like Dual-branch network based on solely Transforme...
11/20/2022

Revisiting Color-Event based Tracking: A Unified Network, Dataset, and Metric

Combining the Color and Event cameras (also called Dynamic Vision Sensor...

1 Introduction

Figure 1: Comparison between CNN-based and Transformer-based trackers on LaSOT Fan et al. (2019).

Transformer tracking recently receives a surge of research interests and becomes almost a necessity to achieve state-of-the-art (SOTA) performance Chen et al. (2021); Wang et al. (2021a); Cui et al. (2022). The success of Transformer trackers mainly attributes to attention that enables complex feature interactions. But, is this complex attention the only way realizing SOTA tracking? Or in other words, is Transformer the only path to SOTA?

We answer no, and display a Transformer-free path using pureconvolutional neural network (CNN). Different than complex interactions in visual feature by attention requiring more training data and longer training time, our alternative is to explore simple interactions of multimodal, i.e., vision and language, through CNN. In fact, language, an equally important cue as vision, has been largely explored in vision-related tasks, and is not new to tracking. Prior works Feng et al. (2021); Li et al. (2017); Feng et al. (2020) have exploited vision-language (VL) multimodal learning for improving tracking. However, the performance falls far behind current SOTAs. For instance on LaSOT Fan et al. (2019), the gap between current best VL tracker Feng et al. (2021) and recent Transformer tracker Chen et al. (2021) is absolute 10.9% in SUC (see Fig. 1). So, what is the bottleneck of VL tracking in achieving SOTA?

The devil is in VL representation. Feature representation has been shown to be crucial in improving tracking Wang et al. (2015); Zhang and Peng (2019); Li et al. (2019); Guo et al. (2022); Cui et al. (2022). Given two modalities of vision and language, the VL feature is desired to be unified and adaptive Pérez-Rúa et al. (2019); Yu et al. (2020b). The former property requires deep interaction of vision and language, while the latter needs VL feature to accommodate different scenarios of visual and linguistic information. However, in existing VL trackers, vision and language are treated independently and processed distantly

until the final result fusion. Although this fusion may easily get two modalities connected, it does not accords with human learning procedure that integrates multisensory by various neurons before causal inference 

Badde et al. (2021), resulting in a lower upper-bound for VL tracking. Besides, current VL trackers treat template and search branches as homoplasmic inputs, and adopt symmetrical feature learning structures for these two branches, inherited from typical vision-only Siamese tracking Feng et al. (2021). We argue the mixed modality may have different intrinsic nature than the pure vision modality, and thus requires a more flexible and general design for different signals.

Our solution. Having observed the above, we introduce a novel unified-adaptive vision-language representation, aiming for SOTA VL tracking without using Transformer111Here we stress that we do not use Transformer for visual feature learning as in current Transformer trackers or for multimodal learning. We only use it in language embedding extraction (i.e., BERT Devlin et al. (2018)). Specifically, we first present modality mixer, or ModaMixer, a conceptually simple but effective module for VL interaction. Language is a high-level representation and its class embedding can help distinguish targets of different categories (e.g., cat and dog) and meanwhile the attribute embedding (e.g., color, shape) provides strong prior to separate targets of same class (e.g., cars with different colors). The intuition is, channel features in vision representation also reveal semantics of objects Gao et al. (2018); Yang et al. (2015). Inspired by this, ModaMixer regards language representation as a selector to reweight different channels of visual features, enhancing target-specific channels as well as suppressing irrelevant both intra- and inter-class channels. The selected feature is then fused with the original feature, using a special asymmetrical design (analyzed later in experiments), to generate the final unified VL representation. A set of ModaMixers are installed in a typical CNN from shallow to deep, boosting robustness and discriminability of the unified VL representation at different semantic levels. Despite simplicity, ModaMixer brings 6.9% gains over a pure CNN baseline Guo et al. (2020a) (i.e., 50.7%57.6%).

Despite huge improvement, the gap to SOTA Transformer tracker Chen et al. (2021) remains (57.6% v.s. 64.9%). To mitigate the gap, we propose an asymmetrical searching strategy (ASS) to adapt the unified VL representation for improvements. Different from current VL tracking Feng et al. (2021) adopting symmetrical and fixed template and search branches as in vision-only Siamese tracking Li et al. (2019), we argue that the learning framework of mixed modality should be adaptive and not fixed. To this end, ASS borrows the idea from neural architecture search (NAS) Zoph and Le (2017); Real et al. (2019) to separately learn distinctive and asymmetrical networks for mixed modality in different branches and ModaMixers. The asymmetrical architecture, to our best knowledge, is the first of its kind in matching-based tracking. Note, although NAS has been adopted in matching-based tracking Yan et al. (2021b), this method finds symmetrical networks for single modality. Differently, ASS is applied on mixed modality and the resulted architecture is symmetrical

. Moreover, the network searched in ASS avoids burdensome re-training on ImageNet 

Deng et al. (2009), enabling quick reproducibility of our work (only 0.625 GPU days with a single RTX-2080Ti). Our ASS is general and flexible, and together with ModaMixer, it surprisingly shows additional 7.6% gains (i.e., 57.6%65.2%), evidencing our argument and effectiveness of ASS.

Eventually, with the unified-adaptive representation, we implement the first pure CNN-based VL tracker that shows SOTA results comparable and even better than Transformer-based solutions, without bells and whistles. Specifically, we apply our method to a CNN baseline SiamCAR Guo et al. (2020a), and the resulted VL tracker VLT shows 65.2% SUC on LaSOT Fan et al. (2019) while running at 43FPS, unbelievably improving the baseline by 14.5% and outperforming SOTA Transformer trackers Chen et al. (2021); Wang et al. (2021a) (see again Fig. 1). We observe similar improvements by our approach on other four benchmarks. Besides empirical results, we provide theoretical analysis to evidence the effectiveness of our method. Note that, our approach is general in improving vision-only trackers including Transformer-based ones. We show this by applying it to TransT Chen et al. (2021) and the resulted tracker VLT shows 2.4% SUC gains (i.e., 64.9%67.3%), evidencing its effectiveness and generality.

We are aware that one can certainly leverage the Transformer Meinhardt et al. (2022) to learn a good (maybe better) VL representation for tracking, with larger data and longer training period. Different than this, our goal is to explore a cheaper way with simple architectures such as pure CNN for SOTA tracking performance and open more possibilities for future tracking beyond Transformer. In summary, our contributions are four-fold: (i) we introduce a novel unified-adaptive vision-language representation for SOTA VL tracking; (ii) we propose the embarrassingly simple yet effective ModaMixer for unified VL representation learning; (iii) we present ASS to adapt mixed VL representation for better tracking and (iv) using pure CNN architecture, we achieve SOTA results on multiple benchmarks .

Figure 2: The proposed vision-language tracking framework. The semantic information of language description is injected to vision from shallow to deep layers of the asymmetrical modeling architecture to learn unified-adaptive vision-language representation.

2 Related Work

Visual Tracking. Tracking has witnessed great progress in the past decades. Particularly, Siamese tracking Bertinetto et al. (2016); Tao et al. (2016), that aims to learn a generic matching function, is a representative branch and has revolutionized with numerous extensions Li et al. (2018); Zhang and Peng (2019); Li et al. (2019); Yu et al. (2020a); Fan and Ling (2019); Guo et al. (2017); Zhang et al. (2021, 2020); Chen et al. (2020). Recently, Transformer Vaswani et al. (2017) has been introduced to Siamese tracking for better interactions of visual features and greatly pushed the standard of state-of-the-art performance Chen et al. (2021); Wang et al. (2021a); Sun et al. (2020); Cui et al. (2022); Lin et al. (2021). From a different perspective than using complex Transformer, we explore multimodal with simple CNN to achieve SOTA tracking.

Vision-Language Tracking. Natural language contains high-level semantics and has been leveraged to foster vision-related tasks Fukui et al. (2016); Kim et al. (2018); Anderson et al. (2018) including tracking Li et al. (2017); Feng et al. (2020, 2021). The work Li et al. (2017) first introduces linguistic description to tracking and shows that language enhances the robustness of vision-based method. Most recently, SNLT Feng et al. (2021) integrates linguistic information into Siamese tracking by fusing results respectively obtained by vision and language. Different from these VL trackers that regard vision and language as independent cues with weak connections only at result fusion, we propose ModaMixer to unleash the power of VL tracking by learning unified VL representation.

NAS for Tracking. Neural architecture search (NAS) aims at finding the optimal design of deep network architectures Zoph and Le (2017); Real et al. (2019); Liu et al. (2019); Guo et al. (2020b) and has been introduced to tracking Yan et al. (2021b); Zhang et al. (2021). LightTrack Yan et al. (2021b) tends to search a lightweight backbone but is computationally demanding (about 40 V100 GPU days). AutoMatch uses DARTS Liu et al. (2019) to find better matching networks for Siamese tracking. All these methods leverage NAS for vision-only tracking and search a symmetrical Siamese architecture. Differently, our work searches the network for multimodal tracking and tries to find a more general and flexible asymmetrical two-stream counterpart. In addition, our search pipeline only takes 0.625 RTX-2080Ti GPU days, which is much more resource-friendly.

3 Unified-Adaptive Vision-Language Tracking

This section details our unified-adaptive vision-language (VL) tracking as shown in Fig. 2. In specific, we first describe the proposed modality mixer for generating unified multimodal representation and then asymmetrical network which searches for learning adaptive VL representation. Afterwards, we illustrate the proposed tracking framework, followed by theoretical analysis of our method.

3.1 Modality Mixer for Unified Representation

The essence of multimodal learning is a simple and effective modality fusion module. As discussed before, existing VL trackers simply use a later fusion way, in which different modalities are treated independently and processed distantly until merging their final results Feng et al. (2021); Li et al. (2017). Despite the effectiveness to some extent, the complementarity of different modalities in representation learning is largely unexplored, which may impede the multimodal learning to unleash its power for VL tracking. In this work, we propose the modality mixer (dubbed ModaMixer) to demonstrate a compact way to learn a unified vision-language representation for tracking.

Figure 3: Illustration of the ModaMixer.

ModaMixer considers language representation as selector to reweight channels of vision features. In specific, given the language description with words of a video222The language description in tracking is generated only by the initial target object in the first frame., a language model Devlin et al. (2018) is adopted to abstract the sentence to semantic features with size of . The extra “2” denotes the “[CLS][SEP]” characters in language model processing (see Devlin et al. (2018) for more details). Notably, descriptions for different videos may contain various length . To ensure the ModaMixer applicable for all videos, we first average the features for all words along sequence length dimension “(N+2)” to generate a unique language representation for each description. Then a linear layer is followed to align the channel number of with the corresponding vision feature . Channel selector is expressed as Hadamard product operator, which point-wisely multiplies language representation () to embedding of each spatial position in the vision feature

. Finally, a residual connection between the mixed feature

and vision feature is conducted to avoid losing informative vision details. In a nutshell, the ModaMixer can be formulated as,

(1)

where denotes Hadamard product, is a linear projection layer with weight matrix size of for channel number alignment, and indicates post-processing block before residual connection. Please note that, to enable adaptive feature modeling for different modalities, we search different to process features before and after fusion (see Sec. 3.2 for more details). The proposed ModaMixer is illustrated in Fig. 3. Akin to channel attention Shen et al. (2021); Gao et al. (2018), the high-level semantics in language representation dynamically enhance target-specific channels in vision features, and meanwhile suppress the responses of distractors belonging to both inter- and intra-classes.

3.2 Asymmetrical Search for Adaptive Vision-Language Representation

Besides the fusion module, the other crucial key for vision-language tracking is how to construct the basic modeling structure. The simplest strategy is to inherit a symmetrical Siamese network from vision-based tracking (e.g.Bertinetto et al. (2016); Li et al. (2019)), as in current VL trackers Feng et al. (2021). But the performance gap still remains if using this manner, which is mostly blamed on the neglect of the different intrinsic nature between VL-based multimodal and vision-only single modality. To remedy this, we propose an asymmetrical searching strategy (dubbed ASS) to learn an adaptive modeling structure for pairing with ModaMixer.

The spirits of network search are originated from the field of Neural Architecture Search (NAS). We adopt a popular NAS model, in particular the single-path one-shot method SPOS Guo et al. (2020b), for searching the optimal structure of our purpose. Although SPOS has been utilized for tracking Yan et al. (2021b), our work significantly differs from it from two aspects: 1) Our ASS is tailored for constructing an asymmetrical two-stream network for multimodal tracking, while Yan et al. (2021b) is designed to find a symmetrical Siamese network for vision-only single-modality tracking. Besides, we search layers both in the backbone network and the post-processing in the ModaMixer (see Eq. 1); 2)

Our ASS reuses the pre-trained supernet from SPOS, which avoids the burdensome re-training on ImageNet 

Deng et al. (2009) (both for the supernet and found subnet) and thus reduces the time complexity of our search pipeline to of that in LightTrack Yan et al. (2021b) (i.e., 0.625 RTX-2080Ti GPU days v.s. 40 V100 GPU days). Due to limited space, please refer to appendix for more details and comparison of our ASS and Yan et al. (2021b).

The search space and search strategy of ASS are kept consistent with the original SPOS Guo et al. (2020b). In particular, the search pipeline is formulated as,

(2)
(3)

where represents the architecture search space of the network , is a sample from and denotes the corresponding network weights. Notably, the network includes three components , where each indicates backbone for the template branch , backbone for the search branch and layers in the ModaMixer . The whole pipeline consists of training supernet on tracking datasets via random sampling from search space (Eq. 2

) and finding the optimal subnet via evolutionary algorithms (Eq. 

3). The SUC (success score) on validation data is used as rewards of evolutionary algorithms. Tab. 1 demonstrates the searched asymmetrical networks in our VL tracking. For more details of ASS, we kindly refer readers to appendix or Guo et al. (2020b).

3.3 Tracking Framework

Stem Stage1
Moda
Mixer
Stage2
Moda
Mixer
Stage3
Moda
Mixer
Stage4
Moda
Mixer
Template

Search

Table 1: The asymmetrical architecture learned by ASS.

is the stem convolution layer. (

) represents the basic ASS unit, where the first three ones indicate Shuffle block Zhang et al. (2018) with kernel sizes of (3,5,7), respectively, and the last one denotes a Shuffle Xception block Zhang et al. (2018) with kernel size of 3.

With the proposed ModaMixer and the searched asymmetrical networks, we construct a new vision-language tracking framework, as shown in Fig. 2 and Tab. 1

. Our framework is matching-based tracking. Both template and search backbone networks contain 4 stages with the maximum stride of 8, the chosen blocks of each stage are denoted with different colors in Tab. 

1. ModaMixer is integrated into each stage of the template and search networks to learn informative mixed representation. It is worth noting that, the asymmetry is revealed in not only the design of backbone networks, but also the ModaMixer. Each ModaMixer shares the same meta-structure as in Fig. 3, but comprises different post-processing layers to allow adaption to different semantic levels (i.e., network depth) and input signals (i.e., template and search, pure-vision and mixed feature in each ModaMixer). With the learned unified-adaptive VL representations from the template and search branches, we perform feature matching and target localization, the same as in our baseline.

3.4 A Theoretical Explanation

This section presents a theoretical explanation of our method, following the analysis in Huang et al. (2021). Based on the Empirical Risk Minimization (ERM) principle Singh (2019), the objective of representation learning is to find better network parameters by minimizing the empirical risk, i.e.,

(4)

where

denotes loss function,

represents the modality set, indicates sample number, is the input mutimodal signal, is training label, and demotes optimization space of . Given the empirical risk , its corresponding population risk is defined as,

(5)

Following Amini et al. (2009); Tripuraneni et al. (2020); Huang et al. (2021), the population risk is adopted to measure the learning quality. Then the latent representation quality Huang et al. (2021) is defined as,

(6)

where represents the optimal case, indicates the best achievable population risk. With the empirical Rademacher complexity  Bartlett and Mendelson (2002), we restate the conclusion in Huang et al. (2021) with our definition.

Theorem 1 (Huang et al. (2021)).

Assuming we have produced the empirical risk minimizers and , training with the and modalities separately ( > ). Then, for all

with probability at least

:

(7)

where

(8)

computes the quality difference learned from multiple modalities and single modality with dataset . Theorem 1 defines an upper bound of the population risk training with different number of modalities, which proves that more modalities could potentially enhance the representation quality. Furthermore, the Rademacher complexity is proportional to the network complexity, which demonstrates that heterogeneous network would theoretically rise the upper bound of , and also exhibits that our asymmetrical design has larger optimization space when learning with modalities compared to modalities (). The proof is beyond our scoop, and please refer to Huang et al. (2021) for details.

Type Method LaSOT LaSOT TNL2K GOT-10k OTB99-L
SUC P SUC P SUC P AO SR SR SUC P

CNN-based
SiamRCNN Voigtlaender et al. (2020) 64.8 68.4 - - 52.3 52.8 64.9 72.8 59.7 70.0 89.4
PrDiMP Danelljan et al. (2020) 59.8 60.8 - - 47.0 45.9 63.4 73.8 54.3 69.5 89.5
AutoMatch Zhang et al. (2021) 58.3 59.9 37,6 43.0 47.2 43.5 65.2 76.6 54.3 71.6 93.2
Ocean Zhang et al. (2020) 56.0 56.6 - - 38.4 37.7 61.1 72.1 47.3 68.0 92.1
KYS Bhat et al. (2020) 55.4 - - - 44.9 43.5 63.6 75.1 51.5 - -
ATOM Danelljan et al. (2019) 51.5 50.5 37.6 43.0 40.1 39.2 55.6 63.4 40.2 67.6 82.4
SiamRPN++ Li et al. (2019) 49.6 49.1 34.0 39.6 41.3 41.2 51.7 61.6 32.5 63.8 82.6
C-RPN Fan and Ling (2019) 45.5 42.5 27.5 32.0 - - - - - - -
SiamFC Bertinetto et al. (2016) 33.6 33.9 23.0 26.9 29.5 28.6 34.8 35.3 9.8 58.7 79.2
ECO Danelljan et al. (2017) 32.4 30.1 22.0 24.0 - - 31.6 30.9 11.1 - -
SiamCAR Guo et al. (2020a) 50.7 51.0 33.9 41.0 35.3 38.4 56.9 67.0 41.5 68.8 89.1

CNN-VL
SNLT Feng et al. (2021) 54.0 57.6 26.2 30.0 27.6 41.9 43.3 50.6 22.1 66.6 80.4
VLT (Ours) 65.2 69.1 44.7 51.6 49.8 51.0 61.4 72.4 52.3 73.9 89.8


Trans-based
STARK Yan et al. (2021a) 66.4 71.2 47.8 55.1 - - 68.0 77.7 62.3 69.6 91.4
TrDiMP Wang et al. (2021a) 63.9 66.3 - - - - 67.1 77.7 58.3 70.5 92.5
TransT Chen et al. (2021) 64.9 69.0 44.8 52.5 50.7 51.7 67.1 76.8 60.9 70.8 91.2

Trans-VL
VLT (Ours) 67.3 72.1 48.4 55.9 53.1 53.3 69.4 81.1 64.5 76.4 93.1

Table 2: State-of-the-art comparisons on LaSOT Fan et al. (2019), LaSOT Fan et al. (2021), TNL2K Wang et al. (2021b), GOT-10k Huang et al. (2019) and OTB99-LANG (OTB99-L) Li et al. (2017). TransT and SiamCAR are baselines of the proposed VLT and VLT, respectively. All metrics of performance are in % in tables unless otherwise specified.

4 Experiment

4.1 Implementation Details

We apply our method to both CNN-based SiamCAR Guo et al. (2020a) (dubbed VLT) and Transformer-based TransT Chen et al. (2021) (dubbed VLT). The matching module and localization head are inherited from the baseline tracker without any modifications.

Searching for VLT. The proposed ASS aims to find a more flexible modeling structure for vision-language tracking (VLT). Taking VLT as example, the supernet from SPOS Guo et al. (2020b) is used as feature extractor to replace the ResNet He et al. (2016)

in SiamCAR. We train the trackers with supernet using training splits of COCO 

Lin et al. (2014), Imagenet-VID Deng et al. (2009), Imagenet-DET Deng et al. (2009), Youtube-BB Real et al. (2017), GOT-10k Huang et al. (2019), LaSOT Fan et al. (2019) and TNL2K Wang et al. (2021b)

for 5 epochs, where each epoch contains

template-search pairs. Once finishing supernet training, evolutionary algorithms as in SPOS Guo et al. (2020b) is applied to search for optimal subnet and finally obtains VLT. The whole search pipeline consumes 15 hours on a single RTX-2080Ti GPU. The search process of VLT is similar to VLT. We present more details in the appendix due to space limitation.

Optimizing VLT and VLT. The training protocol of VLT and VLT follows the corresponding baselines SiamCAR Guo et al. (2020a) and TransT Chen et al. (2021). Notably, for each epoch, half training pairs come from datasets without language annotations (i.e., COCO Lin et al. (2014), Imagenet-VID Deng et al. (2009), Imagenet-DET Deng et al. (2009), Youtube-BB Real et al. (2017)

). The language representation is set as 0-tensor or visual pooling feature under this circumstances (discussed in Tab. 

4)333GOT-10k Huang et al. (2019) provides simple descriptions for object/motion/major/root class, e.g., “dove, walking, bird, animal”, in each video. We concatenate these words to obtain a pseudo language description..

4.2 State-of-the-art Comparison

Tab. 2 presents the results and comparisons of our trackers with other SOTAs on LaSOT Fan et al. (2019), LaSOT Fan et al. (2021), TNL2K Wang et al. (2021b), OTB99-LANG Li et al. (2017) and GOT-10K Huang et al. (2019). The proposed VLT and VLT run at 43/35 FPS on a single RTX-2080Ti GPU, respectively. Compared with the speeds of baseline trackers SiamCAR Guo et al. (2020a)/TransT Chen et al. (2021) with 52/32 FPS, the computation cost of our method is small. Moreover, our VLT outperforms TransT in terms of both accuracy and speed.

Compared with SiamCAR Guo et al. (2020a), VLT achieves considerable SUC gains of 14.5%/10.8%/14.5% on LaSOT/LaSOT/TNL2K, respectively, which demonstrates the effectiveness of the proposed VL tracker. Notably, our VLT outperforms the current best VL tracker SNLT Feng et al. (2021) for 11.2%/18.5% on LaSOT/LaSOT, showing that the unified-adaptive vision-language representation is more robust for VL tracking and is superior to simply fusing tracking results of different modalities. The advancement of our method is preserved across different benchmarks. What surprises us more is that the CNN-based VLT is competitive and even better than recent vision Transformer-based approaches. For example, VLT outperforms TransT Chen et al. (2021) on LaSOT and meanwhile runs faster (43 FPS v.s. 32 FPS) and requires less training pairs ( v.s. ). By applying our method to TransT, the new tracker VLT improves the baseline to 67.3% in SUC with 2.4% gains on LaSOT while being faster, showing its generality.

# Method ModaMixer ASS LaSOT TNL2K
SUC P P SUC P P
Baseline - - 50.7 60.0 51.0 35.3 43.6 38.4

VLT - 57.6 65.8 61.1 41.5 49.2 43.2
VLT - 52.1 59.8 50.6 40.7 47.2 40.2

VLT 65.2 74.9 69.1 48.3 55.2 46.6

Table 3: Ablation on ModaMixer and asymmetrical searching strategy (ASS).
Method Setting LaSOT LaSOT TNL2K GOT-10k OTB99-L
SUC P SUC P SUC P AO SR SR SUC P

VLT
0-tensor 65.2 69.1 41.2 47.5 48.3 46.6 61.4 72.4 52.3 72.7 88.8
template 63.9 67.9 44.7 51.6 49.8 51.1 61.0 70.8 52.2 73.9 89.8

Table 4: Comparing different strategies to handle training videos without language annotation.

4.3 Component-wise Ablation

We analyze the influence of each component in our method to show the effectiveness and rationality of the proposed ModaMixer and ASS. The ablation experiments are conducted on VLT, and results are presented in Tab. 3. By directly applying the ModaMixer on the baseline SiamCAR Guo et al. (2020a) (“ResNet50+ModaMixer”), it obtains SUC gains of on LaSOT (②v.s.①). This verifies that the unified VL representation effectively improves tracking robustness. One interesting observation is that ASS improves vision-only baseline for percents on LaSOT (③v.s.①), but when equipping with ModaMixer, it surprisingly further brings SUC gains (④v.s.②), which shows the complementarity of multimodal representation learning (ModaMixer) and the proposed ASS.

4.4 Further Analysis

Dealing with videos without language description during training. As mentioned above, language annotations are not provided in several training datasets (e.g., YTB-BBox Real et al. (2017)). We design two strategies to handle that. One is to use “0-tensor” as language embedding, and the other is to replace the language embedding with visual features which are generated by pooling template feature in the bounding box. As shown in Tab. 4, the two strategies perform competitively, but the one with visual feature is slightly better in average. Therefore, in the following, we take VLT trained by replacing language embedding with template visual embedding as the baseline for further analysis.

Symmetrical or Asymmetrical? The proposed asymmetrical searching strategy is essential for achieving an adaptive vision-language representation. As illustrated in Tab. 5(a), we experiment by searching for a symmetrical network (including both backbone and in the ModaMixer), but it is inferior to the asymmetrical counterpart for of success rate (SUC) and precision (P) on LaSOT Fan et al. (2019), respectively, which empirically proves our argument.

Settings SUC P
  symmetrical 60.0 61.7
  asymmetrical  63.9  67.9
(a)
Settings SUC P
 Shuffle-ModaMixer  59.1  62.2
 NAS-ModaMixer  63.9  67.9
(b)
Settings SUC P
w/o. language  53.4  54.6
Pse. language  51.6  53.4
(c)
Table 5: Evaluating different settings on LaSOT: (a) the influence of symmetrical and our asymmetrical design, (b) adopting fixed ShuffleNet block or searching the post-processing block in ModaMixer, and (c) removing language description or using caption generated by Radford et al. (2021) during inference.

Asymmetry in ModaMixer. The asymmetry is used in not only the backbone network, but also the ModaMixer. In our work, the post-processing layers for different signals (visual and mixed features) are decided by ASS, which enables the adaption at both semantic levels (i.e., network depth) and different input signals (i.e., template and search, pure-vision and mixed feature in each ModaMixer). As in Tab. 5(b), when replacing the post-processing layers with a fixed ShuffleNet block from SPOS Guo et al. (2020b) (i.e., inheriting structure and weights from the last block in each backbone stage), the performance drops from to 59.1% in SUC on LaSOT. This reveals that the proposed ASS is important for building a better VL learner.

No/Pseudo description during inference. VL trackers require the first frame of a video is annotated with a language description. One may wonder that what if there is no language description? Tab. 5(c) presents the results by removing the description and using that generated with an recent advanced image-caption method Radford et al. (2021) (in ICML2021). The results show that, without language description, performance heavily degrades ( SUC on LaSOT), verifying that the high-level semantics in language do help in improving robustness. Even though, the performance is still better than the vision-only baseline. Surprisingly, when using the generated description, it doesn’t show promising results (), indicating that it is still challenging to generate accurate caption in real-world cases and noisy caption even brings negative effects to the model.

Figure 4: (a) feature channel with maximum/minimum (top/bottom) selection scores from ModaMixer in stage1-4. (b) activation map before/after (top/bottom) multimodal fusion in ModaMixer.

Channel selection by ModaMixer. ModaMixer translates the language description to a channel selector to reweight visual features. As shown in Fig. 4, the channel activation maps with maximum selection scores always correspond to the target, while the surrounding distractors are successfully assigned with minimum scores (Fig. 4 (a)-bottom). Besides, with multimodal fusion (or channel selection), the network can enhance the response of target and meanwhile suppress the distractors (see Fig. 4 (b)). This evidences our argument that language embedding can identify semantics in visual feature channels and effectively select useful information for localizing targets. More visualization results are presented in appendix due to limited space.

5 Conclusion

In this work, we explore a different path to achieve SOTA tracking without complex Transformer, i.e., multimodal VL tracking. The essence is a unified-adaptive VL representation, learned by our ModaMixer and asymmetrical networks. In experiments, our approach surprisingly boosts a pure CNN-based Siamese tracker to achieve competitive or even better performances compared to recent SOTAs. Besides, we provide an theoretical explanation to evidence the effectiveness of our method. We hope that this work inspires more possibilities for future tracking beyond Transformer.

Acknowledgments

This work is co-supervised by Prof. Liping Jing and Dr. Zhipeng Zhang.

Appendix A Appendix

The appendix presents additional details of our tracker in terms of design and experiments, as follows.

  • A.1    Volume of Language-Annotated Training Data
    We analyze how the volume of language-annotated training data affects tracking performance.

  • A.2    Different Language Models
    We analyze and compare different language embedding models (i.e., BERT Devlin et al. (2018)

    and GPT-2 

    Radford et al. (2019)) in our method.

  • A.3    Details of The Proposed Asymmetrical Searching Strategy (ASS)
    We present more details about the pipeline of our proposed ASS.

  • A.4    Comparison of ASS and LightTrack Yan et al. (2021b)
    We show efficiency comparison of our ASS and another NAS-based tracker LightTrack Yan et al. (2021b).

  • A.5    Visualization of Tracking Result and Failure Case
    We visualize the tracking results and analyze a failure case of our tracker.

  • A.6    Activation Analysis of Different Language Descriptions
    We study the impact of different language descriptions on tracking performance by visualizing their activation maps.

  • A.7    Attribute-based Performance Analysis
    We conduct attribute-based performance analysis on LaSOT Fan et al. (2019), and the results demonstrate the robustness of our tracker in various complex scenarios.

Settings SUC (%) P (%)
  50% 58.9 61.4
  75%  61.8  64.9
  100%  63.9  67.9
(a)
Settings SUC (%) P (%)
  GPT-2 Radford et al. (2019)  59.3  62.3
  BERT Devlin et al. (2018)  63.9  67.9
(b)
Table 6: Evaluation for different settings on LaSOT: (a) training with different volumes of language-annotated data, (b) the influence of different language models.

a.1 Volume of Language-Annotated Training Data

Language-annotated data is crucial for our proposed tracker in learning the robust vision-language representations. We analyze the influence of training with different volumes of language-annotated data and the results are presented in Tab. 6(a). The default setting in the manuscript is noted as “100%”. For settings of “50%” and “75%”, the reduced part is filled with the data without language annotation, which keeps the whole training data volume. It shows that as the language-annotated training pairs reduced, the performance on LaSOT Fan et al. (2019) gradually decreases ( in SUC), demonstrating that more language-annotated data helps improve model capacity.

a.2 Different Language Models

As described in Sec. 3.1 of the manuscript, the language model of BERT Devlin et al. (2018) is adopted to abstract the semantics of the sentence, which directly relates to the learning of vision-language representation. To show the influence of different language models, we compare the results of using BERT Devlin et al. (2018) and GPT-2 Radford et al. (2019), as shown in Tab. 6(b). An interesting finding is that GPT-2 Radford et al. (2019)

even decreases the performances, which is discrepant with recent studies in natural language processing. One possible reason is that the bi-directional learning strategy in BERT 

Devlin et al. (2018) can better capture the context information of a sentence than the self-regression in GPT-2 Radford et al. (2019).

a.3 Details of The Proposed Asymmetrical Searching Strategy (ASS)

As mentioned in Sec. 3.2 of the manuscript, ASS is designed to adapt the mixed modalities for different branches by simultaneously searching the asymmetrical network . The pipeline of ASS consists of two stages. The first stage is pretraining to search architecture and the second one is to retrain it for our VL tracking, as summarized in Alg. 1.

1:  /* Search */
2:  Input: Network , search space , max iteration , random sampling ,   Train dataset: , ,   Val dataset: ,   For videos without language annotation: 0-tensor” or “template-embedding”.
3:  Initialize: Initialize the network parameters .
4:  for  :  do
5:     for  :  do
6:        if language annotation exists then
7:           ;
8:        else if 0-tensor” then
9:           ; // Default setting without language annotation
10:        else if template-embedding” then
11:           ; // Robust setting without language annotation
12:        end if
13:        , ;
14:     end for
15:     Update the network parameters with gradient descent:
16:     ;
17:  end for
18:  ; Guo et al. (2020b)
19:  Initialize: Initialize the network parameters .
20:  while not converged do
21:     for  :  do
22:        line 6 - 12
23:        ;
24:     end for
25:     Update the network parameters with gradient descent:;
26:  end while
27:  Output: network parameters , . 
28:  /* Retrain */
29:  Train the searched networks , ;
30:  /* Inference */
31:  Track the target by .
Algorithm 1 Algorithm for Asymmetrical Searching Strategy.

The pretraining stage (line 3-18 in Alg. 1) contains four steps: 1) ASS first initializes the network parameter of . Concretely, and reuse the pretrained supernet of SPOS Guo et al. (2020b), while copies the weight of the last layer in and . This reduces the tedious training on ImageNet Deng et al. (2009) and enables quick reproducibility of our work; 2) The language model Devlin et al. (2018) processes the annotated sentence to get corresponding representation . If the language annotations are not provided, two different strategies are designed to handle these cases (i.e., “0-tensor” or “template-embedding”, illustrated in Sec. 4.4 and Tab. 5c of the manuscript); 3) Then, is trained for iterations. For each iteration, a subnet is randomly sampled from search space by the function and outputs the predictions . The corresponding parameters of would be updated by gradient descent; 4) After pretraining, Evolutionary Architecture Search Guo et al. (2020b) is performed to find the optimal subnet . The rewarding for evolutionary search is the SUC (success score) on validation data . The retraining stage (line 19-27 in Alg. 1) is to optimize the searched subnet following the training pipeline of baseline trackers Guo et al. (2020a); Chen et al. (2021).

Tab. 7 displays the detailed configurations of the searched asymmetrical architecture, providing a complement to Tab. 1 in the manuscript.

Stem Stage1 Stage2 Stage3 Stage4 Output
Layer Name
Convolution
Block
Block
ModaMixer
Block
ModaMixer
Block
ModaMixer
Block
ModaMixer
Convolution
Block
Parameter
Output Size

Table 7: Configurations of the asymmetrical architecture learned by ASS.

a.4 Comparison of ASS and LightTrack Yan et al. (2021b)

Despite greatly boosting the tracking performance, Neural Architecture Search (NAS) brings complicated training processes and large computation costs. Considering the complexity, we ease unnecessary steps of ASS to achieve a better trade-off between training time and performance. Taking another NAS-based tracker (i.e., LightTrack Yan et al. (2021b)) as the comparison, we demonstrate the efficiency of our proposed ASS.

As illustrated in Tab. 8, NAS-based trackers usually need to first pretrain the supernet on ImageNet Deng et al. (2009) to initialize the parameters, which results in high time complexity in training. LightTrack even trains the backbone network on ImageNet Deng et al. (2009) twice (i.e., the 1st and 4th steps), which heavily increases the time complexity. By contrast, our ASS avoids this cost by reusing the pre-trained supernet from SPOS, which is much more efficient.

Steps LightTrack Yan et al. (2021b) ASS in VLT (Ours)
1st step
Pretraining backbone supernet
on ImageNet Deng et al. (2009)
Reusing trained backbone
supernet of SPOS Guo et al. (2020b)
2nd step
Training tracking supernet
on tracking datasets
Training tracking supernet
on tracking datasets
3rd step
Searching with evolutionary
algorithm on tracking supernet
Searching with evolutionary
algorithm on tracking supernet
4th step
Retraining searched backbone
subset on ImageNet Deng et al. (2009)
Reusing trained backbone
supernet of SPOS Guo et al. (2020b)
5th step
Finetuning searched tracking
subset on tracking datasets
Finetuning searched tracking
subset on tracking datasets
Network searching cost 40 Tesla-V100 GPU days 3 RTX-2080Ti GPU days
Table 8: Pipeline comparison of ASS and LightTrack in term of time complexity.

a.5 Visualization of Tracking Result and Failure Case.

As shown in Fig. 5, the proposed VLT delivers more robust tracking under deformation, occlusion (the first row) and interference with similar objects (the second row). It demonstrates the effectiveness of learned multimodal representation, especially in complex environments. The third row shows the failure case of our tracker. In this case, the target is fully occluded for about 100 frames and distracted by similar objects, leading to ineffectiveness of our tracker in learning helpful information. A possible solution to deal with this is to apply a global searching strategy, and we leave this to future work.

Figure 5: The first two rows show the success of our tracker in locating target object in complex scenarios, while the third row exhibits a failure case of our method when the target is occluded for a long period (with around 100 frames).
Figure 6: Activation visualization of VLT with different language descriptions using GradCAM Selvaraju et al. (2017). The general language description endows our VL tracker the distinguishability between the target and interferences.

a.6 Activation Analysis of Different Language Descriptions

Language description provides high-level semantics to enhance the target-specific channels while suppressing the target-irrelevant ones. As presented in Fig. 6, we show the effect of different words to evidence that language helps to identify targets. The first row shows that the VLT without language description focuses on two birds (red areas), interfered by the same object class. When introducing the word “bird”, the response of the similar object is obviously suppressed. With a more detailed “black bird”, the responses of distractors almost disappear, which reveals that more specific annotation can help the tracker better locate the target area. Furthermore, we also try to track the target with only environmental description, i.e., “standing on the ground”. The result in column 5 shows that the background is enhanced while the target area is suppressed. The comparison evidences that the language description of object class is crucial for the tracker to distinguish the target from the background clutter, while the mere description of the environment (the fourth column) may introduce interference instead. The last column shows the activation maps with full description, where the tracker can precisely locate the target, demonstrating the effectiveness of the learned unified-adaptive vision-language representation.

a.7 Attribute-based Performance Analysis

Fig. 7 presents the attribute-based evaluation on LaSOT Fan et al. (2019). We compare VLT and VLT with representative state-of-the-art algorithms, as shown in Fig. 7. It shows that our methods are more effective than other competing trackers on most attributes. Fig. 7 shows the ablation on different components of VLT, which evidences that the integration of ModaMixer and ASS is necessary for a powerful VL tracker.

[][Comparison with different trackers.] [][Ablation on components of VLT.]
Figure 7: AUC scores of different attributes on the LaSOT.

References

  • [1] M. R. Amini, N. Usunier, and C. Goutte (2009) Learning from multiple partially observed views-an application to multilingual text categorization. Advances in Neural Information Processing Systems. Cited by: §3.4.
  • [2] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang (2018)

    Bottom-up and top-down attention for image captioning and visual question answering

    .
    In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §2.
  • [3] S. Badde, F. Hong, and M. S. Landy (2021) Causal inference and the evolution of opposite neurons. Proceedings of the National Academy of Sciences. Cited by: §1.
  • [4] P. L. Bartlett and S. Mendelson (2002) Rademacher and gaussian complexities: risk bounds and structural results.

    Journal of Machine Learning Research

    .
    Cited by: §3.4.
  • [5] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr (2016) Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision Workshops, Cited by: §2, §3.2, Table 2.
  • [6] G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte (2020) Know your surroundings: exploiting scene information for object tracking. In European Conference on Computer Vision, Cited by: Table 2.
  • [7] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu (2021) Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §A.3, §1, §1, §1, §1, §2, Table 2, §4.1, §4.1, §4.2, §4.2.
  • [8] Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji (2020) Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [9] Y. Cui, J. Cheng, L. Wang, and G. Wu (2022) MixFormer: end-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2.
  • [10] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2017) ECO: efficient convolution operators for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: Table 2.
  • [11] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2019) ATOM: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: Table 2.
  • [12] M. Danelljan, L. V. Gool, and R. Timofte (2020) Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: Table 2.
  • [13] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §A.3, §A.4, Table 8, §1, §3.2, §4.1, §4.1.
  • [14] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: 2nd item, §A.2, §A.3, 6(b), §3.1, footnote 1.
  • [15] H. Fan, H. Bai, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, M. Huang, J. Liu, Y. Xu, et al. (2021) Lasot: a high-quality large-scale single object tracking benchmark. International Journal of Computer Vision. Cited by: Table 2, §4.2.
  • [16] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling (2019) LaSOT: a high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: 7th item, §A.1, §A.7, Figure 1, §1, §1, Table 2, §4.1, §4.2, §4.4.
  • [17] H. Fan and H. Ling (2019) Siamese cascaded region proposal networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2, Table 2.
  • [18] Q. Feng, V. Ablavsky, Q. Bai, G. Li, and S. Sclaroff (2020) Real-time visual object tracking with natural language description. In IEEE Winter Conference on Applications of Computer Vision, Cited by: §1, §2.
  • [19] Q. Feng, V. Ablavsky, Q. Bai, and S. Sclaroff (2021) Siamese natural language tracker: tracking by natural language descriptions with siamese trackers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §1, §2, §3.1, §3.2, Table 2, §4.2.
  • [20] A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. In Conference on Empirical Methods in Natural Language Processing, Cited by: §2.
  • [21] X. Gao, Y. Zhao, Ł. Dudziak, R. Mullins, and C. Xu (2018) Dynamic channel pruning: feature boosting and suppression. arXiv preprint arXiv:1810.05331. Cited by: §1, §3.1.
  • [22] D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen (2020) SiamCAR: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §A.3, §1, §1, Table 2, §4.1, §4.1, §4.2, §4.2, §4.3.
  • [23] M. Guo, Z. Zhang, H. Fan, L. Jing, Y. Lyu, B. Li, and W. Hu (2022) Learning target-aware representation for visual tracking via informative interactions. In

    International Joint Conference on Artificial Intelligence

    ,
    Cited by: §1.
  • [24] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang (2017) Learning dynamic siamese network for visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §2.
  • [25] Z. Guo, X. Zhang, H. Mu, W. Heng, Z. Liu, Y. Wei, and J. Sun (2020) Single path one-shot neural architecture search with uniform sampling. In European Conference on Computer Vision, Cited by: §A.3, Table 8, §2, §3.2, §3.2, §3.2, §4.1, §4.4, 18.
  • [26] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §4.1.
  • [27] L. Huang, X. Zhao, and K. Huang (2019) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: Table 2, §4.1, §4.2, footnote 3.
  • [28] Y. Huang, C. Du, Z. Xue, X. Chen, H. Zhao, and L. Huang (2021) What makes multi-modal learning better than single (provably). Advances in Neural Information Processing Systems. Cited by: §3.4, §3.4, Theorem 1.
  • [29] J. Kim, J. Jun, and B. Zhang (2018) Bilinear attention networks. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [30] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan (2019) Siamrpn++: evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2, §3.2, Table 2.
  • [31] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [32] Z. Li, R. Tao, E. Gavves, C. G. Snoek, and A. W. Smeulders (2017) Tracking by natural language specification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, §3.1, Table 2, §4.2.
  • [33] L. Lin, H. Fan, Y. Xu, and H. Ling (2021) SwinTrack: a simple and strong baseline for transformer tracking. arXiv preprint arXiv:2112.00995. Cited by: §2.
  • [34] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In European Conference on Computer Vision, Cited by: §4.1, §4.1.
  • [35] H. Liu, K. Simonyan, and Y. Yang (2019) Darts: differentiable architecture search. In International Conference on Learning Representations, Cited by: §2.
  • [36] T. Meinhardt, A. Kirillov, L. Leal-Taixe, and C. Feichtenhofer (2022) Trackformer: multi-object tracking with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [37] J. Pérez-Rúa, V. Vielzeuf, S. Pateux, M. Baccouche, and F. Jurie (2019) Mfas: multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1.
  • [38] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, Cited by: §4.4, Table 5.
  • [39] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al. (2019) Language models are unsupervised multitask learners. OpenAI blog. Cited by: 2nd item, §A.2, 6(b).
  • [40] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le (2019)

    Regularized evolution for image classifier architecture search

    .
    In The Association for the Advancement of Artificial Intelligence, Cited by: §1, §2.
  • [41] E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke (2017) Youtube-boundingboxes: a large high-precision human-annotated data set for object detection in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §4.1, §4.1, §4.4.
  • [42] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, Cited by: Figure 6.
  • [43] Z. Shen, M. Zhang, H. Zhao, S. Yi, and H. Li (2021) Efficient attention: attention with linear complexities. In IEEE Winter Conference on Applications of Computer Vision, Cited by: §3.1.
  • [44] A. Singh (2019) Foundations of machine learning. Available at SSRN 3399990. Cited by: §3.4.
  • [45] P. Sun, Y. Jiang, R. Zhang, E. Xie, J. Cao, X. Hu, T. Kong, Z. Yuan, C. Wang, and P. Luo (2020) Transtrack: multiple-object tracking with transformer. arXiv:2012.15460. Cited by: §2.
  • [46] R. Tao, E. Gavves, and A. W. Smeulders (2016) Siamese instance search for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [47] N. Tripuraneni, M. Jordan, and C. Jin (2020)

    On the theory of transfer learning: the importance of task diversity

    .
    Advances in Neural Information Processing Systems. Cited by: §3.4.
  • [48] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, Cited by: §2.
  • [49] P. Voigtlaender, J. Luiten, P. H. Torr, and B. Leibe (2020) Siam r-cnn: visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: Table 2.
  • [50] N. Wang, J. Shi, D. Yeung, and J. Jia (2015) Understanding and diagnosing visual tracking systems. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §1.
  • [51] N. Wang, W. Zhou, J. Wang, and H. Li (2021) Transformer meets tracker: exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §1, §2, Table 2.
  • [52] X. Wang, X. Shu, Z. Zhang, B. Jiang, Y. Wang, Y. Tian, and F. Wu (2021) Towards more flexible and accurate object tracking with natural language: algorithms and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: Table 2, §4.1, §4.2.
  • [53] B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu (2021) Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: Table 2.
  • [54] B. Yan, H. Peng, K. Wu, D. Wang, J. Fu, and H. Lu (2021)

    Lighttrack: finding lightweight neural networks for object tracking via one-shot architecture search

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: 4th item, §A.4, §A.4, Table 8, §1, §2, §3.2.
  • [55] B. Yang, J. Yan, Z. Lei, and S. Z. Li (2015) Convolutional channel features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Cited by: §1.
  • [56] Y. Yu, Y. Xiong, W. Huang, and M. R. Scott (2020) Deformable siamese attention networks for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [57] Z. Yu, Y. Cui, J. Yu, M. Wang, D. Tao, and Q. Tian (2020) Deep multimodal neural architecture search. In ACM International Conference on Multimedia, Cited by: §1.
  • [58] X. Zhang, X. Zhou, M. Lin, and J. Sun (2018) Shufflenet: an extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: Table 1.
  • [59] Z. Zhang, Y. Liu, X. Wang, B. Li, and W. Hu (2021) Learn to match: automatic matching network design for visual tracking. Proceedings of the IEEE/CVF International Conference on Computer Vision. Cited by: §2, §2, Table 2.
  • [60] Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu (2020) Ocean: object-aware anchor-free tracking. In European Conference on Computer Vision, Cited by: §2, Table 2.
  • [61] Z. Zhang and H. Peng (2019) Deeper and wider siamese networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
  • [62] B. Zoph and Q. V. Le (2017)

    Neural architecture search with reinforcement learning

    .
    In International Conference on Learning Representations, Cited by: §1, §2.