Visual object tracking (VOT) aims to accurately localize an unknown object in sequential video frames, just given its initial state. Visual trackers constantly seek to find more robust and accurate approaches considering various applications and challenges in real-world scenarios. In the spirit of deep learning (DL), an important objective is to design reliable network architectures for visual tracking purposes [OurSurvey]
, which usually requires adequate experience, insightful knowledge, learning heuristics, and extensive manual trial & error.
Neural architecture search
(NAS) has been developed to automatically discover preferable (or ideally optimal) network architecture for a learning task by exploring a wide-reaching space of candidates. Generally, NAS methods are classified into thereinforcement learning (RL)-based, evolutionary algorithm (EA)-based, Bayesian optimization (BO)-based, and gradient-based methods, according to their diversified search strategies. Although the first three categories suffer from less efficiency, high time consumption, and extensive computational overhead, the gradient-based methods provide competitive performances & quite efficiency. The well-known differentiable architecture search (DARTS) [DARTS] introduces a generic approach that relaxes the search space into the continuous domain and shares the parameters among candidate architectures. Although DARTS has achieved promising results by its gradient-based search, several works [PDARTS, DARTS+, RobustDARTS, FairDARTS, DARTS-, ProxylessNAS] have been proposed to study and address its problems.
Despite the exploitation of NAS in numerous tasks (e.g., classification [DARTS, FairDARTS], detection [DetNAS, NAS-FCOS], semantic segmentation [Auto-DeepLab, Nas-Seg1]), almost all the network architectures for visual tracking are based on human-designed heuristics. Very recently, the LightTrack [LightTrack]
uses evolutionary search to obtain lightweight architectures for resource-limited hardware platforms. Also, it uses single-path uniform sampling and lightweight building blocks to achieve more compact architectures and reduce the computational costs. However, the single-path sampling-based methods decouple the optimizations of the weights and architecture parameters of the super-net, leading to large-variance to the optimization process and tendency to a non-complex structure[NAS-FCOS]. The LightTrack [LightTrack] has inherited the limitations of evolutionary algorithms as well as single-pass search approaches. Furthermore, it searches within a limited search space and stacks the basic blocks to construct the final architecture. In contrast, this work aims to automatically discover the best architecture block (or cell) that adapts large-scale trained backbone features to the tracking objectives. It modifies DARTS [DARTS] that provides interesting advantages such as weight-sharing, gradient-based search, efficiency, and simplicity to have better generalization. The modifications include (i) operation-level Dropout [PDARTS] to prevent the aggregation of skip-connections and (ii) early-stopping of the search procedure using a hold-out sample set for validating the transferability of the best cell. Moreover, the proposed approach exploits the complete search space of DARTS. Last but not least, the proposed approach does not stack multiple cells to provide a balanced accuracy & speed for visual tracking and address the conventional problem of the performance gap between the search and evaluation phases [PDARTS_IJCV]. Finally, the effectiveness of NAS exploitation and its generalization is validated by employing three versions of DARTS [DARTS, FairDARTS] and integrating the proposed approach into two visual trackers [DiMP, PrDiMP].
In summary, the main contributions are as follows:
A novel cell-level differentiable architecture search mechanism is proposed to to automate the network design of the tracking module during offline training. Our approach is simple, efficient, and easy to be incorporated to existing trackers for improved performance.
Extensive experimental evaluations on five widely-used visual tracking benchmarks demonstrate the superior performance of the proposed approach. Our architecture search mechanism empirically takes 41 hours to produce the final network architecture; moreover, it is practically shown to boost the overall performance when applied to existing baselines. Our code is to be made publicly available upon paper acceptance.
2 Related Work
2.1 Single Object Tracking
Most state-of-the-art visual trackers are based on classic/custom Siamese networks [SiamRPN++, SiamBAN, SiamCAR, SiamAttn, DiMP, PrDiMP] providing a good trade-off between performance & computational complexity. The main ideas include taking powerful backbone features and employing lightweight modules to extract robust target-specific features for visual tracking. For instance, all the SiamRPN++ [SiamRPN++], SiamBAN [SiamBAN], SiamCAR [SiamCAR], DiMP [DiMP], SiamAttn [SiamAttn], and PrDiMP [PrDiMP] trackers use ResNet-50 [ResNet] as the backbone and adapt the features for visual object tracking using shallow sub-networks. However, these hand-designed sub-networks are biased toward human priors with no guarantees achieving the highest effectiveness. This motivates this work to automatically design these modules by a cell-level search procedure.
2.2 Differentiable NAS
Typically, NAS methods use RL-based or EA-based algorithms to search for an optimal architecture in a discrete space with potential candidates. Unfortunately, these methods are inefficient and unaffordable as they demand large computational resources to perform searching procedures. However, the gradient-based approach has shown promising results while searching for a few GPU days. DARTS [DARTS] proposes a continuous search space that provides an interesting generalization ability, simplicity, and efficiency as the most popular gradient-based approach. It introduces the first- & second-order approximation-based approaches according to the calculation of architecture gradient, where the second-order one leads to better performance but lower search speed. However, the DARTS suffers from (i) the performance gap between the search & evaluation phases [PDARTS, PDARTS_IJCV], (ii) repeating blocks restriction [ProxylessNAS], (iii) performance collapse [DARTS+, DARTS-, RobustDARTS] due to the model over-fitting, (iv) degenerate architectures [RobustDARTS], and (v) aggregation of skip connections [FairDARTS, PDARTS, PDARTS_IJCV].
Consequently, several works are presented to address the problems of DARTS. To bridge the gap between the search and evaluation phases, the progressive DARTS (PDARTS) [PDARTS] gradually increases the network depth assisted by the search space approximation & regularization. It achieves better performance and also alleviates heavy computational overhead and instability. The ProxylessNAS [ProxylessNAS] proposes learning architectures on large-scale datasets, path-level pruning, and latency regularization loss to address repeating blocks restriction, GPU memory consumption, and hardware limitations.
The DARTS+ [DARTS+] proposes an early stopping paradigm to avoid the performance collapse of DARTS due to the model over-fitting in the search phase. To improve the robustness, the RobustDARTS [RobustDARTS]
investigates the failure cases of DARTS causing degenerate architectures with inferior performance. Then, it introduces an early stopping criterion with the dominant Hessian eigenvalue of validation loss. Besides, an adaptive regularization has been adopted according to the validation loss curvature. The DARTS-[DARTS-] proposes an indicator-free approach to handle the performance collapse & search instability of DARTS. It distinguishes two roles of skip connections (i.e., stabilization of super-net training & candidate operation) by an auxiliary skip connection between every two nodes. Finally, the Fair-DARTS [FairDARTS] proposes the collaborative competition approach and auxiliary loss to address the aggregation of skip connections & discretization discrepancy problems, respectively.
The common problem is that most of the DARTS-based methods employ the first-order DARTS to reduce computational complexity allowing to perform the search procedure on some stacked cells. However, this work exploits the second-order DARTS integrated into a visual tracking framework by adopting a cell-level search procedure. Besides, the operation-level Dropout [PDARTS, PDARTS_IJCV] and a proposed early-stopping strategy are used to alleviate the aggregation of skip connections and select the best cell architecture.
3 Proposed Approach: CHASE
The primary motivation of this work is to automatically adapt the robust features extracted from the backbone to the tracking objective by a computational cell (see Fig. 1). Hence, this work exploits a modified version of DARTS [DARTS] that forms an ordered directed acyclic graph (DAG) with nodes as its computational cell, which is learned through architecture search procedure. Note that the conventional NAS techniques (e.g., [DARTS, DARTS-, FairDARTS]) and LightTrack [LightTrack] stack a series of cells considering the size of the training set and desired computational complexity. However, the proposed approach learns a cell into a visual tracking network to preserve limited computational complexity. This work uses PrDiMP [PrDiMP] as the baseline to demonstrate the effectiveness of the proposed approach for visual tracking. PrDiMP includes the target center regression (TCR) & bounding box regression
(BBR) networks. It predicts the conditional probability density to minimize theKullback-Leiber (KL) divergence between the predictions and label distribution (see [PrDiMP] for more details). The proposed tracker (CHASE) replaces additional convolutional blocks after the backbone with a DAG to find the best operations and node connections by the modified second-order DARTS.
The computational cell has two input nodes (from Block3 & Block4 of ResNet-50 [ResNet]) and four intermediate nodes. Given a feature map at node , the corresponding latent representation at intermediate node is computed as
where stands for candidate operations (from a predefined set in the search space) on edge . The candidate operations consist of 33 & 55 separable convolutions, 33 & 55 dilated convolutions, 3
3 max pooling, 33 average pooling, zero (no connection), and skip connection (i.e., ). Since the DARTS tends to aggregate skip connections due to the rapid error decay during its optimization [PDARTS][PDARTS_IJCV], the CHASE employs the operation-level Dropout (previously used by [PDARTS, PDARTS_IJCV]) with an initial rate , which gradually decays during the search procedure. However, the CHASE does not control the number of skip connections in the final cell architecture (in contrast to [PDARTS]). to fairly explore all operations and provide more flexibility in the final decision.
Since the architecture representations construct a discrete search space, the DARTS method [DARTS] utilizes Softmax function to relax the problem into a continuous search space. That is, the mixed output for is calculated by
in which is the operation mixing weight associated with the operation between nodes and . In fact, the architecture search problem converts into the learning of parameters . To jointly learn network parameters () and architecture parameters (), the gradient descent (GD) algorithm is used to minimize the training () and validation losses () by performing the bi-level optimization problem
To avoid expensive inner optimization, the DARTS reduces the evaluation of architecture gradient by applying the finite difference approximation as
where , . Also and are the learning rate for a step of inner optimization and a small scalar, respectively. Accordingly, two forward passes for and two backward passes for are required to perform the second-order approximation of DARTS (see [DARTS] for more details).
Lastly, the algorithm seeks the best cell based on the one-hot encoding of architecture parameters. However, the DARTS suffers from performance collapse and poor generalization problems if it does not stop the architecture search procedure at a proper training point. Therefore, this work introduces an early-stopping strategy by holding out a set of samples for selecting the best architecture. Note that the CHASE never uses the validation/testing sets for this purpose. Also, the proposed early-stopping strategy is a straightforward way to validate the transferability of the best architecture to other datasets. The CHASE derives the final architecture by
where the most likely operation is considered to achieve a discrete architecture at the minimum hold-out loss () on the hold-out set.
4 Empirical Experiments
Herein, the implementation details of the proposed approach, ablation analysis of three DARTS-based methods, and tracking results of the best architecture on benchmark datasets are reported.
4.1 Implementation Details
The backbone consists of ResNet-50 architecture [ResNet] initialized with the pre-trained Image-Net [ImageNet]
weights. The offline experiments comprise the searching and training phases. The proposed CHASE tracker is implemented in PyTorch and runsfps on a single Nvidia Tesla V100 GPU with 16GB RAM. Except for the following details, the rest of the hyper-parameters are set to the ones in [PrDiMP]. The test sets are never utilized in searching or training phases.
4.1.1 Searching Phase
In this phase, the cell architecture is searched by the modified second-order DARTS. The cell includes edges and nodes ( input, intermediate, and output), which the output node is obtained by depthwise concatenation of intermediate nodes. The CHASE applies operation-level Dropout to avoid shallow connections, which its rate starts from
and gradually decayed to the last epoch. In contrast to the PDARTS[PDARTS] that forces the number of skip-connections to be two, the CHASE leaves the DARTS to fairly decide, considering the importance of skip-connections on the evaluation accuracy.
The training split of the TrackingNet dataset [TrackingNet] is divided into two subsets for training and validation procedures. The weights of network () and encoding weights of architecture () are jointly optimized on training () and validation () sets, respectively. Besides, the training sets of GOT-10k [GOT-10k] and LaSOT [LaSOT] datasets are used as the hold-out set () to determine the best architecture among three runs (with different random seeds) to pick the final cell architecture based on their performance. This strategy prevents over-fitting problem and investigates the transferability of the best cell on other datasets. Based on the training strategies of NAS in [AutoFPN], the backbone and BBR parameters are frozen during architecture search, while the architecture parameters are started to optimize after epochs. This strategy provides fair competition between weight-free operations with other ones, leading to performance improvement, acceleration, and avoiding getting stuck into bad local optima. The network is trained for at most epochs with a batch size of , similar to the baseline [PrDiMP]. However, the proposed approach stops the training procedure based on the proposed early-stopping strategy (epoch for CHASE). The Adam optimizer [ADAM] is used to learn network and architecture parameters. The initial learning rate is for optimizing with the cosine annealing scheduler. The maximum iteration numbers are K, K, and K for training, validation, and early-stopping procedures. The search phase takes about 41 (18) hours for the second (first) order DARTS method for the TrackingNet dataset on one Nvidia Tesla V100 GPU with 16GB RAM.
4.1.2 Training Phase
In this phase, computational cell is replaced by the best cell architecture, and the whole network (including backbone, TCR, and BBR) is jointly trained from scratch for epochs. The TCR and BBR layers are initialized with random weights ignoring the weights during the searching phase. For the training phase, the training splits of LaSOT [LaSOT], TrackingNet [TrackingNet], GOT-10k [GOT-10k]
, and COCO[MSCOCO] datasets are utilized, similar to the baseline [PrDiMP]. Also, other hyper-parameters are set as in the baseline tracker [PrDiMP].
4.1.3 Evaluating Phase
After offline training phases, the proposed CHASE tracker is evaluated on test splits of generic and aerial visual tracking datasets, namely GOT-10k [GOT-10k], TrackingNet [TrackingNet], LaSOT [LaSOT], UAV-123 [UAV123], and VisDrone-2019-test-dev [VisDrone2019]. In the online phase, all procedures and settings are the same as [PrDiMP].
4.2 Ablation Analysis
Different versions of the CHASE are evaluated on the GOT-10k dataset [GOT-10k] to validate the effectiveness of various DARTS-based methods. First, the results of best cells derived by the first-order DARTS [DARTS] (CHASE-D1), Fair-DARTS [FairDARTS] (CHASE-FD), and modified second-order DARTS (CHASE-PrDiMP or CHASE) are reported in Table 1 (see Fig. 2). Second, the proposed approach is integrated into the DiMP tracker [DiMP] (CHASE-DiMP) to demonstrate the generalization of the proposed approach for visual tracking. As shown in Fig. 2, the CHASE-D1 derives a cell dominated by weight-free operations (i.e., skip and pooling operations), and there is no connection between intermediate nodes resulting in a shallow architecture. The CHASE-FD employs the Fair-DARTS [FairDARTS]
, which utilizes the Sigmoid activation function and an auxiliary loss to address two problems: exclusive competition of skip-connections and discretization discrepancy. Nonetheless, the CHASE outperforms the CHASE-D1 & CHASE-FD up toand in terms of average overlap (AO) metric, respectively (see Table 1). To investigate the generality of the proposed approach, it is integrated into the DiMP tracker [DiMP] that minimizes an L2-based discriminative learning loss to train its network. The proposed approach outperforms the DiMP tracker [DiMP] up to in terms of the AO and up to in terms of success rate (SR) at overlap threshold of . The computational cells derived by the CHASE validate the selection of various operations according to the objective function and preventing from over-fitting problems (e.g., shallow architectures) during the searching phase. At last, the best-performing tracker, CHASE, is selected to compare with recent trackers in the next section.
|Metric||DiMP [DiMP]||CHASE-DiMP||PrDiMP [PrDiMP]||CHASE-D1||CHASE-FD||CHASE-PrDiMP|
4.3 State-of-the-art Comparison
In this section, the state-of-the-art evaluations are performed on five large-scale visual tracking benchmarks (refer to Sec. 4.1.3) and the proposed CHASE tracker is compared with various state-of-the-art visual trackers, namely ECO [ECO], SiamMask [SiamMask], DaSiamRPN [DaSiamRPN], SiamRPN++ [SiamRPN++], ATOM [ATOM], DCFST [DCFST], COMET [COMET], SiamFC++ [SiamFCpp], DiMP-50 [DiMP], PrDiMP-50 [PrDiMP], KYS [KYS], SiamAttn [SiamAttn], MAML [MAML], ROAM++ [ROAM], SiamCAR [SiamCAR], SiamBAN [SiamBAN], D3S [D3S], Ocean [Ocean], and LightTrack [LightTrack].
GOT-10k [GOT-10k]: This large high-diversity dataset includes over K videos as the training set and videos for evaluation without publicly available ground-truth. Notably, the target classes for evaluation do not overlap with training ones. Hence, this dataset is usually used for studying the transferability of proposed approaches for tracking unseen targets. Therefore, the proposed CHASE uses its training set as one of the hold-out sets to early-stop the cell searching phase. The comparison results presented in Table 2 show that the CHASE outperforms the baseline up to , , and in terms of AO and SR at overlap thresholds of and , respectively. Besides, the CHASE has achieved better results ( in AO, in SR0.5) compared with the LightTrack [LightTrack].
|AO ()||SR0.5 ()||SR0.75 ()||AUC ()||Norm. Prec. ()||Prec. ()||AUC ()||Norm. Prec. ()||Prec. ()||SR0.5 ()||Prec. ()||AUC ()||Prec. ()|
LaSOT [LaSOT]: LaSOT is a long-term and challenging tracking benchmark consisting of videos and M frames, with frames per video on average. The test set contains videos and K frames with target disappear/reappear scenarios. Thus, this dataset appropriately indicates the robustness of short-term trackers in real-world situations. For this reason, the proposed tracker uses its training set as the second dataset of hold-out set in the searching phase. As shown in Table 2, the CHASE improves the baseline results [PrDiMP] by a margin of , , and in terms of area under curve (AUC), normalized precision, and precision, respectively.
TrackingNet [TrackingNet]: TrackingNet is a challenging in-the-wild tracking dataset consisting of classes of targets from YouTube videos. This dataset contains more than K videos and M frames, including videos for testing which the ground-truths are not publicly available. From Table 2, the MAML tracker [Instance_Det_meta] has close results (better in precision metric) compared with the proposed tracker since it employs a modern object detector (i.e., FCOS [FCOS]) and online domain adaptation to enhance discriminating target from non-target regions. However, the proposed CHASE tracker has achieved better results in terms of AUC and normalized precision, and it has improved the baseline results by a margin of in AUC and in precision metric.
UAV-123 [UAV123]: UAV-123 is an challenging aerial-view tracking dataset consisting of videos, K frames, and classes of targets captured from a low-altitude perspective. According to the results in Table 2, the proposed CHASE tracker outperforms the state-of-the-art visual trackers but also the baseline tracker [PrDiMP] up to and in terms of success and precision rate metrics.
VisDrone-2019-test-dev [VisDrone2019]: VisDrone-2019 also aims to track visual targets captured from aerial-view. It includes test videos (K frames) from challenging scenarios such as abrupt camera motion, tiny targets, fast view-point change, and day/night conditions. Compared with the baseline [PrDiMP], the results of the CHASE tracker have improved up to in AUC and in precision rate. The COMET [COMET] has obtained the best results employing the training set of VisDrone for its offline training and accurately designed modules for small object tracking.
A novel cell-level differentiable architecture search mechanism is proposed. To address inherent limitations of differentiable architecture search, we modify the second-order DARTS by operation-level dropout and early stopping to mitigate the skip-connection aggregation and performance collapse issues. Our approach is simple, efficient, and easy to be integrated into existing visual trackers. Extensive experiments demonstrate the effectiveness of the proposed approach, as well as noticeable performance improvement when working with different existing trackers.