Model-free visual object tracking is one of the most fundamental problems in computer vision. Given the object of interest marked in the first video frame, the objective is to localize the target in subsequent frames, despite object motion, changes in viewpoint, lighting variation, among other disturbing factors. One of the most challenging difficulties with model-free tracking is the lack of prior knowledge on the target object appearance. Since any arbitrary object may be tracked, it is impossible to train a fully specialized tracker.
Recently, convolutional neural networks (CNNs) have demonstrated strong power in learning feature representations. To fully exploit the representation power of CNNs in visual tracking, it is desirable to train them on large datasets specialized for visual tracking, and covering a wide range of variations in the combination of target and background. However, it is truly challenging to learn a unified representation based on videos that have completely different characteristics. Some trackers train regression networks for tracking in an entirely offline manner. Other works [2, 3, 6] propose to train deep CNNs to address the general similarity learning problem in an offline phase and evaluate the similarity online during tracking. However, since these works have no online adaptation, the representations they learned offline are general but not always discriminative.
Rather than applying a single fixed network for feature extraction, we propose to use multiple network branches with an online branch selection mechanism. It is well known that different networks designed and trained for different tasks have diverse feature representations. With the online branch selection mechanism, our tracker dynamically selects the most efficient and robust branch for target representation, even if the target appearance changes. Our goal is to improve the generalization capability with multiple networks.
The main contributions of our work are summarized as follows. First, we propose a multi-branch framework based on a siamese network for object tracking. The proposed architecture is designed to extract appearance representation robust against target variations and changing contrast with background scene elements. Second, to make the full use of the different branches, we propose an effective and generic branch selection mechanism to dynamically select branches according to their discriminative power. Third, on the basis of multiple branches and branch selection mechanism, we present a novel deep learning tracker achieving real-time and improved tracking performance. Our extensive experiments compare the proposed Multi-Branch Siamese Tracker (MBST) with state-of-the-art trackers on OTB benchmarks [4, 5].
2 Related Work
Siamese Network Based Trackers. Object tracking can be addressed using similarity learning. By learning a deep embedding function, we can evaluate the similarity between an exemplar image patch and a candidate patch in a search region. These procedures allow to track the target to the location that obtains the highest similarity score. Inspired by this idea, the pioneering work of SiamFC  proposed a fully-convolutional Siamese Network in which the similarity learning with deep CNNs is addressed using a Siamese architecture. Since this approach does not need online training, it can easily achieve real-time tracking. Due to the robustness and real-time performance of the SiamFC  approach, several subsequent works proceeded along this direction to address the tracking problem. In this context, EAST 
employs an early-stopping agent to speed up tracking where easy frames are processed with cheap features, while challenging frames are processed with deep features. CFNet incorporates a Correlation Filter into a shallow siamese network, which can speed up tracking without accuracy drop comparing to a deep Siamese network. TRACA  applies context-aware feature compression before tracking to achieve high tracking performance. SA-Siam  utilizes the combination of semantic features and appearance features to improve generalization capability. In our work, we use the Siamese Network as embedding function to extract feature representations. All branches use the Siamese architecture to apply identical transformation on target patch and search region.
Multi-Branch Tracking Frameworks. The diversity of target representation from a single fixed network is limited. The learned features may not be discriminative in all tracking situations. There are many works using diverse features with context-aware or domain-aware scheme.
TRACA  is a multi-branch tracker, which utilizes multiple expert auto-encoders to robustly compress raw deep convolutional features. Since each of expert auto-encoders is trained according to a different context, it performs context-dependent compression. MDNet  is composed of shared layers and multiple branches of domain-specific layers. BranchOut  employs a CNN for target representation, with a common convolutional layers and multiple branches of fully connected layers. It allows different number of layers in each branch to maintain variable abstraction levels of target appearances.
A common insight of these multi-branch trackers is the possibility to make a robust tracker by utilizing different feature representations. Our method shares some insights and design principles with other multi-branch trackers. Our network architecture is composed of multiple branches separately trained offline and focusing on different types of CNN features. In addition, we use an AlexNet  branch in our framework that is designed and pretrained for image classification. In our multi-branch frameworks, the combination of branches trained in different scenarios ensures a better use of diverse feature representations.
Online Branch Selection. Different models produce various feature maps on different tracked targets in different scales, rotations, illumination and other factors. Using all features available for a single object tracking is neither efficient nor effective. BranchOut  selects a subset of branches randomly for model update to diversify learned target appearance models. MDNet  learns domain-independent representations from pretraining, and identifies branches through online learning.
In our online branch selection mechanism, we analyse the feature representation of each branch to select the most robust branch at every frames. This allows us to use diverse feature representations and to handle various challenges in the object tracking problem more efficiently.
3 Multi-Branch Siamese Tracker
We propose a multi-branch siamese network for tracking. Given that different neural network models produce diverse feature representations, we use many of them as branches in our tracker to produce diverse feature representations and select the most robust branch with our online branch selection mechanism.
3.1 Network Architecture
Using multiple target representations is shown to be beneficial for object tracking [6, 10], as different CNNs can provide various feature representations. In our work, we ensemble siamese networks including context-dependent branches and one AlexNet branch as . The context-dependent branches have the same structure as SiamFC  and the AlexNet branch has the same structure as AlexNet . Each branch of the tracker is a siamese network applying identical transformation to both inputs and combining their representation by a cross-correlation layer. The architecture of the proposed tracker is illustrated in Fig. 1.
The input consists of a target patch cropped from the first video frame and another patch containing the search region in the current frame. The target patch has a size of , corresponding to the width, height and color channels of the image patch. The search region has a size of ( and ), representing also the width, height and color channels of the search region. can be considered as a collection of candidate patches in the search region with the same dimension as .
From what we observed, there are two strategies to improve the discriminative ability of the tracking networks. The first one is training the network in different contexts, while the second one is to use multiple networks designed and trained for different tasks. In our approach, we utilize context-dependent branches pretrained in different contexts in addition to another branch pretrained for image classification task to improve our tracking performance. We note that more branches could be added with other pre-trained networks at the cost of slower performances.
Context-dependent branches: We use context-dependent branches and one general branch as . All these branches have the same architecture as the SiamFC network . Context-dependent branches are trained in three steps. Firstly, we train the basic siamese network on the ILSVRC-2015 
video dataset (henceforth ImageNet), including 4,000 video sequences and around 1.3 million frames containing about 2 million tracked objects. We keep the basic siamese network as the general branch. Then, we perform contextual clustering on the low level feature map from the ImageNet Video dataset to find() context-dependent clusters. Finally, we use the clusters to train context-dependent branches initialized by the basic siamese network. These branches take as input and extract their feature maps. Then, using a cross correlation layer we combine their feature maps to get a response map. The response map of context-dependent branches is calculated as:
where indicates the contextual index including the general branch (), denotes features generated by the network.
The AlexNet branch: We use AlexNet 
pretrained on the image classification task as a branch with a network trained for a different task. Small modifications are made on the stride to ensure that the output response map has the same dimension as other branches. Since AlexNet is trained for image classification and the deeper layers encode more semantic information of targets, target representations from this branch are more robust to significant appearance variations. The network output corresponds toas input, while the generated features are denoted as . The response map is expressed as:
In our implementation, MBST is composed of context-dependent branches and AlexNet branch. The output of each branch is a response map indicating the similarity between target and candidate patch within the search region . The branch selection mechanism compares the maps from each branch to select the most discriminative one. The corresponding branch is then used for frames.
3.2 Online Branch Selection Mechanism
Different branches trained in different scenarios can be used to diversify the target representation. To ensure the optimal exploitation of the diverse representations from our branches, we designed a branch selection mechanism to monitor the tracking output and automatically select the most discriminative branch as illustrated in Fig. 2.
Given the input image pair, each branch applies identical transformation to both inputs and calculates the response map using a cross-correlation layer. Since the ranges of feature values from different branches are different, we apply response weights
on response map of each branches to normalize their range difference. The discriminative power is then measured based on the weighted response maps from all branches. The heuristic approach we used to measure the discriminative power of branches is formulated as:
where is the response map for each branch , is the peak value of the response map , and is the minimum value of the response map .
The objective function of our branch selection mechanism can be written as:
where is the selected branch to transform inputs.
The first aim of our experiments is to investigate the effect of incorporating multiple feature representations with an online branch selection mechanism. For this purpose, we performed ablation analysis on our framework. We then compare our method with state-of-the-art trackers. The experimental results demonstrate that our method achieves improved performance with respect to the basic SiamFC tracker .
4.1 Implementation Details
Network structure: The context-dependent branches have exactly the same structure as the SiamFC network . For the AlexNet branch, we use AlexNet  pretrained on ImageNet dataset  with a small modification to ensure that the output response map has the same dimension as other branches, which is . Other branches could also be used based on other network architectures.
Data Dimensions: In our experiment, the target image patch has a dimension of , and the search region has a dimension of . But since all branches are fully convolution layers, they can also be adapted to any other dimension easily. The embedding output for and has a dimension of and respectively.
Training: We use the ImageNet dataset  for training and only consider color images. For simplicity, we randomly pick a pair of images, we crop in the center and
in the center of another image. Images are scaled such that the bounding box, plus an added margin for context, has a fixed area. The basic siamese branch is trained for 50 epochs with an initial learning rate of 0.01. The learning rate decays after every epoch with a decay factorof 0.869. The context-dependent branches are fine-tuned based on the parameters of the general branch with a learning rate 0.00001 for 10 epochs. For the AlexNet branch, we directly use AlexNet  pretrained on ImageNet dataset .
Our experiments are performed on a PC with a Intel i7-3770 3.40 GHz CPU and a Nvidia Titan X GPU. We evaluated our results using the Python implementation of the OTB toolkit. The average testing speed of MBST is 17 fps.
Hyperparameters: The weights for context-dependent branches have the same value of 1.0. For AlexNet branch, we perform a grid search from 8.0 to 12.0 with step 0.5. Evaluation suggests that the best performance is achieved when is 10.5. This value is thus used for all the test sequences. In order to handle scale variations, we rescale the inputs into three different resolutions.
4.2 Dataset and Evaluation Metrics
OTB: We evaluate the proposed tracker on the OTB benchmarks [4, 5] with eleven interference attributes for the video sequences. The OTB benchmark uses the precision and success rate for quantitative analysis. For the precision plot, we calculate the average Euclidean distance between the center locations of the tracked targets and the manually labeled ground truth. Then the average center location error over all the frames of one sequences is used to summarize the overall performance. As the representative precision score for each tracker, we use the score for the threshold of 20 pixels. For the success plot, we compute the IoU (intersection over union) between the tracked and ground truth bounding boxes. A success plot is obtained by evaluating the success rate at different IoU thresholds. The area-under-curve (AUC) of the success plot is reported.
4.3 Ablation Analysis
To verify the contribution of each branch and the online branch selection mechanism of our algorithm, we implemented several variations of our approach and evaluated them on the OTB benchmarks.
Multiple branches improve the tracking result. We compared our full branches algorithm with various combination of branches as illustrated in Table 1. We evaluate the performances of the original branch, context-dependent branches and AlexNet branch alone. Note that branch selection is applied only when we evaluate the context-dependent branches, since many branches are available. For the other experiments in Table 1, we combine these branches with online branch selection for testing. Results clearly demonstrate that the proposed multiple branches architecture allows a better use of diverse feature representations. The best FPS is achieved by the general siamese branch, which is expected since it needs less computations with only one branch.
Online branch selection for every frame is not necessary. As shown in Fig. 3, we conduct experiments on the branch selection interval by changing the value: . When the value of branch selection interval is less than 7 frames, the tracking performance is reduced. This can be explained by the fact that a frequent execution of the selection mechanism increases the possibility of selecting an inappropriate branch. When the value of branch selection interval is more than 7 frames, the tracking performance is also decreased because we keep for a too long period a branch that is not discriminative anymore. In our experiments, the optimal value of branch selection interval was 7 frames.
4.4 Comparison with State-of-the Art Trackers
We compare MBST with CFNet , SiamFC , Staple , LCT , Struck , MEEM , SCM , LMCF , MUSTER , TLD  on OTB benchmarks. The precision plots and success plots of one path evaluation (OPE) are shown in Fig. 4. Based on precision and success plots, the overall comparison suggests that the proposed MBST achieved the best performance among these state-of-the-art trackers on OTB benchmarks. Notably, it outperforms SiamFC  as well as its variation CFNet  on all datasets. This demonstrates that diverse feature representations are important to improve tracking, as feature maps from various CNNs can be quite different. Fig. 5 demonstrates that our tracker effectively handles all kinds of challenging situations that often require high-level semantic understanding. For example, our tracker significantly outperforms SiamFC in the case of deformation, occlusion and out-of-plane rotations because the contrast between the object and the background changes and switching to another feature map may give a better discriminativity. Therefore, our approach is beneficial each time the appearance of the object changes significantly during its tracking.
In this paper, we propose a Multi-Branch Siamese Network with Online Selection. We ensemble multiple siamese networks to diversify target feature representations. Using our online branch selection mechanism, the most discriminative branch is selected against target appearance variations. Our tracker benefits from the diverse target representation, and can handle all kinds of challenging situations in visual object tracking. Our experiment results show improved performances compared to standard Siamese network trackers, while outperform several recent state-of-the-art trackers.
-  Held, D., Thrun, S and Savarese, S.: Learning to Track at 100 FPS with Deep Regression Networks. In: Leibe, B., Matas, J., Sebe, N. and Welling, M., ECCV 2016, pp. 749–765. Springer
-  Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A. and Torr, P.H.: Fully-convolutional siamese networks for object tracking. In: ECCV 2016, pp. 850–865. Springer
-  Valmadre, J., Bertinetto, L., Henriques, J.F., Vedaldi, A. and Torr, P.H: End-to-end representation learning for correlation filter based tracking. In: CVPR 2017, pp. 5000–5008. IEEE
-  Wu, Y., Lim, J. and Yang, M.H.: Online object tracking: A benchmark. In: CVPR 2013, pp. 2411–2418
-  Wu, Y., Lim, J. and Yang, M.H.: Object tracking benchmark. TPAMI37(9), 1834–1848(2015)
-  He, A., Luo, C., Tian, X. and Zeng, W.: A twofold siamese network for real-time object tracking. In: CVPR 2018, pp. 4834–4843
-  Huang, C., Lucey, S. and Ramanan, D.: Learning policies for adaptive tracking with deep feature cascades. In: ICCV 2017, pp: 105–114
-  Choi, J., Chang, H.J., Fischer, T., Yun, S., Lee, K., Jeong, J., Demiris, Y. and Choi, J.Y.: Context-aware Deep Feature Compression for High-speed Visual Tracking. In: CVPR 2018, pp: 479–488
-  Nam, H. and Han, B: Learning multi-domain convolutional neural networks for visual tracking. In: CVPR 2016, pp: 4293–4302
-  Nam, H., Baek, M. and Han, B.: Modeling and propagating cnns in a tree structure for visual tracking. arXiv preprint arXiv:1608.07242(2016)
-  Krizhevsky, A., Sutskever, I. and Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS 2012, pp: 1097–1105.
-  Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. IJCV115(3), 211–252(2015)
-  Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O. and Torr, P.H.: Staple: Complementary learners for real-time tracking. In: CVPR 2016, pp: 1401–1409
-  Ma, C., Yang, X., Zhang, C. and Yang, M.H.: Long-term correlation tracking. In: CVPR 2015, pp: 5388–5396
-  Hare, S., Saffari, A. and Torr, P.H.: Struck: Structured output tracking with kernels. In: ICCV 2011, pp: 263–270
-  Zhang, J., Ma, S. and Sclaroff, S.: MEEM: robust tracking via multiple experts using entropy minimization. In: ECCV 2014, pp: 188–203
-  Zhong, W., Lu, H., and Yang, M.H.: Robust object tracking via sparsity-based collaborative model. In: CVPR 2012, pp: 1838–1845
-  Han, B., Sim, J. and Adam, H.: BranchOut: Regularization for online ensemble tracking with convolutional neural networks. In: ICCV 2017, pp: 2217–2224
-  Wang, M., Liu, Y. and Huang, Z.: Large margin object tracking with circulant feature maps. In: CVPR 2017, pp: 21–26
-  Hong, Z., Chen, Z., Wang, C., Mei, X., Prokhorov, D. and Tao, D.: Multi-store tracker (muster): A cognitive psychology inspired approach to object tracking. In: CVPR 2015, pp: 749–758
-  Kalal, Z., Mikolajczyk, K., Matas, J., et al: Tracking-learning-detection. TPAMI34(7), 1409(2012)