Learning to Fuse Asymmetric Feature Maps in Siamese Trackers

12/04/2020 ∙ by Wencheng Han, et al. ∙ Linköping University IEEE Beijing Institute of Technology 0

In recent years, Siamese-based trackers have achieved promising performance in visual tracking. Most recent Siamese-based trackers typically employ a depth-wise cross-correlation (DW-XCorr) to obtain multi-channel correlation information from the two feature maps (target and search region). However, DW-XCorr has several limitations within Siamese-based tracking: it can easily be fooled by distractors, has fewer activated channels, and provides weak discrimination of object boundaries. Further, DW-XCorr is a handcrafted parameter-free module and cannot fully benefit from offline learning on large-scale data. We propose a learnable module, called the asymmetric convolution (ACM), which learns to better capture the semantic correlation information in offline training on large-scale data. Different from DW-XCorr and its predecessor (XCorr), which regard a single feature map as the convolution kernel, our ACM decomposes the convolution operation on a concatenated feature map into two mathematically equivalent operations, thereby avoiding the need for the feature maps to be of the same size (width and height) during concatenation. Our ACM can incorporate useful prior information, such as bounding-box size, with standard visual features. Furthermore, ACM can easily be integrated into existing Siamese trackers based on DW-XCorr or XCorr. To demonstrate its generalization ability, we integrate ACM into three representative trackers: SiamFC, SiamRPN++, and SiamBAN. Our experiments reveal the benefits of the proposed ACM, which outperforms existing methods on six tracking benchmarks. On the LaSOT test set, our ACM-based tracker obtains a significant improvement of 5.8

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 7

page 8

page 9

page 10

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual tracking is a challenging problem, where the task is to estimate the state of an arbitrary target in each frame of a video, given only its location in the initial frame. Recently, trackers based on Siamese networks have gained attention due to their combined advantage of high speed and tracking performance. The pioneering method, SiamFC 

[bertinetto2016fully], utilizes Siamese networks to extract deep convolutional features from the template in the initial frame of a video and instances inside the search regions of other frames. A cross correlation layer (XCorr) is then used to compute the similarity between the template and instances. Consequently, the instance with the highest similarity score is considered the target. The XCorr in SiamFC produces a single-channel response map and assumes the target is located near the highest response. As an extension, SiamRPN [li2018high] formulates the tracking problem as one-shot detection. It introduces a region proposal network (RPN) [ren2015faster] and utilizes up-channel cross correlation (UP-XCorr). However, UP-XCorr imbalances the parameter distribution, making the training optimization hard. To address this issue, SiamRPN++ introduces a depth-wise correlation (DW-XCorr) to efficiently generate a multi-channel correlation feature map. Due to its efficiency, several recent Siamese trackers [guo2020siamcar, chen2020siamese, yu2020deformable, xu2020siamfc++, du2020correlation] also employ DW-XCorr in their frameworks.

Figure 1: Comparison between DW-XCorr and ACM in terms of being fooled by distractors (first row), information distribution across channels (second row) and background suppression to better discriminative target boundaries (third row). DW-XCorr produces similar responses for distractors and the target (Fig. 1a). In contrast, ACM produces more distinct responses (Fig. 1b

). In both cases (a and b), red arrows indicate the feature vectors extracted from the correlation feature maps of the corresponding pixels, followed by computing the cosine similarity (

) between the two feature vectors ( and ). Only a few channels of DW-XCorr have high response when tracking a desired target (Fig. 1c). Instead, more channels of ACM map carries high response with different semantic information, such as top right corner (left) or center of target (right), as shown in Fig. 1d. We show two example feature channels for DW-XCorr and ACM. DW-XCorr maps are blurry and do not accurately capture shape of target (Fig. 1e). In comparison, AC maps suppresses the background, providing clear boundaries of the target (Fig. 1f).

As discussed above, most recent Siamese trackers employ DW-XCorr to compute the similarity between the template and instances. However, both DW-XCorr and its predecessor XCorr are handcrafted parameter-free modules and are not able to fully benefit from large-scale offline learning. DW-XCorr has several limitations in the context of tracking. First, it produces similar correlation responses for the target and distractors of homogeneous appearance. To demonstrate this, we analyze the similarity between DW-XCorr features of a target and its distractors in Fig. 1a. The heatmap is generated by performing an L1 normalization (, where is a pixel in the correlation feature map and is the number of channels) on every pixel in the DW-XCorr features. As can be seen, DW-XCorr produces high responses (i.e. feature norms) not only near the target (the red rectangle), but also near other instances. We compute the cosine similarity between the target and one distractor (the green rectangle) and observe a high value ( 0.8), indicating that DW-XCorr produces similar results for both. This makes it difficult for RPN to effectively discriminate the desired target from distractors.

The second limitation is that only a few channels in the DW-XCorr feature map are activated, i.e. have a high response when tracking a particular target [li2019siamrpn]. To perform cross-correlation, features of different targets are desired to be orthogonal and distributed in different channels, so that correlation feature channels of different targets are suppressed and only a few channels of the same target are activated. The suppressed channels are unable to help RPN in making robust and precise predictions and can reduce the capacity of the model. As shown in Fig. 1c, the maximum value of a channel with middle response is significantly lower than the global maximum value. This indicates that these channels contribute little to the final predictions. Last, DW-XCorr often produces responses at irrelevant background. As a consequence, correlation maps are often blurry and do not have clear boundaries, as shown in Fig. 1e. This is likely to hinder RPN from making accurate and robust predictions.

The aforementioned shortcomings of DW-XCorr and its predecessor XCorr, within Siamese-based trackers, motivate us to look into designing a new module that learns to fuse feature maps by benefiting from offline learning on large-scale data. In case of two feature maps (. the template and sub-window in a search image) having the same size, a straightforward way is to concatenate (fuse) them and then learn a method for joint training by adding convolutional layers. Here, the additional convolutional layers can learn to discriminate the target and background. However, such a concatenation strategy is non-trivial in the case of Siamese-based trackers since the two feature maps are of different sizes (height and width). Further, the concatenation of feature maps of different sizes is desired to be performed in an efficient manner to meet the real-time requirements during inference.

1.1 Contributions

We introduce a novel module, called the asymmetric convolution (ACM), that avoids the need for the feature maps to be of the same size during concatenation. Our ACM decomposes the convolution operation on a concatenated feature map into two mathematically equivalent operations. First, it performs convolutions on two feature maps independently using kernels of the same size as that of the template feature map. Then, it performs a summation on the resulting feature maps, through broadcasting [numpy]. By utilizing the broadcasting of matrix addition, we efficiently compute the summation on these different-sized feature maps.

The proposed ACM produces more discriminative features, as shown in Fig. 1b, with respect to the target (the red rectangle) and distractors (the green rectangle). This enables the tracker to make more robust predictions. Further, the maximum values of different channels in our ACM are closer, which indicates that more channels carry useful information, as shown in Fig. 1d. At the same time, ACM can effectively suppress background, thereby providing clear boundaries for the target, as in Fig. 1f. We validate these advantages by conducting an extensive analysis on 50k different image pairs from the LaSOT train set [fan2019lasot]. Details are presented in §3.2.

In addition to overcoming the aforementioned limitations of DW-XCorr, the proposed ACM is flexible and can also incorporate useful additional information. Here, we incorporate prior information in the form of bounding-box (b-box) size (height and width) from the initial frame in a video. This prior information helps to overcome the lack of accurate target-box locations in the template image, thereby providing guidance to the RPN headers. Furthermore, we show the generalization ability by replacing the standard DW-XCorr or XCorr with our ACM in three representative Siamese-based trackers: SiamFC [bertinetto2016fully], SiamRPN++ [li2019siamrpn] and SiamBAN [chen2020siamese]. Comprehensive experiments on six tracking benchmarks show the benefits of our ACM, leading to favorable performance against existing methods. On the large-scale LaSOT test set [fan2019lasot], our ACM-based trackers (SiamFC-ACM, SiamRPN++ACM and SiamBAN-ACM) achieve relative gains of 8.6%, 5.7% and 11.3%, in terms of area-under-the-curve (AUC), over their respective baselines (SiamFC, SiamRPN++ and SiamBAN).

2 Related Work

Recently, deep learning has pervaded computer vision with great success in a variety of tasks, including object tracking 

[wang2013learning, bertinetto2016learning, nam2016learning, tao2016siamese, bertinetto2016fully, held2016learning, yun2017action, kosiorek2017hierarchical, song2018vital, park2018meta, pu2018deep, danelljan2019atom]

. Several deep learning-based trackers learn a classifier online to distinguish the target from the background and distractors 

[wang2013learning, nam2016learning, song2017crest, sun2018correlation, danelljan2019atom]. The MDNet [nam2016learning] tracker employs a CNN trained offline from multiple annotated videos. During evaluation, it learns a domain-specific detector online to discriminate between the background and foreground. ATOM [danelljan2019atom] comprises two dedicated components: target estimation, which is trained offline, and classification trained online. DiMP [bhat2019learning] employs a meta-learning based architecture, trained offline, that predicts the weights of the target model. The recently introduced KYS [bhat2020know] extends DiMP by exploiting scene information to improve the results.

Several existing deep trackers [bertinetto2016fully, li2018high, li2019siamrpn, chen2020siamese, guo2020siamcar, xu2020siamfc++] are based on Siamese networks and focus on learning a universal discriminator during large-scale offline learning. These trackers formulate the task as a general similarity computation between the target template and the search region. The pioneering work, SiamFC [bertinetto2016fully], introduced the XCorr layer to combine feature maps and can run at a speed of 100 frames per second (FPS). DSiam [guo2017learning] learns a feature map adaptation method to suppress background and handle target variation. SiamRPN++ [li2019siamrpn] and SiamDW [zhang2019deeper] overcome the issues of previous Siamese-based trackers that restrict them to using only relatively shallow networks. Specifically, they address the problems stemming from destroying the strict translation invariance and introduce modern deep networks, such as, ResNet [he2016deep], and ResNeXt [xie2017aggregated], into Siamese trackers. SiamRPN++ utilizes a depth-wise correlation (DW-XCorr) to efficiently generate a multi-channel correlation feature map. The recently introduced SiamBAN[chen2020siamese] and SiamCAR[guo2020siamcar] also employ DW-XCorr and use an anchor-free strategy to predict bounding-boxes (b-boxes) directly without pre-defined anchor boxes.

Our Approach: As discussed earlier, most recent Siamese trackers typically employ a handcrafted module, DW-XCorr, to compute the similarity between the template and instances. Both DW-XCorr and its predecessor XCorr are not able to fully benefit from large-scale offline learning and have several limitations, including being easily fooled by distractors and providing weak discrimination of the object boundaries. To address these issues, we propose a new module (ACM) that learns to better capture semantic information from large-scale data during offline training. Our ACM produces more discriminative features with respect to the target and distractors, contains more activated channels carrying useful information and effectively suppresses the background, thereby providing clear boundaries of the target. Furthermore, our ACM is flexible and generic, enabling easy integration into existing Siamese trackers. We show this by integrating our ACM into three Siamese trackers and demonstrate its effectiveness on six benchmarks.

3 Method

3.1 Siamese Networks for Tracking

Siamese networks formulate the tracking task as learning a general similarity map between the feature maps extracted from the target template and the search region. When certain sliding windows in the search region are similar to the template, responses in these windows are high [bertinetto2016fully]. These networks are designed as Y-shaped, with two branches: one for the template and the other for the search region . The two branches share the same network with parameters and produce two feature maps and . These two feature maps have the same channel number () but different sizes ( vs. ), where and . Then, a function is used to combine the feature maps and generate a similarity map , where the center of the target is most likely found at the position with the highest response. Usually, is an XCorr operation between and . The formulation is as follows:

(1)

To further improve the performance of Siamese-based trackers, SiamRPN [li2018high] adds region proposal network (RPN) [ren2015faster] to generate bounding-boxes (b-boxes) for each frame of a tracking sequence.The RPN contains two XCorr modules to extract correlation maps and two headers on them to perform anchor classification and regression, respectively. This is different to previous Siamese trackers, such as SiamFC, where the b-box is not explicitly regressed and is typically set based on the size that best matches the search region. While SiamRPN utilizes an RPN, it employs up-channel cross correlation (UP-XCorr), which imbalances the parameter distribution, making the training optimization difficult. SiamRPN++ [li2019siamrpn] addresses this issue by introducing a depth-wise correlation (DW-XCorr) to efficiently generate a multi-channel correlation feature map , as in Fig. 2a. The formulation is as follows:

(2)

where is a depth-wise convolution [howard2017mobilenets] of two feature maps, and is the number of channels. Then, the features are fed into the RPN headers to produce the final tracking b-box. The RPN headers are usually constructed with sequences of convolutional layers, including the classification module , which predicts the classification score of each b-box candidate, and the regression module , which obtains the details (size in terms of width and height) of each b-box. By applying these headers to the correlation maps, we can obtain the score map and b-box map :

(3)

As we can see, the fusion method is crucial for Siamese-based trackers. However, both the XCorr and depth-wise XCorr (DW-XCorr) are parameter-free methods and therefore cannot fully benefit from large-scale training. Further, they have several limitations as described in §1. Our asymmetric convolution module (ACM) addresses these limitations by introducing an asymmetric convolution (AC) as . With the parameter , AC can be optimized during training and finds the a better way to fuse and .

3.2 Asymmetric Convolution

Figure 2: Comparison of (c) AC with (a) DW-XCorr and (b) a naive strategy to fuse different-sized feature maps. (a): DW-XCorr uses a channel feature map extracted from the template as kernel and convolves instance feature maps in a depth-wise manner to generate a channel correlation feature map. (b): A naive strategy to perform concatenation on different-sized feature maps (template and search region) is to first split the search region feature map into sub-windows of the same size as that of the template feature map. Then, different sub-windows and the template are concatenated along channel axis, followed by a convolution to generate a new feature during offline training. (c): AC efficiently concatenates different-sized feature maps by first separately convolving the two feature maps (template and search) using kernels of same size as that of the template feature map. Then, it computes summation on these different-sized feature maps through broadcasting. In addition, our AC possesses the ability to incorporate useful non-visual features (dashed line), such as b-box size.

Different from handcrafted methods (, DW-XCorr and XCorr) for fusing features in Siamese networks, we look into how to concatenate two different-sized feature maps and learn a fusion during offline training on large-scale data. Learning to fuse feature maps during offline training is expected to provide rich prior information, enabling the fusion method to better adapt to different challenging situations, such as motion blur, deformation, fast motion and clutter. However, an efficient direct concatenation of these feature maps is challenging due to the different sizes of the template and search image. To this end, we investigate the problem of efficiently fusing feature maps of different sizes. A straightforward way (Fig. 2b) is to first split the the search region feature map into sub-windows of the same size as that of the template feature map. It is woth noting that every sub-window is a sliding window here. Then, different sub-windows and the template can be concatenated along the channel axis, followed by a convolution operation to produce a new feature . However, such a strategy (Fig. 2b) based on direct convolution on the concatenated feature map is computationally expensive, since the convolution operation is required to be repeated for each sub-window. To circumvent this problem, we introduce a mathematically equivalent procedure, called the asymmetric convolution (AC), that replaces this direct convolution on the concatenated feature map with two independent convolutions (Fig. 2c). For a sub-window , our AC, comprising two independent convolutions followed by a summation, is mathematically equivalent to the direct convolution on the concatenated feature map:

(4)

where is a window of , is the kernel applied to , and is that applied to . After the convolution operation, the result has a shape of . The left side of Eq. 4 is a convolution on a concatenated feature map of and , and it is equivalent to the right side, i.e., two independent convolutions and a summation. Next, we collect the features of all windows inside to formulate a new feature map , as follows:

(5)

where is a summation with broadcasting. We utilize the broadcasting method since it efficiently conducts arithmetic operations on matrices with different shapes and is widely available in scientific computing packages, including Numpy [numpy]

and Pytorch 

[steiner2019pytorch]. Through broadcasting, engines allow the dimensions of arrays to differ. Specifically, arrays with smaller sizes are virtually duplicated (that is, without copying any data in the memory and thus introducing little computational burden), so that the shapes of the operands match[numpy]. Moreover, all sub-windows inside share the same convolution. Therefore, we replace with for simplicity.111Details are provided in the supplementary material.

In this way, we perform a convolution operation on two feature maps with different shapes, simultaneously. After applying a ReLU activation function, we obtain a new fusion

which can be optimized during training:

(6)
Figure 3: Comparison between DW-XCorr and AC in terms of (a and b) producing more discriminative features for targets and distractors to avoid being fooled by the distractors and (c) information distribution across correlation channels. Comparison is performed on 50K different image pairs from LaSOT train set. (a): Cosine similarity between targets and distractors based on DW-XCorr and AC feature maps, respectively. (b): Same as (a), except cosine similarity is replaced by Euclidean distance. In (b), the correlation feature maps are first normalized between [0,1] and then the Euclidean distance is computed between targets and distractors. (c): Average values over all maximum feature values of channels for DW-XCorr and AC, respectively. In each case, the maximum feature values are obtained by first performing a normalization (dividing the values by their global maximum value).

As discussed earlier, our AC benefits from the offline training and alleviates the limitations of DW-XCorr. To demonstrate that AC produces more discriminative features for the targets and distractors than XCorr, we perform an experiment in which we compute the cosine similarity between targets and distractors based on the AC and XCorr feature maps, respectively on 50k different image pairs from the LaSOT dataset. We set the target to be at the center of the search region and select the features located at the center of the AC and DW-XCorr maps to represent it. Then, we find the maximum response outside the b-box region and select features at this point to represent the distractor. Finally, the cosine similarity between the target and distractor features is computed to evaluate the discriminative ability of AC and DW-XCorr. Fig. 3a shows that AC maps are less affected by distractors, producing more discriminative features, compared to DW-XCorr. Fig. 3b shows a similar comparison but from a different perspective, where cosine similarity is replaced with the euclidean distance. Here, the correlation feature maps are first normalized between [0,1] and then the Euclidean distance is computed between targets and distractors. Further, AC maps contain more semantic information than DW-XCorr, as shown earlier in Fig. 1b. We also validate, on same 50k image pairs from LaSOT, that AC channels provide more diversity when extracting correlation information, compared to DW-XCorr. We first normalize AC and DW-XCorr by dividing them by their global maximum value, and then calculate maximum values of each channel. Finally, average values over all channels are used to draw a comparison, shown in Fig. 3c. Lastly, AC maps suppresses influence of irrelevant background better, compared to DW-XCorr, as shown earlier in Fig. 1f.222Additional results are provided in the supplementary material. This helps RPN headers to more accurately predict the b-boxes.

Figure 4: (a): Effectiveness of our ACM

in fusing additional information (single number to indicate digit location) with visual feature maps for the task of digit prediction on MNIST. Digit images from MNIST are concatenated into a

matrix. We then randomly generate an index between 0-3 to indicate the position of the numbers. Here, ”index” means the position to predict, and ”prediction” is the predicted digit at this position. Indexes are 0,1 in the first row and 2,3 in the second row of the matrix. The colors, superimposed on the images, are responses of feature maps where high responses are represented by warm colors. The incorporation of additional index information using ACM results in high responses concentrated around the target position. (b): Tracking comparison between our ACM-based tracker (SiamRPN++ACM) and the baseline (SiamRPN++) on example frames, where the target is only part of an object (, part of hand or body). Here, we also show DW-XCorr and ACM feature maps of the baseline and SiamRPN++ACM, respectively. Each feature map shown is obtained by taking the L1 norm of each pixel in the respective feature map. Our ACM map is able to focus on regions belonging to the target. Moreover, the integration of non-visual 1D b-box size features provides useful prior information to the RPN headers, leading to more accurate predictions.

3.3 Incorporating Prior Non-Visual Information

As discussed earlier, our ACM is flexible and can incorporate additional (non-visual) information. Here, we show the integration of prior information in the form of target b-box size (width and height) from the initial frame. It is worth noting that traditional RPN header has no exact prior information about the target b-box which can be of arbitrary shape. ACM can provide such additional prior information, in terms of a b-box size, to the RPN header for accurate target localization. However, a b-box size is a one-dimensional feature and cannot be fed directly into 2D convolutional networks. Here, we regard a b-box size as a specific image feature with a size of , where is the channel number. In this way, we utilize ACM to fuse useful prior information, such as b-box size, with standard high-dimensional visual features representing template and search regions.

We use the b-box size information from the initial frame in our tracking framework to distinguish features belonging to the target and provide guidance to the RPN headers:

(7)

Here, is the b-box of the initial frame and is a three-layer fully-connected network with parameters . Since the target in the template is always at the center of the image, we only use the width and height of the b-box. Fig. 4(b) shows a tracking comparison between our ACM-based tracker and the baseline (using DW-XCorr) on example frames, where target is only part of an object (, part of hand or body).

To further demonstrate the effectiveness of our fusion, we conduct a simple experiment for digit prediction on MNIST dataset [lecun2010mnist]. First, we concatenate number images from MNIST into a matrix and randomly generate an index of 0-3 to indicate the position of the numbers. Then, we design a network similar to AlexNet to predict the number at a given position. To incorporate the index information (a single number), we extract the index features using a three-layer fully-connected network and fuse them with the feature map of a matrix image using our ACM. We then feed the fused features into a prediction network. As shown in Fig. 4(a), high responses are uniform without using index information. After integrating index information using ACM, they are more concentrated around the target positions. Even though we only give the network a single index number, it is able to better discriminate target position with emphasis to the region belonging to the target. As a result, our network correctly predicts the number at given position.

Figure 5: Overview of our tracker (SiamBAN-ACM) which integrates ACM, in place of DW-XCorr, in the baseline SiamBAN. The network contains three branches: template, search region and b-box. The template and search region branches share the same ResNet50 backbone, producing feature maps for the ACM. The b-box branch utilizes a fully-connected network and provides useful prior information. The ACM fuses the resulting features from the three branches and generates a correlation feature map for the RPN headers. Based on this feature map, the Cls header predicts foreground-background classification, whereas the Reg header performs b-box regression. Consequently, predictions from different layers are fused by layer-wise aggregation.

3.4 ACM for Visual Tracking

The proposed ACM is generic and can be easily integrated into existing Siamese trackers. Here, we integrate ACM into three trackers: SiamFC [bertinetto2016fully], the recently introduced SiamRPN++ [li2019siamrpn] and SiamBAN [chen2020siamese]. For SiamFC, we replace its XCorr with ACM, whereas for SiamRPN++ and SiamBAN we replace their DW-XCorr with ACM. The resulting ACM-based trackers are named, SiamFC-ACM, SiamRPN++ACM and SiamBAN-ACM, respectively.

Our SiamFC-ACM. The original SiamFC [bertinetto2016fully] employs XCorr to produce a single-channel response map. We use the same network as the original SiamFC to extract features, and feed the feature maps produced by the template and search region branches into ACM, producing a correlation map with a single channel. The position with the highest response is then set as the predicted target center.

Our SiamRPN++ACM. The original SiamRPN++ [li2019siamrpn] is the first to introduce DW-XCorr into Siamese trackers. For SiamRPN++ACM, we replace the DW-XCorr in the original SiamRPN++ with our ACM. Specifically, ACM fuses the features from the three branches (template, search region and b-box) to generate a correlation feature map, as shown in Fig. 10. The b-box branch uses three fully-connected layers to generate a target location feature map (). Then, we apply two

convolutions without padding to the template and search region feature maps to obtain semantic feature maps. Consequently, the summation of the three feature maps (

i.e.

template, search region and b-box maps) is then batch normalized and used as input to the RPN headers. The template and initial b-box are fixed during inference and the three branches remain independent until the broadcasting summation. Thus, we can cache the two branches (template and b-box) to save computational cost. In this way, the additional computational cost introduced by ACM is only a single convolution on the search region, thereby causing no significant degradation to the overall inference speed.

Our SiamBAN-ACM. The recently introduced SiamBAN [chen2020siamese] does not employ pre-defined anchors, enabling it to perform better and faster than its baseline SiamRPN++. To obtain SiamBAN-ACM, we apply same changes (replacing DW-XCorr with ACM) to the baseline SiamBAN as described above for SiamRPN++ACM.

4 Experiments

We perform comprehensive experiments on six tracking benchmarks: OTB-100 [wu2015object], UAV123 [mueller2016a], TrackingNet [muller2018trackingnet], VOT2016, VOT2019 [VOT_TPAMI] and LaSOT [fan2019lasot]. A well-documented and complete training and inference code will be publicly released.

Implementation Details.

Our ACM-based tracking frameworks are implemented using the Pytorch tracking platform PySOT. For fair comparison, we follow the same training protocol for our SiamFC-ACM, SiamRPN++ACM and SiamBAN-ACM as that of their respective baseline SiamFC, SiamRPN++ and SiamBAN trackers. Further, we use the same loss functions in our tracking networks as that of the respective baselines, as ACM can be optimized without auxiliary guidance. We perform training on a workstation with an Intel E5-2698 v4 CPU, 512G memory, and four V100 GPUs. For both training and testing, template patches are cropped to

pixels, and the search region is cropped to pixels.

4.1 Ablation Study

We perform an ablation study to analyze the impact of ACM in the three tracking architectures. As discussed earlier, our ACM addresses the limitations of XCorr and DW-XCorr by introducing an asymmetric convolution (AC). Further, ACM also possesses the flexibility to incorporate additional (non-visual) information in the form of b-box size. Here, we also analyze the impact of only replacing the XCorr or DW-XCorr with AC and not incorporating additional (b-box size) prior information. We perform ablation experiments on the VOT2016 and OTB-100 datasets. We follow the standard evaluation protocols of the respective datasets. On VOT2016, trackers are evaluated using expected average overlap (EAO) score. The EAO score takes into account both robustness and accuracy. Here, robustness represents number of tracking failures, while accuracy indicates the average overlap between the ground-truth b-box and tracker prediction. On OTB-100, trackers are evaluated using the area-under-the-curve (AUC), which is obtained by averaging the overlap precision (OP) scores over a range of thresholds [0, 1]. Here, OP metric indicates the percentage of frames where intersection-over-union (IoU) overlap between the ground-truth b-box and predictions from the tracker is greater than a certain threshold.

Table 1 shows the results using three baseline tracking architectures on both datasets. We also report the speed in terms of frames per second (FPS). Note that all speeds are reported on a GTX1080Ti GPU. On VOT2016, the baseline SiamBAN and SiamRPN++ achieve EAO scores of 50.5 and 46.4, respectively. A consistent improvement in tracking performance is obtained when replacing the DW-XCorr with our AC in these two baseline architectures. Our final ACM, which contains both the AC and the prior b-box size information, achieves significant improvement in performance over both the baselines. Our ACM-based trackers (SiamBAN-ACM and SiamRPN++ACM) obtain absolute gains of 4.4% and 3.7%, in terms of EAO, over their respective SiamBAN and SiamRPN++ baselines. In case of the baseline SiamFC, our ACM contains only the AC and no additional (non-visual) information, since SiamFC only needs to predict the center of the target. Our ACM-based tracker (SiamFC-ACM) obtains a significant gain of 6.1% over the baseline SiamFC. Similarly, our ACM-based trackers also provide consistent improvements in tracking performance on their respective baselines on OTB-100.

AC ACM VOT2016 OTB2015 Speed
(EAO) (AUC Score) (fps)
0.505 0.695 48
SiamBAN 0.535 0.715 41
0.549 0.720 41
0.464 0.695 46
SiamRPN++ 0.485 0.705 40
0.501 0.712 40
0.277 0.586 190
SiamFC 0.338 0.600 172
Table 1: Ablation study on VOT2016 [kristan2016the] and OTB-100[wu2015object]. We show the results using three different baseline tracking architectures. All speeds are reported on a GTX1080Ti GPU. We also show our ACM with only AC and without the integration of prior non-visual information. In all cases, our final ACM achieves consistent improvement in tracking performance over the baseline architectures. The best scores are highlighted in bold in each case.
SiamFC SiamFC DiMP SiamRPN++ SiamBAN KYS SiamRPN++ SiamBAN
[tao2016siamese] -ACM [bhat2019learning] [li2019siamrpn] [chen2020siamese] [bhat2020know] ACM -ACM
A 0.571 0.577 0.740 0.733 0.725 0.740 0.747 0.753
P 0.553 0.537 0.687 0.694 0.687 0.688 0.705 0.712
NP 0.652 0.675 0.801 0.800 0.795 0.800 0.804 0.810
Table 2: State-of-the-art comparison on TrackingNet [muller2018trackingnet] test set in terms of success (AUC), precision and normalized precision. Success, precision and normalized precision are denoted by A, P and NP, respectively. The best two results are shown in red and blue, respectively. Our ACM-based trackers (SiamFC-ACM, SiamRPN++ACM and SiamBAN-ACM) consistently outperform their respective baselines (SiamFC, SiamRPN++ and SiamBAN). Further, our SiamBAN-ACM achieves favorable performance against existing trackers.

4.2 State-of-the-Art Comparison

TrackingNet [muller2018trackingnet]: Table 2 shows the comparison on TrackingNet test set, which comprises over 500 videos without publicly available ground-truths. The results are obtained through an online evaluation server. Our three trackers (SiamFC-ACM, SiamRPN++ACM and SiamBAN-ACM) achieve consistent improvement over their respective baselines (SiamFC, SiamRPN++ and SiamBAN). The recently introduced KYS [bhat2020know] and its baseline DiMP [bhat2019learning] achieve normalized precision (NP) scores of 80.0 and 80.1, respectively. Our SiamBAN-ACM achieves NP score of 81.0, outperforming both KYS and DiMP. SiamBAN-ACM also achieves favorable result in terms of success (A), against existing trackers with an AUC score of 75.3.

OTB-100 [wu2015object]: Fig.6(a) shows the results, in terms of success plot, over all 100 videos of OTB-100. The trackers are ranked in terms of their AUC score (in the legend). Among existing methods, SiamBAN achieves an AUC score of 69.6. The recently introduced KYS [bhat2020know] and its baseline DiMP [bhat2019learning] obtain AUC scores of 69.4 and 68.8, respectively. Our SiamBAN-ACM outperforms existing trackers with an AUC score of 72.0. Further, our SiamBAN-ACM obtains an absolute gain of 2.5% over the baseline SiamBAN. In OTB-100, each video is annotated with 11 different attributes. SiamBAN-ACM achieves promising performance on all these attributes, compared to existing methods. The attribute plots are provided in the supplementary material.

Figure 6: State-of-the-art comparison on (a) OTB-100 [wu2015object] and (b) LaSOT [fan2019lasot] test set in terms of success plot. For each method, we show the AUC scores in the legend. On both datasets, our ACM-based trackers (SiamFC-ACM, SiamRPN++ACM and SiamBAN-ACM) consistently outperform their respective baselines (SiamFC, SiamRPN++ andSiamBAN). On OTB-100, our SiamBAN-ACM sets a new state-of-the-art with an AUC score of 72.0. On LaSOT, SiamBAN-ACM performs favorably against existing methods with AUC score of 57.2. Best viewed zoomed in.

LaSOT [fan2019lasot]: We evaluate our approach on the test set comprising 280 long videos. Fig.6(b) shows the success plot. We rank the trackers w.r.t. their AUC scores (in the legend). Among existing methods, SiamBAN and DiMP obtain AUC scores of 51.4 and 56.5, respectively. Our SiamBAN-ACM obtains favorable results against the state-of-the-art, while outperforming baseline SiamBAN by an AUC gain of 5%.

VOT 2016 and 2019 [VOT_TPAMI]: Table 3 and 4 show a comparison on VOT 2016 and 2019, respectively. On VOT2016, our SiamBAN-ACM outperforms the previous best SiamBAN with a EAO (E) absolute gain of 4.4%. Similarly on VOT 2019, our three trackers (in bold) achieve consistent improvement in performance over their baselines. Compared to SiamBAN, our SiamBAN-ACM has 20% lower failure rate, while also achieving improved tracking accuracy.

UAV123 [mueller2016a]: Table 5 shows the comparison in terms of success (AUC). Among existing Siamese trackers, SiamCAR and SiamBAN achieve AUC scores of 61.4 and 63.1, respectively. Our SiamBAN-ACM achieves favorable performance against existing trackers with AUC score of 64.8.

SiamFC SiamFC SiamRPN++ ROAM++ SPM SiamRPN++ SiamBAN SiamBAN
[bertinetto2016fully] -ACM [li2019siamrpn] [yang2020roam] [wang2019spm] ACM [chen2020siamese] -ACM
E 0.277 0.338 0.441 0.434 0.481 0.501 0.505 0.549
R 0.382 0.294 0.174 0.210 0.206 0.144 0.149 0.098
A 0.549 0.535 0.599 0.620 0.610 0.666 0.632 0.647
Table 3: State-of-the-art comparison on VOT2016 challenge dataset [VOT_TPAMI] in terms of expected average overlap (E), robustness (R) and accuracy (A). The best two results are shown in red and blue fonts, respectively. Our SiamBAN-ACM approach achieves the best EAO score, outperforming the previous best tracker SiamBAN with a EAO (E) absolute gain of 4.4%.
SiamFC SiamFC SiamRPN++ SiamRPN++ DiMP SiamBAN Ocean SiamBAN
[bertinetto2016fully] -ACM [li2019siamrpn] ACM [bhat2019learning] [chen2020siamese] [zhang2020ocean] -ACM
E 0.163 0.206 0.285 0.303 0.321 0.327 0.350 0.362
R 0.958 0.712 0.482 0.431 0.371 0.396 0.316 0.316
A 0.470 0.503 0.599 0.624 0.582 0.602 0.594 0.621

Table 4: State-of-the-art comparison on VOT2019 challenge dataset [VOT_TPAMI] in terms of expected average overlap (E), robustness (R) and accuracy (A). The best two results are shown in red and blue fonts, respectively. Our SiamBAN-ACM tracker performs favorably against existing methods.
SiamFC SiamFC SiamRPN++ DiMP SiamRPN++ SiamCAR SiamBAN SiamBAN
[bertinetto2016fully] -ACM [li2019siamrpn] [bhat2019learning] ACM [guo2020siamcar] [chen2020siamese] -ACM
0.498 0.508 0.613 0.654 0.634 0.614 0.631 0.648
Table 5: State-of-the-art comparison on UAV123 [mueller2016a] in terms of success (AUC). The best two results are shown in red and blue fonts, respectively. Our three trackers (in bold) achieve improved performance, compared to their respective baselines. Further, our SiamBAN-ACM obtains results comparable with the state-of-the-art, while outperforming its baseline SiamBAN by 1.7% in AUC.

5 Conclusion

We propose a learnable module, called the asymmetric convolution (ACM), to efficiently fuse feature maps of different sizes in Siamese trackers. Our ACM addresses the limitations of standard DW-XCorr and benefits from large-scale offline training. Further, ACM possesses the flexibility to integrate useful non-visual information, such as the location (b-box size) of target b-box in the initial frame. We integrate ACM into three Siamese tracking architectures. Experiments on six datasets demonstrate that ACM-based trackers provide consistent improvement over their baselines, leading to favorable results against existing methods.

References

Detail proof of Section 3.2

For one sub-window with the same size of inside , the convolution on it is:

(8)

Because is the i-th sub-window of , it can be wrote as:

(9)

All sub-windows inside share the same convolution, so the collection of the results can be achieved by a convolution on the whole :

(10)

Therefore, we replace with .

Visualization of feature maps

To demonstrate our ACM has a better ability to suppress irrelevant background and keeping clear boundaries of targets, we show more cases of ACM and DW-XCorr feature maps in Fig.7. The color indicates the L1 norm of each pixel and high values are represented by warm colors. From these pictures, we can see that DW-XCorr produces high responses around the targets and irrelevant background objects. These background responses prevent headers from making robust predictions. We attribute this to the poor discriminative ability of DW-XCorr on background objects. However, ACM only produces high responses near the targets and suppresses the influence of irrelevant background. Thus, clear boundaries of the targets can be drawn, which can help the headers to make more robust and precise predictions.

Figure 7: Visualization of ACM and DW-XCorr feature maps.
Figure 8: Network used in our number prediction experiment.
Figure 9: Evaluation on all sequences and 11 attributes on OTB-2015 [wu2015object] in terms of success.
Figure 10: Evaluation on LaSOT[fan2019lasot] in terms of success, precision and normalized precision.

Details of Number Prediction Experiments

To prove that ACM is able to incorporate prior information into a convolutional neural network, we conduct a number prediction experiment. Firstly, we concatenate four different images from the MNIST 

[lecun2010mnist] dataset, making a matrix. Then an index is randomly generated, indicating one position in the matrix. A model is trained to predict the number in the given position. The network is based on AlexNet [krizhevsky2017imagenet] as shown in Fig. 10. Prior information used in this experiment is the index information, which is a single number. The indexes are sent into a fully-connected network to extract features. Then the features are fused with the feature map extracted by the third layer of backbone using ACM. We also try to fuse them at different layers of the backbone, and networks can always accurately predict the correct number in the location.

Experiment Results

OTB-2015. Following [wu2015object], we report the results in OPE using success plots. we evaluate our trackers on 11 different attributes, including fast motion(FM), background cluttered (BC), motion blur (MB), deformation (DEF), illumination variation(IV), in-plane rotation (IPR), low resolution (LR), occlusion (OCC), out-of-plane rotation (OPR),out-of-view (OV) and scale variation (SV). Notably, our trackers significantly outperform their baselines on MB, DEF, FM and BC sequences.

LaSOT. On the LaSOT[fan2019lasot] test set, we evaluate our models and six other trackers in one-pass evaluation (OPE) using success, precision and normalized precision plots. Our SiamBAN-ACM achieves the best performance in all of three metrics and obtains absolute gains of 5.8%, 6.6% and 5.5% in terms of success, precision and normalized precision respectively.