Tracking is a fundamental task in any video application requiring some degree of reasoning about objects of interest, as it allows to establish object correspondances between frames [makovski2008visual]. It finds use in a wide range of scenarios such as automatic surveillance, vehicle navigation, video labelling, human-computer interaction and activity recognition. Given the location of an arbitrary target of interest in the first frame of a video, the aim of visual object tracking
is to estimate its position in all the subsequent frames[yilmaz2006object, yang2011recent, smeulders2014visual].
For many applications, it is important that tracking can be performed online, while the video is streaming. In other words, the tracker should not make use of future frames to reason about the current position of the object [kristan2016novel]. This is the scenario portrayed by visual object tracking benchmarks, which represent the target object with a simple axis-aligned [wu2013online, liang2015encoding, mueller2016benchmark, valmadre2018long, muller2018trackingnet] or rotated [kristan2016novel] bounding box. Such a simple annotation helps to keep the cost of data labelling low; what is more, it allows a user to perform a quick and simple initialisation of the target.
Similar to object tracking, the task of semi-supervised video object segmentation (VOS) requires estimating the position of an arbitrary target specified in the first frame of a video. However, in this case the object representation consists of a binary segmentation mask which expresses whether each pixel belongs to the target or not [perazzi2016benchmark]. Such a detailed representation is more desirable for applications that require pixel-level information, like video editing [perazzi2017video] and rotoscoping [miksik2017roam]. Understandably, producing pixel-level estimates requires more computational resources than a simple bounding box. As a consequence, VOS methods have been traditionally slow, often requiring several seconds per frame ( [wen2015jots, tsai2016video, perazzi2017learning, bao2018cnn]). Very recently, there has been a surge of interest in faster approaches [Yang_2018_CVPR, marki2016bilateral, wug2018fast, cheng2018fast, chen2018blazingly, jampani2017video, hu2018videomatch]. However, even the fastest still cannot operate in real-time.
In this paper, we aim at narrowing the gap between arbitrary object tracking and VOS by proposing SiamMask, a simple multi-task learning approach that can be used to address both problems. Our method is motivated by the success of fast tracking approaches based on fully-convolutional Siamese networks trained offline on millions of pairs of video frames [bertinetto2016fully, SiamRPN] and by the very recent availability of a large video dataset with pixel-wise annotations such as YouTube-VOS [xu2018youtube]. We aim at retaining the offline trainability and online speed of these methods while at the same time significantly refining their representation of the target object, which is limited to a simple axis-aligned bounding box.
To achieve this goal, we simultaneously train a Siamese network on three tasks, each corresponding to a different strategy to establish correspondances between the target object and candidate regions in the new frames. As in the fully-convolutional approach of Bertinetto [bertinetto2016fully], one task is to learn a measure of similarity between the target object and multiple candidates in a sliding window fashion. The output is a dense response map which only indicates the location of the object, without providing any information about its spatial extent. To refine this information, we simultaneously learn two further tasks: bounding box regression using a Region Proposal Network [ren2015faster, SiamRPN] and class-agnostic binary segmentation [DeepMask]. Notably, binary labels are only required during offline training to compute the segmentation loss and not during tracking. In our proposed architecture, each task is represented by a different branch departing from a shared CNN and contributes towards a final loss, which sums the three outputs together.
Once trained, SiamMask solely relies on a single bounding box initialisation, operates online without updates and produces object segmentation masks and rotated bounding boxes at 35 frames per second. Despite its simplicity and fast speed, SiamMask establishes a new state-of-the-art on VOT-2018 for the problem of real-time object tracking. Moreover, the same method is also very competitive against recent semi-supervised VOS approaches on DAVIS-2016 and DAVIS-2017, while being the fastest by a large margin. This result is achieved with a simple bounding box initialisation (as opposed to a mask) and without adopting costly techniques often used by VOS approaches such as fine-tuning [maninis2017video, perazzi2017learning, bao2018cnn, voigtlaender2017online], data augmentation [LucidDataDreaming_CVPR17_workshops, li2018video] and optical flow [tsai2016video, bao2018cnn, perazzi2017learning, li2018video, cheng2018fast].
The rest of this paper is organised as follows. Section 2 briefly outlines some of the most relevant prior work in visual object tracking and semi-supervised VOS; Section 3 describes our proposal; Section 4 evaluates it on four benchmarks and illustrates several ablative studies; Section 5 concludes the paper.
2 Related Work
In this section, we briefly cover most representative techniques for the two problems tackled in this paper.
Visual object tracking.
Arguably, until very recently, the most popular paradigm for tracking arbitrary objects has been to train online a discriminative classifier exclusively from the ground-truth information provided in the first frame of a video (and then update it online). This strategy has often been referred to astracking-by-detection ( [babenko2009visual, smeulders2014visual]). In the past few years, the Correlation Filter, a simple algorithm that allows to discriminate between the template of an arbitrary target and its 2D translations, rose to prominence as particularly fast and effective strategy for tracking-by-detection thanks to the pioneering work of Bolme [bolme2010visual]. Performance of Correlation Filter-based trackers has then been notably improved with the adoption of multi-channel formulations [kiani2013multi, henriques2015tracking], spatial constraints [kiani2015correlation, danelljan2015learning, lukezic2017discriminative, li2018learning]
and deep features ([danelljan2017eco, valmadre2017end]).
Recently, a radically different approach has been introduced [bertinetto2016fully, held2016learning, tao2016siamese]. Instead of learning a discrimative classifier online, the idea is to train (offline) a similarity function on pairs of video frames. At test time, this function can be simply evaluated on a new video, once per frame. Evolutions of the fully-convolutional Siamese approach [bertinetto2016fully] considerably improved tracking performance by making use of region proposals [SiamRPN], hard negative mining [zhu2018distractor], ensembling [he2018towards] and memory networks [yang2018learning].
Most modern trackers, including all the ones mentioned above, use a rectangular bounding box both to initialise the target and to estimate its position in the subsequent frames. Despite its convenience, a simple rectangle often fails to properly represent an object, as it is evident in the examples of Figure 1. This motivated us to propose a tracker able to produce binary segmentation masks while still only relying on a bounding box initialisation.
Interestingly, in the past it was not uncommon for trackers to produce a coarse binary mask of the target object ( [comaniciu2000real, perez2002color, bibby2008robust]). However, to the best of our knowledge, the only recent tracker that, like ours, is able to operate online and produce a binary mask starting from a bounding box initialisation is the superpixel-based approach of Yeo [yeo2017superpixel]. However, at 4 frames per seconds (fps), its fastest variant is significantly slower that our proposal (35 fps). Furthermore, When using CNN features, its speed is affected by a 60-fold decrease, plummeting below 0.1 fps. Finally, it has not demonstrated to be competitive on modern tracking or VOS benchmarks111Despite being available, its code is not usable.. Similar to us, the methods of Perazzi [perazzi2017learning] and Ci [ci2018video] can also start from a rectangle and output per-frame masks. However, they require fine-tuning at test time, which makes them slow.
Semi-supervised video object segmentation. Benchmarks for arbitrary object tracking ( [smeulders2014visual, kristan2016novel]) assume that trackers receive input frames in a sequential fashion. This aspect is generally referred to with the attributes online or causal [kristan2016novel]. Moreover, methods are often focused on achieving a speed that exceeds the ones of typical video framerates [VOT2018]. Conversely, semi-supervised VOS algorithms have been traditionally more concerned with an accurate representation of the object of interest [perazzi2017video, perazzi2016benchmark].
In order to exploit consistency between video frames, several methods propagate the supervisory segmentation mask of the first frame to the temporally adjacent ones via graph labeling approaches ( [wen2015jots, perazzi2015fully, tsai2016video, marki2016bilateral, bao2018cnn]). In particular, Bao [bao2018cnn] recently proposed a very accurate method that makes use of a spatio-temporal MRF in which temporal dependencies are modelled by optical flow, while spatial dependencies are expressed by a CNN.
Another popular strategy is to process video frames independently ( [maninis2017video, perazzi2017learning, voigtlaender2017online]), similarly to what happens in most tracking approaches. For example, in OSVOS-S Maninis [maninis2017video] do not make use of any temporal information. They rely on a fully-convolutional network pre-trained for classification and then, at test time, they fine-tune it using the ground-truth mask provided in the first frame. MaskTrack [perazzi2017learning] instead is trained from scratch on individual images, but it does exploit some form of temporality at test time by using the latest mask prediction and optical flow as additional input to the network.
Aiming towards the highest possible accuracy, at test time VOS methods often feature computationally intensive techniques such as fine-tuning [maninis2017video, perazzi2017learning, bao2018cnn, voigtlaender2017online], data augmentation [LucidDataDreaming_CVPR17_workshops, li2018video] and optical flow [tsai2016video, bao2018cnn, perazzi2017learning, li2018video, cheng2018fast]. Therefore, these approaches are generally characterised by low framerates and the inability to operate online. For example, it is not uncommon for methods to require minutes [perazzi2017learning, cheng2017segflow] or even hours [tsai2016video, bao2018cnn] for videos that are just a few seconds long.
Recently, there has been an increasing interest in the VOS community towards faster methods [marki2016bilateral, wug2018fast, cheng2018fast, chen2018blazingly, jampani2017video, hu2018videomatch]. To the best of our knowledge, the fastest approaches with a performance competitive with the state-of-the-art are the ones of Yang [Yang_2018_CVPR] and Wug [wug2018fast]. The former uses a meta-network “modulator” to quickly adapt the parameters of a segmentation network during test time, while the latter does not use any fine-tuning and adopts an encoder-decoder Siamese architecture trained in multiple stages. Both these methods run below 10 frames per second, while we are more than four times faster and only rely on a bounding box initialisation.
To allow online operability and fast speed, we adopt the fully-convolutional Siamese framework [bertinetto2016fully]. Moreover, to illustratate that our approach is agnostic to the specific fully-convolutional method used as a starting point ( [bertinetto2016fully, SiamRPN, zhu2018distractor, yang2018learning, he2018twofold]), we consider the popular SiamFC [bertinetto2016fully] and SiamRPN [SiamRPN] as two representative examples. We first introduce them in Section 3.1 and then describe our approach in Section 3.2.
3.1 Fully-convolutional Siamese networks
SiamFC. Bertinetto [bertinetto2016fully] propose to use, as a fundamental building block of a tracking system, an offline-trained fully-convolutional Siamese network that compares an exemplar image against a (larger) search image to obtain a dense response map. and are, respectively, a crop centered on the target object and a larger crop centered on the last estimated position of the target. The two inputs are processed by the same CNN , yielding two feature maps that are cross-correlated:
In this paper, we refer to each spatial element of the response map (left-hand side of Eq. 1) as response of a candidate window (RoW). For example, , encodes a similarity between the examplar and -th candidate window in . For SiamFC, the goal is for the maximum value of the response map to correspond to the target location in the search area . Instead, in order to allow each RoW to encode richer information about the target object, we replace the simple cross-correlation of Eq. 1 with depth-wise cross-correlation [bertinetto2016learning] and produce a multi-channel response map. SiamFC is trained offline on millions of video frames with the logistic loss [bertinetto2016fully, Section 2.2], which we refer to as .
SiamRPN. Li [SiamRPN] considerably improve the performance of SiamFC by relying on a region proposal network (RPN) [ren2015faster], which allows to estimate the target location with a bounding box of variable aspect ratio. In particular, in SiamRPN each RoW encodes a set of anchor box proposals and corresponding object/background scores. Therefore, SiamRPN outputs box predictions in parallel with classification scores. The two output branches are trained using the smooth and the cross-entropy losses [SiamRPN, Section 3.2]. In the following, we refer to them as and respectively.
Unlike existing tracking methods that rely on low-fidelity object representations, we argue the importance of producing per-frame binary segmentation masks. To this aim we show that, besides similarity scores and bounding box coordinates, it is possible for the RoW of a fully-convolutional Siamese network to also encode the information necessary to produce a pixel-wise binary mask. This can be achieved by extending existing Siamese trackers with an extra branch and loss.
binary masks (one for each RoW) using a simple two-layers neural networkwith learnable parameters . Let denote the predicted mask corresponding to the -th RoW,
From Eq. 2 we can see that the mask prediction is a function of both the image to segment and the target object in . In this way, can be used as a reference to guide the segmentation process, such that objects of any arbitrary class can be tracked. This clearly means that, given a different reference image , the network will produce a different segmentation mask for .
Loss function. During training, each RoW is labelled with a ground-truth binary label and also associated with a pixel-wise ground-truth mask of size . Let denote the label corresponding to pixel of the object mask in the
-th candidate RoW. The loss function(Eq. 3
) for the mask prediction task is a binary logistic regression loss over all RoWs:
Thus, the classification layer of consists of classifiers, each indicating whether a given pixel belongs to the object in the candidate window or not. Note that is considered only for positive RoWs (with ).
Mask representation. In contrast to semantic segmentation methods à-la FCN [long2015fully] and Mask R-CNN [maskrcnn], which maintain explicit spatial information throughout the network, our approach follows the spirit of [DeepMask, SharpMask] and generates masks starting from a flattened representation of the object. In particular, in our case this representation corresponds to one of the RoWs produced by the depth-wise cross-correlation between and . Importantly, the network of the segmentation task is composed of two convolutional layers, one with 256 and the other with channels (Figure 2). This allows every pixel classifier to utilise information contained in the entire RoW and thus to have a complete view of its corresponding candidate window in , which is critical to disambiguate between instances that look like the target (last row of Figure 5), also known as distractors [zhu2018distractor]. With the aim of producing a more accurate object mask, we follow the strategy of [SharpMask], which merges low and high resolution features using multiple refinement modules made of upsampling layers and skip connections. Further details can be found in the appendix A.
Two variants. For our experiments, we augment the architectures of SiamFC [bertinetto2016fully] and SiamRPN [SiamRPN] with our segmentation branch and the loss , obtaining what we call the two-branch and three-branch variants of SiamMask. These respectively optimise the multi-task losses and , defined as:
We refer the reader to [bertinetto2016fully, Section 2.2] for and to [SiamRPN, Section 3.2] for and . For , a RoW is considered positive () if one of its anchor boxes has IOU with the ground-truth box of at least 0.6 and negative () otherwise. For , we adopt the same strategy of [bertinetto2016fully]
to define positive and negative samples. We did not search over the hyperparameters of Eq.4 and Eq. 5 and simply set like in [DeepMask] and . The task-specific branches for the box and score outputs are constituted by two convolutional layers. Figure 2 illustrates the two variants of SiamMask.
Box generation. Note that, while VOS benchmarks require binary masks, typical tracking benchmarks such as VOT [kristan2016novel] require a bounding box as final representation of the target object. We consider three different strategies to generate a bounding box from a binary mask (Figure 3): (1) axis-aligned bounding rectangle (Min-max), (2) rotated minimum bounding rectangle (MBR) and (3) the optimisation strategy used for the automatic bounding box generation proposed in VOT-2016 [kristan2016visual] (Opt). We empirically evaluate these alternatives in Section 4 (Table 1).
3.3 Implementation details
Network architecture. For both our variants, we use a ResNet-50 [he2016deep] until the final convolutional layer of the -th stage as our backbone
. In order to obtain a high spatial resolution in deeper layers, we reduce the output stride toby using convolutions with stride 1. Moreover, we increase the receptive field by using dilated convolutions [chen2018deeplab]. In our model, we add to the shared backbone an unshared adjust layer ( conv with outputs). For simplicity, we omit it in Eq. 1. We describe the network architectures in more detail in Appendix A.
Training. Like SiamFC [bertinetto2016fully], we use examplar and search image patches of and pixels respectively. During training, we randomly jitter examplar and search patches. Specifically, we consider random translations (up to pixels) and rescaling (of and for examplar and search respectively).
The network backbone is pre-trained on the ImageNet-classification task. We use SGD with a first warmup phase in which the learning rate increases linearly from to
for the first 5 epochs and then descreases logarithmically untilfor 15 more epochs. We train all our models using COCO [lin2014microsoft], ImageNet-VID [russakovsky2015imagenet] and YouTube-VOS [xu2018youtube].
Inference. During tracking, SiamMask is simply evaluated once per frame, without any adaptation. In both our variants, we select the output mask using the location attaining the maximum score in the classification branch. Then, after having applied a per-pixel sigmoid, we binarise the output of the mask branch with a threshold . In the two-branch variant, for each video frame after the first one, we fit the output mask with the Min-max box and use it as reference to crop the next frame search region. Instead, in the three-branch variant, we find more effecitve to exploit the highest-scoring output of the box branch as reference.
For the implementation of SiamMask, we used PyTorch. Code, pre-computed results and pre-trained models will be made available.
In this section, we evaluate our approach on two related tasks: visual object tracking (on VOT-2016 and VOT-2018) and semi-supervised video object segmentation (on DAVIS-2016 and DAVIS-2017). We refer to our two-branch and three-branch variants with SiamMask-2B and SiamMask respectively.
4.1 Evaluation for visual object tracking
Datasets and settings. We adopt two widely used benchmarks for the evaluation of the object tracking task: VOT-2016 [kristan2016visual] and VOT-2018 [VOT2018], both annotated with rotated bounding boxes. We use VOT-2016 to conduct an experiment to understand how different types of representation affect the performance. For this first experiment, we use mean intersection over union (IOU) and Average Precision (AP)@ IOU. We then compare against the state-of-the-art on VOT-2018, using the official VOT toolkit and the Expected Average Overlap (EAO), a measure that considers both accuracy and robustness of a tracker [VOT2018].
How much does the object representation matter? Existing tracking methods typically predict axis-aligned bounding boxes with a fixed [bertinetto2016fully, henriques2015tracking, danelljan2015learning, lukezic2017discriminative] or variable [SiamRPN, held2016learning, zhu2018distractor] aspect ratio. We are interested in understanding to which extent producing a per-frame binary mask can improve tracking. In order to focus on representation accuracy, for this experiment only we ignore the temporal aspect and sample video frames at random. The approaches described in the following paragraph are tested on randomly cropped search patches (with random shifts within pixels and scale deformations up to ) from the sequences of VOT-2016.
In Table 1, we compare our three-branch variant using the Min-max, MBR and Opt approaches (described at the end of Section 3.2 and in Figure 3). For perspective, we also report results for SiamFC and SiamRPN as representative of the fixed and variable aspect-ratio approaches, together with three oracles that have access to per-frame ground-truth information and serve as upper bound for the different representation strategies. (1) The fixed aspect-ratio oracle uses the per-frame ground-truth area and center location, but fixes the aspect reatio to the one of the first frame and produces an axis-aligned bounding box. (2) The Min-max oracle uses the minimal enclosing rectangle of the rotated ground-truth bounding box to produce an axis-aligned bounding box. (3) Finally, the MBR oracle uses the rotated minimum bounding rectangle of the ground-truth. Note that (1), (2) and (3) can be considered, respectively, the performance upper bounds for the representation strategies of SiamFC, SiamRPN and SiamMask.
Table 1 shows that our method achieves the best mIOU, no matter the box generation strategy used (Figure 3). Albeit SiamMask-Opt offers the highest IOU and mAP, it requires significant computational resources due to its slow optimisation procedure [kristan2016visual]. Instead, we adopt the MBR strategy (whose computational overhead is negligible) for our final object tracking evaluation. SiamMask-MBR achieves a mAP@0.5 IOU of , with a respective improvement of and points w.r.t. the two fully-convolutional baselines. Interestingly, the gap significantly widens when considering mAP at the higher accuracy regime of 0.7 IOU: and respectively. Notably, our accuracy results are not far from the fixed aspect-ratio oracle. Moreover, comparing the upper bound performance represented by the oracles, it is possible to notice how, by simply changing the bounding box representation, there is a great room of improvement ( mIOU improvement between the fixed aspect-ratio and the MBR oracles).
Overall, this study shows how the MBR strategy to obtain a rotated bounding box from a binary mask of the object offers a significant advantage over popular strategies that simply report axis-aligned bounding boxes.
|mIOU ()||mAP@0.5 IOU||mAP@0.7 IOU|
|Fixed a.r. Oracle||73.43||90.15||62.52|
Results on VOT-2018. In Table 2 we compare the two variants of SiamMask against seven recently published state-of-the-art trackers on the VOT-2018 benchmark. Both achieve outstanding performance and run in real-time. In particular, our three-branch variant (SiamMask) significantly outperforms the very recent and top performing DaSiamRPN [zhu2018distractor]. Even without box regression branch, our simpler two-branch variant (SiamMask-2B) achieves a high EAO of , which is in par with SA_Siam_R [he2018towards] and superior to any other real-time method in the published literature. SiamMask provides a further relative gain of , achieving a EAO of , which establishes a new state-of-the-art for real-time tracking. Our model is particularly strong under the accuracy metric, showing a significant advantage with respect to the Correlation Filter-based trackers CSRDCF [lukezic2017discriminative], STRCF [li2018learning] and ECO [danelljan2017eco]. This is not surprising, as SiamMask relies on a richer object representation, as outlined in Table 1. Interestingly, similarly to us He (SA_Siam_R) [he2018towards] are motivated to achieve a more accurate target representation. To this aim, they consider multiple rotated and rescaled bounding boxes as candidates. However, like SiamFC [bertinetto2016fully], SA_Siam_R is still constrained to a fixed aspect-ratio bounding box.
|SiamMask||SiamMask-2B||DaSiamRPN [zhu2018distractor]||SiamRPN [SiamRPN]||SA_Siam_R [he2018towards]||CSRDCF [lukezic2017discriminative]||STRCF [li2018learning]||LSART [sun2018learning]||ECO [danelljan2017eco]|
4.2 Evaluation for semi-supervised VOS
In the following we show how, once trained, the same method can also be used for the task of video object segmentation, achieving competitive performance without requiring any adaptation at test time. Importantly, differently to typical VOS approaches, ours can operate online, runs in real-time and only requires a simple bounding box initialisation.
Datasets and settings. We report the performance of SiamMask on the popular DAVIS-2016 [perazzi2016benchmark] and DAVIS-2017 [pont2017davis]
benchmarks. For both datasets, we use the official performance measures: the Jaccard index () to express region similarity and the F-measure () to express contour accuracy. For each measure , three statistics are considered: mean , recall , and decay , which informs us about the gain/loss of performance over time [perazzi2016benchmark].
To initialise SiamMask, we extract the axis-aligned bounding box (Min-max strategy, Figure 3) from the mask provided in the first frame. Similarly to most VOS methods, in case of multiple objects in the same video (DAVIS-2017) we simply perform multiple inferences.
Results on DAVIS-2016 and 2017. In the semi-supervised setting, VOS methods are initialised with a binary mask [perazzi2017video] and many of them require computationally intensive techniques at test time such as fine-tuning [maninis2017video, perazzi2017learning, bao2018cnn, voigtlaender2017online], data augmentation [LucidDataDreaming_CVPR17_workshops, li2018video], inference on MRF/CRF [wen2015jots, tsai2016video, marki2016bilateral, bao2018cnn] and optical flow [tsai2016video, bao2018cnn, perazzi2017learning, li2018video, cheng2018fast]. As a consequence, it is not uncommon for VOS techniques to require several minutes to process the short sequences of DAVIS. Clearly, these strategies make the online applicability (which is our focus) impossible. For this reason, in our comparison (Table 3 and 4, Figure 4) we mainly concentrate on fast state-of-the-art approaches.
Table 3 shows how SiamMask can be considered as a strong baseline for online VOS. First, it is almost two orders of magnitude faster than accurate approaches such as OnAVOS [voigtlaender2017online] or SFL [cheng2017segflow]. Second, it is competitive with recent VOS methods that do not employ fine-tuning, while being four times more efficient than the fastest ones (OSMN [Yang_2018_CVPR] and RGMP [wug2018fast]). Interestingly, we note that SiamMask achieves the best decay [perazzi2016benchmark] for region similarity (,) and contour accuracy () on both DAVIS-2016 and DAVIS-2017. This suggests that our method is robust over time and thus it is indicated for particularly long sequences.
Figure 4 offers a clearer overview of the tradeoff between segmentation accuracy (as mean IOU, which corresponds to ) and speed (in frames per second). Among the methods of Table 3 that do not fine tune the model online, the recently proposed FAVOS [cheng2018fast] obtains the best results. However, it combines several independent modules (a part-based tracker, a segmentation network and a similarity-based aggregation module), while SiamMask is only evaluated with a single model and it is almost 50 times faster.
Qualitative results of SiamMask for both VOT and DAVIS sequences are shown in Figure 5, 10 and 11. Despite the high speed, SiamMask produces accurate segmentation masks even in presence of distractors.
4.3 Further analysis
In this section, we illustrate ablation studies, failure cases and timings of our methods.
|SiamMask-2B w/o R||✔||0.326||62.3||55.6||43|
|SiamMask w/o R||✔||0.342||68.6||57.8||37|
Network architecture. In Table 5, AN and RN denote whether we use AlexNet or ResNet-50 as the shared backbone (Figure 2), while with “w/o R” we mean that the method does not use the refinement strategy of Pinheiro [SharpMask]. From the results of Table 5, it is possible to make several observations. (1) The first set of rows shows that, by simply updating the architecture of , it is possible to achieve an important performance improvement. However, this comes at the cost of speed, especially for SiamRPN. (2) SiamMask-2B and SiamMask considerably improve over their baselines (with same ) SiamFC and SiamRPN. With a relative in EAO, the gap of the two-branch variant is particularly important. (3) Interestingly, the refinement approach of Pinheiro [SharpMask] is very important for the contour accuracy , but less so for the other metrics.
Multi-task learning. We conducted two further experiments to disentangle the effect of multi-task learning [caruana1997multitask, ruder2017overview] to the one of using the MBR box (from a binary mask) as representation for the target object instead of a traditional axis-aligned bounding box. Results are reported in Table 5. To achieve this, we modified the two variants of SiamMask so that, respectively, they report an axis-aligned bounding box from the score branch (SiamMask-2B-box) or the box branch (SiamMask-box). Therefore, despite having been trained, the mask branch is not used during inference. We can observe how both variants obtain an improvement with respect to their simpler counterparts: from 0.251 to 0.265 EAO for the two-branch and from 0.329 to 0.331 for the three-branch. However, for both variants the gap between SiamMask-box and SiamMask is higher. This implies that, despite being meaningful, the improvement brought simply by training multiple related tasks together is less relevant than the type of target object representation used.
Timing. SiamMask operates online without any adaptation to the test sequence. On a single NVIDIA Titan X GPU, we measured an average speed of 35 and 40 frames per second, respectively for the two-branch and three-branch variants. Note that the highest computational burden comes from the feature extractor . For this reason, changing architecture is a convenient way to obtain different speed/performance trade-offs.
Failure cases. Finally, we discuss two scenarios in which SiamMask fails: motion blur and “non-object” pattern (Figure 6). Despite being different in nature, these two cases arguably arise from the complete lack of similar training samples in a training set such as YouTube-VOS [xu2018youtube], which is focused on objects that can be unambiguously segmented from the foreground.
We introduced SiamMask, a simple approach that enables fully-convolutional Siamese trackers to produce class-agnostic binary segmentation masks of the target object. We show how it can be applied with success to both tasks of visual object tracking and semi-supervised video object segmentation, showing better accuracy than state-of-the-art trackers and, at the same time, the fastest speed among VOS methods. The two variants of SiamMask we proposed are initialised with a simple bounding box, operate online, run in real-time and do not require any adaptation to the test sequence. We hope that our work will inspire further studies that consider the two problems of visual object tracking and video object segmentation together.
Appendix A Network architecture details
Network backbone. Table 6 illustrates the details of our backbone architecture ( in the main paper). For both variants, we use a ResNet-50 [he2016deep] until the final convolutional layer of the 4-th stage. In order to obtain a higher spatial resolution in deep layers, we reduce the output stride to 8 by using convolutions with stride 1. Moreover, we increase the receptive field by using dilated convolutions [chen2018deeplab]. Specifically, we set the stride to 1 and the dilation rate to 2 in the conv layer of conv4_1. Differently to the original ResNet-50, there is no downsampling in conv4_x. We also add to the backbone an adjust layer (a convolutional layer with 256 output channels). Examplar and search patches share the network’s parameters from conv1 to conv4_x, while the parameters of the adjust layer are not shared. The output features of the adjust layer are then depth-wise cross-correlated, resulting a feature map of size 17 17.
block in both variants contains a normalisation layer and ReLU non-linearity whileconv6 only consists of a convolutional layer.
|block||examplar output size||search output size||backbone|
|conv1||6161||125125||77, 64, stride 2|
3 max pool, stride 2
|conv5||1 1, 256||1 1, 256||1 1, 256|
|1 1, 2||1 1, 4||1 1, (63 63)|
|conv5||1 1, 256||1 1, 256|
|1 1, 1||1 1, (63 63)|
Mask refinement module. With the aim of producing a more accurate object mask, we follow the strategy of [SharpMask], which merges low and high resolution features using multiple refinement modules made of upsampling layers and skip connections. Figure 9 illustrates how a mask is generated with stacked refinement modules. Figure 7 gives an example of refinement module .
Appendix B Further qualitative results
Different masks at different locations. Our model generates a mask for each RoW. During inference, we rely on the score branch to select the final output mask (using the location attaining the maximum score). The example of Figure 8 illustrates the multiple output masks produced by the mask branch, each corresponding to a different RoW.