SiamMask: A Framework for Fast Online Object Tracking and Segmentation

In this paper we introduce SiamMask, a framework to perform both visual object tracking and video object segmentation, in real-time, with the same simple method. We improve the offline training procedure of popular fully-convolutional Siamese approaches by augmenting their losses with a binary segmentation task. Once the offline training is completed, SiamMask only requires a single bounding box for initialization and can simultaneously carry out visual object tracking and segmentation at high frame-rates. Moreover, we show that it is possible to extend the framework to handle multiple object tracking and segmentation by simply re-using the multi-task model in a cascaded fashion. Experimental results show that our approach has high processing efficiency, at around 55 frames per second. It yields real-time state-of-the-art results on visual-object tracking benchmarks, while at the same time demonstrating competitive performance at a high speed for video object segmentation benchmarks.

READ FULL TEXT VIEW PDF

page 1

page 6

page 7

page 13

page 14

12/12/2018

Fast Online Object Tracking and Segmentation: A Unifying Approach

In this paper we illustrate how to perform both visual object tracking a...
06/30/2016

Fully-Convolutional Siamese Networks for Object Tracking

The problem of arbitrary object tracking has traditionally been tackled ...
03/26/2020

Online and Real-time Object Tracking Algorithm with Extremely Small Matrices

Online and Real-time Object Tracking is an interesting workload that can...
04/16/2018

A Novel Low-cost FPGA-based Real-time Object Tracking System

In current visual object tracking system, the CPU or GPU-based visual ob...
07/11/2020

Fast Video Object Segmentation With Temporal Aggregation Network and Dynamic Template Matching

Significant progress has been made in Video Object Segmentation (VOS), t...
11/27/2018

Eliminating Exposure Bias and Loss-Evaluation Mismatch in Multiple Object Tracking

Identity Switching remains one of the main difficulties Multiple Object ...
03/18/2021

Real-Time Visual Object Tracking via Few-Shot Learning

Visual Object Tracking (VOT) can be seen as an extended task of Few-Shot...

1 Introduction

Tracking is a fundamental task in any video application requiring some degree of reasoning about objects of interest, as it allows to establish object correspondences between frames. It finds use in a wide range of scenarios such as automatic surveillance, vehicle navigation, video labelling, human-computer interaction and activity recognition. Given the location of an arbitrary target of interest in the first frame of a video, the aim of visual object tracking

is to estimate its position in all the subsequent frames with the best possible accuracy 

[79]. For many applications, it is important that tracking can be performed online, i.e. while the video is streaming, which implies that the tracker should not make use of future frames to reason about the current position of the object [42]. This is the scenario portrayed by visual-object tracking benchmarks, which represent the target object with a simple axis-aligned ([48, 59, 60, 82]) or rotated [43, 42] bounding box. Such a simple annotation helps to keep the cost of data labelling low; what is more, it allows a user to perform a quick and simple initialization of the target. However, in the presence of complex movements and non-rigid deformations, bounding boxes are a very poor approximation of an object’s contour, which can cause the erroneous inclusion of pixels belonging to the background in the representation.

Similar to object tracking, the task of video object segmentation (VOS) requires estimating the position of an arbitrary target specified in the first frame of a video. However, in this case the object representation consists of a binary segmentation mask expressing whether or not a pixel belongs to the target [67]. Such a detailed representation is more desirable for applications that require detailed pixel-level information, like video editing [69] and rotoscoping [58]. Understandably, producing pixel-wise masks requires more computational resources than a simple bounding box. As a consequence, VOS methods have been traditionally slow, often requiring several seconds per frame (e.g. [89, 81, 66, 1]). Recently, there has been a surge of interest in faster approaches [97, 57, 65, 12, 10, 37, 35]. However, even the fastest have not been able to operate in real-time.

Fig. 1: Our method addresses both tasks of visual tracking and video object segmentation to achieve high practical convenience. Like conventional object trackers such as [18] (red), it relies on a simple bounding box initialization (blue) and operates online. However, SiamMask (green) is able to produce binary segmentation masks (out of which we can infer rotated bounding-boxes) that much more accurately describe the target object.

We aim at narrowing the gap between arbitrary object tracking and VOS by proposing SiamMask, a simple multi-task learning approach that can be used to address both problems. Our method is motivated by the success of fast tracking approaches based on fully-convolutional Siamese networks [3] trained offline on millions of pairs of video frames (e.g. [45, 106, 30, 98]) and by the recent availability of YouTube-VOS [91], a large video dataset with pixel-wise annotations. We aim at retaining both the offline trainability and online speed of Siamese approches, while at the same time significantly refining their representation of the target object, which is limited to a simple axis-aligned bounding box. To achieve this goal, we simultaneously train a fully-convolutional Siamese network on three tasks, each corresponding to a different strategy to establish correspondances between the target object and candidate regions in the new frames. As in the work of Bertinetto et al.  [3], one task is to learn a measure of similarity between the target object and multiple candidates in a sliding window fashion. The output is a dense response map which only indicates the location of the object, without providing any information about its spatial extent. To refine this information, we simultaneously learn two further tasks: bounding box regression using a Region Proposal Network [75, 45] and class-agnostic binary segmentation [64]. Notably, the segmentation binary labels are only required during offline training to compute the segmentation loss and not online during segmentation/tracking. In our proposed architecture, each task is represented by a different branch departing from a shared CNN and contributes towards a final loss, which sums the three outputs together.

Once trained, SiamMask solely relies on a single bounding box initialisation, operates online without updates and produces object segmentation masks and rotated bounding boxes at 55 frames per second (on a single consumer-grade GPU). Despite its simplicity and fast speed, SiamMask establishes a new state-of-the-art for the problem of real-time object tracking. Moreover, the same method is also very competitive against VOS approaches (on multiple benchmarks), while being the fastest by a large margin. This result is achieved with a simple bounding box initialisation (as opposed to a mask) and without adopting costly techniques often used by VOS approaches such as fine-tuning [56, 66, 1, 84], data augmentation [40, 47] and optical flow [81, 1, 66, 47, 12].

We further extend our multi-task framework to the problem of multiple object tracking and segmentation by adopting a trained SiamMask model within a two-stage cascade strategy. Using multiple instances of SiamMask (one per target object), the first stage identifies a crop where the target object is likely to be located, and the second extracts an accurate pixel-wise mask. As common in multiple-object tracking, the data association problem (where new target objects are mapped to existing tracks) is addressed with the Hungarian algorithm. This overall strategy is fairly effective, and despite its simplicity it achieved the second place at the YouTube-VIS challenge [96].

The rest of this paper is organized as follows. Section 2 briefly reviews related work on visual-object tracking and video-object segmentation. Section 3 presents an overview on fully-convolutional Siamese networks, which are at the basis of our work. Section 4 describes our proposed approach for tracking and segmentation, while Section 5 explains how it can be extended to address the problem of multiple object tracking and segmentation. Section 6 reports quantitative and qualitative experimental results for all the tasks considered on several popular benchmarks. Finally, Section 7 concludes the paper.

2 Related Work

To provide context for our work, we briefly discuss some of the most representative developments in visual-object tracking and video-object segmentation of the last few years.

2.1 Visual object tracking

Until recently, the most popular paradigm for tracking arbitrary objects was to train online a discriminative classifier from the ground-truth information provided in the first frame of a video, and then update the classifier online

[79]. One particularly popular and effective strategy was the use of the Correlation Filter [6], a simple algorithm that allows to discriminate between the template of an arbitrary target and its 2D translations at high speeds thanks to its formulation in the Fourier domain. Since the pioneering work of Bolme et al., performance of Correlation Filter-based trackers has been notably improved with the adoption of multi-channel formulations [25, 34], spatial constraints [41, 19, 55, 46]

and deep features (

e.g. [18, 83]).

In 2016, a different paradigm started gaining popularity [3, 33, 80]. Instead of learning a discriminative classifier online, these methods train offline a similarity function on pairs of video frames. At test time, this function can be simply evaluated on a new video, once per frame. In particular, evolutions of the fully-convolutional Siamese approach [3] considerably improved tracking performance by making use of region proposals [45], hard negative mining [106], ensembling [30, 29], memory networks [98], multiple stage cascaded regression [85, 22], and anchor-free mechanism [92, 27, 11] More recent developments improved the fully-convolutional framework under several different aspects. Yang et al. [99]

recur to meta-learning to iteratively adjust the network’s parameters during tracking via the use of an offline-trained recurrent neural network. Cheng 

et al. [14] focuses on the challenging issue brought by “distractors” during tracking by learning explicitly learning a “relation detector” to discriminate them from the background. Guo et al. [26] propose to enhance the Siamese network with a graph attention mechanism to establish correspondences between the features of target and search area. Zhou et al. [105] focus on mining the most salient regions of the tracked objects to increase the discriminative power of the trained model. Yan et al. [93] utilize Neural Architecture Search to prune the large space of Siamese-based architecture and find the best performing or the most efficient in terms of FLOPs. Furthermore, several works [38, 49, 61, 94, 28] are concerned with the negative implications of adversarial attacks applied to tracking systems, and address them with several techniques from the robustness literature (e.g. adversarial training).

A recently popular trend is the one of using self-supervised approaches for visual-object tracking. Wang et al. [86] propose a Siamese Correlation Filter-based network trained using pseudo-labels obtained by running a tracker back and forth on a video to obtain stable trajectories. Yuan et al. [101] exploit cycle-consistency for representation learning. Zheng et al. [104] learn a Siamese network in an unsupervised way by mining moving objects in video via optical flow. Similarly, Sio et al. [78] manage to learn a Siamese network in an unsupervised way by extracting both exemplar and search image from the same frame, while employing data augmentation to not make the problem trivial. Finally, Wu et al. [90] learn a foreground/background discriminator in an unsupervised way using contrasting learning. This is then used to discover corresponding patches throughout a video to learn the tracking model.

The above trackers use a rectangular bounding box both to initialise the target and to estimate its position in the subsequent frames. Despite its convenience, a simple rectangle often fails to properly represent an object, as it is evident in the examples of Figure 1. This motivated us to propose a tracker able to produce binary segmentation masks while still only relying on a quick-to-draw bounding box initialization.

2.2 Video object segmentation (VOS)

Benchmarks for arbitrary object tracking (e.g. [79, 44]) assume that trackers receive input frames in a sequential fashion. This aspect is generally referred to with the attributes online or causal [44]. Moreover, methods are often focused on achieving a speed that exceeds the one of typical video frame rates (around 25 to 30 frames per second). Conversely, VOS algorithms have been traditionally more concerned with an accurate representation of the object of interest [69, 67].

In order to exploit consistency between video frames, several methods propagate the supervisory segmentation mask of the first frame to the temporally adjacent ones via graph labeling approaches (e.g. [89, 68, 81, 57, 1, 103]). In particular, Bao et al. [1] recently proposed a very accurate method that makes use of a spatio-temporal MRF in which temporal dependencies are modelled by optical flow, while spatial dependencies are expressed by a CNN. Another popular strategy is to process video frames independently (e.g. [56, 66, 84]), similarly to what happens in most tracking approaches. For example, in OSVOS-S Maninis et al. [56] do not make use of any temporal information. They rely on a fully-convolutional network pre-trained for classification and then, at test time, they fine-tune it using the ground-truth mask provided in the first frame. MaskTrack [66] instead is trained from scratch on individual images, but it does exploit some form of temporality at test time by using the latest mask prediction and optical flow as additional input to the network.

Aiming towards the highest possible accuracy, at test time VOS methods often feature computationally intensive techniques such as fine-tuning [56, 66, 1, 84], data augmentation [40, 47] and optical flow [81, 1, 66, 47, 12]. Therefore, these approaches are generally characterized by low frame rates and the inability to operate online. For example, it is not uncommon for methods to require minutes [66, 13] or even hours [81, 1] for videos that are just a few seconds long, like the ones of the DAVIS benchmark [67]. Recently, there has been an increasing interest in the VOS community towards faster methods [57, 65, 12, 10, 37, 35]. Two notable fast approaches with a performance competitive with the state of the art are OSMN [97] and RGMP [65]. The former uses a meta-network “modulator” to quickly adapt the parameters of a segmentation network during test time, while the latter does not use any fine-tuning and adopts an encoder-decoder Siamese architecture trained in multiple stages. Both these methods run at less than 10 frames per second, which does not make them suitable for real-time applications.

2.3 Tracking and segmentation

Interestingly, in the past it was not uncommon for online trackers to produce a very coarse binary mask of the target object (e.g. [16, 70, 5, 73]). In modern times, faster and online-operating trackers have typically used rectangular bounding boxes to represent the target object, while to be able to produce accurate masks researchers have often forego speed and online operability, as we saw in the previous section.

Few notable exceptions exists, some of which are very recent and have been published after the conference version of this paper. Yeo et al. [100] proposed a super pixel-based tracker that is able to operate online and produce binary masks for the object starting from a bounding box initialization. However, the fastest variant of this tracker runs at 4 frames per second. When the CNN features were used, its speed is decreased by 40 times Perazzi et al. [66] and Ci et al.  [15] propose video object segmentation methods which, like us, can be initialized using a simple axis-aligned rectangle in the first frame while also outputting a mask at each frame. Yan et al. [95] adopts a pixel-wise correlation layer and an auxiliary mask head to improve tracking performance. However, their method requires online learning of the network parameters, which restricts its practical applications. Lukezic et al. [54] propose an approach that handles both tracking and segmentation by encoding the target object with two discriminative models capturing complementary properties: one is adaptive but only considers Euclidean motions, the other instead accounts for non-rigid transformations. These methods require online learning during tracking, which can impact their speed.

3 Fully-convolutional Siamese Networks

To allow online operability and fast speed, we adopt the fully-convolutional Siamese framework [3], considering both SiamFC [3, 83] and SiamRPN [45] as the method to start from. We first introduce them in Section 3.1 and 3.2 and then describe our approach in Section 4.

3.1 SiamFC

Bertinetto et al. [3] proposed to use, as a fundamental building block of a tracking system, an offline-trained fully-convolutional Siamese network that compares an exemplar image z against a larger search image x to obtain a dense response map (where the location with the highest response can be used to infer the location of the exemplar in the search image). z and x are, respectively, a crop of size centered on the target object and a larger crop centered on the last estimated position of the target. The two inputs are processed by the same CNN , yielding two feature maps that are cross-correlated:

(1)

In this paper, we refer to each spatial element of the response map as response of a candidate window (RoW), where encodes the similarity between the examplar z and the -th candidate window in x. For SiamFC, the goal is for the maximum value of the response map to correspond to the target location in the search area x. In order to allow each RoW to encode richer information about the target object, as we will see later, we replace the simple cross-correlation of Eq. 1 with depth-wise cross-correlation [2, 39] and produce a multi-channel response map.

SiamFC is trained offline on millions of video frames with the logistic loss [3, Section 2.2]. Let be the ground-truth label for the RoW of position in the grid of the response map. The logistic loss is defined as:

(2)

3.2 SiamRPN

Li et al. [45] considerably improved the tracking accuracy of SiamFC by relying on a region proposal network (RPN) [75, 23], which allows to estimate the target location with a bounding box of variable aspect ratio. In particular, in SiamRPN each RoW encodes a set of anchor box proposals and corresponding object/background scores. Therefore, SiamRPN is able to output box predictions (with a regression branch) in parallel with object/background classification scores.

Using the nomenclature and formulation from [45], let us assume a total of anchor boxes. Convolutional layers are used to obtain the two features (for the classification branch) and (for the regression branch) from the feature map . Their number of channels depends on the number of anchors and increases, respectively, of -times and -times w.r.t. . The correlation between z and x on the classification branch is obtained by

(3)

while the correlation on the regression branch by

(4)

This way, each spatial location in and has a “depth” of and channels respectively. In other words, for each anchor from the RPN module, the network produces two multi-channel response maps:

  • A two-channel output for object/background classification, with “scores” representing corresponding location in the original map.

  • A four-channels output for bounding-box regression, representing the center distance and the width and height difference between the anchor and corresponding ground truth.

Offline, the classification branch is trained using the cross-entropy loss [45]:

(5)

where is the output of the classification branch for the -th anchor of the -th RoW, while the regression branch is trained using the smooth loss with normalized coordinates. Let , , , and denote the coordinates of the center point and the width and height of an anchor box. Let , , , and denote the central coordinates, width, and height of the ground-truth box. The normalized distance between an anchor and the ground-truth box is defined as

(6)

The smooth loss for is

(7)

where

is a hyperparameter that needs tuning. Let

be the outputs of the four channels of the regression branch for an anchor. The loss for the anchor regression is defined as:

(8)

where is the ground-truth label for the -th anchor of the -th RoW.

4 SiamMask

Unlike existing tracking methods that rely on low-fidelity object representations, we argue the importance of producing per-frame binary segmentation masks. To this aim we show that, besides similarity scores and bounding box coordinates, it is possible for the RoW of a fully-convolutional Siamese network to also encode the information necessary to produce a pixel-wise binary mask. This can be achieved by extending existing Siamese trackers with an extra branch and loss. In the following subsections, we describe the multi-branch network architecture (4.1), the strategy to obtain and refine a mask representation (4.2

), the loss function (

4.3), and how we obtain a bounding-box from a mask (4.4).

4.1 Multi-branch network architecture

We augment the architectures of SiamFC [3] and SiamRPN [45] by adding our segmentation branch to obtain the two-branch and three-branch variants of the proposed Siamese mask network (SiamMask), illustrated in Fig. 2 and Fig. 3. In the two-branch variant, the branch is tasked with discriminating each RoW between target object or background, while the branch outputs one segmentation mask per RoW. In addition to these, the three-branch variant also employs a box-regression branch like in SiamRPN.

Fig. 2: Schematic illustration of the two-branch variant of SiamMask, based on [3]. denotes depth-wise cross correlation.

Importantly, in order to allow each RoW to encode richer information about the target object, we replace the simple cross-correlation “” of Eq. 1 with depth-wise cross-correlation (see e.g.  [2]) “” and produce multi-channel response maps. Then, the output of the depth-wise cross-correlation

is a vector with

channels (in the illustrative example of Fig. 2 it has size ). Note that, for the classification branch, the multi-channel output is mapped to a single channel response map by a convolutional layer. The logistic loss is then defined by replacing from 5 with .

In the segmentation branch, we predict binary masks (one for each RoW

) using a simple two-layers neural network

with learnable parameters :

(9)

From Eq. 9 we can see that the mask prediction is a function of both the search are x and the exemplar image z. This way, z can be used as a guide to “prime” the segmentation: given a different reference image, the network will produce a different segmentation mask, as we will see in the experimental section.

Fig. 3: Schematic illustration of the three-branch variant of SiamMask, based on [45].

4.2 Mask representation and refinement

Fig. 4: Schematic illustration of the stacked refinement modules. For a more detailed version of refinement modules, see Fig. 5.

In contrast to semantic segmentation methods in the style of FCN [52] and Mask R-CNN [31]

, which maintain explicit spatial information throughout the network, our approach follows the spirit of DeepMask 

[64] and SharpMask [71] and generates masks starting from a flattened representation of the object. In particular, in our case this representation corresponds to one of the RoWs produced by the depth-wise cross-correlation between and . Importantly, the network of the segmentation task is composed of two convolutional layers, one with 256 and the other with channels (Figure 2). This allows every pixel classifier to utilise information contained in the entire RoW and thus to have a complete view of its corresponding candidate window in x. With the aim of producing a more accurate object mask, we follow the strategy of [71], which combines low and high resolution features using multiple refinement modules made of upsampling layers and skip connections.

Fig. 4 represents a more detailed illustration of our architecture, which explicitly shows the stack of refinement modules for generating the final mask. Exemplar and search image are processed by the same network, and their features (depth-wise) cross-correlated to obtain the features , where we refer to the -th RoW with . Let , , and be the feature maps extracted from the third, second, and first layers in the Siamese network for the -th RoW in x. Deconvolution is carried out on to obtain the -th RoW’s segmentation representation at a relatively low resolution (). In the refinement module , the mask representation and the high layer feature map are combined to obtain, by upsampling, a mask representation with a higher resolution than : (). Analogously, modules and produce representations with increasingly higher resolutions. Beside obtaining higher resolution feature maps, this procedure allows to use complementary information from layers at different “depths”.

Fig. 5 shows the structure of as example of a refinement module. The mask representation is used to obtain a new mask representation A via two convolutional layers and a non-linear layer. The feature map is then used to output a new feature B with the same size as A via three convolutional layers and two non-linear layers. The sum of A and B then produces a new mask C. Finally, one last non-linear layer is used to produce a new, -upscaled mask representation .

Fig. 5: Schematic illustration of one of the three refinement modules ().

4.3 Loss function

We define the loss function for the segmentation branch and combine it with the loss functions of the other branches.

Loss for the segmentation branch. During training, each RoW is labelled with a ground-truth binary label and also associated with a pixel-wise ground-truth mask of size . Let denote the label corresponding to pixel of the object mask in the -th candidate RoW. The loss function (Eq. 10

) for the mask prediction task is a binary logistic regression loss over all RoWs:

(10)

where is an element in defined in Eq. 9. Thus, the classification layer of consists of classifiers, each indicating whether a given pixel belongs to the object in the candidate window or not. Note that is considered only for positive RoWs (i.e. with ). Given the high number of negative samples, considering them would unbalance the loss. We experimented with a weighted loss, so that negative and positive samples would hold the same importance, but it showed worse results than only considering the positives.

Multiple task loss. For our experiments, we augment the architectures of SiamFC [3] and SiamRPN [45] with our segmentation branch and the loss , obtaining what we call the two-branch and three-branch variants of SiamMask. These respectively optimise the multi-task losses and , defined as:

(11)
(12)

We did not search over the hyperparameters of Eq. 11 and Eq. 12 and simply set like in [64] and . The task-specific branches for the box and score outputs are constituted by two convolutional layers.

4.4 Box generation

Note that, while video object segmentation benchmarks require binary masks, typical tracking benchmarks such as VOT [44] require a bounding box as final representation of the target object. We consider three different strategies to generate a bounding box from a binary mask, represented in Fig. 6: (1) axis-aligned bounding rectangle (Min-max), (2) rotated minimum bounding rectangle (MBR) and (3) the optimisation strategy used for the automatic bounding box generation proposed in the VOT benchmarks [44, 42] (Opt). We empirically evaluate these alternatives in Section 6 (Table IV).

Fig. 6: In order to generate a bounding box from a binary mask (in yellow), we experiment with three different methods. Min-max: the axis-aligned rectangle containing the object (red); MBR: the minimum bounding rectangle (green); Opt: the rectangle obtained via the optimisation strategy proposed in VOT-2016 [44, 42] (blue).

4.5 Training and testing

During training, pairs of exemplar and search image samples are input into the network, and the predicted masks, scores, and boxes are used to optimize the multi-task loss (Eq. 11 or 12). Positive and negative samples are defined differently in the two- and three-branch version of SiamMask. For , a RoW is considered positive () if one of its anchor boxes has an intersection-over-union (IOU) with the ground-truth bounding-box of at least 0.6, and negative () otherwise. For , we adopt a similar strategy to the one of [3] to define positive and negative samples: a RoW is considered a positive sample if the distance between the center of the prediction and the center of the ground-truth is below 16 pixels (in feature space), and negative otherwise.

During online tracking, the offline-trained SiamMask is used to produce masks and boxes for every input frame, with no further adaptation of the network’s parameters and a simple axis-aligned bounding-box initialization. In both variants, the output mask is selected using the location that achieves the maximum score in the classification branch.

5 Multiple Object Tracking and Segmentation

Fig. 7: The two-stage version of SiamMask used for each object in the multiple object tracking and segmentation problem.

We extend the application of the proposed SiamMask to segmentation-based multiple object tracking [87], which requires to segment and track an arbitrary number of objects throughout the video [102]. Compared to the single-object tracking problem, it presents the added difficulty of requiring to disambiguate between object instances. This means that, across frames, each object needs to be labelled with the correct identity, which becomes particularly challenging in crowded scenes.

To address this problem, we use a pre-trained segmentation-based object detection algorithm to initialize individual tracks, and then for each object we apply SiamMask twice in a “cascaded fashion”. More specifically, for each new frame an off-the-shelf segmentation-based object detector [8] is used to obtain candidate masks . Given existing trajectories , SiamMask is used to produce masks . Then, identity association across frames is obtained by solving the assignment between and as an optimal transport problem. Let be the pairwise affinity between the -th mask in and the -th mask in . It is defined as the IOU (intersection over union) between the masks and . Let represent the pairwise association between and , which is 1 if and belong to the same object, and 0 otherwise. The assignment between and is formulated as:

(13)

This ensures that each mask in is associates with at most one mask in , and conversely that each mask in is associated with at most one mask in . This constrained integer optimization problem can be solved by the Hungarian algorithm. By solving this assignment problem at each frame, objects tracks are maintained throughout the video.

Although many existing visual object tracking algorithms and object segmentation algorithms can be used to construct object correspondences between adjacent frames, we found SiamMask to be a practical and effective choice. Its low computational cost allows us to deal with multiple objects, and the tracking score from the classification branch can be conveniently used to indicate the occlusion or disappearance of the target object.

For our experiments on multiple object tracking and segmentation we propose a cascaded version of SiamMask, as illustrated in Fig. 7. In a first stage, the regression branch of SiamMask predicts a coarse location of each object. In the second, a crop from the search image is extracted in correspondence of the bounding box corresponding to the highest score in the first stage, and used to predict a refined mask from the segmentation branch. These masks are the new predictions associated with the current object trajectories. Then, IoUs between the masks predicted by SiamMask and the newly detected masks are computed. The assignment from Eq. 13 associates the detected masks with the existing trajectories, and it is also used to keep track of newly appeared or disappeared objects ( is the number of new or lost objects).

6 Experiments

In this experimental section, we first describe implementation details (6.1) and then we evaluate our approach on three related but different problems: visual object tracking on VOT-2016, VOT-2018, GOT-10k and TrackingNet ( 6.2); video object segmentation on DAVIS-2016, DAVIS-2017 and YouTube-VOS ( 6.3); and multiple-object tracking and segmentation on YouTube-VIS ( 6.4). We conclude the section with ablation studies ( 6.5) and qualitative examples from benchmarks videos ( 6.6).

6.1 Implementation

Network architecture. For both the two-branch and three-branch variants of SiamMask, we use a ResNet-50 [32] architecture until the final convolutional layer of the -th stage as our backbone

. In order to obtain a high spatial resolution in deeper layers, we reduce the output stride to

(from ) by using convolutions with stride 1. Moreover, we increase the receptive field by using dilated convolutions [9]. Table I outlines the structure of , while Table II and III show the architectures of the branches of the two variants of SiamMask. In our model, we add to the shared backbone a (not shared) “adjust layer” constituting of a conv with outputs (for simplicity, this is omitted Eq. 1). In the

convolutional layer of block conv4_1, the stride is 1 and the dilation rate is 2. The conv5 block in both variants contains a normalization layer and a ReLU non-linearity, and block conv6 only consists of a

convolutional layer. For the three-branch variant, the number of anchors is set to 5. Exemplar and search images share the network parameters from block conv1 to conv4_x, and do not share the parameters of the adjust layer. As a last step, depth-wise cross-correlation is carried between the output feature maps from the individual adjust layers for the exemplar and search area, obtaining an output of size .

Block Exemplar output size Search output size Details
conv1_x , 64, stride 2
max pool, stride 2
conv2_x
conv3 x
conv4 x
adjust
xcorr depth-wise
TABLE I: Backbone architecture. The structure of each block is shown in square brackets
Block Score Mask
conv5
conv6
TABLE II: The architecture of the two-branch variant.
Block Score Box Mask
conv5
conv6
TABLE III: The architecture of the three-branch variant. is the number of anchors for each RoW.
Method mIoU mAP@0.5 IoU mAP@0.7 IoU
Oracle Fixed
Min-max/aligned
MBR
SiamFC [3]
SiamRPN [45]
SiamMask Min-max/aligned
MBR
Opt
TABLE IV: Accuracies for different bounding box representation strategies on VOT-2016.

Offline training settings. As in SiamFC [3], the size of exemplar and search image patches are and

pixels respectively. Training samples originate from COCO

[50]

, ImageNet-VID

[76] and YouTube-VOS [91], and were augmented by randomly shifting and scaling the input patches. Random translations are within pixels for exemplar images and within

pixels for search images. Random scaling is within [0.95, 1.05] and [0.82, 1.18] for exemplar and search images respectively. The network backbone is pre-trained on the ILSVRC ImageNet classification task (1000 classes). We use SGD with a first

warmup phase in which the learning rate increases linearly from to

for the first 5 epochs and then descreases logarithmically until

for 15 more epochs.

Online inference settings. During tracking, SiamMask is simply evaluated once per frame, without any adaptation. In both our variants, we select the output mask using the location attaining the maximum score in the classification branch. Then, after having applied a per-pixel sigmoid, we binarise the output of the mask branch at the threshold of . In the two-branch variant, for each video frame after the first one, we fit the output mask with the Min-max box and use it as reference to crop the next frame search region. Instead, in the three-branch variant, we find more effective to exploit the highest-scoring output of the box branch as reference.

Timing. SiamMask operates online without any adaptation to the test sequence. On a single NVIDIA RTX 2080 GPU, we measured an average speed of 55 and 60 frames per second, respectively for the two-branch and three-branch variants. Note that SiamMask does not perform online adaptation of the network parameters during tracking, and that the highest computational burden comes from the feature extractor .

Datasets. To evaluate tracking performance, the following four benchmarks were used: VOT-2016 [43], VOT-2018 [42], GOT-10k [36], and TrackingNet [60].

  • VOT-2016 and VOT-2018. We use VOT-2016 to understand how different types of representation affect the performance. For this first experiment, we use mean intersection over union (IOU) and Average Precision (AP)@ IOU. We then compare against the state-of-the-art on both VOT-2018 and VOT-2016, using the official VOT toolkit and the Expected Average Overlap (EAO), a measure that considers both accuracy and robustness of a tracker [44].

  • GOT-10k and TrackingNet. These are larger and more recent visual object tracking datasets which are useful to test the generalization ability of trackers on a vast number of diverse classes, scenarios, and types of motions.

To evaluate (VOS) segmentation performance, the following four benchmarks were used: DAVIS-2016 [67], DAVIS-2017 [72], YouTube-VOS [91], and YouTube-VIS [96].

  • DAVIS-2016 and DAVIS-2017. We report the performance of SiamMask on DAVIS-2016 [67], DAVIS-2017 [72] and YouTube-VOS [91]

    benchmarks. For both DAVIS datasets, we use the official performance measures: the Jaccard index (

    ) to express region similarity and the F-measure () to express contour accuracy. For each measure , three statistics are considered: mean , recall , and decay , which informs us about the gain/loss of performance over time [67].

  • YouTube-VOS. Following [91], the final result for a test sample in YouTube-VOS is the average of the following four metrics: the overlap precision for seen classes that appear in the training set, the overlap precision for unseen classes that do not appear in the training set, the edge precision for seen classes, and the edge precision for unseen classes. We report the mean Jaccard index and F-measure for both seen (, ) and unseen categories (, ). is the average of these four measures.

  • YouTube-VIS. This is a large multiple object tracking and segmentation dataset with 2,883 high-resolution videos containing objects labelled with 40 classes, and 131,000 high-quality pixel-wise masks. Average precision (AP) and average recall are used as performance metrics [96].

6.2 Evaluation for tracking

Target object representation. Existing tracking methods typically predict axis-aligned bounding boxes with a fixed [3, 34, 19, 55] or variable [45, 33, 106] aspect ratio. We are interested in understanding to which extent producing a per-frame binary mask can improve tracking. In order to focus on representation accuracy, for this experiment only we ignore the temporal aspect and sample video frames at random. The approaches described in the following paragraph are tested on randomly cropped search patches (with random shifts within pixels and scale deformations up to ) from the sequences of VOT-2016.

In Table IV, we compare our three-branch variant using the Min-max, MBR and Opt approaches (described in Section 4.4 and in Figure 6). For reference, we also report results for SiamFC and SiamRPN as representative of the fixed and variable aspect-ratio approaches, together with three oracles that have access to per-frame ground-truth information and serve as upper bounds for the different representation strategies. (1) The fixed aspect-ratio oracle (“fixed” in the table) uses the per-frame ground-truth area and center location, but fixes the aspect reatio to the one of the first frame and produces an axis-aligned bounding box. (2) The Min-max oracle uses the minimal enclosing rectangle of the rotated ground-truth bounding box to produce an axis-aligned bounding box. (3) Finally, the MBR oracle uses the rotated minimum bounding rectangle of the ground-truth. Note that (1), (2) and (3) can be considered, respectively, the performance upper bounds for the representation strategies of SiamFC, SiamRPN and SiamMask.

The results are reported for SiamFC and SiamRPN as the representative trackers using, respectively, fixed and variable aspect-ratio bounding-box representation. We use SiamMask’s three-branch variant, for which report the results obtained when using Min-max, MBR, and Opt representation strategies. Although SiamMask-Opt offers the highest IOU and mAP, it requires significant computational resources due to its slow optimisation procedure. SiamMask-MBR achieves a mAP@0.5 IOU of , with a respective improvement of and points w.r.t. the two fully-convolutional baselines. Interestingly, the gap significantly widens when considering mAP at the higher accuracy regime of 0.7 IOU: and respectively. Notably, our accuracy results are not far from the fixed aspect-ratio oracle. Moreover, comparing the upper bound performance represented by the oracles, it is possible to notice how, by simply changing the bounding box representation, there is a great room for improvement (e.g. mIOU improvement between the fixed aspect-ratio and the MBR oracles).

Overall, this study shows how the MBR strategy to obtain a rotated bounding box from a binary mask of the object offers a significant advantage over popular strategies that simply report axis-aligned bounding boxes.

Results on VOT-2016 and VOT-2018. Table V shows the results of SiamMask with different box generation strategies on the VOT-2016 and VOT-2018 benchmarks. The metrics EAO, accuracy, robustness, and speed were considered. SiamMask-box indicates that the box branch of SiamMask is adopted for inference despite the mask branch has been trained. From the table, the following observations can be made:

  • On the simpler VOT-2016 benchmark, compared with SiamMask-box which directly outputs axis-aligned boxes from the box regression branch, SiamMask-Opt which outputs boxes from the mask branch increases EAO by 3%, and increases the accuracy by 4.7%.

  • On the more challenging VOT-2018, SiamMask-Opt increases EAO by 2.4% and increases the accuracy by 5.8%. The robustness is also increased.

  • Overall, SiamMask-MBR yields a better performance than SiamMask-box, while keeping real time speed.

In general, we can observe clear improvements on all evaluation metrics by using the mask branch for box generation. Note how SiamMask-Opt is best for overall EAO (especially in terms of Accuracy), but its improvement w.r.t. SiamMask-MBR does not justify the significantly higher computational cost.

Method VOT-2016 benchmark VOT-2018 benchmark Speed
EAO Accu. Robust. EAO Accu. Robust. (fps)
SiamMask-box
SiamMask-MBR 55
SiamMask-Opt 5
TABLE V: Results of SiamMask on the VOT-2016 and VOT-2018 benchmarks.

In Table VI we compare the two variants of SiamMask with MBR strategy and SiamMask–Opt against five popular trackers on the VOT-2018 benchmark. Unless stated otherwise, SiamMask refers to our three-branch variant with MBR strategy. Both variants achieve a strong performance and run in real-time. In particular, our three-branch variant significantly outperforms DaSiamRPN [106] (which is trained on YouTube-bounding-boxes [74]), achieving a EAO of while running at 55 frames per second. Even without box regression branch, our simpler two-branch variant (SiamMask-2B) achieves a high EAO of , which is in par with SA_Siam_R [30] and superior to any other real-time method in the published literature at the time of the conference version of this paper [88]. Finally, in SiamMask–Opt, the strategy proposed in [42] to find the optimal rotated rectangle from a binary mask brings the best overall performance (and a particularly high accuracy), but comes at a significant computational cost.

Our model is particularly strong under the accuracy metric, showing a significant advantage with respect to the Correlation Filter-based trackers CSRDCF [55], STRCF [46]. This is not surprising, as SiamMask relies on a richer object representation, as demonstrated in the experiments of Table IV. Interestingly, similarly to us, He et al. (SA_Siam_R) [30] are motivated to achieve a more accurate target representation by considering multiple rotated and rescaled bounding boxes. However, their representation is still constrained to a fixed aspect-ratio box.

Metric SiamMask DaSiam-RPN [106] Siam-RPN [45] SA_Siam_R [34] CSRDCF [55] STRCF [46]
Opt MBR
Three-branch Two-branch
EAO
Accur.
Robus.
Speed 5 55 60 160
TABLE VI: Comparison with the state-of-the-art on the VOT-2018 benchmark.
Fig. 8: The EAO plot for the proposed SiamMask and the top 10 real time competing trackers on the VOT2018 challenge.

Real-time VOT-2018 comparison. As additional visualization, we use the plots provided by VOT toolkit to compare SiaMask against the top 10 real-time trackers in terms of EAO (Expected Average Overlap: the metric used as a summary by VOT). In Fig. 8, the horizontal coordinate represents the rank of the trackers, and the vertical coordinate their EAO. The horizontal gray line in the figure is what is considered by the VOT committee as the state-of-the-art at the time of the competition. Compared with the fully convolutional network SiamFC, SiamMask increases the performance by a very significant absolute 19.8%.

VOT attributes breakdown. In the VOT benchmarks, frames are densely labelled with scene attributes to give a more qualitative understanding of how different trackers perform under different circumstances. The scene attributes are: occlusion, illumination change, motion change, size (scale) change, and camera motion. We compared SiamMask with popular and representative trackers [18, 30, 45, 46, 106, 20, 4, 62] with respect to these attributes on the VOT-2016 and VOT-2018 benchmarks. The results are shown in Fig. 9. For both benchmarks, it can be seen that SiamMask obtains the best results for most scene attributes. One clear advantage brought by our method is the capability of providing pixel-wise mask representation for the target object (at a high speed), which allows a much higher accuracy and ease of adaptation, especially in presence of rapid non-rigid deformations.

Fig. 9: Comparison between SiamMask and state-of-the-art trackers with respect to different visual scene attributes on VOT2016 and VOT2018.

Unsupervised learning.

A recent trend in object tracking (and computer vision in general) is to train feature extractors on large scale datasets with a self-supervised proxy task. This is a compelling strategy, as it offers a way to exploit large datasets without having to provide costly box or mask labels. However, these methods encounter the additional challenge of choosing the right proxy task outside of the usual supervision loop, which requires a much larger amount of experiments for tuning a large number of highly-consequential hyperparameters, like the ones controlling data augmentation. Table 

VII

compares our method with a representative set of recent self-supervised-learning-based trackers 

[86, 101, 104, 78, 90] on the VOT-2018 benchmark. It can be seen that, while their speed is comparable to our method (and sometimes faster), their overall performance still lags behind.

Metric LUDT [86] CycleSiam [101] USOT [104] S2SiamFC [78] PUL [90] Our method
EAO
Accuracy
Robustness
Speed (fps) 55 44
TABLE VII: Comparison with unsupervised learning-based methods on the VOT-2018 benchmark.

Results on GOT-10k and TrackingNet. GOT-10k [36] is a very large scale tracking dataset covering 563 object classes and 87 motion patterns. In total, it consists of 10,000 video clips with 1.5 million bounding-box labels. Trackers are evaluated on a selection of 180 videos exhibiting 84 different object classes and 32 “motion types”. Trackers are ranked using average overlap. We also reported the success rates at two thresholds: 0.5 and 0.75. In this case, SiamMask was compared with the results made available by the benchmark for CFNet [83], SiamFC [3], GOTURN [32], CCOT [20] and MDNet [63]. Results are shown in Table VIII. Compared to CFNet [83] (which has best performance among the competing trackers), SiamMask has a significant advantage on all the metrics considered. In general, the fact that SiamMask maintains its strong performance on a dataset with a large number of classes should be taken as a positive signal for its generalization capabilities. It provides a relative increase in average overlap of 37%, and of up to 150% in success rate. However, it is hard to establish an apple-to-apple comparison between SiamMask and the methods reported by this benchmark. On the one hand, the trackers reported by the benchmark are trained on the training split of the same dataset (which is supposedly sampled from the same distribution as the set used as benchmark). On the other, a part from the “person” class, the GOT-10k training set does not contain any other class from the benchmark set. For the sake of simplicity, and for consistency with the other experiments in the paper, we did not enforce the same separation, so we do not have data regarding the class overlap between the two sets. This should be taken into account when doing comparisons on this benchmark.

Metric SiamMask CFNet [83] SiamFC [3] GOTURN [32] CCOT [20] MDNet [63]
Average overlap
Success rate with overlap threshold
Success rate with overlap threshold
TABLE VIII: Comparison on GOT-10k benchmark.
Metric SiamMask ATOM [17] MDNet [63] CFNet [83] SiamFC [3] ECO [18]
AUC of success rate
Tracking precision
Normalized precision
TABLE IX: Comparison on TrackingNet benchmark.

TrackingNet [60] is a popular large video benchmark of 511 videos for testing visual object tracking algorithms. Trackers were ranked according to the area under the curve (AUC) of the success rate, tracking precision, and normalized precision. On this dataset, SiamMask was compared with ATOM [17], MDNet [63], CFNet [83], SiamFC [3] and ECO [18]. Results are shown in Table IX for supervised methods and X for self-supervised methods. Again, it can be seen that SiamMask outperforms the competitors according to all metrics considered by the benchmark. Interestingly, SiamMask even slightly (+2.1%) improves over ATOM [17], which adapts online the parameters of the network used as a feature extractor.

Metric LUDT [86] USOT [104] PUL [90] SiamMask
AUC of success rate
Tracking precision
Normalized precision
TABLE X: Comparison with unsupervised learning-based methods on the TrackingNet dataset.

On the TrackingNet benchmark, we also compared our method with a few unsupervised learning-based visual trackers [86, 104, 90]. Results are shown in Table X. Unsurprisingly, SiamMask can leverage millions of bounding-box and mask labels during training and achieves significantly better results.

6.3 Evaluation for video object segmentation (VOS)

Our model, once trained, can also be used for the task of VOS to achieve competitive performance without requiring any adaptation at test time. Importantly, differently to typical VOS approaches, ours can operate online, runs in real-time and only requires a simple bounding box initialization. To initialize SiamMask, an axis-aligned bounding box (the Min-max strategy shown in Fig. 6) is extracted from the mask provided in the first frame (multiple objects are tracked and segmented using multiple tracker’s instances). Instead, VOS methods are typically initialized with a binary mask [69] and many of them require computationally intensive techniques at test time such as fine-tuning [56, 66, 1, 84], data augmentation [40, 47], inference on MRF/CRF [89, 81, 57, 1] and optical flow [81, 1, 66, 47, 12]. As a consequence, it is not uncommon for VOS techniques to require several minutes to process a short sequence. Clearly, these strategies make the online applicability (which is our focus) impossible. For this reason, in our comparison we mainly concentrate on fast VOS approaches.

Results on DAVIS-2016. Fig. 10 compares SiamMask with several popular fast VOS methods on DAVIS-2016 in terms of segmentation accuracy (y-axis) and speed (x-axis). SiamMask shows a comparable accuracy while running significantly faster than other methods (often by an order of magnitude). Notably, the competitive accuracy is obtained without requiring online updates of the backbone model, as done for instance in OSMN [97].

Fig. 10: Comparison in terms of mean IOU and speed (fps) between SiamMask and popular fast video object segmentation algorithms on the DAVIS-2016 dataset.

Table XI offers a more detailed breakdown of the comparison, while also considering slower but better performing methods such as OnAVOS and MSK. Few notes:

  • OnAVOS [84] and MSK [66] yield the best overlap and contour precision across the board. However, their strategy of performing online model updates make them hundreds of times slower than SiamMask and unable to run in real time.

  • In contrast with VOS approaches which do not perform online model updates (FAVOS[12], RGMP [65], SFL-ol [13], PML [10], OSMN [97], PLM [77] and VPN [37]), SiamMask has a simpler initialization (bounding box instead of pixel-wise mask) and an important advantage in terms of speed.

  • Importantly, SiamMask yields the best performance (lowest values) for the decay of both the area overlap rate and the contour precision ). This suggests that SiamMask is robust over time and thus particularly appropriate to be used in long sequences.

Method FT Speed
OnAVOS [84]
MSK [66]
MSK [66]
SFL [13]
FAVOS [12]
RGMP [65] 8
SFL-ol [13]
PML [10]
OSMN [97]
PLM [77]
VPN [37]
SiamMask
TABLE XI: Comparison between SiamMask and the state-of-the-art segmentation algorithms on the DAVIS 2016 validation set: FT denotes whether fine-tuning is required (✓) or not (✗); M denotes whether video segmentation is initialized with a mask (✓) or a bounding box (✗); Speed is measured in frames per second.

Results on DAVIS-2017 and YouTube-VOS.

Method FT M Speed
OnAVOS [84]
OSVOS [7]
FAVOS [12]
OSMN [97]
SiamMask 55
TABLE XII: Comparison on the DAVIS 2017 validation set.

Table XII and XIII compare the performance of SiamMask on two additional VOS benchmarks: DAVIS-2017 and YouTube-VOS. Looking at the results from the tables, few comments can be made:

  • SiamMask again does not have the best performance overall, but it is very competitive at a speed that is often hundreds of time faster than higher performing methods like OnAVOS and OSVOS.

  • On DAVIS-2017, again SiamMask exhibits a strong temporal robustness (expressed by a low decay). It is surpassed only by the slower FAVOS, which maintains several trackers for different object parts, thus being able to deal with complex deformations that naturally occur over time.

  • On YouTube-VOS, SiamMask surprisingly achieves the best accuracy for the set of seen classes.

  • The second fastest method after SiamMask is OSMN, which uses meta learning to perform rapid online parameters updates. However, it performs worse across the board in all metrics.

Method FT M Speed
OnAVOS [84]
OSVOS [7]
OSMN [97]
SiamMask
TABLE XIII: Comparison on the YouTube-VOS validation set.

General remarks. The results on DAVIS-2016, DAVIS-2017 and YouTube-VOS (Table XI, XII and XIII) demonstrate that SiamMask can achieve competitive online segmentation performance at a fast speed, with only a simple bounding-box initialization and without any adaptation at test time. Moreover, SiamMask (1) is almost two orders of magnitude faster than accurate segmentation algorithms such as OnAVOS [84] and SFL [13]; (2) is about four times faster than fast online methods such as OSMN [97] and RGMP [65]; (3) has a remarkably low decay in performance over time, which makes it effective in long sequences. These points suggest that SiamMask can be treated as a strong baseline for online video object segmentation, as well as tracking.

6.4 Evaluation for multiple object tracking and segmentation

First, we validate the efficacy of the two-stage strategy when performing multiple object tracking and segmentation on the validation set of YouTubeVIS [96]. The comparison of the one-stage vs. the two-stage approach is shown in Table XIV. In both cases, one tracker per target object is instantiated and the off-the-shelf segmentation approach HTC [8] is used to initialize the tracks. As it can be seen, the two-stage variant modestly improves the results, with an absolute improvement in mAP and in average recall. We believe this improvement is to attribute to the fact that the two-stage version regression branch helps limiting the area to segment, thus reducing the degree of the difficulty for the mask branch.

Model
HTC+SiamMask
HTC+ two-stage SiamMask
TABLE XIV: Comparison between the proposed two-stage SiamMask and the single stage SiamMask on the validation set of YouTubeVIS [96]. mAP refers to mean average precision, while is the average recall with 10 proposals, averaged over IoU thresholds and classes. and indicate the difference for both metrics. HTC is the off-the-shelf segmentation method [8] used to initialize the tracks.
Team name mAP AP50 AP75 AR1 AR10 Ranking
Jono [53] 1
missingOurs 2
bellejuillet [24] 3
Linhj 4
mingmingdiii [21] 5
xiAaonice [51] 6
guwop 7
exing 8
Baseline [37]
TABLE XV: Leaderboard of the 2019 edition of YouTube-VIS. The metrics used by the challenge are mean average precision (mAP), average precision at fixed IOU threshold of 50% and 75% (AP50 and AP75), and average recall with 1 or 10 proposals (AR1 and AR10). For more information, and for a description of the all the approaches used in the competition, see https://youtube-vos.org/challenge/2019/leaderboard.

Table XV shows the comparison between the two-stage version of SiamMask and the algorithms participating to the 2019 YouTube-VIS challenge (on the test-set) [96]. Compared with the official baseline proposed by the YouTube-VIS organizers [96], the two-stage version of SiamMask increases the mAP of a relative . Despite the very simple approach, SiamMask ranks second on the leaderboard, after the approach described in [53], which consider the VIS task as constituted by four different problems: detection, classification, segmentation, and tracking, and solves them separately.

One important situation that our simple adaptation to SiamMask does not handle well is confusion: when multiple objects are very close to each other, there is a high uncertainty in mapping pixels to identities, with the results that one object can “hijack” the mask of another. A solution to this issue could be to model the relationship between pixels within the same mask (e.g. with conditional random fields, or graph neural networks), rather than treating each pixel individually. However, this will inevitably reduce the tracking speed.

6.5 Ablation studies

We perform a series of ablation studies to analyse the impact of different architectures and multi-task training setups.

Table XVI compares different variants of the fully-convolutional Siamese framework, indicating whether the classic AlexNet or ResNet-50 are used as the backbone , whether or not the mask refinement strategy (from [64]) is used, and which multi-task configuration is adopted. Few observations can be made:

  • Unsurprisingly, using a ResNet-50 backbone delivers better performance, and at a reasonable cost in terms of speed.

  • Using the same ResNet-50 backbone, the two- and three-branch variants of SiamMask improve over their respective baselines, SiamFC and SiamRPN.

  • Mask refinement is very useful for increasing the contour accuracy in the segmentation task. However, it does not seem to significantly affect the EAO tracking metric. This is not particularly surprising, as it only considers rotated bounding boxes, which are only a crude approximation of the actual object boundaries.

Method AlexNet ResNet-50 EAO Speed (fps)
SiamFC 86
SiamFC 40
SiamRPN
SiamRPN 76
SiamMask-2 branches without mask refinement 43
SiamMask-3braches without mask refinement 58
SiamMask-2branches-score 40
SiamMask-3branches-box 76
SiamMask-2braches 60
SiamMask-3branches 55
TABLE XVI: Ablation studies of SiamMask on the VOT-2018 and DAVIS-2016 datasets

We conducted two further experiments to disentangle the effect of multi-task training, also reported in Table XVI. To achieve this, we modified the two variants of SiamMask during inference so that, respectively, they report an axis-aligned bounding box from the score branch (SiamMask-2branches-score) or the box branch (SiamMask-3branches-box). Therefore, despite having been trained, the mask branch is not used during inference. We can observe how both variants obtain a modest but meaningful improvement with respect to their counterparts (SiamFC and SiamRPN): from 0.251 to 0.265 EAO for the two-branch and from 0.359 to 0.363 for the three-branch. This might suggests that learning an additional task could act as a regularizer even when the segmentation output is not used, although our experimental setup is too limited to draw such a conclusion with confidence.

6.6 Qualitative examples

Failure cases. Qualitatively, we observed a few scenarios in which SiamMask performs rather poorly. One is extreme motion blur (e.g. left-hand side of Fig. 11), which is caused by sudden camera motion or by the fast movements of the target object. Since labelling is itself expected to present a significant amount of noise, this is a scenario where a supervised, offline-trained-only strategy like SiamMask can be particularly susceptible. Instead, when motion blur is not extreme, SiamMask typically perform rather well (see e.g. some of the examples in Fig. 14.

As already mentioned earlier when considering the multiple-object case, since SiamMask models pixels individually, confusion (i.e. when different object’s trajectories overlap) is another very challenging scenario, even if we are only interested in one target.

Finally, a rather pathological but still important failure case is the one encountered when the selected area to track does not correspond to an object, but rather to a texture or a part of an object (e.g. see the right part of Fig. 11). Given that SiamMask is trained on a large dataset with object-level labels, it is naturally biased towards objects even when the initialization provided by the user says otherwise.

Fig. 11: Failure cases: extreme motion blur and “non-object” initialization.

A variety of objects and shapes. Fig. 12 shows a few qualitative examples of mask predictions for objects of different types and shapes. In general, we observe that SiamMask adapts well to all sort of objects and deformations, and provides fairly accurate masks even in presence of noisy backgrounds and non-rigid deformations.

Fig. 12: Example score maps from the mask branch for a variety of objects.

Multiple masks per output. SiamMask generates one mask for each individual RoW (response of a candidate window). During tracking, the RoW attaining the maximum score from the classification branch is considered as the region in which the object is. The segmentation branch predicts the final output mask corresponding to this region. To more clearly observe what the mask branch predicts, we visualize the masks predicted from different RoWs of the same search area in Fig. 13.

Fig. 13: Score maps from the mask branch at different locations.

Further qualitative results. In order to qualitatively analyze tracking and segmentation accuracy of SiamMask, we present the visual results of SiamMask for some challenging video sequences from the VOT-2018, DAVIS-2016 and 2017, and Youtube-VIS datasets.

Sequences for the single-object tracking benchmark VOT-2018 are shown in Fig.14. SiamMask maintains high accuracy with sequences presenting important non-rigid deformations, like butterfly and iceskater1. While butterfly is a fairly “simple” sequence because of the stark contrast between object and background, iceskater1 is challenging, because of the complexity of the background. With fast-moving objects, SiamMask can produce accurate segmentation masks even in presence of distractors (e.g crabs1 and iceskater2). However, as seen previously, conditions become much more challenging for SiamMask when object trajectories overlap. Video object segmentation algorithms are often sensitive to motion blur and variations in illumination. On the contrary, SiamMask yields accurate masks for the sequences singer2, shaking, and soccer1, which present large variations in illumination and severe motion blur.

Fig. 14: Qualitative results of SiamMask for the sequences butterfly, crabs1, iceskater1, iceskater2, motocross1, singer2, shaking, and soccer1 from the visual object tracking benchmark VOT-2018.

Fig. 15 shows the qualitative results of SiamMask on a few representative sequences from DAVIS-2016 and DAVIS-2017. For the video object segmentation task, SiamMask effectively adapts to the challenge presented by change of scales, view angle, and shape (e.g. drift-straight and motocross-jump sequences). Moreover, it is also effective in accurately handling minor occlusion of the target object (e.g. bmx-trees and libby sequences).

Finally, Fig. 16 visualizes a set of positive results of our two-stage SiamMask variant described in 5 on some challenging videos from YouTube-VIS, where objects undergo deformation, occlusion, or rapid motion.

Fig. 15: Qualitative results of SiamMask for some sequences from the object segmentation benchmarks DAVIS-2016 and DAVIS-2017.
Fig. 16: Example results of the two-stage version of SiamMask on some sequences from Youtube-VIS.

7 Conclusion

In this paper we introduced SiamMask, a simple approach that enables fully-convolutional Siamese trackers to produce class-agnostic binary segmentation masks of the target object. We show how it can be applied with success to both tasks of visual object tracking and semi-supervised video object segmentation, showing better accuracy than most real-time trackers and, at the same time, the fastest speed among VOS methods. The two variants of SiamMask we proposed are initialised with a simple bounding box, operate online, run in real-time and do not require any adaptation to the test sequence. In addition, SiamMask can be easily extended to also perform multiple object tracking and segmentation by cascading two models. We hope that our work will inspire further studies on multi-task approach that consider different but closely related computer vision problems together.

References

  • [1] L. Bao, B. Wu, and W. Liu (2018) Cnn in mrf: video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5977–5986. Cited by: §1, §1, §2.2, §2.2, §6.3.
  • [2] L. Bertinetto, J. F. Henriques, J. Valmadre, P. Torr, and A. Vedaldi (2016) Learning feed-forward one-shot learners. Advances in neural information processing systems 29. Cited by: §3.1, §4.1.
  • [3] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr (2016) Fully-convolutional siamese networks for object tracking. In European conference on computer vision, pp. 850–865. Cited by: §1, §2.1, §3.1, §3.1, §3, Fig. 2, §4.1, §4.3, §4.5, §6.1, §6.2, §6.2, §6.2, TABLE IV, TABLE VIII, TABLE IX.
  • [4] G. Bhat, J. Johnander, M. Danelljan, F. S. Khan, and M. Felsberg (2018) Unveiling the power of deep tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 483–498. Cited by: §6.2.
  • [5] C. Bibby and I. Reid (2008) Robust real-time visual tracking using pixel-wise posteriors. In European Conference on Computer Vision, pp. 831–844. Cited by: §2.3.
  • [6] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui (2010) Visual object tracking using adaptive correlation filters. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 2544–2550. Cited by: §2.1.
  • [7] S. Caelles, K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool (2017) One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 221–230. Cited by: TABLE XII, TABLE XIII.
  • [8] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. (2019) Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4974–4983. Cited by: §5, §6.4, TABLE XIV.
  • [9] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2017) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: §6.1.
  • [10] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1189–1198. Cited by: §1, §2.2, 2nd item, TABLE XI.
  • [11] Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji (2020) Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6668–6677. Cited by: §2.1.
  • [12] J. Cheng, Y. Tsai, W. Hung, S. Wang, and M. Yang (2018) Fast and accurate online video object segmentation via tracking parts. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7415–7424. Cited by: §1, §1, §2.2, 2nd item, §6.3, TABLE XI, TABLE XII.
  • [13] J. Cheng, Y. Tsai, S. Wang, and M. Yang (2017) Segflow: joint learning for video object segmentation and optical flow. In Proceedings of the IEEE international conference on computer vision, pp. 686–695. Cited by: §2.2, 2nd item, §6.3, TABLE XI.
  • [14] S. Cheng, B. Zhong, G. Li, X. Liu, Z. Tang, X. Li, and J. Wang (2021) Learning to filter: siamese relation network for robust tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4421–4431. Cited by: §2.1.
  • [15] H. Ci, C. Wang, and Y. Wang (2018) Video object segmentation by learning location-sensitive embeddings. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 501–516. Cited by: §2.3.
  • [16] D. Comaniciu, V. Ramesh, and P. Meer (2000) Real-time tracking of non-rigid objects using mean shift. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), Vol. 2, pp. 142–149. Cited by: §2.3.
  • [17] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2019) Atom: accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669. Cited by: §6.2, TABLE IX.
  • [18] M. Danelljan, G. Bhat, F. Shahbaz Khan, and M. Felsberg (2017) Eco: efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6638–6646. Cited by: Fig. 1, §2.1, §6.2, §6.2, TABLE IX.
  • [19] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg (2015) Learning spatially regularized correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision, pp. 4310–4318. Cited by: §2.1, §6.2.
  • [20] M. Danelljan, A. Robinson, F. Shahbaz Khan, and M. Felsberg (2016) Beyond correlation filters: learning continuous convolution operators for visual tracking. In European conference on computer vision, pp. 472–488. Cited by: §6.2, §6.2, TABLE VIII.
  • [21] M. Dong, J. Wang, Y. Huang, D. Yu, K. Su, K. Zhou, J. Shao, S. Wen, and C. Wang (2019) Temporal feature augmented network for video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: TABLE XV.
  • [22] H. Fan and H. Ling (2019) Siamese cascaded region proposal networks for real-time visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7952–7961. Cited by: §2.1.
  • [23] C. Feichtenhofer, A. Pinz, and A. Zisserman (2017) Detect to track and track to detect. In Proceedings of the IEEE international conference on computer vision, pp. 3038–3046. Cited by: §3.2.
  • [24] Q. Feng, Z. Yang, P. Li, Y. Wei, and Y. Yang (2019) Dual embedding learning for video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: TABLE XV.
  • [25] H. K. Galoogahi, T. Sim, and S. Lucey (2013) Multi-channel correlation filters. In Proceedings of the IEEE international conference on computer vision, pp. 3072–3079. Cited by: §2.1.
  • [26] D. Guo, Y. Shao, Y. Cui, Z. Wang, L. Zhang, and C. Shen (2021) Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9543–9552. Cited by: §2.1.
  • [27] D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen (2020) SiamCAR: siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6269–6277. Cited by: §2.1.
  • [28] Q. Guo, X. Xie, F. Juefei-Xu, L. Ma, Z. Li, W. Xue, W. Feng, and Y. Liu (2020) Spark: spatial-aware online incremental attack against visual tracking. In European Conference on Computer Vision, pp. 202–219. Cited by: §2.1.
  • [29] A. He, C. Luo, X. Tian, and W. Zeng (2018) A twofold siamese network for real-time object tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4834–4843. Cited by: §2.1.
  • [30] A. He, C. Luo, X. Tian, and W. Zeng (2018) Towards a better match in siamese network based visual object tracker. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §1, §2.1, §6.2, §6.2, §6.2.
  • [31] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §4.2.
  • [32] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §6.1, §6.2, TABLE VIII.
  • [33] D. Held, S. Thrun, and S. Savarese (2016) Learning to track at 100 fps with deep regression networks. In European conference on computer vision, pp. 749–765. Cited by: §2.1, §6.2.
  • [34] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2014) High-speed tracking with kernelized correlation filters. IEEE transactions on pattern analysis and machine intelligence 37 (3), pp. 583–596. Cited by: §2.1, §6.2, TABLE VI.
  • [35] Y. Hu, J. Huang, and A. G. Schwing (2018) Videomatch: matching based video object segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 54–70. Cited by: §1, §2.2.
  • [36] L. Huang, X. Zhao, and K. Huang (2019) Got-10k: a large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence 43 (5), pp. 1562–1577. Cited by: §6.1, §6.2.
  • [37] V. Jampani, R. Gadde, and P. V. Gehler (2017) Video propagation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 451–461. Cited by: §1, §2.2, 2nd item, TABLE XI, TABLE XV.
  • [38] S. Jia, C. Ma, Y. Song, and X. Yang (2020) Robust tracking against adversarial attacks. In European Conference on Computer Vision, pp. 69–84. Cited by: §2.1.
  • [39] L. Kaiser, A. N. Gomez, and F. Chollet (2018)

    Depthwise separable convolutions for neural machine translation

    .
    In International Conference on Learning Representations, pp. 1–10. Cited by: §3.1.
  • [40] A. Khoreva, R. Benenson, E. Ilg, T. Brox, and B. Schiele (2017) Lucid data dreaming for object tracking. In The DAVIS challenge on video object segmentation, Cited by: §1, §2.2, §6.3.
  • [41] H. Kiani Galoogahi, T. Sim, and S. Lucey (2015) Correlation filters with limited boundaries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4630–4638. Cited by: §2.1.
  • [42] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. ˇCehovin Zajc, T. Vojir, G. Bhat, A. Lukezic, A. Eldesokey, et al. (2018) The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §1, Fig. 6, §4.4, §6.1, §6.2.
  • [43] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. Cehovin Zajc, T. Vojir, G. Hager, A. Lukezic, A. Eldesokey, et al. (2017) The visual object tracking vot2017 challenge results. In Proceedings of the IEEE international conference on computer vision workshops, pp. 1949–1972. Cited by: §1, §6.1.
  • [44] M. Kristan, J. Matas, A. Leonardis, T. Vojíř, R. Pflugfelder, G. Fernandez, G. Nebehay, F. Porikli, and L. Čehovin (2016) A novel performance evaluation methodology for single-target trackers. IEEE transactions on pattern analysis and machine intelligence 38 (11), pp. 2137–2155. Cited by: §2.2, Fig. 6, §4.4, 1st item.
  • [45] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8971–8980. Cited by: §1, §2.1, §3.2, §3.2, §3.2, §3, Fig. 3, §4.1, §4.3, §6.2, §6.2, TABLE IV, TABLE VI.
  • [46] F. Li, C. Tian, W. Zuo, L. Zhang, and M. Yang (2018) Learning spatial-temporal regularized correlation filters for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4904–4913. Cited by: §2.1, §6.2, §6.2, TABLE VI.
  • [47] X. Li and C. C. Loy (2018) Video object segmentation with joint re-identification and attention-aware mask propagation. In Proceedings of the European conference on computer vision (ECCV), pp. 90–105. Cited by: §1, §2.2, §6.3.
  • [48] P. Liang, E. Blasch, and H. Ling (2015) Encoding color information for visual tracking: algorithms and benchmark. IEEE transactions on image processing 24 (12), pp. 5630–5644. Cited by: §1.
  • [49] S. Liang, X. Wei, S. Yao, and X. Cao (2020) Efficient adversarial attacks for visual object tracking. In European Conference on Computer Vision, pp. 34–50. Cited by: §2.1.
  • [50] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §6.1.
  • [51] X. Liu, H. Ren, and T. Ye (2019) Spatio-temporal attention network for video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: TABLE XV.
  • [52] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §4.2.
  • [53] J. Luiten, P. Torr, and B. Leibe (2019) Video instance segmentation 2019: a winning approach for combined detection, segmentation, classification and tracking.. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 709–712. Cited by: §6.4, TABLE XV.
  • [54] A. Lukezic, J. Matas, and M. Kristan (2020) D3s-a discriminative single shot segmentation tracker. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 7133–7142. Cited by: §2.3.
  • [55] A. Lukezic, T. Vojir, L. ˇCehovin Zajc, J. Matas, and M. Kristan (2017) Discriminative correlation filter with channel and spatial reliability. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6309–6318. Cited by: §2.1, §6.2, §6.2, TABLE VI.
  • [56] K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool (2018) Video object segmentation without temporal information. IEEE transactions on pattern analysis and machine intelligence 41 (6), pp. 1515–1530. Cited by: §1, §2.2, §2.2, §6.3.
  • [57] N. Märki, F. Perazzi, O. Wang, and A. Sorkine-Hornung (2016) Bilateral space video segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 743–751. Cited by: §1, §2.2, §2.2, §6.3.
  • [58] O. Miksik, J. Pérez-Rúa, P. H. Torr, and P. Pérez (2017) Roam: a rich object appearance model with application to rotoscoping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4691–4699. Cited by: §1.
  • [59] M. Mueller, N. Smith, and B. Ghanem (2016) A benchmark and simulator for uav tracking. In European conference on computer vision, pp. 445–461. Cited by: §1.
  • [60] M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem (2018) Trackingnet: a large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 300–317. Cited by: §1, §6.1, §6.2.
  • [61] K. K. Nakka and M. Salzmann (2020) Temporally-transferable perturbations: efficient, one-shot adversarial attacks for online visual object trackers. arXiv preprint arXiv:2012.15183. Cited by: §2.1.
  • [62] H. Nam, M. Baek, and B. Han (2016) Modeling and propagating cnns in a tree structure for visual tracking. arXiv preprint arXiv:1608.07242. Cited by: §6.2.
  • [63] H. Nam and B. Han (2016)

    Learning multi-domain convolutional neural networks for visual tracking

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4293–4302. Cited by: §6.2, §6.2, TABLE VIII, TABLE IX.
  • [64] P. O. O Pinheiro, R. Collobert, and P. Dollár (2015) Learning to segment object candidates. Advances in neural information processing systems 28. Cited by: §1, §4.2, §4.3, §6.5.
  • [65] S. W. Oh, J. Lee, K. Sunkavalli, and S. J. Kim (2018) Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7376–7385. Cited by: §1, §2.2, 2nd item, §6.3, TABLE XI.
  • [66] F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung (2017) Learning video object segmentation from static images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2663–2672. Cited by: §1, §1, §2.2, §2.2, §2.3, 1st item, §6.3, TABLE XI.
  • [67] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung (2016) A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 724–732. Cited by: §1, §2.2, §2.2, 1st item, §6.1.
  • [68] F. Perazzi, O. Wang, M. Gross, and A. Sorkine-Hornung (2015) Fully connected object proposals for video segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 3227–3234. Cited by: §2.2.
  • [69] F. Perazzi (2017) Video object segmentation. In PhD thesis, Cited by: §1, §2.2, §6.3.
  • [70] P. Pérez, C. Hue, J. Vermaak, and M. Gangnet (2002) Color-based probabilistic tracking. In European Conference on Computer Vision, pp. 661–675. Cited by: §2.3.
  • [71] P. O. Pinheiro, T. Lin, R. Collobert, and P. Dollár (2016) Learning to refine object segments. In European conference on computer vision, pp. 75–91. Cited by: §4.2.
  • [72] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool (2017) The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675. Cited by: 1st item, §6.1.
  • [73] H. Possegger, T. Mauthner, and H. Bischof (2015) In defense of color-based model-free tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2113–2120. Cited by: §2.3.
  • [74] E. Real, J. Shlens, S. Mazzocchi, X. Pan, and V. Vanhoucke (2017) Youtube-boundingboxes: a large high-precision human-annotated data set for object detection in video. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5296–5305. Cited by: §6.2.
  • [75] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28. Cited by: §1, §3.2.
  • [76] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §6.1.
  • [77] J. Shin Yoon, F. Rameau, J. Kim, S. Lee, S. Shin, and I. So Kweon (2017) Pixel-level matching for video object segmentation using convolutional neural networks. In Proceedings of the IEEE international conference on computer vision, pp. 2167–2176. Cited by: 2nd item, TABLE XI.
  • [78] C. H. Sio, Y. Ma, H. Shuai, J. Chen, and W. Cheng (2020) S2siamfc: self-supervised fully convolutional siamese network for visual tracking. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 1948–1957. Cited by: §2.1, §6.2, TABLE VII.
  • [79] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, and M. Shah (2013) Visual tracking: an experimental survey. IEEE transactions on pattern analysis and machine intelligence 36 (7), pp. 1442–1468. Cited by: §1, §2.1, §2.2.
  • [80] R. Tao, E. Gavves, and A. W. Smeulders (2016) Siamese instance search for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1420–1429. Cited by: §2.1.
  • [81] Y. Tsai, M. Yang, and M. J. Black (2016) Video segmentation via object flow. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3899–3908. Cited by: §1, §1, §2.2, §2.2, §6.3.
  • [82] J. Valmadre, L. Bertinetto, J. F. Henriques, R. Tao, A. Vedaldi, A. W. Smeulders, P. H. Torr, and E. Gavves (2018) Long-term tracking in the wild: a benchmark. In Proceedings of the European conference on computer vision (ECCV), pp. 670–685. Cited by: §1.
  • [83] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, and P. H. Torr (2017) End-to-end representation learning for correlation filter based tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2805–2813. Cited by: §2.1, §3, §6.2, §6.2, TABLE VIII, TABLE IX.
  • [84] P. Voigtlaender and B. Leibe (2017) Online adaptation of convolutional neural networks for video object segmentation. arXiv preprint arXiv:1706.09364. Cited by: §1, §2.2, §2.2, 1st item, §6.3, §6.3, TABLE XI, TABLE XII, TABLE XIII.
  • [85] G. Wang, C. Luo, Z. Xiong, and W. Zeng (2019) Spm-tracker: series-parallel matching for real-time visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3643–3652. Cited by: §2.1.
  • [86] N. Wang, W. Zhou, Y. Song, C. Ma, W. Liu, and H. Li (2021) Unsupervised deep representation learning for real-time tracking. International Journal of Computer Vision 129 (2), pp. 400–418. Cited by: §2.1, §6.2, §6.2, TABLE X, TABLE VII.
  • [87] Q. Wang, Y. He, X. Yang, Z. Yang, and P. Torr (2019) An empirical study of detection-based video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0. Cited by: §5.
  • [88] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr (2019) Fast online object tracking and segmentation: a unifying approach. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 1328–1338. Cited by: §6.2.
  • [89] L. Wen, D. Du, Z. Lei, S. Z. Li, and M. Yang (2015) Jots: joint online tracking and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2226–2234. Cited by: §1, §2.2, §6.3.
  • [90] Q. Wu, J. Wan, and A. B. Chan (2021) Progressive unsupervised learning for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2993–3002. Cited by: §2.1, §6.2, §6.2, TABLE X, TABLE VII.
  • [91] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. Price, S. Cohen, and T. Huang (2018) Youtube-vos: sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 585–601. Cited by: §1, 1st item, 2nd item, §6.1, §6.1.
  • [92] Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu (2020) Siamfc++: towards robust and accurate visual tracking with target estimation guidelines. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    ,
    Vol. 34, pp. 12549–12556. Cited by: §2.1.
  • [93] B. Yan, H. Peng, K. Wu, D. Wang, J. Fu, and H. Lu (2021) Lighttrack: finding lightweight neural networks for object tracking via one-shot architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15180–15189. Cited by: §2.1.
  • [94] B. Yan, D. Wang, H. Lu, and X. Yang (2020) Cooling-shrinking attack: blinding the tracker with imperceptible noises. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 990–999. Cited by: §2.1.
  • [95] B. Yan, X. Zhang, D. Wang, H. Lu, and X. Yang (2021) Alpha-refine: boosting tracking performance by precise bounding box estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5289–5298. Cited by: §2.3.
  • [96] L. Yang, Y. Fan, and N. Xu (2019) Video instance segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5188–5197. Cited by: §1, 3rd item, §6.1, §6.4, §6.4, TABLE XIV.
  • [97] L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos (2018) Efficient video object segmentation via network modulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6499–6507. Cited by: §1, §2.2, 2nd item, §6.3, §6.3, TABLE XI, TABLE XII, TABLE XIII.
  • [98] T. Yang and A. B. Chan (2018) Learning dynamic memory networks for object tracking. In Proceedings of the European conference on computer vision (ECCV), pp. 152–167. Cited by: §1, §2.1.
  • [99] T. Yang, P. Xu, R. Hu, H. Chai, and A. B. Chan (2020) ROAM: recurrently optimizing tracking model. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6718–6727. Cited by: §2.1.
  • [100] D. Yeo, J. Son, B. Han, and J. Hee Han (2017)

    Superpixel-based tracking-by-segmentation using markov chains

    .
    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1812–1821. Cited by: §2.3.
  • [101] W. Yuan, M. Y. Wang, and Q. Chen (2020) Self-supervised object tracking with cycle-consistent siamese networks. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 10351–10358. Cited by: §2.1, §6.2, TABLE VII.
  • [102] L. Zhang, Y. Li, and R. Nevatia (2008) Global data association for multi-object tracking using network flows. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §5.
  • [103] Y. Zhang, Z. Wu, H. Peng, and S. Lin (2020) A transductive approach for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6949–6958. Cited by: §2.2.
  • [104] J. Zheng, C. Ma, H. Peng, and X. Yang (2021) Learning to track objects from unlabeled videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13546–13555. Cited by: §2.1, §6.2, §6.2, TABLE X, TABLE VII.
  • [105] Z. Zhou, W. Pei, X. Li, H. Wang, F. Zheng, and Z. He (2021) Saliency-associated object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9866–9875. Cited by: §2.1.
  • [106] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu (2018) Distractor-aware siamese networks for visual object tracking. In Proceedings of the European conference on computer vision (ECCV), pp. 101–117. Cited by: §1, §2.1, §6.2, §6.2, §6.2, TABLE VI.