End-to-End Referring Video Object Segmentation with Multimodal Transformers

by   Adam Botach, et al.

The referring video object segmentation task (RVOS) involves segmentation of a text-referred object instance in the frames of a given video. Due to the complex nature of this multimodal task, which combines text reasoning, video understanding, instance segmentation and tracking, existing approaches typically rely on sophisticated pipelines in order to tackle it. In this paper, we propose a simple Transformer-based approach to RVOS. Our framework, termed Multimodal Tracking Transformer (MTTR), models the RVOS task as a sequence prediction problem. Following recent advancements in computer vision and natural language processing, MTTR is based on the realization that video and text can both be processed together effectively and elegantly by a single multimodal Transformer model. MTTR is end-to-end trainable, free of text-related inductive bias components and requires no additional mask-refinement post-processing steps. As such, it simplifies the RVOS pipeline considerably compared to existing methods. Evaluation on standard benchmarks reveals that MTTR significantly outperforms previous art across multiple metrics. In particular, MTTR shows impressive +5.7 and +5.0 mAP gains on the A2D-Sentences and JHMDB-Sentences datasets respectively, while processing 76 frames per second. In addition, we report strong results on the public validation set of Refer-YouTube-VOS, a more challenging RVOS dataset that has yet to receive the attention of researchers. The code to reproduce our experiments is available at https://github.com/mttr2021/MTTR



There are no comments yet.


page 1

page 3

page 8


Language as Queries for Referring Video Object Segmentation

Referring video object segmentation (R-VOS) is an emerging cross-modal t...

End-to-End Video Instance Segmentation with Transformers

Video instance segmentation (VIS) is the task that requires simultaneous...

SeqFormer: a Frustratingly Simple Model for Video Instance Segmentation

In this work, we present SeqFormer, a frustratingly simple model for vid...

TxT: Crossmodal End-to-End Learning with Transformers

Reasoning over multiple modalities, e.g. in Visual Question Answering (V...

SSTVOS: Sparse Spatiotemporal Transformers for Video Object Segmentation

In this paper we introduce a Transformer-based approach to video object ...

Container: Context Aggregation Network

Convolutional neural networks (CNNs) are ubiquitous in computer vision, ...

Dynamic Multimodal Instance Segmentation guided by natural language queries

In this paper, we address the task of segmenting an object given a natur...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Given a text query and a sequence of video frames, the proposed model outputs prediction sequences for all object instances in the video prior to determining the referred instance. Here predictions with the same color and shape belong to the same sequence and attend to the same object instance in different frames. Note that the order of instance predictions for different frames remains the same. Best viewed in color.

Attention-based [vaswani2017attention]

deep neural networks exhibit impressive performance on various tasks across different fields, from computer vision

[ViT_2021, liu2021swin] to natural language processing [Devlin2019BERTPO, brown2020gpt3]. These advancements make networks of this sort, such as the Transformer [vaswani2017attention], particularly interesting candidates for solving multimodal problems. By relying on the self-attention mechanism, which allows each token in a sequence to globally aggregate information from every other token, Transformers excel at modeling global dependencies and have become the cornerstone in most NLP tasks [Devlin2019BERTPO, yang2019xlnet, radford2019language, brown2020gpt3]. Transformers have also started showing promise in solving computer vision tasks, from recognition [ViT_2021] to object detection [carion2020detr] and even outperforming the long-used CNNs as general-purpose vision backbones [liu2021swin].

The referring video object segmentation task (RVOS) involves the segmentation of a text-referred object instance in the frames of a given video. Compared with the more widely researched referring image segmentation task (RIS) [yu2016refcoco, mao2016refcocoplus], in which objects are mainly referred to by their appearance, in RVOS objects can also be referred to by the actions they are performing or in which they are involved. This renders RVOS significantly more complicated than RIS, as text expressions that refer to actions often cannot be properly deduced from a single static frame. Furthermore, unlike their image-based counterparts, RVOS methods may also be required to establish data association of the referred object across multiple frames (i.e., tracking) in order to deal with disturbances such as occlusions or motion blur.

To solve these challenges and effectively align video with text, existing RVOS approaches [hui2021cstm, liu2021cmpc, ning2020polar]

typically rely on complicated pipelines. In contrast, here we propose a simple, end-to-end Transformer-based approach to RVOS. Using recent advancements in Transformers for textual feature extraction

[vaswani2017attention, liu2019roberta], visual feature extraction [ViT_2021, liu2021swin, liu2021vswin] and object detection [carion2020detr, wang2021vistr], we develop a framework that significantly outperforms existing approaches. To accomplish this, we employ a single multimodal Transformer and model the task as a sequence prediction problem. Given a video and a text query, our model generates prediction sequences for all objects in the video before determining the one the text refers to. Additionally, our method is free of text-related inductive bias modules and utilizes a simple cross-entropy loss to align the video and the text. As such, it is much less complicated than previous approaches to the task.

The proposed pipeline is schematically depicted in Fig. 1. First, we extract linguistic features from the text query using a standard Transformer-based text encoder, and visual features from the video frames using a spatio-temporal encoder. The features are then passed into a multimodal Transformer, which outputs several sequences of object predictions [wang2021vistr]. Next, to determine which of the predicted sequences best corresponds to the referred object, we compute a text-reference score for each sequence. For this we propose a temporal segment voting scheme that allows our model to focus on more relevant parts of the video when making the decision.

Our main contributions are as follows:

  • We present a Transformer-based RVOS framework, dubbed Multimodal Tracking Transformer (MTTR), which models the task as a parallel sequence prediction problem and outputs predictions for all objects in the video prior to selecting the one referred to by the text.

  • Our sequence selection strategy is based on a temporal segment voting scheme, a novel reasoning scheme that allows our model to focus on more relevant parts of the video with regards to the text.

  • The proposed method is end-to-end trainable, free of text-related inductive bias modules, and requires no additional mask refinement. As such, it greatly simplifies the RVOS pipeline compared to existing approaches.

  • We thoroughly evaluate our method on three benchmarks. On the A2D-Sentences and JHMDB-Sentences [gavrilyuk2018a2d] datasets, MTTR significantly outperforms all existing methods across all metrics. Moreover, we report strong results on the public validation set of Refer-YouTube-VOS [seo2020urvos], a more challenging dataset that has yet to receive attention in the literature.

2 Related Work

Referring video object segmentation.

The RVOS task was originally introduced by gavrilyuk2018a2d, whose goal was to attain pixel-level segmentation of actors and their actions in video content. To effectively aggregate and align visual, temporal and lingual information from video and text, state-of-the-art approaches to RVOS typically rely on complicated pipelines [wang2019acga, wang2020cmd, ning2020polar, McIntosh_2020_CVPR, liu2021cmpc]. gavrilyuk2018a2d proposed an I3D-based [carreira2017i3d] encoder-decoder architecture that generated dynamic filters from text features and convolved them with visual features to obtain the segmentation masks. Following them, wang2020cmd added spatial context to the kernels by incorporating deformable convolutions [Dai_2017_ICCV]. For a more effective representation than convolutions, VT-Capsule [McIntosh_2020_CVPR] encoded each modality in capsules [capsule_networks_nips_2017], while ACGA [wang2019acga] utilized a co-attention mechanism to enhance the multimodal features. To improve positional relation representations in the text, PRPE [ning2020polar] explored a positional encoding mechanism based on the polar coordinate system. URVOS [seo2020urvos] improved tracking capabilities by performing language-based object segmentation using the key frame in the video and propagating the predicted mask throughout the video. Differently from others, AAMN [yang2020aamn] utilized a top-down approach where an off-the-shelf object detector is used to localize objects in the video prior to parsing relations between visual and textual features. Recently, CMPC-V [liu2021cmpc] achieved state-of-the-art results by constructing a temporal graph using video and text features, and applying graph convolution [graph_conf_ICLR_2017] to detect the referred entity.

Figure 2: A detailed overview of MTTR. First, the input text and video frames are passed through feature encoders and then concatenated into multimodal sequences (one per frame). A multimodal Transformer then encodes the feature relations and decodes instance-level features into a set of prediction sequences. Next, corresponding mask and reference prediction sequences are generated. Finally, the predicted sequences are matched with the ground truth sequences for supervision (in training) or used to generate the final prediction (during inference).


The Transformer [vaswani2017attention] was introduced as an attention-based building block for sequence-to-sequence machine translation, and since then has become the cornerstone for most NLP tasks [Devlin2019BERTPO, yang2019xlnet, radford2019language, brown2020gpt3]. Unlike previous architectures, the Transformer relies entirely on the attention mechanism to draw dependencies between input and output.

Recently, the introduction of Transformers to computer vision tasks has demonstrated spectacular performance. DETR [carion2020detr], which utilizes a non-auto-regressive Transformer, significantly simplifies the traditional object detection pipeline while achieving performance comparable to that of CNN-based detectors [faster_rcnn_NIPS_2017]. Given a fixed set of learned object queries, DETR reasons about the global context of an image and the relations between its objects and then outputs a final set of detection predictions in parallel. VisTR [wang2021vistr] extends the idea behind DETR to video instance segmentation. It views the task as a direct end-to-end parallel sequence prediction problem. By supervising video instances at the sequence level as a whole, VisTR is able to output an ordered sequence of masks for each instance in a video directly (i.e., natural tracking).

ViT [ViT_2021] introduces the Transformer to image recognition by using linearly projected fixed-sized patches as tokens for a standard Transformer encoder. liu2021swin introduced the Swin Transformer, a general-purpose backbone for computer vision based on a hierarchical Transformer whose representations are computed inside shifted windows. The Swin architecture was also recently extended to the video domain [liu2021vswin], which we adapt as our temporal encoder.

Another recent relevant work is MDETR [kamath2021mdetr], a DETR-based end-to-end multimodal detector that detects objects in an image conditioned on a text query. Different from our method, their approach is designed to work on static images, and its performance largely depends on well-annotated datasets that contain aligned text and box annotations, the types of which are not available in the RVOS task.

3 Method

3.1 Method Overview

Task definition.

The input of RVOS consists of a frame sequence , where , and a text query , where is the word in the text. Then, for a subset of frames of interest of size , the goal is to segment the object referred by in each frame in . We note that since producing mask annotations requires significant efforts, rarely contains all of the frames in .

Feature extraction.

We begin by extracting features from each frame in the sequence using a deep spatio-temporal encoder. Simultaneously, linguistic features are extracted from the text query using a Transformer-based [vaswani2017attention] text encoder. Then, the spatio-temporal and linguistic features are linearly projected to a shared dimension .

Instance prediction.

In the next step, the features of each frame of interest are flattened and separately concatenated with the text embeddings, producing a set of multimodal sequences. These sequences are fed in parallel into a Transformer [vaswani2017attention, carion2020detr]. In the Transformer’s encoder layers, the textual embeddings and the visual features of each frame exchange information. Then, the decoder layers, which are fed with object queries per input frame, query the multimodal sequences for entity-related information and store it in the object queries. Corresponding queries of different frames share the same trainable weights and are trained to attend to the same instance in the video (each one in its designated frame). We refer to these queries (represented by the same unique color and shape in Figs. 2 and 1) as queries belonging to the same instance sequence. This design allows for natural tracking of each object instance in the video [wang2021vistr].

Output generation.

For each instance sequence output by the Transformer, we generate a a corresponding mask sequence. To accomplish this we use an FPN-like [lin2017fpn] spatial decoder and dynamically generated conditional convolution kernels [tian2020conditional, wang2021maxdeeplab]. Finally, we use a novel text-reference score function that, based on mask and text associations, determines which of the object query sequences has the strongest association with the object described in , and returns its segmentation sequence as the model’s prediction.

3.2 Temporal Encoder

A suitable temporal encoder for the RVOS task should be able to extract both visual characteristics (e.g., shape, size, location) and action semantics for each instance in the video. Several previous works [gavrilyuk2018a2d, ning2020polar, liu2021cmpc]

utilized the Kinetics-400

[kay2017kinetics] pre-trained I3D network [carreira2017i3d] as their temporal encoder. However, since I3D was originally designed for action classification, using its outputs as-is for tasks that require fine details (e.g., instance segmentation) is not ideal as the features it outputs tend to suffer from spatial misalignment caused by temporal downsampling. To compensate for this side effect, past state-of-the-art approaches came up with different solutions, from auxiliary mask refinement algorithms [krahenbuhl2012crf, liu2021cmpc] to utilizing additional backbones that operate alongside the temporal encoder [hui2021cstm]. In contrast, our end-to-end approach does not require any additional mask refinement steps and utilizes a single backbone.

Recently, the Video Swin Transformer [liu2021vswin] was proposed as a generalization of the Swin Tranformer [liu2021swin] to the video domain. While the original Swin was designed with dense predictions (such as segmentation) in mind, Video Swin was tested mainly on action recognition benchmarks. To the best of our knowledge, we are the first to utilize it (with a slight modification) for video segmentation. As opposed to I3D, Video Swin contains just a single temporal downsampling layer and can be modified easily to output per-frame feature maps (more details in the supplementary material). As such, it is a much better choice for processing a full sequence of consecutive video frames for segmentation purposes.

3.3 Multimodal Transformer

For each frame of interest, the temporal encoder generates a feature map

and the text encoder outputs a linguistic embedding vector

for the text. These visual and linguistic features are linearly projected to a shared dimension . The features of each frame are then flattened and separately concatenated with the text embeddings, resulting in a set of multimodal sequences, each of shape . The multimodal sequences along with a set of instance sequences are then fed in parallel into a Transformer as described earlier. Our Transformer architecture is similar to the one used in DETR [carion2020detr]. Accordingly, the problem now comes down to finding the instance sequence that attends to the text-referred object.

3.4 The Instance Segmentation Process

Our segmentation process, as shown in Fig. 2, consists of several steps. First, given , the updated multimodal sequences output by the last Transformer encoder layer, we extract and reshape the video-related part of each sequence (i.e., the first tokens) into the set . Then, we take , the outputs of the first blocks of our temporal encoder, and hierarchically fuse them with using an FPN-like [lin2017fpn] spatial decoder . This process results in semantically-rich, high resolution feature maps of the video frames, denoted as .


Next, for each instance sequence

output by the Transformer decoder, we use a two-layer perceptron

to generate a corresponding sequence of conditional segmentation kernels [tian2020conditional, wang2021maxdeeplab].


Finally, a sequence of segmentation masks is generated for by convolving each segmentation kernel with its corresponding frame features, followed by a bilinear upsampling operation to resize the masks into ground-truth resolution,


3.5 Instance Sequence Matching

During the training process we need to determine which of the predicted instance sequences best fits the referred object. However, if the video sequence contains additional annotated instances, we found that supervising their detection (as negative examples) alongside that of the referred instance helps stabilize the training process.

Let us denote by the set of ground-truth sequences that are available for , and by the set of the predicted instance sequences. We assume that the number of predicted sequences is chosen to be strictly greater than the number of annotated instances (i.e.,

) and that the ground-truth sequences set is padded with

to fill any missing slots. Then, we want to find a matching between the two sets [carion2020detr, wang2021vistr]. Accordingly, we search for a permutation with the lowest possible cost:


where is a pair-wise matching cost. The optimal permutation can be computed efficiently using the Hungarian algorithm [kuhn1955hungarian]. Each ground-truth sequence is of the form


where is a ground-truth mask, and is a one-hot referring vector, i.e., the positive class means that corresponds to the text-referred object and that this object is visible in the corresponding video frame .

To allow our model to produce reference predictions in the form of Eq. 5, we use a reference prediction head, denoted , which consists of a single linear layer of shape

followed by a softmax layer. Given a predicted object query

, this head takes as input and outputs a reference prediction .

Thus, each prediction of our model is a pair of sequences:


We define the pair-wise matching cost function as the sum of two terms



are hyperparameters. The first term,

, supervises the predicted mask sequence using the ground-truth mask sequence by averaging the negation of the Dice coefficients [milletari2016vnet] of each pair of corresponding masks at every time step. The second term, , supervises the reference predictions using the corresponding ground-truth sequence as follows

Method Precision IoU mAP
50% 60% 70% 80% 90% Overall Mean
hu2016segmentation 34.8 23.6 13.3 3.3 0.1 47.4 35.0 13.2
gavrilyuk2018a2d (RGB) 47.5 34.7 21.1 8.0 0.2 53.6 42.1 19.8
RefVOS [bellver2020refvos] 57.8 9.3 67.2 49.7
AAMN [yang2020aamn] 68.1 62.9 52.3 29.6 2.9 61.7 55.2 39.6
CMSA+CFSA [ye2021cfsa] 48.7 43.1 35.8 23.1 5.2 61.8 43.2
CSTM [hui2021cstm] 65.4 58.9 49.7 33.3 9.1 66.2 56.1 39.9
CMPC-V (I3D) [liu2021cmpc] 65.5 59.2 50.6 34.2 9.8 65.3 57.3 40.4
MTTR (, ours) 72.1 68.4 60.7 45.6 16.4 70.2 61.8 44.7
MTTR (, ours) 75.4 71.2 63.8 48.5 16.9 72.0 64.0 46.1
Table 1: Comparison with state-of-the-art methods on A2D-Sentences [gavrilyuk2018a2d].
Method Precision IoU mAP
50% 60% 70% 80% 90% Overall Mean
hu2016segmentation 63.3 35.0 8.5 0.2 0.0 54.6 52.8 17.8
gavrilyuk2018a2d (RGB) 69.9 46.0 17.3 1.4 0.0 54.1 54.2 23.3
AAMN [yang2020aamn] 77.3 62.7 36.0 4.4 0.0 58.3 57.6 32.1
CMSA+CFSA [ye2021cfsa] 76.4 62.5 38.9 9.0 0.1 62.8 58.1
CSTM [hui2021cstm] 78.3 63.9 37.8 7.6 0.0 59.8 60.4 33.5
CMPC-V (I3D) [liu2021cmpc] 81.3 65.7 37.1 7.0 0.0 61.6 61.7 34.2
MTTR (, ours) 91.0 81.5 57.0 14.4 0.1 67.4 67.9 36.6
MTTR (, ours) 93.9 85.2 61.6 16.6 0.1 70.1 69.8 39.2
Table 2: Comparison with state-of-the-art methods on JHMDB-Sentences [gavrilyuk2018a2d].

3.6 Loss Functions

Let us denote (with a slight abuse of notation) by the set of predicted instance sequences permuted according to the optimal permutation

. Then, we can define our loss function as follows:


Following VisTR [wang2021vistr], the first term, dubbed , ensures mask alignment between the predicted and ground-truth sequences. As such, this term is defined as a combination of the Dice [milletari2016vnet] and Focal [lin2017focal] loss functions:


where are hyperparameters and is applied on the masks per-pixel. Additionally, these two losses are normalized by the number of instances inside the batch.

The second loss term, denoted , is a cross-entropy term that supervises the sequence reference predictions:


where is a hyperparameter. In practice we further downweight the terms of the negative (“unreferred”) class by a factor of 10 to account for class imbalance [carion2020detr]. Also, note that the same and are used as weights in the matching cost (7) and loss functions. Intriguingly, despite its simplicity and its lack of explicit text-related inductive bias, is able to deliver equivalent or even better performance compared with more complex loss functions [kamath2021mdetr] that we tested. Hence, and for the sake of simplicity, no additional loss functions are used for text supervision in our method.

3.7 Inference

For a given sample of video and text, let us denote by the set of reference prediction sequences output by our model. Additionally, we denote by

the probability of the positive (“referred”) class for a given reference prediction

. During inference we return the segmentation mask sequence that corresponds to , the predicted reference sequence with the highest positive score:


This sequence selection scheme, which we term the “temporal segment voting scheme”, grades prediction sequences based on the number of terms they contain which directly relate to the referred object. Thus, it allows our model to focus on more relevant parts of the video (in which the referred object is visible), and disregard less relevant parts (which depict other objects or in which the referred object is occluded) when making the decision.

4 Experiments

To evaluate our approach, we conduct experiments on three referring video object segmentation datasets. The first two, A2D-Sentences and JHMDB-Sentences [gavrilyuk2018a2d], were created by adding textual annotations to the original A2D [xu2015can] and JHMDB [jhuang2013towards] datasets. Each video in A2D has 3–5 frames annotated with pixel-level segmentation masks, while in JHMDB, 2D articulated human puppet masks are available for all frames. Additional details about these datasets are presented in the supplementary material. We adopt Overall IoU, Mean IoU, and precision@K to evaluate our method on these datasets. Overall IoU computes the ratio between the total intersection and the total union area over all the test samples. Mean IoU is the averaged IoU over all the test samples. Precision@K considers the percentage of test samples whose IoU scores are above a threshold K, where . We also compute mean average precision (mAP) over 0.50:0.05:0.95 [lin2014coco].

We want to note that we found inconsistencies in the mAP metric calculation in previous studies. For example, examination of published code revealed incorrect calculation of the metric as the average of the precision@K metric over several K values. To avoid further confusion and ensure a fair comparison, we suggest adopting the COCO API

111https://github.com/cocodataset/cocoapi for mAP calculation. For reference, a full implementation of the evaluation that utilizes the API is released with our code.

We further evaluate MTTR on the more challenging Refer-YouTube-VOS dataset, introduced by seo2020urvos, who provided textual annotations for the original YouTube-VOS dataset [xu2018youtube]. Each video has pixel-level instance segmentation annotations for every fifth frame. The original release of Refer-YouTube-VOS contains two subsets. One subset contains first-frame expressions that describe only the first frame. The other contains full-video expressions that are based on the whole video and are, therefore, more challenging. Following the introduction of the RVOS competition222https://youtube-vos.org/dataset/rvos/, only the more challenging subset of the dataset is publicly available now. Since ground-truth annotations are available only for the training samples and the test server is currently inaccessible, we report results on the validation samples by uploading our predictions to the competition’s server333https://competitions.codalab.org/competitions/29139

. The primary evaluation metrics for this dataset are the average of the region similarity (

) and the contour accuracy ().

4.1 Implementation Details

As our temporal encoder we use the smallest (“tiny”) version of the Video Swin Transformer [liu2021vswin] pretrained on Kinetics-400 [kay2017kinetics]. The original Video Swin consists of four blocks with decreasing spatial resolution. We found the output of the fourth block to be too small for small object detection and hence we only utilize the first three blocks. We use the output of the third block as the input of the multimodal Transformer, while the outputs of the earlier blocks are fed into the spatial decoder. We also modify the encoder’s single temporal downsampling layer to output per-frame feature maps as required by our model. As our text encoder we use the Hugging Face [wolf2020Transformers] implementation of RoBERTa-base [liu2019roberta]. For A2D-Sentences [gavrilyuk2018a2d] we feed the model windows of frames with the annotated target frame in the middle. Each frame is resized such that the shorter side is at least 320 pixels and the longer side is at most 576 pixels. For Refer-YouTube-VOS [seo2020urvos], we use windows of consecutive annotated frames during training, and full-length videos (up to 36 annotated frames) during evaluation. Each frame is resized such that the shorter side is at least 360 pixels and the longer side is at most 640 pixels. We do not use any segmentation-related pretraining, e.g., on COCO [lin2014coco], which is known to boost segmentation performance [wang2021vistr]. We refer the reader to the supplementary material for more details.

URVOS [seo2020urvos] 47.23 45.27 49.19
CMPC-V (I3D) [liu2021cmpc] 47.48 45.64 49.32
ding2021progressive 54.8 53.7 56.0
MTTR (ours) 55.32 54.00 56.64
Table 3: Results on Refer-YouTube-VOS. The upper half is evaluated on the original validation set, while the bottom half is evaluated on the public validation set. – ensemble.
Figure 3: Visual examples of MTTR’s performance on the Refer-YouTube-VOS [seo2020urvos] validation set. Best viewed in color.

[][Non-temporal backbones.] Method IoU mAP Overall Mean c CMPC-I [liu2021cmpc] 64.9 51.5 35.1 MTTR (DeepLab-ResNet101) 67.5 60.2 41.2 MTTR (Video Swin-T, ) 68.9 60.3 41.8 [][Input window size effect.] IoU mAP Overall Mean 68.9 60.3 41.8 69.7 61.5 43.8 69.5 61.8 44.0 70.2 61.8 44.7 72.0 64.0 46.1 [][Textual feature extractor.] Method IoU mAP Overall Mean RoBERTa (base) 69.5 61.8 44.0 Distill-RoBERTa (base) 70.5 62.4 43.8 BERT (base) 69.7 62.1 44.3

Table 4: Ablation studies on A2D-Sentences [gavrilyuk2018a2d] dataset.

4.2 Comparison with State-of-the-Art Methods

We evaluate our method on the A2D-Sentences dataset in order to compare its performance with previous state-of-the-art approaches. For fair comparison with existing works [hui2021cstm, liu2021cmpc], our model is trained and evaluated for this purpose with windows of size 8. As shown in Tab. 1, our method significantly outperforms all existing approaches across all metrics. For example, our model shows a 4.3 mAP gain over current state of the art, and an absolute improvement of 6.6% on the most stringent metric P@0.9, which demonstrates its ability to generate high-quality masks. We also note that our top configuration () achieves a massive 5.7 mAP gain and 6.7% absolute improvement on both Mean and Overall IoU compared to the current state of the art. Impressively, this configuration is able to do so while processing 76 frames per second on a single RTX 3090 GPU.

Following previous works [gavrilyuk2018a2d, liu2021cmpc], we evaluate the generalization ability of our model by evaluating it on JHMDB-Sentences without fine-tuning. We uniformly sample three frames from each video and evaluate our best model on these frames. As shown in Tab. 2, our method generalizes well and outperforms all existing approaches. Note that all methods (including ours) produce low results on P@0.9. This can be attributed to JHMDB’s [jhuang2013towards] imprecise mask annotations which were generated by a coarse human puppet model.

Finally, we report our results on the public validation set of Refer-YouTube-VOS [seo2020urvos] in Tab. 3. As mentioned earlier, this subset contains only the more challenging full-video expressions from the original release of Refer-YouTube-VOS. Compared with existing methods [seo2020urvos, liu2021cmpc] which trained and evaluated on the full version of the dataset, our model demonstrates superior performance across all metrics despite being trained on less data and evaluated exclusively on a more challenging subset. Additionally, our method shows competitive performance compared with the methods that led in the 2021 RVOS competition [liang2021rethinking, ding2021progressive]. We note, however, that these methods use ensembles and are trained on additional segmentation and referring datasets [lin2014coco, xu2018youtube, yu2016refcoco, mao2016refcocoplus].

4.3 Ablation Studies

We conduct several ablation studies on A2D-Sentences to evaluate our model’s design and robustness. For this purpose, unless stated otherwise, we use window size .

Temporal encoder.

To evaluate MTTR’s performance independently of the temporal encoder, we compare it with CMPC-I, the image-targeted version of CMPC-V [liu2021cmpc]. Following CMPC-I, we use DeepLab-ResNet101 [chen2017deeplab]

pretrained on the PASCAL-VOC

[everingham2010pascal] as a visual feature extractor. We train our model using only the target frames (i.e., without additional frames for temporal context). As shown in Tab. 4, our method significantly surpasses CMPC-I across all metrics, including a 6.1 gain in mAP and 8.7% absolute improvement in Mean IoU. In fact, the results in Tabs. 1 and 4 suggest that this configuration of our model surpasses all existing methods regardless of the temporal context.

Temporal context.

In Tab. 4 we study the effect of the temporal context size on MTTR’s performance. A larger temporal context enables better extraction of action-related information. For this purpose, we train and evaluate our model using different window sizes. As expected, widening the temporal context leads to large performance gains, with an mAP gain of 4.3 and an absolute Mean IoU improvement of 3.7% when changing the window size from 1 to 10.

Text encoder.

To study the effect of the text encoder on our model’s performance, we train our model using two additional widely-used feature extractors, namely BERT-base [Devlin2019BERTPO] and Distill-RoBERTa-base [sanh2019distilbert], a distilled version of RoBERTa [liu2019roberta]. As shown in Tab. 4, our model achieves comparable results regardless of the chosen text encoder, which shows its robustness to this change.

Supervision of un-referred instances.

To study the effect of supervising the detections of un-referred instances alongside that of the referred instance in each sample, we train different configurations of our model without supervision of un-referred instances. Intriguingly, in all such experiments our model immediately converges to a local minimum of the text loss (

), where the same object query is repeatedly matched with all ground-truth instances, thus leaving the rest of the object queries untrained. In some experiments our model manages to escape this local minimum after a few epochs and then achieves comparable performance with our default configuration. Nevertheless, in other experiments this phenomenon significantly hinders its final mAP score.

4.4 Qualitative Analysis

As illustrated in Fig. 3, MTTR can successfully track and segment text referred objects even in challenging situations where they are surrounded by similar instances, occluded, or completely outside of the camera’s field of view in extensive parts of the video.

5 Conclusion

We introduced MTTR, a simple Transformer-based approach to RVOS that models the task as a sequence prediction problem. Our end-to-end method considerably simplifies existing RVOS pipelines by simultaneously processing both text and video frames in a single multimodal Transformer. Extensive evaluation of our approach on standard benchmarks reveals that our method outperforms existing state-of-the-art methods by a large margin (e.g., a 5.7 mAP improvement on A2D-Sentences). We hope our work will inspire others to see the potential of Transformers for solving complex multimodal tasks.


Appendix A Additional Dataset Details

a.1 A2D-Sentences & JHMDB-Sentences

A2D-Sentences [gavrilyuk2018a2d] contains 3,754 videos (3,017 train, 737 test) with 7 actors classes performing 8 action classes. Additionally, the dataset contains 6,655 sentences describing the actors in the videos and their actions. JHMDB-Sentences [gavrilyuk2018a2d] contains 928 videos along with 928 corresponding sentences describing 21 different action classes.

a.2 Refer-YouTube-VOS

The original release of Refer-YouTube-VOS [seo2020urvos] contains 27,899 text expressions for 7,451 objects in 3,975 videos. The objects belong to 94 common categories. The subset with the first-frame expressions contains 10,897 expressions for 3,412 videos in the train split and 1,993 expressions for 507 videos in the validation split. The subset with the full-video expressions contains 12,913 expressions for 3,471 videos in the train split and 2,096 expressions for 507 videos in the validation split. Following the introduction of the RVOS competition444https://youtube-vos.org/dataset/rvos/, only the more challenging full-video expressions subset is publicly available now, so we use this subset exclusively in our experiments. Additionally, this subset’s original validation set was split into two separate competition validation and test sets of 202 and 305 videos respectively. Since ground-truth annotations are available only for the training set and the test server is currently closed, we report results exclusively on the competition validation set by uploading our predictions to the competition’s server555https://competitions.codalab.org/competitions/29139.

Appendix B Additional Implementation Details

b.1 Temporal Encoder Modifications

The original architecture of Video Swin Transformer [liu2021vswin]

contains a single temporal down-sampling layer, realized as a 3D convolution with kernel and stride of size

(the first dimension is temporal). However, since our multimodal Transformer expects per-frame embeddings, we canceled this temporal down-sampling step by modifying the kernel and stride of the above convolution to size . Additionally, in order to maintain support for the Kinetics-400 [kay2017kinetics] pretrained weights of the original Swin configuration, we summed the pretrained kernel weights of the aforementioned convolution on its temporal dim, resulting in a new kernel. This solution is equivalent to (but more efficient than) duplicating each frame in the input sequence before inserting it into the temporal encoder.

b.2 Multimodal Transformer

We employ the same Transformer architecture proposed by carion2020detr. The decoder layers are fed with a set of object queries per input frame. For efficiency reasons we only utilize 3 layers in both the encoder and decoder, but note that more layers may lead to additional performance gains, as demonstrated by [carion2020detr]. Also, similary to [carion2020detr], fixed sine spatial positional encodings are added to the features of each frame before inserting them into the Transformer. No positional encodings are used for the text embeddings, as in our experiments using sine embeddings have led to reduced performance and learnable encodings had no effect compared to using no encodings at all.

b.3 Instance Segmentation

The spatial decoder is an FPN-like [lin2017fpn] module consisting of several 2D convolution, GroupNorm [wu2018group]

and ReLU layers. Nearest neighbor interpolation is used for the upsampling steps. The segmentation kernels and the feature maps of

are of dimension following [tian2020conditional].

b.4 Additional Training Details

We use as the feature dimension of the multimodal Transformer’s inputs and outputs. The hyperparameters for the loss and matching cost functions are .

During training we utilize AdamW [loshchilov2017decoupled] as the optimizer with weight decay set to (following [carion2020detr]

). We also apply gradient clipping with a maximal gradient norm of

. A learning rate of is used for the Transformer and for the temporal encoder. The text encoder is kept frozen.

Similarly to [carion2020detr] we found that utilizing auxiliary decoding losses on the outputs of all layers in the Transformer decoder expedites training and improves the overall performance of the model.

During training, to enhance model’s position awareness, we randomly flip the input frames horizontally and swap direction-related words in the corresponding text expressions accordingly (e.g., the word ’left’ is replaced with ’right’).

We train the model for 70 epochs on A2D-Sentences [gavrilyuk2018a2d] with a learning rate drop by a factor of 2.5 after the first 50 epochs. In the default configuration we use window size and batch size of 6 on 3 RTX 3090 24GB GPUs. On Refer-YouTube-VOS [seo2020urvos] the model is trained for 30 epochs with a learning rate drop by a factor of 2.5 after the first 20 epochs. In the default configuration we use window size and batch size of 4 on 4 A6000 48GB GPUs.