Local-Global Context Aware Transformer for Language-Guided Video Segmentation

03/18/2022
by   Chen Liang, et al.
5

We explore the task of language-guided video segmentation (LVS). Previous algorithms mostly adopt 3D CNNs to learn video representation, struggling to capture long-term context and easily suffering from visual-linguistic misalignment. In light of this, we present Locater (local-global context aware Transformer), which augments the Transformer architecture with a finite memory so as to query the entire video with the language expression in an efficient manner. The memory is designed to involve two components – one for persistently preserving global video content, and one for dynamically gathering local temporal context and segmentation history. Based on the memorized local-global context and the particular content of each frame, Locater holistically and flexibly comprehends the expression as an adaptive query vector for each frame. The vector is used to query the corresponding frame for mask generation. The memory also allows Locater to process videos with linear time complexity and constant size memory, while Transformer-style self-attention computation scales quadratically with sequence length. To thoroughly examine the visual grounding capability of LVS models, we contribute a new LVS dataset, A2D-S+, which is built upon A2D-S dataset but poses increased challenges in disambiguating among similar objects. Experiments on three LVS datasets and our A2D-S+ show that Locater outperforms previous state-of-the-arts. Further, our Locater based solution achieved the 1st place in the Referring Video Object Segmentation Track of the 3rd Large-scale Video Object Segmentation Challenge. Our code and dataset are available at: https://github.com/leonnnop/Locater

READ FULL TEXT

page 1

page 4

page 7

page 9

page 11

research
03/22/2021

Context-aware Biaffine Localizing Network for Temporal Sentence Grounding

This paper addresses the problem of temporal sentence grounding (TSG), w...
research
11/18/2022

Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization

This paper deals with the problem of localizing objects in image and vid...
research
06/01/2023

Lightweight Vision Transformer with Bidirectional Interaction

Recent advancements in vision backbones have significantly improved thei...
research
07/15/2023

Multiscale Memory Comparator Transformer for Few-Shot Video Segmentation

Few-shot video segmentation is the task of delineating a specific novel ...
research
04/16/2020

Local-Global Video-Text Interactions for Temporal Grounding

This paper addresses the problem of text-to-video temporal grounding, wh...
research
11/11/2022

CoRAL: a Context-aware Croatian Abusive Language Dataset

In light of unprecedented increases in the popularity of the internet an...
research
03/27/2022

Video Polyp Segmentation: A Deep Learning Perspective

In the deep learning era, we present the first comprehensive video polyp...

Please sign up or login with your details

Forgot password? Click here to reset