ClawCraneNet: Leveraging Object-level Relation for Text-based Video Segmentation

03/19/2021
by   Chen Liang, et al.
0

Text-based video segmentation is a challenging task that segments out the natural language referred objects in videos. It essentially requires semantic comprehension and fine-grained video understanding. Existing methods introduce language representation into segmentation models in a bottom-up manner, which merely conducts vision-language interaction within local receptive fields of ConvNets. We argue that such interaction is not fulfilled since the model can barely construct region-level relationships given partial observations, which is contrary to the description logic of natural language/referring expressions. In fact, people usually describe a target object using relations with other objects, which may not be easily understood without seeing the whole video. To address the issue, we introduce a novel top-down approach by imitating how we human segment an object with the language guidance. We first figure out all candidate objects in videos and then choose the refereed one by parsing relations among those high-level objects. Three kinds of object-level relations are investigated for precise relationship understanding, i.e., positional relation, text-guided semantic relation, and temporal relation. Extensive experiments on A2D Sentences and J-HMDB Sentences show our method outperforms state-of-the-art methods by a large margin. Qualitative results also show our results are more explainable.

READ FULL TEXT

page 4

page 8

page 9

page 10

page 11

research
03/24/2020

Video Object Grounding using Semantic Roles in Language Description

We explore the task of Video Object Grounding (VOG), which grounds objec...
research
08/04/2017

Localizing Moments in Video with Natural Language

We consider retrieving a specific temporal segment, or moment, from a vi...
research
05/24/2021

Human-centric Relation Segmentation: Dataset and Solution

Vision and language understanding techniques have achieved remarkable pr...
research
11/20/2015

Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation

Integrating higher level visual and linguistic interpretations is at the...
research
09/29/2021

Contrastive Video-Language Segmentation

We focus on the problem of segmenting a certain object referred by a nat...
research
07/16/2018

Object Relation Detection Based on One-shot Learning

Detecting the relations among objects, such as "cat on sofa" and "person...
research
07/16/2022

Knowledge Guided Bidirectional Attention Network for Human-Object Interaction Detection

Human Object Interaction (HOI) detection is a challenging task that requ...

Please sign up or login with your details

Forgot password? Click here to reset