Cross-Modal Progressive Comprehension for Referring Segmentation

05/15/2021
by   Si Liu, et al.
0

Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the target entity as well as suppress other irrelevant ones by spatial graph reasoning. For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning. In addition to the CMPC, we also introduce a simple yet effective Text-Guided Feature Exchange (TGFE) module to integrate the reasoned multimodal features corresponding to different levels in the visual backbone under the guidance of textual information. In this way, multi-level features can communicate with each other and be mutually refined based on the textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation benchmarks and three referring video segmentation benchmarks respectively.

READ FULL TEXT

page 2

page 5

page 13

page 14

page 15

research
10/01/2020

Referring Image Segmentation via Cross-Modal Progressive Comprehension

Referring image segmentation aims at segmenting the foreground masks of ...
research
02/09/2021

Referring Segmentation in Images and Videos with Cross-Modal Self-Attention Network

We consider the problem of referring segmentation in images and videos w...
research
04/09/2019

Cross-Modal Self-Attention Network for Referring Image Segmentation

We consider the problem of referring image segmentation. Given an input ...
research
04/06/2022

Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

Procedural Multimodal Documents (PMDs) organize textual instructions and...
research
10/09/2021

Two-stage Visual Cues Enhancement Network for Referring Image Segmentation

Referring Image Segmentation (RIS) aims at segmenting the target object ...
research
10/01/2020

Linguistic Structure Guided Context Modeling for Referring Image Segmentation

Referring image segmentation aims to predict the foreground mask of the ...
research
03/12/2022

Differentiated Relevances Embedding for Group-based Referring Expression Comprehension

Referring expression comprehension (REC) aims to locate a certain object...

Please sign up or login with your details

Forgot password? Click here to reset