Referring Expression Object Segmentation with Caption-Aware Consistency

10/10/2019
by   Yi-Wen Chen, et al.
4

Referring expressions are natural language descriptions that identify a particular object within a scene and are widely used in our daily conversations. In this work, we focus on segmenting the object in an image specified by a referring expression. To this end, we propose an end-to-end trainable comprehension network that consists of the language and visual encoders to extract feature representations from both domains. We introduce the spatial-aware dynamic filters to transfer knowledge from text to image, and effectively capture the spatial information of the specified object. To better communicate between the language and visual modules, we employ a caption generation network that takes features shared across both domains as input, and improves both representations via a consistency that enforces the generated sentence to be similar to the given referring expression. We evaluate the proposed framework on two referring expression datasets and show that our method performs favorably against the state-of-the-art algorithms.

READ FULL TEXT

page 1

page 2

page 4

page 9

research
03/20/2016

Segmentation from Natural Language Expressions

In this paper we approach the novel problem of segmenting an image based...
research
04/21/2022

Referring Expression Comprehension via Cross-Level Multi-Modal Fusion

As an important and challenging problem in vision-language tasks, referr...
research
12/30/2016

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

Referring expressions are natural language constructions used to identif...
research
07/06/2018

Dynamic Multimodal Instance Segmentation guided by natural language queries

In this paper, we address the task of segmenting an object given a natur...
research
06/14/2023

LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation

Referring video object segmentation (RVOS) aims to segment the target in...
research
03/18/2020

MUTATT: Visual-Textual Mutual Guidance for Referring Expression Comprehension

Referring expression comprehension (REC) aims to localize a text-related...
research
03/28/2020

BiLingUNet: Image Segmentation by Modulating Top-Down and Bottom-Up Visual Processing with Referring Expressions

We present BiLingUNet, a state-of-the-art model for image segmentation u...

Please sign up or login with your details

Forgot password? Click here to reset