CMF: Cascaded Multi-model Fusion for Referring Image Segmentation

06/16/2021
by   Jianhua Yang, et al.
0

In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multi-scale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multi-scale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. Code is available at https://github.com/jianhua2022/CMF-Refseg.

READ FULL TEXT
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

05/05/2021

Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation

Recently, referring image segmentation has aroused widespread interest. ...
07/18/2021

Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Anticipating human actions is an important task that needs to be address...
04/21/2021

Comprehensive Multi-Modal Interactions for Referring Image Segmentation

We investigate Referring Image Segmentation (RIS), which outputs a segme...
12/09/2020

Hateful Memes Detection via Complementary Visual and Linguistic Networks

Hateful memes are widespread in social media and convey negative informa...
03/30/2022

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Referring video segmentation aims to segment the corresponding video obj...
05/13/2021

Robust Dynamic Multi-Modal Data Fusion: A Model Uncertainty Perspective

This paper is concerned with multi-modal data fusion (MMDF) under unexpe...
04/06/2022

Multi-Scale Memory-Based Video Deblurring

Video deblurring has achieved remarkable progress thanks to the success ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.