CMF: Cascaded Multi-model Fusion for Referring Image Segmentation

by   Jianhua Yang, et al.

In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multi-scale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multi-scale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. Code is available at



There are no comments yet.


page 1


Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation

Recently, referring image segmentation has aroused widespread interest. ...

Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos

Anticipating human actions is an important task that needs to be address...

Comprehensive Multi-Modal Interactions for Referring Image Segmentation

We investigate Referring Image Segmentation (RIS), which outputs a segme...

Hateful Memes Detection via Complementary Visual and Linguistic Networks

Hateful memes are widespread in social media and convey negative informa...

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Referring video segmentation aims to segment the corresponding video obj...

Robust Dynamic Multi-Modal Data Fusion: A Model Uncertainty Perspective

This paper is concerned with multi-modal data fusion (MMDF) under unexpe...

Multi-Scale Memory-Based Video Deblurring

Video deblurring has achieved remarkable progress thanks to the success ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.