MaIL: A Unified Mask-Image-Language Trimodal Network for Referring Image Segmentation

11/21/2021
by   Zizhang Li, et al.
0

Referring image segmentation is a typical multi-modal task, which aims at generating a binary mask for referent described in given language expressions. Prior arts adopt a bimodal solution, taking images and languages as two modalities within an encoder-fusion-decoder pipeline. However, this pipeline is sub-optimal for the target task for two reasons. First, they only fuse high-level features produced by uni-modal encoders separately, which hinders sufficient cross-modal learning. Second, the uni-modal encoders are pre-trained independently, which brings inconsistency between pre-trained uni-modal tasks and the target multi-modal task. Besides, this pipeline often ignores or makes little use of intuitively beneficial instance-level features. To relieve these problems, we propose MaIL, which is a more concise encoder-decoder pipeline with a Mask-Image-Language trimodal encoder. Specifically, MaIL unifies uni-modal feature extractors and their fusion model into a deep modality interaction encoder, facilitating sufficient feature interaction across different modalities. Meanwhile, MaIL directly avoids the second limitation since no uni-modal encoders are needed anymore. Moreover, for the first time, we propose to introduce instance masks as an additional modality, which explicitly intensifies instance-level features and promotes finer segmentation results. The proposed MaIL set a new state-of-the-art on all frequently-used referring image segmentation datasets, including RefCOCO, RefCOCO+, and G-Ref, with significant gains, 3 released soon.

READ FULL TEXT

page 1

page 4

page 8

research
12/04/2021

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Referring image segmentation is a fundamental vision-language task that ...
research
02/02/2023

MoE-Fusion: Instance Embedded Mixture-of-Experts for Infrared and Visible Image Fusion

Infrared and visible image fusion can compensate for the incompleteness ...
research
02/20/2020

Stroke Constrained Attention Network for Online Handwritten Mathematical Expression Recognition

In this paper, we propose a novel stroke constrained attention network (...
research
08/17/2022

ParaColorizer: Realistic Image Colorization using Parallel Generative Networks

Grayscale image colorization is a fascinating application of AI for info...
research
06/05/2019

OctopusNet: A Deep Learning Segmentation Network for Multi-modal Medical Images

Deep learning models, such as the fully convolutional network (FCN), hav...
research
03/11/2023

Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation

Referring image segmentation segments an image from a language expressio...
research
03/18/2022

Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation

Scene Graph Generation, which generally follows a regular encoder-decode...

Please sign up or login with your details

Forgot password? Click here to reset