Towards Unifying Reference Expression Generation and Comprehension

10/24/2022
by   Duo Zheng, et al.
5

Reference Expression Generation (REG) and Comprehension (REC) are two highly correlated tasks. Modeling REG and REC simultaneously for utilizing the relation between them is a promising way to improve both. However, the problem of distinct inputs, as well as building connections between them in a single model, brings challenges to the design and training of the joint model. To address the problems, we propose a unified model for REG and REC, named UniRef. It unifies these two tasks with the carefully-designed Image-Region-Text Fusion layer (IRTF), which fuses the image, region and text via the image cross-attention and region cross-attention. Additionally, IRTF could generate pseudo input regions for the REC task to enable a uniform way for sharing the identical representation space across the REC and REG. We further propose Vision-conditioned Masked Language Modeling (VMLM) and Text-Conditioned Region Prediction (TRP) to pre-train UniRef model on multi-granular corpora. The VMLM and TRP are directly related to REG and REC, respectively, but could help each other. We conduct extensive experiments on three benchmark datasets, RefCOCO, RefCOCO+ and RefCOCOg. Experimental results show that our model outperforms previous state-of-the-art methods on both REG and REC.

READ FULL TEXT

page 7

page 8

page 13

page 14

research
09/25/2019

UNITER: Learning UNiversal Image-TExt Representations

Joint image-text embedding is the bedrock for most Vision-and-Language (...
research
03/19/2020

Multi-task Collaborative Network for Joint Referring Expression Comprehension and Segmentation

Referring expression comprehension (REC) and segmentation (RES) are two ...
research
08/30/2016

Utilizing Large Scale Vision and Text Datasets for Image Segmentation from Referring Expressions

Image segmentation from referring expressions is a joint vision and lang...
research
08/19/2023

Whether you can locate or not? Interactive Referring Expression Generation

Referring Expression Generation (REG) aims to generate unambiguous Refer...
research
01/24/2022

Multi-Graph Fusion Networks for Urban Region Embedding

Learning the embeddings for urban regions from human mobility data can r...
research
03/18/2020

MUTATT: Visual-Textual Mutual Guidance for Referring Expression Comprehension

Referring expression comprehension (REC) aims to localize a text-related...
research
07/04/2022

CRFormer: A Cross-Region Transformer for Shadow Removal

Aiming to restore the original intensity of shadow regions in an image a...

Please sign up or login with your details

Forgot password? Click here to reset