Differentiated Relevances Embedding for Group-based Referring Expression Comprehension

03/12/2022
by   Fuhai Chen, et al.
0

Referring expression comprehension (REC) aims to locate a certain object in an image referred by a natural language expression. For joint understanding of regions and expressions, existing REC works typically target on modeling the cross-modal relevance in each region-expression pair within each single image. In this paper, we explore a new but general REC-related problem, named Group-based REC, where the regions and expressions can come from different subject-related images (images in the same group), e.g., sets of photo albums or video frames. Different from REC, Group-based REC involves differentiated cross-modal relevances within each group and across different groups, which, however, are neglected in the existing one-line paradigm. To this end, we propose a novel relevance-guided multi-group self-paced learning schema (termed RMSL), where the within-group region-expression pairs are adaptively assigned with different priorities according to their cross-modal relevances, and the bias of the group priority is balanced via an across-group relevance constraint simultaneously. In particular, based on the visual and textual semantic features, RMSL conducts an adaptive learning cycle upon triplet ranking, where (1) the target-negative region-expression pairs with low within-group relevances are used preferentially in model training to distinguish the primary semantics of the target objects, and (2) an across-group relevance regularization is integrated into model training to balance the bias of group priority. The relevances, the pairs, and the model parameters are alternatively updated upon a unified self-paced hinge loss.

READ FULL TEXT

page 1

page 2

page 9

page 10

research
04/19/2020

Relationship-Embedded Representation Learning for Grounding Referring Expressions

Grounding referring expressions in images aims to locate the object inst...
research
03/03/2019

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Referring expression grounding aims at locating certain objects or perso...
research
06/11/2019

Cross-Modal Relationship Inference for Grounding Referring Expressions

Grounding referring expressions is a fundamental yet challenging task fa...
research
05/21/2023

Advancing Referring Expression Segmentation Beyond Single Image

Referring Expression Segmentation (RES) is a widely explored multi-modal...
research
12/07/2019

A Real-time Global Inference Network for One-stage Referring Expression Comprehension

Referring Expression Comprehension (REC) is an emerging research spot in...
research
11/28/2022

SLAN: Self-Locator Aided Network for Cross-Modal Understanding

Learning fine-grained interplay between vision and language allows to a ...
research
05/15/2021

Cross-Modal Progressive Comprehension for Referring Segmentation

Given a natural language expression and an image/video, the goal of refe...

Please sign up or login with your details

Forgot password? Click here to reset