Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

01/09/2023
by   Haowei Wang, et al.
0

Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding task, which locates the target regions of an image corresponding to the text description. Existing approaches for PNG are mainly based on a two-stage paradigm, which is computationally expensive. In this paper, we propose a one-stage network for real-time PNG, termed End-to-End Panoptic Narrative Grounding network (EPNG), which directly generates masks for referents. Specifically, we propose two innovative designs, i.e., Locality-Perceptive Attention (LPA) and a bidirectional Semantic Alignment Loss (SAL), to properly handle the many-to-many relationship between textual expressions and visual objects. LPA embeds the local spatial priors into attention modeling, i.e., a pixel may belong to multiple masks at different scales, thereby improving segmentation. To help understand the complex semantic relationships, SAL proposes a bidirectional contrastive objective to regularize the semantic consistency inter modalities. Extensive experiments on the PNG benchmark dataset demonstrate the effectiveness and efficiency of our method. Compared to the single-stage baseline, our method achieves a significant improvement of up to 9.4 two-stage model. Meanwhile, the generalization ability of EPNG is also validated by zero-shot experiments on other grounding tasks.

READ FULL TEXT

page 3

page 6

page 7

research
03/03/2019

Improving Referring Expression Grounding with Cross-modal Attention-guided Erasing

Referring expression grounding aims at locating certain objects or perso...
research
07/31/2021

Word2Pix: Word to Pixel Cross Attention Transformer in Visual Grounding

Current one-stage methods for visual grounding encode the language query...
research
03/10/2022

Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding

Recently, one-stage visual grounders attract high attention due to the c...
research
12/07/2019

A Real-time Global Inference Network for One-stage Referring Expression Comprehension

Referring Expression Comprehension (REC) is an emerging research spot in...
research
08/11/2022

PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to ...
research
11/15/2022

YORO – Lightweight End to End Visual Grounding

We present YORO - a multi-modal transformer encoder-only architecture fo...
research
04/09/2021

Look Before You Leap: Learning Landmark Features for One-Stage Visual Grounding

An LBYL (`Look Before You Leap') Network is proposed for end-to-end trai...

Please sign up or login with your details

Forgot password? Click here to reset