PPMN: Pixel-Phrase Matching Network for One-Stage Panoptic Narrative Grounding

08/11/2022
by   Zihan Ding, et al.
1

Panoptic Narrative Grounding (PNG) is an emerging task whose goal is to segment visual objects of things and stuff categories described by dense narrative captions of a still image. The previous two-stage approach first extracts segmentation region proposals by an off-the-shelf panoptic segmentation model, then conducts coarse region-phrase matching to ground the candidate regions for each noun phrase. However, the two-stage pipeline usually suffers from the performance limitation of low-quality proposals in the first stage and the loss of spatial details caused by region feature pooling, as well as complicated strategies designed for things and stuff categories separately. To alleviate these drawbacks, we propose a one-stage end-to-end Pixel-Phrase Matching Network (PPMN), which directly matches each phrase to its corresponding pixels instead of region proposals and outputs panoptic segmentation by simple combination. Thus, our model can exploit sufficient and finer cross-modal semantic correspondence from the supervision of densely annotated pixel-phrase pairs rather than sparse region-phrase pairs. In addition, we also propose a Language-Compatible Pixel Aggregation (LCPA) module to further enhance the discriminative ability of phrase features through multi-round refinement, which selects the most compatible pixels for each phrase to adaptively aggregate the corresponding visual context. Extensive experiments show that our method achieves new state-of-the-art performance on the PNG benchmark with 4.0 absolute Average Recall gains.

READ FULL TEXT

page 1

page 4

page 8

research
03/18/2019

Neural Sequential Phrase Grounding (SeqGROUND)

We propose an end-to-end approach for phrase grounding in images. Unlike...
research
06/06/2020

MAGNet: Multi-Region Attention-Assisted Grounding of Natural Language Queries at Phrase Level

Grounding free-form textual queries necessitates an understanding of the...
research
08/18/2019

A Fast and Accurate One-Stage Approach to Visual Grounding

We propose a simple, fast, and accurate one-stage approach to visual gro...
research
01/09/2023

Towards Real-Time Panoptic Narrative Grounding by an End-to-End Grounding Network

Panoptic Narrative Grounding (PNG) is an emerging cross-modal grounding ...
research
12/09/2018

Real-Time Referring Expression Comprehension by Single-Stage Grounding Network

In this paper, we propose a novel end-to-end model, namely Single-Stage ...
research
03/14/2023

Medical Phrase Grounding with Region-Phrase Context Contrastive Alignment

Medical phrase grounding (MPG) aims to locate the most relevant region i...
research
09/01/2019

Phrase Grounding by Soft-Label Chain Conditional Random Field

The phrase grounding task aims to ground each entity mention in a given ...

Please sign up or login with your details

Forgot password? Click here to reset