Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression Grounding

by   Long Chen, et al.

The prevailing framework for solving referring expression grounding is based on a two-stage process: 1) detecting proposals with an object detector and 2) grounding the referent to one of the proposals. Existing two-stage solutions mostly focus on the grounding step, which aims to align the expressions with the proposals. In this paper, we argue that these methods overlook an obvious mismatch between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., expression-agnostic), hoping that the proposals contain all right instances in the expression (i.e., expression-aware). Due to this mismatch, current two-stage methods suffer from a severe performance drop between detected and ground-truth proposals. To this end, we propose Ref-NMS, which is the first method to yield expression-aware proposals at the first stage. Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object. These scores can guide the NMS operation to filter out the boxes irrelevant to the expression, increasing the recall of critical objects, resulting in a significantly improved grounding performance. Since Ref- NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method. Extensive ablation studies on several backbones, benchmarks, and tasks consistently demonstrate the superiority of Ref-NMS. Codes are available at:


page 2

page 8

page 14

page 15


VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching

The prevailing framework for matching multimodal inputs is based on a tw...

Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs

Today's VidSGG models are all proposal-based methods, i.e., they first g...

Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

In this paper, we address the problem of referring expression comprehens...

Real-Time Referring Expression Comprehension by Single-Stage Grounding Network

In this paper, we propose a novel end-to-end model, namely Single-Stage ...

Suspected Object Matters: Rethinking Model's Prediction for One-stage Visual Grounding

Recently, one-stage visual grounders attract high attention due to the c...

Learning High-quality Proposals for Acne Detection

Acne detection is crucial for interpretative diagnosis and precise treat...

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts

Textual grounding is an important but challenging task for human-compute...