Words aren't enough, their order matters: On the Robustness of Grounding Visual Referring Expressions

05/04/2020
by   Arjun R Akula, et al.
11

Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7 structure, i.e., words are enough to identify the target object, the word order doesn't matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which doesn't. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12 in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at https://github.com/aws/aws-refcocog-adv

READ FULL TEXT

page 3

page 5

page 9

page 10

research
07/21/2023

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Visual grounding (VG) aims to establish fine-grained alignment between v...
research
09/28/2022

Target Features Affect Visual Search, A Study of Eye Fixations

Visual Search is referred to the task of finding a target object among a...
research
06/16/2022

RefCrowd: Grounding the Target in Crowd with Referring Expressions

Crowd understanding has aroused the widespread interest in vision domain...
research
07/06/2018

Dynamic Multimodal Instance Segmentation guided by natural language queries

In this paper, we address the task of segmenting an object given a natur...
research
09/18/2020

Commands 4 Autonomous Vehicles (C4AV) Workshop Summary

The task of visual grounding requires locating the most relevant region ...
research
09/27/2020

A Brief Survey and Comparative Study of Recent Development of Pronoun Coreference Resolution

Pronoun Coreference Resolution (PCR) is the task of resolving pronominal...
research
03/08/2022

Counting with Adaptive Auxiliary Learning

This paper proposes an adaptive auxiliary task learning based approach f...

Please sign up or login with your details

Forgot password? Click here to reset