DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners

09/07/2023
by   Clarence Lee, et al.
0

State-of-the-art visual grounding models can achieve high detection accuracy, but they are not designed to distinguish between all objects versus only certain objects of interest. In natural language, in order to specify a particular object or set of objects of interest, humans use determiners such as "my", "either" and "those". Determiners, as an important word class, are a type of schema in natural language about the reference or quantity of the noun. Existing grounded referencing datasets place much less emphasis on determiners, compared to other word classes such as nouns, verbs and adjectives. This makes it difficult to develop models that understand the full variety and complexity of object referencing. Thus, we have developed and released the DetermiNet dataset , which comprises 250,000 synthetically generated images and captions based on 25 determiners. The task is to predict bounding boxes to identify objects of interest, constrained by the semantics of the given determiner. We find that current state-of-the-art visual grounding models do not perform well on the dataset, highlighting the limitations of existing models on reference and quantification tasks.

READ FULL TEXT

page 5

page 6

page 8

research
11/16/2017

Grounded Objects and Interactions for Video Captioning

We address the problem of video captioning by grounding language generat...
research
05/22/2023

Type-to-Track: Retrieve Any Object via Prompt-based Tracking

One of the recent trends in vision problems is to use natural language c...
research
06/30/2011

Grounded Semantic Composition for Visual Scenes

We present a visually-grounded language understanding model based on a s...
research
05/08/2019

ShapeGlot: Learning Language for Shape Differentiation

In this work we explore how fine-grained differences between the shapes ...
research
05/19/2022

Voxel-informed Language Grounding

Natural language applied to natural 2D images describes a fundamentally ...
research
01/12/2021

Predicting Relative Depth between Objects from Semantic Features

Vision and language tasks such as Visual Relation Detection and Visual Q...
research
10/07/2020

Towards Understanding Sample Variance in Visually Grounded Language Generation: Evaluations and Observations

A major challenge in visually grounded language generation is to build r...

Please sign up or login with your details

Forgot password? Click here to reset