RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D

08/23/2023
by   Shuhei Kurita, et al.
0

Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed a broad coverage of the video-based referring expression comprehension dataset: RefEgo. Our dataset includes more than 12k video clips and 41 hours for video-based referring expression comprehension annotation. In experiments, we combine the state-of-the-art 2D referring expression comprehension models with the object tracking algorithm, achieving the video-wise referred object tracking even in difficult conditions: the referred object becomes out-of-frame in the middle of the video or multiple similar objects are presented in the video.

READ FULL TEXT

page 1

page 4

page 7

page 8

page 15

research
03/06/2023

Referring Multi-Object Tracking

Existing referring understanding tasks tend to involve the detection of ...
research
03/23/2021

Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos

In this paper, we address the problem of referring expression comprehens...
research
03/01/2020

Cops-Ref: A new Dataset and Task on Compositional Referring Expression Comprehension

Referring expression comprehension (REF) aims at identifying a particula...
research
06/26/2023

Kosmos-2: Grounding Multimodal Large Language Models to the World

We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enablin...
research
04/15/2019

Learning to Generate Unambiguous Spatial Referring Expressions for Real-World Environments

Referring to objects in a natural and unambiguous manner is crucial for ...
research
06/08/2021

SynthRef: Generation of Synthetic Referring Expressions for Object Segmentation

Recent advances in deep learning have brought significant progress in vi...
research
10/19/2021

Come Again? Re-Query in Referring Expression Comprehension

To build a shared perception of the world, humans rely on the ability to...

Please sign up or login with your details

Forgot password? Click here to reset