Object Referring in Videos with Language and Human Gaze

01/04/2018
by   Arun Balajee Vasudevan, et al.
0

We investigate the problem of object referring (OR) i.e. to localize a target object in a visual scene coming with a language description. Humans perceive the world more as continued video snippets than as static images, and describe objects not only by their appearance, but also by their temporal-spatial contexts and motion features. Humans also gaze at the object when they issue a referring expression. Existing works for OR mostly focus on static images only, which fall short in providing many such cues. This paper addresses OR in videos with language and human gaze. To that end, we present a new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze. We further propose a novel network model for OR in videos, by integrating appearance, motion, gaze, and spatial-temporal contextual information all into one network. Experimental results shows that our method effectively utilizes motion cues, human gaze, and spatial-temporal context information. Our method outperforms previous OR methods. The dataset and code will be made available.

READ FULL TEXT

page 3

page 5

page 7

page 8

research
06/06/2023

Human-Object Interaction Prediction in Videos through Gaze Following

Understanding the human-object interactions (HOIs) from a video is essen...
research
09/13/2018

Seeing Tree Structure from Vibration

Humans recognize object structure from both their appearance and motion;...
research
09/13/2023

STUPD: A Synthetic Dataset for Spatial and Temporal Relation Reasoning

Understanding relations between objects is crucial for understanding the...
research
10/31/2019

A Self Validation Network for Object-Level Human Attention Estimation

Due to the foveated nature of the human vision system, people can focus ...
research
03/27/2023

Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention

Predicting human gaze is important in Human-Computer Interaction (HCI). ...
research
06/08/2022

Language-Bridged Spatial-Temporal Interaction for Referring Video Object Segmentation

Referring video object segmentation aims to predict foreground labels fo...
research
04/19/2021

What can human minimal videos tell us about dynamic recognition models?

In human vision objects and their parts can be visually recognized from ...

Please sign up or login with your details

Forgot password? Click here to reset