Unsupervised Visual-Linguistic Reference Resolution in Instructional Videos

03/07/2017
by   De-An Huang, et al.
0

We propose an unsupervised method for reference resolution in instructional videos, where the goal is to temporally link an entity (e.g., "dressing") to the action (e.g., "mix yogurt") that produced it. The key challenge is the inevitable visual-linguistic ambiguities arising from the changes in both visual appearance and referring expression of an entity in the video. This challenge is amplified by the fact that we aim to resolve references with no supervision. We address these challenges by learning a joint visual-linguistic model, where linguistic cues can help resolve visual ambiguities and vice versa. We verify our approach by learning our model unsupervisedly using more than two thousand unstructured cooking videos from YouTube, and show that our visual-linguistic model can substantially improve upon state-of-the-art linguistic only model on reference resolution in instructional videos.

READ FULL TEXT

page 1

page 7

page 8

research
09/23/2017

Visual Reference Resolution using Attention Memory for Visual Dialog

Visual dialog is a task of answering a series of inter-dependent questio...
research
11/20/2015

Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation

Integrating higher level visual and linguistic interpretations is at the...
research
01/29/2020

Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in Untrimmed Sequences

Understanding the structure of complex activities in videos is one of th...
research
05/11/2016

Unsupervised Semantic Action Discovery from Video Collections

Human communication takes many forms, including speech, text and instruc...
research
06/28/2015

Unsupervised Semantic Parsing of Video Collections

Human communication typically has an underlying structure. This is refle...
research
12/19/2022

SrTR: Self-reasoning Transformer with Visual-linguistic Knowledge for Scene Graph Generation

Objects in a scene are not always related. The execution efficiency of t...
research
06/28/2016

"Show me the cup": Reference with Continuous Representations

One of the most basic functions of language is to refer to objects in a ...

Please sign up or login with your details

Forgot password? Click here to reset