VILT: Video Instructions Linking for Complex Tasks

08/23/2022
by   Sophie Fischer, et al.
0

This work addresses challenges in developing conversational assistants that support rich multimodal video interactions to accomplish real-world tasks interactively. We introduce the task of automatically linking instructional videos to task steps as "Video Instructions Linking for Complex Tasks" (VILT). Specifically, we focus on the domain of cooking and empowering users to cook meals interactively with a video-enabled Alexa skill. We create a reusable benchmark with 61 queries from recipe tasks and curate a collection of 2,133 instructional "How-To" cooking videos. Studying VILT with state-of-the-art retrieval methods, we find that dense retrieval with ANCE is the most effective, achieving an NDCG@3 of 0.566 and P@1 of 0.644. We also conduct a user study that measures the effect of incorporating videos in a real-world task setting, where 10 participants perform several cooking tasks with varying multimodal experimental conditions using a state-of-the-art Alexa TaskBot system. The users interacting with manually linked videos said they learned something new 64 automatically linked videos (55 important for task learning.

READ FULL TEXT
research
11/16/2022

Task-aware Retrieval with Instructions

We study the problem of retrieval with instructions, where users of a re...
research
09/20/2023

Grounded Complex Task Segmentation for Conversational Assistants

Following complex instructions in conversational assistants can be quite...
research
03/05/2015

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision

We present a novel method for aligning a sequence of instructions to a v...
research
11/17/2021

Induce, Edit, Retrieve: Language Grounded Multimodal Schema for Instructional Video Retrieval

Schemata are structured representations of complex tasks that can aid ar...
research
02/17/2023

Multimodal Subtask Graph Generation from Instructional Videos

Real-world tasks consist of multiple inter-dependent subtasks (e.g., a d...
research
06/30/2015

Unsupervised Learning from Narrated Instruction Videos

We address the problem of automatically learning the main steps to compl...
research
10/09/2017

LD-SDS: Towards an Expressive Spoken Dialogue System based on Linked-Data

In this work we discuss the related challenges and describe an approach ...

Please sign up or login with your details

Forgot password? Click here to reset