DeepAI AI Chat
Log In Sign Up

What's Cookin'? Interpreting Cooking Videos using Text, Speech and Vision

03/05/2015
by   Jonathan Malmaud, et al.
Google
MIT
0

We present a novel method for aligning a sequence of instructions to a video of someone carrying out a task. In particular, we focus on the cooking domain, where the instructions correspond to the recipe. Our technique relies on an HMM to align the recipe steps to the (automatically generated) speech transcript. We then refine this alignment using a state-of-the-art visual food detector, based on a deep convolutional neural network. We show that our technique outperforms simpler techniques based on keyword spotting. It also enables interesting applications, such as automatically illustrating recipes with keyframes, and searching within a video for events of interest.

READ FULL TEXT

page 5

page 6

page 8

page 9

09/22/2018

Learning to Localize and Align Fine-Grained Actions to Sparse Instructions

Automatic generation of textual video descriptions that are time-aligned...
08/23/2022

VILT: Video Instructions Linking for Complex Tasks

This work addresses challenges in developing conversational assistants t...
10/24/2017

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention

This paper describes a novel text-to-speech (TTS) technique based on dee...
05/19/2020

A Recipe for Creating Multimodal Aligned Datasets for Sequential Tasks

Many high-level procedural tasks can be decomposed into sequences of ins...
05/11/2016

Unsupervised Semantic Action Discovery from Video Collections

Human communication takes many forms, including speech, text and instruc...
09/14/2017

The Conditional Analogy GAN: Swapping Fashion Articles on People Images

We present a novel method to solve image analogy problems : it allows to...