Multi-Task Learning of Object State Changes from Uncurated Videos

11/24/2022
by   Tomáš Souček, et al.
0

We aim to learn to temporally localize object state changes and the corresponding state-modifying actions by observing people interacting with objects in long uncurated web videos. We introduce three principal contributions. First, we explore alternative multi-task network architectures and identify a model that enables efficient joint learning of multiple object states and actions such as pouring water and pouring coffee. Second, we design a multi-task self-supervised learning procedure that exploits different types of constraints between objects and state-modifying actions enabling end-to-end training of a model for temporal localization of object states and actions in videos from only noisy video-level supervision. Third, we report results on the large-scale ChangeIt and COIN datasets containing tens of thousands of long (un)curated web videos depicting various interactions such as hole drilling, cream whisking, or paper plane folding. We show that our multi-task model achieves a relative improvement of 40 significantly outperforms both image-based and video-based zero-shot models for this problem. We also test our method on long egocentric videos of the EPIC-KITCHENS and the Ego4D datasets in a zero-shot setup demonstrating the robustness of our learned model.

READ FULL TEXT

page 1

page 4

page 8

page 13

page 14

page 15

page 16

page 17

research
03/22/2022

Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

Human actions often induce changes of object states such as "cutting an ...
research
02/09/2017

Joint Discovery of Object States and Manipulation Actions

Many human activities involve object manipulations aiming to modify the ...
research
12/05/2019

Zero-Shot Generation of Human-Object Interaction Videos

Generation of videos of complex scenes is an important open problem in c...
research
08/03/2020

RareAct: A video dataset of unusual interactions

This paper introduces a manually annotated video dataset of unusual acti...
research
07/17/2023

Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Procedural activity understanding requires perceiving human actions in t...
research
03/10/2020

Learning Video Object Segmentation from Unlabeled Videos

We propose a new method for video object segmentation (VOS) that address...
research
03/27/2023

Learning Action Changes by Measuring Verb-Adverb Textual Relationships

The goal of this work is to understand the way actions are performed in ...

Please sign up or login with your details

Forgot password? Click here to reset