Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos

03/22/2022
by   Tomáš Souček, et al.
4

Human actions often induce changes of object states such as "cutting an apple", "cleaning shoes" or "pouring coffee". In this paper, we seek to temporally localize object states (e.g. "empty" and "full" cup) together with the corresponding state-modifying actions ("pouring coffee") in long uncurated videos with minimal supervision. The contributions of this work are threefold. First, we develop a self-supervised model for jointly learning state-modifying actions together with the corresponding object states from an uncurated set of videos from the Internet. The model is self-supervised by the causal ordering signal, i.e. initial object state → manipulating action → end state. Second, to cope with noisy uncurated training data, our model incorporates a noise adaptive weighting module supervised by a small number of annotated still images, that allows to efficiently filter out irrelevant videos during training. Third, we collect a new dataset with more than 2600 hours of video and 34 thousand changes of object states, and manually annotate a part of this data to validate our approach. Our results demonstrate substantial improvements over prior work in both action and object state-recognition in video.

READ FULL TEXT

page 1

page 3

page 8

page 15

page 16

page 17

page 18

research
11/24/2022

Multi-Task Learning of Object State Changes from Uncurated Videos

We aim to learn to temporally localize object state changes and the corr...
research
02/09/2017

Joint Discovery of Object States and Manipulation Actions

Many human activities involve object manipulations aiming to modify the ...
research
05/27/2019

Toward Self-Supervised Object Detection in Unlabeled Videos

Unlabeled video in the wild presents a valuable, yet so far unharnessed,...
research
12/13/2019

End-to-End Learning of Visual Representations from Uncurated Instructional Videos

Annotating videos is cumbersome, expensive and not scalable. Yet, many s...
research
07/04/2014

Weakly Supervised Action Labeling in Videos Under Ordering Constraints

We are given a set of video clips, each one annotated with an ordered l...
research
03/27/2023

Learning Action Changes by Measuring Verb-Adverb Textual Relationships

The goal of this work is to understand the way actions are performed in ...
research
11/19/2020

Watch and Learn: Mapping Language and Noisy Real-world Videos with Self-supervision

In this paper, we teach machines to understand visuals and natural langu...

Please sign up or login with your details

Forgot password? Click here to reset