Self-supervised Learning for Unintentional Action Prediction

09/24/2022
by   Olga Zatsarynna, et al.
0

Distinguishing if an action is performed as intended or if an intended action fails is an important skill that not only humans have, but that is also important for intelligent systems that operate in human environments. Recognizing if an action is unintentional or anticipating if an action will fail, however, is not straightforward due to lack of annotated data. While videos of unintentional or failed actions can be found in the Internet in abundance, high annotation costs are a major bottleneck for learning networks for these tasks. In this work, we thus study the problem of self-supervised representation learning for unintentional action prediction. While previous works learn the representation based on a local temporal neighborhood, we show that the global context of a video is needed to learn a good representation for the three downstream tasks: unintentional action classification, localization and anticipation. In the supplementary material, we show that the learned representation can be used for detecting anomalies in videos as well.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2016

Self-Supervised Video Representation Learning With Odd-One-Out Networks

We propose a new self-supervised CNN pre-training technique based on a n...
research
12/11/2021

Self-supervised Spatiotemporal Representation Learning by Exploiting Video Continuity

Recent self-supervised video representation learning methods have found ...
research
10/28/2019

Skip-Clip: Self-Supervised Spatiotemporal Representation Learning by Future Clip Order Ranking

Deep neural networks require collecting and annotating large amounts of ...
research
11/22/2021

Towards Tokenized Human Dynamics Representation

For human action understanding, a popular research direction is to analy...
research
05/04/2022

TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition

Recognizing transformation types applied to a video clip (RecogTrans) is...
research
11/12/2018

A Perceptual Prediction Framework for Self Supervised Event Segmentation

Temporal segmentation of long videos is an important problem, that has l...
research
10/16/2020

What Can You Learn from Your Muscles? Learning Visual Representation from Human Interactions

Learning effective representations of visual data that generalize to a v...

Please sign up or login with your details

Forgot password? Click here to reset