ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

02/23/2022
by   Kunyu Peng, et al.
15

Automatically understanding human behaviour allows household robots to identify the most critical needs and plan how to assist the human according to the current situation. However, the majority of such methods are developed under the assumption that a large amount of labelled training examples is available for all concepts-of-interest. Robots, on the other hand, operate in constantly changing unstructured environments, and need to adapt to novel action categories from very few samples. Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays and then used as input to convolutional neural networks. We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement. In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning. Inspired by recent success of feature enhancement methods in semi-supervised learning, we further introduce ProFormer – an improved training strategy which uses soft-attention applied on iteratively estimated action category prototypes used to augment the embeddings and compute an auxiliary consistency loss. Extensive experiments consistently demonstrate the effectiveness of our approach for one-shot recognition from body poses, achieving state-of-the-art results on multiple datasets and surpassing the best published approach on the challenging NTU-120 one-shot benchmark by 1.84 https://github.com/KPeng9510/ProFormer.

READ FULL TEXT
research
12/26/2020

Skeleton-DML: Deep Metric Learning for Skeleton-Based One-Shot Action Recognition

One-shot action recognition allows the recognition of human-performed ac...
research
01/27/2021

Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition

We introduce SynSE, a novel syntactically guided generative approach for...
research
11/19/2021

Grounded Situation Recognition with Transformers

Grounded Situation Recognition (GSR) is the task that not only classifie...
research
08/22/2022

PoseBERT: A Generic Transformer Module for Temporal 3D Human Modeling

Training state-of-the-art models for human pose estimation in videos req...
research
06/10/2022

Masked Autoencoders are Robust Data Augmentors

Deep neural networks are capable of learning powerful representations to...
research
04/25/2018

Dynamic Few-Shot Visual Learning without Forgetting

The human visual system has the remarkably ability to be able to effortl...

Please sign up or login with your details

Forgot password? Click here to reset