Self-supervised Extraction of Human Motion Structures via Frame-wise Discrete Features

09/12/2023
by   Tetsuya Abe, et al.
0

The present paper proposes an encoder-decoder model for extracting the structures of human motions represented by frame-wise discrete features in a self-supervised manner. In the proposed method, features are extracted as codes in a motion codebook without the use of human knowledge, and the relationship between these codes can be visualized on a graph. Since the codes are expected to be temporally sparse compared to the captured frame rate and can be shared by multiple sequences, the proposed network model also addresses the need for training constraints. Specifically, the model consists of self-attention layers and a vector clustering block. The attention layers contribute to finding sparse keyframes and discrete features as motion codes, which are then extracted by vector clustering. The constraints are realized as training losses so that the same motion codes can be as contiguous as possible and can be shared by multiple sequences. In addition, we propose the use of causal self-attention as a method by which to calculate attention for long sequences consisting of numerous frames. In our experiments, the sparse structures of motion codes were used to compile a graph that facilitates visualization of the relationship between the codes and the differences between sequences. We then evaluated the effectiveness of the extracted motion codes by applying them to multiple recognition tasks and found that performance levels comparable to task-optimized methods could be achieved by linear probing.

READ FULL TEXT

page 5

page 10

page 12

page 14

page 15

page 17

page 22

page 23

research
11/17/2020

Exploring Self-Attention for Visual Odometry

Visual odometry networks commonly use pretrained optical flow networks i...
research
03/19/2022

Similarity and Content-based Phonetic Self Attention for Speech Recognition

Transformer-based speech recognition models have achieved great success ...
research
11/03/2022

Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR

This paper proposes a novel technique to obtain better downstream ASR pe...
research
12/10/2020

Developing Motion Code Embedding for Action Recognition in Videos

In this work, we propose a motion embedding strategy known as motion cod...
research
07/31/2020

Estimating Motion Codes from Demonstration Videos

A motion taxonomy can encode manipulations as a binary-encoded represent...
research
05/21/2021

How Can I Swing Like Pro?: Golf Swing Analysis Tool for Self Training

In this work, we present an analysis tool to help golf beginners compare...
research
04/06/2019

Learning Problem-agnostic Speech Representations from Multiple Self-supervised Tasks

Learning good representations without supervision is still an open issue...

Please sign up or login with your details

Forgot password? Click here to reset