Contrastive Learning from Extremely Augmented Skeleton Sequences for Self-supervised Action Recognition

12/07/2021
by   Tianyu Guo, et al.
Peking University
7

In recent years, self-supervised representation learning for skeleton-based action recognition has been developed with the advance of contrastive learning methods. The existing contrastive learning methods use normal augmentations to construct similar positive samples, which limits the ability to explore novel movement patterns. In this paper, to make better use of the movement patterns introduced by extreme augmentations, a Contrastive Learning framework utilizing Abundant Information Mining for self-supervised action Representation (AimCLR) is proposed. First, the extreme augmentations and the Energy-based Attention-guided Drop Module (EADM) are proposed to obtain diverse positive samples, which bring novel movement patterns to improve the universality of the learned representations. Second, since directly using extreme augmentations may not be able to boost the performance due to the drastic changes in original identity, the Dual Distributional Divergence Minimization Loss (D^3M Loss) is proposed to minimize the distribution divergence in a more gentle way. Third, the Nearest Neighbors Mining (NNM) is proposed to further expand positive samples to make the abundant information mining process more reasonable. Exhaustive experiments on NTU RGB+D 60, PKU-MMD, NTU RGB+D 120 datasets have verified that our AimCLR can significantly perform favorably against state-of-the-art methods under a variety of evaluation protocols with observed higher quality action representations. Our code is available at https://github.com/Levigty/AimCLR.

READ FULL TEXT VIEW PDF

Authors

page 6

08/08/2021

Skeleton-Contrastive 3D Action Representation Learning

This paper strives for self-supervised learning of a feature space suita...
10/12/2020

MS^2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition

In this paper, we address self-supervised representation learning from h...
04/29/2021

3D Human Action Representation Learning via Cross-View Consistency Pursuit

In this work, we propose a Cross-view Contrastive Learning framework for...
04/21/2022

Unsupervised Human Action Recognition with Skeletal Graph Laplacian and Self-Supervised Viewpoints Invariance

This paper presents a novel end-to-end method for the problem of skeleto...
04/15/2021

Contrastive Learning with Stronger Augmentations

Representation learning has significantly been developed with the advanc...
01/20/2022

Self-supervised Video Representation Learning with Cascade Positive Retrieval

Self-supervised video representation learning has been shown to effectiv...
02/23/2022

ProFormer: Learning Data-efficient Representations of Body Movement with Prototype-based Feature Augmentation and Visual Transformers

Automatically understanding human behaviour allows household robots to i...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

On account that action recognition has very broad application in many fields such as human-computer interaction, video content analysis, and smart surveillance, it has always been a popular research topic in the field of computer vision. Due to the development of depth sensors

Zhang (2012)

and the human pose estimation algorithms

Cao et al. (2019); Fang et al. (2017), skeleton-based action recognition has gradually become a significant branch of action recognition.

In the past few years, most of the existing skeleton-based action recognition methods are based on the supervised learning framework. Whether it is a CNN-based method

Du, Fu, and Wang (2015); Ke et al. (2017); Liu, Liu, and Chen (2017), RNN-based method Du, Wang, and Wang (2015); Song et al. (2018); Zhang et al. (2019), or GCN-based method Yan, Xiong, and Lin (2018); Shi et al. (2019); Si et al. (2019); Chen et al. (2021), numerous labeled data is used to learn the action representation. Fully supervised action recognition methods are inevitably data-driven but the cost of labeling large-scale datasets is particularly high. Therefore, more and more researchers intend to use unlabeled skeleton data for learning human action representation.

Recently, several works Zheng et al. (2018); Su, Liu, and Shlizerman (2020); Lin et al. (2020) focus on designing pretext tasks for self-supervised methods to learn action representations from unlabeled skeleton data. With the development of contrastive self-supervised learning and its ability to make feature representations have better discrimination, several works Rao et al. (2021); Li et al. (2021) directly rely on the contrastive learning framework, using normal augmentations to construct similar positive samples. However, those carefully designed augmentations limit the model to further explore the novel movement patterns exposed by other augmentations and there are still several significant motivations that need to be carefully considered:

1) Stronger data augmentations could benefit representation learning. In SkeletonCLR Li et al. (2021), it just uses two data augmentations Shear and Crop. Nevertheless, studies Tian et al. (2020); Wang and Qi (2021) have shown that data augmentation design is crucial, and the abundant semantic information introduced by stronger data augmentations can significantly improve the generalizability of learned representations and eventually bridge the gap with the fully supervised methods.

2) Directly using stronger augmentations could deteriorate the performance. Stronger data augmentations bring novel movement patterns while the augmented skeleton sequence may not keep the identity of the original sequence. Therefore, directly using extreme augmentations may not necessarily be able to boost the performance due to the drastic changes in original identity. Thus, additional efforts are needed to explore the role of stronger augmentations.

3) How to force the model to learn more features. Simply relying on the contrastive learning framework can not force the model to study more features well. Studies Pan et al. (2021); Cheng et al. (2020) have shown that the drop mechanism can be used for contrastive learning, and can effectively solve the problem of over-fitting. Currently, the drop mechanism is not well exploited in self-supervised skeleton-based action recognition.

4) How to better expand the positive set to make the learning process more reasonable. In contrastive learning, two different augmented samples from the same sample are considered as positive samples, while samples in the memory bank are all treated as negative samples. However, the samples in the memory bank are not necessarily all negative samples which makes the learning process unreasonable to a certain extent.

To this end, a contrastive learning framework utilizing abundant information mining for self-supervised action representation (AimCLR) is proposed. Specifically, the framework of AimCLR is shown in Figure 1. Different from traditional contrastive learning methods Rao et al. (2021); Li et al. (2021) which directly use normally augmented view, a novel framework proposed in our work is based on the extreme augmentations and the drop mechanism which obtain diverse positive samples and bring abundant spatio-temporal information. Then the Dual Distributional Divergence Minimization Loss (DM Loss) is proposed to minimize the distribution divergence between the normally augmented view and the extremely augmented views. Furthermore, a Nearest Neighbors Mining (NNM) is used to expand the positive set to make the learning process more reasonable.

In summary, we have made the following contributions:

  • Compared with the traditional contrastive learning method using similar augmented pairs, AimCLR is proposed to use more extreme augmentations and more reasonable abundant information mining which greatly improve the effect of contrastive learning.

  • Specifically, the extreme augmentations and the Energy-based Attention-guided Drop Module (EADM) are proposed to introduce novel movement patterns to force the model to learn more general representations. Then the DM Loss is proposed to gently learn from the introduced movement patterns. In order to alleviate the irrationality of the positive set, we further propose the Nearest Neighbors Mining (NNM) strategy.

  • With the multi-stream fusion scheme, our 3s-AimCLR achieves state-of-the-art performances under a variety of evaluation protocols such as KNN evaluation, linear evaluation, semi-supervised evaluation, and finetune evaluation protocol on three benchmark datasets.

2 Related Work

Figure 1: The pipeline of the proposed AimCLR. Through the extreme augmentations and normal augmentations , , and are obtained from the input sequence . The query encoder and an MLP extract and while the query encoder with EADM and an MLP is used to obtain the . The momentum updated key encoder and an MLP is used to obtain , are stored in the memory bank in each training step, serving as negative samples for the next training steps. While using InfoNCE loss, we also propose DM Loss to minimize the distribution divergence of , , and . Furthermore, we propose NNM to expand the positive set to make the learning process more reasonable.

Supervised Skeleton-based Action Recognition. Early skeleton-based action recognition methods are usually based on hand-crafted features Wang et al. (2012); Vemulapalli, Arrate, and Chellappa (2014); Vemulapalli and Chellapa (2016)

. With the rapid development of deep learning in recent years, some methods

Du, Wang, and Wang (2015); Song et al. (2018); Zhang et al. (2019) use RNN to process skeleton data. Meanwhile, several methods convert the 3D skeleton sequence into an image representation and have achieved good results based on CNN Du, Fu, and Wang (2015); Ke et al. (2017); Liu, Liu, and Chen (2017). In recent years, with the introduction of graph convolutional networks, a variety of GCN-based methods Shi et al. (2019); Si et al. (2019); Chen et al. (2021) have emerged on the basis of ST-GCN Yan, Xiong, and Lin (2018) to better model the spatio-temporal structure relationship. In this paper, we adopt the widely-used ST-GCN as the encoder to extract the skeleton features.

Contrastive Self-Supervised Representation Learning. Some contrastive learning methods Zhang, Isola, and Efros (2016); Pathak et al. (2016); Gidaris, Singh, and Komodakis (2018) focus on designing various novel pretext tasks to find the pattern information hidden in the unlabeled data. MoCo and MoCov2 He et al. (2020); Chen et al. (2020b) promotes contrastive self-supervised learning through a queue-based memory bank and momentum update mechanism. SimCLR Chen et al. (2020a)

uses a much larger batch size to compute the embeddings in real-time, and uses a multi-layer perceptron (MLP) to further improve the performance of self-supervised representation learning. Current work CLSA

Wang and Qi (2021) shows that strong augmentations are beneficial to the performance of downstream tasks and it expects to learn from strongly augmented samples. The development of contrastive self-supervised representation learning also laid the foundation for our AimCLR.

Self-supervised Skeleton-based Action Recognition. LongT GAN Zheng et al. (2018) proposes to use the encoder-decoder to regenerate the input sequence to obtain useful feature representation. P&C Su, Liu, and Shlizerman (2020) proposes a training strategy to weaken the decoder, forcing the encoder to learn more discriminative features. Yang et al. (2021b)

design a novel skeleton cloud colorization technique to learn skeleton representations. AS-CAL

Rao et al. (2021) and SkeletonCLR Li et al. (2021) use momentum encoder for contrastive learning with single-stream skeleton sequence while CrosSCLR Li et al. (2021) proposes cross-stream knowledge mining strategy to improve the performance and ISC Thoker, Doughty, and Snoek (2021) proposes inter-skeleton contrastive learning to learn from multiple different input skeleton representations. In order to learn more general features, MSL Lin et al. (2020) introduces multiple self-supervised tasks to learn more general representations. However, the currently existing methods rarely explore the gains that abundant spatio-temporal information brings to the task of action recognition. Therefore, a more concise and general framework needs to be proposed.

3 AimCLR

3.1 SkeletonCLR Overview

SkeletonCLR Li et al. (2021) is based on the recent advanced practice MoCov2 Chen et al. (2020b) to learn single-stream 3D action representations. The pipeline of the SkeletonCLR is shown in the bottom blue part of Figure 1. Given an encoded query and encoded key , the batch embeddings of are stored in first-in-first-out memory bank to get rid of redundant computation. It serves as negative samples for the next training steps. Then the InfoNCE loss Oord, Li, and Vinyals (2018) can be written as:

(1)

where is the temperature hyper-parameter, and dot product is to compute their similarity where , are normalized.

After computing the InfoNCE loss in Eq. 1, the query encoder is updated via gradients while the key encoder is updated as a moving-average of the query encoder. We denote the parameters of the query encoder as and those of the key encoder as . Then the key encoder is updated as:

(2)

where is a momentum coefficient. The momentum encoder is updated slowly based on the encoder change, which ensures stable key representations.

3.2 Data Augmentations

In contrastive learning, augmentations for positive samples bring semantic information for the encoder to learn. However, those carefully designed augmentations limit the encoder to further explore the novel patterns exposed by other augmentations. Therefore, we aim to explore a more general framework in which extreme augmentations can introduce more novel movement patterns than normal augmentations.

1) Normal Augmentations . One spatial augmentation Shear and one temporal augmentation Crop are used as the normal augmentations like SkeletonCLR.

2) Extreme Augmentations . We introduce four spatial augmentations: Shear, Spatial Flip, Rotate, Axis Mask and two temporal augmentations: Crop, Temporal Flip and two spatio-temporal augmentations: Gaussian Noise and Gaussian Blur. We use the combination of all the 8 augmentations (2 normal and 6 other augmentations) as “Extreme Augmentations” to finally get one extremely augmented sequence. On account that the combination of extreme augmentations is complicated, we hope to explore a more general framework in which extreme augmentations can introduce more novel movement patterns than normal augmentations.

3.3 Energy-based Attention-guided Drop Module

For a feature learned by the encoder, we hope that even if some important features are discarded, different actions can be distinguished. Studies Pan et al. (2021); Cheng et al. (2020) have shown that the drop mechanism can be used for contrastive learning, and can effectively solve the problem of over-fitting. It inspires us to calculate the spatio-temporal attention map to drop several important features, which could force the model to learn more features and obtain more general and robust feature representations.

Actually, there are lots of modules Hu, Shen, and Sun (2018); Woo et al. (2018); Lee, Kim, and Nam (2019) proposed to calculate the attention maps. In order not to introduce additional parameters, we adopt the parameter-free attention module Simam Yang et al. (2021a) to calculate the attention map. Formally, and

are the target neuron and other neurons in a single channel of the input feature

where denotes the number of channels, denotes the temporal dimension and denotes the spatial dimension. The minimal energy of target neuron can be computed with the following:

(3)

where , , is a hyper-parameter, and is the number of neurons on the channel. Eq. 3 indicates that the lower energy , the neuron is more distinctive from surround neurons, and more important for visual processing. Therefore, the importance of each neuron can be obtained by . The we can obtain the attention map by that , where E groups all . After that, we use the attention map to drop some important features using Algorithm 1.

1:a GCN feature : the dimension is ; : control the importance margin to drop.
2:Features after processing.
3:Compute the attention map ;
4:Generate the spatial attention map and the temporal attention map using and ;
5:Apply the spatial mask: ;
6:Normalize the feature:
7:  ;
8:Apply the temporal mask: ;
9:Normalize the feature:
10:  ;
11:return ;
Algorithm 1 Energy-based attention-guided drop module.

3.4 Dual Distributional Divergence Minimization

As shown in Figure 1, for the input sequence , we apply normal augmentations and extreme augmentations to obtain , and . The query encoder is applied to extract features: and . is the dropped features after EADM. An MLP head is applied to project the feature to a lower dimension space: , , . The key encoder and are the momentum updated version of and .

The memory bank of negative samples is a first-in-first-out queue updated per iteration by . After each inference step, will enqueue while the earliest embedding in M will dequeue. M provides numerous negative embeddings while the new calculated is the positive embedding. Thus, we can obtain a conditional distribution:

(4)

which encodes the likelihood of the query being assigned to the embedding in the memory bank M. Similarly, we can also have the likelihood of positive pairs for the query being assigned to its positive counterpart :

(5)

The InfoNCE loss in Eq. 1 can be rewritten in another form:

(6)

where is the ideal distribution of the likelihood, is the distribution learned by the network. To avoid the unknown ideal distribution exploration, InfoNCE loss regards as a one-hot distribution, where positive pairs have and negative pairs satisfy . It means that InfoNCE loss maximizes the agreement of two different augmented sequences’ representations from the same sequence while minimizing the agreement with other negative sequences. To explore the novel movement patterns from the extreme augmentations, a straightforward approach is directly using the extremely augmented sequence as query and the normally augmented sequence as key in InfoNCE loss. However, compared to the normally augmented sequence, the extremely augmented sequence may not keep the identity of the original sequence due to the dramatic changes in movement patterns, leading to performance degradation.

In addition, it’s almost impossible to obtain the ideal likelihood distribution. Fortunately, CLSA Wang and Qi (2021) found that the normally augmented query and the extremely augmented query share similar distribution for a randomly initialized network. It inspires us that the distribution of normally augmented query over memory bank can be used to supervise that of the extremely augmented query. It avoids directly using one-hot distribution for extremely augmented views and is able to explore the novel patterns exposed by the extreme augmentations.

Similar to Eq. 4 and Eq. 5, we obtain a conditional distribution for based on its positive samples and negative samples: and . The conditional distribution and for is calculated in the same way. Then, we propose to minimize the following distributional divergence between the normally augmented view and the extremely augmented view such that:

(7)

Similarly, the distributional divergence between the normally augmented view and the dropped extremely augmented view is minimized such that:

(8)

Therefore, the proposed DM loss can be formulated as .

3.5 Nearest Neighbors Mining

Traditional contrastive learning methods regard the normally augmented samples from the same sample as positive samples and all samples in the memory bank as negative samples. However, the samples in the memory bank are not necessarily all negative samples Dwibedi et al. (2021). Therefore, we hope that the nearest neighbors of query , , and over the memory bank M should be considered as positive samples to expand the positive set.

Specifically, is the index set of the nearest top-k neighbors that are most similar to the normally augmented query in the memory bank M. Similarly, we could also have the and to represents the index set of the nearest top-k neighbors of the extremely augmented query and the dropped extremely augmented query . Thus, we can set the nearest top-k neighbors as positive samples to make the learning process more reasonable:

(9)

where . Compared to Eq. 1, Eq. 9 will lead to a more regular space by pulling close more high-confidence positive samples.

Two-stage Training Strategy.

In the early training stage, the model is not stable and strong enough to provide reliable nearest neighbors. Thus, we perform two-stage training for AimCLR: In the first stage, the model is trained with the loss function:

. Then in the second training stage, the loss function is to start mining the nearest neighbors. Here, and are the coefficient to balance the loss. Though other values may achieve better results, we use to make AimCLR more general.

4 Experiments

4.1 Dataset

PKU-MMD Dataset Liu et al. (2020): It contains almost 20,000 action sequences covering 51 action classes. It consists of two subsets. Part I is an easier version for action recognition, while part II is more challenging with more noise caused by view variation. We conduct experiments under the cross-subject protocol on the two subsets.

NTU RGB+D 60 Dataset Shahroudy et al. (2016): The dataset contains 56,578 action sequences and 60 action classes. There are two evaluation protocols: cross-subject (xsub) and cross-view (xview). In xsub, half of the subjects are used as training sets, and the rest are used as test sets. In xview, the samples of camera 2 and 3 are used for training while the samples of camera 1 are used for testing.

NTU RGB+D 120 Dataset Liu et al. (2019): It is NTU RGB+D 60 based extension, whose scale is up to 120 action classes and 113,945 sequences. There are two evaluation protocols: cross-subject (xsub) and cross-setup (xset). In xsub, actions performed by 53 subjects are for training and the others are for testing. In xset, all 32 setups are separated as half for training and the other half for testing.

w/ NA w/ EA w/ EADM w/ NNM NTU-60(%)
xsub xview
75.0 79.8
71.3 77.8
77.4 82.5
78.2 82.8
78.9 83.8
Table 1: Ablation study results on NTU-60 dataset.
Method Stream NTU-60(%) PKU(%) NTU-120(%)
xsub xview part I xsub xset
acc. acc. acc. acc. acc.
SkeletonCLR (CVPR ) joint 68.3 76.4 80.9 56.8 55.9
AimCLR (ours) joint 74.3 6.0 79.7 3.3 83.4 2.5 63.4 6.6 63.4 7.5
SkeletonCLR (CVPR ) motion 53.3 50.8 63.4 39.6 40.2
AimCLR (ours) motion 66.8 13.5 70.6 19.8 72.0 8.6 57.3 17.7 54.4 14.2
SkeletonCLR (CVPR ) bone 69.4 67.4 72.6 48.4 52.0
AimCLR (ours) bone 73.2 3.8 77.0 9.6 82.0 9.4 62.9 14.5 63.4 11.4
3s-SkeletonCLR (CVPR ) joint+motion+bone 75.0 79.8 85.3 60.7 62.6
3s-AimCLR (ours) joint+motion+bone 78.9 3.9 83.8 4.0 87.8 2.5 68.2 7.5 68.8 6.2
Table 2: Linear evaluation results compared with SkeletonCLR on NTU-60, PKU-MMD, and NTU-120 dataset. “” represents the gain compared to SkeletonCLR using the same stream data. “3s” means three stream fusion.

4.2 Experimental Settings

All the experiments are conducted on the PyTorch

Paszke et al. (2019) framework. For data pre-processing, we follow SkeletonCLR and CrosSCLR Li et al. (2021) for a fair comparison. The mini-batch size is set to 128.

Self-supervised Pretext training.

ST-GCN is adopted as the encoder. For contrastive settings, we follow that in SkeletonCLR. For optimization, we use SGD with momentum (0.9) and weight decay (0.0001). The model is trained for 300 epochs with a learning rate of 0.1 (decreases to 0.01 at epoch 250). For the nearest neighbors mining, we set

. For the two-stage training strategy mentioned in Section 3.5, the encoder is trained with in the first 150 epochs while trained with in the remaining epochs. For a fair comparison, we also generate three streams of skeleton sequences, i.e., joint, bone, and motion. For all the reported results of three streams, we use the weights of for weighted fusion like other multi-stream GCN methods.

KNN Evaluation Protocol.

A K-Nearest Neighbor (KNN) classifier is used on the features of the trained encoder. It can also reflect the quality of the features learned by the encoder.

Linear Evaluation Protocol.

The models are verified by linear evaluation for the action recognition task. Specifically, we train a linear classifier (a fully-connected layer followed by a softmax layer) supervised with encoder fixed.

Semi-supervised Evaluation Protocol. We pre-train the encoder with all data and then finetuning the whole model with only 1% or 10% randomly selected labeled data.

Finetune Protocol. We append a linear classifier to the trained encoder and then train the whole model to compare it with fully supervised methods.

(a) SkeletonCLR
(b) 3s-SkeletonCLR
(c) 3s-CrosSCLR
(d) AimCLR
(e) 3s-AimCLR
(f) 3s-AimCLR
Figure 2: The t-SNE visualization of embeddings on NTU-60 xsub. “” means using cross-stream knowledge mining strategy proposed in 3s-CrosSCLR Li et al. (2021).
Method 100ep 150ep 200ep 300ep
3s-SkeletonCLR(repro.) 71.3 73.8 74.1 74.1
3s-CrosSCLR(repro.) 70.0 72.8 76.0 77.2
3s-AimCLR (ours) 75.4 76.0 78.2 78.6
3s-AimCLR (ours) 76.5 77.4 78.3 78.9
Table 3: Linear evaluation results on NTU-60 xsub for different epochs.

4.3 Ablation Study

Figure 3: Comparison of KNN accuracy of SkeletonCLR Li et al. (2021) and our AimCLR.

We conduct ablation studies on different datasets to verify the effectiveness of different components of our method.

The effectiveness of the data augmentation. As shown in Table 1, 3s-SkeletonCLR Li et al. (2021) uses the normal augmentations (w/ NA) and achieves the accuracy of 75.0% and 79.8% on xsub and xview respectively. While simply replacing the normal augmentations with extreme augmentations (w/ EA), the accuracies drop on both xsub and xview. It also illustrates that directly using extreme augmentations may not necessarily be able to boost the performance due to the drastic changes in original identity. While both EA and NA are used, i.e., when loss in Eq. 7 comes into play, the accuracy is improved by 2.4% and 2.7%.

The effectiveness of the EADM and NNM. From Table 1, it is worth noting that when EADM is introduced, the accuracy on xsub and xview are improved by 0.8% and 0.3%, respectively. Notably, our 3s-AimCLR achieves the highest accuracy when NNM is further introduced. It also shows that the proposed EADM and NNM can make the encoder further learn more robust and suitable features for downstream tasks.

Method NTU-60(%)
xsub xview
Single-stream:
LongT GAN (AAAI ) 39.1 48.1
MSL (ACM MM ) 52.6 -
AS-CAL (Information Sciences ) 58.5 64.8
P&C (CVPR ) 50.7 76.3
SeBiReNet (ECCV ) - 79.7
SkeletonCLR (CVPR ) 68.3 76.4
AimCLR (ours) 74.3 79.7
Three-stream:
3s-SkeletonCLR (CVPR ) 75.0 79.8
3s-Colorization (ICCV ) 75.2 83.1
3s-CrosSCLR (CVPR ) 77.8 83.4
3s-AimCLR (ours) 78.6 82.6
3s-AimCLR (ours) 78.9 83.8
Table 4: Linear evaluation results on NTU-60 dataset.

The effectiveness of the AimCLR. We conduct experiments on three datasets to verify the performance of our AimCLR and the SkeletonCLR. As can be seen from Table 2, for the three different streams of the three datasets, AimCLR performs much better than SkeletonCLR. The gains in motion and bone stream are significant. For the fusion results, our 3s-AimCLR far exceeds 3s-SkeletonCLR on the three datasets. In addition, it can be seen from Table 3 that our 3s-AimCLR is always better than 3s-CrosSCLR and 3s-SkeletonCLR under the same training epochs no matter using cross-stream knowledge mining strategy or not. The result of 3s-AimCLR at 100 epochs is even better than the result of 3s-SkeletonCLR at 300 epochs.

Qualitative Results. We apply t-SNE Van der Maaten and Hinton (2008) with fix settings to show the embedding distribution in Figure 4. From the visual results, we can draw a conclusion that 3s-AimCLR could cluster the embeddings of the same class closer than 3s-SkeletonCLR for the simply fusion results. While using the cross-stream knowledge mining strategy, our 3s-AimCLR can also make the action classes that overlapped seriously more distinguishable compared with 3s-CrosSCLR.

Method part I(%) part II(%)
Supervised:
ST-GCN (AAAI ) 84.1 48.2
VA-LSTM (TPAMI ) 84.1 50.0
Self-supervised:
LongT GAN (AAAI ) 67.7 26.0
MSL (ACM MM ) 64.9 27.6
3s-CrosSCLR (CVPR ) 84.9 21.2
ISC (ACM MM ) 80.9 36.0
3s-AimCLR (ours) 87.4 39.5
3s-AimCLR (ours) 87.8 38.5
Table 5: Linear evaluation results on PKU-MMD dataset.
Method xsub(%) xset(%)
P&C (CVPR ) 42.7 41.7
AS-CAL (Information Sciences ) 48.6 49.2
3s-CrosSCLR (CVPR ) 67.9 66.7
ISC (ACM MM ) 67.9 67.1
3s-AimCLR (ours) 68.0 68.7
3s-AimCLR (ours) 68.2 68.8
Table 6: Linear evaluation results on NTU-120 dataset.

4.4 Comparison with State-of-the-art

We compare the proposed method with prior related methods under a variety of evaluation protocols.

KNN Evaluation Results. Notably, the KNN classifier does not require learning extra weights compared with the linear classifier. From Figure 3, we can see that our AimCLR is better than SkeletonCLR on the two datasets under the KNN classifier. The obvious gains also show that the features learned by our AimCLR are more discriminative.

Linear Evaluation Results on NTU-60. As shown in Table 4, for a single stream (i.e., joint stream), our AimCLR outperforms all other methods Zheng et al. (2018); Lin et al. (2020); Rao et al. (2021); Su, Liu, and Shlizerman (2020); Nie, Liu, and Liu (2020); Li et al. (2021). For the performance of the 3-streams, our 3s-AimCLR leads 3s-SkeletonCLR 3.9% and 4.0% under the xsub and xview protocols, respectively. It is worth mentioning that regardless of whether 3s-AimCLR uses a cross-stream knowledge mining strategy, the results are better than the 3s-CrossSCLR and 3s-Colorization Yang et al. (2021b). It also indicates that even without the knowledge mining between streams, 3s-AimCLR has the ability to learn better feature representations.

Linear Evaluation Results on PKU-MMD. As shown in Table 5, our 3s-AimCLR is ahead of the existing self-supervised methods in both part I and part II of this dataset. Part II is more challenging with more skeleton noise caused by the view variation. Notably, 3s-CrosSCLR suffers on part II while our 3s-AimCLR performs well. It also proves that our 3s-AimCLR has a strong ability to cope with movement patterns caused by skeleton noise.

Linear Evaluation Results on NTU-120. As shown in Table 6, our 3s-AimCLR defeats the other self-supervised methods on NTU-120. Our fusion results outperforms the advanced ISC (68.2% vs 67.9% on xsub and 68.8% vs 67.1% on xset). This shows that our 3s-AimCLR is also competitive on multi-class large-scale datasets.

Semi-supervised Evaluation Results. From Table 7, even only with a small labeled subset, our 3s-AimCLR performs better than the state-of-the-art consistently for all configurations. The results of using 1% and 10% labeled data far exceed ISC, 3s-CrosSCLR, and 3s-Colorization. It also proves that the novel movement patterns brought by extreme augmentations have a huge impact when there is only a small amount of labeled data.

Method PKU-MMD(%) NTU-60(%)
part I part II xsub xview
1% labeled data:
LongT GAN (AAAI ) 35.8 12.4 35.2 -
MSL (ACM MM ) 36.4 13.0 33.1 -
ISC (ACM MM ) 37.7 - 35.7 38.1
3s-CrosSCLR (CVPR ) 49.7 10.2 51.1 50.0
3s-Colorization (ICCV ) - - 48.3 52.5
3s-AimCLR (ours) 57.5 15.1 54.8 54.3
10% labeled data:
LongT GAN (AAAI ) 69.5 25.7 62.0 -
MSL (ACM MM ) 70.3 26.1 65.2 -
ISC (ACM MM ) 72.1 - 65.9 72.5
3s-CrosSCLR (CVPR ) 82.9 28.6 74.4 77.8
3s-Colorization (ICCV ) - - 71.7 78.9
3s-AimCLR (ours) 86.1 33.4 78.2 81.6
Table 7: Semi-supervised evaluation results on PKU-MMD dataset and NTU-60 dataset.

Finetuned Evaluation Results. For fair comparisons, the ST-GCN used in the methods of Table 8 all have the same structure and parameters. For a single bone stream, the finetuned results of our AimCLR are better than that of SkeletonCLR. What’s more, the finetuned 3s-AimCLR also outperforms the 3s-CrosSCLR and the supervised 3s-ST-GCN, indicating the effectiveness of our method.

Method NTU-60(%) NTU-120(%)
xsub xview xsub xset
SkeletonCLR (CVPR ) 82.2 88.9 73.6 75.3
AimCLR (ours) 83.0 89.2 76.4 76.7
3s-ST-GCN (AAAI ) 85.2 91.4 77.2 77.1
3s-CrosSCLR (CVPR ) 86.2 92.5 80.5 80.4
3s-AimCLR (ours) 86.9 92.8 80.1 80.9
Table 8: Finetuned results on NTU-60 and NTU-120 dataset. “” means using the bone stream data.

5 Conclusion

In this paper, AimCLR is proposed to explore the novel movement patterns brought by extreme augmentations. Specifically, the extreme augmentations and the energy-based attention-guided drop module are proposed to bring novel movement patterns to improve the universality of the learned representations. The DM Loss is proposed to minimize the distribution divergence in a more gentle way. In order to alleviate the irrationality of the positive set, the nearest neighbors mining strategy is further proposed to make the learning process more reasonable. Experiments show that 3s-AimCLR significantly performs favorably against state-of-the-art methods under a variety of evaluation protocols with observed higher quality action representations.

References

  • Cao et al. (2019) Cao, Z.; Hidalgo, G.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2019.

    OpenPose: Realtime multi-person 2D pose estimation using part affinity fields.

    IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 43(1): 172–186.
  • Chen et al. (2020a) Chen, T.; Kornblith, S.; Norouzi, M.; and Hinton, G. 2020a. A simple framework for contrastive learning of visual representations. In

    International Conference on Machine Learning (ICML)

    , 1597–1607.
  • Chen et al. (2020b) Chen, X.; Fan, H.; Girshick, R.; and He, K. 2020b. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297.
  • Chen et al. (2021) Chen, Z.; Li, S.; Yang, B.; Li, Q.; and Liu, H. 2021. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. In

    AAAI Conference on Artificial Intelligence

    , volume 35, 1113–1122.
  • Cheng et al. (2020) Cheng, K.; Zhang, Y.; Cao, C.; Shi, L.; Cheng, J.; and Lu, H. 2020. Decoupling GCN with dropgraph module for skeleton-based action recognition. In European Conference on Computer Vision (ECCV), 536–553.
  • Du, Fu, and Wang (2015) Du, Y.; Fu, Y.; and Wang, L. 2015.

    Skeleton based action recognition with convolutional neural network.

    In

    Asian Conference on Pattern Recognition (ACPR)

    , 579–583.
  • Du, Wang, and Wang (2015) Du, Y.; Wang, W.; and Wang, L. 2015.

    Hierarchical recurrent neural network for skeleton based action recognition.

    In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1110–1118.
  • Dwibedi et al. (2021) Dwibedi, D.; Aytar, Y.; Tompson, J.; Sermanet, P.; and Zisserman, A. 2021. With a little help from my friends: Nearest-neighbor contrastive learning of visual representations. arXiv preprint arXiv:2104.14548.
  • Fang et al. (2017) Fang, H.-S.; Xie, S.; Tai, Y.-W.; and Lu, C. 2017. RMPE: Regional multi-person pose estimation. In IEEE International Conference on Computer Vision (ICCV), 2334–2343.
  • Gidaris, Singh, and Komodakis (2018) Gidaris, S.; Singh, P.; and Komodakis, N. 2018. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728.
  • He et al. (2020) He, K.; Fan, H.; Wu, Y.; Xie, S.; and Girshick, R. 2020. Momentum contrast for unsupervised visual representation learning. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 9729–9738.
  • Hu, Shen, and Sun (2018) Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 7132–7141.
  • Ke et al. (2017) Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; and Boussaid, F. 2017. A new representation of skeleton sequences for 3D action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3288–3297.
  • Lee, Kim, and Nam (2019) Lee, H.; Kim, H.-E.; and Nam, H. 2019.

    SRM: A style-based recalibration module for convolutional neural networks.

    In IEEE International Conference on Computer Vision (ICCV), 1854–1862.
  • Li et al. (2021) Li, L.; Wang, M.; Ni, B.; Wang, H.; Yang, J.; and Zhang, W. 2021. 3D human action representation learning via cross-view consistency pursuit. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4741–4750.
  • Lin et al. (2020) Lin, L.; Song, S.; Yang, W.; and Liu, J. 2020. MSL: Multi-task self-supervised learning for skeleton based action recognition. In ACM International Conference on Multimedia (ACM MM), 2490–2498.
  • Liu et al. (2019) Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.-Y.; and Kot, A. C. 2019. NTU RGB + D 120: A large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10): 2684–2701.
  • Liu et al. (2020) Liu, J.; Song, S.; Liu, C.; Li, Y.; and Hu, Y. 2020. A benchmark dataset and comparison study for multi-modal human action analytics. ACM Transactions on Multimedia Computing, Communications, and Applications, 16(2): 1–24.
  • Liu, Liu, and Chen (2017) Liu, M.; Liu, H.; and Chen, C. 2017. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition, 68: 346–362.
  • Nie, Liu, and Liu (2020) Nie, Q.; Liu, Z.; and Liu, Y. 2020. Unsupervised 3D human pose representation with viewpoint and pose disentanglement. In European Conference on Computer Vision (ECCV), 102–118.
  • Oord, Li, and Vinyals (2018) Oord, A. v. d.; Li, Y.; and Vinyals, O. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  • Pan et al. (2021) Pan, T.; Song, Y.; Yang, T.; Jiang, W.; and Liu, W. 2021. VideoMoCo: Contrastive video representation learning with temporally adversarial examples. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 11205–11214.
  • Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems (NeurIPS), 32: 8026–8037.
  • Pathak et al. (2016) Pathak, D.; Krahenbuhl, P.; Donahue, J.; Darrell, T.; and Efros, A. A. 2016. Context encoders: Feature learning by inpainting. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2536–2544.
  • Rao et al. (2021) Rao, H.; Xu, S.; Hu, X.; Cheng, J.; and Hu, B. 2021. Augmented skeleton based contrastive action learning with momentum LSTM for unsupervised action recognition. Information Sciences, 569: 90–109.
  • Shahroudy et al. (2016) Shahroudy, A.; Liu, J.; Ng, T.-T.; and Wang, G. 2016. NTU RGB + D: A large scale dataset for 3D human activity analysis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1010–1019.
  • Shi et al. (2019) Shi, L.; Zhang, Y.; Cheng, J.; and Lu, H. 2019. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 12026–12035.
  • Si et al. (2019) Si, C.; Chen, W.; Wang, W.; Wang, L.; and Tan, T. 2019. An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1227–1236.
  • Song et al. (2018) Song, S.; Lan, C.; Xing, J.; Zeng, W.; and Liu, J. 2018. Spatio-temporal attention-based LSTM networks for 3D action recognition and detection. IEEE Transactions on Image Processing (TIP), 27(7): 3459–3471.
  • Su, Liu, and Shlizerman (2020) Su, K.; Liu, X.; and Shlizerman, E. 2020. Predict & Cluster: Unsupervised skeleton based action recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 9631–9640.
  • Thoker, Doughty, and Snoek (2021) Thoker, F. M.; Doughty, H.; and Snoek, C. G. 2021. Skeleton-contrastive 3D action representation learning. In ACM International Conference on Multimedia (ACM MM).
  • Tian et al. (2020) Tian, Y.; Sun, C.; Poole, B.; Krishnan, D.; Schmid, C.; and Isola, P. 2020. What makes for good views for contrastive learning? arXiv preprint arXiv:2005.10243.
  • Van der Maaten and Hinton (2008) Van der Maaten, L.; and Hinton, G. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11).
  • Vemulapalli, Arrate, and Chellappa (2014) Vemulapalli, R.; Arrate, F.; and Chellappa, R. 2014. Human action recognition by representing 3D skeletons as points in a lie group. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 588–595.
  • Vemulapalli and Chellapa (2016) Vemulapalli, R.; and Chellapa, R. 2016. Rolling rotations for recognizing human actions from 3D skeletal data. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 4471–4479.
  • Wang et al. (2012) Wang, J.; Liu, Z.; Wu, Y.; and Yuan, J. 2012. Mining actionlet ensemble for action recognition with depth cameras. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1290–1297.
  • Wang and Qi (2021) Wang, X.; and Qi, G.-J. 2021. Contrastive learning with stronger augmentations. arXiv preprint arXiv:2104.07713.
  • Woo et al. (2018) Woo, S.; Park, J.; Lee, J.-Y.; and Kweon, I. S. 2018. CBAM: Convolutional block attention module. In European Conference on Computer Vision (ECCV), 3–19.
  • Xu et al. (2020) Xu, S.; Rao, H.; Hu, X.; and Hu, B. 2020. Prototypical contrast and reverse prediction: Unsupervised skeleton based action recognition. arXiv preprint arXiv:2011.07236.
  • Yan, Xiong, and Lin (2018) Yan, S.; Xiong, Y.; and Lin, D. 2018. Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI Conference on Artificial Intelligence, volume 32.
  • Yang et al. (2021a) Yang, L.; Zhang, R.-Y.; Li, L.; and Xie, X. 2021a. SimAM: A simple, parameter-free attention module for convolutional neural networks. In International Conference on Machine Learning (ICML), 11863–11874.
  • Yang et al. (2021b) Yang, S.; Liu, J.; Lu, S.; Er, M. H.; and Kot, A. C. 2021b. Skeleton cloud colorization for unsupervised 3D action representation learning. In IEEE International Conference on Computer Vision (ICCV).
  • Zhang et al. (2019) Zhang, P.; Lan, C.; Xing, J.; Zeng, W.; Xue, J.; and Zheng, N. 2019. View adaptive neural networks for high performance skeleton-based human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41(8): 1963–1978.
  • Zhang, Isola, and Efros (2016) Zhang, R.; Isola, P.; and Efros, A. A. 2016. Colorful image colorization. In European Conference on Computer Vision (ECCV), 649–666.
  • Zhang (2012) Zhang, Z. 2012. Microsoft kinect sensor and its effect. IEEE Multimedia, 19(2): 4–10.
  • Zheng et al. (2018) Zheng, N.; Wen, J.; Liu, R.; Long, L.; Dai, J.; and Gong, Z. 2018. Unsupervised representation learning with long-term dynamics for skeleton based action recognition. In AAAI Conference on Artificial Intelligence, volume 32.

Appendix A Data Augmentation

In contrastive learning, augmentations for positive samples bring semantic information for the encoder to learn. Just like the traditional contrastive learning framework Chen et al. (2020a); He et al. (2020); Chen et al. (2020b), we naturally consider the “pattern invariant” nature of the skeleton sequence for normal augmentations: The same skeleton sequence is randomly augmented to maintain similar action patterns, and contrastive learning can be performed to extract more effective action representations. However, those carefully designed augmentations limit the encoder to further explore the novel patterns exposed by other augmentations.

Therefore, extreme augmentations are introduced to bring more novel movement patterns to learn more general feature representation. However, we don’t want to spend a lot of effort exploring data augmentations, we hope to explore a more general framework in which extreme augmentations can introduce more novel movement patterns than normal augmentations. According to the characteristics of the skeleton sequence and several previous works Rao et al. (2021); Li et al. (2021), we adopt the following data augmentations.

Normal Augmentations . One spatial augmentation Shear and one temporal augmentation Crop are used as the normal augmentations like SkeletonCLR and CrosSCLR Li et al. (2021).

Extreme Augmentations . We introduce four spatial augmentations: Shear, Spatial Flip, Rotate, Axis Mask and two temporal augmentations: Crop, Temporal Flip and two spatio-temporal augmentations: Gaussian Noise and Gaussian Blur.

1) Shear

: The shear augmentation is a linear transformation on the spatial dimension. The shape of 3D coordinates of body joints is slanted with a random angle. The transformation matrix is defined as:

(10)

where are shear factors sampled randomly from . is the shear amplitude. Here we set . Then the sequence is multiplied by the transformation matrix on the channel dimension.

2) Crop

: For image classification tasks, crop is a very commonly used data augmentation, because it can increase the diversity while maintaining the distinction of original samples. For the temporal skeleton sequence, specifically, we symmetrically pad some frames to the sequence and then randomly crops it to the original length. The padding length is defined as

, is the padding ratio and here we set .

3) Spatial Flip

: There are two facts that the symmetry of the human body structure and the robustness of actions to symmetrical exchanges. For example, throwing with the left hand and throwing with the right hand should be considered as the same class of action as “throw”. Thus, the skeleton sequence of each frame is symmetrically transformed based on the probability of

. In particular, the position of the torso in the center of the skeleton remains unchanged, while the positions of the left and right sub-skeletons are exchanged.

4) Temporal Flip: In Xu et al. (2020), reverse prediction is proposed to learn more high-level information (e.g. movement order) that are meaningful to human perception. Hence, temporal flip is used as one of the data augmentations. Specifically, the skeleton sequence is reversed to with the probability of .

5) Rotate: Due to the variability of the camera position in the spatial coordinate system, we introduce random rotate to the skeleton sequence, inspired by Rao et al. (2021), for all joint coordinate skeleton sequences, we randomly select a main rotation axis and choose a random rotation angle , and the remaining two rotation axis randomly chooses the angle . This is consistent with people’s general perception of movement while maintaining the movement pattern: the change in the observation perspective does not affect the action itself.

6) Axis Mask: For the 3D skeleton sequence, we hope that the sequence projected to 2D can be used as its augmented sequence. Specifically, we randomly select an axis and apply the zero-mask with the probability of .

7) Gaussian Noise: To simulate the noisy positions caused by estimation or annotation, we add Gaussian noise over joint coordinates of the original sequence.

8) Gaussian Blur: As an effective augmentation strategy to reduce the level of details and noise of images, Gaussian blur can be applied to the skeleton sequence to smooth noisy joints and decrease action details Rao et al. (2021). We randomly sample for the Gaussian kernel, which is a sliding window with length of 15. Joint coordinates of the original sequence are blurred at 50% chance by the kernel below:

(11)

where denotes the relative position from the center skeleton, and the length of the kernel is set to 15 corresponding to the total span of .

Appendix B Visualization Results

Qualitative Results. We apply t-SNE Van der Maaten and Hinton (2008) with fix settings to show the embedding distribution of SkeletonCLR and our AimCLR of PKU-MMD part I in Figure 4. We can clearly see that for three different streams, our AimCLR can always better make the feature representation of the same class more compact and that of different classes more distinguishable. It is also evidence of the superior performance of our AimCLR on various downstream tasks.

(a) SkeletonCLR
(b) SkeletonCLR
(c) SkeletonCLR
(d) 3s-SkeletonCLR
(e) AimCLR
(f) AimCLR
(g) AimCLR
(h) 3s-AimCLR
Figure 4: The t-SNE visualization of embeddings on PKU-MMD part I. “” means using motion stream data while “” means using bone stream data.