3D Human Action Representation Learning via Cross-View Consistency Pursuit

04/29/2021 ∙ by Linguo Li, et al. ∙ Shanghai Jiao Tong University 0

In this work, we propose a Cross-view Contrastive Learning framework for unsupervised 3D skeleton-based action Representation (CrosSCLR), by leveraging multi-view complementary supervision signal. CrosSCLR consists of both single-view contrastive learning (SkeletonCLR) and cross-view consistent knowledge mining (CVC-KM) modules, integrated in a collaborative learning manner. It is noted that CVC-KM works in such a way that high-confidence positive/negative samples and their distributions are exchanged among views according to their embedding similarity, ensuring cross-view consistency in terms of contrastive context, i.e., similar distributions. Extensive experiments show that CrosSCLR achieves remarkable action recognition results on NTU-60 and NTU-120 datasets under unsupervised settings, with observed higher-quality action representations. Our code is available at https://github.com/LinguoLi/CrosSCLR.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

 Equal Contribution Corresponding Author: Bingbing Ni

Human action recognition is an important but challenging task in computer vision research. Due to the light-weight and robust estimation algorithms 

[cao2019openpose, xu2020deep], 3D skeleton has become a popular feature representation to study human action dynamics. Many 3D action recognition works  [du2015hierarchical, zhang2017view, ke2017new, liang2019three, si2019attention, liu2020disentangling, kay2017kinetics] use a fully-supervised manner and require massive labeled 3D skeleton data. However, annotating data is expensive and time-consuming, which prompts people to explore unsupervised methods [zheng2018unsupervised, lin2020ms2l, rao2020augmented, su2020predict] on skeleton data. Some unsupervised methods exploit structure completeness within each sample based on pretext tasks, including reconstruction [gui2018adversarial, zheng2018unsupervised], auto-regression [kundu2019unsupervised, su2020predict] and jigsaw puzzles [noroozi2016unsupervised, wei2019iterative], but it is unsure that the designed pretext tasks generalize well for downstream tasks. Other unsupervised methods are based on contrastive learning [wu2018unsupervised, chen2020simple, he2020momentum, khosla2020supervised], aiming to leverage the instance discrimination of samples in latent space.

Figure 1: Hand waving in joint and motion form. Two samples are from the same action class. (a) usual contrastive learning methods regard them as negative pairs. (b) in a multi-view situation, considering their similar motion patterns, they can be positive pairs. This motivates us to introduce cross-view contrastive learning for skeleton representation.

Although the above approaches improve the skeleton representation capability to some extent, it is believed that the power of unsupervised methods is by far from fully explored. On the one hand, traditional contrastive learning uses only one positive pair generated by data augmentation and even similar samples are regarded as negative samples. Despite the high similarity, the negative samples are forced away in embedding space, which is unreasonable for clustering. On the other hand, current unsupervised methods [ben2018coding, zheng2018unsupervised, kundu2019unsupervised, lin2020ms2l, su2020predict] have not yet explored the rich intra-supervision information provided by different skeleton modalities. Considering that it is easy to obtain skeleton data in multiple “views”, e.g., joint, motion and bone, complementary information preserved in different views can assist the operation to mine positive pairs from similar negative samples. As shown in Figure 1, the same hand waving actions are different in pose (joint), but similar in motion. Usual contrastive learning methods regard them as negative pairs, keeping them away in embedding space. If such complementary information, i.e., different in joint but similar in motion, could be fully utilized and explored, the size of hidden positive pairs in joint can be boosted, enhancing training fidelity. Thus, the cross-view contrastive learning strategy takes advantage of multi-view knowledge, resulting in better-extracted skeleton features.

To this end, we propose a Cross-view Contrastive Learning framework for Skeleton-based action Representation (CrosSCLR), which exploits multi-view information for mining positive samples and pursuing cross-view consistency in unsupervised contrastive learning, enabling the model to extract more comprehensive cross-view features. First, parallel Contrastive Learning is evoked for each single-view Skeleton action Representation (SkeletonCLR), yielding multiple single-view embedding features. Second, inspired by the fact that the distance of samples in embedding space reflects the similarity of the samples in the original space, we refer to the extreme similarity of samples in one view to guide the learning process in another view, as shown in Figure 1. More specifically, Cross-View Consistent Knowledge Mining (CVC-KM) module is developed to exam the similarity of samples, and select the most similar pairs as positive ones to boost the positive set in complementary views, i.e., embedding distance/similarity (confidence score) serves as the weight of corresponding mined sample in the contrastive loss. In other words, CVC-KM conveys the most prominent knowledge from one view to others, introduces complementary pseudo-supervised constraint and promotes information sharing among views. The entire framework excavates positive pairs across views according to the distance between samples in the embedding space to promote knowledge exchange among views, so that the extracted skeleton features will contain multi-view knowledge and are more competitive for various downstream tasks. Extensive results on NTU-RGB+D [shahroudy2016ntu, liu2019ntu] datasets demonstrate that our method indeed boosts the 3D action representation learning benefiting from cross-view consistency. We summarize our contributions as follows:

  • [topsep=0pt, partopsep=5pt, leftmargin=13pt, parsep=0pt, itemsep=4pt]

  • We propose CrosSCLR, a cross-view contrastive learning framework for skeleton-based action representation.

  • We develop Contrastive Learning for Skeleton-based action Representation (SkeletonCLR) to learn the single-view representations of skeleton data.

  • We use parallel SkeletonCLR models and CVC-KM to excavate useful samples across views, enabling the model to capture more comprehensive representation unsupervisedly.

  • We evaluate our model on 3D skeleton datasets, e.g., NTU-RGB+D 60/120, and achieve remarkable results under unsupervised settings.

2 Related Work

Self-Supervised Representation Learning.

Self-supervised learning is to learn feature representations from numerous unlabeled data, which usually generates supervision by pretext tasks, e.g., jigsaw puzzles 

[noroozi2016unsupervised, noroozi2018boosting, wei2019iterative]

, colorization 

[zhang2016colorful], predicting rotation [gidaris2018unsupervised, zhai2019s4l]. For sequence data, supervision can be generated by frame orders [misra2016shuffle, fernando2017self, lee2017unsupervised], space-time cubic puzzles [kim2019self] and prediction [wang2019self, lin2020ms2l], but these methods highly rely on the quality of pretext tasks. Recently, contrastive methods [oord2018representation, wu2018unsupervised, tian2019contrastive, khosla2020supervised] based on instance discrimination have been proposed for representation learning. MoCo [he2020momentum] introduces a memory bank to store the embeddings of negative samples, and SimCLR [chen2020simple] uses a much larger mini-batch size to compute the embeddings in real time, but they can not capture the cross-view knowledge for 3D action representation. Concurrent work CoCLR [han2020self] leverages multi-modal information for video representation via co-training, which doesn’t consider the contrastive context. Our CrosSCLR simultaneously trains models in all views by encouraging cross-view consistency, leading to more representative embeddings.

Skeleton-based Action Recognition. To tackle skeleton-based action recognition tasks, early approaches are generally based on hand-craft features [ni2011rgbd, wang2012mining, vemulapalli2014human, vemulapalli2016rolling]

. Recent methods pay more attention to deep neural networks. For the sequence structure of skeleton data, many RNN-based methods 

[du2015hierarchical, shahroudy2016ntu, zhang2017view, song2018spatio, zhang2019view] were carried out to effectively utilize the temporal feature. Since RNN suffers from gradient vanishing [hochreiter2001gradient], CNN-based models [ke2017new, li2017skeleton] attract researchers’ attention, but they need to convert skeleton data to another form. Further, ST-GCN [stgcn2018aaai] was proposed to better model the graph structure of skeleton data. Then the attention mechanism [li2019actional, shi2019two, si2019attention, zhang2020context, zhang2020semantics] and multi-stream structure [liang2019three, shi2019skeleton, shi2019two, wang2020learning] are applied to adaptively capture multi-stream features based on GCNs. We adopt the widely-used ST-GCN as the backbone to extract the skeleton features.

Unsupervised Skeleton Representation. Many unsupervised methods [srivastava2015unsupervised, luo2017unsupervised, martinez2017human, li2018unsupervised] were proposed to capture action representations in videos. For skeleton data, previous works [zanfir2013moving, ben2018coding]

have achieved some progress in unsupervised representation learning without deep neural networks. Recent deep learning methods 

[gui2018adversarial, zheng2018unsupervised, kundu2019unsupervised, su2020predict]

are based on the structure of encoder-decoder or generative adversarial network (GAN). LongT GAN 

[zheng2018unsupervised] proposed an auto-encoder-based GAN for sequential reconstruction and evaluated it on the action recognition tasks. P&C [su2020predict] uses a weak decoder in the encoder-decoder model, forcing the encoder to learn more discriminative features. MS[lin2020ms2l] proposed a multi-task learning scheme for action representation learning. However, these methods highly depend on reconstruction or prediction, and they do not exploit the natural multi-view knowledge of skeleton data. Thus, we introduce CrosSCLR for unsupervised 3D action representation.

3 CrosSCLR

Although 3D skeleton has shown its importance in action recognition, unsupervised skeleton representation has not been well exploited recently. Since the easily-obtained “multi-view” skeleton information plays a significant role in action recognition, we expect to exploit them to mine positive samples and pursue cross-view consistency in unsupervised contrastive learning, thus giving rise to a Cross-view Contrastive Learning (CrosSCLR) framework for Skeleton-based action Representation.

As shown in Figure 3, CrosSCLR contains two key modules: 1) SkeletonCLR (Section 3.1

): a contrastive learning framework to unsupervisedly learn single-view representations, and 2) CVC-KM (Section

3.2): it conveys the most prominent knowledge from one view to others, introduces complementary pseudo-supervised constraint and promotes information sharing among views. Finally, the more discriminating representations can be obtained by cooperatively training (Section 3.2).

3.1 Single-View 3D Action Representation

Contrastive learning has been widely-used due to its instance discrimination capability, especially for images [chen2020simple, he2020momentum] and videos [han2020self]. Inspired by this, we develop SkeletonCLR to learn single-view 3D action representations, based on the recent advanced practice, MoCo [he2020momentum].

SkeletonCLR. It is a memory-augmented contrastive learning method for skeleton representation, which considers one sample’s different augments as its positive samples and other samples as negative samples. In each training step, the batch embeddings are stored in first-in-first-out memory to get rid of redundant computation, serving as negative samples for the next steps. The positive samples are embedded close to each other while the embeddings of negative samples are pushed away. As shown in Figure 2, SkeletonCLR consists of the following major components:

  • [topsep=0pt, partopsep=5pt, leftmargin=13pt, parsep=0pt, itemsep=4pt]

  • A data augmentation module that randomly transforms the given skeleton sequence into different augments , that are considered as positive pairs. For skeleton data, we adopt Shear and Crop as the augmentation strategy (see Section 3.3 and Appendix).

  • Two encoders and that embed and into hidden space: and , where . is the momentum updated version of : , where is a momentum coefficient. SkeletonCLR uses ST-GCN [stgcn2018aaai] as the backbone (Details are in Section 3.3).

  • A simple projector and its momentum updated version

    that project the hidden vector to a lower dimension space:

    , , where

    . The projector is a fully-connected (FC) layer with ReLU.

  • A memory bank that stores negative samples to avoid redundant computation of the embeddings. It is a first-in-first-out queue updated per iteration by . After each inference step, will enqueue while the earliest embedding in will dequeue. During contrastive training, provides numerous negative embeddings while the new calculated is the positive embedding.

  • An InfoNCE [oord2018representation] loss for instance discrimination:


    where , is the temperature hyper-parameter [hinton2015distilling], and dot product is to compute their similarity where are normalized.

Constrained by contrastive loss , the model is unsupervisedly trained to discriminate each sample in the training set. At last, we can obtain a strong encoder that is beneficial to extract single-view distinguishing representations.

Figure 2: Architecture of single-view SkeletonCLR, which is a memory augmented contrastive learning framework.

Limitations of Single-View Contrastive Learning. The above SkeletonCLR still suffers the following limitations:

1) Embedding distribution can provide more reliable information. We expect samples from the same category are embedded closely. However, instance discrimination in SkeletonCLR uses only one positive pair and even similar samples are regarded as negative samples. It is unreasonable that the negative samples are forced away in embedding space despite their high embedding similarity. In other words, one positive pair cannot fully describe the relationships of samples, and a more reliable embedding distribution is needed, i,e., positive/negative setting plus embedding similarity. We aim to mine more representative knowledge to facilitate contrastive learning, which is also the knowledge we want to exchange across views. Thus, we introduce the contrastive context in Section 3.2.

2) Multi-view data can benefit representation learning. SkeletonCLR only relies on single-view data. As shown in Figure 1, since we don’t have any annotations, different samples of the same class are inevitably embedded into distinct places far from each other, i.e., they distribute sparsely/irregularly, bringing much difficulty for linear classification. Considering the readily generated multi-view data of 3D skeleton (see Section 3.3), if such complementary information in Figure 1, i.e., different in joint but similar in motion, could be fully utilized and explored, the size of hidden positive pairs in joint can be boosted, enhancing training fidelity. To this end, we inject this consideration into unsupervised contrastive learning framework.

Figure 3: (a) CrosSCLR. Given two samples generated from the same raw data, e.g., joint and motion, SkeletonCLR models produce single-view embeddings while cross-view consistent knowledge mining (CVC-KM) exchanges multi-view complementary knowledge. (b) how works in embedding space. In step 1, we mine high-confidence knowledge from similarities to boost the positive set of view , i.e., shares ’s neighbors; In step 2, we use the similarities to supervise the embedding distribution in view . share similar relationships with others. Thus, two embedding spaces become similar under the constraint of .

3.2 Cross-View Consistent Knowledge Mining

Motivated by the situation in Figure 1 that complementary knowledge is preserved in multiple views, we propose the Cross-View Consistent Knowledge Mining (CKC-KM), leveraging the high similarity of samples in one view to guide the learning process in another view. It excavates positive pairs across views according to the embedding similarity to promote knowledge exchange among views, then the size of hidden positive pairs in each view can be boosted and the extracted skeleton features will contain multi-view knowledge, resulting in a more regular embedding space.

In this section, we first clarify contrastive context as the consistent knowledge across views, and then show how to mine high-confidence knowledge, and finally inject its cross-view consistency into single-view SkeletonCLR to further benefit the cross-view unsupervised representation.

Contrastive Context as Consistent Knowledge. As discussed above, the knowledge we want to exchange across views is one sample’s contrastive context, which describes this sample’s relationships with others (distribution) in embedding space under the settings of contrastive learning. Notice that SkeletonCLR uses a memory bank to store necessary embeddings. Given one sample’s embedding and corresponding memory bank , its contrastive context is a similarity set among and conditioned on specific knowledge miner that generates index set of positive samples,


where and dot product “” is to compute the similarity among embeddings and . is the index set of embeddings in memory bank and is the index set of positive samples selected by knowledge miner . Thus contrastive context consists of following two aspects:

  • [topsep=0pt, partopsep=5pt, leftmargin=13pt, parsep=0pt, itemsep=4pt]

  • Embedding Context : it is the relationship between one sample and others in embedding space, i.e., distribution;

  • Contrastive Setting : it is the positive setting mined by according to the embedding similarity ;

thus has positive context and negative context , where . The contrastive context contains not only the information of the most similar samples but the detailed relationships of samples (distribution).

In Equation (1), the embedding has positive context which does not consider any of neighbors in embedding space except for the augments. Despite the high similarity, the negative samples are forced away in embedding space, and then samples belonging to the same category are difficultly embedded into the same cluster, which is not efficient to build a “regular” embedding space for down-stream classification tasks.

High-confidence Knowledge Mining. To solve the above issue, we develop the high-confidence Knowledge Mining mechanism (KM), which selects the most similar pairs as positive ones to boost the positive sets. It shares similar high-level spirit with neighborhood embedding [hinton2002stochastic] but performs differently in an unsupervised contrastive manner.

Specifically, it is based on the following observation in Figure 4 that after single-view contrastive learning, two embeddings most likely belong to the same category if they are embedded closely enough; on the contrary, two embeddings hardly belong to the same class if they locate extremely far from each other in embedding space. Therefore, we can facilitate contrastive learning by setting the most similar embeddings as positive to make it more clustered:


where is the function to select the index of top- similar embeddings and is their index set in memory bank. Compared to Equation (1), Equation (5) will leads to a more regular space by pulling close more high-confidence positive samples. Additionally, since we don’t have any labels, a larger may harm the contrastive performance (see Section 4.3).

Cross-View Consistency Learning. Considering easily-obtained multi-view skeleton data, complementary information preserved in different views can assist the operation to mine positive pairs from similar negative samples in Figure 1. Then the size of hidden positive pairs can be boosted by cross-view knowledge communication, resulting in better-extracted skeleton features. To this end, we design the cross-view consistency learning which not only mines the high-confidence positive samples from complementary view but also lets the embedding context be consistent in multiple views. Its two-view case is illustrated in Figure 3 for example.

Specifically, samples and are generated from the same raw data by the view generation method in Section 3.3, where and indicate two types of data views. After single-view contrastive learning, two SkeletonCLR modules obtain the embeddings and corresponding memory bank respectively. We can mine the high-confidence knowledge from two views by and , where are the positive context of respectively.

CrosSCLR aims to learn the consistent embedding distribution in different views by encouraging the similarity of contrastive context, i.e., exchanging high-confidence knowledge across views. In Figure 3 (b), if we want to use the knowledge of view to guide view ’s contrastive learning, it contains two aspects: 1) step 1: we select the most similar pairs (positive) in view as the positive sets in view , i.e., . Thus the sample shares ’s positive neighbors; 2) step 2: we use the embedding similarity in view as the weight of corresponding embedding in view to provide the detailed relational information, i.e., . Then the similarity is computed by and has embedding context . Finally, the overall loss is conducted as:


where means we transfer the contrastive context of to that of . Since are the embedding context of , we call Equation (7) as cross-view contrastive context learning, which constrains the similar distribution of two views (see Section 4.3, the results of t-SNE). Compared to Equation (5), Equation (6) considers the cross-view information, cooperatively using one view’s high-confidence positive samples and its distribution to instruct the other view’s contrastive learning, resulting in more regular space and better extracted skeleton features.

Learning CrosSCLR. For more views, CrosSCLR has following objective:


where is the number of views and .

In the early training process, the model is not stable and strong enough to provide reliable cross-view knowledge without the supervision of labels. As the unreliable information may lead astray, it is not encouraged to enable cross-view communication too early. We perform two-stage training for CrosSCLR: 1) each view of the model is individually trained with Equation (1

) without cross-view communication. 2) then the model can supply high-confidence knowledge, so the loss function is replaced with Equation (

8), starting cross-view knowledge mining.

3.3 Model Details

View Generation of 3D Skeleton. Generally, 3D human skeleton sequence has frames with joints, and each joint has coordinate feature, which can be noted as . Different from videos, the views [shi2019skeleton, shi2019two] of skeleton, e.g., joint, motion, bone, motion of bone, can be easily obtained, which is a natural advantage for skeleton-based representation learning. Motion is represented as the temporal displacement between frames: , and bone is the distance between two neighboring joints in the same frame: . For simplicity, we use three views: joint, motion and bone in experiments.

Encoder . We adopt ST-GCN [stgcn2018aaai] as encoder, which is suitable for modeling graph-structure skeleton data by exploiting the spatial and temporal relations. After a series of ST-GCN blocks, the output feature is applied by an average pooling operation on spatial and temporal dimensions, obtaining final representation .

4 Experiments

4.1 Datasets

NTU-RGB+D 60. NTU-RGB+D 60 (NTU-60) dataset [shahroudy2016ntu] is a large-scale dataset of 3D joint coordinate sequences for skeleton-based action recognition, containing skeleton sequences in action categories. Each skeleton graph contains body joints as nodes, and their 3D coordinates are initial features. There are two protocols [shahroudy2016ntu] recommended. 1) Cross-Subject (xsub): training data and validation data are collected from different subjects. 2) Cross-View (xview): training data and validation data are collected from different camera views.

NTU-RGB+D 120. NTU-RGB+D 120 (NTU-120) dataset [liu2019ntu] is an extended version of NTU-60, containing skeleton sequences in action categories. Two protocols [liu2019ntu] are recommended. 1) Cross-Subject (xsub): training data and validation data are collected from different subjects. 2) Cross-Setup (xset): training data and validation data are collected from different setup IDs.

NTU-RGB+D 61-120. NTU-RGB+D 61-120 (NTU-61-120) dataset is a subset of NTU-120 dataset, containing skeleton sequences in the last action categories in NTU-120. The categories in NTU-61-120 do not intersect with those in NTU-60. This dataset is used as external dataset to evaluate the transfer capability of our method.

4.2 Experimental Settings

All the experiments are conducted on the PyTorch 

[paszke2017automatic] framework. For data pre-processing, we remove the invalid frames of each skeleton sequence and then resize them to the length of

frames by linear interpolation. For optimization, we use SGD with momentum (

) and weight decay (). The mini-batch size is set to .

Data Augmentation . For skeleton sequence, we choose Shear [rao2020augmented] and Crop [shorten2019survey] as the augmentation strategy.


is a linear transformation on the spatial dimension. The transformation matrix is defined as:


where , , , , , are shear factors randomly sampled from . is the shear amplitude. The sequence is multiplied by the transformation matrix on the channel dimension. Then, the human pose in 3D coordinate is inclined at a random angle.


is an augmentation on the temporal dimension that symmetrically pads some frames to the sequence and then randomly crops it to the original length. The padding length is defined as

, is noted as padding ratio. The padding operation uses the reflection of the original boundary.

NTU-60 (%)
Method View xsub xview
SkeletonCLR Joint 68.3 76.4
SkeletonCLR Motion 53.3 50.8
SkeletonCLR Bone 69.4 67.4
2s-SkeletonCLR Joint + Motion 70.5 77.9
3s-SkeletonCLR Joint + Motion + Bone 75.0 79.8
CrosSCLR Joint 72.9 79.9
CrosSCLR Motion 72.7 77.6
CrosSCLR Bone 75.2 78.8
2s-CrosSCLR Joint + Motion 74.5 82.1
3s-CrosSCLR Joint + Motion + Bone 77.8 83.4
Table 1: Comparisons of SkeletonCLR and CrosSCLR on each view and their ensembles. SkeletonCLR models are trained independently and “+” means the ensemble model.

Unsupervised Pre-training. We generate three views of skeleton sequences, i.e., joint, motion and bone. For the encoder, we adopt ST-GCN [stgcn2018aaai], but the number of channels in each layer is reduced to of the original setting. For contrastive settings, we follow that in MOCOv2 [chen2020improved] but reduce the size of memory bank to k. For data augmentation, We set shear amplitude and the padding ratio . The model is trained for epochs with the learning rate (multiplied by at epoch ). InfoNCE loss in Equation (1) is used in the first epochs, and then replaced with in Equation (8) after -th epoch. We set as the default in the knowledge mining mechanism.

Linear Evaluation Protocol.

The models are verified by linear evaluation for action recognition task, i.e., attaching the frozen encoder to a linear classifier (a

fully-connected layer followed by a softmax layer), and then training the classifier supervisedly. We train models for epochs with learning rate (multiplied by at epoch ).

Finetune Protocol. We append a linear classifier to the learnable encoder, and then train the whole model for the action recognition task, to compare it with fully-supervised methods. We train for epochs with learning rate (multiplied by at epoch ).

4.3 Ablation Study

All experiments in this section are conducted on NTU-60 dataset and follow the unsupervised pre-training and linear evaluation protocol in Section 4.2.

Effectiveness of CrosSCLR. In Table 1, we separately pre-train SkeletonCLR and jointly pre-train CrosSCLR models on different skeleton views, e.g., joint, motion and bone. We adopt linear evaluation on each view of the models. Table 1 reports that 1) CrosSCLR improves the capability of each single SkeletonCLR model, e.g., CrosSCLR-joint (79.88) v.s SkeletonCLR-joint (76.44) on xview protocol; 2) CrosSCLR bridges the performance gap of two views and jointly improves their accuracy, e.g., for SkeletonCLR, joint (76.44) v.s motion (50.82) but for CrosSCLR, joint (79.88) v.s motion (77.59); 3) CrosSCLR improves the multi-view ensemble results via cross-view training. In summary, the cross-view high-confidence knowledge does help the model extract more discriminating representations.

Figure 4: The t-SNE visualization of embeddings at different epochs during pre-training. Embeddings from categories are sampled and visualized with different colors. For CrosSCLR, starts to be available at epoch , so its distribution has no difference from that of SkeletonCLR before epoch , shown in red boxes.

Qualitative Results. We apply t-SNE [maaten2008visualizing] with fix settings to show the embedding distribution of SkeletonCLR and CrosSCLR on epochs during pre-training in Figure 4. Note that cross-view loss, Equation (8), is available only after epoch . From the visual results, we can draw a similar conclusion to that in Table 1. Embeddings of CrosSCLR are clustered more closely than that of SkeletonCLR, which is more discriminating. For CrosSCLR, the distributions of joint and motion are distinct at -th epoch but look very similar at -th epoch, i.e., consistent distribution. Especially, they both build a more “regular” space than SkeletonCLR, proving the effectiveness of CrosSCLR.

Effects of Contrastive Setting top-. As hyper-parameter determines the number of mined samples, influencing the depth of knowledge exchange, we study how impacts the performance in cross-view learning. Table 3 shows that has a great influence on the performance and achieves the best result when . However, a larger decreases the performance, because the not so confident information may lead the model astray in an unsupervised case.

Contrastive Setting and Embedding Context . We develop following models in Table 4 for comparison: 1) SkeletonCLR + is a model with single-view knowledge mining. 2) CrosSCLR w/o. embedding context (EC) is the model only using the contrastive setting for cross-view learning, which ignores the embedding context/distribution, i.e., in Equation (6). The results of SkeletonCLR + show that KM improves the representation capability of SkeletonCLR. Additionally, CrosSCLR achieves worse performance without using embedding context (EC), proving the significance of similarity/distribution among samples.

Effects of Augmentations. SkeletonCLR and CrosSCLR are based on contrastive learning, but the data augmentation strategy used on skeleton data is rarely explored, especially for the GCN encoder. We verify the effectiveness of data augmentation and the impact of different augmented intensities in skeleton-based contrastive learning by conducting experiments on SkeletonCLR, as shown in Table 3. It indicates the importance of data augmentation in SkeletonCLR. We choose and as default settings according to the mean accuracy on xsub and xview protocols.

NTU-60 (%)
top- xsub xview
0 70.5 77.4
1 74.5 82.1
3 73.7 79.9
5 72.4 79.2
7 73.0 78.6
10 64.4 69.9
Table 2: Results of pre-training 2s-CrosSCLR with various in knowledge miner .
Augmentation NTU-60 (%)
Shear Crop xsub xview
33.3 26.2
62.7 67.7
66.3 68.8
62.0 66.8
67.6 76.3
68.3 76.4
69.1 74.7
Table 3: Ablation study on different data augmentations for SkeletonCLR (joint).
Views of NTU-60 (%)
Method Pre-training xsub xview
SkeletonCLR Joint 68.3 76.4
SkeletonCLR + Joint 69.3 77.4
CrosSCLR w/o. EC Joint + Motion 71.4 78.5
CrosSCLR Joint + Motion 72.9 79.9
Table 4: Ablation study on contrastive settings and embedding context (EC). The models are linear evaluated on only joint.

4.4 Comparison

We compare CrosSCLR with other methods under linear evaluation and finetune protocols. Since the backbone in many methods is an RNN-based model, e.g., GRU or LSTM, we additionally use LSTM (following the setting in [rao2020augmented]) as the encoder for a fair comparison, i.e., CrosSCLR (LSTM).

Unsupervised Results on NTU-60. In Table 5, LongT GAN [zheng2018unsupervised] adversarially trains the model by skeleton inpainting pretext task, MS[lin2020ms2l] trains the model by multi-task scheme, i.e, prediction, jiasaw puzzle and instance discrimination, AS-CAL [rao2020augmented] uses momentum LSTM encoder for contrastive learning with single-view skeleton sequence, P&C [su2020predict] trains a stronger encoder by weakening decoder, and SeBiReNet [nie2020unsupervised] constructs a human-like GRU network to utilize view-independent and pose-independent feature. Our CrosSCLR exploits the multi-view knowledge by cross-view consistent knowledge mining. Taking a fully-connected layer (FC) as the classifier, our model outperforms other methods with the same classifier. With LSTM classifier and LSTM encoder, our model outperforms the above methods on both xsub and xview protocols.

Results on NTU-120. As few unsupervised results are reported on NTU-120 dataset, we compare our method with unsupervised and supervised methods. As shown in Table 6, TSRJI [caetano2019skeleton] supervisedly utilizes attention LSTM, AS-CAL [rao2020augmented] adopts LSTM for skeleton modeling, and our method defeats the other unsupervised method and some of the supervised methods.

NTU-60 (%)
Method Encoder Classifier xsub xview
LongT GAN [zheng2018unsupervised] GRU FC 39.1 48.1
MS[lin2020ms2l] GRU GRU 52.6 -
AS-CAL [rao2020augmented] LSTM FC 58.5 64.8
P&C [su2020predict] GRU KNN 50.7 76.3
SeBiReNet [nie2020unsupervised] GRU LSTM - 79.7
3s-CrosSCLR (LSTM) LSTM FC 62.8 69.2
3s-CrosSCLR (LSTM) LSTM LSTM 70.4 79.9
3s-CrosSCLR ST-GCN FC 72.8 80.7
3s-CrosSCLR ST-GCN FC 77.8 83.4
Table 5: Unsupervised results on NTU-60. These methods are pre-trained to learn encoder and then follow the linear evaluation protocol to learn the classifiers. “” indicates the model pre-trained on NTU-61-120.
NTU-120 (%)
Method Supervision xsub xset
Part-Aware LSTM [shahroudy2016ntu] Supervised 25.5 26.3
Soft RNN [hu2018early] Supervised 36.3 44.9
TSRJI [caetano2019skeleton] Supervised 67.9 62.8
ST-GCN [stgcn2018aaai] Supervised 79.7 81.3
AS-CAL [rao2020augmented] Unsupervised 48.6 49.2
3s-CrosSCLR (LSTM) Unsupervised 53.9 53.2
3s-CrosSCLR Unsupervised 67.9 66.7
Table 6: Unsupervised results on NTU-120. We show and compare our method with unsupervised and supervised methods.

Linear Classification with Fewer Labels. We follow the same protocol as that of MS[lin2020ms2l], i.e., pre-training with all training data and then finetuning the classifier with only and randomly-selected labeled data respectively. As shown in Table 7, CrosSCLR achieves higher performance than other methods.

NTU-60 (%)
Method Label Fraction xsub xview
LongT GAN [zheng2018unsupervised] 1% 35.2 -
MS[lin2020ms2l] 1% 33.1 -
3s-CrosSCLR 1% 51.1 50.0
LongT GAN [zheng2018unsupervised] 10% 62.0 -
MS[lin2020ms2l] 10% 65.2 -
3s-CrosSCLR 10% 74.4 77.8
Table 7: Linear classification with fewer labels on NTU-60.
NTU-60 (%) NTU-120 (%)
Method xsub xview xsub xset
3s-ST-GCN [stgcn2018aaai] 85.2 91.4 77.2 77.1
3s-CrosSCLR (FT) 85.6 92.0 - -
3s-CrosSCLR (FT) 86.2 92.5 80.5 80.4
Table 8: Finetuned results on NTU-60 and NTU-120. ST-GCN is the method reproduced by released code. “” indicates the model pre-trained on NTU-61-120. “FT” means finetune protocol.

Finetuned Results on NTU-60 and NTU-120. We first unsupervisedly pre-train our model and follow the finetune protocol for evaluation. For fair comparison, ST-GCN [stgcn2018aaai] in Table 8 has the same number of parameters as 3s-CrosSCLR ( channel with three streams). It shows that the finetuned model, CrosSCLR (FT) outperforms the supervised ST-GCN on both NTU-60 and NTU-120 datasets, indicating the effectiveness of cross-view pre-training.

Transfer Ability. We first pre-train CrosSCLR on NTU-61-120, and then transfer it to NTU-60 for linear evaluation, noted as CrosSCLR. The model trained under xsub protocol is transferred to the xsub protocol of NTU-60; the model trained under xset protocol is transferred to the xview protocol of NTU-60. In Table 5, it achieves better results than the other unsupervised methods, and its supervisedly finetuning result is higher than ST-GCN as shown in Table 8.

5 Conclusion

In this work, we propose a Cross-view Contrastive Learning framework for unsupervised 3D skeleton-based action representation to exploit multi-view high-confidence knowledge as complementary supervision. It integrates single-view contrastive learning with cross-view consistent knowledge mining modules which convey the contrastive settings and embedding context among views by high-confidence sample mining. Experiments show remarkable results of CrosSCLR for action recognition on NTU datasets.


This work was supported by National Science Foundation of China (U20B2072, 61976137). This work was also supported by NSFC (U19B2035), Shanghai Municipal Science and Technology Major Project (2021SHZDZX0102).