SM-SGE: A Self-Supervised Multi-Scale Skeleton Graph Encoding Framework for Person Re-Identification

by   Haocong Rao, et al.
Beijing Institute of Technology

Person re-identification via 3D skeletons is an emerging topic with great potential in security-critical applications. Existing methods typically learn body and motion features from the body-joint trajectory, whereas they lack a systematic way to model body structure and underlying relations of body components beyond the scale of body joints. In this paper, we for the first time propose a Self-supervised Multi-scale Skeleton Graph Encoding (SM-SGE) framework that comprehensively models human body, component relations, and skeleton dynamics from unlabeled skeleton graphs of various scales to learn an effective skeleton representation for person Re-ID. Specifically, we first devise multi-scale skeleton graphs with coarse-to-fine human body partitions, which enables us to model body structure and skeleton dynamics at multiple levels. Second, to mine inherent correlations between body components in skeletal motion, we propose a multi-scale graph relation network to learn structural relations between adjacent body-component nodes and collaborative relations among nodes of different scales, so as to capture more discriminative skeleton graph features. Last, we propose a novel multi-scale skeleton reconstruction mechanism to enable our framework to encode skeleton dynamics and high-level semantics from unlabeled skeleton graphs, which encourages learning a discriminative skeleton representation for person Re-ID. Extensive experiments show that SM-SGE outperforms most state-of-the-art skeleton-based methods. We further demonstrate its effectiveness on 3D skeleton data estimated from large-scale RGB videos. Our codes are open at



There are no comments yet.


page 1

page 2

page 3

page 4


Multi-Level Graph Encoding with Structural-Collaborative Relation Learning for Skeleton-Based Person Re-Identification

Skeleton-based person re-identification (Re-ID) is an emerging open topi...

SimMC: Simple Masked Contrastive Learning of Skeleton Representations for Unsupervised Person Re-Identification

Recent advances in skeleton-based person re-identification (re-ID) obtai...

Self-Supervised Gait Encoding with Locality-Aware Attention for Person Re-Identification

Gait-based person re-identification (Re-ID) is valuable for safety-criti...

A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D Skeleton Based Person Re-Identification

Person re-identification (Re-ID) via gait features within 3D skeleton se...

Space-Time Representation of People Based on 3D Skeletal Data: A Review

Spatiotemporal human representation based on 3D visual perception data i...

Self-Learning with Rectification Strategy for Human Parsing

In this paper, we solve the sample shortage problem in the human parsing...

Object Skeleton Extraction in Natural Images by Fusing Scale-associated Deep Side Outputs

Object skeleton is a useful cue for object detection, complementary to t...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Person re-identification (Re-ID) aims to retrieve the same individual from a different view or scene, with great potential in authentication-related applications (vezzani2013people). Conventional studies (wang2018learning; karianakis2018reinforced; haque2016recurrent) typically extract appearance-based features such as body texture and silhouettes from RGB or depth images to perform person Re-ID. Nevertheless, an important flaw of these methods is their vulnerability to illumination or appearance changes (rao2021a2). In contrast, skeleton-based models exploit 3D coordinates of key body joints to characterize human body and motion, which are usually robust to factors such as view and body shape changes (han2017space). Despite that skeleton data have been extensively studied in action and motion related tasks (li2020dynamic), it is still an open challenge to extract discriminative body and motion features with 3D skeletons for person Re-ID (rao2020self). In this sense, this work aims to construct a systematic framework from three aspects to tackle the skeleton-based person Re-ID task.

Figure 1. SM-SGE exploits multi-scale skeleton graphs to model body structure and internal relations (structural and collaborative relations), and captures multi-scale graph dynamics to learn skeleton representations for person Re-ID.

(1) Multi-scale skeleton graphs. Most existing works (barbosa2012re; munaro2014one; andersson2015person; pala2019enhanced) construct skeleton descriptors to depict discriminative features of body structure and motion ( anthropometric and gait attributes (andersson2015person)) for person Re-ID. However, they typically extract these hand-craft features from skeletons with a single spatial scale and topology, which limits their ability to capture underlying structural information from different body partitions beyond body-joint level ( limb-level components) (li2020dynamic). To fully mine latent structural features within body structure, it is beneficial to devise a systematic manner to represent skeletons at different levels. In this work, we model skeletons as multi-scale graphs (see Fig. 1) to learn coarse-to-fine grained body and motion features from 3D skeleton data.

(2) Multi-scale relation learning. In human walking, body components usually possess different internal relations, which could carry unique and recognizable patterns (murray1964walking; winter2009biomechanics). Recent works like (rao2020self; liao2020model; rao2021a2)

typically encode body-joint trajectory or pre-defined pose descriptors into a feature vector for skeleton representation learning, while they rarely explore the inherent relations between different body joints or components. For example, adjacent body joints “knee” and “foot” are strongly correlated in walking, while they enjoy different degrees of

collaboration with their corresponding limb-level component “leg”. To capture such internal body relations in skeletal motion, it is highly desirable to design a framework to capture correlations between physically-connected body parts (referred as “structural relations”) and relations among all collaborative body components (referred as “collaborative relations”).

(3) Multi-scale skeleton dynamics modeling. Existing 3D skeleton based Re-ID methods usually model skeleton dynamics from the trajectory of body joints (rao2020self; rao2021a2) or the sequence of pre-defined joint features ( pairwise joint distances and pose descriptors (liao2020model)). Since these methods learn skeleton motion at a fixed scale of body joints, they lack the flexibility to capture motion patterns at various levels. For instance, they cannot explicitly model the movement or interaction of higher level limbs from joint trajectory, which might cause a loss of global motion features. Hence, it is important to devise a framework that can explicitly model skeleton dynamics at different scales to better capture body motion patterns.

To fulfill all above goals, this work for the first time proposes a Self-supervised Multi-scale Skeleton Graph Encoding (SM-SGE) framework that exploits coarse-to-fine skeleton graphs to model body-structure and motion features for person Re-ID. Specifically, we first construct multi-scale skeleton graphs by spatially dividing each skeleton into body-component nodes of different granularities (shown in Fig. 2), which allows our framework to fully model body structure and capture skeleton features at various levels. Second, motivated by the fact that human walking usually carries unique patterns (murray1964walking), which endow body components with different internal relations, we propose a Multi-scale Graph Relation Network (MGRN) to capture structural and collaborative relations among body components in multi-scale skeleton graphs. MGRN exploits structural relations between adjacent body-component nodes to aggregate key correlative features for better node representations, and meanwhile incorporates collaborative relations among nodes of different scales into graph encoding process to enhance global pattern learning. Finally, we propose a novel Multi-scale Skeleton Reconstruction (MSR) mechanism with two concurrent pretext tasks, namely skeleton subsequence reconstruction task and cross-scale skeleton inference task, to enable our framework to capture skeleton dynamics and latent high-level semantics ( body part correspondence, sequence order) from unlabeled skeleton graph representations. The graph features of all scales learned from the proposed framework are then combined as the final skeleton representation to perform the downstream task of person Re-ID.

The proposed SM-SGE framework enjoys three main advantages: First, it seamlessly unifies the learning of multi-scale skeleton graphs into a systematic framework, which enables us to model body structure, component relations, and motion patterns of skeletons at different levels. Second, unlike most existing skeleton-based methods that require manual annotation ( ID labels) for representation learning, our framework is able to learn an effective representation for unlabeled skeletons, which can be directly applied to skeleton-based tasks such as person Re-ID. Last, our framework is also effective with the 3D skeleton data estimated from RGB videos (yu2006framework), thus it can be potentially applied to RGB-based datasets under general settings. In summary, our main contributions include:

  • We devise multi-scale graphs to fully model 3D skeletons, and propose a novel self-supervised multi-scale skeleton graph encoding (SM-SGE) framework to learn an effective representation from unlabeled skeletons for person Re-ID.

  • We propose the multi-scale graph relation network (MGRN) to learn both structural and collaborative relations of body-component nodes, so as to aggregate crucial correlative features of nodes and capture richer pattern information.

  • We propose the multi-scale skeleton reconstruction (MSR) mechanism to enable the framework to encode graph dynamics and high-level semantics from unlabeled skeletons.

  • Extensive experiments show that SM-SGE outperforms most state-of-the-art skeleton-based methods on three person Re-ID benchmarks, and it can achieve highly competitive performance on skeletons estimated from large-scale RGB videos.

2. Related Works

This section briefly reviews existing skeleton-based person Re-ID methods using hand-crafted features, supervised learning or self-supervised learning. We also introduce depth-based and multi-modal Re-ID methods that are related to skeleton-based models.

Hand-crafted and Supervised Re-ID Methods with Skeleton Data. Most existing works extract hand-crafted skeleton descriptors to depict certain geometric, anthropometric, and gait attributes of human body. (barbosa2012re) computes 7 Euclidean distances between the floor plane and joint or joint pairs to construct a distance matrix, which is learned to match gallery individuals with a quasi-exhaustive strategy. (munaro2014one) and (pala2019enhanced) further extend them to 13 () and 16 skeleton descriptors (

) respectively, and leverage support vector machine (SVM),

-nearest neighbor (KNN) or Adaboost classifiers for Re-ID. As existing solutions that use skeleton features alone typically perform unsatisfactorily, other modalities such as 3D face descriptors

(pala2019enhanced) and 3D point clouds (munaro20143d)

are often used to boost the performance. A few recent studies resort to supervised deep learning models to learn discriminative skeleton representation:


utilizes long short-term memory (LSTM)

(hochreiter1997long) to encode temporal dynamics of pairwise joint distance to perform person Re-ID; Liao et al. (liao2020model)

propose PoseGait model, which learns 81 hand-crafted pose features of 3D skeleton data with deep convolutional neural networks (CNN) for gait-based human recognition.

Self-supervised Skeleton-based Re-ID Methods. Recently, Rao et al. (rao2020self) devise a self-supervised attention-based gait encoding model with multi-layer LSTM to encode gait features from unlabeled skeleton sequences for person Re-ID. The latest self-supervised study (rao2021a2) further proposes a locality-awareness approach that combines various pretext tasks ( reverse sequential reconstruction) and contrastive learning scheme to enhance self-supervised gait representation learning for the person Re-ID task.

Depth-based and Multi-modal Re-ID Methods. Depth-based methods exploit depth-image sequences to extract human shapes, silhouettes or gait features for person Re-ID. (sivapalan2011gait) proposes Gait Energy Volume (GEV) algorithm, which extends Gait Energy Image (GEI) (chunli2010behavior) to 3D domain, to learn depth features for human recognition. (munaro2014one) devises a depth-based point cloud matching (PCM) method to match multi-view 3D point cloud sets to discriminate different individuals. In (haque2016recurrent), Haque et al. leverage 3D LSTM and 3D CNN (boureau2010theoretical) to learn motion dynamics from 3D point clouds for person Re-ID. As to multi-modal methods, skeleton information and RGB or depth features ( depth shape features (munaro20143d; wu2017robust; hasan2016long)) are usually combined to enhance Re-ID performance. In (karianakis2018reinforced), Karianakis et al. propose an RGB-to-depth transferred CNN-LSTM model with reinforced temporal attention (RTA) for the person Re-ID task.

3. The proposed Framework

Suppose that a 3D skeleton sequence with consecutive skeletons is , where denotes the skeleton with three-dimensional coordinates () of body joints. represents the training set that contains skeleton sequences collected from different persons and views. Each skeleton sequence corresponds to an ID label , where and is the number of different persons. The goal of SM-SGE framework is to learn a latent discriminative representation H from skeleton sequences without using any label. Then, we evaluate the effectiveness of learned skeleton representation (H) on the downstream task of person Re-ID: Frozen H

and corresponding ID labels are used to train a multi-layer perceptron (MLP) for person Re-ID (note that the learned features

H are NOT tuned at this training stage). The overview of proposed SM-SGE framework is shown in Fig. 3, and we present each technical component as below.

3.1. Multi-Scale Skeleton Graph Construction

Human body can be segmented into functional components with diverse granularities ( knee joint, thigh part, leg limb), each of which typically carries different geometric or anthropometric attributes of body (winter2009biomechanics). Inspired by this fact, we regard body joints as the basic components, and merge spatially nearby groups of joints to be a higher level body-component node at the center of their positions. As shown in Fig. 2, we first construct skeleton graphs at three scales, namely joint-scale, part-scale, and body-scale graphs (denoted as ) for each skeleton . Besides, to encourage our model to capture coarse-to-fine skeleton features more systematically, we also build a hyper-joint-scale graph (denoted as ) based on a denser body-limb representation (liu2018recognizing)

, which is constructed by linearly interpolating nodes between adjacent nodes in the joint-scale graph. Each graph

() consists of nodes (, ) and edges (). Here , denote the set of nodes corresponding to different body components and the set of their structural relations respectively, and is the number of nodes in the scale graph . We use to represent a graph’s adjacency matrix, where each element is defined as the normalized structural relation between adjacent nodes and , and satisfy: , where denotes indices for neighbor nodes of node in . During training of SM-SGE, is adaptively learned to capture flexible structural relations.

Figure 2. Four graph scales for a skeleton with 20 body joints. We divide body into 10 and 5 parts to build part-scale and body-scale graphs, and merge internal joints into nodes.
Figure 3. Schematic diagram of SM-SGE: First, we construct graphs of four scales for each skeleton in a sequence . Second, we exploit multi-scale graph relation network (MGRN) to capture structural relations of neighbor nodes to aggregate crucial structural features for node representations, and compute both single-scale and cross-scale collaborative relations among body-component nodes, which are exploited to fuse collaborative node features across scales. Then, we utilize LSTM to encode the fused scale graph representation () of the skeleton in the subsequence, which is randomly sampled from (corresponding to graph features ), into encoded graph state () to capture graph dynamics and reconstruct skeletons () across scales. Finally, the learned encoded graph states of all scales are concatenated and fed into MLP for person Re-ID.

3.2. Multi-Scale Graph Relation Network

Different body parts typically possess internal relations at the physical or kinematic level, which could be exploited to mine rich body-structure features and patterns of motion (aggarwal1998nonrigid). Motivated by this fact, we propose to learn relations of body components from two aspects: (1) Structural relations: Structurally-connected body components usually enjoy a higher motion correlation than distant pairs. Thus, to better represent each body-component node, when encoding skeleton graph features, it is crucial to capture structural relations of neighbor nodes to aggregate the most correlative spatial features. (2) Collaborative relations: Human motion like walking is often performed with several action-related body components, which collaborate together in a relatively stable pattern ( gait patterns) (murray1964walking; rao2020self). It is therefore beneficial to learn the inherent collaborative relations among different body components to mine more global pattern information from skeletons. To achieve above goals, we propose the Multi-scale Graph Relation Network (MGRN) with structural and collaborative relation learning as below.

Structural Relation Learning. Given the scale graph of a skeleton, MGRN first computes the structural relation between adjacent nodes and in as follows:


where is the weight matrix that maps the scale node into a higher level feature space , denotes a learnable weight matrix for relation learning at scale, indicates the feature concatenation of two nodes, and

is a non-linear activation function. Then, to learn flexible structural relations to focus on more correlative nodes, we normalize relations with a temperature-based

() function (hinton2015distilling) as follows:


where denotes the temperature that is normally set to in the function, while higher value of produces a softer relation distributed over nodes and retains more similar relation information. Here denotes neighbor nodes (including ) of node in graph.

To aggregate features of most relevant nodes to represent the node , we exploit normalized structural relations to yield the representation for node by: . To sufficiently capture potential structural relations ( motion correlation, position similarity), MGRN concurrently and independently learns different structural relation matrices () using the same computation (see Eq. 1, Eq. 2). In this way, MGRN can jointly capture structural relation information of nodes from different representation subspaces (velickovic2018graph) based on learnable structural relation matrices (see Fig. 4). We averagely aggregate features learned by these matrices to represent each node:


where denotes the node representation of learned by structural relation matrices, represents the structural relation between node and computed by the structural relation matrix, and denotes the corresponding weight matrix to perform node feature mapping. Here we use average rather than concatenation operation to reduce the feature dimension of nodes and allow for learning more structural relation matrices.

Collaborative Relation Learning. Motivated by the fact that unique walking patterns could be represented by the dynamic cooperation among body joints or between different body components (murray1964walking), we expect our model to capture more discriminative patterns globally by learning collaborative relations from two aspects: (1) single-scale collaborative relations among nodes of the same scale, and (2) cross-scale collaborative relations between a node and its spatially corresponding or motion-related higher level body component. To this end, MGRN computes collaborative relation matrix () between scale nodes and scale nodes as follows (shown in Fig. 4 and Fig. 3):


where represents the singe-scale (when ) and cross-scale (when ) collaborative relation between node in and node in . Here MGRN computes the inner product of node feature representations, which aggregate key spatial information with structural relation learning (see Eq. 3), to measure the degree of collaboration between two nodes. denotes the temperature to adjust the softness of relation learning (illustrated in Eq. 2).

Multi-scale Collaboration Fusion. To adaptively focus on key correlative features in body-component collaboration at different spatial levels to enhance global pattern learning, we propose the multi-scale collaboration fusion that exploits collaborative relations to fuse node features across scales. Each node representation () in the scale graph is updated by the feature fusion of collaborative nodes () learned from different graphs as below (see Fig. 3):


where is a learnable weight matrix to integrate collaborative features of scale node () into scale node representation (), represents the number of nodes in scale graph, and is the fusion coefficient to fuse collaborative graph node features. We denote the fused graph features of scale () for a skeleton sequence be . Note that the multi-scale collaboration fusion does NOT directly fuse graph features of all scales into a representation. Instead, graph representations of each individual scale is retained (shown in Fig. 3) to encourage our model to capture skeleton dynamics and pattern information at different levels.

Figure 4. Examples for three types of relations: (1) Structural relations in . (2) Single-scale collaborative relations in . (3) Cross-scale collaborative relations between and .

3.3. Multi-Scale Skeleton Reconstruction Mechanism

To enable SM-SGE to encode multi-scale graph dynamics of unlabeled skeletons, we propose a self-supervised Multi-scale Skeleton Reconstruction (MSR) mechanism to simultaneously capture skeleton graph dynamics and high-level semantics ( skeleton order in the subsequence, cross-scale component correspondence) from different scales of graphs. Unlike the plain reconstruction that learns to reconstruct the whole sequence at a sole scale, the objective of MSR is combined with two concurrent pretext tasks as follows:

(1) Skeleton subsequence reconstruction task, which reconstructs multiple skeleton subsequences based on their graph representations. In particular, MSR aims to reconstruct target multi-scale skeletons () corresponding to multi-scale graphs () in subsequences, instead of reconstructing the original subsequences (for clarity, we use the vector to represent all node positions in the scale graph of the skeleton in the subsequence).

(2) Cross-scale skeleton inference task that exploits fine skeleton graph representations to infer 3D positions of coarser body components. For instance, we propose to use joint-scale graph representations (), which may contain richer spatial information with denser nodes, to infer nodes of body-scale skeletons (). It can also be viewed as a cross-scale reconstruction task to reconstruct different scale nodes with the same skeleton graph.

To simultaneously achieve above two pretext tasks, we first sample -length subsequences by randomly discarding skeletons from the input sequence . To exploit more potential samples for training, the random sampling process is repeated for rounds and each round covers all possible lengths from to . Second, given an sampled skeleton subsequence , the MGRN encodes its corresponding skeleton graphs of each scale into fused graph features (see Eq. 1-5). Then, we leverage an LSTM to integrate the temporal dynamics of graphs at each scale into effective representations: LSTM encodes each skeleton graph representation and the previous step’s latent state (if existed), which provides the temporal context information of scale graph representations, into the current latent state ():


where , denotes the LSTM encoder, which aims to capture long-term dynamics of graph representations at the scale. are encoded graph states that contain crucial temporal encoding information of scale graph representations from time to . Last, we exploits encoded graph states at the scale to reconstruct the target skeleton at the scale as follows:


where is the reconstructed skeleton: When , Eq. 7 is the plain skeleton reconstruction at the same scale, and indicates the cross-scale skeleton inference. is a network function built by MLP, where weights are NOT shared between different scales, we train different individual MLPs for each skeleton reconstruction at the same or different scales (see Fig. 3).

As we expect to capture graph dynamics and pattern features of skeletons at various scales, we employ the MSR mechanism with the above reconstruction objective on all scales of graphs. Formally, we define the objective function for the self-supervision of MSR on the scale graphs, which minimizes loss between ground-truth skeletons of the -length subsequence and reconstructed skeletons:


where is the reconstructed scale skeleton based on scale encoded graph states (see Eq. 7), and denotes norm. The reason for using loss is twofold: It gives sufficient gradients to positions with small losses to facilitate precise spatial reconstruction, and meanwhile can alleviate gradient explosion with stable gradients for large losses (li2020dynamic). It should be noted that our implementation actually optimizes Eq. 8 on each individual graph scale, and the sum of reconstruction loss for all sampled subsequences is computed. By learning to reconstruct skeletons of the same scale and infer cross-scale body-component nodes dynamically ( use varying subsequences), MSR encourages our framework to integrate crucial skeleton dynamics and high-level semantics into encoded graph states to achieve better person Re-ID performance (see Sec. 5).

3.4. The Entire Framework

The computation flow of the entire framework during self-supervised learning can be summarized as: (Sec. 3.1) (Sec. 3.2) (Eq. 6) (Eq. 7). The self-supervised loss (see Eq. 8) is employed to train the SM-SGE framework to learn an effective skeleton representation from multi-scale skeleton graphs. For the downstream task of person Re-ID, we extract encoded graph states () learned from the pre-trained framework, and exploit an MLP () to predict the sequence label. Specifically, for the skeleton in an input sequence, we concatenate its corresponding encoded graph states of four scales, namely (), as the skeleton-level representation of the sequence. Then, we train the MLP () with the frozen and its label (note that is NOT tuned in this training stage). The ID prediction () of each skeleton-level representation in a sequence is averaged to be the final sequence-level ID prediction . We employ the cross-entropy loss to train for person Re-ID.

0.1em0.45pt0.45pt IAS-A IAS-B
Rank-1 nAUC Rank-1 nAUC
0.1em0.45pt0.45pt Hand-Crafted and Supervised Methods
0.1em0.45pt0.45pt Gait Energy Image (chunli2010behavior) 25.6 72.1 15.9 66.0
3D CNN + Average Pooling (boureau2010theoretical) 33.4 81.4 39.1 82.8
Gait Energy Volume (sivapalan2011gait) 20.4 66.2 13.7 64.8
3D LSTM (haque2016recurrent) 31.0 77.6 33.8 78.0
PCM + Skeleton (munaro20143d) 27.3 81.8
DVCov + SKL (wu2017robust) 46.6 45.9
ED + SKL (wu2017robust) 52.3 63.3
descriptors + KNN (munaro2014one) 33.8 63.6 40.5 71.1
Single-layer LSTM (haque2016recurrent) 20.0 65.9 19.1 68.4
Multi-layer LSTM (zheng2019relational) 34.4 72.1 30.9 71.9
descriptors + Adaboost (pala2019enhanced) 27.4 65.5 39.2 78.2
PostGait (liao2020model) 41.4 79.9 37.1 74.8
0.1em0.45pt0.45pt Self-Supervised Methods
0.1em0.45pt0.45pt Attention Gait Encodings (rao2020self) 56.1 81.7 58.2 85.3
SGELA (rao2021a2) 60.1 82.9 62.5 86.9
SM-SGE (Ours) 59.4 86.7 69.8 90.4
Table 1. Performance comparison with hand-crafted, supervised and self-supervised methods on IAS-A and IAS-B. and denote depth-based and multi-modal methods respectively. Bold numbers refer to the best performer among skeleton-based methods. “—” indicates no published results.

4. Experiments

4.1. Experimental Settings

Datasets: We evaluate our framework on three public person Re-ID datasets that contain skeleton data (IAS-Lab (munaro2014feature), KS20 (nambiar2017context), KGBD (andersson2015person)) and a large RGB video based multi-view dataset CASIA B (yu2006framework), which contain 11, 20, 164, and 124 different individuals respectively. We follow the frequently used evaluation setup in the literature (rao2020self; haque2016recurrent): For IAS-Lab, we use the full training set and two testing splits, IAS-A and IAS-B; For KGBD, since no training and testing splits are given, we randomly leave one skeleton video of each person for testing and use the remaining videos for training; For KS20, we randomly select one sequence from each viewpoint for testing and use the rest of skeleton sequences for training.

To evaluate the effectiveness of SM-SGE when 3D skeleton data are directly estimated from RGB videos rather than Kinect, we introduce a large-scale RGB video based dataset CASIA B (yu2006framework)

, and exploit pre-trained pose estimation models

(chen20173d; cao2019openpose) to extract 3D skeletons from RGB videos (detailed in the Appendix). We evaluate our approach on each view (, , , , , , , , , , ) of CASIA B and use the adjacent views for training.

0.1em0.45pt0.45pt KGBD KS20
Rank-1 nAUC Rank-1 nAUC
0.1em0.45pt0.45pt Hand-Crafted and Supervised Methods
0.1em0.45pt0.45pt descriptors + KNN (munaro2014one) 46.9 90.0 58.3 78.0
Single-layer LSTM (haque2016recurrent) 39.8 87.2 80.9 92.3
Multi-layer LSTM (zheng2019relational) 46.2 89.8 81.6 94.2
descriptors + Adaboost (pala2019enhanced) 69.9 90.6 59.8 78.8
PostGait (liao2020model) 90.6 97.8 70.5 94.0
0.1em0.45pt0.45pt Self-Supervised Methods
0.1em0.45pt0.45pt Attention Gait Encodings (rao2020self) 87.7 96.3 86.5 94.7
SGELA (rao2021a2) 86.9 97.1 86.9 94.9
SM-SGE (Ours) 99.5 99.6 87.5 95.8
Table 2. Performance comparison on KGBD and KS20 datasets. Bold numbers refer to the best performer.

Implementation Details: The number of nodes in the hyper-joint-scale and joint-scale graph are , in KS20, , in CASIA B, and , in IAS-Lab and KGBD datasets. For part-scale and body-scale graphs, the numbers of nodes are and for all datasets. On IAS-Lab, KS20 and KGBD datasets, the sequence length is empirically set to , which achieves the best overall performance among different settings. For the largest dataset CASIA B with roughly estimated skeleton data from RGB frames, we set sequence length for training/testing. The node feature dimension is and the number of structural relation matrices is . The temperatures () for relation learning, the collaboration fusion coefficient (), and the number of sampling rounds () are empirically set to 1. MLP with one hidden layer is employed in SM-SGE. For MSR mechasnim, we use a 2-layer LSTM with hidden units per layer. We adopt Adam optimizer with learning rate on IAS-Lab, KGBD and on KS20, CASIA B to train the framework.

Evaluation Metrics: Person Re-ID typically adopts a “multi-shot” manner that leverages predictions of multiple frames or a sequence-level representation to predict a sequence label. In this work, we compute both Rank-1 accuracy and nAUC (area under the cumulative matching curve (CMC) normalized by the number of ranks (gray2008viewpoint)) to quantify multi-shot person Re-ID performance.

4.2. Comparison with State-of-the-Art Methods

In this section, we compare our approach with existing hand-crafted, supervised and self-supervised skeleton-based Re-ID methods on IAS-Lab (see Table 1), KS20 and KGBD (see Table 2). We also include classic depth-based methods and representative multi-modal methods as a reference. The comparison results are reported below:

Comparison with Hand-crafted and Supervised Skeleton-based Methods: As shown in Table 1 and Table 2, the proposed SM-SGE framework enjoys evident advantages over existing skeleton-based methods: First, our approach significantly outperforms two representative hand-crafted methods that extract anthropometric attributes of skeletons (, descriptors (munaro2014one; pala2019enhanced)) by - Rank-1 accuracy and - nAUC on different datasets. Second, compared with state-of-the-art CNN-based (PoseGait (liao2020model)) and LSTM-based models (haque2016recurrent; zheng2019relational), our self-supervised framework can achieve superior performance with a large margin (up to Rank-1 accuracy and nAUC) on all datasets. Besides, these supervised methods typically require massive labels and even extra hand-crafted features ( PoseGait (liao2020model) relies on 81 hand-crafted pose and motion features) for representation learning, while our framework is able to automatically model spatial and temporal features of unlabeled skeleton graphs at various scales to learn a more effective skeleton representation for the person Re-ID task.

Comparison with Self-supervised Skeleton-based Methods: Our approach achieves a significant improvement (- Rank-1 accuracy and - nAUC) over existing state-of-the-art self-supervised methods on three out of four testing sets (IAS-B, KGBD, KS20). On IAS-A, despite both SGELA and our framework obtain a close Rank-1 accuracy, our approach gains a markedly higher nAUC () than SGELA, which suggests that our approach can achieve better overall Re-ID performance when retrieving persons from high to low ranking. Notably, the proposed SM-SGE outperforms existing self-supervised methods by more than Rank-1 accuracy on the largest skeleton-based dataset KGBD, which demonstrates the great potential of our approach on large-scale person Re-ID.

Comparison with Depth-based and Multi-modal methods: As reported in Table 1, our skeleton-based framework consistently performs better than classic depth-based methods (GEI (chunli2010behavior), GEV (sivapalan2011gait), 3D CNN (boureau2010theoretical), 3D LSTM (haque2016recurrent)) by at least Rank-1 accuracy and nAUC on IAS-A and IAS-B. Compared with representative multi-modal methods, our approach is still the best performer in most cases. Interestingly, although the “PCM + Skeleton” method (munaro20143d) that uses both skeletons and 3D point cloud matching attains the best Rank-1 accuracy on IAS-B, it is inferior to the SM-SGE by accuracy on IAS-A, which demonstrates that our approach is more effective under the setting with frequent shape and appearance changes (IAS-A). Considering that the proposed SM-SGE only requires 3D skeleton data as the input and can achieve more satisfactory performance on each dataset, it can be a promising solution to person Re-ID and other potential skeleton-related tasks.

5. Further Analysis

In this section, we first evaluate the performance of SM-SGE on skeleton data estimated from RGB videos in CASIA B. Then, we conduct ablation study to demonstrate the effectiveness of each component, and evaluate effects of different parameters on SM-SGE. Last, we visualize and analyze the learned collaborative relations.

Evaluation with Model-estimated Skeletons. We exploit pre-trained pose estimation models (cao2019openpose; chen20173d) to extract 3D skeletons from RGB videos of CASIA B, and evaluate the performance of MS-SGE with the estimated skeleton data. We compare our framework with the state-of-the-art supervised method PoseGait (liao2020model) under the same evaluation setup. As shown in Table 3, our approach significantly outperforms PoseGait by - Rank-1 accuracy on all views of CASIA B. Notably, SM-SGE obtains more stable performance than PoseGait on 7 consecutive views from to , which shows the robustness of our framework to view-point variation. On the two most challenging views ( and ), our approach can still perform better than PoseGait by more than Rank-1 accuracy. These results demonstrate the effectiveness of our framework on skeleton data estimated from RGB videos, and also show the great potential of our approach to be applied to large RGB-based datasets under general settings ( varying views).

0.1em0.45pt0.45pt Methods
0.1em0.45pt0.45pt PoseGait [2020] 10.7 37.4 52.5 28.3 24.3 18.9 23.5 17.2 23.6 18.8 4.3
SM-SGE (Ours) 18.4 50.8 53.6 40.9 51.2 59.3 52.3 53.9 30.2 28.8 13.6
Table 3. Rank-1 accuracy on different views of CASIA B.
0.1em0.45pt0.45pt MG MGRN MSR IAS-A IAS-B
0.1em0.45pt0.45pt 53.5 61.8
55.0 65.8
56.1 67.0
56.6 67.8
57.4 68.9
57.2 67.7
57.9 68.1
59.4 69.8
Table 4. Rank-1 accuracy of SM-SGE with different components: Structural/collaborative relation (SR/CR) in MGRN, cross-scale skeleton inference (CSI) and skeleton subsequence reconstruction (SSR) in MSR. “MG” denotes exploiting multi-scale graphs rather than using joint-scale graphs.

Ablation Study. We evaluate the contribution of each component in our framework (here IAS-Lab is taken as an example) and report the results in Table 4. We use an LSTM with plain skeleton reconstruction as the baseline (see first row in Table 4). We can draw the following conclusions: (1) Introducing multi-scale skeleton graphs (MG) consistently improves person Re-ID performance by at least Rank-1 accuracy, which justifies our claim that modeling skeletons as multi-scale graphs can facilitate learning richer body and motion features for person Re-ID. (2) Exploiting structural relations (SR) between body components produces significant performance gain by - Rank-1 accuracy compared with directly modeling body-joint trajectory (baseline), while combining collaborative relation (CR) learning further boosts the Re-ID performance by up to Rank-1 accuracy. These results demonstrate the effectiveness of multi-scale graph relation learning (MGRN) on capturing more discriminative body structural features and motion patterns for person Re-ID. (3) The proposed MSR mechanism based on cross-scale skeleton inference and skeleton subsequence reconstruction pretext tasks evidently improves the model performance (- Rank-1 accuracy) under different relation learning (SR or CR), which verifies our intuition that mining high-level semantics such as cross-scale body-component correspondence could encourage learning a more effective skeleton representation for the person Re-ID task. Other datasets also report similar results.

Figure 5. Rank-1 accuracy on IAS-Lab illustrating effects of different hyper-parameters: (a)-(b) Temperature and . (c) Structural relation matrix number . (d) Collaboration fusion coefficient . (e) Sequence length . (e) Random sampling round number . Zoom in for the better visualization.
0.1em0.45pt0.45pt Body-scale Part-scale Joint-scale
0.1em0.45pt0.45pt ✓ 53.3 60.4
56.9 65.8
58.3 68.5
59.4 69.8
Table 5. Performance of SM-SGE with different graph scales.

Effects of Multiple Graph Scales. As shown in Table 5, combining graph scales from coarse (body-scale) to fine (hyper-joint-scale) progressively improves Rank-1 accuracy by - on both IAS-A and IAS-B. Compared with the plain reconstruction of body-scale graphs (note that plain reconstruction without collaborative relation learning is employed when using only body-scale graphs, shown in first row), employing multi-scale skeleton reconstruction (MSR) based on two adjacent scales of graphs (body-scale and part-scale) obtains a significant performance gain by - Rank-1 accuracy. These results further demonstrate the effectiveness of multi-scale graphs and MSR, which are able to capture more discriminative skeleton features at various levels for person Re-ID.

Model Sensitivity Analysis. We evaluate effects of different hyper-parameters () on SM-SGE: (1) We observe that SM-SGE is not sensitive to temperature changes from 0.1 to 1.0 (see Fig. 5 (a), (b)). Since lower temperatures tend to ignore more similar information (hinton2015distilling) and could reduce relation learning performance, we set temperature to 1.0 for all relation learning in our framework. (2) As shown in Fig. 5 (c), introducing more learnable structural relation matrices with larger can improve model performance on both IAS-A and IAS-B. However, too many relation matrices () may cause the model to learn redundant relation information, which leads to a slight performance degradation. (3) The parameter controls the degrees of fusion with collaborative node features. We find that fusing graph node features with larger can achieve better Re-ID performance (see Fig. 5 (d)), which verifies the necessity of sufficient multi-scale collaboration fusion to learn a more effective skeleton representation. (4) As shown in Fig. 5 (e) (f), SM-SGE obtains the highest Re-ID performance when and in most cases. Although larger and can provide more training subsequences ( up to samples), it can increase the computation complexity ( requires more computation memory and time for training), thus we set and to achieve better trade-off between performance and complexity.


CCR Visualization (KS20)


CCR Visualization (IAS-A)


CCR Matrices (KS20)


CCR Matrices (IAS-A)

Figure 6. (a)-(b): Cross-scale collaborative relations (CCR) among body components for sample skeletons in KS20 and IAS-A. (c)-(d): CCR matrices () for (a) and (b). Note that the abscissa and ordinate denote indices of nodes.

Analysis of Cross-scale Collaborative Relations. We visualize positions of different body components and their collaborative relations across adjacent scales (note that we draw significant relations with values larger than maximum value of ), and we obtain observations as follows: (1) As shown in Fig. 6(a) and 6(b), spatially corresponding or nearby body components ( subcomponents of limbs at the same side) possess evident relations across different scales, which demonstrates that SM-SGE can capture the high-level semantics of body-component correspondence between different graphs. (2) For non-adjacent body components ( arms, legs) with a joint movement trend, the framework also learns higher correlations among their corresponding nodes in different graphs (see Fig. 6(c), 6(d)), which justifies our claim that SM-SGE framework is able to adaptively infer global body-component cooperation in skeletal motion. More results and proofs are provided in Appendix.

6. Conclusion

In this paper, we model 3D skeletons as multi-scale graphs, and propose a self-supervised multi-scale skeleton graph encoding (SM-SGE) framework to learn an effective representation from unlabeled skeleton graphs for person Re-ID. To capture key correlative features of graph nodes, we propose the multi-scale graph relation network (MGRN) to learn structural and collaborative relations among body-component nodes in different graphs. A novel multi-scale skeleton reconstruction (MSR) mechanism with subsequence reconstruction and cross-scale skeleton inference tasks is devised to encode graph dynamics and discriminative high-level features of skeleton graphs for person Re-ID. SM-SGE outperforms most state-of-the-art skeleton-based methods, and it can achieve satisfactory performance on 3D skeleton data estimated from RGB videos.