1. Introduction
Person reidentification (ReID) aims to retrieve the same individual from a different view or scene, with great potential in authenticationrelated applications (vezzani2013people). Conventional studies (wang2018learning; karianakis2018reinforced; haque2016recurrent) typically extract appearancebased features such as body texture and silhouettes from RGB or depth images to perform person ReID. Nevertheless, an important flaw of these methods is their vulnerability to illumination or appearance changes (rao2021a2). In contrast, skeletonbased models exploit 3D coordinates of key body joints to characterize human body and motion, which are usually robust to factors such as view and body shape changes (han2017space). Despite that skeleton data have been extensively studied in action and motion related tasks (li2020dynamic), it is still an open challenge to extract discriminative body and motion features with 3D skeletons for person ReID (rao2020self). In this sense, this work aims to construct a systematic framework from three aspects to tackle the skeletonbased person ReID task.
(1) Multiscale skeleton graphs. Most existing works (barbosa2012re; munaro2014one; andersson2015person; pala2019enhanced) construct skeleton descriptors to depict discriminative features of body structure and motion ( anthropometric and gait attributes (andersson2015person)) for person ReID. However, they typically extract these handcraft features from skeletons with a single spatial scale and topology, which limits their ability to capture underlying structural information from different body partitions beyond bodyjoint level ( limblevel components) (li2020dynamic). To fully mine latent structural features within body structure, it is beneficial to devise a systematic manner to represent skeletons at different levels. In this work, we model skeletons as multiscale graphs (see Fig. 1) to learn coarsetofine grained body and motion features from 3D skeleton data.
(2) Multiscale relation learning. In human walking, body components usually possess different internal relations, which could carry unique and recognizable patterns (murray1964walking; winter2009biomechanics). Recent works like (rao2020self; liao2020model; rao2021a2)
typically encode bodyjoint trajectory or predefined pose descriptors into a feature vector for skeleton representation learning, while they rarely explore the inherent relations between different body joints or components. For example, adjacent body joints “knee” and “foot” are strongly correlated in walking, while they enjoy different degrees of
collaboration with their corresponding limblevel component “leg”. To capture such internal body relations in skeletal motion, it is highly desirable to design a framework to capture correlations between physicallyconnected body parts (referred as “structural relations”) and relations among all collaborative body components (referred as “collaborative relations”).(3) Multiscale skeleton dynamics modeling. Existing 3D skeleton based ReID methods usually model skeleton dynamics from the trajectory of body joints (rao2020self; rao2021a2) or the sequence of predefined joint features ( pairwise joint distances and pose descriptors (liao2020model)). Since these methods learn skeleton motion at a fixed scale of body joints, they lack the flexibility to capture motion patterns at various levels. For instance, they cannot explicitly model the movement or interaction of higher level limbs from joint trajectory, which might cause a loss of global motion features. Hence, it is important to devise a framework that can explicitly model skeleton dynamics at different scales to better capture body motion patterns.
To fulfill all above goals, this work for the first time proposes a Selfsupervised Multiscale Skeleton Graph Encoding (SMSGE) framework that exploits coarsetofine skeleton graphs to model bodystructure and motion features for person ReID. Specifically, we first construct multiscale skeleton graphs by spatially dividing each skeleton into bodycomponent nodes of different granularities (shown in Fig. 2), which allows our framework to fully model body structure and capture skeleton features at various levels. Second, motivated by the fact that human walking usually carries unique patterns (murray1964walking), which endow body components with different internal relations, we propose a Multiscale Graph Relation Network (MGRN) to capture structural and collaborative relations among body components in multiscale skeleton graphs. MGRN exploits structural relations between adjacent bodycomponent nodes to aggregate key correlative features for better node representations, and meanwhile incorporates collaborative relations among nodes of different scales into graph encoding process to enhance global pattern learning. Finally, we propose a novel Multiscale Skeleton Reconstruction (MSR) mechanism with two concurrent pretext tasks, namely skeleton subsequence reconstruction task and crossscale skeleton inference task, to enable our framework to capture skeleton dynamics and latent highlevel semantics ( body part correspondence, sequence order) from unlabeled skeleton graph representations. The graph features of all scales learned from the proposed framework are then combined as the final skeleton representation to perform the downstream task of person ReID.
The proposed SMSGE framework enjoys three main advantages: First, it seamlessly unifies the learning of multiscale skeleton graphs into a systematic framework, which enables us to model body structure, component relations, and motion patterns of skeletons at different levels. Second, unlike most existing skeletonbased methods that require manual annotation ( ID labels) for representation learning, our framework is able to learn an effective representation for unlabeled skeletons, which can be directly applied to skeletonbased tasks such as person ReID. Last, our framework is also effective with the 3D skeleton data estimated from RGB videos (yu2006framework), thus it can be potentially applied to RGBbased datasets under general settings. In summary, our main contributions include:

We devise multiscale graphs to fully model 3D skeletons, and propose a novel selfsupervised multiscale skeleton graph encoding (SMSGE) framework to learn an effective representation from unlabeled skeletons for person ReID.

We propose the multiscale graph relation network (MGRN) to learn both structural and collaborative relations of bodycomponent nodes, so as to aggregate crucial correlative features of nodes and capture richer pattern information.

We propose the multiscale skeleton reconstruction (MSR) mechanism to enable the framework to encode graph dynamics and highlevel semantics from unlabeled skeletons.

Extensive experiments show that SMSGE outperforms most stateoftheart skeletonbased methods on three person ReID benchmarks, and it can achieve highly competitive performance on skeletons estimated from largescale RGB videos.
2. Related Works
This section briefly reviews existing skeletonbased person ReID methods using handcrafted features, supervised learning or selfsupervised learning. We also introduce depthbased and multimodal ReID methods that are related to skeletonbased models.
Handcrafted and Supervised ReID Methods with Skeleton Data. Most existing works extract handcrafted skeleton descriptors to depict certain geometric, anthropometric, and gait attributes of human body. (barbosa2012re) computes 7 Euclidean distances between the floor plane and joint or joint pairs to construct a distance matrix, which is learned to match gallery individuals with a quasiexhaustive strategy. (munaro2014one) and (pala2019enhanced) further extend them to 13 () and 16 skeleton descriptors (
) respectively, and leverage support vector machine (SVM),
nearest neighbor (KNN) or Adaboost classifiers for ReID. As existing solutions that use skeleton features alone typically perform unsatisfactorily, other modalities such as 3D face descriptors
(pala2019enhanced) and 3D point clouds (munaro20143d)are often used to boost the performance. A few recent studies resort to supervised deep learning models to learn discriminative skeleton representation:
(haque2016recurrent)utilizes long shortterm memory (LSTM)
(hochreiter1997long) to encode temporal dynamics of pairwise joint distance to perform person ReID; Liao et al. (liao2020model)propose PoseGait model, which learns 81 handcrafted pose features of 3D skeleton data with deep convolutional neural networks (CNN) for gaitbased human recognition.
Selfsupervised Skeletonbased ReID Methods. Recently, Rao et al. (rao2020self) devise a selfsupervised attentionbased gait encoding model with multilayer LSTM to encode gait features from unlabeled skeleton sequences for person ReID. The latest selfsupervised study (rao2021a2) further proposes a localityawareness approach that combines various pretext tasks ( reverse sequential reconstruction) and contrastive learning scheme to enhance selfsupervised gait representation learning for the person ReID task.
Depthbased and Multimodal ReID Methods. Depthbased methods exploit depthimage sequences to extract human shapes, silhouettes or gait features for person ReID. (sivapalan2011gait) proposes Gait Energy Volume (GEV) algorithm, which extends Gait Energy Image (GEI) (chunli2010behavior) to 3D domain, to learn depth features for human recognition. (munaro2014one) devises a depthbased point cloud matching (PCM) method to match multiview 3D point cloud sets to discriminate different individuals. In (haque2016recurrent), Haque et al. leverage 3D LSTM and 3D CNN (boureau2010theoretical) to learn motion dynamics from 3D point clouds for person ReID. As to multimodal methods, skeleton information and RGB or depth features ( depth shape features (munaro20143d; wu2017robust; hasan2016long)) are usually combined to enhance ReID performance. In (karianakis2018reinforced), Karianakis et al. propose an RGBtodepth transferred CNNLSTM model with reinforced temporal attention (RTA) for the person ReID task.
3. The proposed Framework
Suppose that a 3D skeleton sequence with consecutive skeletons is , where denotes the skeleton with threedimensional coordinates () of body joints. represents the training set that contains skeleton sequences collected from different persons and views. Each skeleton sequence corresponds to an ID label , where and is the number of different persons. The goal of SMSGE framework is to learn a latent discriminative representation H from skeleton sequences without using any label. Then, we evaluate the effectiveness of learned skeleton representation (H) on the downstream task of person ReID: Frozen H
and corresponding ID labels are used to train a multilayer perceptron (MLP) for person ReID (note that the learned features
H are NOT tuned at this training stage). The overview of proposed SMSGE framework is shown in Fig. 3, and we present each technical component as below.3.1. MultiScale Skeleton Graph Construction
Human body can be segmented into functional components with diverse granularities ( knee joint, thigh part, leg limb), each of which typically carries different geometric or anthropometric attributes of body (winter2009biomechanics). Inspired by this fact, we regard body joints as the basic components, and merge spatially nearby groups of joints to be a higher level bodycomponent node at the center of their positions. As shown in Fig. 2, we first construct skeleton graphs at three scales, namely jointscale, partscale, and bodyscale graphs (denoted as ) for each skeleton . Besides, to encourage our model to capture coarsetofine skeleton features more systematically, we also build a hyperjointscale graph (denoted as ) based on a denser bodylimb representation (liu2018recognizing)
, which is constructed by linearly interpolating nodes between adjacent nodes in the jointscale graph. Each graph
() consists of nodes (, ) and edges (). Here , denote the set of nodes corresponding to different body components and the set of their structural relations respectively, and is the number of nodes in the scale graph . We use to represent a graph’s adjacency matrix, where each element is defined as the normalized structural relation between adjacent nodes and , and satisfy: , where denotes indices for neighbor nodes of node in . During training of SMSGE, is adaptively learned to capture flexible structural relations.3.2. MultiScale Graph Relation Network
Different body parts typically possess internal relations at the physical or kinematic level, which could be exploited to mine rich bodystructure features and patterns of motion (aggarwal1998nonrigid). Motivated by this fact, we propose to learn relations of body components from two aspects: (1) Structural relations: Structurallyconnected body components usually enjoy a higher motion correlation than distant pairs. Thus, to better represent each bodycomponent node, when encoding skeleton graph features, it is crucial to capture structural relations of neighbor nodes to aggregate the most correlative spatial features. (2) Collaborative relations: Human motion like walking is often performed with several actionrelated body components, which collaborate together in a relatively stable pattern ( gait patterns) (murray1964walking; rao2020self). It is therefore beneficial to learn the inherent collaborative relations among different body components to mine more global pattern information from skeletons. To achieve above goals, we propose the Multiscale Graph Relation Network (MGRN) with structural and collaborative relation learning as below.
Structural Relation Learning. Given the scale graph of a skeleton, MGRN first computes the structural relation between adjacent nodes and in as follows:
(1) 
where is the weight matrix that maps the scale node into a higher level feature space , denotes a learnable weight matrix for relation learning at scale, indicates the feature concatenation of two nodes, and
is a nonlinear activation function. Then, to learn flexible structural relations to focus on more correlative nodes, we normalize relations with a temperaturebased
() function (hinton2015distilling) as follows:(2) 
where denotes the temperature that is normally set to in the function, while higher value of produces a softer relation distributed over nodes and retains more similar relation information. Here denotes neighbor nodes (including ) of node in graph.
To aggregate features of most relevant nodes to represent the node , we exploit normalized structural relations to yield the representation for node by: . To sufficiently capture potential structural relations ( motion correlation, position similarity), MGRN concurrently and independently learns different structural relation matrices () using the same computation (see Eq. 1, Eq. 2). In this way, MGRN can jointly capture structural relation information of nodes from different representation subspaces (velickovic2018graph) based on learnable structural relation matrices (see Fig. 4). We averagely aggregate features learned by these matrices to represent each node:
(3) 
where denotes the node representation of learned by structural relation matrices, represents the structural relation between node and computed by the structural relation matrix, and denotes the corresponding weight matrix to perform node feature mapping. Here we use average rather than concatenation operation to reduce the feature dimension of nodes and allow for learning more structural relation matrices.
Collaborative Relation Learning. Motivated by the fact that unique walking patterns could be represented by the dynamic cooperation among body joints or between different body components (murray1964walking), we expect our model to capture more discriminative patterns globally by learning collaborative relations from two aspects: (1) singlescale collaborative relations among nodes of the same scale, and (2) crossscale collaborative relations between a node and its spatially corresponding or motionrelated higher level body component. To this end, MGRN computes collaborative relation matrix () between scale nodes and scale nodes as follows (shown in Fig. 4 and Fig. 3):
(4) 
where represents the singescale (when ) and crossscale (when ) collaborative relation between node in and node in . Here MGRN computes the inner product of node feature representations, which aggregate key spatial information with structural relation learning (see Eq. 3), to measure the degree of collaboration between two nodes. denotes the temperature to adjust the softness of relation learning (illustrated in Eq. 2).
Multiscale Collaboration Fusion. To adaptively focus on key correlative features in bodycomponent collaboration at different spatial levels to enhance global pattern learning, we propose the multiscale collaboration fusion that exploits collaborative relations to fuse node features across scales. Each node representation () in the scale graph is updated by the feature fusion of collaborative nodes () learned from different graphs as below (see Fig. 3):
(5) 
where is a learnable weight matrix to integrate collaborative features of scale node () into scale node representation (), represents the number of nodes in scale graph, and is the fusion coefficient to fuse collaborative graph node features. We denote the fused graph features of scale () for a skeleton sequence be . Note that the multiscale collaboration fusion does NOT directly fuse graph features of all scales into a representation. Instead, graph representations of each individual scale is retained (shown in Fig. 3) to encourage our model to capture skeleton dynamics and pattern information at different levels.
3.3. MultiScale Skeleton Reconstruction Mechanism
To enable SMSGE to encode multiscale graph dynamics of unlabeled skeletons, we propose a selfsupervised Multiscale Skeleton Reconstruction (MSR) mechanism to simultaneously capture skeleton graph dynamics and highlevel semantics ( skeleton order in the subsequence, crossscale component correspondence) from different scales of graphs. Unlike the plain reconstruction that learns to reconstruct the whole sequence at a sole scale, the objective of MSR is combined with two concurrent pretext tasks as follows:
(1) Skeleton subsequence reconstruction task, which reconstructs multiple skeleton subsequences based on their graph representations. In particular, MSR aims to reconstruct target multiscale skeletons () corresponding to multiscale graphs () in subsequences, instead of reconstructing the original subsequences (for clarity, we use the vector to represent all node positions in the scale graph of the skeleton in the subsequence).
(2) Crossscale skeleton inference task that exploits fine skeleton graph representations to infer 3D positions of coarser body components. For instance, we propose to use jointscale graph representations (), which may contain richer spatial information with denser nodes, to infer nodes of bodyscale skeletons (). It can also be viewed as a crossscale reconstruction task to reconstruct different scale nodes with the same skeleton graph.
To simultaneously achieve above two pretext tasks, we first sample length subsequences by randomly discarding skeletons from the input sequence . To exploit more potential samples for training, the random sampling process is repeated for rounds and each round covers all possible lengths from to . Second, given an sampled skeleton subsequence , the MGRN encodes its corresponding skeleton graphs of each scale into fused graph features (see Eq. 15). Then, we leverage an LSTM to integrate the temporal dynamics of graphs at each scale into effective representations: LSTM encodes each skeleton graph representation and the previous step’s latent state (if existed), which provides the temporal context information of scale graph representations, into the current latent state ():
(6) 
where , denotes the LSTM encoder, which aims to capture longterm dynamics of graph representations at the scale. are encoded graph states that contain crucial temporal encoding information of scale graph representations from time to . Last, we exploits encoded graph states at the scale to reconstruct the target skeleton at the scale as follows:
(7) 
where is the reconstructed skeleton: When , Eq. 7 is the plain skeleton reconstruction at the same scale, and indicates the crossscale skeleton inference. is a network function built by MLP, where weights are NOT shared between different scales, we train different individual MLPs for each skeleton reconstruction at the same or different scales (see Fig. 3).
As we expect to capture graph dynamics and pattern features of skeletons at various scales, we employ the MSR mechanism with the above reconstruction objective on all scales of graphs. Formally, we define the objective function for the selfsupervision of MSR on the scale graphs, which minimizes loss between groundtruth skeletons of the length subsequence and reconstructed skeletons:
(8) 
where is the reconstructed scale skeleton based on scale encoded graph states (see Eq. 7), and denotes norm. The reason for using loss is twofold: It gives sufficient gradients to positions with small losses to facilitate precise spatial reconstruction, and meanwhile can alleviate gradient explosion with stable gradients for large losses (li2020dynamic). It should be noted that our implementation actually optimizes Eq. 8 on each individual graph scale, and the sum of reconstruction loss for all sampled subsequences is computed. By learning to reconstruct skeletons of the same scale and infer crossscale bodycomponent nodes dynamically ( use varying subsequences), MSR encourages our framework to integrate crucial skeleton dynamics and highlevel semantics into encoded graph states to achieve better person ReID performance (see Sec. 5).
3.4. The Entire Framework
The computation flow of the entire framework during selfsupervised learning can be summarized as: (Sec. 3.1) (Sec. 3.2) (Eq. 6) (Eq. 7). The selfsupervised loss (see Eq. 8) is employed to train the SMSGE framework to learn an effective skeleton representation from multiscale skeleton graphs. For the downstream task of person ReID, we extract encoded graph states () learned from the pretrained framework, and exploit an MLP () to predict the sequence label. Specifically, for the skeleton in an input sequence, we concatenate its corresponding encoded graph states of four scales, namely (), as the skeletonlevel representation of the sequence. Then, we train the MLP () with the frozen and its label (note that is NOT tuned in this training stage). The ID prediction () of each skeletonlevel representation in a sequence is averaged to be the final sequencelevel ID prediction . We employ the crossentropy loss to train for person ReID.
0.1em0.45pt0.45pt  IASA  IASB  
Rank1  nAUC  Rank1  nAUC  
0.1em0.45pt0.45pt HandCrafted and Supervised Methods  
0.1em0.45pt0.45pt Gait Energy Image (chunli2010behavior)  25.6  72.1  15.9  66.0 
3D CNN + Average Pooling (boureau2010theoretical)  33.4  81.4  39.1  82.8 
Gait Energy Volume (sivapalan2011gait)  20.4  66.2  13.7  64.8 
3D LSTM (haque2016recurrent)  31.0  77.6  33.8  78.0 
PCM + Skeleton (munaro20143d)  27.3  —  81.8  — 
DVCov + SKL (wu2017robust)  46.6  —  45.9  — 
ED + SKL (wu2017robust)  52.3  —  63.3  — 
descriptors + KNN (munaro2014one)  33.8  63.6  40.5  71.1 
Singlelayer LSTM (haque2016recurrent)  20.0  65.9  19.1  68.4 
Multilayer LSTM (zheng2019relational)  34.4  72.1  30.9  71.9 
descriptors + Adaboost (pala2019enhanced)  27.4  65.5  39.2  78.2 
PostGait (liao2020model)  41.4  79.9  37.1  74.8 
0.1em0.45pt0.45pt SelfSupervised Methods  
0.1em0.45pt0.45pt Attention Gait Encodings (rao2020self)  56.1  81.7  58.2  85.3 
SGELA (rao2021a2)  60.1  82.9  62.5  86.9 
SMSGE (Ours)  59.4  86.7  69.8  90.4 
0.1em0.2pt0.2pt 
4. Experiments
4.1. Experimental Settings
Datasets: We evaluate our framework on three public person ReID datasets that contain skeleton data (IASLab (munaro2014feature), KS20 (nambiar2017context), KGBD (andersson2015person)) and a large RGB video based multiview dataset CASIA B (yu2006framework), which contain 11, 20, 164, and 124 different individuals respectively. We follow the frequently used evaluation setup in the literature (rao2020self; haque2016recurrent): For IASLab, we use the full training set and two testing splits, IASA and IASB; For KGBD, since no training and testing splits are given, we randomly leave one skeleton video of each person for testing and use the remaining videos for training; For KS20, we randomly select one sequence from each viewpoint for testing and use the rest of skeleton sequences for training.
To evaluate the effectiveness of SMSGE when 3D skeleton data are directly estimated from RGB videos rather than Kinect, we introduce a largescale RGB video based dataset CASIA B (yu2006framework)
, and exploit pretrained pose estimation models
(chen20173d; cao2019openpose) to extract 3D skeletons from RGB videos (detailed in the Appendix). We evaluate our approach on each view (, , , , , , , , , , ) of CASIA B and use the adjacent views for training.0.1em0.45pt0.45pt  KGBD  KS20  

Rank1  nAUC  Rank1  nAUC  
0.1em0.45pt0.45pt HandCrafted and Supervised Methods  
0.1em0.45pt0.45pt descriptors + KNN (munaro2014one)  46.9  90.0  58.3  78.0 
Singlelayer LSTM (haque2016recurrent)  39.8  87.2  80.9  92.3 
Multilayer LSTM (zheng2019relational)  46.2  89.8  81.6  94.2 
descriptors + Adaboost (pala2019enhanced)  69.9  90.6  59.8  78.8 
PostGait (liao2020model)  90.6  97.8  70.5  94.0 
0.1em0.45pt0.45pt SelfSupervised Methods  
0.1em0.45pt0.45pt Attention Gait Encodings (rao2020self)  87.7  96.3  86.5  94.7 
SGELA (rao2021a2)  86.9  97.1  86.9  94.9 
SMSGE (Ours)  99.5  99.6  87.5  95.8 
0.1em0.2pt0.2pt 
Implementation Details: The number of nodes in the hyperjointscale and jointscale graph are , in KS20, , in CASIA B, and , in IASLab and KGBD datasets. For partscale and bodyscale graphs, the numbers of nodes are and for all datasets. On IASLab, KS20 and KGBD datasets, the sequence length is empirically set to , which achieves the best overall performance among different settings. For the largest dataset CASIA B with roughly estimated skeleton data from RGB frames, we set sequence length for training/testing. The node feature dimension is and the number of structural relation matrices is . The temperatures () for relation learning, the collaboration fusion coefficient (), and the number of sampling rounds () are empirically set to 1. MLP with one hidden layer is employed in SMSGE. For MSR mechasnim, we use a 2layer LSTM with hidden units per layer. We adopt Adam optimizer with learning rate on IASLab, KGBD and on KS20, CASIA B to train the framework.
Evaluation Metrics: Person ReID typically adopts a “multishot” manner that leverages predictions of multiple frames or a sequencelevel representation to predict a sequence label. In this work, we compute both Rank1 accuracy and nAUC (area under the cumulative matching curve (CMC) normalized by the number of ranks (gray2008viewpoint)) to quantify multishot person ReID performance.
4.2. Comparison with StateoftheArt Methods
In this section, we compare our approach with existing handcrafted, supervised and selfsupervised skeletonbased ReID methods on IASLab (see Table 1), KS20 and KGBD (see Table 2). We also include classic depthbased methods and representative multimodal methods as a reference. The comparison results are reported below:
Comparison with Handcrafted and Supervised Skeletonbased Methods: As shown in Table 1 and Table 2, the proposed SMSGE framework enjoys evident advantages over existing skeletonbased methods: First, our approach significantly outperforms two representative handcrafted methods that extract anthropometric attributes of skeletons (, descriptors (munaro2014one; pala2019enhanced)) by  Rank1 accuracy and  nAUC on different datasets. Second, compared with stateoftheart CNNbased (PoseGait (liao2020model)) and LSTMbased models (haque2016recurrent; zheng2019relational), our selfsupervised framework can achieve superior performance with a large margin (up to Rank1 accuracy and nAUC) on all datasets. Besides, these supervised methods typically require massive labels and even extra handcrafted features ( PoseGait (liao2020model) relies on 81 handcrafted pose and motion features) for representation learning, while our framework is able to automatically model spatial and temporal features of unlabeled skeleton graphs at various scales to learn a more effective skeleton representation for the person ReID task.
Comparison with Selfsupervised Skeletonbased Methods: Our approach achieves a significant improvement ( Rank1 accuracy and  nAUC) over existing stateoftheart selfsupervised methods on three out of four testing sets (IASB, KGBD, KS20). On IASA, despite both SGELA and our framework obtain a close Rank1 accuracy, our approach gains a markedly higher nAUC () than SGELA, which suggests that our approach can achieve better overall ReID performance when retrieving persons from high to low ranking. Notably, the proposed SMSGE outperforms existing selfsupervised methods by more than Rank1 accuracy on the largest skeletonbased dataset KGBD, which demonstrates the great potential of our approach on largescale person ReID.
Comparison with Depthbased and Multimodal methods: As reported in Table 1, our skeletonbased framework consistently performs better than classic depthbased methods (GEI (chunli2010behavior), GEV (sivapalan2011gait), 3D CNN (boureau2010theoretical), 3D LSTM (haque2016recurrent)) by at least Rank1 accuracy and nAUC on IASA and IASB. Compared with representative multimodal methods, our approach is still the best performer in most cases. Interestingly, although the “PCM + Skeleton” method (munaro20143d) that uses both skeletons and 3D point cloud matching attains the best Rank1 accuracy on IASB, it is inferior to the SMSGE by accuracy on IASA, which demonstrates that our approach is more effective under the setting with frequent shape and appearance changes (IASA). Considering that the proposed SMSGE only requires 3D skeleton data as the input and can achieve more satisfactory performance on each dataset, it can be a promising solution to person ReID and other potential skeletonrelated tasks.
5. Further Analysis
In this section, we first evaluate the performance of SMSGE on skeleton data estimated from RGB videos in CASIA B. Then, we conduct ablation study to demonstrate the effectiveness of each component, and evaluate effects of different parameters on SMSGE. Last, we visualize and analyze the learned collaborative relations.
Evaluation with Modelestimated Skeletons. We exploit pretrained pose estimation models (cao2019openpose; chen20173d) to extract 3D skeletons from RGB videos of CASIA B, and evaluate the performance of MSSGE with the estimated skeleton data. We compare our framework with the stateoftheart supervised method PoseGait (liao2020model) under the same evaluation setup. As shown in Table 3, our approach significantly outperforms PoseGait by  Rank1 accuracy on all views of CASIA B. Notably, SMSGE obtains more stable performance than PoseGait on 7 consecutive views from to , which shows the robustness of our framework to viewpoint variation. On the two most challenging views ( and ), our approach can still perform better than PoseGait by more than Rank1 accuracy. These results demonstrate the effectiveness of our framework on skeleton data estimated from RGB videos, and also show the great potential of our approach to be applied to large RGBbased datasets under general settings ( varying views).
0.1em0.45pt0.45pt Methods  

0.1em0.45pt0.45pt PoseGait [2020]  10.7  37.4  52.5  28.3  24.3  18.9  23.5  17.2  23.6  18.8  4.3 
SMSGE (Ours)  18.4  50.8  53.6  40.9  51.2  59.3  52.3  53.9  30.2  28.8  13.6 
0.1em0.2pt0.2pt 
0.1em0.45pt0.45pt MG  MGRN  MSR  IASA  IASB  

SR  CR  CSI  SSR  
0.1em0.45pt0.45pt  53.5  61.8  
✓  55.0  65.8  
✓  ✓  56.1  67.0  
✓  ✓  ✓  56.6  67.8  
✓  ✓  ✓  ✓  57.4  68.9  
✓  ✓  ✓  57.2  67.7  
✓  ✓  ✓  ✓  57.9  68.1  
✓  ✓  ✓  ✓  ✓  59.4  69.8 
0.1em0.2pt0.2pt 
Ablation Study. We evaluate the contribution of each component in our framework (here IASLab is taken as an example) and report the results in Table 4. We use an LSTM with plain skeleton reconstruction as the baseline (see first row in Table 4). We can draw the following conclusions: (1) Introducing multiscale skeleton graphs (MG) consistently improves person ReID performance by at least Rank1 accuracy, which justifies our claim that modeling skeletons as multiscale graphs can facilitate learning richer body and motion features for person ReID. (2) Exploiting structural relations (SR) between body components produces significant performance gain by  Rank1 accuracy compared with directly modeling bodyjoint trajectory (baseline), while combining collaborative relation (CR) learning further boosts the ReID performance by up to Rank1 accuracy. These results demonstrate the effectiveness of multiscale graph relation learning (MGRN) on capturing more discriminative body structural features and motion patterns for person ReID. (3) The proposed MSR mechanism based on crossscale skeleton inference and skeleton subsequence reconstruction pretext tasks evidently improves the model performance ( Rank1 accuracy) under different relation learning (SR or CR), which verifies our intuition that mining highlevel semantics such as crossscale bodycomponent correspondence could encourage learning a more effective skeleton representation for the person ReID task. Other datasets also report similar results.
0.1em0.45pt0.45pt Bodyscale  Partscale  Jointscale 

IASA  IASB  
0.1em0.45pt0.45pt ✓  53.3  60.4  
✓  ✓  56.9  65.8  
✓  ✓  ✓  58.3  68.5  
✓  ✓  ✓  ✓  59.4  69.8  
0.1em0.2pt0.2pt 
Effects of Multiple Graph Scales. As shown in Table 5, combining graph scales from coarse (bodyscale) to fine (hyperjointscale) progressively improves Rank1 accuracy by  on both IASA and IASB. Compared with the plain reconstruction of bodyscale graphs (note that plain reconstruction without collaborative relation learning is employed when using only bodyscale graphs, shown in first row), employing multiscale skeleton reconstruction (MSR) based on two adjacent scales of graphs (bodyscale and partscale) obtains a significant performance gain by  Rank1 accuracy. These results further demonstrate the effectiveness of multiscale graphs and MSR, which are able to capture more discriminative skeleton features at various levels for person ReID.
Model Sensitivity Analysis. We evaluate effects of different hyperparameters () on SMSGE: (1) We observe that SMSGE is not sensitive to temperature changes from 0.1 to 1.0 (see Fig. 5 (a), (b)). Since lower temperatures tend to ignore more similar information (hinton2015distilling) and could reduce relation learning performance, we set temperature to 1.0 for all relation learning in our framework. (2) As shown in Fig. 5 (c), introducing more learnable structural relation matrices with larger can improve model performance on both IASA and IASB. However, too many relation matrices () may cause the model to learn redundant relation information, which leads to a slight performance degradation. (3) The parameter controls the degrees of fusion with collaborative node features. We find that fusing graph node features with larger can achieve better ReID performance (see Fig. 5 (d)), which verifies the necessity of sufficient multiscale collaboration fusion to learn a more effective skeleton representation. (4) As shown in Fig. 5 (e) (f), SMSGE obtains the highest ReID performance when and in most cases. Although larger and can provide more training subsequences ( up to samples), it can increase the computation complexity ( requires more computation memory and time for training), thus we set and to achieve better tradeoff between performance and complexity.
Analysis of Crossscale Collaborative Relations. We visualize positions of different body components and their collaborative relations across adjacent scales (note that we draw significant relations with values larger than maximum value of ), and we obtain observations as follows: (1) As shown in Fig. 6(a) and 6(b), spatially corresponding or nearby body components ( subcomponents of limbs at the same side) possess evident relations across different scales, which demonstrates that SMSGE can capture the highlevel semantics of bodycomponent correspondence between different graphs. (2) For nonadjacent body components ( arms, legs) with a joint movement trend, the framework also learns higher correlations among their corresponding nodes in different graphs (see Fig. 6(c), 6(d)), which justifies our claim that SMSGE framework is able to adaptively infer global bodycomponent cooperation in skeletal motion. More results and proofs are provided in Appendix.
6. Conclusion
In this paper, we model 3D skeletons as multiscale graphs, and propose a selfsupervised multiscale skeleton graph encoding (SMSGE) framework to learn an effective representation from unlabeled skeleton graphs for person ReID. To capture key correlative features of graph nodes, we propose the multiscale graph relation network (MGRN) to learn structural and collaborative relations among bodycomponent nodes in different graphs. A novel multiscale skeleton reconstruction (MSR) mechanism with subsequence reconstruction and crossscale skeleton inference tasks is devised to encode graph dynamics and discriminative highlevel features of skeleton graphs for person ReID. SMSGE outperforms most stateoftheart skeletonbased methods, and it can achieve satisfactory performance on 3D skeleton data estimated from RGB videos.
Comments
There are no comments yet.