Person re-identification (Re-ID) aims to retrieve the same individual from a different view or scene, with great potential in authentication-related applications (vezzani2013people). Conventional studies (wang2018learning; karianakis2018reinforced; haque2016recurrent) typically extract appearance-based features such as body texture and silhouettes from RGB or depth images to perform person Re-ID. Nevertheless, an important flaw of these methods is their vulnerability to illumination or appearance changes (rao2021a2). In contrast, skeleton-based models exploit 3D coordinates of key body joints to characterize human body and motion, which are usually robust to factors such as view and body shape changes (han2017space). Despite that skeleton data have been extensively studied in action and motion related tasks (li2020dynamic), it is still an open challenge to extract discriminative body and motion features with 3D skeletons for person Re-ID (rao2020self). In this sense, this work aims to construct a systematic framework from three aspects to tackle the skeleton-based person Re-ID task.
(1) Multi-scale skeleton graphs. Most existing works (barbosa2012re; munaro2014one; andersson2015person; pala2019enhanced) construct skeleton descriptors to depict discriminative features of body structure and motion ( anthropometric and gait attributes (andersson2015person)) for person Re-ID. However, they typically extract these hand-craft features from skeletons with a single spatial scale and topology, which limits their ability to capture underlying structural information from different body partitions beyond body-joint level ( limb-level components) (li2020dynamic). To fully mine latent structural features within body structure, it is beneficial to devise a systematic manner to represent skeletons at different levels. In this work, we model skeletons as multi-scale graphs (see Fig. 1) to learn coarse-to-fine grained body and motion features from 3D skeleton data.
(2) Multi-scale relation learning. In human walking, body components usually possess different internal relations, which could carry unique and recognizable patterns (murray1964walking; winter2009biomechanics). Recent works like (rao2020self; liao2020model; rao2021a2)
typically encode body-joint trajectory or pre-defined pose descriptors into a feature vector for skeleton representation learning, while they rarely explore the inherent relations between different body joints or components. For example, adjacent body joints “knee” and “foot” are strongly correlated in walking, while they enjoy different degrees ofcollaboration with their corresponding limb-level component “leg”. To capture such internal body relations in skeletal motion, it is highly desirable to design a framework to capture correlations between physically-connected body parts (referred as “structural relations”) and relations among all collaborative body components (referred as “collaborative relations”).
(3) Multi-scale skeleton dynamics modeling. Existing 3D skeleton based Re-ID methods usually model skeleton dynamics from the trajectory of body joints (rao2020self; rao2021a2) or the sequence of pre-defined joint features ( pairwise joint distances and pose descriptors (liao2020model)). Since these methods learn skeleton motion at a fixed scale of body joints, they lack the flexibility to capture motion patterns at various levels. For instance, they cannot explicitly model the movement or interaction of higher level limbs from joint trajectory, which might cause a loss of global motion features. Hence, it is important to devise a framework that can explicitly model skeleton dynamics at different scales to better capture body motion patterns.
To fulfill all above goals, this work for the first time proposes a Self-supervised Multi-scale Skeleton Graph Encoding (SM-SGE) framework that exploits coarse-to-fine skeleton graphs to model body-structure and motion features for person Re-ID. Specifically, we first construct multi-scale skeleton graphs by spatially dividing each skeleton into body-component nodes of different granularities (shown in Fig. 2), which allows our framework to fully model body structure and capture skeleton features at various levels. Second, motivated by the fact that human walking usually carries unique patterns (murray1964walking), which endow body components with different internal relations, we propose a Multi-scale Graph Relation Network (MGRN) to capture structural and collaborative relations among body components in multi-scale skeleton graphs. MGRN exploits structural relations between adjacent body-component nodes to aggregate key correlative features for better node representations, and meanwhile incorporates collaborative relations among nodes of different scales into graph encoding process to enhance global pattern learning. Finally, we propose a novel Multi-scale Skeleton Reconstruction (MSR) mechanism with two concurrent pretext tasks, namely skeleton subsequence reconstruction task and cross-scale skeleton inference task, to enable our framework to capture skeleton dynamics and latent high-level semantics ( body part correspondence, sequence order) from unlabeled skeleton graph representations. The graph features of all scales learned from the proposed framework are then combined as the final skeleton representation to perform the downstream task of person Re-ID.
The proposed SM-SGE framework enjoys three main advantages: First, it seamlessly unifies the learning of multi-scale skeleton graphs into a systematic framework, which enables us to model body structure, component relations, and motion patterns of skeletons at different levels. Second, unlike most existing skeleton-based methods that require manual annotation ( ID labels) for representation learning, our framework is able to learn an effective representation for unlabeled skeletons, which can be directly applied to skeleton-based tasks such as person Re-ID. Last, our framework is also effective with the 3D skeleton data estimated from RGB videos (yu2006framework), thus it can be potentially applied to RGB-based datasets under general settings. In summary, our main contributions include:
We devise multi-scale graphs to fully model 3D skeletons, and propose a novel self-supervised multi-scale skeleton graph encoding (SM-SGE) framework to learn an effective representation from unlabeled skeletons for person Re-ID.
We propose the multi-scale graph relation network (MGRN) to learn both structural and collaborative relations of body-component nodes, so as to aggregate crucial correlative features of nodes and capture richer pattern information.
We propose the multi-scale skeleton reconstruction (MSR) mechanism to enable the framework to encode graph dynamics and high-level semantics from unlabeled skeletons.
Extensive experiments show that SM-SGE outperforms most state-of-the-art skeleton-based methods on three person Re-ID benchmarks, and it can achieve highly competitive performance on skeletons estimated from large-scale RGB videos.
2. Related Works
This section briefly reviews existing skeleton-based person Re-ID methods using hand-crafted features, supervised learning or self-supervised learning. We also introduce depth-based and multi-modal Re-ID methods that are related to skeleton-based models.
Hand-crafted and Supervised Re-ID Methods with Skeleton Data. Most existing works extract hand-crafted skeleton descriptors to depict certain geometric, anthropometric, and gait attributes of human body. (barbosa2012re) computes 7 Euclidean distances between the floor plane and joint or joint pairs to construct a distance matrix, which is learned to match gallery individuals with a quasi-exhaustive strategy. (munaro2014one) and (pala2019enhanced) further extend them to 13 () and 16 skeleton descriptors (
) respectively, and leverage support vector machine (SVM),pala2019enhanced) and 3D point clouds (munaro20143d)
are often used to boost the performance. A few recent studies resort to supervised deep learning models to learn discriminative skeleton representation:(haque2016recurrent)
utilizes long short-term memory (LSTM)(hochreiter1997long) to encode temporal dynamics of pairwise joint distance to perform person Re-ID; Liao et al. (liao2020model)
propose PoseGait model, which learns 81 hand-crafted pose features of 3D skeleton data with deep convolutional neural networks (CNN) for gait-based human recognition.
Self-supervised Skeleton-based Re-ID Methods. Recently, Rao et al. (rao2020self) devise a self-supervised attention-based gait encoding model with multi-layer LSTM to encode gait features from unlabeled skeleton sequences for person Re-ID. The latest self-supervised study (rao2021a2) further proposes a locality-awareness approach that combines various pretext tasks ( reverse sequential reconstruction) and contrastive learning scheme to enhance self-supervised gait representation learning for the person Re-ID task.
Depth-based and Multi-modal Re-ID Methods. Depth-based methods exploit depth-image sequences to extract human shapes, silhouettes or gait features for person Re-ID. (sivapalan2011gait) proposes Gait Energy Volume (GEV) algorithm, which extends Gait Energy Image (GEI) (chunli2010behavior) to 3D domain, to learn depth features for human recognition. (munaro2014one) devises a depth-based point cloud matching (PCM) method to match multi-view 3D point cloud sets to discriminate different individuals. In (haque2016recurrent), Haque et al. leverage 3D LSTM and 3D CNN (boureau2010theoretical) to learn motion dynamics from 3D point clouds for person Re-ID. As to multi-modal methods, skeleton information and RGB or depth features ( depth shape features (munaro20143d; wu2017robust; hasan2016long)) are usually combined to enhance Re-ID performance. In (karianakis2018reinforced), Karianakis et al. propose an RGB-to-depth transferred CNN-LSTM model with reinforced temporal attention (RTA) for the person Re-ID task.
3. The proposed Framework
Suppose that a 3D skeleton sequence with consecutive skeletons is , where denotes the skeleton with three-dimensional coordinates () of body joints. represents the training set that contains skeleton sequences collected from different persons and views. Each skeleton sequence corresponds to an ID label , where and is the number of different persons. The goal of SM-SGE framework is to learn a latent discriminative representation H from skeleton sequences without using any label. Then, we evaluate the effectiveness of learned skeleton representation (H) on the downstream task of person Re-ID: Frozen H
and corresponding ID labels are used to train a multi-layer perceptron (MLP) for person Re-ID (note that the learned featuresH are NOT tuned at this training stage). The overview of proposed SM-SGE framework is shown in Fig. 3, and we present each technical component as below.
3.1. Multi-Scale Skeleton Graph Construction
Human body can be segmented into functional components with diverse granularities ( knee joint, thigh part, leg limb), each of which typically carries different geometric or anthropometric attributes of body (winter2009biomechanics). Inspired by this fact, we regard body joints as the basic components, and merge spatially nearby groups of joints to be a higher level body-component node at the center of their positions. As shown in Fig. 2, we first construct skeleton graphs at three scales, namely joint-scale, part-scale, and body-scale graphs (denoted as ) for each skeleton . Besides, to encourage our model to capture coarse-to-fine skeleton features more systematically, we also build a hyper-joint-scale graph (denoted as ) based on a denser body-limb representation (liu2018recognizing)
, which is constructed by linearly interpolating nodes between adjacent nodes in the joint-scale graph. Each graph() consists of nodes (, ) and edges (). Here , denote the set of nodes corresponding to different body components and the set of their structural relations respectively, and is the number of nodes in the scale graph . We use to represent a graph’s adjacency matrix, where each element is defined as the normalized structural relation between adjacent nodes and , and satisfy: , where denotes indices for neighbor nodes of node in . During training of SM-SGE, is adaptively learned to capture flexible structural relations.
3.2. Multi-Scale Graph Relation Network
Different body parts typically possess internal relations at the physical or kinematic level, which could be exploited to mine rich body-structure features and patterns of motion (aggarwal1998nonrigid). Motivated by this fact, we propose to learn relations of body components from two aspects: (1) Structural relations: Structurally-connected body components usually enjoy a higher motion correlation than distant pairs. Thus, to better represent each body-component node, when encoding skeleton graph features, it is crucial to capture structural relations of neighbor nodes to aggregate the most correlative spatial features. (2) Collaborative relations: Human motion like walking is often performed with several action-related body components, which collaborate together in a relatively stable pattern ( gait patterns) (murray1964walking; rao2020self). It is therefore beneficial to learn the inherent collaborative relations among different body components to mine more global pattern information from skeletons. To achieve above goals, we propose the Multi-scale Graph Relation Network (MGRN) with structural and collaborative relation learning as below.
Structural Relation Learning. Given the scale graph of a skeleton, MGRN first computes the structural relation between adjacent nodes and in as follows:
where is the weight matrix that maps the scale node into a higher level feature space , denotes a learnable weight matrix for relation learning at scale, indicates the feature concatenation of two nodes, and
is a non-linear activation function. Then, to learn flexible structural relations to focus on more correlative nodes, we normalize relations with a temperature-based() function (hinton2015distilling) as follows:
where denotes the temperature that is normally set to in the function, while higher value of produces a softer relation distributed over nodes and retains more similar relation information. Here denotes neighbor nodes (including ) of node in graph.
To aggregate features of most relevant nodes to represent the node , we exploit normalized structural relations to yield the representation for node by: . To sufficiently capture potential structural relations ( motion correlation, position similarity), MGRN concurrently and independently learns different structural relation matrices () using the same computation (see Eq. 1, Eq. 2). In this way, MGRN can jointly capture structural relation information of nodes from different representation subspaces (velickovic2018graph) based on learnable structural relation matrices (see Fig. 4). We averagely aggregate features learned by these matrices to represent each node:
where denotes the node representation of learned by structural relation matrices, represents the structural relation between node and computed by the structural relation matrix, and denotes the corresponding weight matrix to perform node feature mapping. Here we use average rather than concatenation operation to reduce the feature dimension of nodes and allow for learning more structural relation matrices.
Collaborative Relation Learning. Motivated by the fact that unique walking patterns could be represented by the dynamic cooperation among body joints or between different body components (murray1964walking), we expect our model to capture more discriminative patterns globally by learning collaborative relations from two aspects: (1) single-scale collaborative relations among nodes of the same scale, and (2) cross-scale collaborative relations between a node and its spatially corresponding or motion-related higher level body component. To this end, MGRN computes collaborative relation matrix () between scale nodes and scale nodes as follows (shown in Fig. 4 and Fig. 3):
where represents the singe-scale (when ) and cross-scale (when ) collaborative relation between node in and node in . Here MGRN computes the inner product of node feature representations, which aggregate key spatial information with structural relation learning (see Eq. 3), to measure the degree of collaboration between two nodes. denotes the temperature to adjust the softness of relation learning (illustrated in Eq. 2).
Multi-scale Collaboration Fusion. To adaptively focus on key correlative features in body-component collaboration at different spatial levels to enhance global pattern learning, we propose the multi-scale collaboration fusion that exploits collaborative relations to fuse node features across scales. Each node representation () in the scale graph is updated by the feature fusion of collaborative nodes () learned from different graphs as below (see Fig. 3):
where is a learnable weight matrix to integrate collaborative features of scale node () into scale node representation (), represents the number of nodes in scale graph, and is the fusion coefficient to fuse collaborative graph node features. We denote the fused graph features of scale () for a skeleton sequence be . Note that the multi-scale collaboration fusion does NOT directly fuse graph features of all scales into a representation. Instead, graph representations of each individual scale is retained (shown in Fig. 3) to encourage our model to capture skeleton dynamics and pattern information at different levels.
3.3. Multi-Scale Skeleton Reconstruction Mechanism
To enable SM-SGE to encode multi-scale graph dynamics of unlabeled skeletons, we propose a self-supervised Multi-scale Skeleton Reconstruction (MSR) mechanism to simultaneously capture skeleton graph dynamics and high-level semantics ( skeleton order in the subsequence, cross-scale component correspondence) from different scales of graphs. Unlike the plain reconstruction that learns to reconstruct the whole sequence at a sole scale, the objective of MSR is combined with two concurrent pretext tasks as follows:
(1) Skeleton subsequence reconstruction task, which reconstructs multiple skeleton subsequences based on their graph representations. In particular, MSR aims to reconstruct target multi-scale skeletons () corresponding to multi-scale graphs () in subsequences, instead of reconstructing the original subsequences (for clarity, we use the vector to represent all node positions in the scale graph of the skeleton in the subsequence).
(2) Cross-scale skeleton inference task that exploits fine skeleton graph representations to infer 3D positions of coarser body components. For instance, we propose to use joint-scale graph representations (), which may contain richer spatial information with denser nodes, to infer nodes of body-scale skeletons (). It can also be viewed as a cross-scale reconstruction task to reconstruct different scale nodes with the same skeleton graph.
To simultaneously achieve above two pretext tasks, we first sample -length subsequences by randomly discarding skeletons from the input sequence . To exploit more potential samples for training, the random sampling process is repeated for rounds and each round covers all possible lengths from to . Second, given an sampled skeleton subsequence , the MGRN encodes its corresponding skeleton graphs of each scale into fused graph features (see Eq. 1-5). Then, we leverage an LSTM to integrate the temporal dynamics of graphs at each scale into effective representations: LSTM encodes each skeleton graph representation and the previous step’s latent state (if existed), which provides the temporal context information of scale graph representations, into the current latent state ():
where , denotes the LSTM encoder, which aims to capture long-term dynamics of graph representations at the scale. are encoded graph states that contain crucial temporal encoding information of scale graph representations from time to . Last, we exploits encoded graph states at the scale to reconstruct the target skeleton at the scale as follows:
where is the reconstructed skeleton: When , Eq. 7 is the plain skeleton reconstruction at the same scale, and indicates the cross-scale skeleton inference. is a network function built by MLP, where weights are NOT shared between different scales, we train different individual MLPs for each skeleton reconstruction at the same or different scales (see Fig. 3).
As we expect to capture graph dynamics and pattern features of skeletons at various scales, we employ the MSR mechanism with the above reconstruction objective on all scales of graphs. Formally, we define the objective function for the self-supervision of MSR on the scale graphs, which minimizes loss between ground-truth skeletons of the -length subsequence and reconstructed skeletons:
where is the reconstructed scale skeleton based on scale encoded graph states (see Eq. 7), and denotes norm. The reason for using loss is twofold: It gives sufficient gradients to positions with small losses to facilitate precise spatial reconstruction, and meanwhile can alleviate gradient explosion with stable gradients for large losses (li2020dynamic). It should be noted that our implementation actually optimizes Eq. 8 on each individual graph scale, and the sum of reconstruction loss for all sampled subsequences is computed. By learning to reconstruct skeletons of the same scale and infer cross-scale body-component nodes dynamically ( use varying subsequences), MSR encourages our framework to integrate crucial skeleton dynamics and high-level semantics into encoded graph states to achieve better person Re-ID performance (see Sec. 5).
3.4. The Entire Framework
The computation flow of the entire framework during self-supervised learning can be summarized as: (Sec. 3.1) (Sec. 3.2) (Eq. 6) (Eq. 7). The self-supervised loss (see Eq. 8) is employed to train the SM-SGE framework to learn an effective skeleton representation from multi-scale skeleton graphs. For the downstream task of person Re-ID, we extract encoded graph states () learned from the pre-trained framework, and exploit an MLP () to predict the sequence label. Specifically, for the skeleton in an input sequence, we concatenate its corresponding encoded graph states of four scales, namely (), as the skeleton-level representation of the sequence. Then, we train the MLP () with the frozen and its label (note that is NOT tuned in this training stage). The ID prediction () of each skeleton-level representation in a sequence is averaged to be the final sequence-level ID prediction . We employ the cross-entropy loss to train for person Re-ID.
|0.1em0.45pt0.45pt Hand-Crafted and Supervised Methods|
|0.1em0.45pt0.45pt Gait Energy Image (chunli2010behavior)||25.6||72.1||15.9||66.0|
|3D CNN + Average Pooling (boureau2010theoretical)||33.4||81.4||39.1||82.8|
|Gait Energy Volume (sivapalan2011gait)||20.4||66.2||13.7||64.8|
|3D LSTM (haque2016recurrent)||31.0||77.6||33.8||78.0|
|PCM + Skeleton (munaro20143d)||27.3||—||81.8||—|
|DVCov + SKL (wu2017robust)||46.6||—||45.9||—|
|ED + SKL (wu2017robust)||52.3||—||63.3||—|
|descriptors + KNN (munaro2014one)||33.8||63.6||40.5||71.1|
|Single-layer LSTM (haque2016recurrent)||20.0||65.9||19.1||68.4|
|Multi-layer LSTM (zheng2019relational)||34.4||72.1||30.9||71.9|
|descriptors + Adaboost (pala2019enhanced)||27.4||65.5||39.2||78.2|
|0.1em0.45pt0.45pt Self-Supervised Methods|
|0.1em0.45pt0.45pt Attention Gait Encodings (rao2020self)||56.1||81.7||58.2||85.3|
4.1. Experimental Settings
Datasets: We evaluate our framework on three public person Re-ID datasets that contain skeleton data (IAS-Lab (munaro2014feature), KS20 (nambiar2017context), KGBD (andersson2015person)) and a large RGB video based multi-view dataset CASIA B (yu2006framework), which contain 11, 20, 164, and 124 different individuals respectively. We follow the frequently used evaluation setup in the literature (rao2020self; haque2016recurrent): For IAS-Lab, we use the full training set and two testing splits, IAS-A and IAS-B; For KGBD, since no training and testing splits are given, we randomly leave one skeleton video of each person for testing and use the remaining videos for training; For KS20, we randomly select one sequence from each viewpoint for testing and use the rest of skeleton sequences for training.
To evaluate the effectiveness of SM-SGE when 3D skeleton data are directly estimated from RGB videos rather than Kinect, we introduce a large-scale RGB video based dataset CASIA B (yu2006framework)
, and exploit pre-trained pose estimation models(chen20173d; cao2019openpose) to extract 3D skeletons from RGB videos (detailed in the Appendix). We evaluate our approach on each view (, , , , , , , , , , ) of CASIA B and use the adjacent views for training.
|0.1em0.45pt0.45pt Hand-Crafted and Supervised Methods|
|0.1em0.45pt0.45pt descriptors + KNN (munaro2014one)||46.9||90.0||58.3||78.0|
|Single-layer LSTM (haque2016recurrent)||39.8||87.2||80.9||92.3|
|Multi-layer LSTM (zheng2019relational)||46.2||89.8||81.6||94.2|
|descriptors + Adaboost (pala2019enhanced)||69.9||90.6||59.8||78.8|
|0.1em0.45pt0.45pt Self-Supervised Methods|
|0.1em0.45pt0.45pt Attention Gait Encodings (rao2020self)||87.7||96.3||86.5||94.7|
Implementation Details: The number of nodes in the hyper-joint-scale and joint-scale graph are , in KS20, , in CASIA B, and , in IAS-Lab and KGBD datasets. For part-scale and body-scale graphs, the numbers of nodes are and for all datasets. On IAS-Lab, KS20 and KGBD datasets, the sequence length is empirically set to , which achieves the best overall performance among different settings. For the largest dataset CASIA B with roughly estimated skeleton data from RGB frames, we set sequence length for training/testing. The node feature dimension is and the number of structural relation matrices is . The temperatures () for relation learning, the collaboration fusion coefficient (), and the number of sampling rounds () are empirically set to 1. MLP with one hidden layer is employed in SM-SGE. For MSR mechasnim, we use a 2-layer LSTM with hidden units per layer. We adopt Adam optimizer with learning rate on IAS-Lab, KGBD and on KS20, CASIA B to train the framework.
Evaluation Metrics: Person Re-ID typically adopts a “multi-shot” manner that leverages predictions of multiple frames or a sequence-level representation to predict a sequence label. In this work, we compute both Rank-1 accuracy and nAUC (area under the cumulative matching curve (CMC) normalized by the number of ranks (gray2008viewpoint)) to quantify multi-shot person Re-ID performance.
4.2. Comparison with State-of-the-Art Methods
In this section, we compare our approach with existing hand-crafted, supervised and self-supervised skeleton-based Re-ID methods on IAS-Lab (see Table 1), KS20 and KGBD (see Table 2). We also include classic depth-based methods and representative multi-modal methods as a reference. The comparison results are reported below:
Comparison with Hand-crafted and Supervised Skeleton-based Methods: As shown in Table 1 and Table 2, the proposed SM-SGE framework enjoys evident advantages over existing skeleton-based methods: First, our approach significantly outperforms two representative hand-crafted methods that extract anthropometric attributes of skeletons (, descriptors (munaro2014one; pala2019enhanced)) by - Rank-1 accuracy and - nAUC on different datasets. Second, compared with state-of-the-art CNN-based (PoseGait (liao2020model)) and LSTM-based models (haque2016recurrent; zheng2019relational), our self-supervised framework can achieve superior performance with a large margin (up to Rank-1 accuracy and nAUC) on all datasets. Besides, these supervised methods typically require massive labels and even extra hand-crafted features ( PoseGait (liao2020model) relies on 81 hand-crafted pose and motion features) for representation learning, while our framework is able to automatically model spatial and temporal features of unlabeled skeleton graphs at various scales to learn a more effective skeleton representation for the person Re-ID task.
Comparison with Self-supervised Skeleton-based Methods: Our approach achieves a significant improvement (- Rank-1 accuracy and - nAUC) over existing state-of-the-art self-supervised methods on three out of four testing sets (IAS-B, KGBD, KS20). On IAS-A, despite both SGELA and our framework obtain a close Rank-1 accuracy, our approach gains a markedly higher nAUC () than SGELA, which suggests that our approach can achieve better overall Re-ID performance when retrieving persons from high to low ranking. Notably, the proposed SM-SGE outperforms existing self-supervised methods by more than Rank-1 accuracy on the largest skeleton-based dataset KGBD, which demonstrates the great potential of our approach on large-scale person Re-ID.
Comparison with Depth-based and Multi-modal methods: As reported in Table 1, our skeleton-based framework consistently performs better than classic depth-based methods (GEI (chunli2010behavior), GEV (sivapalan2011gait), 3D CNN (boureau2010theoretical), 3D LSTM (haque2016recurrent)) by at least Rank-1 accuracy and nAUC on IAS-A and IAS-B. Compared with representative multi-modal methods, our approach is still the best performer in most cases. Interestingly, although the “PCM + Skeleton” method (munaro20143d) that uses both skeletons and 3D point cloud matching attains the best Rank-1 accuracy on IAS-B, it is inferior to the SM-SGE by accuracy on IAS-A, which demonstrates that our approach is more effective under the setting with frequent shape and appearance changes (IAS-A). Considering that the proposed SM-SGE only requires 3D skeleton data as the input and can achieve more satisfactory performance on each dataset, it can be a promising solution to person Re-ID and other potential skeleton-related tasks.
5. Further Analysis
In this section, we first evaluate the performance of SM-SGE on skeleton data estimated from RGB videos in CASIA B. Then, we conduct ablation study to demonstrate the effectiveness of each component, and evaluate effects of different parameters on SM-SGE. Last, we visualize and analyze the learned collaborative relations.
Evaluation with Model-estimated Skeletons. We exploit pre-trained pose estimation models (cao2019openpose; chen20173d) to extract 3D skeletons from RGB videos of CASIA B, and evaluate the performance of MS-SGE with the estimated skeleton data. We compare our framework with the state-of-the-art supervised method PoseGait (liao2020model) under the same evaluation setup. As shown in Table 3, our approach significantly outperforms PoseGait by - Rank-1 accuracy on all views of CASIA B. Notably, SM-SGE obtains more stable performance than PoseGait on 7 consecutive views from to , which shows the robustness of our framework to view-point variation. On the two most challenging views ( and ), our approach can still perform better than PoseGait by more than Rank-1 accuracy. These results demonstrate the effectiveness of our framework on skeleton data estimated from RGB videos, and also show the great potential of our approach to be applied to large RGB-based datasets under general settings ( varying views).
|0.1em0.45pt0.45pt PoseGait ||10.7||37.4||52.5||28.3||24.3||18.9||23.5||17.2||23.6||18.8||4.3|
Ablation Study. We evaluate the contribution of each component in our framework (here IAS-Lab is taken as an example) and report the results in Table 4. We use an LSTM with plain skeleton reconstruction as the baseline (see first row in Table 4). We can draw the following conclusions: (1) Introducing multi-scale skeleton graphs (MG) consistently improves person Re-ID performance by at least Rank-1 accuracy, which justifies our claim that modeling skeletons as multi-scale graphs can facilitate learning richer body and motion features for person Re-ID. (2) Exploiting structural relations (SR) between body components produces significant performance gain by - Rank-1 accuracy compared with directly modeling body-joint trajectory (baseline), while combining collaborative relation (CR) learning further boosts the Re-ID performance by up to Rank-1 accuracy. These results demonstrate the effectiveness of multi-scale graph relation learning (MGRN) on capturing more discriminative body structural features and motion patterns for person Re-ID. (3) The proposed MSR mechanism based on cross-scale skeleton inference and skeleton subsequence reconstruction pretext tasks evidently improves the model performance (- Rank-1 accuracy) under different relation learning (SR or CR), which verifies our intuition that mining high-level semantics such as cross-scale body-component correspondence could encourage learning a more effective skeleton representation for the person Re-ID task. Other datasets also report similar results.
Effects of Multiple Graph Scales. As shown in Table 5, combining graph scales from coarse (body-scale) to fine (hyper-joint-scale) progressively improves Rank-1 accuracy by - on both IAS-A and IAS-B. Compared with the plain reconstruction of body-scale graphs (note that plain reconstruction without collaborative relation learning is employed when using only body-scale graphs, shown in first row), employing multi-scale skeleton reconstruction (MSR) based on two adjacent scales of graphs (body-scale and part-scale) obtains a significant performance gain by - Rank-1 accuracy. These results further demonstrate the effectiveness of multi-scale graphs and MSR, which are able to capture more discriminative skeleton features at various levels for person Re-ID.
Model Sensitivity Analysis. We evaluate effects of different hyper-parameters () on SM-SGE: (1) We observe that SM-SGE is not sensitive to temperature changes from 0.1 to 1.0 (see Fig. 5 (a), (b)). Since lower temperatures tend to ignore more similar information (hinton2015distilling) and could reduce relation learning performance, we set temperature to 1.0 for all relation learning in our framework. (2) As shown in Fig. 5 (c), introducing more learnable structural relation matrices with larger can improve model performance on both IAS-A and IAS-B. However, too many relation matrices () may cause the model to learn redundant relation information, which leads to a slight performance degradation. (3) The parameter controls the degrees of fusion with collaborative node features. We find that fusing graph node features with larger can achieve better Re-ID performance (see Fig. 5 (d)), which verifies the necessity of sufficient multi-scale collaboration fusion to learn a more effective skeleton representation. (4) As shown in Fig. 5 (e) (f), SM-SGE obtains the highest Re-ID performance when and in most cases. Although larger and can provide more training subsequences ( up to samples), it can increase the computation complexity ( requires more computation memory and time for training), thus we set and to achieve better trade-off between performance and complexity.
Analysis of Cross-scale Collaborative Relations. We visualize positions of different body components and their collaborative relations across adjacent scales (note that we draw significant relations with values larger than maximum value of ), and we obtain observations as follows: (1) As shown in Fig. 6(a) and 6(b), spatially corresponding or nearby body components ( subcomponents of limbs at the same side) possess evident relations across different scales, which demonstrates that SM-SGE can capture the high-level semantics of body-component correspondence between different graphs. (2) For non-adjacent body components ( arms, legs) with a joint movement trend, the framework also learns higher correlations among their corresponding nodes in different graphs (see Fig. 6(c), 6(d)), which justifies our claim that SM-SGE framework is able to adaptively infer global body-component cooperation in skeletal motion. More results and proofs are provided in Appendix.
In this paper, we model 3D skeletons as multi-scale graphs, and propose a self-supervised multi-scale skeleton graph encoding (SM-SGE) framework to learn an effective representation from unlabeled skeleton graphs for person Re-ID. To capture key correlative features of graph nodes, we propose the multi-scale graph relation network (MGRN) to learn structural and collaborative relations among body-component nodes in different graphs. A novel multi-scale skeleton reconstruction (MSR) mechanism with subsequence reconstruction and cross-scale skeleton inference tasks is devised to encode graph dynamics and discriminative high-level features of skeleton graphs for person Re-ID. SM-SGE outperforms most state-of-the-art skeleton-based methods, and it can achieve satisfactory performance on 3D skeleton data estimated from RGB videos.