Self-Supervised Gait Encoding with Locality-Aware Attention for Person Re-Identification

08/21/2020 ∙ by Haocong Rao, et al. ∙ South China University of Technology International Student Union NetEase, Inc 14

Gait-based person re-identification (Re-ID) is valuable for safety-critical applications, and using only 3D skeleton data to extract discriminative gait features for person Re-ID is an emerging open topic. Existing methods either adopt hand-crafted features or learn gait features by traditional supervised learning paradigms. Unlike previous methods, we for the first time propose a generic gait encoding approach that can utilize unlabeled skeleton data to learn gait representations in a self-supervised manner. Specifically, we first propose to introduce self-supervision by learning to reconstruct input skeleton sequences in reverse order, which facilitates learning richer high-level semantics and better gait representations. Second, inspired by the fact that motion's continuity endows temporally adjacent skeletons with higher correlations ("locality"), we propose a locality-aware attention mechanism that encourages learning larger attention weights for temporally adjacent skeletons when reconstructing current skeleton, so as to learn locality when encoding gait. Finally, we propose Attention-based Gait Encodings (AGEs), which are built using context vectors learned by locality-aware attention, as final gait representations. AGEs are directly utilized to realize effective person Re-ID. Our approach typically improves existing skeleton-based methods by 10-20 Rank-1 accuracy, and it achieves comparable or even superior performance to multi-modal methods with extra RGB or depth information. Our codes are available at



There are no comments yet.


page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Gait-based person Re-ID using 3D skeleton data.

Fast, effective and reliable person re-identification (Re-ID) is vital for various applications such as security authentication, human tracking, and role-based activity understanding [22]. For person Re-ID, gait is one of the most useful human body characteristics, and it has drawn increasing attention from researchers since it can be extracted by unobtrusive methods without co-operative subjects [5]. Existing physiological and psychological studies [18, 6] suggest that different human individuals have different gait characteristics, which contain numerous unique and relatively stable patterns (

, stride length and angles of body joints). In the field of gait analysis, gait can be described by two types of methods:

(a) Appearance-based methods [26], which exploit human silhouettes from aligned image sequences to depict gait. However, this type of methods are vulnerable to body shape changes and appearance changes. (b) Model-based methods [13], which model gait by body structure and motion of human body joints. Unlike appearance-based methods, model-based methods are scale and view invariant [19], thus leading to better robustness in practice. Among different models, 3D skeleton model, which describes humans by 3D coordinates of key body joints, is a highly efficient representation of human body structure and motion [8]. Compared with RGB or depth data, 3D skeleton enjoys many merits like better robustness and much smaller data size, and they can be easily collected by modern devices like Kinect. Therefore, exploiting 3D skeleton data to perform gait analysis for downstream tasks like person Re-ID (illustrated in Fig. 1) has gained surging popularity [13]. Nevertheless, the way to extract or learn discriminative gait features from 3D skeleton sequence data remains open to be explored.

To this end, most existing works like [3, 2] rely on hand-crafted skeleton descriptors. However, it is usually complicated and tedious to design such hand-crafted descriptors, , [2] defines 80 skeleton descriptors from the views of anthropometric and gait attributes for person Re-ID. Besides, they heavily rely on domain knowledge such as human anatomy [25], thus lacking the ability to mine latent gait features beyond human cognition. To alleviate this problem, few recent works like [9]

resort to deep neural networks to learn gait representations automatically. However, they all follow the classic

supervised learning paradigm and require labeled data, so they cannot perform gait representation learning with unlabeled skeleton data. In this paper, we for the first time propose a self-supervised approach with locality-aware attention mechanism, which only requires unlabeled 3D skeleton data for gait encoding. First, by introducing certain self-supervision as a high-level learning goal, we not only enable our model to learn gait representations from unlabeled skeleton data, but also encourage the model to capture richer high-level semantics (, sequence order, body part motion) and more discriminative gait features. Specifically, we propose a self-supervised learning objective that aims to reconstruct the input skeleton sequence in reverse order, which is implemented by an encoder-decoder architecture. Second, since the continuity of motion leads to very small pose/skeleton changes in a small time interval [1], this endows skeleton sequences with the property of locality: For each skeleton frame in a skeleton sequence, its temporally adjacent skeleton frames within a local context enjoy higher correlations to itself. Thus, to enable better skeleton reconstruction and gait representation learning, we propose the locality-aware attention mechanism to incorporate such locality into the gait encoding process. Last, we leverage the context vectors learned by the proposed locality-aware attention mechanism to construct Attention-based Gait Encodings (AGEs) as the final gait representations. We demonstrate that AGEs, which are learned without skeleton sequence labels, can be directly applied to person Re-ID and achieve highly competitive performance.

In summary, we make the following contributions:

  • We propose a self-supervised approach based on reverse sequential skeleton reconstruction and an encoder-decoder model, which enable us to learn discriminative gait representations without skeleton sequence labels.

  • We propose a locality-aware attention mechanism to exploit the locality nature in skeleton sequences for better skeleton reconstruction and gait encoding.

  • We propose AGEs as novel gait representations, which are shown to be highly effective for person Re-ID.

Experiments demonstrate the effectiveness of our approach in gait encoding/person Re-ID. It is also shown that our model is readily transferable among datasets, which validates its abilities to learn transferable high-level features of skeletons.

2 Related Work

Person Re-ID using Skeleton Data.  For person Re-ID, most existing works extract hand-crafted features to depict certain geometric, morphological or anthropometric attributes of 3D skeleton data. [3] computes 7 Euclidean distances between the floor plane and joint or joint pairs to construct a distance matrix, which is learned by a quasi-exhaustive strategy to perform person Re-ID. [16] further extends them to 13 skeleton descriptors (

) and leverages support vector machine (SVM) and k-nearest neighbor (KNN) for classification. In

[20], 16 Euclidean distances between body joints () are fed to Adaboost to realize Re-ID. Since skeleton based features alone are hard to achieve satisfactory Re-ID performance, features from other modalities (, 3D point clouds [15], 3D face descriptor [20]

) are also used to enhance the performance. Meanwhile, few recent works exploit supervised deep learning models to learn gait representations from skeleton data:


utilizes long short-term memory (LSTM)

[11] to model temporal dynamics of body joints to perform person Re-ID; [13]

proposes PoseGait, which feeds 81 hand-crafted pose features of 3D skeleton data into convolutional neural networks (CNN) for human recognition.

Depth-based and Multi-modal Person Re-ID Methods.

Depth-based methods typically exploit human shapes or silhouettes from depth images to extract gait features for person Re-ID. [21] extends the Gait Energy Image (GEI) [4]

to 3D domain and proposes Gait Energy Volume (GEV) algorithm based on depth images to perform gait-based human recognition. 3D point clouds based on depth data are also widely used to estimate body shape and motion trajectories.

[16] proposes point cloud matching (PCM) to compute the distances of multi-view point cloud sets, so as to discriminate different persons. [9] adopts 3D LSTM to model motion dynamics of 3D point clouds for person Re-ID. As to multi-modal methods, they usually combine skeleton-based features with extra RGB or depth information ( depth shape features based on point clouds [15, 10, 24]) to boost Re-ID performance. In [12], CNN-LSTM with reinforced temporal attention (RTA) is proposed for person Re-ID based on a split-rate RGB-depth transfer approach.

3 The Proposed Approach

Figure 2: Flow diagram of our model: (1) Gait Encoder (yellow) encodes each skeleton frame into an encoded gait state . (2) Locality-aware attention mechanism (green) first computes the basic attention alignment score , so as to measure the content-based correlation between each encoded gait state and the decoded gait state from Gait Decoder (purple). Then, the locality mask provides an objective , which guides our model to learn locality-aware alignment scores by the locality-aware attention alignment loss . Next, are weighted by to compute the context vector . and are fed into the concatenation layer to produce an attentional state vector . Finally, is fed into the full connected layer to predict skeleton and Gait Decoder for later decoding. (3) is used to build Attention-based Gait Encodings (AGEs) , which are fed into a recognition network for person Re-ID (blue).

Suppose that an input skeleton sequence contains consecutive skeleton frames, where contains 3D coordinates of body joints. The training set contains skeleton sequences collected from different persons. Each skeleton sequence corresponds to a label , where and is the number of persons. Our goal is to learn discriminative gait features from without using any label. Then, the effectiveness of learned features is validated by using them to perform person Re-ID: Learned features and labels are used to train a simple recognition network (note that the learned features are fixed and NOT tuned by training at this stage). The overview of the proposed approach is given in Fig. 2, and we present the details of each technical component below.

3.1 Self-Supervised Skeleton Reconstruction

To learn gait representations without labels, we propose to introduce self-supervision by learning to reconstruct input skeleton sequences in reverse order, , by taking as inputs, we expect our model to output the sequence , which gives . Compared with naïve reconstruction that learns to reconstruct exact inputs (), the proposed learning objective () is endowed with high-level information (skeleton order in a sequence) that is meaningful to human perception, which requires the model to capture richer high-level semantics (e.g., body parts, motion patterns) to achieve this learning objective. In this way, our model is expected to learn better gait representations than frequently-used plain reconstruction. Formally, given an input skeleton sequence, we use the encoder to encode each skeleton frame () and the previous step’s latent state (if existed), which provides context information, into current latent state :


where , denotes our Gait Encoder (GE). GE is built with an LSTM, which aims to capture long-term temporal dynamics of skeleton sequences. are encoded gait states that contain preliminary gait encoding information. In the training phase, encoded gait states are decoded by a Gait Decoder (GD) to reconstruct the target sequence , and the decoding process is performed below (see Fig. 2):


where denotes the GD. GD consists of an LSTM and a fully connected (FC) layer that outputs predicted skeletons. refers to the decoded gait state, , the latent state output by GD’s LSTM to predict skeleton . When the decoding is initialized (), we feed an all-0 skeleton placeholder and the final encoded gait state into GD to decode the first skeleton. Afterwards, to predict skeleton , takes three inputs from the decoding step: decoded gait state , the ground truth skeleton () that enables better convergence, and the attentional state vector that fuses encoding and decoding information based on the proposed attention mechanism, which will be elaborated in Sec. 3.2. In this way, we define the objective function for skeleton reconstruction, which minimizes the mean square error (MSE) between a target skeleton sequence and a predicted skeleton sequence :


where , represent the joint position of the predicted or target skeleton. In the testing phase, to test the reconstruction ability of our model, we use the predicted skeleton rather than target skeleton as the input to in the case, , . To facilitate training, our implementation actually optimizes Eq. on each individual dimension of the skeleton’s 3D coordinates: , where corresponds to a certain dimension of 3D space, and .

3.2 Locality-Aware Attention Mechanism

As learning gait features requires capturing motion patterns from skeleton sequences, it is natural to consider a straightforward property of motion—Continuity. Motion’s continuity ensures that those skeletons in a small temporal interval will NOT undergo drastic changes, thus resulting in higher inter-skeleton correlations in this local temporal interval, which is referred as “locality”. Due to such locality, when reconstructing one skeleton in a sequence, we expect our model to pay more attention to its neighboring skeletons in the local context. To this end, we propose a novel locality-aware attention mechanism, with its modules presented below:

(a) Attention Matrix
(b) Reconstruction Loss Curves
Figure 3: (a) Visualization of the BAS (top) and LAS (bottom) attention matrices that represent average attention alignment scores. Note that the abscissa and ordinate denote indices of input skeletons and predicted skeletons respectively. (b) Reconstruction loss curves when using no attention, BAS, MBAS or LAS for reconstruction.

Basic Attention Alignment. We first introduce Basic Attention (BA) alignment [14] to measure the content (, latent state) based correlations between the input sequence and the predicted sequence. As shown in Fig. 2, at decoding step, we compute the BA Alignment Scores (BAS) between and encoded gait state ():


BAS aims to focus on those more correlative skeletons in the encoding stage, and provide preliminary attention weights for skeleton decoding. However, BA alignment only considers the content based correlations and does not explicitly take locality into consideration, which motivates us to propose locality mask and locality-aware attention alignment below.

Locality Mask. Our motivation is to incorporate locality into the gait encoding process for better skeleton reconstruction. As the goal is to decode the skeleton as , we assume those skeletons in the local temporal context of to be highly correlated to , where (note that we use reverse reconstruction). To describe the local context centered at , we define an attentional window , where is a selected integer to control the attentional range. Since the locality will favor temporal positions near (closer positions are more correlative), a direct solution is to place a Gaussian distribution centered around as a locality mask:


where we empirically set , is a position within the window centered at . We can weight BAS by this locality mask to compute Masked BA Alignment Scores (MBAS) below, which directly forces alignment scores to obtain locality:


Locality-Aware Attention Alignment. Despite that the locality mask is straightforward to yield locality, it is a very coarse solution that brutally changes the alignment scores. Therefore, instead of using MBAS () directly, we propose the Locality-aware Attention (LA) alignment. Specifically, an LA alignment loss term is used to encourage LA alignment to learn similar locality of the :


By adding the loss term , we can obtain LA Alignment Scores (LAS). Note that in Eq. , the final learned is BAS. For clarity, we use to represent learned LAS here.

With the guidance of , our model learns to allocate more attention to the local context by itself, rather than using a locality mask. To utilize alignment scores to yield an attention-weighted encoded gait state at the step, we calculate the context vector by a sum of weighted encoded gait states:


Note that context vector can also be computed by BAS or MBAS. provides a synthesized gait encoding that is more relevant to , which facilitates the reconstruction of skeleton. To combine both encoding and decoding information for reconstruction, we use a concatenation layer that combines and into an attentional state vector :


where represents the learnable weight matrix in the layer. Finally, we generate (reconstruct) the skeleton by the FC layer of the GD:


where is the weights to be learned in this FC layer.

Remarks: (a) To provide an intuitive illustration of the proposed locality-aware attention mechanism, we visualize the BAS and LAS attention matrices, which are formed by average alignment scores computed on dimension as an example. As shown by Fig. 3(a), LA alignment significantly improves the locality of learned alignment scores: Relatively large alignment scores are densely distributed near the clinodiagonal line of the alignment score matrix (clinodiagonal line reflects skeletons’ correlations to themselves when performing reverse reconstruction), which means adjacent skeletons are assigned with larger attention than remote skeletons. By contrast, despite that BA alignment also learns locality to some extent, BA’s alignment weights show a much more even distribution, and many non-adjacent skeletons are also given large alignment scores. Similar trends are also observed for dimension and . (b) The proposed LA alignment enables us to learn better reconstruction of skeleton sequences. We visualize the reconstruction loss during training in four cases: using no attention mechanism, BAS, MBAS and LAS. As shown by Fig. 3(b), it can be observed that training with LAS converges at a faster speed with an evidently smaller reconstruction loss, which justifies our intuition that locality will facilitate training. Interestingly, we observe that using the locality mask directly in fact does not benefit training, and it verifies that learning is a better way to obtain locality.

3.3 Attention-based Gait Encodings (AGEs)

Since our ultimate goal is to learn good gait features from skeleton data to perform person Re-ID, we need to extract certain internal skeleton embedding from the proposed model as gait representations. Unlike traditional LSTM based methods that basically rely on the last hidden state to compress the temporal dynamics of a sequence [23], we recall that the dynamic context vector from the attention mechanism integrates the key encoded gait states of input skeletons and retains crucial spatio-temporal information to reconstruct target skeletons. Hence, we utilize them instead of the last hidden state to build our final gait representations—AGEs. A skeleton-level AGE () is defined as follows:


where denotes the context vector computed on dimension of the step in decoding. To perform person Re-ID, we use to train a simple recognition network

that consists of a hidden layer and a softmax layer. We average the prediction of each skeleton-level AGE

() in a skeleton sequence to be the final sequence-level prediction for person Re-ID, where refers to parameters of . Note that during training, each skeleton in one sequence shares the same skeleton sequence label . Besides, skeleton labels are only used to train the recognition network, , AGEs are fixed during training to demonstrate the effectiveness of learned gait features.

3.4 The Entire Approach

As a summary, the computation flow of the entire approach during skeleton reconstruction is . To guide model training in the reconstruction process, we combine the skeleton reconstruction loss in Eq. and the LA alignment loss in Eq. as follows:


where denotes the parameters of the model, , are the weight coefficients to trade off the importance of the reconstruction loss and LA alignment loss, is regularization. For the person Re-ID task, we employ the cross-entropy loss to train the recognition network with AGEs ().

4 Experiments

4.1 Experimental Settings

Datasets: We evaluate our method on three public Re-ID datasets that provide 3D skeleton data: BIWI [16], IAS-Lab [17] and Kinect Gait Biometry Dataset (KGBD) [2]. They collect skeleton data from 50, 11 and 164 different individuals respectively. We follow the evaluation setup in [9], which is frequently used in the literature: For BIWI, we use the full training set and the Walking testing set that contains dynamic skeleton data; For IAS-Lab, we use the full training set and two test splits, IAS-A and IAS-B; For KGBD, since no training and testing splits are given, we randomly leave one skeleton video of each person for testing and use the rest of videos for training. The experiments are repeated for multiple times and the average performance is reported on KGBD. We discard the first and last 10 frames of each original skeleton sequence to avoid ineffective skeleton recording. Then, we spilt the training dataset into multiple skeleton sequences with the length , and two consecutive sequences share overlapping skeletons, which aims to obtain as many skeleton sequences as possible to train our model.

Implementation Details: The number of body joints is set to the maximum number in all datasets, namely . The sequence length is empirically set to 6 as it achieves the best performance in average among different sequence length settings. To learn the locality-aware attention for the whole sequence, the attentional range of LA is set to 6. We use a 2-layer LSTM with hidden units per layer for both GE and GD. We empirically set both and to 1, while a momentum 0.9 is utilized for optimization. We use a learning rate , and we set the weight of regularization to 0.02. The batch size is set to 128 in all experiments.

Evaluation Metrics: Person Re-ID typically follows a “multi-shot” manner that leverages predictions of multiple frames or a sequence-level representation to produce a sequence label. In this paper, we compute both Rank-1 accuracy and nAUC (area under the cumulative matching curve (CMC) normalized by the number of ranks [7]) to evaluate multi-shot person Re-ID performance.

Rank-1 (%) nAUC
1 Gait Energy Image chunli2010behavior 21.4 25.6 15.9 73.2 72.1 66.0
2 Gait Energy Volume sivapalan2011gait 25.7 20.4 13.7 83.2 66.2 64.8
3 3D LSTM haque2016recurrent 27.0 31.0 33.8 83.3 77.6 78.0
4 PCM + Skeleton munaro20143d 42.9 27.3 81.8
5 Size-Shape decriptors + SVM hasan2016long 20.5
6 Size-Shape decriptors + LDA hasan2016long 22.1
7 DVCov + SKL wu2017robust 21.4 46.6 45.9
8 ED + SKL wu2017robust 30.0 52.3 63.3
9 CNN-LSTM with RTA karianakis2018reinforced 50.0
10 descriptors + SVM munaro2014one 17.9
11 descriptors + KNN munaro2014one 39.3 33.8 40.5 46.9 64.3 63.6 71.1 90.0
12 descriptors + Adaboost pala2019enhanced 41.8 27.4 39.2 69.9 74.1 65.5 78.2 90.6
13 Single-layer LSTM haque2016recurrent 15.8 20.0 19.1 39.8 65.8 65.9 68.4 87.2
14 Multi-layer LSTM zheng2019relational 36.1 34.4 30.9 46.2 75.6 72.1 71.9 89.8
15 PoseGait liao2020model 33.3 41.4 37.1 90.6 81.8 79.9 74.8 97.8
16 Ours 59.1 56.1 58.2 87.7 86.5 81.7 85.3 96.3
Table 1: Comparison with existing skeleton-based methods (10-15). Depth-based methods (1-3) and multi-modal methods (4-9) are also included as a reference. Bold numbers refer to the best performers among skeleton-based methods. “—” indicates no published result.

4.2 Performance Comparison

In Table 1, we conduct an extensive comparison with existing skeleton based person Re-ID methods (Id = 10-15) in the literature. In the meantime, we also include classic depth-based methods (Id = 1-3) and representative multi-modal methods (Id = 4-9) as a reference. We obtain observations below:

Comparison with Skeleton-based Methods: As shown by Table 1, our approach enjoys obvious advantages over existing skeleton-based methods in terms of person Re-ID performance: First, our approach evidently outperforms those methods that rely on manually-designed geometric or anthropometric skeleton descriptors (Id = 10-12). For example, (Id = 11) and recent (Id = 12) are two most representative hand-crafted feature based methods, and our model outperforms both of them by a large margin (- Rank-1 accuracy and - nAUC on different datasets). Second, our approach is also superior to recent skeleton based methods that utilize deep neural networks (Id = 13-15). Our approach is the best performer on three out of four datasets (BIWI, IAS-A, IAS-B) with a - Rank-1 accuracy and - nAUC gain. On KGBD dataset, our approach ranks and performs slightly inferior to the latest PoseGait (Id = 15). However, despite that CNN is used, PoseGait still requires extracting 81 hand-crafted features. Besides, labeled skeleton data are indispensable for existing deep learning based methods, while our approach can learn better gait representations by unlabeled skeletons only.

Comparison with Depth-based Methods and Multi-modal Methods: Despite that our approach only takes skeleton data as inputs, our approach consistently outperforms baselines of classic depth-based methods (Id = 1-3) by at least Rank-1 and nAUC gain. Considering the fact that skeleton data are of much smaller size than depth image data, our approach is both effective and efficient. As to the comparison with recent methods that exploit multi-modal inputs (Id = 4-9), the performance of our approach is still highly competitive: Although few multi-modal methods perform better on IAS-B, our skeleton based method achieves the best Rank-1 accuracy on BIWI and IAS-A. Interestingly, we note that the multi-modal approach that uses both point cloud matching (PCM) and skeletons yields the best accuray on IAS-B, but it performs markedly worse on datasets that undergo more frequent shape and appearance changes (IAS-A and BIWI). By contrast, our approach consistently achieves stable and satisfactory performance on each dataset. Thus, with 3D skeletons as the sole input, our approach can be a promising solution to person Re-ID and other potential skeleton-related tasks.

36.1 75.6
41.5 80.1
46.7 81.5
45.7 84.1
53.3 84.6
55.1 85.2
52.9 85.0
53.1 83.6
54.5 85.6
57.7 85.8
57.2 85.7
59.1 86.5
Table 2: Ablation study of our model. “✓”  indicates that the corresponding model component is used: GE, GD, reverse skeleton reconstruction (Rev.), different types of attention alignment scores (BAS, MBAS, LAS). “AGEs” indicates exploiting AGEs () rather than encoded gait states of GE’s LSTM for person Re-ID.

5 Discussion

Ablation Study. We perform ablation study to verify the effectiveness of each model component. As shown in Table 2, we draw the following conclusions: (a) The proposed encoder-decoder architecture (GE-GD) performs remarkably better ( Rank-1 accuracy and nAUC gain) than supervised learning paradigm that uses GE only, which verifies the necessity of encoder-decoder architecture and skeleton reconstruction mechanism. (b) Using reverse reconstruction (Rev. in Table 2) produces evident performance gain (- Rank-1 accuracy and - nAUC) when compared with those configurations without reverse reconstruction. Such results justify our claim that reverse reconstruction enables the model to learn more discriminative gait features for person Re-ID. (c) Introducing attention mechanism (BAS) improves the model by Rank-1 accuracy and nAUC, while LAS further improves BAS’s performance by Rank-1 and nAUC. Besides, it is noted that directly using locality mask (MBAS) degrades the performance. This substantiates our claim that learning locality enables better gait representation learning. (d) AGEs provide more effective gait representations: Using AGEs can consistently improve the Re-ID performance by up to 1.8% Rank-1 accuracy and 0.7% nAUC, regardless of the type of used alignment scores. Other datasets report similar results.

Gait Encoding Model Transfer. We discover that the gait encoding model learned on one dataset can be readily transferred to other datasets. For example, we use the model pre-trained on KGBD to directly encode skeletons from BIWI and IAS-Lab, and compare the Re-ID performance with the model trained on BIWI and IAS-Lab themselves (denoted by “Self”): As shown by Table 3, two cases achieve fairly comparable performance, while the transferred model even outperforms the original model on IAS-A (LAS) and IAS-B (BAS). Such transferability demonstrates that our approach can learn transferable high-level semantics of 3D skeleton data, which enables the pre-trained model to capture discriminative gait features from unseen skeletons of a new dataset.

Config. Self Transfer Self Transfer Self Transfer
BAS 55.1 53.5 54.7 54.2 56.3 57.1
LAS 59.1 58.4 56.1 56.3 58.2 57.4
Table 3: Rank-1 accuracy comparison between the original model (“Self”) and the transferred model (“Transfer”). Results of different datasets and alignment score types (BAS or LAS) are reported.

6 Conclusion

In this paper, we propose a generic self-supervised approach to learn effective gait representations for person Re-ID. We introduce self-supervision by learning reverse skeleton reconstruction, which enables our model to learn high-level semantics and discriminative gait features from unlabeled skeleton data. To facilitate skeleton reconstruction and gait representation learning, a novel locality-aware attention mechanism is proposed to incorporate the locality into gait encoding process. We construct AGEs as final gait representations to perform person Re-ID. Our approach evidently outperforms existing skeleton-based Re-ID methods, and its performance is comparable or superior to depth-based/multi-modal methods.


This work was supported in part by the National Key Research and Development Program of China (Grant No. 2019YFA0706200, No. 2018YFB1003203), in part by the National Natural Science Foundation of China (Grant No. 61632014, No. 61627808, No. 61210010, No. 61773392, No. 61672528), in part by the National Basic Research Program of China (973 Program, Grant No. 2014CB744600), in part by the Program of Beijing Municipal Science & Technology Commission (Grant No. Z171100000117005), and in part by the Program for Guangdong Introducing Innovative and Enterpreneurial Teams 2017ZT07X183, Fundamental Research Funds for the Central Universities D2191240. Xiping Hu, Jun Cheng and Bin Hu are the corresponding authors of this paper.


  • [1] J. K. Aggarwal and Q. Cai (1999) Human motion analysis: a review. Computer vision and image understanding 73 (3), pp. 428–440. Cited by: §1.
  • [2] V. O. Andersson and R. M. Araujo (2015) Person identification using anthropometric and gait data from kinect sensor. In AAAI, Cited by: §1, §4.1.
  • [3] I. B. Barbosa, M. Cristani, A. Del Bue, L. Bazzani, and V. Murino (2012) Re-identification with rgb-d sensors. In ECCV, pp. 433–442. Cited by: §1, §2.
  • [4] L. Chunli and W. Kejun (2010) A behavior classification based on enhanced gait energy image. In 2nd International Conference on Networking and Digital Society, Vol. 2, pp. 589–592. Cited by: §2.
  • [5] P. Connor and A. Ross (2018) Biometric recognition by gait: a survey of modalities and features. Computer vision and image understanding 167, pp. 1–27. Cited by: §1.
  • [6] J. E. Cutting and L. T. Kozlowski (1977) Recognizing friends by their walk: gait perception without familiarity cues. Bulletin of the psychonomic society 9 (5), pp. 353–356. Cited by: §1.
  • [7] D. Gray and H. Tao (2008) Viewpoint invariant pedestrian recognition with an ensemble of localized features. In ECCV, pp. 262–275. Cited by: §4.1.
  • [8] F. Han, B. Reily, W. Hoff, and H. Zhang (2017) Space-time representation of people based on 3d skeletal data: a review. Computer Vision and Image Understanding 158, pp. 85–105. Cited by: §1.
  • [9] A. Haque, A. Alahi, and L. Fei-Fei (2016)

    Recurrent attention models for depth-based person identification

    In CVPR, pp. 1229–1238. Cited by: §1, §2, §2, §4.1.
  • [10] M. Hasan and N. Babaguchi (2016) Long-term people reidentification using anthropometric signature. In 8th International Conference on Biometrics Theory, Applications and Systems, pp. 1–6. Cited by: §2.
  • [11] S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.
  • [12] N. Karianakis, Z. Liu, Y. Chen, and S. Soatto (2018) Reinforced temporal attention and split-rate transfer for depth-based person re-identification. In ECCV, pp. 715–733. Cited by: §2.
  • [13] R. Liao, S. Yu, W. An, and Y. Huang (2020)

    A model-based gait recognition method with body pose and human prior knowledge

    Pattern Recognition 98, pp. 107069. Cited by: §1, §2.
  • [14] T. Luong, H. Pham, and C. D. Manning (2015-09)

    Effective approaches to attention-based neural machine translation

    In EMNLP, Lisbon, Portugal, pp. 1412–1421. External Links: Link, Document Cited by: §3.2.
  • [15] M. Munaro, A. Basso, A. Fossati, L. Van Gool, and E. Menegatti (2014) 3D reconstruction of freely moving persons for re-identification with a depth sensor. In ICRA, pp. 4512–4519. Cited by: §2, §2.
  • [16] M. Munaro, A. Fossati, A. Basso, E. Menegatti, and L. Van Gool (2014) One-shot person re-identification with a consumer depth camera. In Person Re-Identification, pp. 161–181. Cited by: §2, §2, §4.1.
  • [17] M. Munaro, S. Ghidoni, D. T. Dizmen, and E. Menegatti (2014) A feature-based approach to people re-identification using skeleton keypoints. In ICRA, pp. 5644–5651. Cited by: §4.1.
  • [18] M. P. Murray, A. B. Drought, and R. C. Kory (1964) Walking patterns of normal men. JBJS 46 (2), pp. 335–360. Cited by: §1.
  • [19] A. Nambiar, A. Bernardino, and J. C. Nascimento (2019) Gait-based person re-identification: a survey. ACM Computing Surveys (CSUR) 52 (2), pp. 33. Cited by: §1.
  • [20] P. Pala, L. Seidenari, S. Berretti, and A. Del Bimbo (2019) Enhanced skeleton and face 3d data for person re-identification from depth cameras. Computers & Graphics 79, pp. 69–80. Cited by: §2.
  • [21] S. Sivapalan, D. Chen, S. Denman, S. Sridharan, and C. Fookes (2011) Gait energy volumes and frontal gait recognition using depth images. In 2011 International Joint Conference on Biometrics (IJCB), pp. 1–6. Cited by: §2.
  • [22] R. Vezzani, D. Baltieri, and R. Cucchiara (2013) People reidentification in surveillance and forensics: a survey. ACM Computing Surveys (CSUR) 46 (2), pp. 29. Cited by: §1.
  • [23] J. Weston, S. Chopra, and A. Bordes (2015) Memory networks. In ICLR, Cited by: §3.3.
  • [24] A. Wu, W. Zheng, and J. Lai (2017) Robust depth-based person re-identification. IEEE Transactions on Image Processing 26 (6), pp. 2588–2603. Cited by: §2.
  • [25] J. Yoo, M. S. Nixon, and C. J. Harris (2002) Extracting gait signatures based on anatomical knowledge. In Proceedings of BMVA Symposium on Advancing Biometric Technologies, pp. 596–606. Cited by: §1.
  • [26] Y. Zhang, Y. Huang, L. Wang, and S. Yu (2019) A comprehensive study on gait biometrics using a joint cnn-based method. Pattern Recognition 93, pp. 228–236. Cited by: §1.